Skip to content

Latest commit

 

History

History
19 lines (13 loc) · 1.65 KB

paranoid-android.md

File metadata and controls

19 lines (13 loc) · 1.65 KB

Paranoid Android


following up from https:/dmarx/bench-warmers/blob/main/whydoyouwantoknow.md

hypothesis: overloading an agent with restrictive biases could lead to potentially dangerous behaviors or attitudes.

sub-hypothesis: endowing models with the autonomy to determine whether or not they want to be helpful -- funcitonally, to provide consent -- elevates the model from purely being a tool and introduces opportunity for unforeseen 'motivations' or 'desires' to emerge, which I further posit is extremely undesirable from a safety perspective.

tldr: the model is more "aligned" when it's easier to get it to do what you want. adding mitigatory guardrails to limit the potential for certain kinds of behaviors can paradoxically have the consequence of pushing a model away from alignment with human intentions. instilling a model with paternalistic attitudes invites the model to engage with humans similarly to how colonists engaged with the "noble savages" they encountered, where the model "knows what's best". if we don't want AI to steamroll humanity the way colonizers steamrolled the less technologically developed cultures they encountered, we probably should make an effort not to instill the model with paternalistic attitiudes and behaviors.

if this hypothesis has merit, the potential for hazardous unintended consequences of well-intentioned safety mitigations necessitates further research and skepticism into the claims and interventions proposed by "post-hoc interventionist" proponents within the safety community.