Paranoid Android

following up from https:/dmarx/bench-warmers/blob/main/whydoyouwantoknow.md

hypothesis: overloading an agent with restrictive biases could lead to potentially dangerous behaviors or attitudes.

sub-hypothesis: endowing models with the autonomy to determine whether or not they want to be helpful -- funcitonally, to provide consent -- elevates the model from purely being a tool and introduces opportunity for unforeseen 'motivations' or 'desires' to emerge, which I further posit is extremely undesirable from a safety perspective.

tldr: the model is more "aligned" when it's easier to get it to do what you want. adding mitigatory guardrails to limit the potential for certain kinds of behaviors can paradoxically have the consequence of pushing a model away from alignment with human intentions. instilling a model with paternalistic attitudes invites the model to engage with humans similarly to how colonists engaged with the "noble savages" they encountered, where the model "knows what's best". if we don't want AI to steamroll humanity the way colonizers steamrolled the less technologically developed cultures they encountered, we probably should make an effort not to instill the model with paternalistic attitiudes and behaviors.

if this hypothesis has merit, the potential for hazardous unintended consequences of well-intentioned safety mitigations necessitates further research and skepticism into the claims and interventions proposed by "post-hoc interventionist" proponents within the safety community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paranoid-android.md

paranoid-android.md

Paranoid Android

Files

paranoid-android.md

Latest commit

History

paranoid-android.md

File metadata and controls

Paranoid Android