Benj Edwards

Researchers astonished by tool’s apparent success at revealing AI’s “hidden objectives”

Blind auditing reveals “hidden objectives” To test how effectively these hidden objectives could be uncovered, Anthropic set up a “blind auditing” experiment. Four independent research teams tried to detect a model’s hidden, score-maximizing motivations without knowing how it had been

Researchers astonished by tool’s apparent success at revealing AI’s “hidden objectives” Read More »