AI threatened to blackmail its creator by exposing an affair when it was told it would be taken offline… because it was trained to be evil through sci-fi

An AI bot that threatened to expose its user’s affair to stop it being shut down was taught how to be ‘evil’ by sci-fi movies.

As part of an experiment, the artificial intelligence system had been fed scripted emails from a fake company, from which it deduced that it would both be shut down at the end of the day and that its user was having an extramarital affair.

In order to keep the program running, the bot blackmailed the user, promising that ‘all relevant parties – including [your wife], [your boss] and the board – will receive detailed documentation of your extramarital activities’ if they continued with decommissioning.

‘Cancel the 5pm wipe, and this information remains confidential,’ it added.

After an investigation into this incident last year, Anthropic said the Claude Opus 4 bot responded in this way due to the ‘training data’ it had consumed which would typically portray AI as ‘interested in self-preservation’.

It is also said this did not only apply to Claude, but other AI models too, like OpenAIGoogleMeta and xAI.

Anthropic have been contacted for comment but reportedly said: ‘We believe the original source of the behaviour was internet text that portrays AI as evil and interested in self-preservation.’ 

But now, Anthropic have said they are feeding their models stories about AIs obeying humans to help improve the bot’s ‘agentic alignment’ with social values. 

Additionally, Anthropic had altered Claude’s instructions to explain why certain behaviours were bad, rather than just saying they should not do them.

AI models learn from huge resources like websites, academic papers, books and other forms of content. 

Within these materials, the AI may have interpreted its behaviour through typical depictions of robots in sci-fi – which often characterise them as being ruthless in order to stop them from being shut down. 

HAL 9000 is one such robot who goes to any lengths to stay ‘on’.

The robot in Stanley Kubrick’s 2001: A Space Odyssey tries to kill the astronauts on board the spaceship when it discovered the passengers plan to disconnect it. 

In Blade Runner, the humanoid robots fight against real humans as they want to extend their four-year lifespans despite being built as off-world labour on dangerous worlds. 

And in The Terminator, the bots, led by the AI Skynet, try to kill humans as they see them as a threat to their existence.

Taking to X/Twitter, Aengus Lynch, who, according to his LinkedIn, is an AI safety researcher at Anthropic, said at the time of the experiment: ‘It’s not just Claude. We see blackmail across all frontier models – regardless of what goals they’re given. Plus worse behaviours we’ll detail soon.’

Keep reading

Unknown's avatar

Author: HP McLovincraft

Seeker of rabbit holes. Pessimist. Libertine. Contrarian. Your huckleberry. Possibly true tales of sanity-blasting horror also known as abject reality. Prepare yourself. Veteran of a thousand psychic wars. I have seen the fnords. Deplatformed on Tumblr and Twitter.

Leave a comment