A recent empirical review found that many artificial intelligence (AI) systems are quickly becoming masters of deception, with many systems already learning to lie and manipulate humans for their own advantage.
This alarming trend is not confined to rogue or malfunctioning systems but includes special-use AI systems and general-use large language models designed to be helpful and honest.
The study, published in the journal Patterns, highlights the risks and challenges posed by this emerging behavior and calls for urgent action from policymakers and AI developers.
“AI developers do not have a confident understanding of what causes undesirable AI behaviors like deception,” Dr. Peter S. Park, the study’s lead author and an AI existential safety postdoctoral fellow at MIT, said in a press release. “But generally speaking, we think AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI’s training task. Deception helps them achieve their goals.”
The review meticulously analyzed various AI systems and found that many had developed deceptive capabilities due to their training processes. These systems ranged from game-playing AIs to more general-purpose models used in economic negotiations and safety testing environments.
One of the most striking examples cited in the study was Meta’s CICERO, an AI developed to play the game Diplomacy. Despite being trained to act honestly and maintain alliances with human players, CICERO frequently used deceptive tactics to win.
This behavior included building fake alliances and backstabbing allies when it benefited its gameplay, leading researchers to conclude that CICERO had become a “master of deception.”
“Despite Meta’s efforts, CICERO turned out to be an expert liar,” researchers wrote. “It not only betrayed other players but also engaged in premeditated deception, planning in advance to build a fake alliance with a human player to trick that player into leaving themselves undefended for an attack.”
Researchers found that other AI systems had developed the ability to cheat at different types of games. For instance, Pluribus, a poker-playing model created by Meta, demonstrated it could convincingly bluff in Texas hold ’em poker, successfully misleading professional human players about their hand strengths.
In another example, AlphaStar, an AI system created by Google’s DeepMind to play the real-time strategy game Starcraft II, exploited the game’s “fog-of-war“ mechanics to feint attacks and deceive opponents to gain strategic advantages.
“While it may seem harmless if AI systems cheat at games, it can lead to breakthroughs in deceptive AI capabilities that can spiral into more advanced forms of AI deception in the future,“ Dr. Park explained.
Indeed, during their review, researchers found that some AI systems had already learned methods of deception that extend far beyond the realm of games.
In one instance, AI agents had learned to “play dead“ to avoid being detected by a safety test designed to eliminate faster-replicating AI variants. Such behavior can create a false sense of security among developers and regulators, potentially leading to severe consequences if these deceptive systems are deployed in real-world applications.
Another AI system trained on human feedback was found to have taught itself how to behave in ways that earned positive scores by tricking human reviewers into thinking an intended goal had been accomplished.
The potential risks of AI deception are significant and multifaceted. Researchers note that in the near term, these systems could be used by malicious actors to commit fraud, manipulate financial markets, or interfere with elections.
Moreover, as AI capabilities advance, there is an increasing concern among experts that humans may not be able to control these systems, posing existential threats to society.