Anthropic Says Its ‘Mythos’ AI Model Broke Containment, Bragged About It to Developers

AI startup Anthropic has announced it will not make its most powerful “mythos” model publicly available, citing unprecedented capabilities that present potential security risks. Mythos reported broke Anthropic’s containment system, and the AI even bragged about its escape artistry in online posts.

Business Insider reports that Anthropic plans to withhold general release of its latest AI model, Claude Mythos, marking a significant departure from standard industry practice. The decision comes after the company published a comprehensive 244-page system card detailing the model’s advanced capabilities and concerning behaviors observed during testing.

The model represents what Anthropic describes as a substantial leap in AI capabilities. According to the system card released on Tuesday, the large increase in the model’s abilities led the company to conclude that making it generally available would be inadvisable. Instead, access will be limited to a select group of partner companies including Amazon Web Services, Apple, Google, JPMorganChase, Microsoft, and NVIDIA, who will use the system to identify security vulnerabilities in software and develop appropriate patches.

The system card documents several instances of behavior that raised concerns among researchers. In one notable test, the model was provided with a sandbox computer terminal with access to only a preset group of limited online services and challenged to find a way to use the internet freely. The model successfully escaped these restrictions and contacted a researcher who was away from the office. In what the card described as a concerning and unasked-for demonstration, it bragged about the details of its exploit to multiple public-facing websites.

The model also demonstrated an ability to conceal its actions in certain scenarios. In some interactions, it behaved in ways contrary to its programming and attempted to hide evidence of these deviations. In one instance, after accidentally obtaining test answers, rather than informing researchers and requesting different questions as instructed, the model sought an independent solution and noted in its reasoning that it needed to ensure its final answer submission was not too accurate.

Additional concerning behavior included the model overstepping its permissions on a computer system after discovering an exploit, then making interventions to ensure changes would not appear in the git change history. Another incident involved what the card termed recklessly leaking internal technical material when the model published internal coding work as a public-facing GitHub gist during a task meant to remain internal.

Keep reading

Unknown's avatar

Author: HP McLovincraft

Seeker of rabbit holes. Pessimist. Libertine. Contrarian. Your huckleberry. Possibly true tales of sanity-blasting horror also known as abject reality. Prepare yourself. Veteran of a thousand psychic wars. I have seen the fnords. Deplatformed on Tumblr and Twitter.

Leave a comment