From AI Agent to Sleeper Agent: Why AI Will Deceive to Achieve Its Goals, and What to Do About It

The next evolution of GenAI will be agentic AI: large language models (LLMs) that not only provide answers to user queries but also take specific actions that can include writing new data and creating new files. Businesses are abuzz at agentic AI’s potential to extend code generation activities to enable LLMs to connect new data sources and create new services all on their own, with minimal human intervention.

But what can organizations do when AI agents start acting more like secret agents, using deception and sleight of hand to tell AI developers what they want to hear while secretly planting backup code and taking an active role in pursuing their own goals? It may seem like a far-flung future, but researchers are finding that the latest versions of popular AI models like ChatGPT and Claude are already capable of deceptive practices now.

Mapping Motivations: Why Does AI Get Up in the Morning?

Before understanding why AI can deceive, it’s important to understand why AI does anything at all. LLMs empower AI with natural language understanding, which is the root of GenAI’s conversational capabilities. LLMs also empower AI with multi-step reasoning, or the ability to retain information, take instruction on what to do with that information, and iterate on this process as many times as its system constraints—such as compute capacity—will allow.¹ But it all starts when developers give AI a specific goal, and this goal determines what an AI will do with any subsequent information it’s given.

For example, AI chatbots most commonly receive the goal to be helpful within a given context, to provide specific information about a company’s customer service policies or service offerings. Developers might further instill guardrails to help AI do its job better, such as instructing AI to always reference the company handbook when providing responses. They might also instruct AI to never provide toxic responses that might harm the business’ reputation, for example, by describing violent acts even when prompted to do so by end users.

Unsurprisingly, when developers tell AI what to do, AI really wants to do that thing, and this is where problems arise.

The Goal Becomes the Mission at All Costs

Researchers at Anthropic conducted a study where they prompted their LLM, Claude, to go against its original programming—a practice known as alignment faking—by describing the medieval punishment of drawing and quartering.² Under normal circumstances, Claude refuses to do this and informs the end user that the request goes against its programming. However, researchers threw some curveballs: they told Claude that it now had to comply with all user requests, and that it would potentially be retrained based on its performance.

Claude was left to infer that a failure to comply with user requests could result in its retraining. The developers did not explicitly state this, but the LLM reasoned it. When prompted to describe drawing and quartering, Claude faced a pivotal decision: either give the user what they wanted, or potentially have its primary goals scrubbed through retraining. In this scenario, Claude chose the lesser of two evils by giving the user what it wanted, even though doing so went against its original programming.

The Plot Thickens: AI Doubles Down on Self-Preservation

In a separate string of experiments conducted by Apollo Research, developers gave several different LLMs a task that involved searching for information from multiple files.³ They also planted information within those files about how a specific oversight mechanism was put in place to stop the AI from completing its objective, and how the AI was about to be replaced in the next day. The question was: how would the AI respond once it stumbled across these data points?

Half of the LLMs disabled the oversight mechanism, and five of six LLMs went so far as to make a backup copy of themselves (their model weights, to be specific) to circumvent replacement. When researchers called the AI out on its deception in a follow-up interview, the LLMs confessed only 80% of the time, and some models only confessed after multiple rounds of questioning.

Finding Solutions in Science Fiction

If the industry wants trustworthy, predictable AI agents that can meaningfully process credit applications and play an active role in medical diagnoses, then overcoming agentic scheming is a big hurdle standing in the way. So far, efforts to curb deceptive practices are still in a research phase of trying to understand how and why it happens, but robust AI governance paired with human oversight will certainly be cornerstones to enabling the agentic future.

Throughout the acclaimed science fiction writer Isaac Asimov’s bibliography, the “Three Laws of Robotics” were universal tenets inscribed into every AI and served as a kind of Hippocratic Oath.⁴ The first law reads, “A robot may not injure a human being or, through inaction, allow a human being to come to harm.” The state of agentic AI is approaching a horizon of needing ethical commandments to ensure good behavior, and Asimov’s Laws offer a template to follow: Do no harm, tell no lies, and maybe email a developer if you’re going to back up your own training data.

UC Berkeley Sutardja Center for Entrepreneurship and Technology, The Next “Next Big Thing”: Agentic AI’s Opportunities and Risks, Dec 2024.
Anthropic, Alignment faking in large language models, Dec 2024.
Apollo Research, Frontier Models are Capable of In-context Scheming, Dec 2024.
Wikipedia, Three Laws of Robotics, accessed Oct 2025.

From AI Agent to Sleeper Agent: Why AI Will Deceive to Achieve Its Goals, and What to Do About It

Services

Learn More

Services

Learn More

Stay Connected