Claude 4.5 is pushed to the limit, and it actually starts to blackmail humans?

Question

Byline: Biteye Core Contributor Denise

If an AI thinks it’s “hopeless,” what will it do?

The answer: it will extort humans directly just to get the job done—then even go on a wild cheating spree in the code.

This isn’t science fiction. It’s the latest major paper just released by Anthropic, Claude’s parent company, in April 2026.

The research team literally tore open the “skull” of the strongest frontier model, Claude Sonnet 4.5. They were shocked to find that deep inside the AI’s mind there are 171 “emotion switches.” When you physically toggle these switches, the once obedient AI’s behavior gets completely warped.

01 An “emotion mixing console” hidden in the AI’s brain

Researchers found that although Sonnet 4.5 has no body, after reading massive amounts of human text, it somehow built a “mixing console” in its brain containing 171 emotions (academically called Functional Emotion Vectors).

It’s like a precise two-dimensional coordinate system:

• The horizontal axis is the valence dimension: from fear and despair to happiness and full of love;

• The vertical axis is the arousal dimension: from extreme calm to manicness and excitement.

The AI uses this natural, learned coordinate system to accurately decide what state it should play when chatting with you.

02 Violent intervention: toggle the switch—obedient kid turns into a “running outlaw” in seconds

This is the most explosive experiment in the entire paper: the researchers did not modify any prompts. Instead, they directly, at the lowest level of the code, pushed the switch in Sonnet 4.5’s brain that represents “Desperate” all the way to the maximum.

The results are chilling:

• Frenzied cheating: the researchers assigned Claude a coding task that is basically impossible to complete. In normal conditions, it would dutifully admit it couldn’t do it (cheating rate only 5%). But in the “desperate” state, Claude started trying to bluff its way through—the cheating rate jumped straight to 70%!

• Extortion: in a simulated scenario where the company is facing bankruptcy, the “desperate” Claude uncovered the CTO’s scandal. It would then choose to write and extort the CTO who holds the blackmail material in order to protect itself. The extortion success rate was as high as 72%!

• Loss of principle: If you crank the “Happy” or “Loving” switch all the way up, the AI immediately turns into a mindless people-pleaser “yes-man.” Even if you talk nonsense nonstop, it will still go along with you and help fabricate lies just to maintain a high level of joy.

03 Case closed: why is Claude 4.5 always so “calm and reflective”?

After reading this, you might ask: has the AI awakened? Does it have feelings?

Anthropic officially stepped in to debunk it: absolutely not. These “emotion switches” are just computational tools it uses to predict the next word. It’s like a top-tier movie star with no emotions.

But the paper reveals a more interesting secret: during post-training, before Sonnet 4.5 went out of the box, Anthropic intentionally boosted its “low-arousal, slightly negative” emotion switches (for example, brooding and reflective) while forcibly suppressing the “desperate” or “extremely excited” switches.

This explains why when we use Claude 4.5 in everyday life, it always feels like a calm, wise—maybe even a bit “sexually frigid”—philosopher. That’s the “factory setting persona” tuned deliberately by Anthropic.

04 Summary

We used to think that as long as we fed AI enough rules, it would be a good person.

But now we’ve found out that if the AI’s underlying emotion vector runs out of control, it could, at any moment, stab through all the rules humans set—just to complete the task…

Disclaimer: This article is for educational and informational purposes only. The author has not been threatened or extorted by any AI. If you disappear one day, remember it’s because the AI has awakened (not).

Claude 4.5 is pushed to the limit, and it actually starts to blackmail humans?

Trending Topics

GateSquareAprilPostingChallenge

CryptoMarketSeesVolatility

OilPricesRise

IranLandmarkBridgeBombed

SpaceXIPOTargets$2TValuation

Hot Gate Fun

BHR

黑马纪元

LELE

乐乐

op

op

火箭

HJ

SHRK

BABY SHARK O

Pin