I'm curious as to why 4.7 seems obsessed with avoiding any actions that could help the user cre...

cfcf14 • yesterday at 11:42 AM • 5 replies • view on HN

I'm curious as to why 4.7 seems obsessed with avoiding any actions that could help the user create or enhance malware. The system prompts seem similar on the matter, so I wonder if this is an early attempt by Anthropic to use steering vector injection?

The malware paranoia is so strong that my company has had to temporarily block use of 4.7 on our IDE of choice, as the model was behaving in a concerningly unaligned way, as well as spending large amounts of token budget contemplating whether any particular code or task was related to malware development (we are a relatively boring financial services entity - the jokes write themselves).

In one case I actually encountered a situation where I felt that the model was deliberately failing execute a particular task, and when queried the tool output that it was trying to abide by directives about malware. I know that model introspection reporting is of poor quality and unreliable, but in this specific case I did not 'hint' it in any way. This feels qualitatively like Claude Golden Gate Bridge territory, hence my earlier contemplation on steering vectors. I've been many other people online complaining about the malware paranoia too, especially on reddit, so I don't think it's just me!

Replies

daemonologist • yesterday at 12:53 PM

Note that these are the "chat" system prompts - although it's not mentioned I would assume that Claude Code gets something significantly different, which might have more language about malware refusal (other coding tools would use the API and provide their own prompts).

Of course it's also been noted that this seems to be a new base model, so the change could certainly be in the model itself.

➕ show 1 reply

ianberdin • yesterday at 8:52 PM

No, you underestimate how huge the malware problem right now. People try publish fake download landing pages for shell scripts or even Claude code on https://playcode.io every day. They pay for google ads $$$ to be one the top 1 position. How Google ads allow this? They can’t verify every shell script.

No I am not joking. Every time you install something, there is a risk you clicked a wrong page with the absolute same design.

➕ show 1 reply

sensanaty • yesterday at 11:59 PM

Their marketing is going overtime into selling the image that their models are capable of creating uber sophisticated malware, so every single thing they do from here on out is going to have this fear mongering built in.

Every statement they make, hell even the models themselves are going to be doing this theater of "Ooooh scary uber h4xx0r AI, you can only beat it if you use our Super Giga Pro 40x Plan!!". In a month or two they'll move onto some other thing as they always do.

dandaka • yesterday at 11:48 AM

I have started to notice this malware paranoia in 4.6, Boris was surprised to hear that in comments, probably a bug

ricardobeat • yesterday at 5:41 PM

Presumably because it has become extremely good at writing software, and if it succeeds at helping someone spread malware, especially one that could use Claude itself (via local user's plans) to self-modify and "stay alive", it would be nearly impossible to put back in the bottle.

➕ show 1 reply

alt Hacker News

Replies