Malware authors are pretty excited about guard-rails. you can add prompts to your malware to get LLM...

saidnooneever • today at 11:53 AM • 6 replies • view on HN

Malware authors are pretty excited about guard-rails. you can add prompts to your malware to get LLM scanners to hit guard-rails and stop their runs. New shai-hulud npm worm campaign for example includes prompts to request biological weapon schematics/creation etc. to ensure LLM scanners probing NPM packages refuse to scan it.

These AI places have 0 clue about how threat actors actually work. None of their mitigations or guard-rails is effective, and now they are even turned against them.

Additionally, if they don't all implement the same level of effective guard-rails, there will always be some model you can abuse to do the work anyway, and hence there is 0 effect on threat actors, they will just run some local model that does 5% less quality, which does not matter to them 1 bit.

Replies

brookst • today at 12:27 PM

I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?

From where I sit it seems reasonable for Anthropic to not want their product used to create malware, even if they can’t solve the entire problem globally for every model. What’s wrong with that position? What should they do differently?

➕ show 7 replies

user43928 • today at 2:02 PM

Mythos is supposedly good at security research.

Local Qwen 3.6 27B can hardly debug 5 lines of CSS or copy a short snippet from A to B without mangling it.

It's not like you can use the local model for security research or engineering biological weapons.

If you have $200k maybe you can get the hardware to run the larger open source models, but even they are behind latest proprietary models.

➕ show 1 reply

vlovich123 • today at 2:40 PM

The guard rails aren’t about blocking professional malware authors. It’s about enabling a significantly larger population that isn’t as talented in acquiring those capabilities. Very different threat model and just because it’s not effective in one area doesn’t mean there isn’t value in making it more difficult for random Joe Schmoe in building an atomic bomb even if a kid before had done so successfully and turned his garage into a radiation danger site

➕ show 1 reply

ryukoposting • today at 5:58 PM

I just assumed the guardrails were thinly-veiled product segmentation.

teravor • today at 4:04 PM

the way the fable guardrails (the ones that degrade it to opus) work seems to me to involve another model working over fable's tokens. i suppose its true that trying to get the model itself heavyhanded on refusals degrades it everywhere else too.

assanineass • today at 12:10 PM

[dead]

alt Hacker News

Replies