logoalt Hacker News

saidnooneevertoday at 11:53 AM6 repliesview on HN

Malware authors are pretty excited about guard-rails. you can add prompts to your malware to get LLM scanners to hit guard-rails and stop their runs. New shai-hulud npm worm campaign for example includes prompts to request biological weapon schematics/creation etc. to ensure LLM scanners probing NPM packages refuse to scan it.

These AI places have 0 clue about how threat actors actually work. None of their mitigations or guard-rails is effective, and now they are even turned against them.

Additionally, if they don't all implement the same level of effective guard-rails, there will always be some model you can abuse to do the work anyway, and hence there is 0 effect on threat actors, they will just run some local model that does 5% less quality, which does not matter to them 1 bit.


Replies

brooksttoday at 12:27 PM

I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?

From where I sit it seems reasonable for Anthropic to not want their product used to create malware, even if they can’t solve the entire problem globally for every model. What’s wrong with that position? What should they do differently?

show 7 replies
user43928today at 2:02 PM

Mythos is supposedly good at security research.

Local Qwen 3.6 27B can hardly debug 5 lines of CSS or copy a short snippet from A to B without mangling it.

It's not like you can use the local model for security research or engineering biological weapons.

If you have $200k maybe you can get the hardware to run the larger open source models, but even they are behind latest proprietary models.

show 1 reply
vlovich123today at 2:40 PM

The guard rails aren’t about blocking professional malware authors. It’s about enabling a significantly larger population that isn’t as talented in acquiring those capabilities. Very different threat model and just because it’s not effective in one area doesn’t mean there isn’t value in making it more difficult for random Joe Schmoe in building an atomic bomb even if a kid before had done so successfully and turned his garage into a radiation danger site

show 1 reply
ryukopostingtoday at 5:58 PM

I just assumed the guardrails were thinly-veiled product segmentation.

teravortoday at 4:04 PM

the way the fable guardrails (the ones that degrade it to opus) work seems to me to involve another model working over fable's tokens. i suppose its true that trying to get the model itself heavyhanded on refusals degrades it everywhere else too.

assanineasstoday at 12:10 PM

[dead]