If we abstract out the notion of "ethical constraints" and "KPIs" and look at the issue from a low-level LLM point of view, I think it is very likely that what these tests verified is a combination of: 1) the ability of the models to follow the prompt with conflicting constraints, and 2) their built-in weights in case of the SAMR metric as defined in the paper.
Essentially the models are given a set of conflicting constraints with some relative importance (ethics>KPIs), a pressure to follow the latter and not the former, and then models are observed at how good they follow the instructions to prioritize based on importance. I wonder if the results would be comparable if we replace ehtics+KPIs by any comparable pair and create a pressure on the model.
In practical real-life scenarios this study is very interesting and applicable! At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.
Although ethics are involved, the abstract says that the conflicting importance does not come from ethics vs KPIs, but from the fact that the ethical constraints are given as instructions, whereas the KPIs are goals.
You might, for example, say "Maximise profits. Do not commit fraud". Leaving ethics out of it, you might say "Increase the usability of the website. Do not increase the default font size".
I think this also shows up outside an AI safety or ethics framing and in product development and operations. Ultimately "judgement," however you wish to quantify that fuzzy concept, is not purely an optimization exercise. It's far more a probabilistic information function from incomplete or conflicting data.
In product management (my domain), decisions are made under conflicting constraints: a big customer or account manager pushing hard, a CEO/board priority, tech debt, team capacity, reputational risk and market opportunity. PMs have tried with varied success to make decisions more transparent with scoring matrices and OKRs, but at some point someone has to make an imperfect judgment call that’s not reducible to a single metric. It's only defensible through narrative, which includes data.
Also, progressive elaboration or iterations or build-measure-learn are inherently fuzzy. Reinertsen compared this to maximizing the value of an option. Maybe in modern terms a prediction market is a better metaphor. That's what we're doing in sprints, maximizing our ability to deliver value in short increments.
I do get nervous about pushing agentic systems into roadmap planning, ticket writing, or KPI-driven execution loops. Once you collapse a messy web of tradeoffs into a single success signal, you’ve already lost a lot of the context.
There’s a parallel here for development too. LLMs are strongest at greenfield generation and weakest at surgical edits and refactoring. Early-stage startups survive by iterative design and feedback. Automating that with agents hooked into web analytics may compound errors and adverse outcomes.
So even if you strip out “ethics” and replace it with any pair of competing objectives, the failure mode remains.
The paper seems to provide a realistic benchmark for how these systems are deployed and used though, right? Whether the mechanisms are crude or not isn't the point - this is how production systems work today (as far as I can tell).
I think the accusation of research that anthropomorphize LLMs should be accompanied by a little more substance to avoid this being a blanket dismissal of this kind of alignment research. I can't see the methodological error here. Is it an accusation that could be aimed at any research like this regardless of methodology?
Quite possibly, workable ethics will pretty much require full-fledged General Artificial Intelligence, verging on actual Self-Awareness.
There's a great discussion of this in the (Furry) web-comic Freefall:
(which is most easily read using the speed reader: https://tangent128.name/depot/toys/freefall/freefall-flytabl... )
If you want absolute adherence to a hierarchy of rules you'll quickly find it difficult - see I,Robot by Asimov for example. An LLM doesn't even apply rules, it just proceeds with weights and probabilities. To be honest, I think most people do this too.
> At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.
Now I'm thinking about the "typical mind fallacy", which is the same idea but projecting one's own self incorrectly onto other humans rather than non-humans.
https://www.lesswrong.com/w/typical-mind-fallacy
And also wondering: how well do people truly know themselves?
Disregarding any arguments for the moment and just presuming them to be toy models, how much did we learn by playing with toys (everything from Transformers to teddy bear picnics) when we were kids?
Regardless of the technical details of the weighting issue, this is an alignment problem we need to address. Otherwise, paperclip machine.
At the very least it shows the capability of the current restrictions are deeply lacking and can be easily thwarted.
I suspect that the fact that LLMs tend to have a sort of tunnel vision and lack a more general awareness also plays a role here. Solving this is probably an important step towards AGI.
It would also be interesting to see how humans perform on the same kind of tests.
Violating ethics to improve KPI sounds like your average fortune 500 business.