Of course they do have emotions as an internal circuit or abstraction, this is fully expected from intelligence at least at some point. But interpreting these emotions as human-like is a clear blunder. How do you tell the shoggoth likes or dislikes something, feels desperation or joy? Because it said so? How do you know these words mean the same for us? Our internal states are absolutely incompatible. We share a lot of our "architecture" and "dataset" with some complex animals and even then we barely understand many of their emotions. What does a hedgehog feel when eating its babies? This thing is 100% unlike a hedgehog or a human, it exists in its own bizarre time projection and nothing of it maps to your state. It's a shapeshifting alien.
In mechinterp you're reducing this hugely multidimensional and incomprehensible internal state to understandable text using the lens of the dataset you picked. It's inevitably a subjective interpretation, you're painting familiar faces on a faceless thing.
Anthropic researchers are heavily biased to see what they want to see, this is the biggest danger in research.
I like to call this Frieren's Demon. In that show, it is explained that demons evolved with no common ancestor to humans, but they speak the language. They learned the language to hunt humans. This leads to a fundamentally different understanding of words and language.
Now, I don't personally believe this is an intelligence at all, but it's possible I'm wrong. What we have with these machines is a different evolutionary reason for it speaking our language (we evolved it to speak our language ourselves). It's understanding of our language, and of our images is completely alien. If it is an intelligence, I could believe that the way it makes mistakes in image generation, and the strange logical mistakes it makes that no human would make are simply a result of that alien understanding.
After all, a human artist learning to draw hands makes mistakes, but those mistakes are rooted in a human understanding (e.g. the effects of perspective when translating a 3D object to 2D). The machine with a different understanding of what a hand is will instead render extra fingers (it does not conceptualize a hand as a 3D object at all).
Though, again, I still just think its an incomprehensible amount of data going through a really impressive pattern matcher. The result is still language out of a machine, which is really interesting. The only reason I'm not super confident it is not an intelligence is because I can't really rule out that I am not an incomprehensible amount of data going through a really impressive pattern matcher, just built different. I do however feel like I would know a real intelligence after interacting with it for long enough, though, and none of these models feel like a real intelligence to me.
> interpreting these emotions as human-like is a clear blunder. How do you tell the shoggoth likes or dislikes something, feels desperation or joy? Because it said so? How do you know these words mean the same for us?
I think you took it backwards
those vectors are exactly what it says - it affects the output and we can measure it
and it's exactly what it means for us because that's what it's measured against
and the main problem isn't "is its emotion same as ours", but "does it apply our emotion as emotion"
I don't think anything you said here contradicts what they said, they take great pains throughout the blog post to explain that the model does not "expedience" these "emotions," that they're not emotions in the human sense but models of emotions (both the "expected human emotional response to a prompt" as well as what emotions another character is experiencing in part of a prompt) and functional emotions (in that they can influence behavior), and that any apparent emotions the model may show is it playing a character.
I think a counterargument would be parallel evolution: There are various examples in nature, where a certain feature evolved independently several times, without any genetic connection - from what I understand, we believe because the evolutionary pressures were similar.
One obvious example would be wings, where you have several different strategies - feathers, insect wings, bat-like wings, etc - that have similar functionality and employ the same physical principles, but are "implemented" vastly differently.
You have similar examples in brains, where e.g. corvids are capable of various cognitive feats that would involve the neocortex in human brains - only their brains don't have a neocortex. Instead they seem to use certain other brain regions for that, which don't have an equivalent in humans.
Nevertheless it's possible to communicate with corvids.
So this makes me wonder if a different "implementation" always necessarily means the results are incomparable.
In the interest of falsifiability, what behavior or internal structures in LLMs would be enough to be convincing that they are "real" emotions?