Of course I know that bulk operations aren't the same as clock speed and they sit in a weird place. My point was that their existence is evidence of needing performance on a single core.
There is also an argument to be made for re-architecting code to require fewer or zero branching instructions and cramming everything into vector operations.