Sections 5.4 and 5.5 are the interesting ones.
5.4: Energy consumption going down because of parallelism over multiple cores seems odd. What were those cores doing before? Better utilization causing some spinlocks to be used less or something?
5.5: Fine-grained lock contention significantly hurts energy consumption.
5.4 is the essential reason why multithreading has become the main method to increase CPU performance after 2004. For reaching a given level of performance, increasing the number of cores at the same clock frequency needs much less energy than increasing the clock frequency at the same number of cores.
5.5 depends a lot on the implementation used for locks. High energy consumption due to contention normally indicates bad lock implementations.
In the best implementations, there is no actual contention. A waiting core only reads a private cache line, which consumes very little energy, until the thread that had hold the lock immediately before it modifies the cache line, which causes an exit from the waiting loop. In such implementations there is no global lock variable. There is only a queue associated with a resource and the threads insert themselves in the queue when they want to use the shared resource, providing to the previous thread the address where to signal that it has completed its use of the resource, so the single shared lock variable is replaced with per-thread variables that accomplish its function, without access contention.
While this has been known for several decades, one can still see archaic lock implementations where multiple cores attempt to read or write the same memory locations, which causes data transfers between the caches of various cores, at a very high power consumption.
Moreover, even if you use optimum lock implementations, mutual exclusion is not the best strategy for accessing a shared data resource. Even optimistic access, which is usually called "lock-free", is typically a bad choice.
In my opinion, the best method of cooperation between multiple threads is to use correctly implemented shared buffers or message queues.
By correctly implemented, I mean using neither mutual exclusion nor optimistic access (which may require retries), but using dynamic partitioning of the shared buffers/queues, which is done using an atomic fetch-and-add instruction and which ensures that when multiple threads access simultaneously the shared buffers or queues they access non-overlapping ranges. This is better than mutual exclusion because the threads are never stalled and this is better than "lock-free", i.e. optimistic access, because retries are never needed.
I'm not sure of the exact relationship, but power consumption increases greater than linear with clock speed. If you have 4 cores running at the same time, there's more likely to be thermal throttling → lower clock speeds → lower energy consumption.
Greater power draw though; remember that energy is the integral of power over time.