Should have funded the entire GIL-removal effort by selling carbon credits. Here's an industry waiting to happen: issue carbon credits for optimizing CPU and GPU resource usage in established libraries.
> Similarly, workloads where threads frequently access and modify the same objects show reduced improvements or even degradation due to lock contention.
Perhaps I'm stating the obvious, but you deal with this with lock-free data structures, immutable data, siloing data per thread, fine-grain locks, etc.
Basically you avoid locks as much as possible.
That reminded me of how back in 2008 I removed the GIL from Python to run thousands Python modules in 10,000 threads. We were fighting for every clock cycle and byte and it worked. It took 20 years for the GIL to be removed and become available to the public.
Our experience on memory usage, in comparison, has been generally positive.
Previously we had to use ProcessPoolExecutor which meant maintaining multiple copies of the runtime and shared data in memory and paying high IPC costs, being able to switch to ThreadPoolExecutor was hugely beneficially in terms of speed and memory.
It almost feels like programming in a modern (circa 1996) environment like Java.
Might be worth noting that this seems to be just running some tests using the current implementation, and these are not necessarily general implications of removing the GIL.
Sections 5.4 and 5.5 are the interesting ones.
5.4: Energy consumption going down because of parallelism over multiple cores seems odd. What were those cores doing before? Better utilization causing some spinlocks to be used less or something?
5.5: Fine-grained lock contention significantly hurts energy consumption.
Can’t it just profile them and pick the right one accordingly?
Title shortened - Original title:
Unlocking Python’s Cores: Hardware Usage and Energy Implications of Removing the GIL
I am curious about the NumPy workload choice made, due to more limited impact on CPython performance.
From [2603.04782] "Unlocking Python's Cores: Hardware Usage and Energy Implications of Removing the GIL" (2026) https://arxiv.org/abs/2603.04782 :
> Abstract: [...] The results highlight a trade-off. For parallelizable workloads operating on independent data, the free-threaded build reduces execution time by up to 4 times, with a proportional reduction in energy consumption, and effective multi-core utilization, at the cost of an increase in memory usage. In contrast, sequential workloads do not benefit from removing the GIL and instead show a 13-43% increase in energy consumption
[dead]
I have a suspicion that this paper is basically a summary with some benchmarks, done with LLMs.
> Across all workloads, energy consumption is proportional to execution time
Race-to-idle used to be the best path before multicore. Now it's trickier to determine how to clock the device. Especially in battery powered cases. This is why all modern CPU manufacturers are looking into heterogeneous compute (efficiency vs performance cores).
Put differently, I don't think we should be killing ourselves over this at software time. If you are actually concerned about the impact on raw energy consumption, you should move your workloads from AMD/Intel to ARM/Apple. Everything else would be noise compared to this.
One thing I'm curious about here is the operational impact.
In production systems we often see Python services scaling horizontally because of the GIL limitations. If true parallelism becomes common, it might actually reduce the number of containers/services needed for some workloads.
But that also changes failure patterns — concurrency bugs, race conditions, and deadlocks might become more common in systems that were previously "protected" by the GIL.
It will be interesting to see whether observability and incident tooling evolves alongside this shift.