logoalt Hacker News

dgb23today at 11:27 AM2 repliesview on HN

Not a C++ programmer and I think the solution is neat.

But it's not necessarily an apples to apples comparison. It's not unfair to python because of the runtime overhead. It's unfair because it's a different algorithm with fundamentally different memory characteristics.

A fairer comparison would be to stream the file in C++ as well and maintain internal state for the count. For most people that would be the first/naive approach as well when they programmed something like this I think. And it would showcase what the actual overhead of the python version is.


Replies

VorpalWaytoday at 11:33 AM

> A fairer comparison would be to stream the file in C++ as well and maintain internal state for the count.

Wouldn't memory mapping the data in Python be the more fair comparison? If the language doesn't support that, then this seems to absolutely be a fair comparison.

> For most people that would be the first/naive approach as well when they programmed something like this I think.

I disagree, my mind immediately goes to mmap when I have to deal with a single file that I have to read in it's entirety. I think the non-obvious solution here is rather io-uring (which I would expect to be faster if dealing with lots of small files, as you can load them async concurrently from the file system).

show 1 reply
zahlmantoday at 4:03 PM

> It's unfair because it's a different algorithm with fundamentally different memory characteristics. A fairer comparison would be to stream the file in C++ as well and maintain internal state for the count.

The C++ code is still building a tally by incrementing keys of a hash map one at a time, and then dumping (reversed) key/value pairs out into a list and sorting. The file is small and the Python code is GCing the `line` each time through the outer loop. At any rate it seems like a big chunk of the Python memory usage is just constant (sort of; stuff also gets lazily loaded) overhead of the Python runtime, so.