logoalt Hacker News

londons_exploretoday at 7:50 AM6 repliesview on HN

This sort of analysis is great.

Now why can't compilers do this sort of thing automatically?

Almost any problem seems to be possible to speed up 1000x in AVX512+days of thought compared to the naive version written in a python loop. If we could automate that whole process for big codebases the performance gains could be huge.


Replies

cmovqtoday at 8:04 AM

Compilers can’t really, in a meaningful way, change the layout of your data in memory. And you do need to think about your memory layout to get any benefit from SIMD. You’ll notice a lot of compiler auto vectorization insert many instructions just to shuffle data around to get to a usable layout, which negates much of the benefit.

show 3 replies
flohofwoetoday at 9:06 AM

> 1000x in AVX512+days of thought compared to the naive version written in a python loop

Out of this 1000x speedup you get 100x by just not using python though ;)

Also IIRC the main problem specifically with AVX512 was that mainstream CPUs simply didn't have it, so a smart compiler won't be of much use when the output code only runs on a handful devices.

diamondlovesyoutoday at 7:59 AM

> Now why can't compilers do this sort of thing automatically?

They do - they just can't assume GFNI instructions are present unless you explicitly say so: https://godbolt.org/z/eYasbKsse

mamcxtoday at 3:01 PM

> Now why can't compilers do this sort of thing automatically?

Because they are not query compilers, ie: They don't know the data.

For example a query compiler could swap index to full scan because it "see" (by runtime statistics) the data not benefit for it.

In the other hand, an optimization here can pessimism there. So optimizers in general should be very conservative because butterfly effects!

toshtoday at 2:26 PM

it is not easy for a compiler to vectorize

a pragmatic approach: write in a high level interpreted language that rhymes with modern CPUs, vector extensions, memory bandwidth

e.g. apl [0], bqn [1], k [2], kiwi [3]

  - vectors are dense (not boxed)
  - optimized internal representation (e.g. bitpacked bool vectors)
  - primitives act on vectors + use avx, neon if possible
[0] https://www.dyalog.com [1] https://mlochbaum.github.io/BQN/ [2] https://kx.com [3] https://kiwilang.com

great article by marshall on BQN performance compared to C and how to think about it

https://mlochbaum.github.io/BQN/implementation/versusc.html

related:

  - columnar databases: kdb, duckdb, clickhouse
  - machine learning frameworks: pytorch, keras, jax, mlx
lifthrasiirtoday at 9:10 AM

Intents matter. Compilers can't see through your skull to infer your intents and thus behave very conservatively unless you override that behavior somehow. This inference, alas, also takes (much) time, so compilers have to balance the compilation time with quality of intents guessed as well. (This is why we can't exactly use LLMs in mainstream compilers, by the way.) So go and make a programming language that preserves your intents by every means; but making it practical would be very difficult.