This scans very much as AI-written.

carbocation • yesterday at 9:22 PM • 2 replies • view on HN

Replies

This is obvious Claude slop writing, the author would be advised to use vale [1] with samples of their own writing as a guide.

> Performance begins with the roofline. On the M1 the engine holds about 12 fp16 TFLOP/s of compute against a DRAM-bandwidth ceiling. The roofline has a ridge point near 141 FLOP per byte, a 2 MB working-set threshold, a 0.23 ms floor under any single dispatch, and efficiency near 0.37 picojoules per FLOP at the compute optimum. On a 256-channel 3x3 convolution it runs about 3.8 times faster than the same chip’s GPU and 9 times more energy-efficient. The roofline pairs the engine’s throughput ceilings with its measured power.

> Reaching the engine is not the same as running an arbitrary graph on it. The operations the engine executes are distinct from the ones a capability bit only advertises. A feature attested in the hardware tables or accepted by the compiler frontend counts only once a compile-and-run confirms it, and several advertised operations, three-dimensional convolution among them, never lower to the engine at all. Weight compression on the direct path cuts bandwidth, not only stored size. On the unentitled engine, int4 lookup-table weights run about 2.37 times faster than fp16, and structured sparsity 1.55 to 1.64 times faster at 0.43 times the bytes.

alt Hacker News

Replies