logoalt Hacker News

timschmidtyesterday at 5:51 AM2 repliesview on HN

Reading weights out of memory is the definition of a large linear read. I'm a bit mystified someone hasn't put an embarrassingly parallel flash storage controller next to some tensor processors on a PCIe card. It could have 4Tb of flash hanging off enough channels to saturate SRAM skipping DRAM entirely, and could even offload prompt processing to a GPU in the same workstation so long as it got reasonable tokens/s in inference. I'd buy one tomorrow.


Replies

adrian_byesterday at 6:21 AM

For the last year, there has been development work at several companies for products including HBF (high-bandwidth flash memory) as a supplement to HBM, in order to enable running inference for big LLMs at a reasonable cost, e.g. on one GPU-like card.

HBF was initially announced by SanDisk, early in 2025, then early this year Hynix has announced that they have joined SanDisk in producing HBF, and that the common specification will be standardized under the Open Compute Project.

With HBF, it would be easy to make a GPU card with 4 TB of HBF, which could run the biggest existing open weights LLMs in their native unquantized form.

show 1 reply
zozbot234yesterday at 8:19 AM

For sparse MoE models, the single expert layers that the inference gets sampled from are actually quite small - single-digit megabytes or so.