CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

96 points • by matt_d • today at 4:54 AM • 12 comments • view on HN

Comments

Strictly speaking, this is very domain-specific and doesn't enable any performance that Triton couldn't already achieve (eliminating global memory round-trips via epilogue fusion is nothing new). The real takeaway is the design shift for LLM-driven codegen rather than handcrafted kernels.

LLMs are still bad at low-level hardware optimizations, but really good at high-level composition. Designing compiler abstractions with a restricted, composable API so an LLM can easily glue expert-written blocks together is a smart move. I suspect this will eventually become the norm for codegens as we move to agentic development.

➕ show 2 replies

augment_me • today at 9:37 AM

TLDR:

Authors realize that global row-wise dependent functions like RMSNorm/LayerNorm have baked-in scales that are commutative in certain setups, so they can be moved out after a subsequent projection and be partially aggregated on tiles of rows.

So ((W1 @ gamma * globally_computed_scale) * W2 can be written as (W1 @ gamma * W2) * globally_computed_scale as long as we have row-only interactions for the scale.

This was usually not done before because left-to-right graph compilers like torch.compile can't assume that a global row-wise reduction between GEMMs can be commutative.

saagarjha • today at 9:14 AM

Guys who have only written CUTLASS GEMM epilogue fusions, seeing their second kernel: Getting a lot of "GEMM epilogue fusion" vibes from this

maxignol • today at 7:06 AM

« LLMs can successfully author CODA kernels » That might speed up progress in this area then

cold_harbor • today at 2:32 PM

synthesis-only is the hard part. with execution feedback — run, profile, patch — the gap closes fast. it's basically an RL problem in disguise

rizkimurtadha • today at 3:24 PM

[dead]

rohitsriram • today at 8:30 AM

[flagged]

enricotal • today at 7:34 AM

[flagged]

alt Hacker News

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Comments