logoalt Hacker News

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

96 pointsby matt_dtoday at 4:54 AM12 commentsview on HN

Comments

rahentoday at 6:26 AM

Strictly speaking, this is very domain-specific and doesn't enable any performance that Triton couldn't already achieve (eliminating global memory round-trips via epilogue fusion is nothing new). The real takeaway is the design shift for LLM-driven codegen rather than handcrafted kernels.

LLMs are still bad at low-level hardware optimizations, but really good at high-level composition. Designing compiler abstractions with a restricted, composable API so an LLM can easily glue expert-written blocks together is a smart move. I suspect this will eventually become the norm for codegens as we move to agentic development.

show 2 replies
augment_metoday at 9:37 AM

TLDR:

Authors realize that global row-wise dependent functions like RMSNorm/LayerNorm have baked-in scales that are commutative in certain setups, so they can be moved out after a subsequent projection and be partially aggregated on tiles of rows.

So ((W1 @ gamma * globally_computed_scale) * W2 can be written as (W1 @ gamma * W2) * globally_computed_scale as long as we have row-only interactions for the scale.

This was usually not done before because left-to-right graph compilers like torch.compile can't assume that a global row-wise reduction between GEMMs can be commutative.

saagarjhatoday at 9:14 AM

Guys who have only written CUTLASS GEMM epilogue fusions, seeing their second kernel: Getting a lot of "GEMM epilogue fusion" vibes from this

maxignoltoday at 7:06 AM

« LLMs can successfully author CODA kernels » That might speed up progress in this area then

cold_harbortoday at 2:32 PM

synthesis-only is the hard part. with execution feedback — run, profile, patch — the gap closes fast. it's basically an RL problem in disguise

rizkimurtadhatoday at 3:24 PM

[dead]

rohitsriramtoday at 8:30 AM

[flagged]

enricotaltoday at 7:34 AM

[flagged]