This is fantastic, great work. I will attempt to run it on my 16GB M1 but I doubt it'll run.
Out of curiosity, how did you go about replacing the CUDA specific ops? Any resources you relied on or just experience? Would love to learn more.