logoalt Hacker News

Fault Tolerant Llama training

52 pointsby Mougatinelast Monday at 9:30 AM8 commentsview on HN

Comments

d4l3ktoday at 12:16 AM

Hey, nice to see this here!

I'm the primary author so happy to answer any questions you might have!

zxexztoday at 6:12 AM

This is awesome, can’t wait to try out these techniques. At least a week a year of my time for the past few years has gone towards recovering from a fault crashing a training run. Sometimes environment related, sometimes shared storage, sometimes just because a slightly faulty IB cable.

bjt12345today at 1:47 AM

This is severely underrated work, why aren't there more mid sized companies helping this? Ultra Ethernet just got released.

anonymousDantoday at 9:21 AM

What kind of failures are you typically concerned with here?

timzamantoday at 12:31 AM

300 L40s? What's this, 1998?

show 2 replies