logoalt Hacker News

hinkleytoday at 7:57 PM0 repliesview on HN

For performance reasons we needed a set of assets on all copies of a service. We were using consul for the task management, which is effectively a tree of data that’s tantamount to a json file (in fact we usually pull trees of data as a json file).

Among other problems I knew the next thing we were going to have to do was autoscaling and the system we had for call and response was a mess from that respect. Unanswered questions were: How do you know when all agents have succeeded, how do you avoid overwriting your peers’ data, and what do you do with agents that existed yesterday and don’t today?

I ended up rewriting all of the state management data so that each field had one writer and one or more readers. It also allowed me to move the last live service call for another service and decommission it. Instead of having a admin service you just called one of the peers at random and elected it leader for the duration of that operation. I also arranged the data so the leader could watch the parent key for the roll call and avoid needing to poll.

Each time a task was created the leader would do a service discovery call to get a headcount and then wait for everyone to set a suggests or failure state. Some of these state transitions were idempotent, so if you reissued a task you didn’t need to delete the old results. Everyone who already completed it would noop, and the ones that failed or the new servers that joined the cluster would finish up. If there was a delete operation later then the data would be purged from the data set and the agents, a subsequent call would be considered new.

Long story short, your CS program should have distributed computing classes because this shit is hard to work out from first principles when you don’t know what the principles even are.