The LinkedIn site reliability engineering team has a problem…albeit a good problem. They are tasked with running the business world’s most popular social network with hundreds of millions of users using their sites every day. Since implementing SaltStack (web scale) almost four years ago, LinkedIn had about 5,000 Salt Minions under management. That number has ballooned to more than 70,000 today.
So how did LinkedIn scale SaltStack to meet that kind of growth? In his SaltConf15 talk, Thomas Jackson, LinkedIn senior site reliability engineer, discusses the valuable lessons his team learned while growing their SaltStack-managed infrastructure, while exploring what worked and didn’t work for them.
The question facing Jackson when he began using SaltStack in 2012 was, “what could we do to make sure our infrastructure was more reliable, available and performant. To improve reliability, you need to make sure you write maintainable, debuggable code, that works. And then test, test, test, before rolling into production. Next, to have strong availability you need to first define what availability means to you and then take steps to measure, monitor and remediate. Finally to improve performance you can do less, faster and better, while prioritizing what areas of performance need to improve.”
For more specifics on what LinkedIn did to improve the reliability, availability, and performance of SaltStack, watch Jackson’s SaltConf15 talk titled, “SaltStack at Web Scale…Better, Stronger, Faster,” embedded below or check out the slides on Slideshare. Also, find more great talks in the SaltConf15 video content blog post.