John Fremlin's blog: Making reliability out of randomness

Posted 2022-04-25 22:00:00 GMT

The cloud is an amazing thing. You can just buy more computers instantly. You can use that to buy reliability, by putting multiple replicas in different failure domains.

The arithmetic for this is very simple: if you have your system running in n failure domains D1, D2, D3, ..., Dn, and somehow make it work so it survives as long as it is running in at least one, then it is only broken when it is broken in all of them. Let P(Di) be the probability that datacenter i is broken. Then the probability that the application is working is 1 - P(D1)P(D2)P(D3)...P(Dn). We know from experience of course that some datacenters are flakier than others, but if we assume that each fails independently with the same probability p this is just 1 - pn. It doesn't actually matter if the application is flaky or the datacenter is, as long as the failures aren't correlated.

If you have an application where one copy works 90% of the time, then you can make one that is up 99.9% of the time with copies in three failure regions.

Of course, this doesn't help with bugs like a query of death where the software crashes every time it gets the request. Then sending it around the datacenters will just crash them all.

Replicating unreliable components achieves you the uptime without expensively fixing known bugs. Cloud providers encourage this approach to reliability as they don't want one datacenter fire to take out their customers. They work hard to avoid multi-zone outages and even harder to avoid multi-region ones, e.g. by making each zone able to operate independently without others, and not deploying new software to all zones at once.

An ideal case where you can take advantage of independence is elastic scaling. Requests to scale up may be refused in one zone but easy in another. Downstream datastore and streaming outages from battle-tested software tend to be independent too. Another easy win can be using independent unreliable Redis clusters to more cheaply provide lookup reliability at lower latencies than persistent databases.

Post a comment