John Fremlin's blog: A single point of failure is ok

Posted 2016-10-05 01:11:00 GMT

Making big systems out of many computers, people often end up with lower reliability than with a single computer. Also amusingly they may be slower. There's a big temptation to avoid a single point of failure, by introducing multiple points of failure - one computer is actually quite unlikely to fail, but with many failures are common. If one assumes that the failures are uncorrelated, and there's some way to transparently switch over, then having multiple machines might make sense and it's an obvious goal. Who wants to admit that a single hard drive breaking took down a big website for a few hours?

Embarrassing though it would be, in attempting to make it impossible for a single machine to take things down, engineers actually build such complex systems that the bugs in them take things down far more than a single machine ever would. The chance of failure is increased with software complexity and likely to be correlated between machines. Distributed systems are much more complex by their nature so there is a correspondingly high software engineering cost to making them reliable. With many machines, there are many failures, and working round all the complicated correlated consequences of them can keep a big team happily in work and on-call.

A typical example of adding unreliability in the name of reliability, is the use of distributed consensus - often embodied by Zookeeper. Operationally, if the system is ever mis-configured or runs out of disk space the Zookeeper will stop working aggressively. It offers guarantees on avoiding inconsistency but not achieving uptime so perhaps this is the right choice. Unfortunately, the Paxos algorithm is vulnerable to never finding consensus when hosts are coming in and out of action, which makes sense given that consensus needs people to stick around. In human affairs we deputize a leader to take the lead in times where a quick decision is needed. Having a single old-school replicated SQL DB to provide consistency is not hip but typically would get more 9s of uptime and be more manageable in a crisis.

It can be hard to grasp when trying to deal with heavily virtualized environments where the connection between the services and the systems they run on is deliberately weak, but there's often actually one place where a single point of failure is fine: the device the person using to connect to the system. And in fact it's unavoidable. After all, if the phone you're using just crashes then you can't expect to keep using a remote service without reconnecting. Other failures are less acceptable.

By an end-to-end argument the retries and recovery should therefore be concentrated in the machines the people are operating directly, and any other reliability measures should be seen purely as for performance. Simplicity isn't easy for junior engineers, eager to make their names with a heavily acronymed agglomeration of frameworks and a many tiered architecture - but it leads to really great results.

Post a comment