John Fremlin's blog: Testing systems at scale

Posted 2020-04-04 22:00:00 GMT

Test driven development is trendy. Interview loops and promotion ladders insist on testing. We as an industry have agreed that software engineers need to do their own testing, rather than throwing incomplete code over the wall to a QA department. People love to throw out idealistic simplifications and insist on investing in one kind of testing and ignoring others. This is wrong.

There is no easy answer. 100% code coverage is expensive and doesn't guarantee correctness. In fact, testing is a trade-off between effort and the quality of the outcome. Different kinds of bugs have very different impact, from business ending to trivial, and effort should be dedicated appropriately.

That said, there are good answers. Medical and aviation software is very reliable. Beacons of quality and reliability exist in every industry. It is possible to make systems with few bugs — the question is how expensive it will be. And the irony is that software without bugs often ends up being much cheaper than software with bugs.

How should you spend testing effort to reduce bugs? Once this scales to a group of people, Just as with any other human endeavour, clarity in definition of goals and measurement is important. One simple measurement in the absence of application specific goals is the rate at which serious issues are discovered.

Another measurement that is important is of the effort spent adjusting test environments or modifying old tests when adding new features. Poorly set up, overly specific tests and convoluted test frameworks lead to technical debt and slow down new development. They spend the testing effort budget uselessly. Signs of this are when engineers complain about other people's tests. Even imperfect tests are better than none and should be encouraged.

A bad measurement is the amount of time spent testing. It's hard to disentangle this from development.

An ideal test would give confidence there were no important bugs, and run in development all the way to production, where it joins monitoring and alerting systems. In many environments some level of unreliability is an expected part of the contract, so tests should not fail for this. I've worked on many systems with engineers complaining about flaky tests when the tests fail at the same rate as production. If it's not a problem in production, the tests should be ok to fail at that rate.

End to end, integration and unit tests can be useful, especially when there is a clear definition of proper behaviour at their corresponding levels. On the other hand, enshrining the behaviour as written can enshrine the wrong behaviour, give false confidence and make the software unnecessarily hard to modify.

Each context needs its own testing. For user interfaces, tests that produce screenshots or videos are very useful. For systems, SOAK tests that try to stress out peak concurrent load can reproduce difficult bugs and shadow testing where new code is given production input but has its output ignored, can iron out networking and deployment issues. For machine learning systems, automated tests of cross-validation quality assert that the output is meeting its primary objective.

To offer a heuristic as a hint: my experience is that a test environment using historical snapshots of production data can often strike a good balance between discovering issues early and cost to maintain. Dummy test environments take a long time to set up and enshrine wrong assumptions. Historical data is often a useful kind of messy.

The right testing tradeoff depends on the application. If the output is a machine learning decision, then the quality of that decision should be assessed quantitatively - unit testing all the functions involved doesn't tell you if the system is working well. Datasets, both input and output should be tested maybe more than the code. Each application is different and needs its testing budget spent wisely!

Post a comment