John Fremlin's blog: Bad unit tests impart a false sense of security

Posted 2016-06-21 10:45:00 GMT

Testing improves software. So much so that lack of unit tests is called technical debt and blanket statements from celebrated engineers like Any programmer not writing unit tests for their code in 2007 should be considered a pariah are uncontroversial. When a defect is noticed in software it's easy to say it could have been found by better testing, and often it's simple to add a test that would catch it's recurrence. Done well tests can be very helpful. However, they can also be harmful: in particular when they cause people to be overly confident about their understanding of the consequences of a change.

A good test
— covers the code that runs in production
— tests behaviour that actually matters
— does not fail for spurious reasons or when code is refactored

For example, I made a change to the date parsing function in Wine, Here adding a unit test to record the externally defined behaviour is uncontroversial.

Tests do take time. The MS paper suggests that they add about 15-35% more development time. If correctness is not a priority (and it can be reasonable for it not to be) then adding automatic tests could be a bad use of resources: the chance of the project surviving might be low and depend only on a demo, so taking on technical debt is actually the right choice. More importantly, tests take time from other people: especially if some subjective and unimportant behaviour is enshrined in a test, then the poor people who come later to modify the code will suffer. This is especially true for engineers who aren't confident making sweeping refactorings, so that adding or removing a parameter from an internal function is turned into (for them) a tiresome project. The glib answer is not to accept contributions from these people, but that's really sad — it means rejecting people from diverse backgrounds with specialised skills (just not fluent coding) who would contribute meaningfully otherwise.

Unit tests in particular can enshrine a sort of circular thinking: a test is defined as the observed behaviour of a function, without thinking about whether that behaviour is the right behaviour. For example this change I made to Pandas involved more changing of test code than real code that people will use. This balance of effort causes less time to be spent on improving the behaviour.

In my experience, the worst effect of automatic tests is the shortcut they give to engineers — that a change is correct if the tests pass. Without tests, it's obvious that one must think hard about the correctness of a change and try to validate it: with tests, this validation step is easy to rationalise. In this way, bugs are shipped to production that would have been easy to catch by just running the software once in a setting closer to the production one.

It's hard to write a good test and so, so much easier to write a bad test that is tautologically correct, and avoids all behaviour relevant to production. These bad tests are easy to skip in code review as they're typically boring to read, but give a warm fuzzy feeling that things are being tested — when they're not. Rather than counting the coverage of tests as a metric, we could improve it by using test coverage of the real code that runs in production. Unfortunately, these are not the same thing. False confidence from irrelevant tests measurably reduces reliability.

Post a comment