The difference between Nearly Clean and Really Clean

LADEE in the clean room, presumably unpowdering her nose
Really Clean!

Toothpaste adverts leave no doubt about how much “really clean” matters, even when the actual difference is beyond the powers of human perception.  But for regression suites this can really make the difference between a useful set of checks that make the product better and easier for everyone to work on, and a millstone that wastes time and drags down morale.

Until recently, I was heading up the team responsible for system testing our network protocol stack code.  We had some decent test tools (barring some historical idiosyncrasies) which let us “cable up” and spin up a whole network of VMs (actually containers these days)  and it was easy to create a script to run through a bunch of checks based on that network.  So we ended up with a lot of regression scripts that checked all our function over a wide range of our products.  And our scripts had a pretty good false positive rate (mostly <1%) – we were nearly clean.  So surely we were sitting pretty?

No.  We had a lot of scripts (1000’s) and we had enough false positives that noone really trusted the scripts when a check failed.  We had someone spending an hour a day looking over the “fail” results.  And mostly we decided to “wait and see if it happens again”.  And if we did suspect a bug and send off to whoever had made changes the previous day, their response was usually “don’t think it’s me, probably a false positive”.  And because we’d waited a day or two different people had made changes and squabbled over whose fault it probably was and so who should investigate first.  We had a load of automated checks that drained a load of time, and despite repeated “quality pushes”, the average number of failures (and the false positive rate) slowly ticked up over time.

Our one saving grace was a couple of suites of scripts, which were really clean.   The false postive rate was very low (<0.1% or so) and crucially, it was low enough compared to the number of scripts that when people saw a check failure their default assumption was that there was a bug.  People dug into every check fail when the issue was new and fresh, and if they did find a false positive, they fixed up the script, so our false positive rate slowly got better and better.

And that for me is the big difference.  If your suite is “really clean” so the default expectation is that failed checks in your regression suite indicate product bugs (which had better be because that’s true), then whatever you do, your suite should improve and get cheaper to maintain.  Conversely, if the default expectation is that there’s a good chance of a false positive, then it doesn’t matter how “nearly clean” your suite is, over time it will get worse and your maintenance will get more and more expensive.  (As a side note – Michael Bolton has a good post on what actually happens  when you see a check failure.  The whole point to getting “really clean” is to get to the point where it’s a reasonable working assumption that the issue is in the product, which saves you ever-so-much work).

Over the last few years we slowly moved a lot of our regression suites from “nearly” to “really” clean.  It took a lot of time and effort, but it’s paid off.  We used to burn something like 50 days/year just on maintenance and we’re now probably down to a tenth of that.  And that’s just raw maintenance time – not including savings on bug fixing.

So how do you get from “nearly clean” to “really clean”?  Some thoughts based on our experience of improving our scripts.

  • Expect to put a lot of effort in.  With 200 scripts, <1% false positives means still one or two a night.  You need to get down an order of magnitude better than that before people will expect issues to be in their code and not the scripts.
  • Focus on one area at a time.  Getting one area over the “really clean” hump wins you more than getting everything a bit closer to nearly clean.  This doesn’t even need to be a particular product or functional area.  If you have 200 scripts, then break out the 20 best ones into a separate output and call that an area – and move other scripts over as and when you get them working well.
  • The key test for “really clean” is the belief and trust in people’s minds.(which can be irrespective of the actual level of cleanness but may be helped by seeing the enforcement).  Make sure that you’re clear about what output is “really clean” and what isn’t, to make sure that you can build that trust.
  • Hold that “really clean” quality bar hard.  Inevitably people will mess things up, but we found you can’t let things slide even for “special one-off” reasons.  We used a “fix it up or back it out” policy which worked well for us.

One thought on “The difference between Nearly Clean and Really Clean

  1. Entirely agree 🙂
    I think regular failures are also key here (as well as intermittent ones you are discussing above).
    The team I work for has 20,000 test scripts, spread across 20 or so components. most sets of these we have kept on top of, and have no long running (around for more than a few days) failures.
    However, the components where we do get a failure which we don’t fix promptly have a habit of snowballing. Because there’s already 2 failures in the component, the person who makes the third will feel less inclined to fix it quickly.
    Worse, once we do fix the original cause of those first 2 failures, we sometimes (actually happens remarkably often, at least 10% of cases) find someone else has layered another break on top of the original one.

    Liked by 1 person

Leave a comment