Testing, Risk and Quality in The Martian

A Martian demonstrating risk


Thinking about testing conflict depicted in The Martian made me think a bit closer about the consequences of defining quality as value to someone who matters.

I recently watched The Martian.  At one point, NASA are trying to launch a space probe in a hurry – and cut 10 days of launch site testing having established that historically there’s only been a  1/20 chance of finding an issue during that tesing.  It was unusual to see a film actually covering the concept of “How do we minimise risk with the resources that we have?” rather than just having established process vs a maverick.

It was also an interesting scene (and film) in that having decided to launch a resuce mission (I’ll leave aside the rationality behind that for this post) the film draws out in several places the contrast between the logically optimal strategy (maximising the chances of rescuing a stranded astronaut) and the socially acceptable strategy (minimising the damage to NASA’s public reputation and future spaceflight if something goes wrong with that rescue mission).  As a specific example, the choice above to dodge some testing was a great logical way to effectively save time at a 5% risk cost, which could then be spent more effectively elsewhere to reduce the risk by more than that.  The problem was that specifically choosing not to do some testing that you might otherwise have done is hard to explain away if things go wrong – when 20/20 hindsight suggests that was the wrong call.

Most of us see that as a conflict between the “right thing to do” and some stupid politics.  However, looking further…

If the goal of some process is to maximise the quality of something, and we define quality using the standard “value to someone who matters”, then the above conflict is simply that we have multiple people who matter with different values.  That means there’s not a “right” and “wrong” argument above – maximising quality means making a compromise and balancing the different “values”.

And that matches up with what we see in software development.  We run QA phases and focus on mainline install processes, not to optimise for globally reduced risk – we could spend that time finding and fixing more other bugs – but  because the customer disproportionately values “things working at first”, and “lack of regressions”.  While it might not seem like we’re optimising for quality, we are – it’s that quality is defined by value to people that matter, and their views of quality may be different to ours or expressed in different ways.



What is Quality?

Clearly the best kind of street.

What is it that we’re trying to achieve with the software development process[1]?  Most people agree that roughly speaking we’re trying to maximise the quality of our product or solution, within some constraints (money/time/people/etc.).  That’s all well and good, but what is quality?

Oxford says: Quality – The standard of something as measured against other things of a similar kind; the degree of excellence of something.

Ok, but not that helpful if you’re trying to maximise it.  And it turns out that when you try to define it down more tightly, you rapidly run into strong differences;  For example, in Korea, quality is “Brand new, cutting edge”,  which is a very different thing to optimise for from the UK  “Solid, lasts, doesn’t break”.  (For more examples, try this interesting post.)

Jerry Weinberg has a good definition, which I’ve heard around a few times now: Quality is value to some person who matters.  It sounds a bit trite, but I love it because you can do things with it.  Identify who matters.  Find out what they value.  Then you can start optimising that.


[1] I use “software development process” as a whole rather than “testing” to sidestep the “Why do we test?”, “Are testers trying to improve quality?” question.  That’s probably worth a separate post at some point.  Regardless, as a tester, you’ll make better decisions if you have the context of what your overall team and organisation are trying to achieve.



There’s no substitute for testing

This is the story of how an award-winning building that was an architectural and engineering triumph very nearly ended up as a not-at-all-award-winning heap of rubble but was saved by some last minute testing.

Engineering nerds can also check out the flat arches over the windows.
The Queen’s Building, Emmanuel College, Cambridge, UK

This is the Queen’s building at Emmanuel College, Cambridge.  It’s a pretty cool building and an even cooler piece of engineering.  It’s made of the same limestone (from the same quarry) that the rest of the college was built from some 400 years earlier, but where the original buildings have the massive metre thick walls required when working with such a weak stone, the Queen’s building doesn’t.  The reason it doesn’t just collapse is hinted at by the little port holes that you can see dotting the building.  Limestone, like concrete is weak under shear or tension forces, but strong under compression.  If you looked inside one of those port-holes you’d see the ends of steel rods running down the centre of each stone column, pre-compressing it, similar to pre-stressed concrete.

Being a brand new construction technique, a bunch of bright people did careful calculations to work out what pressures the stone would be under, what the stresses and strains would be, and modelled everything to make very sure that the stresses on the building were well within the tolerances required – all good stuff.  Work started.

Enter, Prof. Chris Burgoyne, a fellow of Emmanuel College, and structural engineer, who worked through the plans and calculations and said, “That’s great in theory – but we need to test that.”  Chris had spent much of his life trying to remove over-confidence in models from his engineering students and had an engineering lab just down the road, so despite reassurances that it would all be just fine, insisted on the testing.

Nothing is more satisfying than finding a bug
Test column in the Engineering Lab, Cambridge University, UK.

This is one of the test columns[1] in a load press in the Cambridge University Engineering Lab.  If you look closely, you can see the cracks as it comes apart at significantly lower than the safe load required.  This posed a bit of a problem, as the building was already partially built!

So what had happened?

The models were right, in as much as they were correctly modelling things.  The pre-tension technique is sound.  The building stands Today as a quiet triumph of ingenuity and engineering brilliance[2].  However, three things were causing the columns to crumble – none of which had been identified in the original models but all of which became obvious in testing.

  • The modern mortar mix used, when placed on the aborbent limestone, set very hard, very quickly, so instead of the load carried uniformly across the surface, the surface was rough causing a relatively small number of contact points, which were acting as pressure points to start fractures.
  • The blocks that made up the columns had cramps (basically big steel staples) holding them in place.  These too were acting as pressure points.
  • Some of the limestone blocks had the bedding planes (the lines where layers of sand were laid on top of each other before being squished by Geology) aligned vertically.  Limestone can take much higher pressure when the bedding planes are horizontal – think standing on a stack of paper laid horizontally or vertically!

Today, in addition to the tensioning rods originally planned, the Queen’s building has some additional unusual features which you can’t see.  It is built with very thin Roman-style lime cement (which sets slower and more putty like and so doesn’t cause fracture points).  There are no staples holding the column blocks together, and the bedding planes in the sandstone blocks are all very carefully aligned.  In addition to being a triumph of architecture and engineering, it is also a triumph of testing.


[1] Technically it isn’t.  It’s the second of several smaller columns built and tested after the first column failed spectacularly.  However, most of my photos of this testing were very under-exposed and rubbish.  I was 13 at the time – I’d been allowed in as I was keen on becoming a structural engineer and Chris was a colleague of my dad’s and friend of the family.  I didn’t end up as a structural engineer, but did take to heart the importance of good testing!

[2] Paper on the structure of the Emmanuel College Queen’s Building

Edit: Updated with feedback from Chris Burgoyne, who corrected a few minor details and was also kind enough to provide the copy of the paper included above.

Book: Thinking Fast and Slow


My goodness, this book is dense.  Don’t get me wrong, it’s well written and very accessible – you could take it as a holiday read and “read through” it – but there is 25 years of psychology research packed into a few hundred pages.  We covered this book in our System Test book club, and found that even covering a chapter or two each session, we had plenty to talk about and discuss.

Roughly speaking, the fundamental theme of this book is that we as humans have two systems in our brains, which Kahneman helpfully labels “System 1” and “System 2”.

  • System 1 is fast, reactive, emotional, and runs on assumptions and work that the brain finds “easy” – analogies, associations, stereotypes, anecdotal evidence.  It’s great for letting us deal with day to day life without going nuts.
  • System 2 is the slow careful, rational thinking that we think we are all the time.  It’s great for coming to the logically correct answer, but it’s way too expensive and slow for us to use all the time.

As System 2 is really expensive to run vs System 1, a lot of the time we actually use System 1 with System 2 unthinkingly rubber-stamping the answer.  An early uncomfortable conclusion to the research is that we’re not the rational beings we think we are.  The rest of the book covers ways in which we actually think, and the various heuristics and biases that we engage in, and so on.  Kahneman has spent his life picking these apart and uses this model to give good explanations to why these come about, and what we can (and can’t!) do to try to get to the actual logical answers both at work and in general life.

I thoroughly recommend this book to anyone interested in their development, whether a tester or otherwise  (Kahneman also provides a good bibliography and references if you’re interested in digging further).  And if you’re thinking of starting some kind of discussion group or book club, this is a good “hook” book to get started with.

Using Dungeons and Dragons to understand your motivations and have more fun at work.

My first ever AD&D was a chaotic good wizard with 6 INT. He didn't do very well.
A neutral good tester prepares to unravel the mysteries of the universe (probably by hitting bits of it with lightning).

Hi, I’m Edmund, and I’m chaotic good.  That means if you want me to do something for you, tell me about how much it will help you (or someone else) and enthuse about how new, different, and exciting it is.  If you do that, I’m much more likely to help you out, and I’ll have more fun helping.

This post is about people’s fundamental motivations, how to think about them, and about how 2 minutes of thought and a change of briefing will make your team and your manager work better and enjoy the work they’re doing more.

The model I’m using originally came from my previous boss, Jon Berger (lawful evil), who deserves all the credit here.  It’s based on the alignment system from the Dungeons and Dragons RPG.  It’s a handy model useful because it’s a system for describing basic character motivations that almost every geek you come across already knows it.  If you haven’t come across it, you’re missing out!

AD&D 2nd edition was clearly the best. Thac0 did wonders for our mental maths.
The alignments at AD&D 2nd edition.

Here’s the updated, but similar alignment model I have for people at work.

Fundamental motivations at work

Some more explanation of the two axes.

  • Good/Evil.  People tend to be fundamentally people driven or goal driven.  Most people like both helping others and getting things done, but which actually gives you the real buzz at the end of a project – that 3000 customers are better off or that you’ve created an amazing thing that works really well?  Good people will spot that horrible “Working as designed but it doesn’t solve the customer’s problem” bug, Evil people will make sure everyone focuses and completes everything required to ship the product.
  • Lawful/Chaotic.  People tend to enjoy working with rules, or without them.  Lawful people create strong processes and can be relied on to do everything needed, but can struggle if not given enough structure to build on.  They drive change by defining new methods that people can follow.  Chaotic try more new things and uncover new ideas, but can struggle completing the details that have to be done.  They drive change by trailblazing and championing.

Ok.  So there’s this model.  What can I do with it?

First off – this is a model.  All models are inaccurate.  They’re not a replacement for thinking, but they can help you think about the thing that you’re modelling.  So have a think about yourself and the people you work with and how you fit into this model.

When you want someone to do something.  If you’re asking for help.  If you’re briefing them on a new part of the project.  If you’re trying to help them develop.  Whatever.  Appeal to their fundamental motivation (and crucially don’t assume that their motivations/comfort zones are the same as yours!).  Compare the following briefings.

  • Marty and I are behind on the flux capacitor testing, and that’s the key part that lets our customers reach 88 mph.  We need you to spend a few days helping on this  to pull us out of the hole.  It’s also a chance to play with the flux capacitor and learn more about it.
  • I need you to help get the flux capacitor working – it’s the critical component that makes the Delorean more than just a car with silly doors.  We need you to spend a few days nailing through the following test conditions, and as a critical component it’s of course crucial we have a clear test report.

Of course, the task is the same in both cases, and indeed the detailed task briefing can be identical.  However by emphasising the parts of the project they’ll enjoy and appealing to their motivation, they’ll tie the task to their motivations and (a) put in a better job and (b) get more of a buzz out of the result.

As a final note, if you’re at all like me, when you first try this (and I recommend you do try!) you’ll feel a bit like a machiavellian manipulative fraud.  However, there’s nothing secret about this – point them at this post and let them think about what motivates them too.  Discuss it. Compare and contrast your motivations.  It might help both of you decide how to choose which tasks to pull off the pile in the next scrum kickoff meeting you’re in.

Boost your development: “Do one thing”

Eating one of these every day may not help you become a better tester.
A Boost.  Eating one of these a day is not guaranteed to develop your testing.

One of the tricky things when trying to improve yourself is connecting a high level goal to specific day to day actions that will get you there.  How do you get from “Within 9 months I want to be a lead tester that people seek help from.” to “Here’s what I have to do this week to achieve that.” ?  Lots of people seem to end up doing their job and hoping that intention and osmosis will get them there eventually.

Here’s a simple thing that I’ve found works for me.  I also use it with people I manage and it’s a good way to frame development as something that happens all the time rather than just as part of the annual/6 month performance review.

Each Monday, decide one extra thing that you’re going to do *this week* to work towards your goal.  It has to an explicit action, that you can tick off (in other words, make it SMART) and it has to be something that isn’t just part of doing your job well, and it has to be something that will help you reach your goal (even if it’s not obvious that it’s the best thing to do).

Tell someone what you’re doing (telling your manager in your one-on-one is great, as it reminds them that you’re pushing your development and willing to do more than “just your job”,  and it also gives you both a chance to discuss/agree other actions too).  Then do it.

The actions don’t have to be big – for example…

  • I will ask Dave the Developer to review the structure of the test script functions that I’m writing, and I’ll find at least one thing in his feedback that I can apply next time I write scripts.
  • I will read James Bach’s blog and brief the team about one thing that I’ve learned and applied.
  • I will spend an hour paired testing with Laura the Lead Tester, and note at least two things that she does that I should regularly do when testing.  I’ll explicitly do those later this week.
  • I will review my notes taken during testing sessions on Wednesday and Thursday, and find 2 holes where I should go back and retest.

Also, note that some of those examples involve other people.  You’ll need to check with the other people that they’re up for helping you, but if you ask people for small specific bits of help to learn to be good like them, they almost always say yes.

The difference between Nearly Clean and Really Clean

LADEE in the clean room, presumably unpowdering her nose
Really Clean!

Toothpaste adverts leave no doubt about how much “really clean” matters, even when the actual difference is beyond the powers of human perception.  But for regression suites this can really make the difference between a useful set of checks that make the product better and easier for everyone to work on, and a millstone that wastes time and drags down morale.

Until recently, I was heading up the team responsible for system testing our network protocol stack code.  We had some decent test tools (barring some historical idiosyncrasies) which let us “cable up” and spin up a whole network of VMs (actually containers these days)  and it was easy to create a script to run through a bunch of checks based on that network.  So we ended up with a lot of regression scripts that checked all our function over a wide range of our products.  And our scripts had a pretty good false positive rate (mostly <1%) – we were nearly clean.  So surely we were sitting pretty?

No.  We had a lot of scripts (1000’s) and we had enough false positives that noone really trusted the scripts when a check failed.  We had someone spending an hour a day looking over the “fail” results.  And mostly we decided to “wait and see if it happens again”.  And if we did suspect a bug and send off to whoever had made changes the previous day, their response was usually “don’t think it’s me, probably a false positive”.  And because we’d waited a day or two different people had made changes and squabbled over whose fault it probably was and so who should investigate first.  We had a load of automated checks that drained a load of time, and despite repeated “quality pushes”, the average number of failures (and the false positive rate) slowly ticked up over time.

Our one saving grace was a couple of suites of scripts, which were really clean.   The false postive rate was very low (<0.1% or so) and crucially, it was low enough compared to the number of scripts that when people saw a check failure their default assumption was that there was a bug.  People dug into every check fail when the issue was new and fresh, and if they did find a false positive, they fixed up the script, so our false positive rate slowly got better and better.

And that for me is the big difference.  If your suite is “really clean” so the default expectation is that failed checks in your regression suite indicate product bugs (which had better be because that’s true), then whatever you do, your suite should improve and get cheaper to maintain.  Conversely, if the default expectation is that there’s a good chance of a false positive, then it doesn’t matter how “nearly clean” your suite is, over time it will get worse and your maintenance will get more and more expensive.  (As a side note – Michael Bolton has a good post on what actually happens  when you see a check failure.  The whole point to getting “really clean” is to get to the point where it’s a reasonable working assumption that the issue is in the product, which saves you ever-so-much work).

Over the last few years we slowly moved a lot of our regression suites from “nearly” to “really” clean.  It took a lot of time and effort, but it’s paid off.  We used to burn something like 50 days/year just on maintenance and we’re now probably down to a tenth of that.  And that’s just raw maintenance time – not including savings on bug fixing.

So how do you get from “nearly clean” to “really clean”?  Some thoughts based on our experience of improving our scripts.

  • Expect to put a lot of effort in.  With 200 scripts, <1% false positives means still one or two a night.  You need to get down an order of magnitude better than that before people will expect issues to be in their code and not the scripts.
  • Focus on one area at a time.  Getting one area over the “really clean” hump wins you more than getting everything a bit closer to nearly clean.  This doesn’t even need to be a particular product or functional area.  If you have 200 scripts, then break out the 20 best ones into a separate output and call that an area – and move other scripts over as and when you get them working well.
  • The key test for “really clean” is the belief and trust in people’s minds.(which can be irrespective of the actual level of cleanness but may be helped by seeing the enforcement).  Make sure that you’re clear about what output is “really clean” and what isn’t, to make sure that you can build that trust.
  • Hold that “really clean” quality bar hard.  Inevitably people will mess things up, but we found you can’t let things slide even for “special one-off” reasons.  We used a “fix it up or back it out” policy which worked well for us.