Cost Per Defect – not so economically crazy

Capers’ Jones, one of my favorite authors on metrics in IT, has often called “Cost Per Defect” a bad idea.  His argument is that in situations of high quality, cost per defect will be higher.  And one can see how this math is true.

If you spend $100 (1 hour @ $100/hour, let’s say) testing something and find 1 bug, the cost per defect is $100.  If you spend $100 testing and find 10 bugs, then the cost per defect is $10.  In this case, it’s cheaper to test where quality is poor.

Now, I am not advocating a test-only approach because it makes cost per defect cheaper.  But I think this mathematical reality raises a much more important question: if quality is good, why do you continue to test?

Testing may very well be a fixed cost regardless of quality.  You, in theory, don’t know if the code delivered to you is good or bad.  Therefore, you write a set of tests assuming that it all needs to be covered and run those tests.  If the quality is good, it appears expensive to test and if the quality is bad, then testing appears cheap.  Thus, Jones argues, cost per defect encourages poor quality.

I argue it encourages something else – an opportunity to revisit whether you should be testing at all.  Jones argues that the cost per defect measure is economically perverse because it punishes high quality by making testing look expensive.  I’d argue that testing was never value added in the first place.  Therefore, when high quality is delivered, why do you continue to waste your time testing?

Instead, I propose that cost per defect is a great measurement for a test organization.  It clearly articulates the issue that testing isn’t of value where quality is high.  And therefore the cost per defect is high.  When no defects are found, cost per defect is undefined, and although this is mathematically annoying, it represents a measure of pure waste.  Testing and finding no defects has absolutely no value to your customers.

If you think testing in quality is the way to achieve good code, then Capers’ argument makes sense.  If you’re always going to test regardless of the situation, then it is an odd measurement.  If, however, you think that testing should not be conducted (or at least done far less) when you know the quality is good, cost per defect drives the right behavior – less testing where quality is good.

It’s not chaotic, it’s percent error

Generally, when we talk about correlation, people imagine a nice positive (or negative) relationship between the independent and dependent variable.  It typically looks something like the magenta data set below – for each increase in X, there’s a corresponding increase in Y, plus or minus some random fixed amount of noise.

In software development, we also see the other pattern on this graph – which still has the corresponding increase in Y as X increases, but we see that the random amount of noise seems to get bigger and bigger as X increases.  It results in this “spray” pattern which we tend to conclude means that things are getting out of control or that the relationship isn’t really there.  Otherwise we’d get a nice positive correlation right?

There is another explanation – percent error.  A positive correlation might be expressed Y = 2X + e, where e is some fixed error, let’s say +/-2.  So, if X = 10, Y should equal somewhere between 18 – 22 (2*10 +/- 2).  However, when you have a percentage error, e is some percentage of X.  When X is small, the absolute error of Y is small, and when X is large, the error of Y larger.  If X is 2, then +/- 10% is .2, but if X is 200, +/- 10% is 20.  One can quickly see how this would cause a spray-like pattern in your data.

It’s also a common pattern in software development, in my experience.  As you code, a certain percentage of the amount of code you create will be faulty.  It appears chaotic, and it does create a larger variance in the number of defects you’ll get as the project gets larger, but it’s actually not particularly surprising.

Knowing this, you can create models that help predict defects which use % faulty instead of count.  It’s still not going to make it easier to predict the absolute number of defects you’ll get, but it will at least help provide a realistic expectation that as project size increases (whether you measure it by function points, lines of code or something else) the number of defects you find may appear to vary more widely than a simple positive correlation would allow.

What we don’t find

I had a sort of odd conversation the other day. We had been studying defect patterns for a group and based on the data had concluded that the team had a serious regression problem. Something on the order of what typically gets referred to as “fix one, break one.”

This is hardly an unheard of pattern in my experience. Many immature teams struggle to keep on top of the bugs, and when they do, they make it worse by trying to fix stuff. If you are careless about fixes, it is far too easy to have unintended consequences.

At any rate, this person I was talking with decided that he needed to do his own research. He couldn’t believe they had that kind of issue. So, he randomly picked a number of defects and he reviewed them with the team. So far, so good. Then he came back to me and said “we looked at some bugs and we could only identify about thirty percent as related to a prior change.” The polite implication was that I was wrong.

I may very well be wrong, but this isn’t a good way to figure it out. For one, he was reliant on the developer to tell him whether it was a regression bug. It isn’t in the developers interest to do that.

But secondly, and in my mind, more importantly, the lack of finding something doesn’t mean that it is not there. This is why statistics uses the null hypothesis. We say that there is not evidence of a relationship between two variables, not that there is no relationship between the two. The language choice seems subtle, but it allows a possibility that most don’t consider. Namely, people think not finding something means it isn’t there to be found. Sadly, that’s not true. It might not be there, but it also could be there and you just didn’t see it.

If you wanted to prove it wasn’t a regression bug, then what you have to do is prove what it is and whatever that is has to be mutually exclusive of being a regression defect. For example, if you could show that the bug was introduced prior to the recent change, then you didn’t introduce it. But you can’t say “I didn’t find evidence that I introduced it, therefore I must not have introduced it.”. That statement is a logical fallacy.

The 900 pound gorilla in your data

Q: Where does a 900 pound gorilla sit?

A: Anywhere he wants to.

I never thought that joke was particularly funny as a kid, and it still isn’t.  However, today, I had a chance to contemplate the 900 pound gorilla in my data.  When something is so overwhelmingly large, it can take over your entire data set, leaving useful information hidden by its sheer size.

Take metrics such as defect containment rate, for example.  Let’s say that you have many products and you want to calculate a defect containment rate for each.  Let’s also assume that one product has 8,000 test defects and a 2,000 prod incidents, or a 80% containment rate.  The other two have 90 test defects and 10 prod incidents and 180 test defects and 20 prod incidents.  You’ll quickly recognize both of these as 90% containment rates.

Now, let’s say you calculate an rolled up containment rate of your three products.  There’s two ways to do it.  The first way is to simply sum everything up, and the other is to average the averages.   Across the three products, you have 8270 test defects and 2190 prod incidents, for an aggregate containment rate of 80.3%.  Or, you could average the three containment numbers and get 83.3%.  In this example, the difference is only 3%, so what’s the big deal, you say?

Well, what if there’s far more divergence among the data.  What if your one product is large and horrible and your other products are smaller and have good containment.  What’s the accurate picture for any aggregation, if aggregation is appropriate at all?

In one scenario, the large product with all the defects simply drowns out the others.  Although it presents a pessimistic view of the overall portfolio of products you have, it may not lead to appropriate action.  If you looked at the aggregate number would you try and shore up testing everywhere, or just for your one problematic product?  More importantly, if you’re a siloed organization, which most organizations are, does one team’s poor performance necessarily represent a failure of the overall testing process?  Not necessarily.  Sometimes taking scale into account provides too much attention to one piece of data and ignores data that is important.

Just because these other products are small doesn’t mean they haven’t learned something about containing defects that couldn’t be taught to the larger product.  You don’t want to throw the baby out with the bathwater, just because your overall metric says containment isn’t good.

That same 900 pound gorilla in your data can wreak havoc on predictive modeling as well, as large outlier values tend to drag around the estimated coefficients.  There are places where letting a 900 pound gorilla sit where it wants is not the right choice.  Keep an eye out for it, and decide how you want an out-sized data element to influence your measurements.

A Defect Containment Rate Philosophy

Defect containment rate at it’s most basic is a fairly simple metric.  It measures the amount of defects which your process detects as a proportion of all defects (those which you do and do not detect).  Typically, it’s defined as test_defects / test_defects + production_defects.  Nobody really argues with the formula, but we do debate how to define it.  (As an aside, I’d actually prefer pre_production_defects to test_defects, since we should recognize the value of all upstream quality activities in detecting defects, not just testing.)

Test defects are pretty easy.  Almost everyone can agree on what a test defect is.  It’s anything you find in test that you don’t like, typically variation from the requirements as defined.  But it might also include detection of missed requirements, usability issues, performance issues and other unspecified functionality, etc.

We get into debates when it comes to defining a production defect.  For one, people want to debate the difference between a defect and an enhancement.  In my opinion, if the user doesn’t like it, it doesn’t matter how it got into production (deliberately or not), it’s a defect. 

But ITIL makes it even worse.  ITIL recognizes two things – incidents and problems.  In theory, many incidents can be caused by a single problem.  An incident is the effect the user experiences.  A problem is what caused the effect.  A problem can be anything from a hardware failure to a software bug.  When it comes to calculating defect containment, which do you count?  Do you count the problems (the unique defects which you didn’t contain) or the incidents (all the impacts caused by a single problem).  A single line of bad code can cause many different undesirable behaviors in production and/or it can cause the same bad thing to happen over and over and over to the same customer or many different customers.

I believe the answer is you should count incidents.  Here’s why.  For one, incidents represent what the user experiences.  Sure, I’ve heard many an argument that not all incidents are preventable, but guess what, your user doesn’t care.  If your system is down, it’s down.  If all your incidents are caused because you bought cheap hardware to run your system on, you still need to deal with it.  Part of containing defects should include assuring adequate performance.  That’s why we have the concept of performance testing.

Secondly, a problem is a unique deficiency in the code.  Let’s say you discover this deficiency in January in production.  If you fix it, it remains a one-time event.  However, if you choose to do nothing, it will occur again in production.  Depending on the issue, it may happen again soon, or perhaps not for a long time.  Maybe it’ll happen to the same customer, but maybe it’ll happen to a different one.

If you only count the deficiencies in the code once, you are treating them as if they go away simply because you’ve identified them, but that’s not the case.  If a defect escapes into production, it stays there until you do something to remove it.  Your measurement system needs to reflect this.

If you choose to just measure the problems, instead of the incidents, you can slowly and steadily erode production quality without ever taking a major hit to your defect containment rate.  Each bug, in isolation, may not be a big deal, but stacked one upon another over time, you suddenly don’t have a very good product anymore.

Think of it from a manufacturing perspective.  If your manufacturing process is imperfect, it introduces the same defect over and over again into the product.  With software, you put out one copy of the software to production and its quickly duplicated to every user.  In effect, unhandled, software defects are like a repetitive manufacturing defect, except they spread much faster.

Fail to act on the production problem and the defect remains uncontained.  Create a measurement system which reflects that.  Measure test_defects / test_defects + production_incidents.

Measure pessimistically

Sometimes it’s better to be a glass-half-empty kind of person, because the worst that can happen is that you’ll end up being pleasantly surprised.  Sure, we all want our projects to have great success, but the reality is that there are just so many opportunities to fail to make a change, for all kinds of unfortunate reasons.

The best thing I think you can do is to measure pessimistically.  What does this mean?  Well, if you’re counting defects, count everything, even the things that are ultimately cancelled.  Why?  Because if your customer reported it, they were confused about how to use the system at best.  Just because it’s “working as designed” doesn’t mean your customer gets how it was supposed to work.

Reality is, there’s always opportunity to improve, so if you’re going to err, it’d be better to err on the side of we’re not good enough, instead of we’re doing great.  At worst, if you measure pessimistically, you’ll take action to try and get better and not get anywhere.  If you measure optimistically, you may fail to take action when you really should be doing something.

Measurement isn’t about trying to tell a happy story.  It’s about trying to tell an accurate story.  But, if for some reason you can’t do that (measurement system noise or whatnot), tell a pessimistic story instead.

Overcomplicated models

How many black belts does it take to decide on a place for dinner?  The answer, at least in my case, turns out to be about 10 to12.  There’s a group of us who try to get together now and again just to catch up and reminisce about times past.  Everyone in the crowd is a great person, and we’re always sad when someone doesn’t attend, but life gets in the way sometimes – and apparently so does the analytical minds of a dozen people.

It all began innocently enough.  “Where should we meet?” someone asked.  A few potential meeting locations were thrown out.  Then someone (you knew it would happen with a bunch of Six Sigma Black Belts), decided we needed a more scientific approach to deciding.  She listed out all our home towns, identified the furthest points (most northern location, most southern location, most eastern location and most western location) and drew a cross between them.  Thus, she approximated the mid-point which turned out to be Framingham, MA.

Alas, once that was done, we were off to the races on alternative models.  Another black belt chimed in, pointing out that we probably needed some sort of weighted distribution since there were more people from the north than the south and by picking a more northern location we’d be minimizing total miles travelled (good for the environment, not so good for the one person with the longest drive).

Fortunately, a bit of quick research discovered that could do exactly that, so we entered all our data in there and it told us that we should be picking somewhere around the Concord/Acton region of Massachusetts.

But, then someone pointed out that we really couldn’t decide on a place until we figured out who was attending.  True enough, even though we’re all invited, we can’t always come.  A couple people had already expressed their regrets due to existing conflicts.  They argued we needed to repeat the experiment, but with only the subset of people attending.

Another person pointed out that attendance varies, and at least based on prior experience, people who accept the invitation can decline at the last minute, so even if you are free now, you might not actually be able to attend.  However, we could use past attendance data to model who is likely to really attend and perhaps arrive at a most likely scenario via Monte Carlo simulations.

So I entered everyone’s location into a table, built a rough set of probabilities of each person attending and ran the simulation 100,000 times (which was probably overkill) to determine the best location based on minimizing the driving distance for a population of likely attendees.  The answer from that simulation was, ironically, Framingham, MA – the same city we started out with in the first place!

I reply to the long email thread: “ok, I’ve completed the model and based on that I agree that Framingham is the best location.  I’ll spare you the boring details.”

And to that we get the reply from one of the group: “but the details are so cool!  He even had charts!”

What does all of this tell us?  Besides the fact that we all have too much time on our hands some days. 

Sometimes you don’t need a complicated model; a simple model will often generate an answer that is more than adequate.  Why?  Because for every nuance you add to a complicated model, there’s often another nuance that counterbalances it.  Each little addition gets wiped out by a subsequent one. 

The reality is, the critical few often eludes us when we start building complex models.  When we lose sight of which variables really matter, we keep adding and adding variables to account for smaller and smaller effects.  In the end, those tiny effects don’t matter much.  Don’t get me wrong, there’s a place for complex models when we need to separate noise from signal and we don’t know what matters, but you can overdo it. 

Start simple.  The goal is to make more good decisions in business than bad ones; you’re never going to make 100% correct decisions all the time.

Good data: pay me now or pay me later

Let’s be honest, nobody thinks much of data while things are going well.  It’s only after a particularly horrific experience, usually brought on by a flood of customer complaints seemingly out of nowhere, that we think “hmm, maybe we ought to be keeping track of some of this stuff.”

Don’t wait for that to happen!  Even if you intend to do nothing with the data right now (which is a shame, because there’s always room for improvement), don’t delay to start getting good data.

Getting good data isn’t hard, there are just a few simple rules to follow.

  1. Get your like data in one place.  Nobody wants to have to search excel workbooks, word documents, issues lists, and thirteen different defect tracking tools to figure out how many defects you have.  Not only is it hard, it creates opportunities for duplication of defects so that when you do want to count, it becomes a heavily manual exercise to figure out what’s in and what’s out.  You can have different systems to, say, track timesheets from defects, but if you do that, and you ever want to tie things together (like defects created per development hour of effort), you have to define at least a few standard fields.  As a corollary to this: excel is an accounting tool, not a defect tracking system.  At least use a database like MS Access.
  2. For your data fields, less is more.  The more data someone has to enter in order to complete a defect ticket for example the more there is to get wrong.  Keep it basic – date, location detected (in prod, in test, in coding, etc.), application affected, project that injected it, and a description.  What else do you really need?  We often want to make lots and lots of slices of the data; resist the urge to enable this, it really isn’t as important as you think it is.  It’s far better to get a handful of fields right.  For time tracking, don’t make the buckets too fine-grained.  You won’t get good data from your employees.
  3. Don’t allow data duplication.  As a simple example, the impacted application typically defines which team should get the defect to fix.  If that’s true, pick one field – typically the application – and get it right.  Make the assignment group be determined by the application so that if you want the bug routed to the right place to be fixed, they have to get the application right.  Simple tricks like this error proof data entry.  Don’t allow data inconsistency errors by having people enter fields where one field should determine the other.
  4. Set naming conventions.  Don’t allow people to enter the application in a free-form field.  Use a drop-down.  Don’t allow people to add new applications to the drop-down willy-nilly.  If you intend to analyze based on a field later, it cannot be free-form and it has to be controlled to prevent garbage from getting into the field.  Make sure the naming conventions cross between systems.  If you have a time tracking system for projects, then when entering a bug you should use the same project code in the defect as the place where you were billing.  Naming conventions should be global, not system specific.
  5. Make choices in data fields mutually exclusive.  Don’t use ITIL’s configuration item concept.  Sorry, ITIL, but this is a bad idea.  Yes, it’s super generic, but when an application fails on one server but not another, what do you put in this field?  The application that failed?  The server that failed?  Unfortunately, configuration items don’t have to be orthogonal, so they can’t be in one field.  If the choices for a field aren’t mutually exclusive, either change the choices so they are or add additional fields that separate out the information.  For example, don’t use pre-existing as a choice for the bug injection.  Any bug can be pre-existing and also be a coding error or requirements error, so these values aren’t mutually exclusive.
  6. Don’t worry if your measurement system won’t let you compare yourself to the rest of the world.  You are always competing against perfection, not another company, so who cares where they stand.  You must be striving to get better anyway.  Even if you are the head of the pack, that’s not good enough.

Most of all, don’t delay.  If you want to make a change in the future, you are going to need data about your past.  Otherwise, expect to spend a lot of time waiting around to get data and in the meantime making gut-based decisions instead of data-driven ones.

Focus on internal measures

In trying to faithfully translate LEAN thinking into software, I’ve been reading what I can get my hands on for LEAN books.  One of the things that I’ve struggled with, as I’m sure many have, is that software lacks the volume and certainly the repetitiousness of manufacturing.  Indeed, nothing we make is the same as last time, or else we’d just deliver the same code over and over again.  Replicating code is easy.

Finding parallels is therefore difficult, but a book called Made-To-Order Lean by Greg Lane caught my attention.  The focus of the book is how to apply lean when your dealing with a low-volume, high-mix manufacturing environment.  For me, that seems like a better parallel to software than the application to high-volume shops.

Indeed, there is much to be learned from this book, but one thing caught my attention in particular in regards to metrics:

RPPM (returned parts per million) is an example of an important and necessary measurement, but it often represents the customer’s inspection/perception.  There is also usually a delay in obtaining and acting on this data, and some customer complaints are either outside the internal inspection limits or are caused by outside factors (for example, shipping damage, improper application, and so on).  So although this metric must be looked at and all returns must be analyzed, it is sometimes better for the quality and production managers to track and base immediate actions on internal quality measures rather than quality issues that make it through to customers (p. 3)

This observation really struck a chord for me when it comes to software and how often we struggle to measure things like defect containment rate and even simply the volume of production incidents.

We often find that production support teams are more interested in tracking work (indeed, they are employed only if quality is poor) than improving quality.  Being motivated by being busy, there’s little focus put on obtaining accurate characteristic data about an incident, often neglecting to record:

  •  whether the problem was in a production or test environment
  • what application was affected (or confusing a software bug as a server issue because a reboot cleared it up)
  • what change injected the issue

In the absence of this information we struggle with measuring defect containment rate because we experience significant measurement system noise (through the first two bullets) and an inability to calculate the measure on a project-by-project basis.  Instead, we survive with a month-over-month containment number which, because all production incidents don’t manifest themselves immediately, is hardly representative of how changes in our process are (or are not) making a difference.

But it’s exactly these issues that the Mr. Lane is pointing out in his book – the external measures of quality are bound to be problematic.  Besides a lack of our own motivation to accurately record incidents, we’ve discovered in many cases that layers of production support have developed workarounds which mask significant volumes of incidents.  Again, the incentive for support teams is to have a level of quality poor enough that there’s a job to be done.  I don’t see it as malicious, but it’s human nature to stay on the hamster wheel you are on.  Support teams take pride in solving the problem but then fail to follow through to have the issue resolved once and for all.  Compound that with customers who fail to report defects because we’ve failed to respond in the past, and the picture is indeed a murky one.

Answers to all of this may lie in simply measuring defect density up to the point of testing.  As a proxy for what the external world will experience, if quality is poor coming into testing, then we can be assured that the output will be less than satisfactory.  Testing is, after all, an imperfect net.  As Mr. Lane suggests, we need to continue to look at and analyze data from the customer, but internal measures of quality can lead to a far more responsive improvement program.

Using imperfect rulers or “it’s about the pattern”

It turns out that I’m not greatly concerned with my weight, at least not on a day to day basis.  Why?  Well, things like water retention, the time of day I weigh myself, or a gluttonous moment all contribute to fairly significant fluctuations in my weight – perhaps 3-4 pounds.  According to my scale, I tend to be somewhere around 175 pounds, give or take.  I’ve been about this weight for a while, and after a gluttonous weekend, before I step on the scale, I hazard a guess at what it’ll tell me.  Did I reach 180 this time?  Do I really care?  No.

There’s error in my measurement system, mostly from the fact that scales are imperfect but also from these slight variations that aren’t meaningful change in my weight.  What’s really important to me about my weight is the trend.  I cared, say a year and a half ago, when I reliably weighed 200 pounds.  Based on the variations I observe today, it was somewhere around 200, perhaps 198, perhaps 202, but more or less 200 pounds.

Then I decided to eat healthier.  I won’t bore you with the details, this isn’t a blog about dieting, I promise.  My weight started to go down.  Slowly but surely, I’d see new lows and some ups, but more new lows and so on until I stabilized at this new weight – more or less 175 pounds.

Now, let’s say my scale is biased, perhaps it weighs me at 10 pounds lighter than I really am.  Maybe I’m really 185, not 175.  But that also means that before I started dieting I was 210, not 200.  The important thing is that I lost weight, and using a consistent ruler (my bathroom scale) I was able to detect a difference.  Plus there was corroborating evidence in the fact that I needed to buy smaller pants, so I know my scale hasn’t be slowly lying to me more and more as each day passes.

All rulers we have are imperfect, and software is no exception.  Rulers can introduce bias, like my example above or they can introduce noise.  My scale, depending on how exactly I place my feet on it, will weigh me plus or minus one half pound if I try taking several weights one right after the other.  I know I’ve not eaten anything, nor had time to burn much in the way of calories.  It’s simply noise.

The thing is, the pattern of change matters to me more than the exact weight that I am.  If we’re talking defect containment rate, and you see an eroding pattern of containment, it really doesn’t matter that your current measurement says 55% and the true value is 60%.  It matters that over time your containment rate has dropped a significant amount and you need to do something to fix it.

I’m essentially never concerned with changes in the one percent, five percent, even ten percent range.  I’m worried about big change, change that can’t be happening due to measurement system noise alone.  I’m almost never worried about bias, as long as the bias remains constant.  I’m concerned with patterns.  Patterns indicate change, for better or worse.  And change, if we’re trying to improve a process is what we ultimately want.

In software, things to not do include:

  • Trying to exclude defect tickets on the basis of cancellation, bug vs. enhancement, etc.  All these special cases in your selection will create opportunities for people to game the system.  “Hey, if I just mark the ticket an enhancement, it doesn’t get counted against me.  Nobody will notice.”  In all the studies I’ve done, the false positives run about 5 percent.  I’m willing to accept 5 percent noise to avoid a situation where someone could game the numbers.
  • Worrying about tracking every last defect.  On the flip side of the above, it’s OK to miss a few defects that are being sent around via email.  What you don’t want to miss is the vast majority of the work.  If you have general compliance to using a bug tracking system except that one rogue employee, don’t fret about it too much.  If quality is getting better, it’ll show up not only in the bug tracking system but presumably equally in a reduction in the bug reports that employee is emailing around.

If you have a consistent ruler that doesn’t introduce so much noise as to be meaningless, then you needn’t spend your time perfecting the ruler.  Accept, and if necessary publicly acknowledge, that the ruler is imperfect but still capable of telling you a meaningful story.  Then get on with your measuring.