How to construct a good ROI

Far too often the proposed cost benefit analysis that I see aren’t worth the paper they are printed on. For the most part, this stems from the benefit side of the calculation and not the cost side. Although we do make poor estimates, getting better at estimating isn’t terribly difficult. Start by reading Steve McConnell’s “Software Estimation” and you’ll be well on your way.

On the benefit side, this is where things go haywire. Lets say we’re talking about the benefit of better code reviews. There’s lots of industry data that indicates code reviews are valuable when done well.

So the math in people’s heads might go something like… Better code reviews reduce defects. Lets assume a test defect is … I don’t know … worth $1000 each, and that we can cut defects by 75% by doing better code reviews and that a code review can be done in ten minutes. Even if the basic formula is right, all the inputs are wrong. Just like a computer program, garbage in, garbage out.

In order to do the benefits half of the equation you need some data to help you with your assumptions. These things you assume are likely knowable, or at least we can get in the right ballpark. Want to know what it costs to fix a defect? Do a brief time study exercise. Or, if you know the cost of a production defect (which for some reason we seem to often know) then use research like Roger Pressman’s to arrive at an approximate cost of finding the defect in the testing or coding phases. The number is probably closer to $500.

Next, look at what the industry data has on efficacy of code reviews. A 65% improvement is not unheard of, but assuming you’ll capture the entire benefit plus more right out of the gate is pure optimism. First off all, you might be doing some reviews today, which blunts your benefit because the potential gain is smaller. Secondly, you won’t be able to capture the entire potential benefit most likely. In one example I looked at, the difference in defect density between teams that did and didn’t do code reviews was 20%. So, if effective code reviews are 65% effective, the maximum opportunity was only 40%, not the proposed 75%. Worse, when buying third party tools or services, you can’t rely on the sales person to provide you good numbers. They have a vested interest in you buying the product and thus in making the ROI work.

And then on the ongoing cost side, it takes a lot longer than ten minutes to do a code review. All in all, code reviews are certainly worth it, but you won’t get this too good to be true benefit from them. In many cases, we have a solution in mind, but no idea how much benefit we might receive so we make up numbers. Sure, that fills out the required paperwork, but it really isn’t due diligence. We have an obligation to support our assumptions with some data (our own or external).

An (accidental) study on the causality of static analysis

Capers Jones in his book “The Economics of Software Quality” lends support to the idea that static analysis is an effective tool for improving quality. He doesn’t directly address whether static analysis is correlated or causal of better quality.

The problem with static analysis is the open question about whether static analysis causes better quality or that teams who care about quality already are prone to using static analysis tools. While I can’t answer the question definitively, I can provide a data point.

A large organization in financial services had installed a commercial static analysis tool a couple years ago. During that time they collected lots of data from the tool but never acted on the vast majority of recommendations. In effect, the accidentally conducted an experiment about the direction of causality. The organization also had enough of a function point counting capability that they could measure productivity and functional correctness accounting for the variability in size of projects via function points.

In the absence of any action on the data from the tool, we ought to expect that applications that score better in the static analysis tool will show evidence of higher team productivity or better functional quality. In essence, static analysis should predict better quality in the absence of any other action. However, the tools didn’t. Instead, we found no evidence of a relationship between static analysis tool results and customer outcomes – productivity or quality. Now, it may have just been the tool selected, and without more experiments we can’t rule that out. But it’s one data point on whether static analysis tools make a meaningful difference to quality.

Statistical significance vs. practical significance

One concept that often gets lost when people become enamored with data is the difference between statistical significance and practical significance. Missing this important difference leads to some common problems. The most frequent one is death by a thousand cuts. We find evidence that all kinds of things might influence a project outcome. However, the size of the effect, while statistically significant isn’t practically significant. We minimize one thing and get a one percent quality improvement, but ignore the less understood thing which causes 50% of the quality issues. Having found something that has an effect, regardless of size, we insist on working on it.

There are always more things we could monitor than we should monitor. Knowing the difference between whether the difference is statistically significant vs. practically significant matters.

The only outcome that really matters is the one the customer sees

The other day I was reviewing a set of proposed metrics with a group of business analysts. We had already been through one round of with their metrics and cut the quantity from fifty plus down to about twenty five. It was an improvement, but it still wasn’t good enough.

In the time between the first meeting and this one, which had been a while, my thinking on metrics had evolved some. The issue wasn’t that the team wanted to measure their performance – that’s a good start. The issue was that the team wanted to measure only their performance.

In particular, one measure that caught my attention was the requirements defect rate. In essence, this is a simple measure that is just the number of requirements defects divided by total defects found. But while the definition is simple, the implementation is not. First off, what does it mean to have a requirement defect. Is a misunderstanding about a requirement a requirement defect or a coding defect? If the requirement is missing, is that a requirement defect or was it simply an implied requirement that any reasonable developer ought to have known? For certain, there are some defects where the source of injection is well understood, but many others where it is not.

But more importantly, it finally clicked for me when Dr. Hammer said that, regardless of your role, you should be measured against the customer outcome. The example he used at the time was that a parcel delivery service ought to be measured for on time delivery. And everyone in the organization should, regardless of their job. Why? Because everyone has a hand in making sure the package gets there, directly or indirectly. And, if some teams are allowed to measure themselves differently, they can choose a set of measures that cause sub optimization of the entire process. In essence, if the business analysts focused on what percentage of the defects were requirements issues, quality could get worse (higher overall defect density) while the number of requirements defects stayed about the same. The end result is that the customer would be more unhappy, the business analysts would look unjustifiably good, and nobody would undertake appropriate corrective action.

What end does it serve, as a business analyst or any other team, to have a measure which simply allows you to say “not my fault” while the customer suffers. No, I argue, if you want to make sure it is the customer who is getting what the need and want, then the only measures that matter are the ones that measure what the customer experiences and not some slice which serves only to absolves some team for responsibility to help the organization meet that goal.

The software prediction “myth”

I got told today that software is unpredictable. That the idea of planning a project is ridiculous because software is inherently unpredictable. Unfortunately, I think the comment stemmed from a misunderstanding of what it means to be predictable.

If you smoke cigarettes your whole life, odds are that you will end up with cancer, heart issues or some other horrid disease. Now, there are people who smoke their entire lives and don’t have any significant ill effects. They die from something else first. And yet, although those people exist, we can say with some certainty that for all those who do end up with some smoking related disease that it was ‘predictable.’ In the same manner, it’s predictable that if you shoot yourself in the head with a gun that you will die, and yet people live from time to time after having done exactly that.

Secondly, predictable doesn’t necessarily mean superbly accurate. Weathermen predict the weather and barely ever get it exactly right. But it turns out over the last decade or so that their accuracy has gone way up. They still get things wrong, but compared to the distant past, it’s still a reasonable prediction. In fact, some research I’ve seen would put stations like the weather channel, for example, at 80% or more accurate (within three degrees of the estimated temperature) over the long run.

To say software isn’t predictable implies that all outcomes are completely random, and yet we know that isn’t the case at all. Even the most diehard agilista will support unit testing of some form because the outcome of doing unit testing is predictable. You get better quality code. Fair coins, dice and the lottery are unpredictable (and to be fair, there have even been lab studies to show that flipping a coin can be predictable if you control enough of the variables.)

If we want to seek to improve our predictions, which is a separate issue from whether software is predictable or not, we have to study the factors and outcomes of projects to establish what matters. But software is predictable; don’t let anyone tell you otherwise.

A diversity of evidence

How many times have you read conflicting research? Eggs are good for you. Eggs are bad for you. A glass of wine is good for your heart. Alcohol is bad for your heart. Some times it makes you wonder what the heck scientists are doing. How could it be that they flip flop so often? Don’t they know what they’re doing?

In fact they do. That’s the way science works. At any given time one or more scientists is studying some hypothesis. To do so, they must select a research design, measurement system, contend with random chance and countless biases. So, if you pick up any given study, it’s likely to show some result… But which result exactly?

Take paired programming, for example. If you search for research on it, you’ll find studies which indicate positive outcomes, studies that indicate negative outcomes and studies which indicate no detectable effects. Which one of these studies should you believe?

Well, in fact, you shouldn’t believe any single study ever. Science doesn’t work that way. Science relies on a diversity of evidence. We expect that other scientists will attempt to duplicate our findings under different experimental conditions. Then, we can look at many experiments attempting to assess the same effect and ascertain whether one study was a fluke or not. You can’t simply pick the study you like that matches your world view. Pretty much anyone can find a study which suits their pet theory. That isn’t the point. We must seek a diversity of evidence to determine if what we are seeing is a true effect or just a fluke of a single study. Once we see many studies we can determine what the likely effect is by determining a mean effect size and utilizing funnel plots to determine if the available evidence is skewed by publication bias.

What is an inch anyway?

When it comes to measuring things we often rely on long established standards to know how big something is, how heavy it is, how much energy it possesses, etc. In software, many of these standard measures simply don’t exist. How much software was delivered? Well, you could measure that in lines of code, story points, or one of a dozen or so competing function point counting methods. As much as we fret over the lack of standards, it doesn’t actually matter as much as you think as to which ruler you use. Just like you could use metric or US measurements, you can pretty much use whatever system you want for software. However, when working in unknown measurement territory, you do have to figure out if the ruler is being applied consistently.

For that there’s a way to measure the measurement system. Gage R&R (repeatability and reproducibility) is the answer. Let’s say I want to count how much software was delivered. There are two things I need to understand. One, if I assign Bob to count it will he produce that same answer over and over again? Second, if I assign Bob or Sue to count something will they both arrive at the same answer? It’s these questions that Gage R&R seeks to answer.

To figure this out, you select a representative sample of the thing you want counted and you assign those things to be counted by each person. Finally, you plot the answers on a few graphs comparing how consistent each person was and how consistent a given thing was measured. If you know the ‘correct’ answer to how it should have been measured you can also look for bias in the measurement system. In our case, we tend not to care about that very much. If we knew the correct answer we wouldn’t need to understand how to count it. So, simply focus on making sure there’s as little variation as possible regardless of who or when something gets counted and you’re well on your way to having a sound measurement system.

What data is “not valuable?”

The other day I overheard “let’s get rid of the data that isn’t valuable.” There’s certainly some “data” that isn’t valuable in that it is known to be wrong, but that wasn’t the gist of this conversation. Instead, they were talking about data for which they could find no current use for it. For example, imagine you were collecting data about people and couldn’t find a relationship between, say, shoe size and heart rate. One might argue that if you were looking for predictors of heart rate that shoe size is no longer valuable data and you should get rid of it.

In a given piece of research that might be true. What if, however, you were collecting data about people (like marketing folks do) to help understand buying habits. What if, right now, you could find no use for their shoe size? It’s taking up space in your database, albeit probably very little. You can’t use it in any of your current models. Should you throw it away?

Not so fast. The frustrating thing about statistics is that just because you don’t see a relationship doesn’t mean there isn’t one. We may not yet understand how to use shoe size in our model… maybe it has a fascinating interaction effect with hand size to predict buying habits? Who knows.

The point isn’t really about shoe size and whether it is useful, but more generally, if you can get good (by which I mean correct) data on something that you at least guessed might be useful, I’m not so sure you should throw it away because you haven’t found a use for it yet. Some day you may have a hypothesis about shoe size, and where will you be if you discarded all that data?

Now, if the costs of storing or collecting that data are so onerous that you have to make a choice, by all means, discard away. But just getting rid of information because you don’t know how to use it yet… not so much.

How much cheaper must it be?

How many times have you given an estimate only to have the business partner try and negotiate you down? In my own recollection, pretty much every time I’ve ever submitted an estimate there’s more push back. Now, that’s not to say my estimates are any better than anyone else’s or that my teams are more efficient. These were questions, at the time, that I didn’t think enough about to collect the data to answer.

But, the estimated cost of a project came up today. It was a huge project we were discussing, perhaps several million dollars in total spend. At some point the conversation turned to a small piece of the estimate. It was just ten thousand dollars or so, but we were discussing if it was the right number. Think about it… In the scheme of several million dollars, what’s ten thousand? It’s less than likely error in the estimate, that’s for sure.

Which is the point of my post. At what point do you know that the proposed reduction in the estimate is meaningful? If you do a point estimate, you probably don’t have any frame of reference. If you provide a best, likely and worst case estimate, however, you can begin imagining how you’d figure that out. If the changes made don’t bring the likely cost below the best case cost, you’re probably arguing about estimating error and not a meaningful difference in the scope or scale of the work.

From folks like Steve McConnell we know that developers are chronic under-estimators. Why then would you allow yourself to be pushed into an even smaller estimate, particularly when you know you were likely to come out worse than your likely case estimate anyway? If you’re going to revise your estimate downwards, make sure it’s for a meaningful change in the scope of the work, not just optimism or bullying on the part of the business. In the long run, you’re doing them no favors by caving in when you can’t reasonably deliver whatever it is for that cost. Now, figuring out how to be more efficient, that’s an entirely different topic.

Why I’ve soured on Defect Containment Rate

At one point in time, if you asked me, I would have wholeheartedly agreed with Capers Jones and said the one critical measure you need to have is DCR (Defect Containment Rate).  Now that I’ve had several attempts to try and make DCR a reality, I’m convinced its one of the least useful measures you could have.

Where to begin:

1.  It’s incredibly lagging.  Once you put something into production, it can take weeks or months for all the missed defects to surface.  This is because of the way the system gets used and the potential lack of interest on the part of your users to report the issues.  Pretty much any measure you make will be biased towards optimism.  Optimism is not something you want in your measurement system, because optimism drives inaction.

2.  It’s hard to do.  Knowing what test defects you found is easy.  Knowing whether a production defect was caused by your project takes work.  You have to figure out when it manifested iself in the code, possibly loading multiple versions of the application into development environments to figure out when it actually got created.  And then there are the odd side effects that you can never be quite sure if you caused or not – like stability issues, etc. which manifest themselves from increased usage of your application now that you introduced new features.  Did this project cause the issue?  Well, no, not directly, but it contributed to the manifestation of it potentially.

3.  It doesn’t credit good behavior throughout the lifecycle.  Ideally, DCR should capture all the defects you contain in all the stages of software development.  Do code reviews and find things to fix?  You should count those.  Right?  Well, without running the code, you can’t actually be sure if the thing you found during code review would ever actually result in a defect.  Sometimes it’s obvious that it would (like a NULL pointer) or would not (like a shortage of comments) but oftentimes we don’t know.  We know that we don’t like the way the code was written and that there is a cleaner way to do it.  That hopefully contributes to the long term stability of the application.  In some sense, that means avoiding future defects, but did you actually contain a defect?  No, probably not, but DCR is never going to give you credit for avoiding a future defect.

4.  It tells you something we already know.  We’ve got this thing about not believing the industry.  Though several researchers have found that testing will remove 35-50% of defects per cycle (I’ve seen numbers a bit lower as well), we insist on measuring our own test capability.  Given that reaching the lower end of that goal isn’t hard – divide requirements into tests, write and run the tests, and record the results – do we really need to know how our testing is doing?  Let me give you a hint – it works about the same as everyone else’s testing.  Do 3 cycles of testing and assume you’ll get about 75-80% of the defects out.  Now go measure something that you don’t know much about, like the quality of the software coming into test.

 

5.  It focuses you on the wrong thing.  Guess what, testing will never be the best way to produce high quality code.  It’s a supporting player, at best.  But if you measure defect containment, you are basically admitting that you are reliant on your test capability to keep bugs out of production.  The best thing you can do to keep bugs out of production is to never write them in the first place or catch them much earlier than testing where your odds of success are reduced.  Get good quality code coming into test and the fact that you’ll only get 35% of the defects removed per test cycle won’t matter nearly as much.

I’m afraid DCR is probably not something I’m going to spend too much time thinking about any more, except perhaps to reiterate why we should be looking for better measures elsewhere.