More measurement mess

It’s always critical to read up on the trends within your industry, and software development is no different than any other.  One thing that we fall short on, however, is to read both the praise and criticism for any new ideas that are out there.  If we wish to treat computer science as a science, than there’s value in both the research which supports theories and the research which refutes it.

Take, for example, this interesting review of a study on Test Driven Development .  The link to the original study is broken, but you can find it here.  The author does a great job pulling from the data provided some critical observations that the original study didn’t call out.  That’s not to say the study was dishonest, but as the author points out there are “lies, damned lies, and… [statistics].”  It’s this type of critical review of all work that Karl Popper was encouraging when he coined the phrase “empirical skepticism” – we must not only find evidence to prove our theories but also attempt to destroy our theories in the pursuit of truth.

I’d like to see more critical (I mean that in the forms of constructive criticism, not just ad hominem) research like this throughout the internet.  I should point out my own review of another study as an example.  We’re too enamoured of new ideas, especially the past few years fervor about Agile, and not taking a hard enough look at what is really going on when we adopt these approaches.  New ideas are good, blind zealotry is not. 

By no means are all things Agile necessarily bad, but they deserve a balanced scientific look if we ever intend software to be a science instead of an art.  In the same manner, I place no confidence in studies about test automation from companies that sell automation software.

Everyone is a statistics expert…

… when they don’t like what the data is telling them.  I happen to be a big fan of xkcd.com, a webcomic that regularly covers a range of topics, but because of the author’s background, regularly mixes in science content.  Entertainingly, every comic comes with some additional mouseover text.  Sometimes an extra joke is hidden there.  Sometimes something thought provoking is.

This particular comic stuck with me, not because of the comic so much, but because of the mouseover text.  If you hover your mouse you’ll see “You don’t use science to show that you’re right, you use science to become right.”  Statistics, like all sciences, is no different. 

Recently, a team had been using control charts to track their coding performance.  Due to the sparseness of data, they were using an individuals control chart to track the trend and stability of their code quality.  One afternoon, I received an email from one of the team members telling me that we should be calculating the mean, upper and lower control limits using only the most recent 10 projects.

My response to this request was “why?”  In turn, I was informed that quality had been better during the recent 10 projects and therefore the mean line should be recalculated to represent this.  Of course, if you looked back just one more project you’d find an example of a project which had horrific quality.  That 11th project was a valid example of the potential outcomes the team could expect to experience.

The control limits on the chart represent this – not just what you are currently seeing, but what you could reasonably expect to see through random variation alone.  There are important patterns in control charts that signal a fundamental shift in performance – 6 points all in ascending or descending order, 9 points all on the same side of the mean line, and others.  But this particular chart showed none of these patterns that would be unusual in a stochastic process.

Yes, a recalculation of the control limits based on the recent projects would give the appearance of an improvement in quality.  Since the mean is what gets reported on the team’s scorecard, they’re invested in improving it.  But it isn’t good science to cherry pick your data to make your numbers look good.  By virtue of doing this, you are withholding information which you are aware of that tells a more complete story.

Instead, the team should be exploring why that past project was so bad.  They could potentially learn why they had such an unusual experience and act to prevent it in the future.  In doing so, they could permanently improve quality and create a desirable special cause variation pattern in their control chart.  Science – use it to help the team become right, rather than just show they are right.

Incorrect measures can yield bad decisions

Recently I was reading this interesting blog entry and related paper on the use of lightweight experiments to make decisions in Agile projects.  Before I get into my points of concern, I want to make clear that I think the philosophy is dead on.  Using data in this manner is a powerful way to make more informed decisions about which way you should or should not change your process.

However, while not intentional, the author also exposes a significant risk of the scientific method.  Namely, if you select the wrong measurement you can get results which mislead.  On to that in a minute…

As a brief aside, I take issue with one general statement the author makes in regard to scientific study – “statistical significance is overrated.”  The author goes on to say that the method of lightweight experiments is not to uncover universal truths.  Having statistical significance does not mean you discovered a universal truth.  In fact, potentially any experiment conducted within a single company has a sampling frame which prevents generalization even if you achieve statistical significance in your results.  However, what statistical significance does mean is that the result you are observing (within the boundaries of how far it could be generalized) is likely the result of more than just random chance.  This is very important, for in the author’s paper he concludes that paired programming is faster, cheaper and only exhibits slightly worse quality.  Yet, lacking statistical significance, nobody should draw any of the conclusions.  The lack of statistical significance means that, at least in what you are looking at, there’s no evidence of a difference between sample A and sample B.  In the author’s case study, it should simply return to a matter of wills as to whether the team in the experiment chose pair programming or individual programming plus Fagan-style inspections.  The data evidence didn’t support one way or the other.

Now, back to my main point about choosing the right measurement.  If you read the paper, you’ll see that the author measures productivity between the paired programming tasks and individual programming tasks by using hours per line of code as the productivity measure.  Lines of code (LOC) has been heavily derided in the past as a poor measure of efficiency.  Primarily, this is because developers can influence the measurement.  It’s a simple matter to write less compact code that does exactly the same thing as more compact code would.  So, the fact that the author uses exactly such a measure on p. 294 was a red flag.

What we need to ascertain is whether two populations doing the exact same activity would arrive at the same solution regardless of paired programming vs. individual programming.  In an experiement in a lab environment, we could probably control for this, but in the wild, we choose not to do this since it’d be wasteful of resources and we have real work to get done.  Nevertheless, this means we have to figure out if the paired programming teams produced the same results (or at least approxiamtely so) as the individuals before we can compare their productivity at doing so.  Fortunately, Agile gives us a reasonable way to look at this.  Agile attempts to size stories by applying story points, typically a rating of approximate size of the feature being requested.  The author’s data set also had this information.

One way we can look at this is to compare the median lines of code written by individual programmers for a story of X points to the median lines of code written by paired programmers for a story of X points.  A simple scatter of story points vs. LOC shows the issue:

As you can see, it seems like the paired programming at 5 story points results in larger LOC than the individual programmers produce.  Paired programmers have 2 of the 3 largest LOC solutions for 5 story points plus they have none of the smallest solutions in their data set.  Indeed, if we create a measure (LOC/story point) and compare the two populations using a Kruskal-Wallis test we get the following:

Kruskal-Wallis Test on MLOC / SP

                         Ave
Experiment   N  Median  Rank      Z
Individual   7   25.90   5.4  -1.22
Pair         5   60.20   8.0   1.22
Overall     12           6.5

H = 1.48  DF = 1  P = 0.223

While the results (likely due to the small sample size) aren’t statistically significant, notice that the median LOC per story point for a pair is 60.2 LOC per story point while the individual programmer produces less than half as much code – 25.9 LOC per story point! 

And indeed, if we compare hours / story point in the two samples, we see a striking result – paired programmers spend more effort per story point than individual programmers do on average!  Pairs spend 5.6 hours per story point while individuals spend 5.31 hours per story point.  Again, the results aren’t statistically significant, but they’re notably different from the result you get by using lines of code as the measure.

Now, I’d hardly go so far as to argue that story points are a perfect proxy for the size of the request, but the discrepency between the results using LOC and story points is instructive.  As a LEAN organization, not only do we need to define value in the eyes of the customer, but we also need to make sure our measurements support value the way the customer sees it.

The measurement that the author chose supports that pairs write more lines of code, but not necessarily deliver more functionality to the user.  As a result of the experiment, the team chose a methodology that may have actually hurt the customers more than it helped.  They’re writing more lines of code, but it looks as if a distinct possibility is that paired programmers produce less compact solutions. 

Regardless of the ability to generalize to the larger world, if you are conducting studies even to make your own process decisions, it’s important to make sure you are measuring the right thing and that you’ve achieved statistical significance.

The above average driver

According to some studies, nearly 65% of vehicle drivers rate themselves as “above-average” drivers. These studies cite “optimism bias” as the cause and (having not read the studies) it may well be the case. People simply think they’ll do better than the other guy.

On the other hand, another simpler explanation might exist. What does average mean to you? For someone like me, who happens to live statistics every day, average (or median) means something very specific. However, I think if you went out on the street and asked a question like “are you an above average, average or below average driver?” that most people would translate that question into their heads as “are you a good, so-so, or bad driver?” The issue simply becomes one of what people interpret average to mean. I wouldn’t cop to being a bad driver. Plus I lack perspective to know what an average driver experiences? Are we talking accidents per year? Speeding tickets per year? I’ve never had a single moving violation in my entire driving career to date, but I’ve been in 1 accident where the car I was driving was totaled. So where exactly does my skill land? I don’t know. I’ve never seen data about what other drivers experience.

Let’s say the “average” driver has 1 accident in their lifetime. I know that’s not true, but I’m just proposing it to illustrate a point. If there was extremely low variation in the population (and it were normally distributed), above average drivers might have .9 accidents in their lifetime while below average drivers have 1.1 accidents in their lifetime. It might be a statistical difference, but it’s hardly a practical difference. Being below average in such a population doesn’t mean much at all.

At issue here isn’t what the “average” statistic really means, but what it means to people when they hear they are above average or below average. Being below average to people means something really bad, when, in fact, a below average performance might still be more than acceptable. We equate anything below-average as unacceptable, which it doesn’t necessarily have to be.

For example, take a professional baseball player, put them on a team of little league players, and even their lowest “below-average” day would far surpass the skill of every kid on every team in the little leagues.

Ok, so what’s all this got to do with anything I write about? Don’t take for granted that people know what you mean when you use words like “average”. Average is a measure of central tendency, but there’s a more colloquial definition which means mediocre. Thus, having a below average performance might not mean to people “a statistically different performance compared to the larger population” and instead means “we suck.”

Why measurement is necessary

This evening, my daughter, in an attempt to be helpful offered to put our dog’s food into his bowl for dinner. She does this every now and again, and usually the mess is kept to a minimum. Regardless, I can hear the difference from across the room when his food goes into our dog’s metal bowl or on the floor.

It isn’t atypical to hear the clatter of a few pieces of his food hit the wood floor on the way and this night was no exception. As my daughter raced back to me with the now empty food scoop in hand, she says to me “I spilled some.” Now, I don’t know what some means to you, but it doesn’t mean the vast majority of the food, right?

Apparently “some” meant exactly that to my daughter. Certainly, as I looked down at my dog’s bowl there was an amount of food in the bowl and another amount outside the bowl on the floor. But “some” is not the word I would have used to describe the amount on the floor. “Most” is the word, if I have to be inexact about it, is the word that I would have used.

I know it’s a dumb example, but this is exactly why we need to measure things. All the words we have to describe portions are inexact. What does a few mean? More than 2 certainly, but is a “few’ deaths as the result of the millions of products you sold 3 people or 300 people or 3000 people? Compared to the whole, even 3000 out of a million might be a few!

The “majority” suffers this issue in news reporting all the time. When the majority of people approve of the President’s performance, it means some number greater than 50% approve. 50.0000001% is a majority. It’s not an overwhelming majority, but it’s a majority technically. And I’ve seen “most” used to mean a simple majority as well, which is crazy, since most clearly means something higher than that. Is 75% most? 80%? 90%? Who knows? The definition is variable.

What about “some”? Officially, it’s just a number greater than 0. Some of the food was outside the bowl. Indeed, not all of the food was outside the bowl, so some is a fair statement. But my version of some and my daughter’s version are really different in this case.

And it’s for this reason that English is an inexact language that true measurement is needed. A proportion would have told a much better story. Not that I would have expected my daughter to say “daddy, I spilled seventy-five percent of the food” but I can expect that from an adult.

Let’s talk real number in business. Put a scale to it – a proportion that is a problem, a count that is a problem, but some real measure of how much is really wrong. “We have some issues with code quality,” after seeing my daughter’s definition of “some” tonight, has a whole new meaning for me.

Not one of the critical few

Ingrained in the Six Sigma school of thought is the critical few – the 80/20 rule. It is an important rule. In practice, there are a handful of things which often allow you to make big leaps from an incapable process to a capable one. There are more subtle characteristics of the process which can be refined to continually improve the performance, but this isn’t step change, it is refinement. And then there’s a class of things that just don’t matter.

Recently, while attempting to facilitate a process design effort, I spent a lot of time thinking about the things that don’t matter. That may have been because that’s all anyone spent their time talking about. And as facilitators, we were enablers of this dragging on. Having been instructed to drive to a single standard process and toolset, we discussed every little one-off thing that people wanted to allow for in the process to see if we could squeeze them out. A day’s worth of 25 people’s time to design a process spent talking about the equivalent of the carpet color.

We wanted perfect compliance to the standard, and that meant a standard which was not necessarily all-inclusive (because some of these one-off requests were truly ridiculous by any standard). This is where I believe we got off track with process work. Process design is about controlling the critical few things which will make the difference in process performance.

But that is not what we were discussing. We were discussing nuances, oddball cases, odd uses of the process, and data elements that some teams wanted and others didn’t. We talked about the 1% and largely ignored the 99%. We talked about things that weren’t going to make the difference, whether they were one way or another.

To begin with, we didn’t know what was going to make the difference. We hadn’t studied the existing processes to understand what made them work – what really mattered and what didn’t. This created unnecessary room for debate because we were unable to bring adequate materials to the table to help the team work through their differences. We had little to no information on what mattered and what didn’t.

Instead of define-measure-analyze-improve-control we just went right into improve. And there we got bogged down discussing every little quirk, because we didn’t know what else we ought to be talking about. Or more importantly, what we shouldn’t be talking about.

Instead of a conversation that was “do we really need that? How many of our teams use that process step?” we could have instead said “sure, it doesn’t matter to me if you allow for that.” And we’d be saying that not because we didn’t care but because we actually knew what did matter. Everything else, the little things that we debated with the teams could have instead been bargaining chips that we could dole out in heaps and have given up basically nothing that really mattered. We could have had a strong position, not because we won all the arguments but because we knew which battles were worth fighting and which were worth conceding.

Had we known what things were not one of the critical few things, we could have appeared very agreeable and allowed the teams as much “leeway” in the process as they claimed they needed. All along we’d be giving up nothing. Nothing that really mattered anyway.

It’s a reminder why a thorough measurement and analysis of a process is important. It isn’t just discovering what the current state is (measurement), but it also understanding why it works (analysis). And from there, narrowing down the bits of process that really do matter, and just letting the rest go. Some things just don’t matter.

Excited about nothing

Recently I’ve been working on measuring organizational efficiency and I was comparing test case execution patterns of a quality assurance team to that of a known sample from a whitepaper I had gotten from IBM. IBM had recognized that there is an S-shaped curve to the cumulative execution of cases. That is, you start off slow, ramp up, and then as you reach the end, the last few cases take a longer time to get done. I don’t know why this is particularly, but I wondered if the same applied in this situation.

And that reminded me of a story about a college professor. Professor Reid was a geology professor at my college, and the way my college curriculum worked, even if you weren’t majoring in the natural sciences you still had to either take a certain number of courses or do a project in the natural sciences. I opted to do a project, though I had no idea what that project was going to be. Fortunately, someone lined me up with Professor Reid.

Professor Reid told me that he had taken a bunch of high school students (on some sort of outreach program) to Shapiro Brook, a generally unremarkable brook which ran down the side of a nearby mountain. At the top of the mountain where the brook sprang from the ground was a quarry.

Now, I’m probably going to get this wrong, so if you are a science buff, I apologize. If you are a science student looking for information on conductivity or pH, this is NOT the place you want to look. You’ve been warned.

Anyway, apparently, the behavior of “normal” brooks is that when the water springs from the ground it has relatively high pH and low conductivity. This is due to there being lots of free H+ ions in the water. As the brook travels over the surface, the free H+ ions are bound by Potassium (K) and Sodium (Na). As a result, this causes the water to become more neutral in pH (pH drops) and more conductive (conductivity rises). As I said, that’s the “normal” behavior.

What Dr. Reid and his students found was the exact opposite. For some reason, pH rose and conductivity dropped. He found this fascinating and wanted me to repeat the experiment, bring back results and finally even put some of that stuff through a Plasma Mass Spectrometer. The Plasma Mass Spectrometer is the kind of equipment that GRAD students wait in line to use, so I was super excited to have the opportunity. Dr. Reid thought, by the way, that the active quarry at the mountaintop was somehow impacting the pH and conductivity of the brook, though he wasn’t sure what the mechanism was exactly.

Anyway, early that fall, I walked up the mountain with a conductivity meter and about 40 little plastic vials which I had properly cleaned with DI Water… blah, blah, blah I won’t bore you with the details of my proper experiment preparations. Every 50 yards or so I took a vial of water and a conductivity reading. When I got back to the bottom of the mountain, I pulled out my map that I had been given. I don’t know why I did this AFTER, but I did. And that’s when I realized I had walked the WRONG brook. Now, I was a college student who was just trying to complete a coursework requirement. I could’ve just used the data I had, forgetting whether the results were honest or not. But, no, I felt guilty doing such a thing, though it crossed my mind, so I went back to the lab, cleaned 40 more vials and trudged back up the mountain this time with my map out in the first place.

Again, I went down the mountain collecting samples every 50 yards or so. Once winter fell, I returned to the same brook to repeat the experiment. We did this to make sure that little feeder streams weren’t influencing the main brook. Of course, this time instead of walking down some of the mountainside, I fell and tore up my hand and wrist pretty good. Determined to not have to go back and make yet another trip, I ripped off some of my shirt, wrapped my hand and wrist (that was probably melodramatic of me), and proceeded to complete my measurements.

When I got back to the lab, I carefully tested the pH of every vial and recorded the data. Then, I brought all my results and readings back to Dr. Reid. I couldn’t really make heads or tails of it, but he could. He literally started bouncing up and down in his chair with excitement. Not in some sort of ridiculous way, but just a little more spring as he talked to me, and his eyes lit up, and a big smile came to his face.

“NOTHING! Shapiro Brook behaves just as it should!” he exclaimed.

I was heartbroken. How was I supposed to write a college paper on nothing? Dr. Reid was undeterred. He proceeded to tell me how great this is, to disprove that there was anything special about Shapiro Brook at all. To in fact find that the world worked exactly as we would expect it to work was, to him, joyful. “You could be a science guy,” he said to me, “have you ever considered switching concentrations?”

And that stuck with me through all these years. When Dr. Reid passed away in the early 2000s, it was this story that first came to mind, and the story that came to mind when I pulled together my data for Quality Assurance.

Sure enough, the QA teams experience the same patterns of progress that IBM had observed. The S-shaped curve wasn’t just some IBM myth. I’m not a QA person, just as I wasn’t a “science guy” back in college, so maybe all QA people know this, but I didn’t. There was excitement discovering that they were just like everyone else, so I sent an email titled “so cool!!!” with the details of my findings to a good friend who I knew would appreciate it. There is satisfaction in finding out that we are not special or different, that despite what people believe, things that the outside world experiences can be applied to us. It gives us hope that what we learn elsewhere is transferrable knowledge.

Risk based testing doesn’t change the goal

I was just out to lunch with a friend who was telling me that the quality assurance department they work for was switching over to risk based testing.  It’s a simple concept as I understand it – test more where there is more risk, test less where there is less risk.  Risk is determined via experience, typically in the form of some scoring system which rates how risky a given change or application is.  The higher the score, the more or different types of testing you do.

Now I’m not a testing expert by any means, but the conversation turned to how they were going to measure their success.  Prior to risk based testing, the measure of success for the department was defect containment rate (DCR).

Defect containment rate is fairly basic as well.  It’s simply (every defect you find in testing) / (every defect you find in testing + every defect you find in production).  In effect, if you find 75 defects while testing and after the code reaches production 25 more defects are found then you have a 75% (75 / (75 + 25))  defect containment rate.   Generally, the higher your DCR the better.

But, no, I’m told by my friend that the new measurement will be on defects found for the areas QA tested.  So, by that logic, if through risk based testing you determined function A wasn’t risky enough to be worth testing, and it breaks in production, that defect shouldn’t be counted against you… Such a decision would only serve to affect the denominator.  You’d still report all the bugs you found in test, but for each prod defect that was found, you’d get to decide whether or not you meant to test for that bug.  Suddenly, 25 defects in production might only count as 10 or 15 if you deemed the remainder as “things we weren’t looking for.”  Now instead of 75% containment (75 / (75 + 25)) you’d have a 88% (75 / (75+ 10)) containment rate.  Hey, you improved!!!  Wrong!

Something is amiss here!  Since when did just because you changed the way you do things change what is important to your customer?  If your prior measurment – defect containment rate – measured what your customer expected of you, where’d you get the free pass to not accomplish that goal anymore?

You don’t design metrics around what will look good.  Looking good is NOT equivalent to doing good.  Actually being good, and meeting your customers’ needs is the goal.  If you refuse to measure how defects impact your customer just because you weren’t looking for those defects, it doesn’t make the defects go away.