Getting back to writing

As you can see from my prior posts, I had taken quite the long leave of absence from writing – approximately 2 years.  In some ways, that’s good.  My interests have changed as my experiences have increased and I’ve moved away from process engineering into a more statistics heavy approach to my work.

That doesn’t mean I’ll never write about process again; indeed I think the purpose of using data is to decide on what choices to make (about process, policy, whatever) so I can’t imagine never touching on the subject again.  However, I want to refocus my efforts on the types of statistical errors I frequently see such that I help to make the world a bit smarter when it comes to thinking about data.  And at the same time, I hope to fill the gap between “Statistics for Dummies” (which is just a toe in the water) and the Andrew Gelman’s of the world (where almost everyone is in over their heads).

On “normalizing” data

I was teaching a class today on the visual analysis of data, which included using my favorite teaching aid, the statapult (it’s a miniature catapult, in case you haven’t seen one.) We were talking about how to effectively present information and I was on a tangent about why simply counting defects was a bad idea. I explained to the class that just counting defects ignores the “opportunity” for a defect. Simply put, it’s common sense that if you deliver a bigger project you will deliver more defects (all other things being equal.) So, how you measure the size of the opportunity matters. You can’t be completely arbitrary with it. Your choice matters.

The statapult provides a good chance to showcase what it means to normalize data. The adjustment on the statapult for fine control is the pullback angle. So I had the class try a bunch of pullback angles and measure the distance of the result. I then had them divide the distance by the pullback angle. This “normalizes” the data to inches per degree and you can easily see that value is consistent for all levels of pullback angles.

But, you can’t normalize the distance by dividing by the number of people on the team. It fails to serve the purpose of correcting for anything that had to do with the statapult. One person can fire it, or a team of three or five or even more. So dividing distance by people is arbitrary at best.

And yet, while that makes perfect sense in a classroom setting, we often fail to do anything as sensible in the real world. In order to be a good normalizing factor, the divisor must be correlated to the numerator. Calculating defects per function point makes sense. Calculating defects per developer probably doesn’t (because developers can work for many months on something, just counting the team size is arbitrary.)

Don’t be arbitrary with normalizing data. Dividing one value by another random value does not result in a sensible calculation.

Two points make a line…

… But not a pattern.

Back in middle school I was taught that two points are adequate to define a line. That’s true. But when it comes to stats and collecting data, two points is far from adequate. If I collect a couple data points on some system, I can certainly calculate a line that connects the two, but I have no idea how the data I collected is impacted by random noise. The line might be a good approximation of the relationship between the two variables, but it equally may not be.

That is why you need to collect a bunch of data. It isn’t nearly as important that you can or can’t make a perfect line through all the data points. What is important is that you can understand the pattern that the data makes. Perhaps the relationship is linear, but perhaps it is somehow otherwise related. Logarithmic or exponential or seasonal patterns or something else.

While any two points may define a line, it’s not necessarily the right line.

Agile Slaves?

A friend forwarded me this article – a review of Agile at 10 years on.  The author makes an interesting point.  There appears to be a common thread in that Agile has become the next process, when it was supposed to be freeing of process.  Instead of being agile, companies are “doing Agile.”  I can find blog entries from 2006 that indicate something along the same lines has been anecdotally observed as well – what Stevey calls “Big A” Agile vs “Little a” agile.

So, after reading the first article, I wondered, what would LEAN have to say about this?  On one hand, this is a good thing.  As Stevey’s blog post points out, if everyone seems to do the process poorly, then maybe there’s something wrong with the process.  But, if becoming slavish to Agile has standardized the process, to something that is leaner (not necessarily all the way there), then perhaps that’s a good thing, right?

On the other hand, it doesn’t go far enough.  If people have become slaves of sprints, iterations, stories and index cards and aren’t continuining to improve the process then we have a real problem.  Being a slave to a leaner process is only better insofar as you are still ahead of your competition.  The point of continuous improvement is to compete against perfection, so that there isn’t room for your customers to leapfrog you while you rest on your Agile laurels.

In the end, I’m not sure how I feel about this revelation that people have become slaves to Agile processes like they were slaves to Waterfall.  Perhaps, most importantly, it confirms something that Boehm and Turner wrote in “Balancing Agility and Discipline” which is that, statistically speaking, half of everyone is below average (median technically).  Maybe the illusion is that there are enough people who can really effectively riff on an existing process to make the purist vision of Agile work, and for the rest of us, having something standard that we can slowly improve over the long run is a better choice.

Hurry up and wait

Why testers want to be involved early in the project is beyond me.  Time and time again I hear “we need to be brought in sooner.”  To do what?  To listen to requirements that are going to change?  To write test plans that are going to be stale?

If you want to show up early, you need to deliver early value, and sitting and listening in is not delivering value.  While I don’t agree with a lot of what the Poppendieck’s wrote comparing LEAN to Agile, the idea of delaying committment is an interesting idea.  It’s a bit like not overproducing in manufacturing, since creating something too often or too early results in throwaway.  Having testers (notice I specifically said testers and not QA) show up early when their value is very late in the project doesn’t make much sense at all.

When things aren’t stable, building derivitive works from it (like test cases from requirements) simply invites extra work.  Instead, the latest possible moment that testing can get involved is a better chance of having a larger portion of the system be stable.  It’s silly to hurry up and wait, or in the case of software, hurry up and rework.

The definition of insanity

Einstein once said that the definition of insanity is doing the same thing over and over and expecting different results.  In software quality, running the same test over and over again, expecting to find a new defect matches that criteria.  Far too often, developers perform “unit testing” that’s really integration testing.  Then, the formal QA department performs more integration testing, which is probably really integration testing.  Finally, the users perform “acceptance testing” which is again, integration testing.

We don’t really distinguish these various types of testing from one another or what the goals are, so instead, each group covers the same ground in turn, hoping to find something that the prior group didn’t.  First off, it’s sad that they do.  And the reason that happens is there are defects created by inconsistent build and software promotion processes and incomplete checks performed when the tests do occur.

But, even poorly run, the odds of finding defects this way isn’t great.  Once fixed, except with the most disastrous of development teams, the bugs aren’t going to be coming back.  So, instead, focus each team on doing a new type of testing so that new ground gets covered:

  • Developers perform unit tests which check that the methods or classes are robust and behave well.  They check for off-by-one errors, null pointer exceptions, etc.  They might perform some integration tests, but not exhaustive ones.
  • QA performs integration tests which makes sure large parts of the system hold together and deliver the right functionality.
  • Users perform acceptance tests which makes sure the system performs as they expected it to.  This is different from an integration test in that they’re more concerned with the end outcome than what might happen to the database or files in the interim steps.  If code quality is good, acceptance tests can be performed in parallel with integration tests.  If you find yourself having to wait until the end to perform acceptance testing because you don’t want the users “exposed to bad code.” then there’s something else entirely wrong with the way you develop software.

Testing should be a non-event, not the main event.  You’re placing far too much risk at the end of the project if testing is the only way you find bugs, particularly if the way you test is the definition of insanity.

Unit Testing

It’s not often that I write a blog post simply to call attention to another spot on the site, but that’s what I’m doing in this case.  We’ve just added a new resource to address some of the practical implications of trying to unit test.  There are lots of sites out there to get you unit test theory, so we’ve stuck to a quick overview and then jump into a simple example which can help you see how to overcome some of the most common issues with trying to effectively unit test.

Why write this?  Doesn’t everyone know how to unit test already?  Unfortunately not.  We find that teams often confuse integration testing done by the developer to be unit testing.  There’s nothing wrong with integration testing – far from it – but unit testing, done properly, is looking for a different kind of defect.  These defects – null pointer exceptions, off-by-one errors, etc. – are harder to find through traditional functional testing, leaving a clear gap in your test coverage.

More importantly, you often want to integrate unit testing into your teams’ existing practices, but that means they’ve got code which isn’t necessarily perfectly suited for unit testing.  We believe this short tutorial will help you see how to resolve those issues so that you and your team don’t get into an ideological battle.  Instead of “we can’t unit test because the code isn’t structured for it” (but full well knowing the code will *never* get restructured for it), help your team unit test as effectively as possible even in an imperfect world.

It’s not a univariate world

Apparently we’ve done a disservice to the world with the word correlation.  Recently I’ve been involved in several discussions surrounding the effectiveness of unit testing.  As you’re probably aware, unit testing, like all testing, is about 35-50% capable at removing defects.  That is, if you inject 100 defects, unit testing can typically find between 35-50 of them.

As a result, we say that there’s a correlation between unit testing and better quality.  That’s correlation as in “there’s a relationship between” and not necessarily the linear correlation people think of that could be expressed in the form Y = mx + b.

In these discussions, multiple people expressed concern with our measurement system, which was simply measured as defects per unit of work delivered.  It’s a common measure of defect density which accounts for the size of the software being delivered.  Unit of work can be function point or line of code, or something else.  At any rate, they were concerned because our data was too “noisy” because we counted all defects, whether they were coding errors, requirements errors, etc.

So, essentially we were comparing the defect density of a population that did unit testing vs one that did not.  And we simply counted all the bugs that were in both populations, whether or not unit testing was designed to find them.  And what do you know, we were able to detect a statistical difference in the populations.

This seems to really confuse people.  How is it that we measure everything, which includes things which we didn’t control for, and yet we can detect a difference?  The issue they were having was they wanted to measure a very specific defect density – unit-test-specific defect density.

This isn’t necessary.  Although there are other factors, like whether you did code reviews, design reviews, etc. if those things vary randomly between the does and doesn’t do unit testing populations, then it’s still possible to detect a difference.

It isn’t a univariate world, but statistics can handle that.  We needn’t over-complicate our measurement systems to try and get a laser focus on what we think we care about.  We can simply measure the overall outcome, and if it makes a difference we ought to be able to detect it.

Fat and Short Tails

If you’ve ever done regression analysis, then you’re probably familiar with some of the diagnostic plots that you get out of the work.  If you’re not, I’d encourage you to read up on regression analysis, because there’s far more to it than just getting an r-squared or even r-squared (adjusted).  You need to examine the diagnostic plots of the residuals to understand if the model is decent.  And one of the most frequently used diagnostic plots is the normality plot of the residuals.

The assumption regarding the residuals is that they are normally distributed around a mean of zero.  But sometimes the residuals wander away, and then what do you do?  There are lots of ways that this could happen, but I’ll show you two – fat tails and short tails.

Short tails looks like this on the normality plot:

Notice the distinct “S” shape to the residuals.  It’s also an easy mnemonic device – “S” shape = short tails.  Short tails indicate that the data is more tightly packed around the mean than a normal distribution would expect.  In the simple example I’ve created above, one way to get short tails is to have 2 populations with the same mean but different variances.  For example, one has a mean of 0 and standard deviation of 1, while the other has a mean of 0 and standard deviation of 0.5.  The other possibility is that you have a single population whose error varies more (or less) as the independent variable increases.  In either case, there’s more data close to the mean than you’d expect.

The other option is fat tails, and looks like this:

I guess you could call it the opposite of the “S” shape that you see in short tails.  Fat tails occur when you have a missing explanatory variable that defines two different levels for the dependent variable.  For example, you might have two populations, one with a mean of 0 and standard deviation of 0.5 and the other with a mean of 1 and standard deviation of 0.5.  Notice that in this case it’s the means that are different instead of the standard deviation. 

The great thing about fat tails is you can go looking for another explanatory variable to correct this error.  For short tails, I’m not certain there’s a lot you can do, but I have to admit I’m no doctorate in statistics.  You may be able to calculate a percent error instead of an absolute error if the issue is increasing variance as the independent variable increases, but there may be other corrections that I’m unaware of.

What’s important to note is that even if the residual plots aren’t what you desired, they can still help you learn about what’s going on and then you can use that information to improve your software processes.  We learn when we fail, so rather than shrug and give up on a failed model, see what you can take away from it.  Fat and short tails are at least two things to get you started.

Jidoka’s place in LEAN IT

I recently read this whitepaper from Dr. Curtis at CAST Software (you have to register to read it, but it’s worth it, in my opinion).  Dr. Curtis covers an aspect of LEAN that he says Agile has overlooked in favor of the “Just in Time” aspects of lean.  That aspect is Jidoka. 

In manufacturing, Jidoka is are self-monitoring processes and systems that allow a single person to oversee many machines.  The machines can detect their own failures or inconsistencies so that you don’t waste people whose job it is to do nothing but error-check the machine.  It’s a powerful technique, and one that I agree has been overlooked by the community at-large.

Simply put, Jidoka is about not making mistakes, which is different from Agile’s “fail-fast” mentality.  Certainly, I’d rather fail-fast than fail slow, but as I’ve written before, and Dr. Curtis does a better job at articulating, it’s better to never fail in the first place.