What are you evaluating your model against?

I enjoy reading Kaiser Fung’s blogs (junkcharts and numbersruleyourworld). One entry in particular caught my attention because it was relevant to some work I had been doing recently.

We were working on a model to improve project estimation for an organization. In this situation, the given group was exerting a lot of effort to deliver an estimate. There are a lot of players involved in a project and each team wants a chance to weigh in on their part. In fact, Steve McConnell in his book Software Estimation: Demystifying the Black Art, he notes that the best person to give an estimate is the person doing the work. But because of the huge backlog of items needing estimation the organization was drowning in estimates and not getting anything done. They wanted to know if a model could improve the process.

So we collected some data, constructing a very simple model based on attempting to estimate the total effort of the project based on extrapolating from a single team’s input. We’ve had success with similar models elsewhere, so it seemed like a plausible route to go here.

How to evaluate a model then becomes the question. With a simple linear model like we were proposing, the first thing we’d look at is the R-squared. Ideally, if the input perfectly predicts the output, your R-sq will be 100%. But since models are not perfect, the R-sq is usually something less. In this case, the best model we had was 25%. The worst model we had resulted in a negative R-sq! You get a negative R-sq when the error in the model is bigger than the fit of the model. At this point using a model to help this organization out seemed hopeless. And that’s when Kaiser’s article popped to mind. We didn’t need a model that necessarily was a perfect model; we simply needed a model that was better than what they were doing today.

Although evaluating the model against various measures of goodness of fit, the real test was whether the model outperformed the alternative. In this case the alternative was expert judgement. We already knew that the effort to produce an estimate using a model was substantially cheaper than having a whole bunch of teams weigh in. So, could a really terrible model be better than the experts? It turns out the answer is yes. The model outperformed expert judgement about 60% of the time despite the poor quality of the model by other measures. One could hardly call the model good, but then again, it wasn’t like the experts were any good either.

We have this illusion that when we provide an estimate based on our “expertise” that it is somehow unbiased and of high quality. But referring back to McConnell’s book again, we know this to not be the case. There’s significant uncertainty in early stage estimates, so why should we presume that the expert can know something that hasn’t been articulated by anyone? They don’t know what they don’t know. And that’s where models can be helpful. Because the model doesn’t have any preconceived notions about the work and isn’t going to be optimistic for no reason, the model is likely to make as good an estimate as any expert would.

For some reason it reminds me of the saying about being chased by a bear. You don’t need to outrun the bear, just the other person the bear is chasing. And so it goes with models. The model doesn’t have to be perfect to be useful, it simply must be better than the alternative.

The software prediction “myth”

I got told today that software is unpredictable. That the idea of planning a project is ridiculous because software is inherently unpredictable. Unfortunately, I think the comment stemmed from a misunderstanding of what it means to be predictable.

If you smoke cigarettes your whole life, odds are that you will end up with cancer, heart issues or some other horrid disease. Now, there are people who smoke their entire lives and don’t have any significant ill effects. They die from something else first. And yet, although those people exist, we can say with some certainty that for all those who do end up with some smoking related disease that it was ‘predictable.’ In the same manner, it’s predictable that if you shoot yourself in the head with a gun that you will die, and yet people live from time to time after having done exactly that.

Secondly, predictable doesn’t necessarily mean superbly accurate. Weathermen predict the weather and barely ever get it exactly right. But it turns out over the last decade or so that their accuracy has gone way up. They still get things wrong, but compared to the distant past, it’s still a reasonable prediction. In fact, some research I’ve seen would put stations like the weather channel, for example, at 80% or more accurate (within three degrees of the estimated temperature) over the long run.

To say software isn’t predictable implies that all outcomes are completely random, and yet we know that isn’t the case at all. Even the most diehard agilista will support unit testing of some form because the outcome of doing unit testing is predictable. You get better quality code. Fair coins, dice and the lottery are unpredictable (and to be fair, there have even been lab studies to show that flipping a coin can be predictable if you control enough of the variables.)

If we want to seek to improve our predictions, which is a separate issue from whether software is predictable or not, we have to study the factors and outcomes of projects to establish what matters. But software is predictable; don’t let anyone tell you otherwise.

What data is “not valuable?”

The other day I overheard “let’s get rid of the data that isn’t valuable.” There’s certainly some “data” that isn’t valuable in that it is known to be wrong, but that wasn’t the gist of this conversation. Instead, they were talking about data for which they could find no current use for it. For example, imagine you were collecting data about people and couldn’t find a relationship between, say, shoe size and heart rate. One might argue that if you were looking for predictors of heart rate that shoe size is no longer valuable data and you should get rid of it.

In a given piece of research that might be true. What if, however, you were collecting data about people (like marketing folks do) to help understand buying habits. What if, right now, you could find no use for their shoe size? It’s taking up space in your database, albeit probably very little. You can’t use it in any of your current models. Should you throw it away?

Not so fast. The frustrating thing about statistics is that just because you don’t see a relationship doesn’t mean there isn’t one. We may not yet understand how to use shoe size in our model… maybe it has a fascinating interaction effect with hand size to predict buying habits? Who knows.

The point isn’t really about shoe size and whether it is useful, but more generally, if you can get good (by which I mean correct) data on something that you at least guessed might be useful, I’m not so sure you should throw it away because you haven’t found a use for it yet. Some day you may have a hypothesis about shoe size, and where will you be if you discarded all that data?

Now, if the costs of storing or collecting that data are so onerous that you have to make a choice, by all means, discard away. But just getting rid of information because you don’t know how to use it yet… not so much.

How much cheaper must it be?

How many times have you given an estimate only to have the business partner try and negotiate you down? In my own recollection, pretty much every time I’ve ever submitted an estimate there’s more push back. Now, that’s not to say my estimates are any better than anyone else’s or that my teams are more efficient. These were questions, at the time, that I didn’t think enough about to collect the data to answer.

But, the estimated cost of a project came up today. It was a huge project we were discussing, perhaps several million dollars in total spend. At some point the conversation turned to a small piece of the estimate. It was just ten thousand dollars or so, but we were discussing if it was the right number. Think about it… In the scheme of several million dollars, what’s ten thousand? It’s less than likely error in the estimate, that’s for sure.

Which is the point of my post. At what point do you know that the proposed reduction in the estimate is meaningful? If you do a point estimate, you probably don’t have any frame of reference. If you provide a best, likely and worst case estimate, however, you can begin imagining how you’d figure that out. If the changes made don’t bring the likely cost below the best case cost, you’re probably arguing about estimating error and not a meaningful difference in the scope or scale of the work.

From folks like Steve McConnell we know that developers are chronic under-estimators. Why then would you allow yourself to be pushed into an even smaller estimate, particularly when you know you were likely to come out worse than your likely case estimate anyway? If you’re going to revise your estimate downwards, make sure it’s for a meaningful change in the scope of the work, not just optimism or bullying on the part of the business. In the long run, you’re doing them no favors by caving in when you can’t reasonably deliver whatever it is for that cost. Now, figuring out how to be more efficient, that’s an entirely different topic.

Never teach a man to fish

The old saying goes “give a man a fish and he eats for a day. Teach a man to fish , and he eats for a lifetime.” Or something like that…

Sometimes, it’s not quite so in software. The idea of making someone self sufficient can create more problems than it solves sometimes. If someone comes asking you for data, it seems like it would be easier to teach them how to write and run a SQL query, right? I mean, after all, if they can fish they can learn how to extract new data elements, do interesting joins, and perhaps even discover something you didn’t think of.

But what happens when the lake dries up… Er, I mean, what happens when someone moves the data? Now, instead of having to just repoint an ODBC connection or two, you get emails from dozens of unhappy people who want to know where all the fish have gone. And they don’t just want you to send them data, they now have dozens or hundreds of their own poorly written queries (because they’re amateur coders) built on the way the data used to be.

The problem with teaching someone to fish is, you really can’t just teach him to fish. You have to teach him how to repair his fishing pole when it breaks, how to find bait, how to scout fishing locations… Things that are related to, but not exactly fishing, if you want a truly self sufficient person. Otherwise, what you have is a problem waiting to happen, one which may be more dire. Certainly one that’s more annoying to fix.

So, when someone asks for data, you should at least ponder for a minute if they’re really equipped to be a fisherman or if you should just hand over the fish.

On average, customers saved…

How many auto insurance commercials have you seen where the announcer says “customers who switched to [company] saved $350 on average.”  The implication is that if you call as well, you too will save a large amount of money.  It seems like a great deal, but how is it that every company seems to be able to claim this?  Logic would dictate, that in order for one company to claim this, the competitors must be more expensive and significantly so.  If they all claim it, something is amiss.

It could be a lie, since everyone knows that advertising isn’t exactly truthful, but my guess is this is more of a statistical half-truth.  I’d call it a form of survivorship bias.  Let’s say you call up an insurance company and get a quote.  Perhaps, $800 a year.  You then compare that to your current insurance and decide whether to switch companies or not.  If your current insurance is cheaper, well, you don’t switch.  Because, after all, who would pay more for insurance?

Thus, anyone who would lose money doesn’t change companies.  That eliminates half the population from even being considered – note the subtle wording of “customers who switched”.  This isn’t customers who got a quote from the insurance company.  They had to choose to switch, which meant they’re already selecting for a success story.

It’s a fun little example of how statistics lie without being an outright lie.

Odds are not guarantees

Ever bet on a horse race?  Me either.  But, I’m sure you’re aware of odds.  Odds-makers describe the chances of everything.  What are the chances a horse will finish first, second, place, etc.  Horses with long-shot odds pay out better than horses with good odds.  Why?  Because in order for you to get a big win, you have to bet on a rare event, like a lame, old nag winning the race.  Something that nobody predicts is going to happen.

Even when odds-makers give a horse good odds of winning there is no guarantee that the horse will win.  If there was, betting on horse races will be silly and pointless.  If the outcomes were always known, there would be no big wins to be had.

This is generally true of all statistics.  Statistics refer to the odds of an event.  When some report says “people who smoke have a 150% increased risk of developing cancer” (I’m making those numbers up) they are talking about the odds.  I’m sure you know someone who smoked their entire life and never got cancer.  Fact is, nobody debates that the odds were against them… except the smoker, who says “look at me, I smoked my entire life and nothing bad ever happened to me!”

It may very well be true.  But that person only experienced one lifetime.  If they had to do their life 100 times over, the odds are that they’d more likely than not develop cancer.  When we have software projects, the same holds true.  If we understand factors which are more likely to result in project failure (like not having a risk management plan), there is no guarantee that the project will fail.  In fact, many times it will succeed, and remembering that, people won’t change their behavior.

But (assuming your data is good), over the long run, projects that don’t do risk management will fail more often.  Nobody may notice the project here or there, but over the long run, ignoring statistical realities means doing harm to your company that you can clearly avoid. 

Unfortunately, odds also means that sometimes you will press someone to take action based on the odds, and they won’t, and things will turn out ok.  They’ll use that data point to say “hah!  see, you were wrong, I didn’t do my risk management plan and the project turned out fine.”

To that you should respond, “odds are not guarantees, we do these things because we improve the odds of success.”  Managing the odds is what good management is about.  Nobody can ever guarantee you a good outcome or a bad one.  There are miraculous results, like people surviving plane crashes, that nobody can explain, when the odds are clearly against them surviving.  There are also miraculous results, like projects going well, that nobody can explain, when the odds of the project going well is against them.  Don’t let an unexpected success dissuade you from taking the odds into account.

In the long run, and that’s what really matters, the odds will catch up to you.

Half will be below average

The funny thing about averages is that, in order to have them, some stuff must be above average, and some stuff must be below average.  Assuming a normal distribution, 50% of everything is below average.  That’s just the way it is.  We tend to be offended by being “below average,” but something has to be, or else you can’t have an average.

In the world of software quality assurance that means there’s an unfortunate truth.  Half the time, you have to inform your project team that the quality you are seeing is below average.  Guaranteed.  That’s what average means after all – it’s the central tendency of the data and it represents the MIDDLE of the data.  One half will be better and one half will be worse.

In the pursuit of perfection, half the time you are going to have to tell people to do better.  If you don’t, your competition likely will, and then even your best 50% of projects won’t be good enough.  We don’t tend to think that’s what average means.  Instead, we equate average to mean not good, or at best so-so, like “an average dinner out.”  When we say that, we mean we didn’t like it that much.  But, when it comes to software (and dinner), average is the reality.  You can’t avoid it.

So, yes, we may not be happy with being on a below average project, but no matter how good everything gets, that distribution still exists and there is still opportunity for improvement until everything is perfect.  And why shouldn’t you pursue perfection in all aspects – cost, time and quality?  It does mean you’ll always, always be telling some team that they’re below average, but that’s just the way things are.

It’s not chaotic, it’s percent error

Generally, when we talk about correlation, people imagine a nice positive (or negative) relationship between the independent and dependent variable.  It typically looks something like the magenta data set below – for each increase in X, there’s a corresponding increase in Y, plus or minus some random fixed amount of noise.

In software development, we also see the other pattern on this graph – which still has the corresponding increase in Y as X increases, but we see that the random amount of noise seems to get bigger and bigger as X increases.  It results in this “spray” pattern which we tend to conclude means that things are getting out of control or that the relationship isn’t really there.  Otherwise we’d get a nice positive correlation right?

There is another explanation – percent error.  A positive correlation might be expressed Y = 2X + e, where e is some fixed error, let’s say +/-2.  So, if X = 10, Y should equal somewhere between 18 – 22 (2*10 +/- 2).  However, when you have a percentage error, e is some percentage of X.  When X is small, the absolute error of Y is small, and when X is large, the error of Y larger.  If X is 2, then +/- 10% is .2, but if X is 200, +/- 10% is 20.  One can quickly see how this would cause a spray-like pattern in your data.

It’s also a common pattern in software development, in my experience.  As you code, a certain percentage of the amount of code you create will be faulty.  It appears chaotic, and it does create a larger variance in the number of defects you’ll get as the project gets larger, but it’s actually not particularly surprising.

Knowing this, you can create models that help predict defects which use % faulty instead of count.  It’s still not going to make it easier to predict the absolute number of defects you’ll get, but it will at least help provide a realistic expectation that as project size increases (whether you measure it by function points, lines of code or something else) the number of defects you find may appear to vary more widely than a simple positive correlation would allow.

Standard deviation matters

I was recently reviewing two health plans which I could join.  One was a typical HMO-type plan and the other a high deductible plan.  High deductible plans have been pushed pretty heavily lately, what with the focus on rising health care costs, the recent legislation, etc.  I don’t want to jump into the fray on the political aspects, but I do want to share with you an observation on being risk adverse.

The way that a high deductible plan keeps premiums down is by requiring the policy holder to pay everything out of pocket until you reach some limit.  For that reason, high deductible plans are often called catastrophic coverage plans, since they only become helpful if you incur massive costs.  In exchange, you pay very low premiums.  And that’s because you get very little benefit most of the time.  These plans, from everything I’ve read online, are great for young and healthy people, but not so good (obviously) for people with chronic conditions or families.

I am married and have two kids, but I was willing to run the numbers and look at the odds to see whether I should go with a traditional plan or a high deductible plan.  First you have the premiums to pay.  If you use no care than a high deductible plan beats a traditional plan hands down in premiums.  The rest is an odds game.  How many times will you have to visit the doctor?  How many prescriptions will you need?  What are your odds of having a catastrophic illness?  For a traditional plan, each one of these events costs you just a copay – typically $20.  For a high deductible plan, you have to pay the full amount until you reach the yearly limit.  (Never mind the fact that if you have a chronic condition you’ll reach that limit year after year after year).

Because I’m a data person, I happen to have tons of data on how often we visit the doctor and my current insurer was kind enough to send me a complete history of all charges so I had a reasonable idea of what various things would cost me.  Then, I built two models, one using a traditional plan and the other using a high deductible plan.

Guess what, in the end, odds were that even with my family (and we make our fair share of doctor visits), I’d likely pay less with the high deductible plan than with a traditional plan.  It wasn’t true in every situation, the potential cost to me of the high deductible plan overlaps with the potential costs of the traditional plan, but the premiums for a traditional plan are so much higher that it was difficult to overcome them.

But, there was something else I observed.  Not only did my models calculate the average result, but they also calculated the range of possible results.  And not surprisingly, the high deductible plan had a much, much higher standard deviation.  This isn’t surprising, since each medical event in a high deductible plan costs a lot more, and so the variability in the result is potentially much greater.

Ultimately, that’s why I stuck with a traditional plan.  Yes, I’m very likely to pay more, but the result is predictable, and I like predictable outcomes.  (I also like not having to concern myself with whether its worthwhile to go to the doctor).  If you’re a risk adverse person, and if studies about investing are to be believed you are more risk adverse than you think you are, then a traditional plan produces a more comfortable financial result.  It’s like buying bonds instead of stocks.  You may not come out way ahead, but it’s much more unlikely that you’ll come out far behind.

Consider that when designing a process as well.  Consistency is important.  You can end up with a software development process that sometimes produces perfect code, but sometimes produces awful code, or you could have a process which produces consistently average code.  Which one is easier to deal with for continuous improvement?