The problem with any approach to estimation

Despite all the times you may have tried, estimating an IT project always seems to be a losing proposition. Why can’t we seem to get it right? I’ve had the opportunity to try my hand at approaches to estimating a number of times, and I believe I’ve come upon one useful insight that any estimating process you define has to take into account – human behavior!

In IT, the majority of costs associated with any project are Iikely to be labor. Numerous strategies exist for trying to get at the “right” amount of labor. You can estimate by proxy, by analogy, top down, bottom up, wideband-delphi… You name it. (For a great resource, check out Steve McConnell’s “Software Estimation: Demystifying the Black Art.”) But I’m going to propose that no typical strategy here will help. The problem isn’t how you arrive at the amount of labor; it’s what happens afterwards.

Across a portfolio of projects, a good estimating process should be unbiased. That is, it should produce estimates which are equally likely to be overestimated as they are under-estimated. It’s true that any single project may experience an estimating error, but so long as the process is well centered, your business partners ought to be able to make reasonable assumptions about likely outcomes.

In one situation I worked on, we developed a simple regression model based on historical data to predict future project estimates. When the model was created using historical data, it performed really well. Even when we selected other random historical projects to test the model, it performed well. Everything would have seemed to indicate that the model was good and wasn’t overfit to the data used to create it.

However, a year after implementing the model, new real world projects that the model had predicted showed a bias towards being under estimated. What was happening? Was our process flawed? Yes and no.

If we were dealing with a process that didn’t involve humans, it’d probably have worked pretty well. However, I’m going to propose that because of humans involved, any estimate created, regardless of the process will have one of two characteristics. Either it’ll be biased towards under estimates or the estimates will be so outrageously large that nobody will want to do business with you. Here’s why…

When you estimate a project you create an effect called anchoring. By stating how much effort the project will take, your teams will align resources on the expectation of that effort. On some days/weeks/months during the life cycle of the project, individuals will be more or less busy. When they are more busy, they will book time equal to the time they worked. However, when they are less busy, because of the resource has been aligned to the project and likely has nothing else to do, they will also book hours to the project. In order to have an unbiased estimate versus actual outcomes, the light times must counterbalance the busy times. However, in order to get paid during the light times, the humans (this is where they mess it all up) still have to book time to your project. Thus, the offsetting light times never come to fruition and the estimate becomes biased towards always being under estimated.

The problem that follows gets even worse. If you use the actual outcomes from these biased results as feedback into your estimate process, it will cause an inflationary effect. Future estimates will be made larger to account for the appearance of under estimating and the process will repeat on subsequent projects. The result will spiral until IT becomes so expensive the business starts looking elsewhere.

It’s grim, but I believe there’s an answer, and it lies in how (frustrating as it may be) the business often treats IT already. Rather than making estimates larger in response to the data, you should adjust your estimating process to make them smaller! I know, this sounds crazy, but hear me out. Let’s say you are finding your projects are 10% under estimated on median. Adjust your process to make estimates smaller, let’s say by 5% and then review the results. If projects are still 10% under estimated, the under estimate you were seeing was likely the result of this waste effect. Continue to shrink your estimates until such time as the under estimating error starts to grow. At this point, you likely have squeezed out the bias caused by light times versus busy times and the under estimates you are now seeing are the result of actually allocating too little effort to the project. Simply undo the shrinking to get you back to your original 10% (or whatever it was) bias, and set your process there. Sure, you’ll always be a little biased towards underestimating projects, but it’s better than an ever bloating IT budget.

Cargo cults in IT

The end of WW II gave rise to a striking example of a causality problem. On small, remote islands, indigenous people encountered large numbers of military servicemen for the first time. As we, and others, landed ships on these islands, cleared jungles for runways, erected control towers and ran military drills the indigenous individuals observed newfound well being in the cargo our servicemen brought with them.

When the war ended, the military packed up their temporary bases and took their cargo planes with them. And with it went the newfound wealth of the native people. So, what did they do?

Well, they replicated the actions they saw taking place. They cut airstrips out of the jungle. They erected towers similar to the landing towers they saw. They executed military style drills. They carved wooden headsets to wear like the ones they saw servicemen wearing.

Since they didn’t understand the causality of the entire event – a war that led to need for new bases that led to airfields and cargo planes, they figured if they recreated the parts they observed they’d get the same outcome. Cargo planes ought to land and bring cargo. Of course, it doesn’t work that way, which is what makes the idea of a ‘Cargo Cult’ so interesting. We clearly see the logical problem in that situation.

When it comes to the corporate environment, however, we often fail to see our own Cargo Cult behaviors. We observe a great company, say Google, and we see that they have bicycles on their campus, so we buy bicycles for our campus. Or Facebook engages in hackathons so we start to do so too. But buying bicycles or performing hackathons is not going to make your company like those other companies. You are simply emulating behaviors that look like the other company without understanding the underlying culture that causes them to do these things. And as a result, you’re likely to get disappointing results from the mimicry. These companies aren’t successful because they engage in these behaviors. The companies are successful first and therefore may engage in these behaviors.

Which brings me took another point. We often look at older companies that fail and we observe that they don’t engage in these behaviors, and we use that as evidence that these behaviors are necessary to survive. But we can learn something from Mark Buchanan’s ‘The Social Atom’ on this point. In his book he demonstrates a ridiculously simple model that predicts the rise and fall of organizations, simply based on time and getting too big. You don’t need to model any behavioral stuff to get the effect as I recall. So, probabilistically, large old companies will decline even if behaviors are held steady. There will always be companies coming and going, and we will always be able to be selective and say “see, that company didn’t do X and they failed. We need to do X.”

If you find yourself saying that in the future, just remember that you may now be a card carrying member of a cargo cult.

What if static analysis is all wrong?

I just got back from a meeting with one of my former college professors. I’ve kept in touch because the academic world and research has much to teach us about how to operate in the business world. For one, without the financial pressures, academia is free to explore some crazier ideas that one day may create value.

In this recent meeting we were discussing static analysis and machine learning. Static analysis has proven frustrating in some of my analysis since it has no evidence of predictive power over outcomes we care about – defects the user would experience and team productivity. And yet we keep talking about doing more static analysis. Is it that the particular tool is bad or is the idea fundamentally flawed in some way?

What turned out to be a non event for machine learning might be an interesting clue to the underlying challenges with static analysis. This particular group does research on genetic programming. Essentially, they are evolving software to solve problems. This is valuable in spaces where the solution isn’t well understood. In this particular piece of research the team was trying to see if modularity would help solve problems faster. That is, if the programs could evolve and share useful functions, would that cause problems to be more easily solved? The odd non event was that it didn’t seem to help at all. No matter how they biased the experiments, the evolution of solutions preferred copying and tweaking code over using a shared function. Although the team didn’t look into it much, they suspect that modularity actually creates fragility in software. That is, if you have a single function that many subsystems use then if the function is changed the ripple effects may be disastrous. If there exist many copies of the function and one is changed, the impact is much smaller. One might argue that this could apply to human created code as well. It isn’t simply a matter of making code more modular and reusable, but perhaps only under certain circumstances. If true, it’d fly in the face of what we know about writing better software. And importantly it would quickly devalue what static analysis tools do, which is push you towards a set of commonly agreed upon (but possibly completely wrong) rules.

What can a snowblower tell us about software?

If you’re in the north eastern United States, you’re probably thinking about snow right now. And if you’re responsible for clearing the snow from your drive or walkways you might also be all too familiar with time behind a snow blower. For years I hand shoveled my walkways, but when we moved to this new house they were simply far too long for that.

It takes me about an hour to do all the clearing I am responsible for, so that’s a lot of time to think, which isn’t necessarily a bad thing. This particular snow is the deepest we’ve had yet. My snowblower has three forward speeds on it and presumably you use a slower speed when you have more snow to clear. The slower speed allows the auger to clear the snow before it gets all backed up.

So, as I was clearing the drive, I noticed something. Even at the lowest speed there was enough snow that some of it was being simply pushed out of the way by the blower. That meant that I’d have to do clean-up passes just to get that little bit of snow that the blower wouldn’t get on the first pass. And that got me to thinking. What if I just went faster? After all, if I was going to have to make a second pass anyway, who cares if it’s a tiny bit of snow or a little bit more?

And that got me to thinking about software. One approach might be to take it slow and carefully, but if you’re going to create bugs anyway, then perhaps going that slow isn’t the right answer. You’re still going to need the clean-up pass so you might as well let it happen and just clean up a bit more, right?

That sort of makes sense, if you think a second pass over the code is as effective as a second pass with the snow blower. In terms of dealing with snow, the blower is relentless. If it goes over the same ground twice it will do so with the same vigor as before. On the other hand, testing is imperfect. Each pass only gets about 35-50% of the defects (tending towards the 35% end). It isn’t like a snow blower at all. If you push aside a relatively big pile of snow with the a snow blower, it’ll get it on the second go. If you create a big pile of bugs in the code on your first go, a round of testing will likely reduce the pile by less than half. Then you need another pass, and another just to get to an industry average 85%.

There’s one other thing about going too fast that I learned with my snow blower. Some times you get a catastrophic failure. In my case, going too fast with the snow blower broke the shear pin on the auger. It’s a safety feature to prevent damage to the engine but it does make it impossible to keep using it to move snow. And software is a bit like that too. Go too fast and you may introduce a major issue that you’ll spend a whole lot of time cleaning up. It is not all about speed.

Want something fixed? Give it to someone who doesn’t want the job

I recently had a great experience with simplifying IT processes. Due to recent org changes, a home grown product (that was a ton of spaghetti code) got turned over to a new team. The thing was, the new team’s job wasn’t a support role and they didn’t particularly relish coding. Over time, the organization had become dependent on the product, although people suspected it was partially because they didn’t know any better. There were off the shelf tools which could do the same job.

Well, it turns out if you want to get rid of some job, the best people to give it to might be the people who don’t want the job in the first place. If you’re content building and maintaining a bunch of spaghetti code, and I then give to you an additional product, you’re likely to keep maintaining that one too. But, if you don’t particularly like coding, you are going to try and avoid doing it. One of the best ways to avoid doing something is to get rid of it.

In fact, this is exactly what happened. The team, who didn’t desire to fix everyone’s problems figured out how to replace the home grown product in just a few months. For years the organization had been told that it couldn’t be replaced. The difference? The old team was content to maintain it, perhaps even proud. The new team had nothing invested in it and didn’t want to.

“I don’t think necessity is the mother of invention. Invention . . . arises directly from idleness, possibly also from laziness. To save oneself trouble.”
― Agatha Christie, An Autobiography

Tom Demarco talks ethics

I never took ethics in college. I also never planned on attending a conference to hear a talk on ethics. After all, ethics were sort of a base assumption from my perspective and not given much thought beyond large companies having employees sign a code of ethics on some sort of frequency.

In fact, I thought I was attending a talk on decision making, and thus expecting something about decision theory, game theory, maybe psychology. Certainly not ethics. But Tom took something that at first I was like “where is he going with this” to “wow!” I’ll attempt to do it justice, but I can’t promise that I will. To the best of my ability, here’s what I took away:

Aristotle believed that ethics were logically provable. Metaphysics contains all the things we can know about the world. Epistemology is built on that. It’s everything we can derive from what we know about the world. For example, Socrates is a man. All men are mortal, therefore Socrates is mortal. Ethics, Aristotle “promised” were provable with logic. For something like 2400 years, all kinds of philosophers tried to make good on this promise and were unable to. At some point in time, David Hume perhaps, classified metaphysics as “is” statements and ethics as “ought” statements. It was argued that it was impossible to derive an ought statement from an is statement.

Along comes Alasdair Macinytre. He argues that if something is defined by its purpose (this is a watch, for example) then the ought statements follow naturally. What ought a watch do? It should tell good time. So, that raises the question, what is the purpose of man?

We go back to Aristotle. Aristotle also created a mechanism for defining things. His definition requires that you group something and then differentiate it. So, a definition for man might be “an animal with the ability to speak.” That’s an is statement, for sure, but by Macintyre’s requirements, it doesn’t define man’s purpose. Macintyre goes on to define man as an animal that practices. Creating practices becomes man’s purpose. A practice is something we attempt to systematically extend. Tennis, for example, is a practice. The current tennis players have extended the sport in new and interesting ways, such that although famous tennis players of yore would recognize the sport they probably couldn’t compete anymore (even if they were young enough) because the sport has been systematically extended.

So, if that’s right, that man is an animal who practices, then for each practice we create, the oughts follow naturally. If you are a software developer and your manager comes to you and says “we need to take three months off this project” what ought you do? Well, first you ought to be honest – cutting the timeline will hurt the quality of my practice. Second, you must be just – quality matters to our customer, and we can’t deliver poor quality. It’s a disservice. And lastly, you must be courageous – go back in there and tell them no!

How many times has one of our employees, by this framework, acted ethically and we viewed it as a problem? Far too many times, I’d guess. The person with ethics, who values his or her practice and whose ought statements are clear can be frustrating. But when viewed through the lens of Tom Demarco’s talk, suddenly what they’re doing makes a lot of sense.

Do something counterintuitive

A post over at one of my favorite management blogs reminds me of my own recent experience with going for it on fourth down. Recently I’ve been working on a project to improve estimating. It’s not uncommon to hear that estimates should be created by those doing the work. Indeed, if a random person unfamiliar with the ins and outs of your system (namely management) estimates a project for you, odds are it’s going to be bad. But we can take it one step further, what if there was evidence that even if the person doing the work makes an estimate you should override that decision based on a model instead?

Steve McConnell notes in his book on estimating that various experiments have shown developers to be eternal optimists. One way he argues to correct for this is to simply make estimates larger. Unfortunately, when evidence shows you have a bias, then you aren’t going to make the right decision on fourth down, so to speak. In our own research, a model helped to compensate for human fallibility. Although we still got an estimate from the developer, when we combined their data with historical information into a model, we got an outcome that outperformed expert judgement alone 65-80% of the time. That’s not perfect, but it’s surely better than without any model at all.

We always want to believe in the greatness of the human mind to make decisions and in a massive number of cases we don’t know a better system, but as Curious Cat points out, sometimes the evidence isn’t what you’d expect it to be at all.

The difference between measure and incentivize

Afraid of taking measurements about your organization because of what behaviors it might create? Don’t be. Measurement alone isn’t harmful to the organization, and understanding how your organization works can be very useful. Oftentimes, the outcomes we want aren’t directly controllable. We want more sales, but you can’t just say to your team “make more sales” and actually expect it to happen. On the other hand, with internal measures of performance people often can just make the numbers look better.

Tell people to be more productive, and if you measure productivity by counting lines of code you’ll get undesirable behaviors like excess code and a resistance to optimizing existing code. However, if you figure out what behaviors cause more code to naturally get generated you can encourage or direct different behaviors in your organization.

For example, I frequently measure organizational productivity as function points per hour. That’s measurement. If I simply say to folks, make our productivity measure go up that’s incentivizing. If I then identify behaviors that matter I can continue to measure and understand what matters without it becoming an incentive that breeds bad behavior – like making your coworkers inefficient so you appear more efficient.

Measure, but don’t directly incentivize people on the outcome you want. Figure out the behaviors that matter and focus there. The outcomes will follow.

Is variability in software development good?

I myself wrote an article not too long ago on the subject of variability in process. Under some circumstances I think variability might be desirable, although I wasn’t particularly referring to software development. Last week I attended a webinar hosted by ITMPI and given by one of the employees at QSM. His talk was about the measurement of productivity in IT, specifically focused on how to account for variations in productivity when estimating. The talk was pretty good, but one of his early slides bothered me.

On that one slide he argued that software development wasn’t like manufacturing and therefore productivity couldn’t be measured like manufacturing does. Unfortunately, he offered no alternative during the talk. Instead he focused on how to measure unexplained variations in project outcomes and to aggregate this into some sort of vague productivity calculation. On the whole, useful for estimating if you just want to know the odds of a certain effort outcome, but not so useful if you want to learn about what factors impact productivity.

It’s true that software development doesn’t have a lot in common with manufacturing and the analogies drawn are often strained. That’s not so concerning to me, as the spirit of what management is often asking is right – what choices can I make to do this faster, better or cheaper. In that context, productivity isn’t just something you find out about after the fact, it’s something you want to understand.

With my own research, we’ve found measurable evidence that certain activities do make a difference in productivity. Colocation is worn about 5% better productivity. Committing resources to projects is worth about 0.4% for every 1% more committed on average your team is to a project. Which gets back to the question I posed in the title: is variability good?

In short, no. But the longer answer is, just like any process, you have to know what to control. With a highly thought intensive process, there are things you can and should seek to control to produce more predictable outcomes. It is true that software development is a more like the design of a product than the manufacture of one, but that doesn’t mean that anything goes.

When do I get to be a “trusted partner”?

If I had a nickel for every time I’ve heard someone in IT say something along the lines of “we want to be a trusted partner,” I’d be wealthy. If I had a nickel for every time a business person said it, I’d be broke. Becoming a trusted partner seems to be something that IT is obsessed with, but not so much on the business side.

I do think that being trusted is important, but no matter how much you talk about it, it will never be granted to you. It must be earned. While I can’t tell you how to earn it, I can give you a simple example of how it could be earned.

I’m not an auto mechanic, so I am forced to trust the guy at the mechanics to take care of my car. Because of the information inequality – he knows way more than I do about cars – I am always suspicious of his motives. After all, he is in the position to diagnose my problem and then make money on me by fixing the supposed problem. Here’s a person I inherently distrust. Sort of sounds like IT as well to me…

One day I took my car in because I swore something was wrong with it. I was prepared to pay for new brakes mentally. So, when they put it up on the lift, told me that the brakes were fine and then didn’t charge me, I started to see this shop in a different light. And it wasn’t just once that they didn’t push unnecessary work on me, but several visits. Usually I was just in for an oil change and since I was there I’d ask about something else. Time and time again they probably could have fleeced me and didn’t.

After that, I trusted them to tell me when things were wrong and was more willing to have them done. That’s what establishes trust. It’s not doing as you are asked, even if you do it cheaply. It’s not suggesting all kinds of new and shiny things you could do. It starts by doing something that is truly in the customer’s best interest in a way that they know it is. Sure, a fancy architecture might be in their best interests in a really long run, but your customer isn’t going to sense that.

Save all that for later, when they’re finally listening to you. Start off by demonstrating a willingness to solve their immediate problems, to save them money and time, and to help them avoid unnecessary work and you will have a much better chance of becoming trusted by the business. Continue to pretend you know better and you can just keep talking about becoming one.