The problem with any approach to estimation

Despite all the times you may have tried, estimating an IT project always seems to be a losing proposition. Why can’t we seem to get it right? I’ve had the opportunity to try my hand at approaches to estimating a number of times, and I believe I’ve come upon one useful insight that any estimating process you define has to take into account – human behavior!

In IT, the majority of costs associated with any project are Iikely to be labor. Numerous strategies exist for trying to get at the “right” amount of labor. You can estimate by proxy, by analogy, top down, bottom up, wideband-delphi… You name it. (For a great resource, check out Steve McConnell’s “Software Estimation: Demystifying the Black Art.”) But I’m going to propose that no typical strategy here will help. The problem isn’t how you arrive at the amount of labor; it’s what happens afterwards.

Across a portfolio of projects, a good estimating process should be unbiased. That is, it should produce estimates which are equally likely to be overestimated as they are under-estimated. It’s true that any single project may experience an estimating error, but so long as the process is well centered, your business partners ought to be able to make reasonable assumptions about likely outcomes.

In one situation I worked on, we developed a simple regression model based on historical data to predict future project estimates. When the model was created using historical data, it performed really well. Even when we selected other random historical projects to test the model, it performed well. Everything would have seemed to indicate that the model was good and wasn’t overfit to the data used to create it.

However, a year after implementing the model, new real world projects that the model had predicted showed a bias towards being under estimated. What was happening? Was our process flawed? Yes and no.

If we were dealing with a process that didn’t involve humans, it’d probably have worked pretty well. However, I’m going to propose that because of humans involved, any estimate created, regardless of the process will have one of two characteristics. Either it’ll be biased towards under estimates or the estimates will be so outrageously large that nobody will want to do business with you. Here’s why…

When you estimate a project you create an effect called anchoring. By stating how much effort the project will take, your teams will align resources on the expectation of that effort. On some days/weeks/months during the life cycle of the project, individuals will be more or less busy. When they are more busy, they will book time equal to the time they worked. However, when they are less busy, because of the resource has been aligned to the project and likely has nothing else to do, they will also book hours to the project. In order to have an unbiased estimate versus actual outcomes, the light times must counterbalance the busy times. However, in order to get paid during the light times, the humans (this is where they mess it all up) still have to book time to your project. Thus, the offsetting light times never come to fruition and the estimate becomes biased towards always being under estimated.

The problem that follows gets even worse. If you use the actual outcomes from these biased results as feedback into your estimate process, it will cause an inflationary effect. Future estimates will be made larger to account for the appearance of under estimating and the process will repeat on subsequent projects. The result will spiral until IT becomes so expensive the business starts looking elsewhere.

It’s grim, but I believe there’s an answer, and it lies in how (frustrating as it may be) the business often treats IT already. Rather than making estimates larger in response to the data, you should adjust your estimating process to make them smaller! I know, this sounds crazy, but hear me out. Let’s say you are finding your projects are 10% under estimated on median. Adjust your process to make estimates smaller, let’s say by 5% and then review the results. If projects are still 10% under estimated, the under estimate you were seeing was likely the result of this waste effect. Continue to shrink your estimates until such time as the under estimating error starts to grow. At this point, you likely have squeezed out the bias caused by light times versus busy times and the under estimates you are now seeing are the result of actually allocating too little effort to the project. Simply undo the shrinking to get you back to your original 10% (or whatever it was) bias, and set your process there. Sure, you’ll always be a little biased towards underestimating projects, but it’s better than an ever bloating IT budget.

What are you evaluating your model against?

I enjoy reading Kaiser Fung’s blogs (junkcharts and numbersruleyourworld). One entry in particular caught my attention because it was relevant to some work I had been doing recently.

We were working on a model to improve project estimation for an organization. In this situation, the given group was exerting a lot of effort to deliver an estimate. There are a lot of players involved in a project and each team wants a chance to weigh in on their part. In fact, Steve McConnell in his book Software Estimation: Demystifying the Black Art, he notes that the best person to give an estimate is the person doing the work. But because of the huge backlog of items needing estimation the organization was drowning in estimates and not getting anything done. They wanted to know if a model could improve the process.

So we collected some data, constructing a very simple model based on attempting to estimate the total effort of the project based on extrapolating from a single team’s input. We’ve had success with similar models elsewhere, so it seemed like a plausible route to go here.

How to evaluate a model then becomes the question. With a simple linear model like we were proposing, the first thing we’d look at is the R-squared. Ideally, if the input perfectly predicts the output, your R-sq will be 100%. But since models are not perfect, the R-sq is usually something less. In this case, the best model we had was 25%. The worst model we had resulted in a negative R-sq! You get a negative R-sq when the error in the model is bigger than the fit of the model. At this point using a model to help this organization out seemed hopeless. And that’s when Kaiser’s article popped to mind. We didn’t need a model that necessarily was a perfect model; we simply needed a model that was better than what they were doing today.

Although evaluating the model against various measures of goodness of fit, the real test was whether the model outperformed the alternative. In this case the alternative was expert judgement. We already knew that the effort to produce an estimate using a model was substantially cheaper than having a whole bunch of teams weigh in. So, could a really terrible model be better than the experts? It turns out the answer is yes. The model outperformed expert judgement about 60% of the time despite the poor quality of the model by other measures. One could hardly call the model good, but then again, it wasn’t like the experts were any good either.

We have this illusion that when we provide an estimate based on our “expertise” that it is somehow unbiased and of high quality. But referring back to McConnell’s book again, we know this to not be the case. There’s significant uncertainty in early stage estimates, so why should we presume that the expert can know something that hasn’t been articulated by anyone? They don’t know what they don’t know. And that’s where models can be helpful. Because the model doesn’t have any preconceived notions about the work and isn’t going to be optimistic for no reason, the model is likely to make as good an estimate as any expert would.

For some reason it reminds me of the saying about being chased by a bear. You don’t need to outrun the bear, just the other person the bear is chasing. And so it goes with models. The model doesn’t have to be perfect to be useful, it simply must be better than the alternative.

Cargo cults in IT

The end of WW II gave rise to a striking example of a causality problem. On small, remote islands, indigenous people encountered large numbers of military servicemen for the first time. As we, and others, landed ships on these islands, cleared jungles for runways, erected control towers and ran military drills the indigenous individuals observed newfound well being in the cargo our servicemen brought with them.

When the war ended, the military packed up their temporary bases and took their cargo planes with them. And with it went the newfound wealth of the native people. So, what did they do?

Well, they replicated the actions they saw taking place. They cut airstrips out of the jungle. They erected towers similar to the landing towers they saw. They executed military style drills. They carved wooden headsets to wear like the ones they saw servicemen wearing.

Since they didn’t understand the causality of the entire event – a war that led to need for new bases that led to airfields and cargo planes, they figured if they recreated the parts they observed they’d get the same outcome. Cargo planes ought to land and bring cargo. Of course, it doesn’t work that way, which is what makes the idea of a ‘Cargo Cult’ so interesting. We clearly see the logical problem in that situation.

When it comes to the corporate environment, however, we often fail to see our own Cargo Cult behaviors. We observe a great company, say Google, and we see that they have bicycles on their campus, so we buy bicycles for our campus. Or Facebook engages in hackathons so we start to do so too. But buying bicycles or performing hackathons is not going to make your company like those other companies. You are simply emulating behaviors that look like the other company without understanding the underlying culture that causes them to do these things. And as a result, you’re likely to get disappointing results from the mimicry. These companies aren’t successful because they engage in these behaviors. The companies are successful first and therefore may engage in these behaviors.

Which brings me took another point. We often look at older companies that fail and we observe that they don’t engage in these behaviors, and we use that as evidence that these behaviors are necessary to survive. But we can learn something from Mark Buchanan’s ‘The Social Atom’ on this point. In his book he demonstrates a ridiculously simple model that predicts the rise and fall of organizations, simply based on time and getting too big. You don’t need to model any behavioral stuff to get the effect as I recall. So, probabilistically, large old companies will decline even if behaviors are held steady. There will always be companies coming and going, and we will always be able to be selective and say “see, that company didn’t do X and they failed. We need to do X.”

If you find yourself saying that in the future, just remember that you may now be a card carrying member of a cargo cult.

What if static analysis is all wrong?

I just got back from a meeting with one of my former college professors. I’ve kept in touch because the academic world and research has much to teach us about how to operate in the business world. For one, without the financial pressures, academia is free to explore some crazier ideas that one day may create value.

In this recent meeting we were discussing static analysis and machine learning. Static analysis has proven frustrating in some of my analysis since it has no evidence of predictive power over outcomes we care about – defects the user would experience and team productivity. And yet we keep talking about doing more static analysis. Is it that the particular tool is bad or is the idea fundamentally flawed in some way?

What turned out to be a non event for machine learning might be an interesting clue to the underlying challenges with static analysis. This particular group does research on genetic programming. Essentially, they are evolving software to solve problems. This is valuable in spaces where the solution isn’t well understood. In this particular piece of research the team was trying to see if modularity would help solve problems faster. That is, if the programs could evolve and share useful functions, would that cause problems to be more easily solved? The odd non event was that it didn’t seem to help at all. No matter how they biased the experiments, the evolution of solutions preferred copying and tweaking code over using a shared function. Although the team didn’t look into it much, they suspect that modularity actually creates fragility in software. That is, if you have a single function that many subsystems use then if the function is changed the ripple effects may be disastrous. If there exist many copies of the function and one is changed, the impact is much smaller. One might argue that this could apply to human created code as well. It isn’t simply a matter of making code more modular and reusable, but perhaps only under certain circumstances. If true, it’d fly in the face of what we know about writing better software. And importantly it would quickly devalue what static analysis tools do, which is push you towards a set of commonly agreed upon (but possibly completely wrong) rules.

What does your dashboard look like?

On my drive today I was thinking about my car’s dashboard. I drive a relatively modern car, so the dashboard is pretty simple – engine temperature, speed, tachometer, and fuel gauge. There’s not a lot to it. Looking at it reminded me, for some reason, of old car dashboards. They aren’t all super complicated, but then I found this example of a Bentley dashboard.

20140116-204813.jpg

Wow. That’s a lot of things. If you look closely, they aren’t all gauges, but there certainly are far more gauges than we have on a modern car. Why, I wondered? Well, it didn’t take too much thinking. What’s the purpose of my car dashboard? It helps me not break the law (speedometer), not break the car (tachometer and temperature) and make sure I get where I’m going (fuel gauge). While cars today are vastly more complicated than they used to be, the dashboards have gotten simpler, not more complex. As cars have become more reliable, and more black box, it has become less necessary (and less desirable) to display excess information. These four gauges pretty much cover the vast majority of what I need to know while driving my car. I could have gauges for all kinds of stuff, including running trends of every message every sensor sends to the on board computer. But they’re not there, because even if they were, I wouldn’t know what to do with the information. In fact, were I not driving a standard, I probably could do without the tachometer. On an automatic, engine speed and shifting is handled for me.

Which brings me to my point. Why is it that as cars have gotten more sophisticated our dashboards have gotten simpler, but in IT our dashboards have gotten more complex as our software process has matured? I suspect the reason is that because we can. There’s tons of data to be had from software development, and very little of it actually has much influence over the outcome of a project. If you keep a handful of things under control there’s no need to have excessive measures. As cars became more robust, there became less reason to monitor every possible system and the components started to disappear off the dashboard. If your software process becomes more standard, and there is less deviation to monitor, then your dashboard should become simpler as well. So, if you’re ending up with a complicated dashboard because your management “needs the information to make decisions” maybe it’s time to start asking which decisions simply don’t need to be made. Standardize and make the process robust; simplify the dashboard.

Is variability in software development good?

I myself wrote an article not too long ago on the subject of variability in process. Under some circumstances I think variability might be desirable, although I wasn’t particularly referring to software development. Last week I attended a webinar hosted by ITMPI and given by one of the employees at QSM. His talk was about the measurement of productivity in IT, specifically focused on how to account for variations in productivity when estimating. The talk was pretty good, but one of his early slides bothered me.

On that one slide he argued that software development wasn’t like manufacturing and therefore productivity couldn’t be measured like manufacturing does. Unfortunately, he offered no alternative during the talk. Instead he focused on how to measure unexplained variations in project outcomes and to aggregate this into some sort of vague productivity calculation. On the whole, useful for estimating if you just want to know the odds of a certain effort outcome, but not so useful if you want to learn about what factors impact productivity.

It’s true that software development doesn’t have a lot in common with manufacturing and the analogies drawn are often strained. That’s not so concerning to me, as the spirit of what management is often asking is right – what choices can I make to do this faster, better or cheaper. In that context, productivity isn’t just something you find out about after the fact, it’s something you want to understand.

With my own research, we’ve found measurable evidence that certain activities do make a difference in productivity. Colocation is worn about 5% better productivity. Committing resources to projects is worth about 0.4% for every 1% more committed on average your team is to a project. Which gets back to the question I posed in the title: is variability good?

In short, no. But the longer answer is, just like any process, you have to know what to control. With a highly thought intensive process, there are things you can and should seek to control to produce more predictable outcomes. It is true that software development is a more like the design of a product than the manufacture of one, but that doesn’t mean that anything goes.

Where you work may determine how you approach process improvements

Recently I had been looking at the process work in a number of teams within an organization. There were process groups with in the Project Management Office, Business Analysts, Quality Assurance and Developers. It wasn’t until I got to see a bunch of different teams all at work at the same time that I realized each team approached process improvement completely differently.

You might say that “if you give a man a hammer, the world looks like a nail” applies here. In the PMO, process improvements had clear charters, project plans, and a sense of urgency around execution – but little to no analysis. After all, project managers are all about execution and ROI, so they focused on the things they knew well and gave little thought to those they didn’t know.

In the Business Analysts, they delveloped a taxonomy to refer to each part of the work the already do, and then proceeded, in excruciating detail, to write down everything they knew about the topic. They documented the current process, the desired process, details about how to write use cases (even though that information exists freely out on the Internet.). No stone was left unturned in the documentation process – but they lacked any planning, nor did they have any mechanism to roll out or monitor the changes.

In Development, all process improvements involved acquiring some sort of tool. Whether it was a structural analysis tool or a code review tool or a log analyzer, there had to be tool. For developers, the manifestation of process is to implement a tool. After all, that’s what most of software development is – implementing technology to automate some business process.

For QA, there was lots of analysis (after all, that’s what testing really is – analysis of the functionality) but little planning, and usually awkward solutions that created work and relied on more inspections rather than took things away.

The issue here is that each team did what they were good at, and by doing so failed to produce a complete result. Just like the development projects themselves, a complete set of activities must occur to make process change work. You need to understand the problem and lay out a method for delivering on it. You must analyze the problem and understand the solution. And you must implement the change in a way that makes doing the new process easier and better than the old process.

But the key here is that process improvements involve a complete set of activities, and you can’t simply approach process improvement in the same way that you’d approach your siloed job in software development. We all do what we’re comfortable with, but that is a big piece of why we need process improvements in the first place. After all, if you give a man a hammer…

The allure of tools

Over the weekend I helped my brother install chair rail and wainscoting around his entire downstairs. I’ve done chair rail before, but I’d never done wainscoting. Still, I had a decent idea of how it was supposed to be done, so I was happy to help.

He has a relatively new house, so compared to my antique home, the walls would be straighter and easier to do the work on I figured. For the chair rail I simply set up the power mitre saw on a workbench in the garage and proceeded to make cuts as we needed them. We didn’t need to batch cut anything because we were mostly able to use full length pieces or cut sections to fit between windows and so on. It took about five hours to complete that stage, and we went to bed feeling pretty satisfied with our work.

The next morning we started on the wainscoting. For each section of wall we would be building picture frame-like boxes out of quarter round stock. I figured that this would be a great time to batch process parts. After all, new walls ought to lead to consistent heights for the boxes and I could simply cut thirty or forty vertical pieces quickly. So, first I set out to make a jig for the mitre saw. That required a run to the local hardware store for wood.

I’m still cautious so I measured the first vertical piece by hand and discovered it was no small task to cut on the saw. First you had to measure the length and cut the piece square. Then you had to rotate the saw to 45 degrees to trim one end. And then you had to rotate it 45 degrees in the other direction and trim the other end. Each time you rotate the mitre saw (which is older and not well oiled), it takes time and effort. The process of cutting even a couple vertical pieces was slow. Then, I’d have to walk up from the garage with the piece, check that it fit well and if not go back down and trim it again and come back up and down and up and down…

It took a really long time just to do the first three boxes. I finally said to my brother, go to the hardware store and buy a manual mitre box and saw. I could bring this little device into the room we were working on. Yes, each cut (the actual act of cutting a piece) took far longer, but consider all the waste I was able to remove from the process. I no longer had to walk up and down the stairs to the garage. I could simply measure the piece, walk a few steps to the mitre box, cut it and walk a few steps to check the fit. Setup of the mitre box is practically instantaneous. Simply pick the saw up and rotate it to one of the sets of slots for an angled or straight cut. No writing anything down. Since I don’t have to remember the length while I walk down to the garage, I no longer needed to write down the length of the cut. If I forgot by the time I got to the mitre box the cost of remeasuring was low.

By switching from a power mitre saw to a manual mitre box I saved a ton of time and effort. There’s a certain allure to power tools that isn’t always justified. The setup costs and limits on where you can put them can more than offset the advantage of one quick cut. It’s a great example of why you have to consider the whole process when designing a solution. Speeding up a tiny part of the process may incur costs that undo the benefit and then some.

Could a high standard deviation be good for your business?

If you had asked me a few weeks ago, I would have told you that having a high standard deviation in your process was never a good thing. After all, being unpredictable isn’t good. That is what is drilled into our heads all the time when you study Six Sigma.

The conversation started with a group of people discussing sports and, in particular, the Oakland A’s, who figure prominently into the book “Moneyball.” As stats geeks, we admire all that went into making the A’s a great team on a much smaller budget than many others. But the A’s are boring. You might get excited about Moneyball but, at least from where I stand, the A’s don’t fare well. If you use attendance as a measure, the A’s rank near the bottom – between 27th and 29th depending on the site you use.

So, how do we rationalize this disconnect. Solid data suggests that a far less monied team can compete, but the fans don’t show up. And the reality is, ultimately sports is a business just like any other and you need the fans to show up. So, why is it that a better average performance didn’t draw in the fans?

Well, because predictable is… Boring. There are places where predictable isn’t good. High standard deviation is good. What do people remember about the Red Sox? Eighty some odd years of coming close and failing, then the excitement of a few years of wins and we are back to the flame out stories. We like it when the game is exciting. And for winning to be exciting, there must be losses. Frankly, I’ll turn off a game if it is a blowout. If you could design a team who was reliably better than every other team, I bet nobody would watch. Predictable outcomes aren’t fun outcomes.

Don’t believe me? Check out the analysis of JCPenny’s “no more coupons” approach.. Lower prices are more predictable and overall should be better for customers, but it doesn’t work. Sometimes we have to appeal to something other than a reliable, repeatable experience. It is important to figure out when is the time to be predictable – a lot of the time, but not all the time.

The only outcome that really matters is the one the customer sees

The other day I was reviewing a set of proposed metrics with a group of business analysts. We had already been through one round of with their metrics and cut the quantity from fifty plus down to about twenty five. It was an improvement, but it still wasn’t good enough.

In the time between the first meeting and this one, which had been a while, my thinking on metrics had evolved some. The issue wasn’t that the team wanted to measure their performance – that’s a good start. The issue was that the team wanted to measure only their performance.

In particular, one measure that caught my attention was the requirements defect rate. In essence, this is a simple measure that is just the number of requirements defects divided by total defects found. But while the definition is simple, the implementation is not. First off, what does it mean to have a requirement defect. Is a misunderstanding about a requirement a requirement defect or a coding defect? If the requirement is missing, is that a requirement defect or was it simply an implied requirement that any reasonable developer ought to have known? For certain, there are some defects where the source of injection is well understood, but many others where it is not.

But more importantly, it finally clicked for me when Dr. Hammer said that, regardless of your role, you should be measured against the customer outcome. The example he used at the time was that a parcel delivery service ought to be measured for on time delivery. And everyone in the organization should, regardless of their job. Why? Because everyone has a hand in making sure the package gets there, directly or indirectly. And, if some teams are allowed to measure themselves differently, they can choose a set of measures that cause sub optimization of the entire process. In essence, if the business analysts focused on what percentage of the defects were requirements issues, quality could get worse (higher overall defect density) while the number of requirements defects stayed about the same. The end result is that the customer would be more unhappy, the business analysts would look unjustifiably good, and nobody would undertake appropriate corrective action.

What end does it serve, as a business analyst or any other team, to have a measure which simply allows you to say “not my fault” while the customer suffers. No, I argue, if you want to make sure it is the customer who is getting what the need and want, then the only measures that matter are the ones that measure what the customer experiences and not some slice which serves only to absolves some team for responsibility to help the organization meet that goal.