The problem with any approach to estimation

Despite all the times you may have tried, estimating an IT project always seems to be a losing proposition. Why can’t we seem to get it right? I’ve had the opportunity to try my hand at approaches to estimating a number of times, and I believe I’ve come upon one useful insight that any estimating process you define has to take into account – human behavior!

In IT, the majority of costs associated with any project are Iikely to be labor. Numerous strategies exist for trying to get at the “right” amount of labor. You can estimate by proxy, by analogy, top down, bottom up, wideband-delphi… You name it. (For a great resource, check out Steve McConnell’s “Software Estimation: Demystifying the Black Art.”) But I’m going to propose that no typical strategy here will help. The problem isn’t how you arrive at the amount of labor; it’s what happens afterwards.

Across a portfolio of projects, a good estimating process should be unbiased. That is, it should produce estimates which are equally likely to be overestimated as they are under-estimated. It’s true that any single project may experience an estimating error, but so long as the process is well centered, your business partners ought to be able to make reasonable assumptions about likely outcomes.

In one situation I worked on, we developed a simple regression model based on historical data to predict future project estimates. When the model was created using historical data, it performed really well. Even when we selected other random historical projects to test the model, it performed well. Everything would have seemed to indicate that the model was good and wasn’t overfit to the data used to create it.

However, a year after implementing the model, new real world projects that the model had predicted showed a bias towards being under estimated. What was happening? Was our process flawed? Yes and no.

If we were dealing with a process that didn’t involve humans, it’d probably have worked pretty well. However, I’m going to propose that because of humans involved, any estimate created, regardless of the process will have one of two characteristics. Either it’ll be biased towards under estimates or the estimates will be so outrageously large that nobody will want to do business with you. Here’s why…

When you estimate a project you create an effect called anchoring. By stating how much effort the project will take, your teams will align resources on the expectation of that effort. On some days/weeks/months during the life cycle of the project, individuals will be more or less busy. When they are more busy, they will book time equal to the time they worked. However, when they are less busy, because of the resource has been aligned to the project and likely has nothing else to do, they will also book hours to the project. In order to have an unbiased estimate versus actual outcomes, the light times must counterbalance the busy times. However, in order to get paid during the light times, the humans (this is where they mess it all up) still have to book time to your project. Thus, the offsetting light times never come to fruition and the estimate becomes biased towards always being under estimated.

The problem that follows gets even worse. If you use the actual outcomes from these biased results as feedback into your estimate process, it will cause an inflationary effect. Future estimates will be made larger to account for the appearance of under estimating and the process will repeat on subsequent projects. The result will spiral until IT becomes so expensive the business starts looking elsewhere.

It’s grim, but I believe there’s an answer, and it lies in how (frustrating as it may be) the business often treats IT already. Rather than making estimates larger in response to the data, you should adjust your estimating process to make them smaller! I know, this sounds crazy, but hear me out. Let’s say you are finding your projects are 10% under estimated on median. Adjust your process to make estimates smaller, let’s say by 5% and then review the results. If projects are still 10% under estimated, the under estimate you were seeing was likely the result of this waste effect. Continue to shrink your estimates until such time as the under estimating error starts to grow. At this point, you likely have squeezed out the bias caused by light times versus busy times and the under estimates you are now seeing are the result of actually allocating too little effort to the project. Simply undo the shrinking to get you back to your original 10% (or whatever it was) bias, and set your process there. Sure, you’ll always be a little biased towards underestimating projects, but it’s better than an ever bloating IT budget.

What are you evaluating your model against?

I enjoy reading Kaiser Fung’s blogs (junkcharts and numbersruleyourworld). One entry in particular caught my attention because it was relevant to some work I had been doing recently.

We were working on a model to improve project estimation for an organization. In this situation, the given group was exerting a lot of effort to deliver an estimate. There are a lot of players involved in a project and each team wants a chance to weigh in on their part. In fact, Steve McConnell in his book Software Estimation: Demystifying the Black Art, he notes that the best person to give an estimate is the person doing the work. But because of the huge backlog of items needing estimation the organization was drowning in estimates and not getting anything done. They wanted to know if a model could improve the process.

So we collected some data, constructing a very simple model based on attempting to estimate the total effort of the project based on extrapolating from a single team’s input. We’ve had success with similar models elsewhere, so it seemed like a plausible route to go here.

How to evaluate a model then becomes the question. With a simple linear model like we were proposing, the first thing we’d look at is the R-squared. Ideally, if the input perfectly predicts the output, your R-sq will be 100%. But since models are not perfect, the R-sq is usually something less. In this case, the best model we had was 25%. The worst model we had resulted in a negative R-sq! You get a negative R-sq when the error in the model is bigger than the fit of the model. At this point using a model to help this organization out seemed hopeless. And that’s when Kaiser’s article popped to mind. We didn’t need a model that necessarily was a perfect model; we simply needed a model that was better than what they were doing today.

Although evaluating the model against various measures of goodness of fit, the real test was whether the model outperformed the alternative. In this case the alternative was expert judgement. We already knew that the effort to produce an estimate using a model was substantially cheaper than having a whole bunch of teams weigh in. So, could a really terrible model be better than the experts? It turns out the answer is yes. The model outperformed expert judgement about 60% of the time despite the poor quality of the model by other measures. One could hardly call the model good, but then again, it wasn’t like the experts were any good either.

We have this illusion that when we provide an estimate based on our “expertise” that it is somehow unbiased and of high quality. But referring back to McConnell’s book again, we know this to not be the case. There’s significant uncertainty in early stage estimates, so why should we presume that the expert can know something that hasn’t been articulated by anyone? They don’t know what they don’t know. And that’s where models can be helpful. Because the model doesn’t have any preconceived notions about the work and isn’t going to be optimistic for no reason, the model is likely to make as good an estimate as any expert would.

For some reason it reminds me of the saying about being chased by a bear. You don’t need to outrun the bear, just the other person the bear is chasing. And so it goes with models. The model doesn’t have to be perfect to be useful, it simply must be better than the alternative.

What if static analysis is all wrong?

I just got back from a meeting with one of my former college professors. I’ve kept in touch because the academic world and research has much to teach us about how to operate in the business world. For one, without the financial pressures, academia is free to explore some crazier ideas that one day may create value.

In this recent meeting we were discussing static analysis and machine learning. Static analysis has proven frustrating in some of my analysis since it has no evidence of predictive power over outcomes we care about – defects the user would experience and team productivity. And yet we keep talking about doing more static analysis. Is it that the particular tool is bad or is the idea fundamentally flawed in some way?

What turned out to be a non event for machine learning might be an interesting clue to the underlying challenges with static analysis. This particular group does research on genetic programming. Essentially, they are evolving software to solve problems. This is valuable in spaces where the solution isn’t well understood. In this particular piece of research the team was trying to see if modularity would help solve problems faster. That is, if the programs could evolve and share useful functions, would that cause problems to be more easily solved? The odd non event was that it didn’t seem to help at all. No matter how they biased the experiments, the evolution of solutions preferred copying and tweaking code over using a shared function. Although the team didn’t look into it much, they suspect that modularity actually creates fragility in software. That is, if you have a single function that many subsystems use then if the function is changed the ripple effects may be disastrous. If there exist many copies of the function and one is changed, the impact is much smaller. One might argue that this could apply to human created code as well. It isn’t simply a matter of making code more modular and reusable, but perhaps only under certain circumstances. If true, it’d fly in the face of what we know about writing better software. And importantly it would quickly devalue what static analysis tools do, which is push you towards a set of commonly agreed upon (but possibly completely wrong) rules.

What does your dashboard look like?

On my drive today I was thinking about my car’s dashboard. I drive a relatively modern car, so the dashboard is pretty simple – engine temperature, speed, tachometer, and fuel gauge. There’s not a lot to it. Looking at it reminded me, for some reason, of old car dashboards. They aren’t all super complicated, but then I found this example of a Bentley dashboard.

20140116-204813.jpg

Wow. That’s a lot of things. If you look closely, they aren’t all gauges, but there certainly are far more gauges than we have on a modern car. Why, I wondered? Well, it didn’t take too much thinking. What’s the purpose of my car dashboard? It helps me not break the law (speedometer), not break the car (tachometer and temperature) and make sure I get where I’m going (fuel gauge). While cars today are vastly more complicated than they used to be, the dashboards have gotten simpler, not more complex. As cars have become more reliable, and more black box, it has become less necessary (and less desirable) to display excess information. These four gauges pretty much cover the vast majority of what I need to know while driving my car. I could have gauges for all kinds of stuff, including running trends of every message every sensor sends to the on board computer. But they’re not there, because even if they were, I wouldn’t know what to do with the information. In fact, were I not driving a standard, I probably could do without the tachometer. On an automatic, engine speed and shifting is handled for me.

Which brings me to my point. Why is it that as cars have gotten more sophisticated our dashboards have gotten simpler, but in IT our dashboards have gotten more complex as our software process has matured? I suspect the reason is that because we can. There’s tons of data to be had from software development, and very little of it actually has much influence over the outcome of a project. If you keep a handful of things under control there’s no need to have excessive measures. As cars became more robust, there became less reason to monitor every possible system and the components started to disappear off the dashboard. If your software process becomes more standard, and there is less deviation to monitor, then your dashboard should become simpler as well. So, if you’re ending up with a complicated dashboard because your management “needs the information to make decisions” maybe it’s time to start asking which decisions simply don’t need to be made. Standardize and make the process robust; simplify the dashboard.

Where you work may determine how you approach process improvements

Recently I had been looking at the process work in a number of teams within an organization. There were process groups with in the Project Management Office, Business Analysts, Quality Assurance and Developers. It wasn’t until I got to see a bunch of different teams all at work at the same time that I realized each team approached process improvement completely differently.

You might say that “if you give a man a hammer, the world looks like a nail” applies here. In the PMO, process improvements had clear charters, project plans, and a sense of urgency around execution – but little to no analysis. After all, project managers are all about execution and ROI, so they focused on the things they knew well and gave little thought to those they didn’t know.

In the Business Analysts, they delveloped a taxonomy to refer to each part of the work the already do, and then proceeded, in excruciating detail, to write down everything they knew about the topic. They documented the current process, the desired process, details about how to write use cases (even though that information exists freely out on the Internet.). No stone was left unturned in the documentation process – but they lacked any planning, nor did they have any mechanism to roll out or monitor the changes.

In Development, all process improvements involved acquiring some sort of tool. Whether it was a structural analysis tool or a code review tool or a log analyzer, there had to be tool. For developers, the manifestation of process is to implement a tool. After all, that’s what most of software development is – implementing technology to automate some business process.

For QA, there was lots of analysis (after all, that’s what testing really is – analysis of the functionality) but little planning, and usually awkward solutions that created work and relied on more inspections rather than took things away.

The issue here is that each team did what they were good at, and by doing so failed to produce a complete result. Just like the development projects themselves, a complete set of activities must occur to make process change work. You need to understand the problem and lay out a method for delivering on it. You must analyze the problem and understand the solution. And you must implement the change in a way that makes doing the new process easier and better than the old process.

But the key here is that process improvements involve a complete set of activities, and you can’t simply approach process improvement in the same way that you’d approach your siloed job in software development. We all do what we’re comfortable with, but that is a big piece of why we need process improvements in the first place. After all, if you give a man a hammer…

The allure of tools

Over the weekend I helped my brother install chair rail and wainscoting around his entire downstairs. I’ve done chair rail before, but I’d never done wainscoting. Still, I had a decent idea of how it was supposed to be done, so I was happy to help.

He has a relatively new house, so compared to my antique home, the walls would be straighter and easier to do the work on I figured. For the chair rail I simply set up the power mitre saw on a workbench in the garage and proceeded to make cuts as we needed them. We didn’t need to batch cut anything because we were mostly able to use full length pieces or cut sections to fit between windows and so on. It took about five hours to complete that stage, and we went to bed feeling pretty satisfied with our work.

The next morning we started on the wainscoting. For each section of wall we would be building picture frame-like boxes out of quarter round stock. I figured that this would be a great time to batch process parts. After all, new walls ought to lead to consistent heights for the boxes and I could simply cut thirty or forty vertical pieces quickly. So, first I set out to make a jig for the mitre saw. That required a run to the local hardware store for wood.

I’m still cautious so I measured the first vertical piece by hand and discovered it was no small task to cut on the saw. First you had to measure the length and cut the piece square. Then you had to rotate the saw to 45 degrees to trim one end. And then you had to rotate it 45 degrees in the other direction and trim the other end. Each time you rotate the mitre saw (which is older and not well oiled), it takes time and effort. The process of cutting even a couple vertical pieces was slow. Then, I’d have to walk up from the garage with the piece, check that it fit well and if not go back down and trim it again and come back up and down and up and down…

It took a really long time just to do the first three boxes. I finally said to my brother, go to the hardware store and buy a manual mitre box and saw. I could bring this little device into the room we were working on. Yes, each cut (the actual act of cutting a piece) took far longer, but consider all the waste I was able to remove from the process. I no longer had to walk up and down the stairs to the garage. I could simply measure the piece, walk a few steps to the mitre box, cut it and walk a few steps to check the fit. Setup of the mitre box is practically instantaneous. Simply pick the saw up and rotate it to one of the sets of slots for an angled or straight cut. No writing anything down. Since I don’t have to remember the length while I walk down to the garage, I no longer needed to write down the length of the cut. If I forgot by the time I got to the mitre box the cost of remeasuring was low.

By switching from a power mitre saw to a manual mitre box I saved a ton of time and effort. There’s a certain allure to power tools that isn’t always justified. The setup costs and limits on where you can put them can more than offset the advantage of one quick cut. It’s a great example of why you have to consider the whole process when designing a solution. Speeding up a tiny part of the process may incur costs that undo the benefit and then some.

The only outcome that really matters is the one the customer sees

The other day I was reviewing a set of proposed metrics with a group of business analysts. We had already been through one round of with their metrics and cut the quantity from fifty plus down to about twenty five. It was an improvement, but it still wasn’t good enough.

In the time between the first meeting and this one, which had been a while, my thinking on metrics had evolved some. The issue wasn’t that the team wanted to measure their performance – that’s a good start. The issue was that the team wanted to measure only their performance.

In particular, one measure that caught my attention was the requirements defect rate. In essence, this is a simple measure that is just the number of requirements defects divided by total defects found. But while the definition is simple, the implementation is not. First off, what does it mean to have a requirement defect. Is a misunderstanding about a requirement a requirement defect or a coding defect? If the requirement is missing, is that a requirement defect or was it simply an implied requirement that any reasonable developer ought to have known? For certain, there are some defects where the source of injection is well understood, but many others where it is not.

But more importantly, it finally clicked for me when Dr. Hammer said that, regardless of your role, you should be measured against the customer outcome. The example he used at the time was that a parcel delivery service ought to be measured for on time delivery. And everyone in the organization should, regardless of their job. Why? Because everyone has a hand in making sure the package gets there, directly or indirectly. And, if some teams are allowed to measure themselves differently, they can choose a set of measures that cause sub optimization of the entire process. In essence, if the business analysts focused on what percentage of the defects were requirements issues, quality could get worse (higher overall defect density) while the number of requirements defects stayed about the same. The end result is that the customer would be more unhappy, the business analysts would look unjustifiably good, and nobody would undertake appropriate corrective action.

What end does it serve, as a business analyst or any other team, to have a measure which simply allows you to say “not my fault” while the customer suffers. No, I argue, if you want to make sure it is the customer who is getting what the need and want, then the only measures that matter are the ones that measure what the customer experiences and not some slice which serves only to absolves some team for responsibility to help the organization meet that goal.

Too many fields

If you’ve read “Everything I need to know about Manufacturing I learned in Joe’s Garage” then you’re familiar with the part where they’re discussing power drills.  Having many tools in use at the same time means that, statistically speaking, it is more likely that one will break at any given time.  Thus, if you become reliant on a lot of drills, you must have lots of backups.

Having too many fields in a defect entry form, or any workflow for that matter, is like having too many drills in your shop.  If you ask people to use so many fields, you’ll probably get very few of them right all the time.  Suddenly, just opening a defect requires you to fill out dozens of fields, when in fact nobody runs are report off any of those fields at all.  I can’t think of very much that you need to enter into a defect field:

  • who opened it (which the system can figure out from who is logged in)
  • when it was opened (which the system can figure out as well).  I don’t see a need for a separate “when was it found”.  Although there’s probably a lag between found and opened, I’m not sure that it’s so critical that you need people to enter it.  Also, probably a date/time for each change to the ticket, but the system can automatically do that too.
  • what test phase were you in (or production if you didn’t catch it).  Useful for containment calculations.
  • what system were you working on.  Yes, I know you could include build numbers, etc, etc. but the combination of the date and the system ought to be enough to clue a developer in to what version was being run at the time in test.  For production, if you allow various versions to be in the field at the same time this might be more useful.
  • a status (open, closed, ready for restest).  Notice there’s no cancel or deferred.  A cancelled ticket is a closed ticket that didn’t get a code change (or some other resolution).  A deferred defect is open… it isn’t fixed.
  • a description of the problem.

What else do you really need?  Keep it simple and minimize the risk of a field being full of useless junk.  Think really, really hard about whether you truly need that field you are adding.  Start by studying a random sample of tickets.  Add the new field on a spreadsheet only and classify the sample.  Mock up some reports – figure out what you think that information would tell you and if there’s any action you would take on it.  If not, you don’t need it.  Let the system auto-populate anything it can to minimize data entry.

Avoid too many fields in your system, just as you should avoid too many drills in your shop.

If the action isn’t different…

Our family has two pets – a cat and a dog.  Both are getting older, and like humans, there is a rising cost to health care for pets as they age.  Particularly, end of life care, or attempting to prolong life is particularly expensive.  Atul Gawande tells some of that story in his article “Letting Go.”  In addition, I happened to be listening to NPR the other day, where a guest pointed out that 66% of the medicaid costs were incurred by just 25% of the people – mostly elderly in end-of-life care situations.  For pets, we often have to make hard decisions about what care is worth giving and what is not.

Please don’t take this is cold-hearted about my pets, because we do love them and have enjoyed them in our lives for 11 years, but I’m a data person and listen hard to the facts.  I accept, as much as we want to think otherwise, that we’re unlikely to escape the statistical realities.  But, I digress…

Our cat, Lily, had stopped eating.  We took her to the vet for an exam, X-rays, blood work, fluid injections and more to try and get a clue as to what was going on.  After the tests came back, the vet called me and gave us two possible diagnoses.  One was some sort of gastrointestinal issue, the other was cancer.  Her opinion (I’m not sure it was actually backed up by true statistics) was that it was as 50/50 chance to be one or the other.  She proposed that we bring Lily back in for an ultrasound, which was more sensitive to soft tissues than an X-ray.

I asked the vet, “once we know the outcome, what are the treatments?”  In reply, she told me that if it was cancer, she would give Lily steroids to improve her energy level and appetite (plus apparently steroids make cats thirsty so she’d also drink more and avoid dehydration).  But, ultimately, that would only help for a short while and she would degrade again and there’d be little we could do beyond that.  If, however, it was an intestinal issue, the steroids would improve her eating and energy level (as well as her drinking) and she’d be on them the rest of her life, however long that might be.

What value did the ultrasound provide, I wondered.  If it’s cancer, the treatment is steriods.  If it’s not cancer, the treatment is still steroids.  The only difference is that either it would work for a short time or a long time.  So I declined the proposed ultrasound and told her to prescribe the steroids.

The reality is, although it might be interesting to have a diagnosis, in either outcome we would have the same path of treatment.  The avoided costs of the ultrasound, which would run several hundred dollars, could then be spent directly on steriods and specialized food.  Being the optimist, I picked up the steroids and popped off to the pet store for a whole case of special cat food.  My wife thought I was crazy coming home with a whole case.

Of course, just 2 days ago we used up the last of that case, and I bought another case of the food at the store.  I’m glad that the issue appears to be intestinal and not cancer.  But the story applies for all kinds of decisions.  If you take a step to analyze, collect data, etc., but regardless of the outcome it leads to the same next action, skip the analysis.  It’s just a waste of time.

Time, in my case, that could be better spent not traumatizing our cat and instead cuddling with her for however many more months or years of time we get.

Unnecessary process steps

LEAN is greatly concerned with waste, and yet we often don’t see it when it is right before our eyes.  Take for example, granting access to a particular website.  These days, its downright commonplace for you to simply create your own account and get going.  You name it – Facebook, banks, online retailers – all off them just let you make your own account.  And some of these places are dealing with your finances or your credit card!

Yet, we don’t think much of it.  In fact, we sort of take for granted that people just create their accounts as they need them.  But in corporate environments, not so much.  Here we rely on access requests, either emailed about or managed through service request tracking systems.  In some cases, the protection is necessary.  Companies often have closely guarded marketing plans which they want as few people to know about as possible.  But, in many cases, we make people request access to things that we’d never deny them access to.

It might not seem like much to deal with an access request.  What is it after all, less than a minute, maybe two minutes at most to create the account for someone?  But what is that cost over months or years?  It’s like a form of water torture… drip, drip, drip.  And what about the person waiting for access, twiddling their thumbs while they wait for someone else to finally get around to granting the access they need to do their job presumably?

Things like this slowly eat away at your productivity.  One little innocent but unnecessary request at a time.  If you’re never going to deny an access request, then why have access requests at all?  Just open up the system and let the users in, or leverage single-sign-on to track who does use the resources.  But having the process step of requesting access just because it’s a corporate norm… silly (and wasteful).