This one graphic DOESN’T explain it all

How many times have you been reading Facebook, or your favorite blog or a site like buzzfeed and you see an entry with the title like ‘this one simple graphic explains [insert topic here] once and for all’ or something like that.

These titles suggest to the reader that if you just looked at some problem in a specific way that suddenly it’d all become clear. Of course, the next step is that your democrat or republican friends post these items to Facebook with a helpful comment like “for my [opposing party] friends.” And really, nobody’s mind is changed.

First off, I’m not going to spend much time addressing cognitive dissonance. Reality is, giving someone evidence that isn’t in line with their world views tends to lead to a strengthening of their current views, not a weakening of it.

But secondarily, for any sufficiently complicated topic (which in my world, is pretty much all of them), there is no one graphic that explains it all. And I suspect most of your situations are like that as well. Let me use an example, organizational productivity. We measure effort per function point per industry norms. And we were demonstrating in our “one chart that explains it all” that productivity had improved since a recent change. Except one chart won’t do it. The chart makes the main point, but then we had at least five other charts checking things like the measurement system hadn’t been tampered with, that quality hadn’t suffered as apparent productivity rose, and so on. In IT, most of the things we measure are proxy measures for some outcome we care about. As proxy measures, we always have to worry about the measurement system and unintended consequences of our choices. As a result, no analysis is ever complete on a single chart.

Treat anyone and anything that is explained to you in “one simple chart” with suspicion. If it seems too simple and obvious, it probably is.

It’s not a scorecard if there is no score

Since it’s the World Cup right now, it makes sense to spend some time talking about keeping score. I’m from the US, so I took note of the US beating Ghana 2 goals to 1 recently. That’s the great thing about keeping score, you actually know who came out on top.

We use scorecards (or something equivalent) widely in professional sports. In baseball, for example, each player has a set of statistics associated with their performance – batting average, ERA, RBI, and so on. And using these statistics we have some clue as to which players are better than others. That doesn’t mean we can exactly discern two great players, but we can get them into a general stratification and figure out who to keep and who to cut.

In software we often desire to have scorecards, but we fail to use them appropriately. Perhaps we consider three things important about every project – on time, on budget, and with decent quality. Sure, it’s a simple model, but perhaps better than nothing.

So we go and collect this data on projects and we notice one project who, based on our data appears to be late, over budget and below average quality. So we go ask the team what’s going on… And wouldn’t you know it, they’ve got a reason for everything. The scope changed, so they can’t be held to the date, and they couldn’t get the resources they really needed for the project, so people are working overtime, and well, you know, if you don’t have top people then the quality will suffer. Clearly this project’s scorecard says it is in trouble, but the team wants to call it all clear.

The outcome of software projects is often like the outcome of a sporting event. The project will either eventually succeed or fail, just as will one of the teams. Some times, for both groups, the getting there isn’t pretty. In the end, the winning sports team may say things like “we won, but we could’ve played better” which demonstrates a far better understanding of probability of success than most project teams have. Instead, the project team wants to put out its own version of reality, one that is inconsistent with how all other projects are looked at. Suddenly you don’t have a scorecard anymore. Who won the game when the score reported is 2 to Purple-People-Eater? You haven’t a clue.

I’m not proposing that measurement in software is an easy task, but if
we choose measures that act as a proxy for a project or organization’s performance, we must use a consistent ruler to measure them all. Otherwise, you don’t have a scorecard, you just have, as Kaiser Fung likes to call it, story time.

Why value delivered is a misleading metric

A recent conversation I was having turned to the idea of standing development teams and measuring performance in terms of value delivered.

There’s been some debate back and forth between colleagues and myself about whether value is something that IT should measure. After all, it commingles elements IT can’t control (the goodness of the business idea) with their delivery. But, that aside, let’s assume we ought to measure IT including value in some way.

The idea of measuring value delivered seems to make sense for fixed sized teams. Under typical circumstances you always want to account for opportunity in your measures. For example, if you were comparing two teams, one of 10 people and the other of 100, you’d intuit that 100 people are going to turn out more value than 10 simply by virtue of being an order of magnitude larger. However, if your team size is fixed, then adjusting for team size doesn’t matter… You’d be taking value delivered and dividing it by the same number every time.

However, there still is a problem with just measuring value delivered. Let’s say you are chugging along delivering business value and the business says “we’d like even more value! Can you hire another person?” Your team size would still be reasonable, so you say yes. Indeed, with the new person on board, value delivery rises, so everyone’s happy right? Well, not necessarily. Adding the person added value, so your line slope of value delivered would rise, but did it add enough value to overcome the additional cost? Well, that depends. If a team of 10 was delivering $1m in value then a team of 11 ought to deliver at least $1.1m in value in order for the proportion of value delivered to stay constant. On the flip side, if your team shrinks from 10 to 9, you’d expect value delivered to drop some as well. In fact, if it dropped from $1m to $950k, the proportion of value delivered to team size would actually increase! And by the way, if the value delivered didn’t drop with fewer people on the team, what does that say about the contributions of the former member(s)? When your business folks say to you “IT costs too much” what they are perceiving is the value they’re getting for the cost, not just the value and despite the statement, not just the cost either.

Of course, if you only measure the value half and not the costs half (whether that’s in people on the team or hours billed or whatever) then you’ll never know this. Capers Jones in his research has pointed to evidence that larger projects experience lower productivity, or in essence, smaller amounts of value delivered per unit of effort exerted. The idea of simply attempting to maximize value delivered underpins the mythical man month that Fred Brooks wrote about so many years ago – the incorrect belief that if ten people are good, then twenty must be better. To know the optimal mix for your organization, then you must attempt to measure productivity, not just the value half.

The problem with value delivered is that it’s a useful measure only as long as the team stays fixed in size in perpetuity. That’s unrealistic. If the team size changes for any reason (hiring, quitting, leave of absence), even temporarily, you must account for the rate of value delivered and not just the sum total value. The BLS reports that median tenure of an employee is just around five years, so you ought to expect some instability in your team over time. Otherwise, when someone comes knocking saying you aren’t delivering the value they expect, you’ll have no basis for the conversation.

What are you evaluating your model against?

I enjoy reading Kaiser Fung’s blogs (junkcharts and numbersruleyourworld). One entry in particular caught my attention because it was relevant to some work I had been doing recently.

We were working on a model to improve project estimation for an organization. In this situation, the given group was exerting a lot of effort to deliver an estimate. There are a lot of players involved in a project and each team wants a chance to weigh in on their part. In fact, Steve McConnell in his book Software Estimation: Demystifying the Black Art, he notes that the best person to give an estimate is the person doing the work. But because of the huge backlog of items needing estimation the organization was drowning in estimates and not getting anything done. They wanted to know if a model could improve the process.

So we collected some data, constructing a very simple model based on attempting to estimate the total effort of the project based on extrapolating from a single team’s input. We’ve had success with similar models elsewhere, so it seemed like a plausible route to go here.

How to evaluate a model then becomes the question. With a simple linear model like we were proposing, the first thing we’d look at is the R-squared. Ideally, if the input perfectly predicts the output, your R-sq will be 100%. But since models are not perfect, the R-sq is usually something less. In this case, the best model we had was 25%. The worst model we had resulted in a negative R-sq! You get a negative R-sq when the error in the model is bigger than the fit of the model. At this point using a model to help this organization out seemed hopeless. And that’s when Kaiser’s article popped to mind. We didn’t need a model that necessarily was a perfect model; we simply needed a model that was better than what they were doing today.

Although evaluating the model against various measures of goodness of fit, the real test was whether the model outperformed the alternative. In this case the alternative was expert judgement. We already knew that the effort to produce an estimate using a model was substantially cheaper than having a whole bunch of teams weigh in. So, could a really terrible model be better than the experts? It turns out the answer is yes. The model outperformed expert judgement about 60% of the time despite the poor quality of the model by other measures. One could hardly call the model good, but then again, it wasn’t like the experts were any good either.

We have this illusion that when we provide an estimate based on our “expertise” that it is somehow unbiased and of high quality. But referring back to McConnell’s book again, we know this to not be the case. There’s significant uncertainty in early stage estimates, so why should we presume that the expert can know something that hasn’t been articulated by anyone? They don’t know what they don’t know. And that’s where models can be helpful. Because the model doesn’t have any preconceived notions about the work and isn’t going to be optimistic for no reason, the model is likely to make as good an estimate as any expert would.

For some reason it reminds me of the saying about being chased by a bear. You don’t need to outrun the bear, just the other person the bear is chasing. And so it goes with models. The model doesn’t have to be perfect to be useful, it simply must be better than the alternative.

What does your dashboard look like?

On my drive today I was thinking about my car’s dashboard. I drive a relatively modern car, so the dashboard is pretty simple – engine temperature, speed, tachometer, and fuel gauge. There’s not a lot to it. Looking at it reminded me, for some reason, of old car dashboards. They aren’t all super complicated, but then I found this example of a Bentley dashboard.

20140116-204813.jpg

Wow. That’s a lot of things. If you look closely, they aren’t all gauges, but there certainly are far more gauges than we have on a modern car. Why, I wondered? Well, it didn’t take too much thinking. What’s the purpose of my car dashboard? It helps me not break the law (speedometer), not break the car (tachometer and temperature) and make sure I get where I’m going (fuel gauge). While cars today are vastly more complicated than they used to be, the dashboards have gotten simpler, not more complex. As cars have become more reliable, and more black box, it has become less necessary (and less desirable) to display excess information. These four gauges pretty much cover the vast majority of what I need to know while driving my car. I could have gauges for all kinds of stuff, including running trends of every message every sensor sends to the on board computer. But they’re not there, because even if they were, I wouldn’t know what to do with the information. In fact, were I not driving a standard, I probably could do without the tachometer. On an automatic, engine speed and shifting is handled for me.

Which brings me to my point. Why is it that as cars have gotten more sophisticated our dashboards have gotten simpler, but in IT our dashboards have gotten more complex as our software process has matured? I suspect the reason is that because we can. There’s tons of data to be had from software development, and very little of it actually has much influence over the outcome of a project. If you keep a handful of things under control there’s no need to have excessive measures. As cars became more robust, there became less reason to monitor every possible system and the components started to disappear off the dashboard. If your software process becomes more standard, and there is less deviation to monitor, then your dashboard should become simpler as well. So, if you’re ending up with a complicated dashboard because your management “needs the information to make decisions” maybe it’s time to start asking which decisions simply don’t need to be made. Standardize and make the process robust; simplify the dashboard.

Scrum’s 50% failure rate?

So in a class today I was flipping through the materials and saw this:

20131210-222420.jpg

What does one make of such a statement? If this statement is true, then using Scrum as a methodology isn’t better than flipping a coin. Half the time it works, the other half it doesn’t. If your methodology choice is functionally equivalent to coin flipping, then the methodology necessarily doesn’t add value. Now sure, you could argue that if you’re on the failing side of the equation that you’re “doing it wrong”, but some consideration should be given to the idea that choosing a methodology (any methodology) is no predictor of success. All that said, even one of the original signers of the Agile manifesto is still obligated to produce data to show this is so. It’s an incredibly broad generalization.

The other thing is that the second part of his statement is far less specific. “Bad Scrum” is a “major cause”? What does major mean in this case? 50%? 25%? Something else? If half your projects are failing is resolving the bad scrum whatever that may mean going to make all of them not fail? Unlikely. We can reasonably assume that regardless of how well you do a process that under some circumstances it will fail, so how far will fixing bad scrum take us. It’s very hard to say from this statement.

From a trusting there’s analysis underlying these statements, I’d by far rather hear something like “in a recent study of N projects, 50% failed [how did they fail?]. Of the failures, N% can be attributed to ‘bad scrum’.” Sure, it doesn’t read like a nice little sound bite you can put onto a slide in a training deck, but it’s far more complete and far more useful to the reader to understand what the opportunity is for fixing the problem.

Seeing through messes rather than cleaning them up

At the Cutter conference I recently attended, a gentleman named Robert Scott gave the keynote address. In it he talked about what he thought the future of technology leaders would look like. One thing he said stuck with me – future leaders will need to be able to see through messes rather than clean them up.

Although I found it an interesting idea, it couldn’t quite fit it into something I could relate to. That was until a discussion today. I was talking with someone regarding measuring a quality assurance organization and each time I proposed a metric he would counter with something like “but if we measure it that way then what about the situation where X doesn’t apply?” essentially implying that there was noise in the measurement system and we’d have to clean it up before the measurement would be useful.

And that’s when Robert’s comment finally made sense to me. As a data person, I’ve learned to live and even embrace the noise inherent in measuring software development. It’s more like economics or sociology than it is like physics or chemistry. Given some exact set of inputs, the output will be directionally the same, but not exactly the same. But software people don’t go for mostly right or directionally right. It bothers them. After all, software is either wrong or it’s right. Either it’s a bug or it isn’t. There’s no such thing as that’s sort of a defect. So, when you introduce a measurement system that naturally contains noise, the discomfort sets in.

Here’s the thing. If we assume Robert Scott is right, that future leaders must be able to see through messes rather than clean them up, then the measurement of software stands a chance of moving forward. It will always be imperfect, but the data helps us to make better decisions. Yes, it’s messy. It contains noise, but rather than always trying to clean it up, acknowledge it is there, and start to look beyond the noise at the signal underlying it.

Do something counterintuitive

A post over at one of my favorite management blogs reminds me of my own recent experience with going for it on fourth down. Recently I’ve been working on a project to improve estimating. It’s not uncommon to hear that estimates should be created by those doing the work. Indeed, if a random person unfamiliar with the ins and outs of your system (namely management) estimates a project for you, odds are it’s going to be bad. But we can take it one step further, what if there was evidence that even if the person doing the work makes an estimate you should override that decision based on a model instead?

Steve McConnell notes in his book on estimating that various experiments have shown developers to be eternal optimists. One way he argues to correct for this is to simply make estimates larger. Unfortunately, when evidence shows you have a bias, then you aren’t going to make the right decision on fourth down, so to speak. In our own research, a model helped to compensate for human fallibility. Although we still got an estimate from the developer, when we combined their data with historical information into a model, we got an outcome that outperformed expert judgement alone 65-80% of the time. That’s not perfect, but it’s surely better than without any model at all.

We always want to believe in the greatness of the human mind to make decisions and in a massive number of cases we don’t know a better system, but as Curious Cat points out, sometimes the evidence isn’t what you’d expect it to be at all.

The difference between measure and incentivize

Afraid of taking measurements about your organization because of what behaviors it might create? Don’t be. Measurement alone isn’t harmful to the organization, and understanding how your organization works can be very useful. Oftentimes, the outcomes we want aren’t directly controllable. We want more sales, but you can’t just say to your team “make more sales” and actually expect it to happen. On the other hand, with internal measures of performance people often can just make the numbers look better.

Tell people to be more productive, and if you measure productivity by counting lines of code you’ll get undesirable behaviors like excess code and a resistance to optimizing existing code. However, if you figure out what behaviors cause more code to naturally get generated you can encourage or direct different behaviors in your organization.

For example, I frequently measure organizational productivity as function points per hour. That’s measurement. If I simply say to folks, make our productivity measure go up that’s incentivizing. If I then identify behaviors that matter I can continue to measure and understand what matters without it becoming an incentive that breeds bad behavior – like making your coworkers inefficient so you appear more efficient.

Measure, but don’t directly incentivize people on the outcome you want. Figure out the behaviors that matter and focus there. The outcomes will follow.

What is our obligation as a data scientist in presenting information?

I just spent three great days at a conference. I hadn’t been to one recently, and a lot of times that I’ve gone I’ve been disappointed by the heavy vendor focus on selling products and giving away trinkets. This particular conference was much more intimate, and well worth my time to pick other people’s brains.

In one session about fifteen of us discussed designing effective dashboards. Although I love Edward Tufte, the conversation here never once touched on data to ink ratio or any of his other great ideas. Instead, we spent a large portion of the roundtable le debating the obligation of a data scientist to guide the dashboard design process.

For example, not that long ago Stephen Few held a dashboard design competition ( here’s his result). The challenge I had with this competition wasn’t in how well the data was presented. Indeed I learned a great lot by looking at the winning solution and Stephen’s solution. What left me feeling unsatisfied was the missing opportunity to discuss what should be presented versus how it should be presented.

And this, to me, is the central question of information display. There are many valuable rules about presenting information once you decide to present it but scant little advice on how to decide whether to present it at all. Statistics literacy is weak in many organizations, so as a likely stats expert in your organization, what do you owe them?

  • Help them to identify the outcomes that matter first. Absent this, no dashboard can be useful. It will just be a lot of beautifully presented garbage.
  • Help them determine potential leading indicators and help them assess whether they matter. It’s not enough to have good ideas about what might matter. We have an obligation to test the relationships.
  • Help them think about the sometimes subtle and insidious problems of statistics… False causation, mathematical quirks that change apparently relationships (like log transforms on a single axis or sharing a variable between two composite measures), and other things that will mislead.

If we fail to do these things for our organizations then we do a disservice to our science and to those that we work with. Dashboards should not be created simply to provide confirming evidence for the world view we want to hold, but to help us seek out information that disconfirms our beliefs also.