Overcomplicated models

How many black belts does it take to decide on a place for dinner?  The answer, at least in my case, turns out to be about 10 to12.  There’s a group of us who try to get together now and again just to catch up and reminisce about times past.  Everyone in the crowd is a great person, and we’re always sad when someone doesn’t attend, but life gets in the way sometimes – and apparently so does the analytical minds of a dozen people.

It all began innocently enough.  “Where should we meet?” someone asked.  A few potential meeting locations were thrown out.  Then someone (you knew it would happen with a bunch of Six Sigma Black Belts), decided we needed a more scientific approach to deciding.  She listed out all our home towns, identified the furthest points (most northern location, most southern location, most eastern location and most western location) and drew a cross between them.  Thus, she approximated the mid-point which turned out to be Framingham, MA.

Alas, once that was done, we were off to the races on alternative models.  Another black belt chimed in, pointing out that we probably needed some sort of weighted distribution since there were more people from the north than the south and by picking a more northern location we’d be minimizing total miles travelled (good for the environment, not so good for the one person with the longest drive).

Fortunately, a bit of quick research discovered that www.meetinbetween.us could do exactly that, so we entered all our data in there and it told us that we should be picking somewhere around the Concord/Acton region of Massachusetts.

But, then someone pointed out that we really couldn’t decide on a place until we figured out who was attending.  True enough, even though we’re all invited, we can’t always come.  A couple people had already expressed their regrets due to existing conflicts.  They argued we needed to repeat the experiment, but with only the subset of people attending.

Another person pointed out that attendance varies, and at least based on prior experience, people who accept the invitation can decline at the last minute, so even if you are free now, you might not actually be able to attend.  However, we could use past attendance data to model who is likely to really attend and perhaps arrive at a most likely scenario via Monte Carlo simulations.

So I entered everyone’s location into a table, built a rough set of probabilities of each person attending and ran the simulation 100,000 times (which was probably overkill) to determine the best location based on minimizing the driving distance for a population of likely attendees.  The answer from that simulation was, ironically, Framingham, MA – the same city we started out with in the first place!

I reply to the long email thread: “ok, I’ve completed the model and based on that I agree that Framingham is the best location.  I’ll spare you the boring details.”

And to that we get the reply from one of the group: “but the details are so cool!  He even had charts!”

What does all of this tell us?  Besides the fact that we all have too much time on our hands some days. 

Sometimes you don’t need a complicated model; a simple model will often generate an answer that is more than adequate.  Why?  Because for every nuance you add to a complicated model, there’s often another nuance that counterbalances it.  Each little addition gets wiped out by a subsequent one. 

The reality is, the critical few often eludes us when we start building complex models.  When we lose sight of which variables really matter, we keep adding and adding variables to account for smaller and smaller effects.  In the end, those tiny effects don’t matter much.  Don’t get me wrong, there’s a place for complex models when we need to separate noise from signal and we don’t know what matters, but you can overdo it. 

Start simple.  The goal is to make more good decisions in business than bad ones; you’re never going to make 100% correct decisions all the time.

What does highly correlated mean?

I ran into a bit of interesting confusion about what being “highly correlated” means.  It’s a bit of a nit, but important myth to dispel nonetheless.  If you have two variables, and they are highly correlated, all it means is that as the independent variable changes so does the dependent variable in some positive or negative relationship.

For example, if you made a random data set of independent variables (X) and then assembled a dependent variable (Y) where Y = X, a Pearson correlation test would show these two variables as being perfectly correlated.  And if you made Y = 2X, they’d still be perfectly correlated.  Changing the slope of the line does NOT increase or decrease the correlation.  Correlation coefficients are reduced when changes in X don’t result in corresponding changes in the Y.  That’s it.

Why’s this important?  Well, if you have identified something as highly correlated, that does not mean there’s a one-to-one relationship between the two variables.  It also doesn’t mean that the Pearson correlation measures the slope of the line.  If you assembled a data set where Y = .5X, the Pearson correlation is still going to be 1.0 (perfectly correlated), not .5 (the coefficient of X).  If you want the slope of the line, use linear regression (assuming the relationship is linear).

This became a blog entry because we were talking about whether we should measure test validation points (of which there can be more than one per test case) or we can just measure the number of test cases.  I asked “are test cases and validation points highly correlated?” and the response I got was “no, there are about 3 validation points per test case.”  Of course, knowing the above, that answer makes no sense.  If it’s really consistent, that’s highly correlated.  Thus, if counting the number of test cases is easy but counting the number of validation points is hard, we can reasonably use the number of test cases as a proxy for validation points.  They’re highly correlated.