The other day I overheard “let’s get rid of the data that isn’t valuable.” There’s certainly some “data” that isn’t valuable in that it is known to be wrong, but that wasn’t the gist of this conversation. Instead, they were talking about data for which they could find no current use for it. For example, imagine you were collecting data about people and couldn’t find a relationship between, say, shoe size and heart rate. One might argue that if you were looking for predictors of heart rate that shoe size is no longer valuable data and you should get rid of it.
In a given piece of research that might be true. What if, however, you were collecting data about people (like marketing folks do) to help understand buying habits. What if, right now, you could find no use for their shoe size? It’s taking up space in your database, albeit probably very little. You can’t use it in any of your current models. Should you throw it away?
Not so fast. The frustrating thing about statistics is that just because you don’t see a relationship doesn’t mean there isn’t one. We may not yet understand how to use shoe size in our model… maybe it has a fascinating interaction effect with hand size to predict buying habits? Who knows.
The point isn’t really about shoe size and whether it is useful, but more generally, if you can get good (by which I mean correct) data on something that you at least guessed might be useful, I’m not so sure you should throw it away because you haven’t found a use for it yet. Some day you may have a hypothesis about shoe size, and where will you be if you discarded all that data?
Now, if the costs of storing or collecting that data are so onerous that you have to make a choice, by all means, discard away. But just getting rid of information because you don’t know how to use it yet… not so much.