Data is in the eye of the beholder

26 Nov 2012

I’ve been thinking, and reading, recently on the topic of displaying data. Specifically, on getting others to understand data. As usually happens when one delves into a topic, I’m seeing connections in disparate sources.

To expand: I have a large, but not necessarily complex, set of data I wish to show. A bar chart doesn’t suffice as the data is broken down along two orthogonal axis. There are many ways of displaying data of this form, but narrowing down on one is proving harder than expected.

The issue is the data is, essentially, random. Rather, there are many data sets but I’m trying to pick one visualization. Programmatically. The problem with this is, of course, presentation of data greatly colors the perception. Given a map of the US with states colored as they voted for the president, trends immediately emerge. Given a list of state names with a bar showing what percent voted Democrat merely befuddles. However, what if our data was “percent of the population with a last name beginning with ‘C’”. We would expect little to no geographic correlation for this. This isn’t strictly true, of course. There could be a large immigrant population in some area that tends to have ‘C’ last names. But, hey, we can pretend. Of potential interest would be simply the outliers from the national average. Or, even more simply, nothing.

But we had to know the intuition to decide the best graph. We could have a library of chart types and a ‘fitness’ function for each that scores the input data to see if it is a fit. Or we could just throw all the charts at the user and let them figure out (a tempting idea, but “let the user figure it out” is not the most maintainable solution). I think I’ve stumbled into what some call A Hard Problem. Note the capitalization; this is different from a hard problem which is merely something one is too lazy to solve at the moment. A Hard Problem requires coffee in addition to time.

And the threads I saw in my reading the came together: an article about hindsight bias followed shortly by a bit in Tufte’s Visual Explanations. The relevant portion of the article was the Challenger explosion (and, likewise, of the Tufte). Yudkowsky argues the O-ring problem was obvious in hindsight, but not at the time of incident. That is, once we know the problem, it becomes obvious to our minds how we could fix the problem.

This seemed a valid argument at the time I read it, knowing little about the Challenger explosion. But Tufte presented a completely different view - the data gathered on damage of past O-rings compared to temperature, and the expected launch conditions, could lead to only one conclusion: delaying the launch until warmer weather. And the engineers at the time had come to the same conclusion but botched the graphs. They didn’t present enough data along some axis (previous launches), but presented too much data along others (non-temperature correlated issues).

hazzens scribe

links

Data is in the eye of the beholder