Exploratory Data Analysis
From CS294-10 Visualization Sp11
Class on Feb 2, 2011
- Chapter 8: Data Density and Small Multiples, In The Visual Display of Quantitative Information. Tufte.
- Chapter 2: Macro/Micro Readings, In Envisioning Information. Tufte.
- Chapter 4: Small Multiples, In Envisioning Information. Tufte.
- Low-Level Components of Analytic Activity in Information Visualization. Robert Amar, James Eagan, and John Stasko. IEEE InfoVis 2005 (pdf)
- Exploratory Data Analysis, NIST Engineering Statistics Handbook
- Exploratory Data Analysis, Wikipedia
Julian Limon - Feb 01, 2011 10:39:35 pm
In general, I tend to avoid generalizations and "must-do" rules; and I guess that's why I take issue with some of Tufte's arguments in these chapters. He argues that "more information is better than less information" and that we should strive to use large datasets -- this intuitively make sense. It's better to have a rich set of data that you can "dumb down" if needed than a set of scarce data that you can't improve. It also seems to make sense that -when too little information is presented-the reader may become suspicious and elaborate second readings of the visualization. However, this seems to be in contradiction with the need of "telling a story" in a visualization. When you cluster thousands or millions of data-points in a visualization, multiple stories can be told and interpreted. Attention can be shifted to insignificant details, non-relevant spikes, or casual curiosities. I believe that when simple stories need to be told simple visualizations can be used. I don't see the point of shrinking a graph to increase its density of it looks like a waste of space. Granted, the use of space may not be the optimal, but the interpretation of the graph can be eased. I guess it depends on the task at hand: complex and elaborate stories may require micro/macro readings and data density. Simple and quick stories, on the other hand, may benefit from focus and aesthetics.
Michael Hsueh - Feb 02, 2011 07:55:07 pm
Many of the low-level components discussed by Amart et al. are similar to those of database operations, as acknowledged by the authors. One could but wonder, perhaps as a pipe dream, if any of the research done in databases can be usefully transferred for use in designing visualizations that facilitate analytic activities. For example, is it possible that some useful wisdom can be gleaned from the body of relational algebra parsing or query optimization research? Such information might be useful in the design of visualization-generating systems: The information actually seems to loosely map some analytic activities to system-level implementation and algorithmic details.
Of course, this is all hypothetical... But imagine that a random user has a data set and wishes to extract some analytic insights from it. He might know how to use relational-algebra-like syntax to accurately and unambiguously specify the information he wants. A visualization system might then magically map the query to an interpretation of what analytical activities the user wants to carry out. As a result, the system accordingly optimizes the parameters used for its visualization generation.
A stretch, probably, but the parallels between the models for manipulating data and analytic activities (as presented by the paper) are interesting at the very least. One thing to be careful about is the tendency to mix up the two models. In this case, I think the paper's choice of a distinct set of vocabulary is advantageous, as it avoids confusion about the subtle differences between certain analytic and relational tasks. For example, the "Retrieve Value" analytic task is really quite similar to a SELECT statement in SQL. "Cluster" is approximately like GROUP-BY, and so on.
Ultimately I think the paper takes an important step in formalizing a structure for discussing visualization in a cognitive context. Up until this point we've mostly been reading about "rules of thumb" that seem convincing from a mostly high level perspective but don't necessarily speak to the underlying, lower-level analytic activities enabled by those visualizations.
Siamak Faridani - Feb 03, 2011 05:06:16 am
One thing that I wanted to see in Tufte and other readings was some comments on Choropleth maps and heat maps. And in general how to encode data for geolocation values. It seems to me that a heat map can hardly be explained in terms of the 7 visualization elements (shape, size, rotation, ...). Perhaps the color is part of it has some relations to that list but it has no specific shape nor any orientation.
I also looked at Tableau and it seems it is incapable of generating heat maps. I am wondering if a heat map is considered a good visualization at all and why don't we see more about it in the literature and in the software tools (MATLAB has a built in heatmap function but R does not have a native function)
Michael Cohen - Feb 03, 2011 02:09:57 pm
Siamak, I was having similar thoughts about network graphs. They are made up of the visual primitives of shapes and lines, and they use position to encode information, but the position of any given element doesn't correspond to any particular quantitative piece of information (i.e., there's no data attribute that you can map deterministically to the x or y position). It's cases like these that make me a bit wary of attempts at automatic visualization; it seems to me that there are a lot of data sets where knowledge of the underlying semantics is crucial to coming up with a useful visualization, and in many cases that useful visualization won't be a straightforward mapping of visual variables to data dimensions (or even aggregates of data dimensions).
However, that concern applies more to coming up with a final product that tells the story in the "best" way. For early steps in exploratory data analysis, I can see how having auto-generated "pretty good" approaches could save a lot of time and potentially lead to insights that would be missed if a human imposed a specific visualization strategy from the outset. The human's first choice of strategy might be appropriate, but it might also reflect a preconception about what the data will reveal that could mask other trends.
Brandon Liu - Feb 03, 2011 02:39:31 pm
In reply to Siamak's mention of heat maps, I think that they are simply a 2D generalization of a histogram. The major choices that must be made in creating a histogram are 1. the size and range of the bins, and 2. the choice of colors. For an example, in 1), if the 'Red' Range had been extended into orange, interpretations of the heat map would change. The range of colors would also have an effect on how salient patterns are.
Sally Ahn - Feb 03, 2011 03:50:19 pm
One thing I wondered while reading Tufte's chapter about data density was the process of transforming a data matrix into a different form of visualization (or transforming one visualization to another) and whether it is fair to compare the data density of these different forms in terms of the "number of entries in data matrix." For example, in the map of 30,000 communes of France he presents, he argues that its data density is about 9,000 numbers per square inch, where these numbers represent latitudes, longitudes, and six numbers for describing the shape of each commune. In translating these numbers to the map, we gain a much clearer meaning of those numbers, but lose precision; the locations and shape of the communes become much clearer to the human visual system in context of a complete map, but we would not be able to tell the latitude/longitude differences among the communes from the map alone. Thus, there is a gain and loss in perception and precision in the transformation process from numerical data to its graphic form that Tufte doesn't seem to address in his discussion on data density.
Saung Li - Feb 03, 2011 06:36:53 pm
Tufte's discussion of data density is quite interesting. He relies on the fact that we are capable of understanding dense visualizations, so we can pack a lot of information into a display. The map of the all the commune boundaries in France is a great example; there is a huge number of them yet we can clearly visualize their locations with this one graphic. One should be careful of putting in too much information, though, as that could cause confusion. I find the Shrink Principle to be very useful, as one can shrink graphics "way down" and still be able to analyze them effortlessly. This allows one to put many graphics together in a small area. However, I still find that having larger graphics is more comforting to the eye, and that they should be enlarged when possible. This discussion of data density is similar to that of data-ink maximization, and applying these two principles together can help one convey as much information as possible clearly in a minimal graphic.
Sergeyk - Feb 03, 2011 07:29:07 pm
I found it a little bit funny that right after we read Tufte rail against unnecessary decoration and presenting several rules, all stating that the presentation of data should be as simple and clean as possible, we read Tufte rail against unnecessary simplification of data (Micro/Macro Readings) and showing us graphics that to me clearly seem needlessly decorated (like the Paris map).
There were some excellent quotes in the reading though. "There is nothing as mysterious as a fact clearly described." I was inspired by Tufte's suggestion that it may be better to have one really dense and informative figure than to spread the same information over several figures (or slides, in his words). I definitely think this is true for scientific publications--the most informative figure in the paper is basically how the ideas are communicated in person. "Hey, I read a cool paper on [something]. They do this cool thing. Here, let me just show you." I will try to unite the main ideas and results into one full-page figure in my future papers.
Michael Porath - Feb 03, 2011 07:36:42 pm
I found Michael's comment about the parallels between RDBMSs and the basic tasks of performing exploratory data analysis an interesting idea.
My take on the topic is that the relation between the two comes naturally because both databases and data visualizations maximize their usefulness when operating on large data sets. Indeed, what made RDBMSs useful and successful was the capability of clustering, filtering, ordering etc. What sets the two disciplines apart is that the process of formulating a database query is an explicit process. The user states clearly what she is looking for, and how the data should be arranged. Exploratory data analysis, on the other hand, allows for random discovery of facts and relations. What always strikes me when I look at data visualizations is how well our perceptive system is capable of implicitly discovering what otherwise has to be explicitly formulated with a database query.
The authors used students' hypotheses to categorize the analytical approaches. If it was true that a user approaches an analytical problem with a clear hypothesis, databases might indeed be better suited to accept or reject the hypothesis.
Dan - Feb 03, 2011 07:55:53 pm
I like the reading on high-resolution data graphics. It's interesting that high data graphics can be scaled down and still retain all legibility. I wonder if you scale many different high data graphics that are unrelated and compose them in patterns, do they lose some meaning? Could that cause them to lose some legibility. Just a thought about that.
The Chromosomal Pictorial Legacy image was interesting. Lots of scientific data packed into an image. However, I feel that this type of image isn't received the same by all audiences. Definitely some esoteric knowledge seems to be required when visualizations should give a sense of what is happening to any viewer or layman.
I also agree with Julian about Tufte's writings. Even though large data sets are good, sometimes the story or question behind a visualization can have a simple concept or answer.
I also thought Michael's comment about MySQL statements was interesting, good analogy!
Krishna - Feb 03, 2011 09:48:08 pm
I found the paper by Amar et.al to be quite insightful. However, I am not sure if their taxonomy is complete, even with their stated goals of abstracting lowest level questions that could be asked on a visualization. For example, it was surprising to see that there were no questions about the history of an information/data facet. For example, I am surprised that no student queried on the history of a mutual fund over a time period. I believe asking about the evolution of a data dimension is low level enough to be included in the taxonomy (or) am I missing something ?
Interesting discussion on RDMS and Visualization. Adding on to Michael Porath's comments, good visualizations have this amazing ability to show history, trends, local and global phenomena all at the same time in a space efficient way - something that RDBMS lack.
David Wong - Feb 04, 2011 12:04:48 am
In response to Siamak's post on heatmaps and encoding using the 7 visualization elements, I think we could treat each pixel as an independent visual component, which has a shape, orientation and color. The position indicates where that level of intensity, noted by the color, occurs, and the aggregate of these is the visualization. Given this definition, the heatmap could be explained in terms of the 7 visualization elements.
I like Michael Cohen's point on the arbitrariness in placement of nodes in network graphs, and even rectangles within treemaps for that matter. There are heuristics applied to generating these visualizations that gives position meaning, but only in relation to the position of other data (best use of space in treemaps and best way to illustrate clusters in network graphs). A human's understanding of underlying semantics is another heuristic that can be applied to generating a visualization. In the end, they're both really sort of the same thing if we view visualizations as separate cuts of highly dimensional data where the cuts are our heuristics. An automatic approach can illustrate the data in a way that humans will find informative and surprising if they've come in with some bias on how to visualize the data.
Karl He - Feb 04, 2011 08:56:35 pm
I don't entirely agree with Tufte's advocation of dense data. A scatter plot with more points may be better for seeing trends, for example. However, in that case it isn't the individual points we really care about, but rather the trend itself, in which case a trendline would convey the information better. Each of his examples of small multiples I more or less feel the same way, what matters is the trend not the individual pieces of the data. While this in some cases may be a good idea, the density is not what is important.
In terms of say, the Vietnam Veteran Memorial, or the map of Tokyo, any particular person would likely only care about a tiny fragment of the data, such as one name or one square. It may be useful to have access to the rest the data, but no one would be specifically wanting to know about a lot of the visualization.
I guess the takeaway is that visualizations which present dense data allow the user to pick out for themselves what is useful about the visualization, rather than the visualization itself telling the user what is interesting. It could have utility in some cases, but in general I would shy away from this technique unless that specific effect is intended.
Matthew Can - Feb 06, 2011 02:44:41 am
Like others who have commented, I took issue with Tufte's view on maximizing data density. I think the principle of maximizing data density is generally something for which we should strive. What I disagree with is Tufte's extreme position on the principle and the way he puts it into effect, particularly his quantitative definition of data density. I think it's misleading when he praises visualizations that display tens or hundreds of thousands of data numbers per square inch (for example, the map of the galaxies). Just because that much data went into the creation of the visualization does not mean the visualization makes all of it readily available to the viewer. And in fact it should not. The real virtue of Tufte's examples is not how many numbers they pack into each square inch but rather how deftly they show the gestalt of all those numbers. His chapter would have been much more informative had he focused less on data density measures and more on techniques for qualitatively increasing data density.
Furthermore, Tufte argued for maximizing data density in the visual display of quantitative information. Does this principle hold for non-quantitative information?