From CS294-10 Visualization Fa07
 Good Visualization
I selected my good visualization from an article which appeared in the New England Journal of Medicine on July 26th 2007:
Upon its publication, this article received a fair amount of coverage in mainstream media, including this article in the New York Times on July 25th, 2007.
Although there are several interesting visualizations from the article (including a good animation), I chose the following graphic:
The above graphic depicts a social network of approximately 2200 people from the Framingham Heart Study. The purpose of this study was to identify if there is a link between obesity and person-to-person links. In this graph, each node represents a person from the study. The color of each node's border represents gender (red: female; blue: male), and the size of the node is proportional to the person's body mass index (BMI). Besides this representation of BMI, the authors further categorize each person as either 1) obese, or 2) not obese, depending on whether their BMI is above or below 30. This designation is encoded in the graph by the color of each node: yellow represents an obese person, green represents a person who is not obese. The color of the edges signifies the type of connection between nodes: orange represents a familial tie and purple represents friendship or marriage.
The position of each node was generated by an algorithm which attempts to fix the distance between any pair of nodes. It is designed to place connected nodes close to each other such that edges do not overlap more than necessary. This means that nodes with many connections are more likely to be in the center of the network. Although not directly encoding any of the data, this placement of the nodes helps make the network more compact and easier to understand.
The data set from which this visualization was generated is large and complex. Although not represented in this visualization, the progression of time is a key factor in this study and the data include longitudinal results spanning decades. Most likely, this graphic is based on a relational data model made up of records consisting of textual strings, floats, and integers describing various attributes of each participant. For this visualization, the database need only contain gender labels (nominal: male/female), BMI (quantitative: integers or floats), and a list of connected nodes (probably integers representing record IDs, which would be quantitative) as well as the type of connection (nominal labels: familial/friendship).
The image model is a visualization of the network in a 2D plot of the nodes (representing people) and the edges between nodes (representing connections between people). The size of the nodes encodes the person's BMI, the color labels it in either the obese or non-obese category, the color of the outline of the nodes encode gender, the edges between nodes correspond to a relationship between the nodes (people), and the color of the edges correspond to a either a familial or friendship connection. Node position (x-axis and y-axis) does not directly encode any of the data, but it indirectly encodes how many connections a node has (nodes with many connections tend to be near the center). Basically, the placement of nodes is automated using an algorithm to reduce the complexity of the network visualization, and the positions are simply a by-product of this process.
I chose this to be my good visualization because I think it demonstrates the utility of effective graphical representations for large and complex data. In this study, it would be nearly impossible to get an overall and immediate sense of the data, particularly the connections between participants as well as the pattern related to obesity, without exploiting the perceptual bandwidth of the human visual system. The sheer size of the data set would make it much more difficult and time-consuming to examine the data textually, or study the relationships between different variables in a piecewise manner.
This type of visualization is most useful as a tool for exploring the data and looking for patterns that may emerge as different attributes are encoded visually. Without this graphic, finding a pattern in the data could conceivably take multiple hours involving countless analyses. Given this visualization however, the link between obesity and social connections is relatively easy to see as one examines the various clusters of similar looking nodes. This type of investigation of the data can help one to pinpoint which subsequent quantitative analyses are most likely to lead to significant results.
Although I think this visualization is critical for assessing this type of large data set, there are still a few problems with it. In this 2D graphic, the compactness of the node positions makes it easier to see the entire subcomponent, but this can lead to a problem with occlusion. It isn't always clear how different nodes are connected in the center of the network where many nodes and edges overlap with each other.
 Bad Visualization
I chose my bad visualization from a psychology textbook that was used in an undergraduate Intro to Psych course:
The above graphic was presented in a section of the book devoted to sleep research. The bar graph summarizes data describing the number of traffic accidents in Canada before and after daylight-savings time adjustments for the years 1991 and 1992 (combined). The purpose of this graph is to suggest a correlation between lost sleep and traffic accidents. More broadly, I think the intention is to simply show that a lack of sleep may impair the physical and mental abilities associated with safe driving.
The graph depicts Accident Frequency according to different time period categories. This visualization is based on a statistical data model in which accident frequency is most likely represented by integers which encode the count of accidents for a given day of the year. For this graph, the authors only examined particular time period categories (nominal labels). The data consist of two main categories: the spring time change (losing an hour) and the fall time change (gaining an hour). Each of these categories is further subdivided into pre and post time-change categories. On some level, these categories act as simple nominal labels, but there is an important ordering to them (before and after) as well as a direction (gaining vs losing an hour), and the data can only be properly interpreted with these considerations in mind.
The image model is based on univariate data, and the authors encode the accident frequency rates as a simple bar graph. In the graph, the category labels lie along the X-axis and the accident frequency is plotted along the Y-axis. The graph only examines rates for the mondays before and after the time change, most likely to ensure that the day of the week is not a factor in any measured effect. Since this is a bar graph, the height of the bars correspond to the accident rate. Additionally, the bars are color-coded according to the pre and post time-change categories, presumably to make it easier to compare the different conditions between the spring and fall.
There are several reasons I believe this graph fails to give an accurate impression of the data. First, the two main categories (spring, fall) are depicted on separate charts, each of which has a different range on the Y-axis. The max for the spring time change is 2800, while the fall time change maximum value is 4200 (although this top tick-mark is actually positioned below the top tick-mark for the spring). There are more accidents during the fall time change overall, yet both the pre and post data bars from this category are visually smaller than the respective bars in the spring category. Similarly, the y-axes cross the x-axes at different values other than zero: the spring chart starts at 2400 accidents, while the fall chart starts at 3600. By only visualizing a small portion of the data, differences may appear striking (and more significant) despite the fact that they only account for a small percent change in the total number of accidents. This also makes it more difficult to compare the two charts directly.
More importantly, the tick-marks for the two different charts represent different ranges. Although the two charts are shown side-by-side to facilitate comparisons, each tick-mark on the left chart represents 100 accidents while a similar tick-mark on the right represents 200 accidents. The physical spacing of the tick marks represent identical distances on paper, yet correspond to different numerical quantities in the data.
I think another failure of the graph, which in some way is related to a shortcoming of the analysis in general, is that there is no context in which to view these data. We only see accident rates on the monday before the time change versus the monday after the time change, and only for a combined two-year period. Although the authors are specifically interested in how accident rates change as a result of daylight savings, it would still be useful to see how accident rates change on a weekly basis through an entire year. Likewise, it may be useful to see how these rates change over the course of several years. The data in this graphic appear to support the authors' argument, but these types of changes may occur frequently over the course of the year and the timing may be coincidental. We also cannot tell if this trend is repeated year after year. The power of the graph is weakened by this lack of contextual information.
Finally, the visualization of the bar graph includes some unnecessary (and even a little confusing) 3D styling, despite the fact that there are only two dimensions to the data (accident rate and time periods). Specifically, each bar in the graph casts a slight shadow on the background. With this added dimension, it is difficult to measure the exact height of the bars (do we use the bars or the shadows, or somewhere in between). Most likely, the shading was added for style alone, but it only serves to add a little more ambiguity to an already weak graphical representation.
The above figure is a simple redesign of my bad visualization in order to more accurately represent the data. Rather than separate the spring and fall time changes into completely separate charts, I put all data points on the same graph in order to facilitate direct comparisons. This also avoids the previous problem of having different scales on the y-axis for the two different charts, thereby making it easier to compare the change in accident rate between the two seasons. I also set the x-axis to cross the y-axis at the zero point (rather than some seemingly arbitrary value which only serves to exaggerate the effect of the time change), in order to provide an better sense of what the change in accident rate is with respect to the overall rate for the specified time periods. For example, the height of the bars in the previous (bad) graph suggested that the accident rate practically doubled in the spring, while my redesign shows that actual change in accident rate is relatively modest compared to the overall rate. I also adopted one of Tufte's technique: removing horizontal grid lines in favor of simple markings that lie directly on the bars themselves. As he mentioned, this removes "chartjunk" (thereby increasing my data-to-ink ratio) and even makes it easier to judge what the values are at the height of each bar. Finally, I omitted any 3D styling, providing a simple flat bar graph which avoids any of the ambiguities associated with the cast shadows.
My final criticism of the bad visualization was that it did not provide any context with which to examine the data of interest. Presumably the authors of the study had access to traffic accident rates for more days than simply the mondays before and after daylight savings time changes. It may very well be that the seemingly simple correlation between amount of sleep and traffic accidents is actually more complicated and nuanced, and the argument may not be as strong when the data for an entire year of mondays are examined in a single graphic. If, on the other hand, the data from the entire year is relatively stable except for the times right after the daylight savings time changes, then another good visualization would be to plot the number of accidents over the course of the entire year. Since I do not have access to actual traffic accident data from Canada, I simply fabricated an ideal data set which supports the argument the authors are trying to make. Here is a second redesigned graphic:
Now we are plotting average number of traffic accidents for Mondays over the entire year. In this graph, the periods of daylight savings time changes are highlighted, and the data clearly shows precipitous changes in the accident rate at these times. As in the previous redesign, this chart is more effective than the bad visualization since all the data are plotted on a single graph to allow the viewer to compare accident numbers at different time periods. In this case however, the context of traffic accident rates over the entire year is provided, making it easier to see how changes in sleep (which is assumed to occur after the time changes) can affect accident rates. I should emphasize that this visualization works particularly well because the extended data set I created still supports the authors' argument. In reality, the traffic accident data probably do not look this good. This type of visualization may in fact diminish the strength of the authors' argument if the change in number of accidents as a result of daylight savings time is not significantly different from the overall changes in accident rates over the course of the year.