# Data and Image Models

Lecture on Jan 24, 2011

## Contents

• Chapter 1: Graphical Excellence, In The Visual Display of Quantitative Information. Tufte.
• Chapter 2: Graphical Integrity, In The Visual Display of Quantitative Information. Tufte.
• Chapter 3: Sources of Graphical Integrity, In The Visual Display of Quantitative Information. Tufte.
• Levels of Measurement, Wikipedia

• The eyes have it, Schneiderman. (html)
• The structure of the information visualization design space. Card & Mackinlay. (ieee)(Google Scholar sources)
• On the theory of scales of measurement. Stevens. (jstor)

## Boheekim - Jan 24, 2011 07:40:20 pm

We were discussing visual variables in class today, and it seemed that we were looking at pieces of data that could be represented as a singular point or shape in a visualization. I was wondering, when we connect two points of data, using the line to signify that these objects have a relationship of some sort, which visual variable category does that fall under?

## Sally Ahn - Jan 25, 2011 12:28:19 am

In response to Boheekim's comment, I found a paper that explains Bertin's visual variables in greater detail (Considering Visual Variables as a Basis for Information Visualisation), and I think such a line would be considered a mark, or "the most basic visual unit" (point, line, area, surface/volume) which the visual variables modify. Any of Bertin's seven visual variables might modify this line to make it more meaningful (e.g. color: red may show decline over time and green may show increase).

One comment I have from the Tufte reading is that some of his examples from the first chapter seem to leave out crucial information necessary for interpreting the data as he describes. For example, Tufte analyzes the "small multiple" graph of air pollutants (page 42) by explaining the peaks at different times with emissions from refineries, power plants, or major freeways, but the graphic itself provides no indication of these locations. Thus, it cannot independently convey the information needed to explain the varying peaks, although it does successfully depict that variations exist. Adding simple marks to indicate the locations of these major factors on the county map slice may address this problem, although I think the occlusions of the peaks and this 3D planar view may make it difficult to make accurate interpretations. This makes me wonder how important it should be for a visualization to be "independent" in presenting a particular story and whether the pros of such independence outweigh the cons of information overload.

## Saung Li - Jan 25, 2011 03:35:37 am

It is interesting to see how visualizations have evolved. Though different, they all attempt to accomplish the task of turning data to sometime we can better understand.

The Napoleon's March graphic is a great example of how a visualization can help picture what actually happened from historical data. I have seen this image multiple times, yet I've been interested by it each time. It shows the initially massive army march into doom through numerous cities and back, dying out to dropping temperatures. The thinning line shows just how many people are dying, and the start and end values differ substantially. I also find it interesting that the Japanese seem to emphasize visualizations more than we do. Perhaps we can learn from them by observing how they convey their visualizations. We need to pick the correct, relevant values, think of artistic ways to show them, and put an emphasis on statistics more.

## Jvoytek - Jan 25, 2011 12:24:39 pm

www.brainscanr.com
Regarding The eyes have it, by Ben Schneiderman: His assessment of the 7 tasks (3 additional tasks are added to the traditional overview, zoom, filter, details-on-demand) has been reflected remarkably well in my own work. Recently my husband and I created a website that visualizes the relationship between terms in neuroscience www.brainscanr.com . We started small with just one overview of the top 20 most well-associated terms to the search term that included details of the connections through a simple roll-over, and the relation of each term to each other term by a line of varying width. We immediately got requests to zoom and filter the data, which we're currently working to implement. In addition to the 4 traditional tasks we've also gotten several requests from researchers to extract the data and the graphs for use in other presentations, and to see how this data has changed over time. I am astonished at how well these tasks relate to what our users are requesting!

## Krishna - Jan 25, 2011 11:00:52 am

In response to Boheekim: I think your example would fit into the 'Network' category(from Shneiderman's seven data types). But yes, I am not sure about the visual category, but probably, the curve that would result from these connections can be thought of as a shape with its own texture, value and color.

Comments on 'The Eyes Have It':

I feel Shneiderman did not emphasize enough the importance of context when it comes to strategizing overviews. Overview strategies, I believe, should be derived from a context, for example it could be based on the user seeing the visualization. For example, when we go to Google maps, the overview is based on my current location. Also, the paper does not address the fact that most data models have an inherent uncertainty factor associated with them. This makes me wonder when and how visualizations should accomodate this uncertainty. I am not sure Shneiderman's seven data types adequately address this (or) maybe given this argument, most data types we deal with are inherently >2-d or multi-dimensional.

## Brandon Liu - Jan 25, 2011 01:17:43 pm

Regarding the discussion on nominal vs. ordinal variables in class: If experiments are assigned ordered sequences, such as flowers 1-50, this sequence can be considered either ordinal or nominal. If the experiment design was such that all flowers should have been treated with the same conditions, it is absolutely a nominal variable. However, this is unrealistic, and treating it as an ordinal variable may be able to point out flaws or biases in the experimental design.

For example, if the ordering of flower IDs corresponded to the time or position in which they were planted, the results could be analyzed for correlations of dependent variables and the ID treated as an ordinal variable: examples of how this could affect the experiment are if the position determined the ID, and the sunlight conditions were different for varying positions.

## Matthew Can - Jan 25, 2011 01:51:24 pm

In "The Eyes Have It" Shneiderman describes seven tasks that information visualizations should support. These tasks adequately let the user navigate through the data and perform queries on it (for example, extract, zoom, filter, details-on-demand). What seem to be missing are tasks that allow the user to modify the data as part of his or her exploration of it.

Are such tasks outside the scope of information visualization?

## Siamak Faridani - Jan 25, 2011 02:29:15 pm

It was interesting to see that there are theories and procedures for doing visualization. I am personally impressed by how Jacques Berti foresaw the need to think about data structures and ways to express information in visualization. Tufte's readings were interesting. I personally really liked his book template. He uses footnotes and images on the sides and the main text on the center which makes it easy to read. We have already discussed many parts of the first chapter in class. I wish he could make some comments on what type of software is being used in order to generate some of these visualizations. Although I sometimes felt that as much as the graphics were informative, the text failed to tell a coherent story. Also in chapter 3 "The doctrine that statistical data are boring" seems to be inaccurate. At least in the recent years, everyone gets very excited when a govenment or an organization open sources large amount of valuable data. Tools like R, Excel and Google Refine have enabled everyone to make sense of large datasets, and I am not sure if that assumption that data is boring still holds.

## Natalie Jones - Jan 25, 2011 04:07:27 pm

Looking at Wattenberg's Map of the Market in class yesterday made me think about how visualizations can require varying levels of investment from the viewer in order to understand the information. For example, with the Map of the Market, the viewer must learn the way the treemap works and what the shapes, colors and categorizations signify before getting much benefit from it. Though every visualization carries some element of this, some require more than others, and for some there is more payback than others. I wonder about how to gauge your audience's threshold for investing in understanding a visualization - I would guess it probably varies depending on the audience and the subject matter, and of course, the quality of the visualization. I'm looking forward to exploring that more as we go through the class.

On an unrelated note, in response to Siamak's suggestion that "everyone gets excited" when data becomes available: I think your perception of who "everyone" is might be a little skewed. I agree that data seems to have become more popular and meaningful to a greater number of people in recent years, and I think that's largely due to technological improvements in making data available and presenting it, as you mention. But I would imagine that the number of people who actually get excited about raw data is still pretty small. What we get excited about is what the data can tell us, and when we can interpret it more easily to gain those insights, which of course, is the whole point of visualization. So Tufte's assertion might be a little outdated, but I don't think the crux of his argument is inaccurate.

## Michael Cohen - Jan 25, 2011 04:06:48 pm

In response to Brandon: Deciding what your independent and dependent variables really are is actually a key point in experimental design and analysis that is somewhat at odds with the exploratory way we might design a visualization. In terms of experimental work (and visualizing it) I disagree somewhat with your approach.

Generally when we look for statistically significant effects, we're looking for a relationship that's significant at around the 95% level, which means that if you pull a large number of samples, 5% of them would seem to show an effect as strong as the one you observed simply due to random variation in sampling. The good news is that if you decide to test a single hypothesis (or perhaps a few) with a sound theoretical foundation and you come up with a significant result, you can be fairly confident (though not certain) that it wasn't due to chance, because only 5% of samples would have shown your result simply due to chance. The bad news is that if you measure a large number of relationships (say, 20) in a sample and test them all just for kicks, you're likely to find a "significant" effect or two that is really just a random variation. Thus, in the flower example, if the position (or time) of the flower was important, it should have been recorded as part of the experimental design and treated as real data. Fishing for a correlation with an index variable after the fact is likely to net you a "suspicious" result in about 5% of your studies, but it won't be meaningful or useful unless your theory has something to say about the effect of that index variable -- and if it does, you really should be recording the actual (quantitative) time or position that relates to that index rather than just the (ordinal) index. Which I guess is a long way of asserting: I don't think it's ever good science to treat a sample index as ordinal. It should either be just a label and therefore nominal, or meaningful and therefore replaced with something quantitative (or perhaps occasionally something ordinal but more physical).

I know this is a bit nit-picky on the statistics, but I think it's relevant more generally for visualization because with the huge data sets and powerful tools we have available nowadays, there's a tendency to "fish" and look for interesting trends in whatever data we happen to have at hand. Sometimes this can lead to important insights, but sometimes it can also lead us to make too much out of chance occurrences that we stumble across.

## Tanushree - Jan 25, 2011 04:57:03 pm

I enjoyed following the story in the much acclaimed statistical graphic by Minard on Napolean's March in Tufte's chapter. It wasn't quite clear to me looking at it the first time, until I read the little description that came with it. The three dimensions of space, time and temperature are used well to depict the strength and the movement of Napolean's army. It uses several visual variables - color, shape/ thickness, position, value (of several variables). I am not sure if Bertin's impassible barrier is broken here or not.

I had been thinking about how time series can be visualized over space (geographical or otherwise) such that it is still static (and not interactive). I thought Napolean's march was a great example of that. However, a constraint here is that the data to be depicted is progressively moving over space with time. It might not be possible if the data were not such. Also, this infographic told me an actual story of Napolean's march - the strength of the army, how they moved across the border and where they turned back, how low the temperature was, how they lost a substantial part of the army when crossing the river, then they were joined by another troop and how a very depleted force reached back. I found the "story" aspect of the graphic very fascinating.

## Poezn - Jan 25, 2011 05:36:47 pm

The discussion about a the taxonomy of variables reminded me of a [[1][discussion]] about Infographics and Data Visualizations on [[2][Quora]] a few weeks ago. We have been talking about the taxonomy of variables, but wouldn't it also be interesting to talk about the taxonomy for the whole information visualization space?

From the top of my head I can think of a number of categories for such a taxonomy: Interactive vs static, degree of automated generation, purpose, technique used etc. Has anyone ever stumbled across a formalized structure for data visualizations?

## Thomas Schluchter - Jan 25, 2011 10:22:15 pm

In response to Brandon's and Michael's discussion: I'm wondering increasingly about the interrelation of statistics and information visualisation. Tufte really hammers home the point that every visualisation tells a story, and that there many wrong or misleading ways to tell a story about data. Statistical tests are designed to defend or refute the claim that results are both significant and meaningful. Yet nothing stops anyone from visualising a data set the relations in which have no statistical meaning.

Think about a data set with one a couple of outliers that are covered by the 5% of expected variation in the sample. In a scatterplot, they attract attention through their odd positioning. The scatterplot accurately represent the data by Tufte's standards so that the visualisation tells "the truth" about the data. This particular truth might not have any significance though. Does the visualisation in this case make sense? Or wouldn't it rather be appropriate to omit the outliers from the representation?

## Michael Hsueh - Jan 25, 2011 11:49:35 pm

I was intrigued by the lecture slide presenting Bertin's assertion that images typically face an impassible three dimension barrier. I was not completely sure of how to interperet it given that I had seen a number of effective visualizations that seem to circumvent this limitation. Was he talking about an inherent limitation of the medium in which information is being conveyed (i.e. 2D surface)? Or was he speaking in terms of the extent to which human cognition allows us to gleam information? Many of the examples of graphical excellence presented in Tufte's text incorporate more than three variables. It also seemed that Bertin's classification of seven visual variables was a means to this end. After some further investigation into the context of Bertin's statement, I came to the belief that he was likely commenting about the case of simple scatter plot images (Bertin's writings were focused primarily on flat graphics). For Bertin, three attributes can be graphically mapped to a two dimensional space with varying point size. Additional dimensions must be visualized using additional plots set side to side or by cleverly superimposing additional dimensions onto the existing space (for example, as is done on pg 34 of Tufte text).

## David Wong - Jan 25, 2011 11:39:22 pm

I also want to reiterate Matt Can's point. To what extend are interactive, dynamic visualization systems also considered visualizations--or did I just answer my question? I believe that if a user can personalize a visualization that would allow them to better understand the data.

Also, regarding the nominal versus ordinal variables in regards to the flower example, I agree with Brandon that the sample ID, so case 1-50, can be considered either nominal or ordinal. In response to Michael's point, I think that the case can be considered an independent variable and still be treated ordinally. I see it merely as a way to organize your data. For instance, you could generate a scatterplot with the case number on the x axis and some aggregation of the data on the y-axis. If it were solely nominal data, graphing it in this manner may not make sense. However, in regards to independent cases of flower sampling, I think that this graph could be an interesting visualization. Nevertheless, I do agree that if you treat it as a measure of something like time, then a quantitative value would be much better suited.

## Julian Limon - Jan 26, 2011 02:33:43 am

I have to admit that Steven's levels of measurement made much more sense in class than in the Wikipedia article. While reading the article, I was still a bit confused about the admissible transformations and mathematical structures, but after the class it was all clarified. One thing we didn't talk about in much detail in class was the difference between interval scales and ratio measurements. In the examples we saw in class, we basically treated all numbers as quantitative measurements and focused on differentiating them from ordinal and nominal categories disguised as numbers. However, I still wonder whether the difference between interval scales and ratio measurements will play a role in our visualizations. Tools like Tableau do not require the user to specify this -- they only distinguish between measures and dimensions. Is it because that's what really matters when you visualize information? Are the other differences more of an exercise for other kinds of analysis?

## Dan Lynch - Jan 26, 2011 03:44:26 am

In regards to the levels of measurement reading, I found one particular aspect to be quite interesting. When transforming ordinal sets by monotonically increasing functions, they maintain the order, but perhaps can provide a different perspective (this is my take on that part although it wasn't directly mentioned in the article). For example, the diagram showing cooking ability with the various transformations can give you an idea of how this works. When we subtract 8 and Dana's ability is the zero point, it you can very quickly determine where all other people stand in relation to Dana. However, when you looked at the cubed score, you would have to do much more analysis with the numbers before inferring about the relationships.

## Karl He - Jan 26, 2011 06:46:10 am

Regarding Bertin's visual variables discussed in class, I had thought about this a little at the time, but not all of the visual variables are capable of encoding the same information. Attributes such as texture and color can distinguish data from other data, but generally cannot provide any meaningful quantitative data.

I feel that these variables should be separated into two categories, those that can provide quantitative data, such as size and position, and those that can only convey categorical data of some sort, such as color and texture.

Orientation is an interesting property to think about especially if you consider the orientation of an arrow shape. This can convey a limited form of quantitative information, in more and more obvious ways than the categorical properties can.