From CS 294-10 Visualization Sp10

Jump to: navigation, search


Data Domain

At UC Berkeley's School of Information (I School), two main lists are used for discussion among students, alumni, and staff: noise@ischool and fun@ischool. As a subscriber to these lists for two years, I have noticed that the demographics of list participants does not match the demographics of the school. Specifically, traffic seems to be dominated by men, PhD candidates, and alumni.

My interactive visualization will let the user explore messages sent to two mailing lists at Berkeley's School of Information to see trends in message volume and the demographics of senders. I prepared a corpus of cleaned messages last semester for an NLP project. The dataset of these two mailing lists from July 2004 to July 2009 has 15,868 messages in 7,609 threads.

Interaction techniques

Presenting the volume of messages over time is one possible format for these data. My visualization will provide a filter panel that allows the user to perform dynamic queries to show messages that fit certain criteria. Allowing the user to view message volume over time will let the user investigate some of the questions I posed above, as well as other issues I haven't thought about.

When the user changes the query in the filter panel, I will show the non-matching data points as grayed out, like in the "Attribute Explorer" video we saw in class. By showing the current set of constraints on the data, this will help users make comparisons between the entire dataset and their selected query. For example, when the user filters for messages from students, she will be able to see the number of messages from current students, compared with the total number of messages.


I'm not sure if this is the best way to do these storyboards; I'll change them if I see someone else doing something more helpful.




Based on these storyboards, I'll have to implement a time-range selector for the user to select what portion of the data to show from the overview. This will set the range of the data used to display the detail.

Each time the user changes an option in the filter panel, the chart will be re-drawn. Data points that match the currently specified criteria will be drawn in the color that corresponds to the list they are part of. Points that do not match will be drawn in gray.

I haven't figured out how best to implement it while accounting for privacy concerns, but I'd like to implement details on demand. Some options might include clicking on a message to see minimal details (the subject) or clicking to view messages from the same sender.

Execution Plan

I am planning to create this visualization using Protovis, although I know that 15000 points is beyond its ideal performance bounds. I think I can group points in large-scale views so that Protovis doesn't actually have to draw 15000 points at a time. If performance is a tremendous problem, I'll use Piccolo2D.


I converted my data into JavaScript object literals that I loaded in a webpage. Using Protovis, jQuery, jQuery UI, and some library to help with date manipulation, I created an interactive visualization with an overview, detail, and filter panel. Initially loading the page is somewhat sluggish, both because it requires downloading 3MB of data files and because processing the files in the browser takes 5-10 seconds. Both of these steps could be optimized significantly by using a less verbose file format and by pre-processing the data more before it arrives in the browser.

The user can move or resize a zoom box in the overview panel, which changes the display in the detail panel.

I decided not to let the user toggle the display of the two different lists because the added complexity didn't seem to have a marked benefit. With the stacked area and stacked bar charts I think the user can adequately answer many questions. One significant failure of this visualization is that it cannot show the disparity between male and female participation as effectively as I had hoped. When you filter message traffic to show only messages from women, there is a marked decrease. The user might deduce that this is because there are fewer women at the School of Information than men. However, the proportion of messages sent is still at a significant disparity with the demographics of the student body, which is about 35% female. To improve this visualization, I would have to figure out some way to show the composition of possible senders.

I spent about 10 hours preparing the dataset for this visualization--much longer than I had planned. It took a long time to extract email addresses from the corpus, associate them with people, and then devise a way to associate varying statuses with individuals over time (a person who sent a message in 2006 as a student might send one as an alumnus in 2009). Although wrangling data is often a part of visualization, this felt excessive and probably wasn't a good use of my time for this assignment. After I had all the data available, I spent another 10-15 hours implementing this solution using Protovis and other JavaScript libraries. Figuring out the best way to work with time range data was one of the challenges I ran across. The Protovis examples were helpful, although I sometimes found myself confused when I tried to arrange panels and marks myself. I did appreciate some of the "simple" but extremely useful tools that Protovis supplies, like pv.Scale, which maps your domain to a range, provides nice tick marks, etc.

You can access this application at http://people.ischool.berkeley.edu/~ryan/cs294/a4 or download a zip at http://people.ischool.berkeley.edu/~ryan/cs294/a4_ryan_greenberg.zip . Because this is a private mailing list and there are some privacy concerns, I'd prefer if only students in this class access the application. Use the same username and password for this class's readings to access the app.

[add comment]
Personal tools