A3-JeffBowman

From CS294-10 Visualization Fa08

Jump to: navigation, search

Assignment: Assignment 3: Creating Interactive Visualization Software

My proposal is to use Macromedia Flash to create a metadata visualizer for arguably the most famous public domain database in the world: Wikipedia.

Contents

Proposal

Storyboard Mockup
Storyboard Mockup

Right now, there is no easy way to see the frequency of Wikipedia edits, as they occur in time. The view of Wikipedia edits corresponds to real-world events as they happen; for instance, Sarah Palin has 8,644 edits to her page (as of the time of writing) with fewer than 10% of those predating her selection as Republican Vice Presidential nominee in August 2008. Seeing the quantity of edits yields sociological information about when and how information is brought to the web. By connecting this data with edit histories from other related articles, the user can see times where the Wikipedia information in that group of articles changed significantly, and how it did so.

For the mockup, I chose to display the information as a set of vertical lines (instead of typical line or area graph) because I feel that visual density is a potentially more accurate description than a histogram with an arbitrarily large number of buckets. In extreme cases, the selection of buckets (August 1-September 1 versus August 2-September 2) could affect the way the data is displayed and perceived, which would result in discontinuities in a line graph but consistency in a density graph. Also, I posit that comparison of data is easier in density graphs than in line or area graphs, as visual texture can be compared across multiple lines more easily than a series of line graphs. Furthermore, using visual density allows for additional data to be encoded with position, such as whether bytes were added or removed to the article in question.

I intend on allowing these density charts to be horizontally "zoomed" dynamically, allowing a focus on specific (shorter) lengths of time. I also intend on allowing individual visualizations about articles to be added and removed dynamically.

The deeper level of data accessible in this project implies that it will need access to an online data source, likely a downloaded version of Wikipedia or a live cache of Wikipedia-sourced metadata.

This project can be extended in a number of ways, time permitting: Aside from the aforementioned "net byte count change" extension, I also intend to parse the article for important dates and links, which can be used to suggest "related" articles.

Finished project

Screenshot of the final submission
Screenshot of the final submission

Source code: A3-Jeff Bowman-Source code v1.zip

Live installation (Wikipedia): [1]

Live installation (294 Wiki): [2]

With a little extra enthusiasm and extra time on this project, I created a Flex web application that reads data (through a simple passthrough proxy on my server) directly from Wikipedia (or, via a different MediaWiki API URL, the CS294 wiki).

Features

The main feature is visible, in a view very much like the storyboard above: The timestamp data—which captures a particular moment in time, rather than a particular day or other large category—is charted as a time series in density, using ordinal grayscale values (light to dark) to indicate the number of Wikipedia revisions that occur over time.

Because the data exploration was so new, I produced a very limited storyboard in the original design; however, many of the features implemented here, including live Wikipedia data and the ability to add and remove articles dynamically. Some of the features I added were:

  • The ability to zoom. Though the zooming interface is not very polished, it allows a deeper exploration of the data.
  • A time scale. Otherwise, Edward Tufte would be ashamed.
  • Limited length data. This was available through the public feed, so I added it as an extended feature that could be activated or removed.
  • Hover-over information. Though it won't win any design awards, the hover-over information allows more concrete numbers than the pure initial exploration of the data.
  • The choice between a logarithmic and linear representation, and between universally-normalized and individually-normalized data.
  • Suggestions based on Wikipedia search, which came from the inability to find the exactly-named article I was looking for during the development and testing process. Later I also implemented a feature that forwards "article redirects" to the destination article, rather than just the source.

Commentary

I spent an inordinate amount of time working with the data via extremely large downloads from the wikipedia download site. I eventually produced a time-series in a static file, but it was much too large and old to be of use. I then stumbled upon the MediaWiki API site, which helped immensely. This is also my first project in Flex, though I have considerable ActionScript background from outside of the class; this helped produce a system that is relatively encapsulated and object-oriented. Naturally, there's quite a bit of "technical debt" that I've run up, merely because the timespan of this project was so short. All in all, I probably spent 40 hours on this project, not including computation where I left my computer to parse the multi-gigabyte Wikipedia file (despite the oh-so-futile results of that venture).

Three major features are missing here, that I wish I could have had time to produce:

  • A data legend, providing numeric quantities to the darkest and lightest stripes.
  • A list of suggestions based on forward and backward links from the article.
  • A clearer interface for zooming and panning.

Interesting data

The Wikipedia article histories on Al Gore, An Inconvenient Truth, the Nobel Peace Prize, and the Intergovernmental Panel on Climate Change.
The Wikipedia article histories on Al Gore, An Inconvenient Truth, the Nobel Peace Prize, and the Intergovernmental Panel on Climate Change.
This was amended to the page after its original submission.

Unlike many other projects, this project uses live data streamed from Wikipedia; therefore, the number of interesting stripes and comparisons is nearly limitless.

Please add to this list if you find anything interesting; Wikipedia is a large enough dataset that there are plenty of interesting data to play with. Try entering these to see the more curious visualizations:

  • Black stripes appear for Al Gore together with his film An Inconvenient Truth to match its Academy Award as Best Documentary of 2007; he also has a black stripe matching Nobel Peace Prize for when he won that award in 2007, which also had a minor effect on An Incovenient Truth and the Intergovernmental Panel on Climate Change. Notably, the article on the Nobel Peace Prize was not heavily edited in 2006 or 2008, and has a much smaller edit spike in 2005.
  • Sarah Palin is better viewed logarithmic, because her nomination at the beginning of September dwarfs the rest of her line. In fact, if you leave your settings on "linear" and "consistent", you may need to remove her as the quantity of her edits in that timespan dwarfs nearly everything else. Typing "Wasilla, AL" into the search box provides a number of articles that only recently gained interest; the article Wasilla doubled in size after her vice-presidential nomination.
  • Likewise, Heath Ledger's article quadrupled in size after his mysterious death earlier this year. You can see that spike manifested in the article for The Dark Knight, but that article doesn't match the change in size.
  • General David Petraeus delivered a report to Congress on September 10, 2007; the edit war that followed did not change the page's length, and ultimately resulted in the page being protected from anonymous edits.
  • Another notable edit war comes from the article Bling-bling, a hip-hop neologism with a frequently-vandalized page that experienced an inexplicable burst of edits in March 2007.
  • The article on Wood has a single black "spike" where it was mentioned in the geek-joke webcomic xkcd. (see comic here)


[add comment]