A3-DavidSun

From CS294-10 Visualization Fa07

Jump to: navigation, search

David Sun & Andy Carle

Contents

[edit] Specification

[edit] Data Domain

We will be working with data obtained from the World Bank [[1]] and Earth Trends[[2]]. Both organizations provide querying and downloading services to their online databases which are repositories to a broad range of social-economical indicators of countries around the globe.

[edit] Interaction Techniques

We shall employ a number of interaction techniques discussed in class for this work, with appropriate adaptations:

  • TreeMap. UN gives a grouping of the world's countries based on their geographical location, e.g. East Asia and Pacific, South Asia, Middle East and North Africa etc. The geographical regions will be translated into nodes in the treemap (the root being trivially the entire world) and roughly positioned at their actual locations on a map of the world. This design should help the user to quickly orient themselves on the visualization.
  • Dynamic query. We will support dynamic querying along both spatial and temporal dimensions. We are considering using slider bars to allow a user to specify the years and the countries that they are interested in. We will also allow the user to control the number of indicators to visualize by presenting them with check boxes that will act as data filters. This design is to help users coping with the quantity and high dimensionality of the data.
  • Overview+context. To further help reducing cognitive load, we will be adopting the overview + context metaphor (more explanation in the storyboard) through a ZUI.

[edit] Storyboard

To explain our initial design, we will step through a sample data analysis session. (Note: No real data is included anywhere in these samples.)

The user opens the software and is presented with a series of boxes depicting the countries of the world. The size of each box is determined by population. The boxes are roughly grouped by geographic area.

The user is presented with a series of check boxes. These are used to change which variables are displayed in the interface. In the following mockup, the user has selected "GDP," "Literacy," and "Phone Service."


Image:A3-AJCDS1.jpg


The colors that appear in the interface represent positive and negative trends in the numbers. For instance, the top right most country box contains three colors, a bright green, a darker green, and a bright red. This indicates that this country's GDP is very good, literacy is above average, and phone service is very poor.

While individual data items are relatively clear here, they would be less so in a full version with the full number of countries (and with more variables displayed at once). This level of zoom is appropriate for spotting trends and interesting outliers, but is not appropriate for noting detail.

The user is intrigued by the top right corner of the display. She wants to compare the two right-hand countries on the top row in Europe. So, she clicks as shown:


Image:A3-AJCDS2.jpg


The user interface zooms in to reveal:


Image:A3-AJCDS3.jpg


We can now see more detail. We know the names of the countries and have labels on the data. Most likely, additional data will be included here, but is omitted for this mockup. Example data could include the actual numbers of the data or other items useful for comparisons.

Seeing this data has piqued the user's interest in this region. She now decides to examine a different set of variables: Internet and Water. This is accomplished by changing the checkboxes on the left. The resultant interface looks like this:


Image:A3-AJCDS4.jpg


As you can see, the number of values per box has changed to two to reflect the reduction in variables. The colors have updated to reflect the new variables as well.

The user is now interested in changes over time. To pursue this, she will look at the data from 1995. This is done by moving the slider at the top of the screen to 1995. This produces:


Image:A3-AJCDS5.jpg


Which shows the values of the variables in 1996. The user could now examine other variables in this year, change the year again, or zoom back out and examine the global view of the conditions she has selected in this local view.

[edit] Implementation

The system that we have implemented is essentially the same as that specified above. We coded our project in Java due to the robust selection of graphics toolkits available for the platform. We used the Piccolo zoomable user interface framework for our design. The treemap layout algorithm used to generate the geometry of region boxes is part of the java treemap package provided by Ben Bederson and Martin Wattenberg. All the major interface decisions shown in the storyboards above have been implemented.

The data we have been using was obtained from the World Bank WDI database . The raw data was retrieved in comma separated format. After some experimentation with Java StringTokenization, we realized that CSV was more complex than what can be handled by a naive StringTokenizer. We eventually borrowed a Java-based CSV parser from [3]. The data is comprised of 54 world development indicies for 209 countries for years ranging from 2001 to 2006, for a total of 11286 data points. The 209 countries are clustered into seven regions by their rough geographical proximity by World Bank, as reflected by the clustering in the treemap. Trends and scaling factors are calculated and variables are manipulated before everything is put in a custom data-structure that acts as the database. A set of APIs were then implemented to allow the front-end to interface with the back-end data repository.

[edit] Changes from Storyboard

No major changes have been made from the storyboard. Minor changers include:

  • Moving the time slider into the side panel (piccolo doesn't play particularly well with embedded swing objects)
  • The interactions have changed very slightly from that indicated in the storyboard. Zooming in is now done with a double click. Panning uses the single click (plus drag). Zooming out is achieved with a right click.
  • The number of variables we have chosen to include forced a scroll bar on the variable pane.
  • We could not find complete data for many of our variables. As such, we added an extra encoding. White boxes represent a lack of data.
  • We did end up adding more text to the display. Values are now displayed on each colored region once you are zoomed far enough in.

The following screen shots show our interface at the world, regional, and country views.

Image:A3-ACDS-world.jpg Image:A3-ACDS-region.jpg Image:A3-ACDS-country.jpg

[edit] Implementation Difficulties

  • Applying the UMD treemap algorithms to piccolo nodes was not entirely straight-forward. Of particular frustration was a Java issue that is interesting/amusing enough to mention here. The boxes defining countries in our interface extend the central class of Piccolo: PNode and implement the treemap unit interface: Mappable. Both Mappable and PNode define a method getBounds() that take no arguments. Unfortunately, these methods have different return types and need to fundamentally different things. In Java, there is no particularly graceful way to handle this issue. We used the simplest solution and changed the name of the method throughout the treemap source code, but not until spending an hour or so establishing if there is a way to handle this situation in Java.
  • An initial attempt was made at parsing the data through a naive Java string tokenizer. It quickly became apparent that a naive tokenizer is incapable of handling additional spaces and symbols in the string elements to be parsed. We eventually resorted to a Java implementation of a CSV parser.
  • The data we downloaded needed to be scaled in various ways to make the visualization look correct. A reoccurring problem was double representation of data. A good example of this issue is the values we downloaded for GDP. In the initial version of our project, we colored the GDP variable in its raw state. However, in doing this we actually redundantly encoded population in a confusing way. GDP tends to increase roughly proportionally to population. Thus, it was only natural that the biggest countries would be the brightest green -- causing greens to dominate the map. We corrected this problem by turning these types of values into ratios of value per population.
  • A second data transformation that proved necessary for visual effect was a visual scaling of variables. A straight-forward, range-based mapping of data points to colors in the red to green color space resulted in an excessively red display. This is due to the considerable disparity between the "haves" and "have nots" of the world -- the middle point of the range of values (which we were coding with the middle color) tends to be far above the average. To counter this effect and produce a more visually diverse display, we opted to independently map the range above the mean and the range below the mean into the green and red color spaces, respectively. This means that the red and green values are displayed on different scales -- a difference in color on the red side may mean something slightly different from a difference in color in the green side. While this is a bit of a stretch of the metaphor we employ, we felt that it made it easier to effectively analyze trends in the interface.
  • It became apparent that the data from world bank was of dubious quality. There were large gaps in the data between the years that affected the countries to varying degrees (e.g., the US lacks internet user count for 2005). In general, the more affluent countries have more complete data than the developing countries, which isn't all that surprising given their relative financial capability to perform data collection of this nature. The year 2005 also turned out to be the one year with the most complete data points which is the reason why we chose it as the default year when the application starts. We performed some elementary data cleaning by replacing missing data with NaN objects and using White to depict those missing data points on the user interface. We also excluded countries whose data value for a particular variable is NaN in some year in the calculation of the average value to prevent skewing of the scaling factor.

[edit] Lingering Technical Issues

  • The piccolo framework has trouble displaying very small text. This became an issue for us on very small countries. In these instances, country names can not properly scale to fit in their boxes. In these cases, we opted to let them overflow. In situations where data labels would not fit in the boxes, we omit them to clean up the interface. A potential solution to this issue is to scale all other interface elements to make piccolo's minimum text size fit in the smallest boxes. Unfortunately, piccolo does not handle large objects particularly well either, making this intractable. A final potential solution is to distort the sizes of boxes to make the text fit. We chose to preserve the accuracy of the display rather than pursue this option.
  • The response of the user interface to the temporal slider bar is very slow. This is mainly an implementation issue caused by the way that the dataset API is accessed. Instead of using bulk query interfaces, the current implementation makes separate queries for each individual country per year per variable. This issue should have a simple fix in the next iteration of the application. Another improvement that can be made is to switch from a custom implemented data-repository to a MySQL/ORACLE based backend, and keepthe current data-repository implementation as a fast cache for frequently accessed information. This should both improve the response time and make the data repository more robust and scalable.

[edit] Delegation of Labor

David and Andy participated equally in the initial design process. For the implementation phase, Andy dealt with the front end with David worked on the back end. We then worked together to connect the two halves, debug the whole package, and produce this write-up.

Specifically, Andy implemented the simple Swing components for selecting variables and the year, the piccolo components that make up the interface, the integration of the UMD treemap algorithm into the display, and the event handling that drives the interface. Piccolo functions made implementing the basic ZUI metaphor simple, but the other interactions are coded from scratch. The most time consuming part of this process were building the basic functionality of the display from the ground up. While Piccolo provides nice tools, they are all oriented towards the ZUI metaphor, rather than any particular type of application. Making Piccolo's pieces do things that are relevant to the semantics of your application is non-trivial. Andy's portion of the project took ~20 hours.

David implemented the parsers on-top of the CSV parser library to read-in the data, the custom data-repository (from scratch), the algorithms to scale the variable values, and the libraries to interface the back-end with the GUI components. David's portion of the project took ~20hrs.

[edit] Download

Source files as zip

Executable JAR and datafiles as a zip. (Extract the two jars and the datafiles to a directory, then double click or java -jar A3-AndyCarleDavidSun.jar)

It is easiest to run our project by downloading and executing the above jar file.

We have included all of the library files we used along with our own code in the source code zip. The piccolo and Treemap folders are libraries we used while the VIZA3 and WorldData folders are our code. The easiest way to build our application from source is to import all four projects into eclipse -- the .project files are included for each.

Double click zooms in at various levels, right click zooms out, click and drag pans. As you turn variables on and off you may notices lots of white areas -- these are a result of the aforementioned missing data.



[add comment]