From CS294-10 Visualization Sp11
I started by using the World Bank Data Catalog. It is a great dataset and well curated and documented. I focused on the World Development Index (WDI) as the primary focus of the visualization and generated a couple of interesting visualizations. The problem is that the data set is well explored. I could not tell a new story with my visualization. Japan has made a lot of improvements from 1961. Malaysia seemed to miscalculate or misrepresent their WDI index and Afghanistan had always been in trouble. It was hard to find something in that dataset that is not already on the internet.
I then started to focus on a visualization that I knew could be improved. The following is a visualization of 16000 mechanical turk workers in the United States and is done on a Google Map that is available here. As we see the large icons for each turker is causing issues for interpreting the data. Specifically the high density of Turkers on the coast lines is stopping the viewer from seeing the complete story, one can just know that Turkers are very active on the coast lines and not active in the middle.
After a little playing with the source of that page I realized that the data is pulled from an XML file on the techlist server. Each turker has an <entry> tag and each tag contains the latitude and longitude of the worker. A snapshot of an entry for a turker is shown below.
The dataset can then be easily read and parsed using the XML package in R
doc = xmlInternalTreeParse("http://techlist.com/mturk/mturkgeo.xml") top=xmlRoot(doc) top[]
I later received a dataset that was curated by professor Ipeirotis of NYU. The dataset which is much larger than the dataset that I generated using R has information for the number of Turkers active in 31000 zip codes in the united states. I believe he will soon publish a new blog post on his own analysis of the dataset but following are my visualizations of his dataset. The dataset has 2 columns Zip Code and Lift Lift of less than one shows that the number of turkers in the area is lower than what is expected by chance and values larger than 1 means that Turkers are more than what was expected if we assume a uniform distribution of turkers on different parts of the country.
We can have a value for number of Turkers divided by the population of the area. Let's call it lift. This allows us to compare areas with different populations. The following shows the states on a 2D map sorted by their lift values.
As we expect NY and CA have the highest number of workers. It was interesting to seen Pennsylvania among states with large number of Turkers
Now we can draw the sum of lift value and the population on the same scale
Replication of the visualization in Tableu
Finally I used tablau to replicate the same visualization.
The same visualization in R
Tableau visualizations seemed to be limited so I tried to replicate the same visualization in R. Here are my 3 failed attempts in R.
Here is my final attempt to generate a nicer visualization in R. The tool that was used was the maps library in R and the legend is added by the shape library.
Unfortunately Tableau does not support heatmaps. It seems that interpolating between geolocation values is difficult as R also has limited support for generating heatmaps.
I am wondering though, in terms of the 8 fundamental elements of visualization how can we interpret heatmaps as they are a continuous meta information on top of a geographic map