A2-JamesOshea
From CS294-10 Visualization Fa07
[edit] Data
The data I used in this assignment were a subset of the World Development Indicators (WDI) published by the World Bank. These data include over 900 indicators of development for 228 different countries. They are described here and you can query the database here. I used Tableau to visualize the data.
I decided to specifically look at how the prevalence of HIV affects different indicators of economic development. I selected over 40 different variables from the years 1990-2006. The statistics include such measures as GDP, total health expenditures, percent of roads that are paved, HIV prevalence, etc. The World Bank does not have complete data for all countries however, so there were many empty cells in the tables I downloaded. In particular, I was only able to get HIV prevalence data for the years 2003 and 2005.
[edit] Reformatting
Unfortunately the data needed to be reformatted before they could loaded by Tableau. Initially, the data were arranged in tables with countries representing the rows and years representing the columns, and each indicator had its own table. I wrote a C program to load in these files and reformat them into a single relational data table. The code I wrote can be found here.
[edit] Questions
[edit] What is the distribution of HIV prevalence across different countries?
First I wanted to simply get a sense of the overall HIV prevalence around the world. Here I plotted HIV prevalence according to geographic region:
Drilling down into the data a little more, I plotted HIV prevalence by country, filtering out any country with a prevalence rate less than 1%. Here we see the specific countries in the Sub-Saharan Africa region which account for the high rate in that area of the world:
[edit] How does HIV affect the labor population?
Then I decided to look for specific correlations with HIV prevalence, still filtering out countries with prevalence rates less than 1%. I first examined how the total labor population is affected by HIV prevalence, expecting there to be an inverse correlation:
As expected, there appears to be a negative correlation between HIV prevalence and the total labor population. Not all countries follow this trend however, as can be noted by Sudan's relatively low prevalence rate of HIV accompanied by a very low labor force. Given Sudan's longstanding conflict involving the SPLA, as well as the more recent Darfur conflict, it would be interesting to cross-index these data with statistics describing civil unrest and political turmoil. Perhaps the low labor force results from something unrelated to HIV (i.e. war). I'm not sure what is going on with Suriname though.
[edit] Does the labor force affect GDP?
Having demonstrated that there is likely a correlation between high HIV prevalence and low labor force, I wanted to see how the labor force affects GDP. Specifically I was thinking that GDP (per capita) would decrease as the labor force (as a percent of total population) decreases. To examine this relationship, I plotted per capita GDP as a function of labor force participation:
For many countries, there does appear to be a trend between labor participation and GDP. For the developed nations in particular, GDP appears to increase as a function of labor participation. It should be noted however, that there are still many countries with high rates of labor participation but strikingly low per capita GDP.
[edit] How does HIV prevalence look with respect to the labor force participation and GDP?
Given that we've shown some relationship between HIV prevalence and labor participation, I decided to add HIV prevalence to the previous chart:
Here I've plotted GDP (per capita) as a function of labor participation (percent of total population), but the size of the marks now encodes HIV prevalence for each country. Although it seems that labor force participation has something to do with GDP in certain areas of the world, it is clear that countries with high rates of HIV prevalence do NOT have high GDPs.
[edit] How does HIV prevalence correlate with per capita GDP
Given the previous chart, I decided to look directly at HIV prevalence and per capita GDP:
Note that the countries with high rates of HIV have some of the lowest GDP. Although there appear to be many countries with low rates of HIV that still have low GDPs, no country with a HIGH rate of HIV has a high GDP.
[edit] Is the effect due to decreased life expectancy?
Although there does appear to be some relationship between HIV prevalence and GDP, I started to think that it may just boil down to life expectancy in general. It isn't hard to imagine countries with high HIV prevalence to also have low life expectancy, so I first addressed this relationship with the following graphic:
Here I've plotted Life Expectancy as a function of HIV prevalence. In this visualization, the size of the mark encodes the country's GDP (per capita). Not surprisingly, HIV prevalence does inversely correlate with life expectancy. And it is appears that life expectancy is also an indicator for per capita GDP. I decided to look into this further with the following question.
[edit] Does life expectancy affect GDP?
Now that we know that life expectancy and HIV prevalence appear related, how does life expectancy predict GDP? In the following graph I've plotted GDP as a function of life expectancy, with the size of the marks corresponding to HIV prevalence.
I believe this is the most compelling visualization of my project. The link between life expectancy and GDP is clear, and it is evident that no country with a high prevalence of HIV has a GDP (per capita) above $5000.
[edit] Conclusion
It is difficult to infer direct causality from the visualizations I have described in this project, but it is clear that the prevalence of HIV is related to various economic indicators for the countries examined in these data. Although there are many factors influencing a country's GDP, it is clear that HIV prevalence, and particularly its affect on life expectancy, plays a role.
More importantly, this project demonstrates the utility of interactive visualization for exploring a large and complex data set. It is through these types of data manipulations, in real-time, that one may begin to gain a greater understanding of the relationships between different variables, start to identify patterns, and eventually learn how the various measurements affect each other.








