From CS294-10 Visualization Fa08

Jump to: navigation, search


Domain and Initial Question

It's job-hunting season on campus, and companies from a wide variety of industries are catering to students with showings at career fairs, infosessions, and technical talks. Most students follow companies hiring from their major, but some take the opportunity to explore career paths divergent from their academic pursuits. For some, the prospect of a higher salary or a more preferable work location drives their exploration. UC Berkeley's own Career Center records average salaries as reported by alumni in various fields, which may convince some students to take a look at a wider range of lucrative career opportunities.

However, the Career Center data does not have too much information on the geographical aspects of job locations, which brings me to my initial question:

Is there a correlation between location (in the United States) and salaries for various occupations?

I am most interested in the distribution of location and salaries for Computer Science-related occupations, but I predict that it will be hard to distinguish them based on the wide variety of titles that Computer Science majors have (e.g. "Software Developer", "Programmer", etc.).

Finding the Data

First, by drilling down through several lists of free data repositories, I find the "Occupational Employment and Wage Estimates" page on the website of the Bureau of Labor Statistics: http://www.bls.gov/oes/oes_dl.htm

I notice there are data sets called "Metropolitan Area Cross-Industry estimates" and decide to download the one from 2007 (the most recent available). It is a ZIP archive, with three data files in Excel format and an additional "field_descriptions" Excel file. Upon opening one of the data files, I see that they contain many columns of data about various occupation titles (e.g. "Marketing Manager"). I'm looking for geographical information and I do see that the "AREA_NAME" column holds explicit names of "areas". These "areas" vary from city names ("Anchorage, AK") to large multi-city regions ("San Francisco-San Mateo-Redwood City, CA"). The "AREA" column holds numbers that look like ZIP codes, but aren't. Reading the "field_descriptions" file reveals that these areas are actually organized by MSA Codes (i.e. "Metropolitan Statistical Area Codes"). Apparently, MSA codes are used to group metropolitan areas into statistically independent regions and are more appropriate than ZIP codes in many contexts.

Importing Into Tableau

I import this data into Tableau with little trouble. I hoped to be able to use the MSA codes to map data right away, but I run into a major problem. I could mark the "MSA_Code" field as a "CMSA geographical role", but it seems that CMSA codes are not the same as MSA codes. CMSA codes are "Consolidated Metropolitan Statistical Area Codes" and apparently are not compatible with MSA codes. I spend some time trying to find a way to have Tableau recognize my selection of MSA codes, but there is no way to get the points mapped short of entering in the coordinates for each MSA individually. Because Tableau can not determine the geographical locations of my MSA codes, any chart I attempt to generate by using the generated Latitudes and Longitudes results in a map with every point located at (0,0):

The Problem with MSA Codes

There is a "STATE" field in the data as well, so I decide to use that instead. Tableau is much more cooperative here; it can map state abbreviations to coordinates directly, and I am able to get my Average Annual Wage measure plotted with circles of differing sizes:

One Third of the States with Average Annual Wages

Switching the Data Source

I notice that I only have about a third of the states represented here, and remember that there are two other data files I haven't looked at yet. At this point, I think switching away from the "Metropolitan Area Cross-Industry estimates" data set is a wise decision, since I have given up on trying to use the Metropolitan Statistical Area Codes. The Bureau of Labor Statistics provides a "State Cross-Industry estimates" data set for 2007 that contains the same information packaged in one convenient Excel file with all the states represented, and I decide to use it to replace my data source in Tableau.

I spend some time hiding away some of the more uninteresting columns, renaming the confusing ones, and resetting the Data Types that each column represented. I recreate the Average Annual Wage map, but decide to use color (along a spectrum) instead of circle size to represent the average annual wage of each state:

State View with Colored Annual Wages

This is my first interesting result! It seems that wages are generally much better along the West Coast, Midwest, and New England regions of the United States. This makes sense to me, as these areas are generally better developed and have a higher standard of living.

Fixing the Data

After thinking a bit about the data that I'm plotting, I realize that there are major flaws in the visualization. I am plotting average annual wages across all occupations, and there are certainly different mixes of occupations in different areas of the nation. Come to think of it, I haven't yet addressed my question about how occupations relate to location at all. I am just taking a huge aggregated average salary for all occupations for each state, a very naive attempt at analysing the data.

There is a bigger flaw as well. Looking at the data files, it seems that there are many levels of aggregate rows that are interspersed within the data for each occupation. For example, the very first row has an occupation title of "All Occupations". Not only that, every fifteen rows or so, there seems to be data for a category of occupations: the occupations that immediately follow that row. For example, a "Sales and related occupations" row is followed by "First-line supervisors/managers of retail sales workers", ..., "Cashiers", "Counter and retail clerks", etc. So I am certain that the above charts are depicting values that have been compounded many times over for each state.

To fix this, I notice that there is a column named "GROUP" in the data. This column is either null or contains the word "major" for each row in the data. It is only "major" on those "occupation category" rows, excluding the "All Occupations" rows. If I can just filter the data to only include those rows that are "major", I should just have data for each occupational category per state with no overlap.

I discover that Tableau actually has a facility for filtering data (so that I don't have to modify the Excel spreadsheet directly). Adding this filter and then having occupation (category) be a column in my chart results in a nice "small multiples" view of the distribution of salaries in each state:

Small Multiples View of Annual Wages

This result gets me much closer to answering my question accurately.

Refining the Question

At this point, even though I've limited the chart to represent broad categories of occupations, there is simply too much data for me to take in. I realize that asking about all occupations is inherently too broad of a question, and believe I can now narrow my scope to just looking at Computer Science-related data. The closest category I can see is the "Computer and mathematical occupations" group, so I add another filter and get a much more viewable single map:

Computer and Mathematical Occupations Only

Although I have lost the ability to compare this category of occupations to others (i.e. to see differences in salary "hot-spots" between them), I do get a much clearer view of this specific slice of the data. I decide to take advantage of this additional resolution and add another measure to the visualization - the total number of employees under that occupation group in each state. This brings me to my final result:

Final Result

Average Annual Wages and Workforce Size of Computer and Mathematical Occupations in 2007

Caption: Average Annual Wages and Workforce Size of Computer and Mathematical Occupations in 2007 (Refined) Question Answered: How are workers in the computer/mathematical field distributed in the United States, and how is that related to their average annual wages?

This visualization depicts the geographical workforce spread of those working in computer/mathematical occupations and their average annual wages as estimated in 2007 by the Bureau of Labor Statistics. This data is separated by state, with each circle on the map representing the workforce in the field within the state. The size of each circle encodes the number of people in the occupational field and the color represents the average annual wage on a spectrum (from $42520 to $83660) where red and green are low and high wages, respectively. The chart shows that the majority of the people in this occupational field are in California, Texas, and states on the East Coast. It seems, accordingly, that the many of those states are also home to the highest wages. Similarly, in states where there are very few people in this occupational field, average wages are lower. The state of Florida seems interesting in that it has a sizable population of people in this field yet has a relatively low average annual wage. For now, it seems that California is still the dominating state in this field, with a large workforce size and competitively high wages.

[add comment]
Personal tools