A3-MasonSmith

From CS 294-10 Visualization Sp10

Jump to: navigation, search

Contents

Introduction

Initially, I wanted to look at some factors affecting success in education. The National Center for Education Statistics has a really useful tool that allows you to build custom data tables for broad educational data organized at the state, district, or even individual school level. However, most of the data was more administrative and didn't really employ the variables I was interested in, so I decided to take another route.

I decided to compare some linguistic properties of English names vs. English text. Specifically, I decided to compare (case-insensitive) character and bigram frequencies of American surnames compared to the English language in general.


Data

I used the Genealogy Data from the 2000 US Census to get a list of the top 1000 surnames. These surnames account for over 40% of all surnames, so they are a reasonable representation of character and bigram frequencies. I wrote a program in Common Lisp to compute bigram and unigram frequencies in the English names, weighted by the relative counts of each name. For the English language, I used the relative frequencies from a paper referenced in a Wikipedia article on letter frequencies.

Process

First, I just created a scatterplot of letter frequencies in the top 1000 names vs. frequencies in the English language.

Image:masonsmith_cfsp.jpg

I forced the axes to be the same, so that the view can easily draw a mental "y=x" line from the bottom-left to top-right corner. Entries above the mental line are relatively more common in names, whereas entries below the line are more common in general English text. Immediately, R and T jump out as candidates of each case respectively.

"Etaoin Shrdlu" is a (not so) popular mnemonic for the 12 most common English letters, in order. One can verify that with the above visualization by reading the labels from right to left. (For some reason, Tableau did not want to print the "I" label, despite all my coaxing), but the equivalent for common name letters is somewhat harder to read off.

Ideally, I wanted a table with each letter sorted by frequency in the name set, with an additional column for the English frequency and the relative difference in rank. Unfortunately, Tableau doesn't seem to let you compute rank of an entry from a list.

Instead, I computed the difference in frequency for each letter and plotted this in a bar graph:

Image:masonsmith_cfdiffbar.jpg

I think the above graph has a few shortcomings, though. Neither the coloring nor the ordering is intuitive; reading the caption is necessary. Furthermore, the property I wanted to show foremost (rank change) is given a less prominent encoding (hue/lightness change). Luckily, Tableau lets you adjust the color center, so exactly half the bars are blue and half are orange. (By default, most of the bars are orange, since most letters after ETA occur only 3.04% of the time on average). With this median adjustment, determining relative rank is easier. On the other hand, I had to compute the median manually, rather than being able to use a formula to determine the color center within Tableau.

The Census data also included breakdowns by race, so I also wanted to see if there were any interesting relations with character frequencies and ethnicities.

My first idea was to use a scatter plot for each ethnicity.

I used letters as marks to save room. This graph doesn't really facilitate comparison between ethnicities, though it does give a good overall view. Using a grouped bar chart was to unwieldly (7 x 26 = 182 bars in a row), but using individual charts was reasonable.

(I also attempted a stacked bar chart, but unfortunately I couldn't Tableau to produce one to my liking.)

One problem with both is that I feel they try to present too much information at once. By narrowing down the frequencies to vowels and only a few consonants, one can make more meaningful comparisons.

Final Visualization

I combined two of my previous visualizations after a bit of refining (namely, standardizing the axes in the scatterplots). The left portion answers my original questions fairly well. You can quickly see the relative ordering of frequencies for the census data, as well as direct comparisons to the English character frequency.

The right scatterplots allow a broad overview of the ethnicity-specific data. One can quickly spot various peculiarities. For instance, 'E' is no longer the most frequent character in 3 of the six categories. I think a dot-dash-dot or range-frame plot would make the scatterplots more useful. For instance, it would emphasize the particularly skewed distribution of Hispanic and Asian/Pacific Islander surnames.



[add comment]
Personal tools