A2-DavidPurdy
From CS294-10 Visualization Fa07
Contents |
[edit] Background
For years, I've had an interest in language. One of my professors in college, Bert Vaux, has conducted several large web surveys about dialect variations of speakers of English. I thought that one of these studies might be an interesting introductory exploration in Spotfire and Tableau.
The dataset of interest is called The Dialect Survey. There were 47471 respondents, who were given a series of 122 questions, for which there were varying numbers of discrete choices about which words or pronunciations they preferred. For each user, we also have their city, state, and ZIP code.
More information about the survey can be found [at its home page]. I'm grateful to Prof. Vaux for allowing me to use this data for this purpose.
[edit] Questions of interest
- What dialect features are most strongly associated with specific regions? (E.g. If someone says "tag sale" rather than "yard sale", or something else, it's a strong indicator they're from Connecticut or a nearby state.)
- Which features show much variation in usage, but the variation does not seem to depend on geography. (E.g. there are 2 popular ways of saying something, and whether one says A or B doesn't appear to depend on where they're from.)
- Can we cluster features into some kind of phylogenetic representation?
- Can we use dialectical homogeneity or heterogeneity to describe the variation within geographic regions, and can we develop some kind of iso-dialectical maps (comparable to isothermal or isobaric maps used in meteorology)?
[edit] Results
The first two questions appear to be impossible to address with Tableau, as its support for categorical data appears to be undeveloped. I compromised with easier questions and used Spotfire for the results below. The 3rd and 4th questions were far beyond the capabilities of Tableau and Spotfire. It is possible that Spotfire could be tortured enough to make it produce something acceptable, but a pragmatic person would do the important calculations in another environment.
Note: The steps I took to pre-process and clean the data are omitted for brevity.
[edit] Zooming in on populous regions
Unfortunately, Spotfire doesn't offer much support for examining all 122 features against all 51 locations (states + DC). So, I compromised and picked 3 states with large numbers of responses. These were California (CA), New York (NY), and Texas (TX). Below is a histogram of the number of respondents per state, with CA, NY, and TX highlighted (CA is the tallest bar, but the program has the label underneath the second bar.)
[edit] Most popular dialectal variant per region
In addition, it wasn't apparent how to produce 3 rows (I later found a possible way to do this), one for each state, of histograms for each variable, for all variables, so I opted instead to use parallel coordinate plots. A further compromise was made in that choosing one line per respondent produced too much "chart ink" and it did not seem possible to color these by density (or use "alpha blending"), so I instead opted for one line per state. The columns corresponded to individual questions, and the vertices were for the most popular response given by respondents from the given state.
In the plot below, we can examine the most popular choice, per state, for questions 1-20. Where there is only one flat line, this indicates that for each question there was one choice that was most popular in all 3 states. Where there are 2 lines, it indicates that one state diverged from the other two; there aren't instances of 3 lines for this example, but if so it would indicate that each state had its own most popular dialect variant for the given question.
[edit] Examining only disagreements
We can then omit all of the questions where the lines are flat (indicating the universally most popular responses), allowing us to drop 14 of the 20 questions, and focus on 6 questions for further analysis - #s 5, 6, 8, 9, 16, 18, and 20. Although this is a very simple way to look for differences, it allows us to jump straight to a few interesting cases.
[edit] Two specific examples
Looking further, at questions 5 and 6, we have the following histograms:
For question 5, we see that the second response is most popular among Californians and Texans, while New Yorkers have a slight preference for the first option, which is much less popular for those further west. This particular question corresponds to the pronunciation of the vowel in the second syllable of cauliflower. Apparently New Yorkers prefer to pronounce it as a rhyme with "see", while those from Texas and California prefer it to rhyme with "sit". (I wonder how native New Yorkers pronounce "California".)
For question 6, again, CA and TX agree on a pronunciation, while New Yorkers go their own way. The question pertains to the pronunciation of the last vowel in "centaur". The first option (the preferred choice among TX and CA respondents) is that it rhymes with "car", while the fourth option is that it rhymes with "sore" and "more" (popular among New Yawkers, ironically).
[edit] Footnotes
A data set of this scale (~50K rows x ~100 columns, for 5M entries) is just large enough to give us some challenges for visualization. I soon learned that the categorical nature of it was more of a problem. I used 4 programs in the course of this work: Tableau, Spotfire, JMP, and GGobi via R (the 'rggobi' package). A short summary of my experiences:
- Tableau: This program is either inappropriate for large categorical data sets, or else I did not find the answers in the online help and gallery of example visualizations. I believe that small categorical data sets may be viewed with this program, but it has many idiosyncratic, automatic decisions about manipulating data and visualizations, which are not statistically intuitive. I won't go into much detail about this, but the seeming inability to condition on one categorical variable, and examine histograms of many other categorical variables is a serious shortcoming for this program. Basically, the program seems to aim for the business reports market, where real-valued data is probably more common than discrete data, such as surveys responses. I gave up on this program.
- Spotfire had many features for data analysis and did allow me to examine a number of properties of the categorical variables. However, I quickly discovered a lot of bugs, such as poor processing of text strings (however, its regular expression support made these easy to clean), and an automatically derived summary of column data entitled something like "Number of distinct values", is sorted lexicographically, rather than numerically. It has modest support for GIS data (it will import older shapefiles, though it doesn't seem to allow one to discard specific sections, such as certain territories of the US that don't appear in my data set). However, Spotfire's very limited statistical functionality for categorical data was a problem. I wanted to examine entropy of variables within subsets and differences across subsets (e.g. explore something like "For question 100, show me the amount of variation (not variance!), conditioned on the state" or "Sort the 122 questions by the amount their entropy varies across states"), but was unable to do so in a straightforward manner via Spotfire's limited ability to create new variables (it really should have categorical analyses built in, though).
- JMP was my next option, as it's developed by SAS, and is known for excellent interactive data exploration functionality. It is extremely impressive, and I had an enjoyable time trying various visualizations and going back and forth with statistical analyses, but the learning curve seemed to be beyond the scope of this assignment.
- GGobi has been very appealing and useful for me in the past, but unfortunately it crashed when attempting to load this data set. Several failed attempts later, and I was back to Spotfire.





