A2-WesleyWillett
From CS294-10 Visualization Fa07
Contents |
[edit] Assignment 2: Creating Visualizations
I chose to analyze race result data from the WCCC (Western Collegiate Cycling Conference) 2006 mountain bike season. This data includes rider names, placings, scores (assigned using a system specified by the national governing body) and occasionally times for all competitors in each of the races held during the season. This includes scores from endurance events such as the Cross Country and Short Track, as well as from gravity events including Downhill, Super D, Dual Slalom, and Mountain Cross. Separate races are held for each of the 5 rider categories (Men's A, Men's B, Men's C, Women's A, Women's B) and individual riders in each category amass points towards the season Omnium for that category. Scores from all racers on each team are totaled and these scores are used to decide the team omnium - the conference championship.
This data was obtained from the WCCC website: [[1]]
[edit] Questions
Q1: Does total race attendance dwindle over the course of the season?
Q2: Similarly, riders usually assume that home teams have an advantage based on their knowledge of their home course and their ability to draw more of their own riders to a race at a nearby location. Is this actually borne out by the data? Do teams actually field more riders or score better at home?
Q3: Do good performances in endurance events correspond to good performances in gravity events or do the two appeal to mutually exclusive sets of riders?
[edit] Data Formatting
The original data was obtained as ASCII text files corresponding to each of the Events for the season. Entries document every race entered by every rider who competed over the course of the season. For each rider in each race the data files contain fixed size columns with values corresponding to:
Finishing Position Rider Number (a unique number handed out to every competitor before each race) Rider Name Team Finishing Time (where available) Total Points Awarded Individual Omnium Points Awarded.
Within each file these columns were broken up into smaller blocks by Race Category (Men's A, Women's A,...), each with their own individual headers. To create the final data file, I pooled all of the valid data tuples from each of the text files and formatted them in Excel. To each tuple I added Category, Event, Venue and Date information by hand.
The data contained 1368 tuples, each corresponding to a 1 racer in a single race. This accounts for 333 racers competing across 17 different events. The median number of events per racer was 3. Three racers competed in all 17 possible events while 94 racers competed only once.
[edit] Data Analysis
Q1: Does total race attendance dwindle over the course of the season? To test this, I first plotted both the total number of points awarded and total number of race records by Venue.
However, this did not provide a time-ordered result. Simply adding the date of the event to the rows shelf and sorting by month did not alleviate this since one of the race weekends was split with one day in September and one in October. To solve this issue, I created a new calculated field - "Race Weekend" - by assigning numbers based on the Venue name. Ordering using these, I produced the second set of charts.
However, this still doesn't tell us the entire story. Because we're plotting total number of records we don't have a sense of how many racers attended each venue, only the total number of entrants in all events and because many riders race multiple races per weekend, we may not be getting an accurate attendance tally. For example, I know that the final race weekend in the calendar was actually cut short by a day and only two races, rather than the expected four, were held. This means there are fewer race results from that weekend, even though a larger number of participants may have raced. The natural operation here would be to count the number of unique names rather than the total number of tuples. Frustratingly, however, Tableau does not support the "Distinct Counts" calculation for any input data format other than a relational database or a Tableau data extract - which took me a long time to realize. Once my data was repackaged as a data extract, I was able to plot a final chart which shows the total number of racers by weekend.
Riders by Race Weekend
The chart above plots the number of riders for the first through fifth(and final) race weekends of the WCCC mountain bike season. Notice how the number of racers declines as the season wears on.
Q2:Do teams actually field more riders or score better at home? To investigate whether teams are able to draw more racers at home, I plotted each the attendance for each of the five hosting teams across across each of the five race weekends. Results for each team are clustered together and color coded so that they can be easily compared.
In all but one case (UC Berkeley), we find that a team's home race drew the most people from that team of any race during the season.
Recasting the same plot using total point earned finds the same result, indicating that teams almost always scored better at home.
Finally, to test Tableau's automatic publishing and summarizing functionality. I output my final chart to PDF (and then paste a snapshot back to the wiki). Caption and key are auto-generated.
Q3: Do good performances in endurance events correspond to good performances in gravity events or do the two appeal to mutually exclusive sets of riders? To investigate this, I used a calculated field to find, for each rider, the average placing across all endurance events ("Short Track" and "Cross Country") and the average placing across all distance events ("Downhill", "Super D", "Dual Slalom", and "Mountain Cross"). I then generated a scatterplot comparing the two (excluding racers who competed in only a single race and could not possibly have participated in both).
The resulting image shows that a only a minority of racers (150 of 332 who were scored for at least one race) competed in both gravity and endurance events. The rest appear along the axes of the chart. This seems to indicate that a large number of racers are discipline specific. However, if we size the points based on the number of omnium points each participant earned (a good metric of their success over the course of the season), we obtain the following result.
Here we see that the most successful riders in the current scoring system (those with large marks) are situated towards the lower left of the chart and are clearly proficient at both disciplines. However, we also observe a clustering to the left-hand side of the char. This seems to show that riders who place well in endurance events are not necessarily inclined to place well in gravity events. In fact, many riders who scored relatively well in the overall season standings were top 5 or top 10 performers in endurance events placed in the high teens and twenties on average in gravity events.
Moving to a small-multiple display organized by category gives another insight into this distribution of scores.
As before, we see a shift towards the left hand side of the graph in each of the categories, with the most successful riders being those who competed in both types of events, but with high scores in endurance events not necessarily translating into high scores in gravity events. This display also shows the disparity in sizes between the different categories. The comparatively small size of the women's fields makes them difficult to read. If we filter to show only them, we see a similar profile in the Women's A field. However, the small number of data points makes it difficult to generalize.
To produce the final figure, I'll revert to the small-multiples version of the display, filter to eliminate a few outliers, an introduce a redundant color encoding by category to help differentiate the multiples. As with question 2, I've chosen to test Tableau's automatic caption generation using the 'Publish to PDF' option.
It's interesting to note that while in the figure for Question 2, Tableau's caption generation functionality worked fairly well, it is less successful here. I suspect the application is simply taking the VizQL query string which is used to generate the chart and performing some natural language synthesis based on it.
[edit] Notes
In general, this data set is probably still too small and too variable to really make many sweeping assumptions based on it. The limited number of race weekends, varying attendance rates, and differences in race conditions between venues all present additional problems. Analyzing more complete data over multiple seasons might provide more reasonable results.












