From CS294-10 Visualization Sp11
Choosing a Dataset
Originally, I wanted to do startup companies, but it was hard finding reliable data. I Used the CrunchBase API along with some Google Docs I scavenged. I experimented here but didn't really go anywhere - thus, I decided to use the complete post history of one of my favorite blogs, Boing Boing! It's available here http://www.boingboing.net/2011/01/25/eleven-years-worth-o.html
My intention in choosing the Boing Boing dataset was that it would permit very general questions about online blogs.
Thinking from the beginning, my questions are:
1. When was the post created? 2. How does the day of week affect # of comments? 3. How does the hour of the day affect # of comments? 4. How does the time of year affect # of comments? 5. How has commenting grown over time? 6. How does the category affect # of comments? 7. How does the author affect # of comments? 8. How does the length of the article affect the # of comments?
In general, my overarching question was: How do author, category, and length of posts affect how many comments a post will get?
Data Schema and Transformation
Boing Boing provides the entire history of its posts as a 38.3 MB zip file, which becomes a ~130 MB XML file. The schema for this file includes the date of publication, the title, the author, the text of the post, the categories and the number of comments. Some transformations I did on this included: 1. Parsing the date into year, month, and day 2. Transforming the date object into Day of Week and using this as a dimension 3. Only taking the first category (how would I do more than one category in Tableau? ) This could have affected the results. The data otherwise seemed relatively clean, with a few NULL bodies - otherwise, I didn't have any problems with minor string things like capitalization. 4. Calculating the length of the post body and using that as a dimension/measure
Early decisions Looking at the raw data
I was really interested in two measures in particular: 1. the length of the post, and 2. the number of comments. After inspecting the raw data, I saw that commenting volume didn't really pick up until 2007 - thus, I cut out years 2004-2007 from the raw file (easy and fast in vim). I then ran a naive parser over the entire thing (using Ruby's REXML library)
If anyone has a better way of turning XML to CSV programmatically and FAST, let me know! I wanted the richness of Ruby's DateTime object transformations, but it was very, very slow (took about an hour). Thankfully, I had something else to do while this ran.
Importing the data into Tableau
The data was imported smoothly into Tableau, bar having to convert some units. The first things I did were look at authors and post counts:
From this, I figured that I should look mostly at the top four authors - these are, after checking WikiPedia, the four co-editors of BoingBoing.
I also found a few outliers in the dat: the highly comment posts are 1. moderation policy and 2. a post called "Untitled", which had lots of random comments. I looked them up here: http://www.boingboing.net/2008/04/24/untitled-1.html
Next, i was interested in the total number of posts by an author, and the ratio to the total number of comments garnered:
This showed that in general all the authors had a consistent average comment count on their posts. Nothing really interesting yet.
Looking at the number of comments created per year:
This shows the rise in # of comments per year, confirming my early suspicions looking at the raw data, that comment volume didn't really pick up until 2007. This also let me know that the 2011 part of the data was (obviously) incomplete.
Next I looked at the day of the week and post volume:
There's significantly less posts on the weekends than the weekdays.
Now, do people comment more on average on weekends?
There was, however, an increase year by year in the average number of comments:
How about the month of the year?
The January statistic seems to be slightly different (maybe since they report on gadgets and it's the holiday season) but the trendline gives me a P-value of 0.059 - i'm not a statistics expert, but I think that means January could be explained by chance.
I then split this view by category and got some interesting results: (keep in mind that category was processed to only include the first one listed)
It looks like Civil Liberties on average attract a lot of comments, while Art and Design have few comments. This is intuitively plausible. Also, commenting seems to be disabled on Tweets, or else noone is interested in them.
Of course, here's the histogram of post size:
We see that "specials" and "features" are often much longer than usual.
Here's authors and their post lengths:
Some interesting points: Xeni makes a lot of the short < 200 posts, and Xeni and Cory write mosts of the longer posts.
Finally, I started exploring one measure : the ratio of the length of a post to the number of comments the post got. This was the most interesting of the questions I thought up while exploring the data in Tableau.
Here's the split on the ratio and day of week:
I didn't see any interesting trends here: I'm looking at the 'gist' of the shape of the data.
Finally, here's the split by category, where the data gets really interesting:
There's distinct shapes for each category, which suggests for them that: Action consistently gets lots of comments for any length For some reason, longer Art and Design posts get less comments Civil liberties posts get plenty of comments, as well as Science Video posts are mainly short
You can pick out which authors focus on which categories, and which of those posts garner lots of comments: Cory writes a lot of Art/Design and Gadgets, and consistently gets lots of comments in the Civil Liberties category.David and Mark, on the other hand, write a lot of posts in the Video category, which are shorter. Xeni doesn't really focus on any category. The longest category in general is "action" of the top 9 categories: There wasn't very many spotlight/feature/special posts in total. The category that consistently gets lots of comments is civil liberties: there's notably a denser area on the top of the chart. I don't totally trust my judgement, though, as the color or overlapping layout of the dots may be misleading. The answer to my original question is that if you want to find posts with lots of comments, check out Civil Liberties posts. Length of post doesn't seem to significantly affect # of comments, as I had hypothesized (these charts are, at the top of the graphs, not biased left or right)
1. It's really easy to exclude outliers - Tableau is great for this. 2. I didn't realize the importance of changing the plot shape, color and size until my last two visualizations. This could seriously affect how I perceive patterns, which leads to: 3. I understand better how perceptual science is needed to design these visualizations. Since A lot of my conclusions come down to the "area", or perceiving close dots as contiguous, the color or shape of that mark may be misleading. The default settings didn't seem to be great. 4. I always had to fiddle around with the internal state of my filters - it would be nice if this could be expanded into a "visual programming language" like Max/MSP. I think I read a paper assigned for CS160 on the topic, but coupling it with a really fast backend like in Tableau would be a great product. Of course I bet Cycling 74 and Tableau own all the relevant patents already =)