From CS 294-10 Visualization Sp10

Revision as of 20:33, 22 February 2010 by Ryangreenberg (Talk | contribs)
(diff) ← Older revision | Current revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Getting Started

I started thinking about questions in two possible areas, countries' success in the Olympics and movies' success at the Oscars. I wrote down some questions about each area that I considered.

Olympics (Winter or Summer)

  • What countries have improved over time? Improved could mean total numbers of medals received, rankings of finishing athletes, number of athletes sent.
  • What countries are most successful in terms of medels per participating athlete?
  • How does a country's success (using one of the above measures) correlate with the country's wealth? What about training resources (athletic facilities, etc.)?
  • Does the country hosting the Olympics see an increase overall in number of medals won?

Academy Awards

  • What is the relationship between popular movies and critically successful movies? We can measure critically successful as number of Oscar nominations or Oscar wins. We can measure popularity by box office returns (inflation adjusted) or box office divided by the price of a ticket at that time.
  • What about the relationship between production cost and success? Marketing costs and success?
  • Is a movie with a best actor or best actress more likely to win best picture as well?
  • What studios have been most productive in terms of critically successful movies?

After quickly perusing some available datasets, it looks like there is better data available for the Academy Awards. Many of the Olympics datasets I found were incomplete or only encompassed a single year or single event.

Refining the Question

How does critical success, measured in terms of nominations, relate to a movie's chance of winning the Academy Award for Best Picture?

To collect this dataset, I started with the list of all best picture winners and nominees at Wikipedia. I combined this with a list of number of nominations from Box Office Mojo (which only extended to 1978). If I wanted to include data from before 1978, I would have to enter it manually or find a data source that would let me import it programatically.

The first chart I made with Tableau was relatively uninteresting. I plotted wins against number of nominations:


One interesting point exposed by this view is that movies with fewer nominations tend not to win best picture. I think this suggests that a number of things have to be good about a movie before it enters real contention for best picture.

Shifting comparisons

What if we change the question to see the relationship between critical success and popular success in pictures nominated for an Academy Award? I tried this by plotting a scatter between box office returns (popular success) and number of nominations (critical success). Again, I only have this data for movies from the present back to 1978. I also wanted to highlight which films actually won in a different color and label certain films. I was surprised how easy Tableau made this, although it is understandably difficult to label the concentrated masses of points.


I'd like to say something about this data graphic, but before you can say anything meaningful we need a new graphic.

Let's get real

In terms of data, the last chart is terrible because it uses nominal box office receipts instead of comparing across a constant quantity. I need to convert these nominal dollar values. I couldn't find a quick way to convert a bunch of nominal dollar values to real dollars, so I put together a little Python class that uses the BLS historical figures to do the conversions. Then I re-did the previous chart after converting all the movies to January 2010 dollars. There is some inaccuracy in these conversions because I assume that the entire box office came in during the year of release for a given movie and use the average CPI for that year to convert to real dollars.


Using real dollars instead of nominal dollars shifted the landscape of the scatter somewhat (notice that ET is the highest grossing nominee). I thought that the four outliers made it harder to see a trend in the more common region, so I tried excluding them.


Getting more data

Box office returns aren't a great indicator of popular success. And there is also the issue of confounding variables, since winning the Academy Award for best picture means a movie is more likely to draw theater-goers. I went looking for some other source of data. I settled on the user ratings for movies at IMDB. I used IMDbPY, a Python psuedo-API for the Internet Movie Database, to download user ratings for movies, as well as MPAA rating, running time, movie genres, and aspect ratio (since it was slow to get the data, I decided to add whatever I could to my dataset). I read my existing data using Python's CSV library, added the new data, and re-output the data. I replaced box office returns with user ratings:


This seemed less interesting that I was hoping for. Then I started to wonder about the relationship between user ratings, box office returns, and whether a movie won or not. I was able to create a more data-dense display by combining these four variables into a single chart. Although size of circles aren't great for comparing data, here I think they work relatively well because they show a relatively small range of discrete values. This display brings out some interesting details. The most highly reviewed nominee on IMDB (The Shawshank Redeption) didn't win for best picture. The Hurt Locker, one of two favorites in this year's race, has one of the lowest box office totals across all the movies depicted. Again, I left out the four outliers that were making it difficult to see trends in the rest of the data (Avatar, ET, Titanic, Indiana Jones).


This chart does illustrate one problem with my approach which is that I started with movies that were nominated for best picture. A better approach would be to use all movies, then showing whether a critically popular or box office hit was recognized by the academy. If I took this approach, it would be better to use a different measure for "critically popular" than number of nominations for a few reasons. It appears that the number of nominations is biased towards large productions: only productions that cost tens of millions of dollars are nominated for best actor/actress, and best sound editing, best visual effects, etc. I investigated using aggregate critics' ratings from Rotten Tomatoes or Metacritic, but neither of these sites provided their data in an easily consumed format.

Note: I realized at the end of this process that I could have used Freebase and its API to supplement the data I have. Probably the most interesting addition here would be production budget.

Final Dataset

My dataset: File:Oscars best picture.xls.txt Note: to use this file you have to remove the .txt extension. MediaWiki does not support Excel documents.

Side notes

Increasing Runtime?

While I was reading up on the Academy Awards and the best picture award, I read the following comment in Wikipedia: "Another point of contention is the recent inclination toward long films". This seemed like a statement I could analyze with the data I had assembled, so I produced the following chart.


It looked like the average runtime of movies was indeed trending upward (although there is a period in the 60s when the average runtime is quite high. I decided to split these into categories based on whether a movie won or was merely nominated:


Based on this chart, it does appear that the Academy favors longer-running movies over the average for the set of films in a given year. Why are there two gaps? (On the left) there were no Oscars awarded in 1933. (On the right) there was an accidental null point in my dataset because of formatting of data from IMDB.

Nominees and Winners by Rating

A couple years ago I did a small data graphic looking at the MPAA rating for best picture nominees and winners. Since I assembled a dataset with more information than I had then, I thought I would see what the data looked like in Tableau. It's worth noting some historical changes that affect the data here.


[add comment]
Personal tools