From CS294-10 Visualization Sp11
I want to explore the conditions that foster or hinder innovation in organizations. However, there aren't that many datasets that compare different organizations or companies. I tried the Department of Labor, the United Nations, NationMaster, StateMaster, and several industry-specific statistics to no avail. Finally, I found that the OECD has statistics from about 30 nations up to the region level. Although it is not as granular as company-level, the data is divided by regions (there are several regions in every state) and will hopefully allow me to draw some conclusions and find interesting correlations.
"Innovation" is an elusive and fuzzy term. Thus, it may be hard to measure or compare. However, the number of patent applications can be used as a proxy to measure innovation.
I downloaded the dataset called Innovation Indicators at region-level (TL3) from the OECD website. The data initially only contained the number of patent applications in the last ten years. I later realized that other dimensions were also available, so I included patent applications per million inhabitants and patent applications by industry (green, ICT, biotech, and nanotech) as well. I also increased the range to include data from the last twenty years.
How has the mix of patent changed over time?
Is there a relationship between the number of employees in an organization and the number of patent applications? Has this relationship changed over time?
I realized that the original dataset did not contain enough information to draw interesting conclusions. I kept looking at the OECD website and found that it also contained demographic and economic information about all regions. It probably contains too much information, but nevertheless downloaded it all. It took four different SQL statements to get all the information I wanted (because the OECD site limits every query to 1 million datapoints). Later, I used Google Refine and the command line to append all the information.
After looking at the raw data, I noticed that not all datapoints are available for all the regions. In some cases, the information is only aggregated at the state or country level.
It is also kind of weird that the datapoints are all in different rows as opposed than in the same one. For example, there is one row for occupation rate in Northern California in 1999, one row for the number of ICT companies in Northern California in 1999, and so on.
Here's one example:
I had a hard time trying to link the two datasets: the first dataset contained the patent data and the second dataset contained the economics and demographics data. Both datasets had the same fields, but they were in different order and had different titles. I tried joining tables in Tableau, having a multiple table dataset, and using Google Refine. Finally, I created a custom SQL statement to put the right fields (measures and dimensions) where they belonged.
When analyzing the raw data, I found out that the data had some inconsistencies. Although the OECD claimed to have data from 1990 onwards, I discovered that -for some countries- the information started in 1995 or later. Moreover, the data for 2008 was mostly unavailable.
Attached are a couple of pictures of the aggregated tables and a time-only bar-chart that shows the aggregated data for all countries and all types of measures.
I was mostly interested in identifying situations that influence innovation (and hence, patent applications) in the regional level, so I used Tableau to find the specific dates and regions where the data would be applicable for that purpose. I used a map to display visually the regions with the most patents over time and a "small multiples" version to display the differences year to year. I had to make some tweaks because the original data used non-standard country codes. I manually changed them in Tableau in order to display the following map.
I also plotted the information in a bar chart and identified that the USA, Germany, and Japan are the countries that have more information available.
I wanted to relate the number of patents to other variables at the regional level. Sadly enough, after playing around with the data I learned that only the data from Japan (and only from 2001 to 2006) contained the number of companies per industry, the number of employees per industry, and the patents at the regional level.
I started drawing some correlations. Plotting the number of establishments versus the number of patents didn't show any interesting data. It looks like, as one would suspect, the more establishments (companies) in a region, the more patents will be published.
Later, I explored the relationship between the number of employees per company (establishment) and the patents generated. I created a ratio in Tableau that computed the average number of employees per company. This wasn't a trivial task, as the data was separated in several rows for the same region. However, after exploring a bit with aggregation functions and calculated fields, the ratio was generated. I didn't find any surprising correlations here. Companies with more employees (in average) tend to generate more patents.
After coming to some conclusions at the regional level, I wanted to validate whether the same held true for countries as a whole. Therefore, I plotted the same Employee/Company ratio versus patents at the national level. The data here showed some interesting aspects--there seems to be a tipping point in the number of persons a company can have to be innovative. If a company gets too big, it may actually hurt its ability to innovate.
I was initally puzzled because the information was contradictory and I could not find any hard evidence that could help companies or organizations innovate better. However, I finally found that all these different graphs started to tell a story. I discovered that the size of the organization affects not innovation in general, but the pace of innovation. In order to confirm this last claim, I created a new ratio (patents per person) and plotted it against the average company size (persons per establishment). What I discovered is that very small organizations do not innovate as fast as medium-sized organizations. It makes sense, since more people are more likely to come up with new ideas and to recombine existing ideas. However, the size of the organization starts paying a toll when it gets too big. When organizations exceed a certain size, relationships aren't as easy as before and innovation slows down. The number of patents generated by every person in the organization diminishes. Once an organization gets bigger than about 200 employees, its patents per capita start looking worse than those of the very small organizations.