From CS294-10 Visualization Fa07
There's a dispute about how to judge a pitcher's performance. We are often impressed by powerful pitchers that can toss 100 mph fastball and deliver strikeouts easily. On the contrary, we tend to ignore a pitcher's stability, which may be the key to a pitcher's long-term success. I will visualize some baseball statistics (http://en.wikipedia.org/wiki/Baseball_statistics) to find out the deterministic factors of being a good pitcher.
What factor could better determine a pitcher's level of success? Power or stability?
The source of data is from The Baseball Archive (http://baseball1.com/statistics/). The dataset is an Excel file that contains the complete career records of all pitchers ever pitched in the Major League Baseball. It includes some basic pitching statistics such as W (win), L (loss), ERA (Earned Run Average), SO (strikeouts), BB (Base on balls, also called "walk"), and so on.
The goal is to investigate whether power or stability can better determine a pitcher's performance. First I find the statistics that could best represent the two: 1. SO (Strikeouts). The number of strikeouts can better indicate the power of a pitcher, since powerful pitchers often deliver a lot of strikeouts. 2. BB (Walks). The number of walks can better indicate the stability of a pitcher. A stable pitcher has good capability of control that enables him to toss the ball into the strike zone. In genera, a stable pitcher would deliver less walks.
Strikeouts and Walks are used here as the variables that influence a pitcher's performance. A pitcher's performance is represented by his ERA, which is the average earned runs the pitcher gives up every 9 innings. A pitcher is considered a good pitcher if he has lower ERA. So I will create the visualization for the relation of (Strikeout-ERA) and (Walks-ERA).
I use Tableau for the visualization.
The initial visualization is not satisfactory because there's a problem in the underlying statistics. The x-axis is the total number of strikeouts (SO) delivered by a pitcher. We can see that the range of the value in x-axis is very wide, as many pitcher tossed less than 100 strikeouts, and some had many thousands of strikeouts. We can't get much information from this, since the number of strikeouts are cumulative. So I normalize the strikeouts by the number of innings a pitcher played. I change the x variable as K/9 (The average number of strikeouts a pitcher delivers every nine innings). The original Excel file doesn't have this K/9 column, so I added it by dividing SO by IP (Inning pitched) and then multiply by 9. The modified graph looks like this:
The graph looks better, but it's still problematic because it shows some distractive outliers. It shows some pitchers have than 15 Strikeouts per 9 innings, and some have no strikeouts. These cases are not common, and only some pitchers who appear in the major league baseball for a very short period of time, say pitching only several innings, could contribute to this kind of outliers. In order to more effectively visualize the relationship, I added a filter to this graph. The filter filters out the the pitchers who pitched less than 200 innings. The improved graph looks like this:
From the above visualization, it's hard to tell the correlation between Strikeouts and ERA. This may imply that the number of strikeouts doesn't necessarily determine the quality of a pitcher.
2. The relation between Walks and ERA:
Similar to Strikeouts-ERA, the visualization for the Walk-ERA relationship uses BB/9 (average number of walks per nine innings) as the x-variable. The original dataset doesn't contain this column, so I added a new column BB/9 by dividing BB(number of walks) by IP (innings pitched) and multiplying by 9. The pitchers who pitched less than 200 innings are filtered out. The graph looks like this:
From this graph above we can find out a positive correlation between Walk and ERA. It shows that a pitcher's ERA increases along with the number of walks per nine innings. This may imply that the number of walks determine the quality of a pitcher.
3. The relation between WHIP and ERA:
Now I know that Walks is more influential than Strikeouts. But how influential it is? I would like to compare Walk with another popular statistics, the WHIP (walks plus hits per inning pitched). WHIP is considered a powerful indicator to a pitcher's performance. Here is the visualization of the relation between WHIP and ERA:
Obviously, from the graph above the WHIP and ERA have a strong correlation.
To compare the above three factors, I align the visualization together to see the difference:
WHIP is strongly influential to ERA a pitcher's ERA as expected, and BB/9 can also influence ERA. K/9 is not a deterministic factor for ERA.
 Final Visualization
For the final visualization, the trend lines are added to the for the three graphs.
Figure 1. Factors that influence a pitcher's performance. The value of y-axis is the ERA, and the values
of x-axis are BB/9, K/9, and WHIP, respectively. The lines depict the correlation between x and y.
The visualization shows that the Walks (BB/9) is a good indicator to whether a pitcher is good or bad. Strikeouts (K/9) itself does not contribute to the performance evaluation, as we see the trend line is almost horizontal. The rightmost graph shows WHIP is a strong indicator to a pitcher's performance as expected. Compare Walks with WHIP, we can see that the influence of Walks itself is weaker but still significant enough.
Thus we can get the conclusion that to determine a pitcher's performance, his stability (estimated by the number of walks) is much more effective than power (estimated by the number of strikeouts). As people like to watch powerful pitchers that deliver lots of strikeouts, the hidden fact is that a stable pitcher who has greater capability of control are more likely to dominate the game.