# A4-BrienColwell

## Statistical Analysis Assignment

My question:

Does the distribution of the presented values affects the viewer's accuracy?

My hypothesis:

The viewer's error should increase as the distribution of the values decreases. That is, the error and distribution of values should be negatively correlated. I think this because outliers help the viewer make comparisons.

I chose this question because I noticed I was more accurate with the bar and pie charts, and to a lesser degree the with tables, when there were large outliers in the data. I think this is because, with more non-uniform data, I could usually tell or guess ordering from the first label I found (it took me a while to find the labels on the pie charts). For example, if the first label has more than half the pie, it must be the largest. Or, if the first label is small and the rest are large, then it is likely to not be the largest, and vice versa if the first label is large.

The independent variable of my experiment is the distribution of the presented values (which ranges on [0,inf)), which will be represented by their standard deviation. The dependent variable is the error (which ranges on {0,1}).

The Pearson product moment correlation coefficient r (computed using Open Calc's PEARSON function) is -0.08. This indicates a negative correlation; however, it is not significant. Thus the data does not fit well to linear regression. Logarithmic regression (using Microsoft Office) gives a similar insignifact correlation, with an R squared value of -0.02.

To investigate further, the distribution data was separated into two groups: those with error 1 and those with error 0. The number of values in each group was truncated to 620. The distributions of the groups were not measured; they were assumed to be approximately normal. The groups are summarized below.

 Group A (distributions with error 0) Group B (distributions with error 1) mean 12.60 11.50 standard deviation 5.74 5.37 median 9.95 8.98 average abs. deviation from mean 4.74 3.97 Table 1: Basic statistical analysis of the two sample sets
 Figure 1: (left to right) A plot of the mean with standard deviation marked; the same plot with the data "swarm" shown; and a box plot of the two sample sets

These numbers support the hypothesis. The mean distribution for errors is lower than that for non errors. Interestingly, the distributions for errors are more concentrated near the mean.

Although the groups were separated using a dependent random variable*, an unpaired T test may still offer some insight. We have two pseudo-independent* sample sets, and we want to know whether the difference of their means is statistically significant. The unpaired T value (using [0]) is 3.49 (with 620 degrees of freedom), which gives the probability of the difference of the means occurring by chance as ~0.05%. This is statistically significant.

(I think the unpaired T test is the one to use here. However, the two-tailed, paired T test (Open Calc's TTEST function) also gives a probability of ~0.05%. The one-tailed, paired T test gives a probability of ~0.03%. These values assume the means are paired. Both are statistically significant.)

In conclusion, although there is no strong correlation between the independent variable and the dependent variable, there seems to be a relation. The validity of the unpaired T test result above is questionable because the experiment -- the way I separated the values -- may be wrong; however, it is interesting that it strongly supports the hypothesis.

The above analysis was performed using Open Calc and the [0] Online Tools for Science (provided by St. John's University) at http://www.physics.csbsju.edu/stats.