Quantitative Evaluation

From CS 160 User Interfaces Sp10

Jump to: navigation, search




  • Measuring the User Experience: Collecting, Analyzing, and Presenting Usability Metrics. Ch 2. Tom Tullis and Bill Albert.

Mattkc7 - Mar 11, 2010 07:27:00 pm

Alexander Sydell - 3/6/2010 15:30:19

The statistical tests in these readings seem to apply mostly for larger data sets. However, the user testing methods that we have covered so far tend to involve at most 5 or so users in a test due to time/budget concerns. What kind of user interface testing techniques can be used alongside statistical testing? Is statistical analysis better suited for something like finding out the likes/dislikes of a target user group before designing an interface, or for gathering opinions on a more mature interface from a larger group of testers?

Wei Wu - 3/7/2010 13:54:10

Martin's run-through of the ways to design and run an experiment reveals many intricacies to consider that I had previously never thought of before. After reading through an overview of the entire process, from defining the dependent and independent variables to designing the actual experiment to statistical ways of actually interpreting the data, it seems to me that the most difficult factor to maintain in an experiment is its "validity." This notion of validity is mostly established through the method in which it is carried out, particularly in the choice of test subjects, and the use of control variables. Both these elements must be chosen to eliminate any confounding factors that make it difficult to say that the independent variable actually caused a change in the dependent variable. At the same time, the experiment must also be broad enough to reach a generalized conclusion.

Thus, the majority of time in the experiment process should be spent in scrutinizing the design of the experiment to account for the above points so that the results are "valid" and accepted by the community. While quantifying the data is relatively easy given the many statistical tools that have already been established, it is the methods by which the data is procured in an experiment that determines the experiment's success.

Daniel Ritchie - 3/7/2010 23:50:50

From what I've read, it seems that most HCI experiments measure their dependent variables by verbal or written report from the test subjects (for instance, "thinking out loud" during Wizard of Oz testing). Though I understand that the equipment could prove prohibitively bulky, I wonder if quantitative physiological measurements such as fMRI have also been used. I can think of several interesting applications of the technology for HCI research: for instance, by comparing MRI images of a user interacting with i.e. a tablet PC to those of the same user writing on a piece of paper, it might be possible to observe how well the tablet interface mimicks the pen-and-paper experience. In other words, it might provide a quantitative measure of just how direct a direct manipulation interface really is.

Matt Vaznaian - 3/9/2010 14:04:13

I think that the best method for determining test reliability is alternative form. I agree with the table that test-retest can produce skewed results when the same test is given multiple times, which is why I like the idea of an alternative test. I don't really think use different items lowers reliability. If the two tests are conducted in such a way that they are essentially identical in what their goals are but seem different from a subject's point of view then they achieve reliability.

Eric Fung - 3/9/2010 17:04:06

The idea of having pilot experiments may not be totally intuitive for people unfamiliar with science testing. If you're going to run an experiment, shouldn't you get it right the first time? But the purpose of having the pilot is much like ironing out errors in a prototype before displaying it to customers: to make key adjustments early on and get an idea of the experience before it costs too much to change.

I think people also frequently lose sight of the idea of statistical significance when considering experiment results. If your interface cuts down on 1 second of interaction time, it can be considered a great improvement or a meaningless blip. The numbers need to be placed in the proper context--you have much more confidence in your experiment's results after a statistical analysis.

Annette Trujillo - 3/9/2010 18:51:18

Pilot experiments remind me a lot of heuristic evaluations and lo-fi prototype testing on users. Particularly, I think pilot experiments relate to what our group did: before testing our lo-fi prototype with users, we tested it with ourselves, and in this way we made the necessary changes to make sure that our user testing is more effective and successful. This included removing some errors that may have messed up our testing statistics, such as something that may be more familiar to one user we are testing than to another user we are testing. If we modified this thing to make it easier for both users to understand it equally, this makes our statistics unbiased.

Vinson Chuong - 3/9/2010 20:36:59

In some aspects, I would say that the "quantitative" in quantitative evaluation is a misnomer. While it is true that the quantitative evaluations described in the readings--controlled experiments--produce some kind of numerical quantity, exactly what those numbers quantify may or may not be useful. As the readings indicated, there are a lot of pitfalls and trade-offs with regard to extracting useful data from a controlled experiment.

This set of readings should remind us that the empirical data we've been taking at face value thus far may or may not be valid depending on the circumstances under which they were extracted. That being said, having at least some indication of correlation is better than having complete uncertainty.

David Zeng - 3/9/2010 22:15:59

I found the section on how to analyze data more interesting than the section on how to choose variables. The analysis of data falls largely into the category of statistics and economics. Besides the methods that they mentioned, there are much much advanced techniques which allow for people to get rid of outliers, which may skew the data. Another technique which may alleviate some of the pressure in choosing variables is to control for certain things. You can do this by setting up dummy variables that test for whether a trait is active or not. Not only does this allow for the researcher to get more data, it allows them to see the effect of that trait and if it has an effect on the overall result.

Kathryn Skorpil - 3/9/2010 22:31:03

The first thing I thought of when I read Chapter 7 from David Martin's book regarding "validity" was the SATs. Back when I took the SATs it was out of 1600 and it felt like it was the most important thing in the world. Looking back now, it really isn't that important except for colleges who need some kind of "standard" for admitting their students. While I believe that the SATs are not accurate to determine the success of a student, I can't think of another way to "standardize" data like a student's success. For example, I have a friend who did horribly on the SATs, particularly in the verbal portion, but now he is working at a successful magazine company as an editor. Obviously for him the SATs did not matter in the long run. I did fairly well on the SATs, but my grades at Berkeley are certainly not excellent. This is why validity, while helpful in giving a good sense of how reliable a test may be, is also not the final say in the matter. (Ironically, my statements are only based on a handful of people I know, so I suppose I would need to get more subjects to actually prove my opinion.)

Calvin Lin - 3/10/2010 2:17:52

When it comes to tests where there is no clear way to measure the dependent variable (such aggressiveness), there is an endless journey of refining such a measurement if a researcher gets too caught up in technicality. There is no universal definition of aggressiveness, and as a result, what should be measured is going to vary among different people. A researcher may never be satisfied with the results because he may think his measuring sticks of the dependent variable are not sufficiently accurate. Clearly, a non-scientific judgment call has to be made at some point. Researchers have to resort to using their common sense and general understanding of how people behave –such as what constitutes aggressive behavior.

A problem with the TV-aggression example is that it would be easy to make your focus too narrow. In Martin’s discussion of determining how to measure aggression, he fails to consider other factors that could lead to aggression. Examples include: abusive parents, bullies at school, influence of friends. If a precise experiment is run and it produces good results that point to a cause-effect relationship existing, it’s easy to interpret this as a conclusion. However, this could very well be a coincidence, taking the other factors into account. With these kinds of tests, it’s nearly impossible to isolate two subjects of interest such as this example, but researchers again have to make judgment calls and simply accept the results at some point.

Owen Lin - 3/10/2010 2:19:47

I'm wondering how we could apply this reading to the development of our app. It's not clear how we can test out our application using the scientific method, but I guess that we could, say, design an experiment that tests how fast people can learn to use an interface efficiently (of course, we'd have to define what "efficient" means by using an operational definition). We could see if we could figure out what makes certain interfaces easier to learn and become fluent in, and we would try to implement those elements in our interface to make it as intuitive as possible. By knowing about the scientific method in detail, we could avoid certain pitfalls and get the most out of such an experiment that could help us develop a good user interface.

Aneesh Goel - 3/10/2010 3:37:56

The use of statistics and experimentation is no doubt very helpful in user interface design; one would hope our quantitative analysis tools use constants taken from statistics, not magic numbers, and that the results of their equations are backed by statistics. Even heuristic analysis benefits from statistical approaches, though the sample size of analyzers might be a bit small for this to be at its most effective, and of course experiments are a good way to test the usability of your application or to control all design elements but one you want to test and using that as an independent variable.

It's also very obviously significant for making user interfaces for presenting and handling large amounts of data; knowing how statisticians present information, how to mislead with accurate statistics, and how to avoid misleading with potentially inaccurate statistics makes presenting data all the more useful.

That said, the reading seems a little generic for a useful response... these are nifty tools, there are clear applications for them, but the basic structure of an experiment and the beginner's introduction to statistics are hard to discuss in significant depth.

Jonathan Hirschberg - 3/10/2010 4:05:56

An assumption that is necessary for the lo-fi prototype testing to work (to give an accurate representation of how users will use the product) is that the differences between having users operate a paper model while being observed versus using the actual application on their own are not significant in the variables we want to measure. If they were, it might not be possible to use the results (and the design issues that are raised) from the lo-fi prototype test to predict the ease or difficulty of using the actual product. Since paper cutouts are clearly different from computer screens, it is necessary to define the scope of the test and what kinds of results are to be expected. We need to limit our testing to the actual interactions with the interface and not worry about other things like methods of text entry, for example. But will the difference between writing something on a prototype versus typing it in cause a different result? Would it be possible to test a group with the paper prototype while testing another group with a real application or a prototype that looks real and see if the results are the same (for example, users in both groups get stuck at the same place). Can we prove the differences to be statistically insignificant to ensure that any differences in results are due to chance and not due to a difference in the testing methods? If so, then maybe paper prototyping does give an adequate indication of how users will use the interface of the real product.

Geoffrey Wing - 3/10/2010 9:23:00

These readings were a lot easier to read than previous readings, but I still learned a lot. The discussion of experimental procedures and statistics is a nice review concepts I learned in high school. After completing these readings, I know how a good idea on what to look for and what to measure when conducting our lo-fi tests. We have a wide array of avenues to display our data and conduct analysis, and as discussed in the readings, it is important to choose carefully the way to present your data, so we don't lose information or just look at one aspect of the data.

Before we analyze our data, we will need to create an experiment to gather data. We will need to decide our independent and dependent variables, and probably have a pilot experiment before we conduct the full-scale experiment. Though we do not have access to an MRI machine, we still have many ways of looking at the behavior of our test users - check their facial expressions, what they say, how they say it, etc.

Daniel Nguyen - 3/10/2010 10:09:03

The section concerning using computers to help interpret test results makes some statements that I feel weakens the reading's argument on why you must beware of the drawbacks of computers. The reading says that you must be careful to understand what the results mean when using computers for statistical analysis, which is hard to do when you are just inputing data, but in my opinion, if you understand what the analysis you are performing entails and what the output data represents, interpreting the results through a computer is no harder than interpreting the same results that may have been done by hand. Also, the reading warns about making errors when entering data, such as entering data from the wrong condition into the analyzer. However, this mistake can be made equally as often when doing statistical analysis by hand, except with possibly more work that has to be done when making the error without a computer. Overall, I feel that the reasons Martin gives for being wary of utilizing computers are actually things that could go wrong without computers also, resulting in a weak argument.The section concerning using computers to help interpret test results makes some statements that I feel weakens the reading's argument on why you must beware of the drawbacks of computers. The reading says that you must be careful to understand what the results mean when using computers for statistical analysis, which is hard to do when you are just inputing data, but in my opinion, if you understand what the analysis you are performing entails and what the output data represents, interpreting the results through a computer is no harder than interpreting the same results that may have been done by hand. Also, the reading warns about making errors when entering data, such as entering data from the wrong condition into the analyzer. However, this mistake can be made equally as often when doing statistical analysis by hand, except with possibly more work that has to be done when making the error without a computer. Overall, I feel that the reasons Martin gives for being wary of utilizing computers are actually things that could go wrong without computers also, resulting in a weak argument.

Charlie Hsu - 3/10/2010 11:02:56

I found the validity and reliability discussion relevant to the lo-fi prototype experimentation we'll be doing with our users. Our dependent variables might be a composite of the time it takes an user to complete the task and the difficulty with which it is completed. Though we may not be able to test reliability (since we are primarily concerned with intuition afforded by the design and thus, first tests would contaminate future tests), we can prove validity because time to complete task is a directly observable dependent variable. We can also assume that since we are testing with subjects not emotionally tied to us (not friends, for example), their feedback on difficulty will be relatively objective and reliable.

Bryan Trinh - 3/10/2010 11:07:03

One of the most important concepts I learned from these articles is the idea of a functional definition. Some words or ideas are much too expansive to simply encapsulate it in one word, so the experimenter must define a particular subset of ideas a word can generate. Many times when I read a statistic headline on the news I can't help but say to myself, "what! how could they say that, thats not even testable". At least now I can say that they are testable. They might not be tested well, but at least they are testable.

The interpretation of results outlines the various methods used to draw conclusions from the experiments. The meta analysis was particularly interesting because it provides a way to combine existing data from different experiments into one consolidated pool of data.

Dan Lynch - 3/10/2010 11:21:10

The reading discusses how to create an effective experiment. Certain issues include picking the right independent and dependent variables, and scoping the range of these variables to ensure effective results, or as the author states, "show effect". Reliability and validity are discussed, as was in the last reading. However, this time, a more in depth discussion is present. The author goes into reliability methods, known as Test-retest, Alternative-form, and Split-half. The first method you simply test the same individuals twice using the same test. The Alternative form is when you make new tests for the same group of individuals. The last, split-half, is splitting the tests into two parts, and making sure the results correlate. An interesting part of the writing was the dicussion on physiological measure. One technique for quantifying thought is using an EEG to look at brain waves, and correlations with behavior.

Peter So - 3/10/2010 11:22:55

The design of experiments appears to be more of an art than a science. The experimenter has to have experience which the subject he is trying to measure in order to recognize and rule out undesired contributions to the data. The reading made me aware of internal and external validity in experimental data. The one new concept I learned was the effect of statistical regression of re-sampled outlier data points converging toward the mean. I feel this is an especially relevant concept to our project designs as we are using an iterative approach. Each time we evaluate our user interface we need to be mindful of how we identify particular problems and take into account confounding variables such as the context in which we are using an application, which can change drastically with portable applications. To guard against criticisms we could test the application in a variety of user environments to argue randomization to preserve our project design's validity.

Saba Khalilnaji - 3/10/2010 11:53:25

Pilot runs should be used like heuristic evaluations. Heuristic evaluations can iron out any major UI problems before resources are spent testing a user interface with actual users; at the same time a pilot run for an experiment can be used to iron out any major flows in the procedure before any resources are spent to run the actual experiment. Also one has to be careful in choosing a dependent variable in experiment because the result can be misleading. For example, in the tracing example performance was measured by the number of time the participant crossed the boundaries however the figure showed how the measurement did not give accurate results. The dependent variable needs to properly measure what you are trying to test, perhaps in the pilot run one can ensure that the chosen variable accurately tests the hypothesis. Lastly it is important to plot the data in order to be able to properly analyze the information. You can use a correlation coefficient to determine the relationship with 0 being unrelated.

Long Do - 3/10/2010 12:29:10

The readings should be pretty helpful in designing a user friendly interface. The only problem I would foresee is the operation definition of "user-friendly". The only way to get such an interface is by testing it with many users and having some majority to agree that the design is, or if it is not, how they would like to see it change to become so. The rest of the readings don't seem very applicable to our project since it deals with statistics and more exact calculations. We will most likely not be doing exact measurements, but more subjective ones like opinions and preferences. The independent variable would be the entire interface and the dependent variables would be how much the users like the interface, and that would vary between users.

Conor McLaughlin - 3/10/2010 12:50:42

Martin talks about the need for an operational definition when conducting an experiment that involves both independent and dependent variables. Being able to repeat the experiment with an operational definition is essential to testing a user interface in industry over multiple workplaces and teams, something like Word would be an excellent example. We've been talking about Raskin's methodology for evaluating the time it takes to conduct an action, but using Martin's scientific experimentation we could really identify how changing an independent variable like the placement of a button could improve the reaction time or use of an interface. We spoke in lecture about how Raskin's model doesn't hold for everything, like the Google example, so things like an operational definition with carefully chosen variables for reliability and validity could offer another path for testing the effects of change in a user interface without having to pay or entice more independent testers.

Raymond Lee - 3/10/2010 13:00:16

I found the discussion on careful analysis on dependent variables to be interesting. The provided example on murders containing manslaughters, murders, and attempted murders illustrates how easy it is to select a flawed dependent variable.

The example of using panels to operationally define a variable strikes me as akin to having an experiment in preparation for an experiment, but I appreciate the rigor involved in generating the most correctly defined variables possible.

Hugh Oh - 3/10/2010 13:15:46

When analyzing statistical data, it seems that it is heavily dependent upon the researcher to filter and interpret this data at his or her own discretion. This, of course, pertains more towards statistics that are almost insignificant. Even with statistics paired with a finding that an experiment can be interpreted completely differently by a reader.

Wei Yeh - 3/10/2010 13:17:00

In an age of tremendous technological growth and information overflow, experimental techniques have become incredibly diverse and varied. Entirely new goals for experiments and new ways to conduct these experiments have emerged in the recent years that are very different from what existed just 20 years ago. Take for example telemetric data sent from users of operating systems like Windows Vista, Mac OS X, and the iPhone OS. When an application you are running on Windows Vista crashes, a dialog box prompts you for your permission to send a report to Microsoft about that crash. If you choose to do so, your report is sent to Microsoft's servers and aggregated with millions of other such crash reports from people all over the world, which are used to help Microsoft figure out which bugs to fix and monitor the quality of Windows Vista. This is a whole new way of conducting an experiment: users remotely "vote" for crashes in the OS and, as a result, the experimenter (in this case, Microsoft) gains extremely valuable data on which bugs should be prioritized for fixing, and whether bugs that were thought to have been fixed are actually fixed. The experimenter and experimentees never come into direct contact, yet so much can be learned. This would not have been possible just 20 years ago, but now that almost every machine is connected to the internet, doing such experiments has become a no-brainer. For a more in-depth look into Microsoft's telemetry techniques, check out the video at http://channel9.msdn.com/posts/Charles/Vince-Orgovan-Windows-Vista-Telemetry/

Tomomasa Terazaki - 3/10/2010 13:36:39

The article was a continuation of chapter 2. It mainly talked about what to do before and after your experiments. It talked about how what variables you should be keeping and how you should be taking notes. And on the other article it talks about how you should take notes after you are done with the experiments. Some of the important things to take notes are like the modes and means. This is important to find out what type of things most likely happens while the program is being run. I really enjoyed reading these two articles because I took AP Spanish in high school so it took me back to memories in high school.

Linsey Hansen - 3/10/2010 13:47:22

So, for starters, I really liked the part about 'pilot experiments' since I feel like that is pretty much what we did with our Lo-Fi prototype, and it made me think of building an interface as being more of an experimental procedure. On that note, the parts about choosing how to define different terms was also pretty interesting, since I never actually thought about how we were defining the various terms for our iPhone application, or even how I defined things such as our target user's experience with our Lo-Fi prototype. However, now that this has been brought to my attention, I think it would be neat to try and do something to better define parts of the user's experience (ie. what does it mean for the application to be useful, easy to use, convenient, etc) as well as to define 'variables' of sorts amongst our applications UI features.

Angela Juang - 3/10/2010 13:55:51

The experiments described in these readings seem to be related better to experiments that can be easily measured quantitatively. In terms of user interface testing, most likely what we'd want to be measuring in a quantitative sense is the amount of time for the user to get certain tasks done. However, things like key presses and mouse clicks go by too quickly for a person to time them manually - these measurements would have to be taken by the computer while the user is interacting with it (i.e., logging button presses, etc.). For a low fidelity prototype such as our current assignment for the iPhone project, it probably won't be feasible for us to make these kinds of measurements to gather data.

Richard Lan - 3/10/2010 14:07:58

Determining the independent and dependent variables for an experiment can be quite challenging. This is due to the fact that the relationships between independent and dependent variables are not always clear. Furthermore, it is not always possible to measure the desired dependent variable, so researchers have to rely on other indicators of the dependent variable, such as physiological changes. The tests used to measure the changes in dependent variables must be reliable and validated for the type of information the researcher wishes to gather. Even after choosing the variables, the researcher must define how they wish to measure the variable. Such a definition is usually a specification of how to measure the variable, and is called an operational definition. Operational definitions for different experiments do not always agree, as different testers may feel the need to define their variables in different ways, based on the characteristics of the experiment.

Weizhi Li - 3/10/2010 14:44:11

One important point mentioned in chapter 2 is the study of trade-off between precision and generality, which means that the designer has to balance between how many variables should be randomized and how many should be controlled. From my prior knowledge and this reading, in my opinion, randomization can be just as challenging as controlling because there are so many exceptions has to be handled. It mentions several threats to the internal validity in this chapter. Another similar example is that when someone sent out surveys to randomly selected individuals and there are always some people refuse to participate. As a result, the final subjects are biased.

bobbylee - 3/10/2010 14:51:56

It is the best reading so far in the class. I believe it is greatly due to the fact that Martin would support his statement with concrete examples. So I can observe in the examples. As I am reading, I found that no matter how much we try to eliminate all confounding variables. When you try to eliminate one confounding variables, a new confounding variable might pop up. For example, in the readings, it says that IQ test prompt statistical regression. My solution to eliminate that would be to do multiple tests and determine their IQ by the mean of their test scores. However, it might cause another confounding variable is that practices might make some of them smarter. Anyway, my last say is that it is impossible to eliminate all the errors happened in the experiments, but what we can do is to eliminate major confounding variables.

Yu Li - 3/10/2010 15:03:41

Although the experimental method allows for both the collection of data and causal statements to be made concerning circumstances that result in a change in behavior, it however does not take into consideration the ethicacy of an experiment. For example, the Milgram experiment on obedience to authority figures follows all the guidelines of the experimental method, including independent, dependent, and random variables. In this sense it was a valid experiment, but it did not take into account the emotional trauma and negative affect the experiment would have on its participants. After many of the participants learned that they were tricked into "killing" a person, they became distraught and upset, leading to extreme emotional stress. This just proves that even if an experiment follows the correct method, in many cases that does not justify how good it really is.

Boaz Avital - 3/10/2010 15:08:47

How can we record the results of our user testing experiments into a statistical model? Number of mistakes? Mistakes per minute? Also, since we're only supposed to get between 3 and 5 testers (ever, not just for this assignment), would we ever get enough data for a statistically significant model?

Sally Ahn - 3/10/2010 15:25:11

Today's reading talked about validity, which reminded me of the last section of the Martin reading from Monday. He presented an example experiment at the end on the relationship between lecture pace and student's attentiveness. However, I wondered about the validity of his method for measuring student's attentiveness. He used the noise level of the students with the assumption that "when students were quietest, they were most attentive." I find that to be a pretty bold assumption that has significant impact on the overall experiment; students are not necessarily more talkative when they are inattentive (they may be thinking about other things, doing homework, sleeping, etc.).

Jungmin Yun - 3/10/2010 15:27:24

This reading is very interesting and informative even though I do not have any psychological background. It is well organized and easily shows how to perform the experiments using scientific methods. I learned a lot about control variables, experiment validity, and generalization. Especially, I was impressed by the idea that we need the right amount of control variables. If we have too much, the experiment can't be generalized and if we have too few, the results will end up with too many confounding variables. In an independent variable section, I liked the idea of pilot experiment where we conduct an informal experiment to iron out those small bugs. And we also need to change our independent variables as needed during the experiment. In a dependent variable section, it mentioned that operational definitions perform the experiment.

Jessica Cen - 3/10/2010 15:35:37

I agree with Martin at the end of the chapter when he says that computers are your friends. But I also think that computers must be treated as tools instead of absolute reference sources. As Martin says, "garbage in, garbage out," which agrees to the fact that computers can only be useful if its input is data that has significance and its output is what we expected.

Furthermore, after reading the three chapters on "Doing Psychology Experiments," I realized that even though we can calculate the statistics for any amount of data, some of that statistical data may not be a good representation of the results. For example, when Martin describes the ways to express the central tendency, he highlights the fact that we lose some information when we describe our data in terms of the mode and median. Therefore, it is important that when interpreting statistics, we make sure that they reflect the most relevant and important information.

Wilson Chau - 3/10/2010 15:38:51

All the readings were by Martin from his textbook, of the readings the one that was the most interesting and that I felt helped me the most in figuring out what to look for in our lo-fi prototyping was Ch 7. Ch 7 focused more on what to focus on in our testing, like what to change and what to keep track of. It helped to give me a better idea of what our independent and dependent variables were for my project.

Brian Chin - 3/10/2010 15:44:50

I thought the reading was interesting and informative. The parts about how to create an experiment and choosing independent and dependent variables was a good overview of how to do experiments in the social sciences, I felt. I question the practicality of many of the techniques in this reading though. For example, if some people were attempting to create an application that was meant to reduce stress, one might conduct an experiment to determine what background image would create the most relaxation in users. These chapters would help greatly in choosing variables, creating an experiment and interpreting the results. However, I feel that this whole methodology is too cumbersome for using to develop an application. If you had to conduct an experiment for every aspect of an application, the application would never be finished. The information from these experiments though would be useful, so it might be wise for application designers to look it up, and see if anyone has done the experiment already.

Long Chen - 3/10/2010 16:02:13

This is the exact same subject we just discussed in my Psych and Econ class. As a economics double major, I understand experiments are imperative to improvement and better understanding. I'm both surprised and thrilled CS160 is the first computer science class that we actually do hands-on interaction with everyday users. The experiments Martin discusses in his writing are more targeted as psychological studies where the user is studied along with the topic. These experiments differ from an economic experiment in many ways, and the focus on the user is one of the key differences. Psychological experiments are also more qualitative based and there are less numerical data to record. My Econ 119 course on psych and econ proposes a hybrid experimental procedure where the numbers are derived from the qualitative user responses and are also used as a part of computation models to understand the user and also the situation. I believe that same kind of experimental approach could be useful in our CS160 course as well.

Vidya Ramesh - 3/10/2010 16:25:35

In the three chapters from Martin's book that we were assigned for reading, the general idea was to define the practice of running experiments using a reliable and valid methodology. Martin started off by defining terms such as independent variable, dependent variable, and random variables. The first is the variable that is being manipulated by experimenters, yet he points out that this variable is independent of the participant's behavior. If an independent variable is independent of the participant's behavior, I think that he is pointing out that the experiment is measuring the influence of the variable on the participant's behavior. The participant's behavior is the dependent variable and the random variables are the circumstances that can vary, but do not bias the experiment. In the next chapter, Martin brings up the difficulties of finding the correct independent variable within the experiment. He explains that there is a difference in the precision that the general public will accept in defining a term and what experimental psychologists will accept, and that any definition of an independent variable definitely must satisfy the latter and the former. He also describes various ways of ensuring validity and reliability like test-retest reliability, content validity, and predictive validity. In the final chapter, he describes how to manipulate the data that is collected during the experiments. He focuses on how plotting frequency distributions is usually pretty helpful and the statistics for describing relationships between two variables.

Arpad Kovacs - 3/10/2010 16:32:31

After reading these chapters, I realized that setting up scientifically valid and reliable psychology experiments is very difficult, due to the high possible variability among test subjects, and requirement for highly precise operational definitions. Luckily, the chapters provide practical advice for selecting realistic, demonstrable ranges, and running pilot tests to calibrate independent variables. The most useful part of chapter 7 was the discussion on formal methods for ensuring reliability and validity of dependent variables, as well as how many dependent variable to measure, and to what degree they should be composited.

However, I am still not sure of what is the best way to applying this knowledge to our lo-fi prototype usability studies. First of all, what is the independent variable? Our experimental participants are testing a single interface; should we change particular elements of the interface in the middle of an experiment to observe how it affects usability (eg increase the size of a button)? Or would it be wiser to split the participants into control and experimental groups, and compare their performances? It is also going to be hard to ensure the reliability of the dependent variable, which I presume will be the time to complete each task. Test-retest reliability is not an option, since it requires time between the two trials, and practice gained from performing identical tasks will contaminate the results. Alternate-form is somewhat better, since the trials can be conducted close together, but running multiple similar trials for each person may induce fatigue and thus introduce additional undesired dependent variables. Split-half seems to be an appealing option, since it can be performed in one sitting, although it is unclear to me how exactly we can split a usability test into two interleaved halves. In summary, there are many concerns that need to be taken into account when designing a psychological experiment, and even more factors come into play when we specialize in HCI experiments.

Jeffrey Bair - 3/10/2010 16:40:52

The way that we do experiements in any kind of setting is to control our variables. It is interesting to see that similar to the way that we control variables in Chemistry or Biology, we also have to control and dependent variables in order to see how the experiment affects the user. It is also important for us to make sure that there are no confounding variables such that it makes it easier for us to determine if there are certain problems for certain variables. The mortality part of the reading is also interesting in that since we will always have different users, them having different opinions about a certain function may just be attributed to the user and it is difficult to replicate it by just asking if someone else has a similar problem.

Plotting distribution seems like a good idea if we have a lot of testers but for our limited amount of time we may find that we have too small a sample in order to really create a good frequency distribution. However, it is interesting to see that having different kinds of graphs can vastly change how you view the relationship of your data. E.g. the line graph makes me feel as if there is a relationship between the data whereas a bar graph my just represent the amount for each piece of data. However, again we must be vigilant in our experiments to make sure that we don't simply read wrong information from the graph that was due to unforseen variables. Otherwise the data that we collect may be useless and/or harmful to our design.

Andrew Finch - 3/10/2010 16:44:22

I found Martin's section on misinterpreting statistical tests very interesting. I feel that misinterpretation of statistical data is so common that it obscures our beliefs about the world and misguides our practices. Martin points out that some people think that when a statistical test fails to show a significant difference in the levels of an independent variable, then the levels are significantly the same.This is obviously incorrect, and can lead to a great deal of confusion and false information if interpreted this way.

Spencer Fang - 3/10/2010 16:45:06

The dual-task methodology mentioned in the reading is an interesting way of measuring how much focus a particular part of an interface would require. I can imagine an experiment similar to the text where a user interacts with an application, and at the same time, must listen for a beep and hit a button as quickly as possible when the beep sounds. A long delay would indicate that the user encountered some part of the interface that is more difficult to understand than others. These parts of the interface can be targeted for further improvement.

Alexis He - 3/10/2010 16:52:16

on "How to do Experiments", Mortality: I think it's crucial that results from experiments take into account the differential-mortality of the various study groups. For example, when high schools advertise that 90% of their graduating class goes on to a 4-year college, they're not taking into account the mortality rate (or the number of dropouts who are not part of the graduating class). A lot of times, in order to understand a statistic or research (not just necessarily experiments), it's important to know what the facts are neglecting -- to read between the lines.

Esther Cho - 3/10/2010 16:53:14

How to conduct psychology experiments does not seem far off from any other scientific experiment (I think that was the goal of creating psychology experiments). Because of this, the information that Martin presents doesn't seem new (even up to how to present the data in graphs and interpreting them). The only thing new I read were the different variables like the confounding variables and including random variables in an experiment. However, how is this different from a scientific experiment?

Richard Heng - 3/10/2010 17:12:49

It seems like the difficulty in determining independent variables comes from the subjectivity of the definitions of the variables. It might be less misleading to not use those subjective words in the study. It would be most acurrate to frame the study in terms of the concrete recorded variables.

Victoria Chiu - 3/10/2010 17:25:30

Choosing what data to manipulate or measure is tough, just like choosing how specific a test should be. Martin suggests that experimenters can first do "pilot experiments" to figure out the range of the independent variables. This idea is similar to lo-fi prototyping; pilot experiments let the experimenters know if the range is inappropriate before a great amount of time and efforts are spent. While lo-fi prototyping requires the same target users for the final product for testing, pilot experiments might just require ourselves or our friends in a casual setting. It is however hard to tell if we and our friends are similar in ways that might be different from the general public, and the pilot experiments could possibly be biased.

Joe Cadena - 3/10/2010 17:27:29

It seems to me that a bit of subjectivity exists when operationally defining an experimental variable. In this case, I believe some results are prone to being skewed. Relating this topic to low fidelity prototyping and hueristic evaluations, is it then possible for the "experimenters" to interpret the results in their favor so that maybe a new design is adopted or a new function included? Maybe, in addition to the results, the design team determines its validity based upon general acceptance.

Chris Wood - 3/10/2010 17:28:01

I can see how isolating one variable for testing a UI can become very complicated. I think the most important part of this type of testing is "operational definition" so other experimenters can follow your testing methods and see any overlooked flaws in the design. The part of the first article on using clues to read the psychological processes of the user seems dubious at best to me. The discussion on using statistical techniques to interpret data was great fun. I read the book "How to Lie with Statistics" and this reminded me a lot of that book and how to realistically portray and read statistical information.

Mohsen Rezaei - 3/10/2010 17:29:01

Couple of very important points were pointed out in the reading that will definitely be a help for designing user interfaces. First important measure mentioned in the reading was that sometimes we are not able to test all the aspects of our design. As mentioned a pilot cannot put himself in a situation where he can learn and know all his problems in the area of piloting. Sometimes test models and experimentations do not reveal all the problems and unknowns of a specific design. In such cases, we ought to be creative and we need to be able to imagine the situations where something could possibly go wrong. Along the line, sometimes we can't even experiment things even if we can think of where our system can break or could possibly break. In this case, knowing how to handle the situation would be our best bet. An example of this would be, like the text mentioned, measuring the aggressiveness of children of particular age. We can't really let a child watch violent movies/TV shows to see how the child's aggressiveness would change. We might hurt the child more than we would hurt the design. Reliability was the other important thing mentioned in the article for this week. An example of reliability would be hardware response to weather. Some hardware don't response the same in cold weather as in hot weather. The performance drops in cold weather because cold weather introduces delay in hardware and makes it hard for the hardware to carry tasks the way it does normally. If the software or design relies on the hardware performance, then we need to make sure that the design does not fail in the place where it is going to be used. Another thing within the same idea is retesting the system. Even though a system/design passes test it doesnt mean that the system is unbreakable. Retesting the system ensures that passing tests wasnt accidental.

Jason Wu - 3/10/2010 17:34:48

I really like the Martin readings because the writing is very clear, and the author includes many examples as well as drawings/figures that illustrate or help the user remember important concepts. Although I have never taken a statistics class, his numerous examples of different types of distributions and graphs really helped me understand the basics of disseminating data acquired through experiments. In particular, I think I finally get why statistics for test results often include all three measures of central tendency as well as variance and standard deviation.

Chapter 7 of the reading raised a couple of interesting points. I like the suggestion to run a pilot experiment before investing more time and money into performing the actual experiment. Even though the lo-fi user testing assignment is not exactly an experiment, my group and I found it very useful to have a test run before going out and testing with actual users. Just by trying out the prototype on one another, my group and I found some usability issues and came up with some best practices for the real user testing. Furthermore, Martin mentioned that some experimenters use EEG and/or fMRIs to measure physiological response as an indirect variable. I wonder if any user interface designers run experiments with such equipment, since I imagine that seeing which parts of the brain are simulated when a user is trying to perform a particular task through the user interface could be helpful.

Jeffrey Doker - 3/10/2010 17:44:10

These readings gave a good (albeit wordy) overview of a variety of basic experimental design and statistical analysis techniques. The lengthy descriptions of technical terms felt unnecessary for the simpler concepts, but I was grateful for them on the trickier statistical tools. Actually, the most interesting aspect of these articles to me was the ubiquitous use of pun-based cartoons. Although at first I couldn't understand why these were useful, I found when I read the summary section at the end of each article that the bolded vocabulary words instantly brought to mind their associated cartoon, and that gave me a key to recalling the definition of that word. This makes me want to learn more about the psychology behind memory triggers, which, though not the point of these articles, I think is a valid topic to investigate in this class.

Jordan Klink - 3/10/2010 17:46:31

Since I'm rather unfamiliar with psychology and psychological practices, I found the reading to be very helpful in understanding how a psychology experiment works. Specifically it highlighted the importance of preparation before an experiment. If careful planning is not made, the experiment will just be a waste of time (and possibly lots of money). The only concern I have is just how applicable psychological practices are in regards to user-interface design. It is likely I will want to test out my design on various subjects, but whether or not a full-fledged experiment will be necessary is questionable. Even more of a concern is if an actual experiment is even desirable. For example, it may be preferable for a much more relaxed approach to keep the subject as comfortable as possible. Regardless, I am glad to have gained the knowledge just in case I do have the necessity to run an experiment in my design process.

Kyle Conroy - 3/10/2010 17:48:10

Interpreting experimental results, or any collection of data, can be difficult for computer scientists who have never had formally statistics courses. Zed Shaw, is his coarsely-worded rant "Programmers Need To Learn Statistics Or I Will Kill Them All" correctly identifies the problem that many programmers do not understand basic statistics, and worse, do not know they are ignorant. Therefore, it is great that we read Martin's "How to Interpret Experimental Results", hopefully fixing some misunderstandings regarding data. Programmers today also constantly run into experimental data all the day. Whether it is website analytics, sales data, sign up rates, or a/b testing, programmers must be able to understand these basic concepts to make informed decisions.

Brandon Liu - 3/10/2010 17:48:59

There was an interesting discussion at the end involving the "significance" of tests. I looked up the origin of the popular thresholds of significance: 0.05% and 0.01%. These come from the 1920s, when books would publish tables of F-distributions (See Stigler, http://www.springerlink.com/content/p546581236kw3g67/). The author of the article suggests that statistical significance at these levels has since been "abused". I was curious why specifically a "1 in 20" chance was chosen as the standard, and was surprised to learn that it was for a rather mundane reason. The proper way to state such a result is "statistically significant at the 5% level"

Mikhail Shashkov - 3/10/2010 17:51:32

Don't really have much to say. No disrespect, but I am confused why we are learning about the scientific method in an upper division college class. All of this stuff is fairly straight-forward and was taught years ago.

Hopefully some interesting takes with respect to HCI will be presented in class.

Darren Kwong - 3/10/2010 17:55:17

The readings show that there are a lot of things to consider when planning an experiment, conducting it, and interpreting its results. For lo-fi prototypes and Wizard of Oz testing, the different variables and conditions should be kept in mind. Real-time processing by a "human computer" might be important to the interpretation of certain results, so the dependent variables that are defined should be independent of conditions and variables that would differ significantly between the prototype and the actual product.

Richard Mar - 3/10/2010 17:56:25

The chapter on interpreting experimental results was essentially a chapter on basic statistics. There was correlation, various distributions, plot types, and a section on statistical tests. The more of these readings I go through the less I like them; very few of them are useful or interesting.

Divya Banesh - 3/11/2010 10:56:11

In the reading by Martin, we reading about confounding variables and how it's important to make sure the variable we're testing is really the variable we think we're testing. This is also related the the principle of Occam's Razor, that to get a solution, we need to strip away everything that's not essential to the problem. I fell this is very important while testing prototypes or creating interfaces but it's also something that can lead to problems. It's important to make sure that, for example, the designer is testing which color pops out more for the user, red or yellow, both are represented in the same spot on the screen and with everything else being the same on the interface.

[add comment]
Personal tools