A3-DavidPurdy and DaisyWang

From CS294-10 Visualization Fa07

Jump to: navigation, search

Contents

[edit] Density Exploration for Machine Learning

[edit] Background

Many scientific pursuits increasingly involve very large, very high dimensional data sets. Existing modeling and visualization methodologies for low dimensional data, of any size, are relatively adequate, so we aim to develop interactive visualization that will assist humans in modeling and visualizing very high dimensional data sets.

This project will meld one machine learning technique, namely Classification and Regression Trees (CART), and one interactive visualization technique for examination of the interaction of the data set and the model. This method will either be parallel coordinate plots or trellised histograms.

[edit] Data Domain

We have not settled on a particular data set yet. The tool itself should be applicable to nearly every data set used encountered in statistics and machine learning. A simple demonstration of the tool may involve simulated data, but we will aim to include a real world data set, such as those from the Machine Learning Repository at UC Irvine, or from various machine learning competitions. Some examples include identifying whether an item on eBay will sell for above or below the mean sale price in its category or classifying email as spam/non-spam.

For simplicity, we will choose a classification problem, rather than a regression or ranking problem, as it simplifies the modeling and focuses more attention on the visualization. However, the project would have no limitations in being used for those other modeling purposes.

[edit] Story Board

Below, we have illustrated several of the individual components. Since interaction with the machine learning system is beyond the scope of this project, linking will be unidirectional: selecting components of the tree will update the visualization of the data distribution, but linking the modeling tool to brushed/selected data won't be feasible at this stage.

Also, the decision about whether to use Parallel Coordinate (PC) Plots or Trellised Histograms (TH) will depend on the difficulty of overcoming several ink/data issues (see Notes below) and the difficulty relative to TH plots. In an ideal implementation, both would be available.

[edit] CART

CART ("Classification and Regression Trees", by Breiman, Friedman, Olshen, and Stone, 1984) is a system that produces binary trees for classification and regression. The tree recursively partitions the data, with splits at each node based on a threshold for a particular variable. The selection of the variable for splitting and the appropriate threshold is based on improvements in predictive accuracy and measures of dispersion such as the Gini index or entropy.

In addition to being a powerful tool for statistical modeling, decision trees are also excellent for exploratory data analysis (EDA) and visualization. Here is an example of a CART tree on simulated data. The first node splits on whether variable 21 (V21) is greater than or equal to 2.165, and the left branch is for the subset of data for which this is true, and the right branch is for the subset with a smaller value for V21.

image:hw3-treeA.jpg

A user may opt to see the data for the very large values of V21, by selecting on the lowest left node, where the response (Y) value has a mean of 0.01706


image:hw3-treeA1.jpg

Conversely, the user may select on other nodes; for simplicity, we show the selection for the smallest values of V21, on the lower right.

image:hw3-treeA2.jpg

[edit] Parallel Coordinate Plots

In the example below, the data has been conditioned on the response value (red for Y = 1, blue for Y = 0). Such a plot can be conditioned on the data subset still under consideration at a specific node in the decision tree, so that "local" patterns may be visible. This picture also illustrates the difficulty of attempting to find patterns when all of the data is presented - the lack of shading makes it difficult to see much beyond patterns such as large values of V21 correspond to small values of Y and small values of V22. See the notes below for additional comments on changes we may make to the usual PC plot.

image:hw3B2.jpg

[edit] Trellised Histograms

This first plot illustrates a single histogram, for V21, conditioned on the Y response values (NB: although we are trying to predict the Y value, conditioning on it for training data can illustrate the relationships between predictor variables and the response variable). The X-axis partitions the response value into its binary states, and the Y axis is the range of values for V21 in this subset of the data; referring back to the X-axis: the width of a histogram refers to the count or frequency of data with the corresponding V21 value from the Y-axis. In this case, for large values of V21, there is no support for Y = 1, as all of the instances correspond to Y = 0. Similarly, for very small values of V21, Y is most likely to equal 1.

image:hw3-hist3.jpg

An example of trellising two histograms, for V1 and V21, is shown below. By highlighting parts of the tree, it would be possible to see which sections of the data are represented. To make it easier to interpret, it seems possible that we would need to have all bars of equal height, or at least give an option for that kind of change.

image:TH-merge.jpg

[edit] Notes

  1. Related work includes Fua et al (1999), though that paper focuses on clustering (unsupervised learning) in conjunction with parallel coordinate plots. A brief literature search has not turned up other instances of a supervised machine learning tool being paired with an interactive system for simultaneous exploration of many dimensions of the data (such as trellised histograms or parallel coordinate plots).
  2. We have examined the implementation of parallel coordinate (PC) plots in Spotfire and GGobi (with R), and have identified several limitations, which may affect our own approach. Notably, the amount of "ink" is extraordinary. At a mere 10,000 observations, trends are impossible to observe, as these implementations emphasize points over distributions. This made us realize that we will likely need to combine several modifications, such as shading (to emphasize dense regions or relationships) and binning. For instance, for each variable the visualization may bin the data into deciles or quintiles, and line segments joining the variables may be shaded to indicate the conditional distribution of the data (i.e. conditioned on variable N, the distribution of the data into variable N+1). We have a number of ideas about this, but our decision to implement PC plots will depend on the difficulty of implementing such modifications.
  3. We have yet to decide on the visualization system, such as Prefuse or Processing. The choice will mainly depend on the difficulty of implementing PC or TH plots.



[edit] Final Writeup

[edit] Overview: Linking recursive partitioning and histograms

In this project, we developed a system for showing multiple histograms of recursively partitioned data. We have recently learned that the visualization community may refer to manual versions of this as "Query Driven Visualization" (cf. multiple research projects from LBL, using VisIt, under Wes Bethel). However, our current implementation is only a piece of what we have designed: this is the beginning of full use of machine learning to determine the specific partitioning at each stage.

[edit] Application (Income prediction) and visual layout

In this example, the data is drawn from a UC Irvine data set from census records. The prediction task is to determine whether a person earned more or less than $50K/year. The variables used for prediction include age, education level, ethniciy, gender, the hours per week that they work, their marital status, their relationship within the family, an occupation identifier, capital gains or losses for the previous year, and their native country. A screenshot is included below. The left panel shows a series of conditions or selections applied to the data - first, subselecting data where the respondent is a "husband" OR "unmarried", then, following the top branch, selecting those cases where their education level identifier is greater than 13, i.e. they have a bachelor's degree, and so on. Partitioning into true and false subsets isn't yet included in this, due to time constraints.

On the right panels, histograms are given for relation (categorical), education level, and age. The squares are for the subset of people who earn more than $50K, and the circles are for those who earn less than $50K. In the data set used, the total population is 32561, of which only 24% (N=7841) are in the higher earning category, and 24720 are in the lower bracket. So, whenever the ratio of counts is other than 3:1 for the circles to squares, for a given subset of the data, it could be interesting, especially in cases where the square is above the circle - i.e. that there are more high earners than low earners for the given conditions.

image:screenshot1.jpg

An example can be found when clicking on the "EducationLevel > 13" node. In the lower right panel, for the histogram of age, one can see that after a certain age the squares greatly exceed the circles - there are far more people with a bachelors degree who earn more than $50K than there are people who have this degree and do not. In clicking on "Age > 45", in the middle panel, for educational status, we can see that after a certain level of education, corresponding roughly to high school graduation, the higher earners are a large and increasing fraction of the population.

The difference between this implementation and the storyboard is that we have currently implemented the selection (or "true") for the recursive partitioning. It turns out that this is very much like "query driven visualization", i.e. with an AND operation on each query. We were pleasantly surprised to learn of such a concept on 10/12, at a presentation at MSRI, by Wes Bethel, of Lawrence Berkeley Lab; in such cases, the partitions, or queries, are generated by a human exploring the data.

[edit] Installation

To install and run:

  1. Download the eclipse environment from: http://www.cs.berkeley.edu/~daisyw/prefuse.zip
  2. Unzip the file into a location of your choice
  3. In Eclipse, select "New->Java Project" and then point at the directory storing the unzipped project
  4. Under "demos", the program is Assig3.java, right click it and then select "Run As -> Java Application"

[edit] Development Process

Daisy handled the implementation in prefuse and Java. Over about 3 days, about 24 hours were spent in implementing it. David handled getting the data, working with CART, producing trees in XML, and the writeup. About 15-20 hours were spent on the project.

At a high level, we had to pick a specific application domain, and David selected a data set from UCI and prepared several trees through statistical analyses with R, and matched the data with nodes in the trees (scripting in Perl and R). Daisy worked in Java with prefuse to render the trees and render the histograms for the variables of interest for the subpopulation selected by a given node on the tree.

For Daisy, the greatest difficulty was in combining the linking and brushing from the tree to the corresponding histograms, and updating the histograms when the user clicks on a new conditional node.

For David, the greatest difficulty was in extending the trees produced by R and CART (using the 'rpart' package) to an XML representation that addresses both "state" of the data and the conditions that produce the data. Most current approaches to decision trees do not report a representation of the data until the leaf nodes, where it is common to only report an expected value of the subset corresponding to the given leaf.



[add comment]