FP-David Purdy and Daisy Wang

From CS294-10 Visualization Fa07

Jump to: navigation, search

Contents

[edit] Proposal

[edit] Group Members

  • David Purdy
  • Daisy Wang

[edit] Description

[edit] Overview

For very large, very high dimensional data sets, an approach for modeling and prediction known as Machine Learning (ML) has developed from the intersection of statistics and computer science. Machine Learning is often described as a black-box approach to modeling a response, Y, as a function of input data, X. Only researchers could understand the underlying equations and optimizations, and trace the process of a machine learning algorithm. Their scale often makes analyzing the data and recognizing the data features difficult. Their complexity and black-box approach often makes looking into the process of a machine learning algorithm infeasible. In addition, automatic machine learning has come to the point, where very little improvements can be made without human involvement.

In contrast, the goals of interactive visualization is on the one hand, to make patterns in data less opaque, and on the other hand, to involve human in the decision process. Recent developments in high dimensional visualization can be paired with recent developments in high dimensional machine learning, to give a much better understanding of model development and properties of the data set. In addition, such pairings can suggest where a researcher may focus their attention to resolve anomalies or interesting properties that arise at the intersection of the model and the data.

Thus, our goal is to use interactive viusalization in the machine learning processes, algorithms and in model selection. For researchers, we arm them with better debugging tools, to see if their algorithm is working as expected by examining the internal data structure; for analyst, we provide them with tools to discover the patterns in large, high dimensional data, to facilitate them buliding models; for common users, we can show how different models perform in a specific application.

[edit] Details

Machine learning model has different data structures. A large category of them uses a tree structure. In the final project, we are still focusing on tree structured machine learning models, but we are significantly extending the scope of the third assignment.

In the third project, we linked sequential sub-selection of data presented as a tree and histograms of multiple variables for the selected subset of the data. For the final project, we have the following goals:

  • Apply our visualization developed for CART(classification and regression trees) in assignment 3, to high dimensional sparse data boosting.
  • Apply the visualization to a real dataset from a practical industrial problem from natural language processing or search.
  • Discuss and implement different histograms appropriate to show for different models, such as CART, boosting and SVM.
  • Integration of a machine learning system and the visualization, ideally to execute a method and then examine the results. Future direction includes adding the feedback loop from the interactive visualization to the machine leanring system.
  • Visualization indicating correctly and incorrectly classified data, as well the ability to select either subset.
  • Specification of variables for display at any particular point in the model building path, determined by properties of the model from the machine learning system.

We believe our approach to visualizing CART, Boosting and other ML methods using one interactive visualization interface is novel. We will elaborate on this in our first presentation.

[edit] Related work

Anticipated references includes:

  • Breiman. "Statistical Modeling: The Two Cultures." Statistical Science, 16:3, 2001.
  • Breiman, Friedman, Olshen, and Stone. "Classification and Regression Trees." Chapman & Hall, 1984.
  • Friedman and Tukey. "A Project Pursuit Algorithm for Exploratory Data Analysis." IEEE Trans. Computers, 1974.
  • Fua, Ward, and Rundensteiner. "Hierarchical Parallel Coordinates for Exploration of Large Datasets." IEEE Visualization 1999.
  • Gosink, Anderson, Bethel, and Joy. "Variable Interactions in Query-Driven Visualization." IEEE Visualization 2007.
  • Hastie, Tibshirani, and Friedman. "The Elements of Statistical Learning." Springer, 2001.
  • Long and Servedio. "Martingale Boosting." COLT 2005.
  • Mansour and McAllester. "Boosting Using Branching Programs." COLT 2000.
  • Rodrigues, Traina, and Traina. "Frequency Plot and Relevance Plot to Enhance Visual Data Exploration." SIBGRAPI 2003.
  • Solka, Wegman, and Marchette. "Data Mining Strategies for the Detection of Chemical Warfare Agents." in Statistical Data Mining and Knowledge Discovery, (H. Bozdogon, ed.) Chapman-Hall, 2004.
  • Unwin, Theus, and Hofmann. "Graphics of Large Datasets: Visualizing a Million." Springer, 2006.
  • Hadley Wickham, personal communications.
  • Wilkinson, Anand, and Grossman. "Graph-theoretic Scagnostics." INFOVIS 2005.
  • Xu, Hong, Chen, Li, Liu, and Zhang. "Parallel Filter: A Visual Classifier Based on Parallel Coordinates and Multivariate Data Analysis." in Lecture Notes in Computer Science, Vol. 4682, "Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence." Springer, 2007.


Anticipated software references include:

  • GGobi
  • JMP
  • prefuse
  • SPSS
  • Statistica Data Miner - overview
  • VisIt
  • XmdvTool

[edit] Initial Problem Presentation

  • Link to slides here

[edit] Midpoint Design Discussion

[edit] Final Deliverables



[add comment]