FP-Jerry Ye and Jimmy Chen
From CS294-10 Visualization Fa07
Contents |
[edit] Proposal
[edit] Group Members
- Jerry Ye
- Jih-Yin Chen (Jimmy)
[edit] Description
Gradient Boosted Decision Trees are an additive classification or regression model consisting of an ensemble of trees, fitted to current residuals, gradients of the loss function, in a forward step-wise manner. In the traditional boosting framework, the weak learners are generally shallow decision trees consisting of a few leaf nodes. GBDT ensembles are found to work well when there are hundreds of such decision trees. Gradient Boosted Decision Trees was introduced by Jerome Friedman in 1999. Salford Systems’ Treenet and the GBM package in R are implementations of GBDT.
In machine learning, understanding how one's model succeeds or fails for a particular dataset or sample is important to building better models. For Naive Bayes classifiers, a list of frequency counts and a confusion matrix is generally informative enough to understand what went wrong. For traditional decision trees, understanding how a sample traverses the tree is highly informative about which features caused a sample to be classified a particular way.
For GBDT models, since there are hundreds of decisions trees, we can not realistically expect a researcher to be able to look at all the decision trees and the individual path the sample traverses each tree. Our proposal is to work on visualizations of ensembles of GBDT for error analysis. Of particular focus will be generating a single representative tree from the ensemble and visualizations of the paths that a particular sample traverses and what features impacted the final results the most.
[edit] References
- Mulvaney R, Phatak DS. A Method to Merge Ensembles of Bagged or Boosted Forced-Split Decision Trees. IEEE Trans on PAMI. 2003
- Barlow, T., Neville, P. Case study: visualization for decision tree analysis in data mining. IEEE Symposium on Information Visualization, 2001.
- Soon Tee Teoh, Kwan-Liu Ma, PaintingClass: interactive construction, visualization and exploration of decision trees. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
[edit] Initial Problem Presentation
Media:initial problem presentation
[edit] Midpoint Design Discussion
1. Train on the dataset to generate boosted decision tree classifier. Given the model and a sample, feed it into the classifier to get the classification result.
2. Transform the classification result into XML that conforms to the TreeML format in prefuse. Loaded the xml into prefuse. Visualized each boosted decision tree. Highlighted the path of classification.
3. Design the visualization of the ensemble of trees to be an undirected (Fig. 2) or directed graph (Fig. 3), in which the arrows show the edge from the parent node to child node. The node size in the graph is proportional to the information gain attributed to the feature in the trees; the thickness of the edge is proportional to the weighted counts of edges traversed by the sample through the trees.

Fig. 1. The boosted decision trees highlighted with the path of classification.

Fig. 2. Visualization of tree ensemble: undirected graph.

Fig. 3. Visualization of tree ensemble: directed graph.
[edit] Final Deliverables
- Code: http://people.ischool.berkeley.edu/~jimmy/vig.zip (see README inside)
- Final Paper: http://people.ischool.berkeley.edu/~jimmy/vis.pdf
- Final Poster: http://people.ischool.berkeley.edu/~jerryye/vis-poster.pdf
