  • Ragnar Skulason


Effective visualization aid for statistical data cleaning

Initial Problem Presentation


Description of the problem and motivation explaining why it is worth addressing

Data cleaning has been researched and written about and many have specialized in the area of data cleaning. However, visualizations in the field have been in need of improvement. Multi dimensional data cleaning in two dimensional matrixes or in database tables and how best to visualize them has not been studied as much. Most of the tools that I have used in this area have been difficult to use, complicated and using them has been very time consuming.

A background survey of related work and a list of references.
QCC - Quality Control Center
Potter's Wheel - http://portal.acm.org/citation.cfm?id=645927.672045&coll=Portal&dl=GUIDE&CFID=8254055&CFTOKEN=15391418
Declarative Data Cleaning - http://www.vldb.org/conf/2001/P371.pdf

A list of the key technical challenges your group expects to face and a description (or storyboard) of the approach you plan to use to address the challenge.

  • Setting up the visualization and implementation
  • How to use my statistical tools for this project
  • How to visualize my statistics in my visualizations

I will use the spiral model for the design:

1. The starting point will be determining the first objectie. What statistical models are relevant? What visualizions would work? etc. The UI will be designed, the visualization studied and an action list will be made.
2. The objectives will be used and a prototype will be made.
3. The prototype will be implemented and programmed. This stage will require the most work.
4. The program will be tested, evaluated and checked for what could be improved.
5. Go to step 1.

A list of milestones breaking the project into smaller chunks and a description of what each person in the group will work on.
1. iteration will be finished before November 10th.
2. iteration will be finished before November 17th.
3. iteration will be finished before November 24th.
4. and the last iteration will be finished on December 1st.

Slides (.ppt)
Slides (google docs)

Midpoint Design Discussions (11/24)


  • Think about how to tackle other datatypes, such as nominal
  • Other statistical functions for other datatypes

Slides (google docs)

Final Paper and Presentation (12/10)


Poster (.pdf)

