Visualizing Web Content

From CS294-10 Visualization Fa08

Jump to: navigation, search

Lecture on Oct 8, 2008


Contents

Readings

  • Summarizing personal web browsing sessions, Dontcheva et al.(pdf)
  • Vispedia: Interactive visual exploration of wikipedia data via search-based integration, Chan et al. (pdf)

Optional Readings

  • MashMaker (look through the tutorials and try it out), Ennals et al. (website)
  • Zoetrope: Interacting with the ephemeral web, Adar et al. (pdf)
  • Relations, cards, and search templates: User-Guided data integration and layout, Dontcheva et al. (pdf)

Simon Tan - Oct 07, 2008 01:19:50 am

While reading Dontcheva's paper and the description of "extraction patterns" used for automatically parsing/saving web content, I was reminded of other "macros for the web" browser plugins. Things like MIT's Chickenfoot or IBM's Co-scripter automate actions so that you can repeat them. However, Dontcheva's system seems more geared towards scraping data from the web, and it's easy to see how that specialization can lead to more opportunities for visualizing that data.

So when it came time to learn what Dontcheva's system could do with the recorded data, I was hoping to see more options for output visualizations in addition to the table- and anchor-based layout templates introduced in the paper. Perhaps it is difficult to do with the huge scope of the types of content that could be pulled in from web pages, but I imagine more complex analysis and more interesting visualizations could arise from the basic data (i.e. numbers and text). Why not a bar graph or scatterplot summary template?

Nicholas Kong - Oct 08, 2008 12:42:10 pm

I thought the extraction patterns paper was fascinating and the idea lends itself to a lot of possible future research. One such direction could be adding more analytical visualizations, as Simon points out. While bar charts or scatterplots might not be useful for some of the applications of the tool, such as gathering paper links and abstracts, it would be useful for tasks such as product price-comparison. It may then prove interesting to combine these more complex visualizations with the map layout has already been provided.

To that end, it would be really interesting to see if one could interface the database with a visualization tool such as flare. There would likely have to be an interim step in the pipeline between the database and the layout/visualization mechanism to extract data from the DOM nodes, though. As Simon points out, this may be difficult given the generality afforded to selecting objects.

Dontcheva et al. also note that adapting this tool to collaborative analysis environments is a possible future research direction. I find this possibility, combined with possible additional annotation tools, intriguing: how would participants effectively create and share extraction patterns, and how would they add metadata (such as comments)?

Kuang - Oct 08, 2008 03:27:47 pm

Both papers present practical systems for extracting structured data from unstructured / semi-structured web pages.

Whereas Mira's system is more UI oriented with a browsing based aggregation, Vispedia performs automatic aggregation across semistructured DBPedia data.

There's an obvious trade off between data "structured-ness" and user involvement. With Mira's system, the user needs to define mappings to a predefined schema through tags. In Vispedia, the user searches for relevant schema columns with a nice relevant-columns suggestion interface.

For Mira's system, it would be interesting to see how a mediate schema for a domain, like travel destinations, could help decrease the level of user interaction, and perhaps increase robustness of extraction rules. I'm thinking, when a user visits a page, we classify it into a domain, and apply a canonical extraction rule. Failing that, the user can create semantic mappings to the mediated schema representing that domain.

Ketrina Yim - Oct 09, 2008 09:04:42 am

Of all the tools presented during the guest lecture, I found Zoetrope to be the most fascinating. I never thought anyone would create a tool that would allow one to browse webpage snapshots over time, but that was before I heard of the inconveniences that often come with looking for and correlating chronological data online. Even more intriguing was the filter and visualization features.

One question I still have, however, is: where are all those snapshots stored? Though I don't know the resolution of the images, it seems that it would require a lot of space to store webpage snapshots that are sampled hourly, compounded with the fact that there may be many webpages to sample.

Scott Murray - Oct 10, 2008 02:10:24 pm

What a great lecture that raised so many issues. I appreciated how Mira maintained an awareness of the user experience at all times, even while grappling with unsolved questions in many areas of computer science.

I was most interested in the questions that arose toward the end of the lecture: What next? How can we continue developing new ways to visualize and interact with data that has never before been visible? Whether we're talking about temporal-web browsing or user-driven collections of non-schematized data from disparate sources, how can we present the information in a way that makes sense to the user—someone unfamiliar with the task they are about to perform, since it hasn't been done before.

On the web, generally the best practice recommendation for unfamiliar UIs is to introduce new users with a quick "tour". (Kayak.com and other sites do this for first-time users.) It's also good to hide most of the UI to start, and enable the user to reveal more UI elements and expert-level controls as s/he learns the application. But sometimes, as with the Zoetrope project, the tool introduces an entirely original (and foreign) conceptual model. In that case, perhaps a quick demo (as we had in class) is the best way to grasp what's possible with the application and how to use it.

Maxwell Pretzlav - Oct 13, 2008 10:05:22 pm

I was very impressed with Mira's web summary software. Two aspects of her presentation caught my interest most. First I was very impressed by how their software found equivalent entries for the same location in multiple sites by simply querying Google directly and ranking based on certain specific criteria. At first the ability to find equivalent entries in alternate sites almost impossible -- how could her software possibly know the search API of every restaurant review site out there? When she showed how google did all the work under the hood it made me wonder if there are other classes of data location and integration problems that can be easily solved (or at least circumvented) by careful application of a general purpose tool intended for human use. The Vispedia software explores some similar ideas by searching related Wikipedia pages, but I wonder if there are larger classes of problems that can be solved this way.

Along those lines, the idea of a database of user-entered labels for different sections of web pages was also very appealing to me. I can imagine how people would slowly over time annotate the different pieces of data on many disparate websites by using Mira's (or similar) software. If there were a repository for all those annotations, software could query the set of annotations to easy pull data from many sites without having to use a specific API for each site.

Witton Chou - Oct 14, 2008 07:49:55 pm

The things that Mira presented in her lecture were amazing. The simplicity of doing research by extracting only the data you want and applying the same DOM extraction techniques off different pages is a very handy concept. Some of us really wanted the tools she presented, which attests to how useful they are. I think if we all had these tools, doing research would be much simpler and more efficient. I don't know how much time I've spent when trying to research a particular product I'd like to buy on Amazon or research for a research paper where I compare specs or information from different pages on the same site that use the same layout. By being able to look at the DOM model and perform searches that will return only the information you want to see is great and I really look forward to seeing what comes out of Mira's research.

Michael So - Oct 14, 2008 10:01:22 pm

That firefox extension that helps a user collect information from web browsing information was pretty neat. The whole concept of extracting the desired parts off a web page was something I never really thought about. And applying that extraction pattern across similar web pages is very handy when looking up different hotels, or restaurants for example. "Summary" is a good name for these extractions because they are summaries of what you browsed on the web.

The most neat thing during Mira's lecture was at the end when she introduced the Zoetrope. The whole being able to rewind on a web page like it was a video was also something I never thought about and I think that is pretty novel. Being able to find articles on a news website from yesterday seems like it would be easy to do with Zoetrope. Also the demo where Mira showed that you can make a visualization of price changes on some book/dvd was also really neat. Knowing when the book/dvd was cheapest and most expensive and how much it was when it was cheapest and most expensive is something that a web site like amazon.com doesn't seem to show. I would be very interested in trying this tool out.

James Hamlin - Oct 15, 2008 05:01:38 am

The Vispedia paper's discussion of the problems of data integration reveals the (near?) intractability of joining the massive amount of diversely-structured data on the web. That working within the relatively uniform space of Wikipedia tables and infoboxes can present such technical challenges demonstrates the trouble we're in when we try to capture the 'semantic web.' This is where the variegated approaches to web 'mash-ups,' of which Mira's project is one great example, find some of their greatest technical challenges (as opposed to the HCI concerns). People working in this space seem to be at a unique nexus, connected to psychophysics, graphics, data integration and interpretation. Mira's brief comment on the relative lack of examples of applied AI/NLP results to this space was a bit surprising at first, but it does make sense, especially given the novelty of this work. Researchers in web visualization are attacking first and directly the actual problems users would want solved. As more is learned about how the interactions should occur, additional iterations can incorporate more advanced techniques for solving these hard data integration problems.

Chris - Oct 15, 2008 05:36:00 am

The "Summarizing Personal Web Browsing Sessions" paper reminded me of a related project that some of my roommates from years past did. The project was a web browser called TrailBlazer which featured a visual history browser. The history browser, brought up by clicking a button on the toolbar, presents the user with a tree of the pages visited past web browsing sessions. The nodes in the tree are the webpages visited and are represented with a screenshot of the webpage. The edges represent link traversal. The thumbnail view of each webpage as well as the contextual data (which pages led where) is more useful than the standard "list of titles of pages you've viewed in the past" features in most browsers even today.

The project still has an active webpage. The page has an intro video which shows some sample interaction and sessions.

Yuta Morimoto

I was so impressed on the presentation. It was the first time I had seen the application of visualization for time sequence of web page. It seems great usability for any people and very useful to take web page snapshots and collect time sequence data, especially in the field of sociology or ethnology. I think many data mining approaches focusing on marketing aspect to reveal correlation of chronological online data. Only can a savvy computer engineer analyze very complex log file of web that definitely lack visualization. Actually, there already exist great log file visualization software such like glTail They realize significant visualization but they look focusing on network engineer. So, it Mira's project is great useful application for people working in visualized data interpretation.

Matt Gedigian - Oct 15, 2008 10:58:30 pm

I was really impressed with the Web Summaries, even though visualization seemed to be a relatively minor component. As @Maxwell mentioned, the use of google was clever, as were the heuristics used to link information. In general the system seemed to have a very nice combination of novel ideas and good software engineering, which don't always come together.

It seemed like having everyone using the software mark the same divs in Yelp or Restaurant.com would be a lot of duplicated efforts. A public repository of patterns would be a huge timesaver. But in "Summarizing Personal Web Browsing Sessions", they say that users weren't interested in a shared library of extraction patterns. What is wrong with people?

Dmason - Oct 15, 2008 01:42:10 pm

The Document Object Model was a new concept to me (did I miss something earlier?) and the suggestion of it started an interesting search on a whole paradigm of document creation and analysis. A quick Google search, for instance, shows an interesting and convoluted history (here, as usual, the wikipedia entry is invaluable: [1]). I had no idea that DOM was imposed by W3C, and like all things they impose, some parties (I'm looking at Microsoft here) totally or mostly disregard it. My impression is that compliance is directly proportional to the economic viability of a webpage, and that this model works best for large commercial webpages.

What I like about DOM is that like all the databases we've looked at before, the DOM database needs to be massaged, and a fair amount of time for developing DOM analysis software must account for and correct non-compliance. (An analogy occurs for instance in Google's PageRank software, which must account for possible looping paths of links and broken links). It never even occurred to me that there was a prescribed form like this for HTML webpages, or that it could be so useful and frustrating.

Sarah Van Wart - Oct 15, 2008 12:14:07 PM

In response to Maxwell / Matt: I also found it interesting that Mira used Google as the underlying engine to search for events in a certain area, but then added a layer of post-processing on top of the search results to make the results and ordering "smarter." In the field of IR, it's interesting to see examples of how the power on an automatated information retrieval tool that queries unstructured web documents really well (i.e. no content-based markup) can be coupled with a set of individual preferences and mapped relationships in order to really give users relevant results. Also, in class on Wednesday, there were a number of people who asked Mira about how the collective associations made between events, places, and spatial contexts could be leveraged into some greater product. I think that's a really interesting concept to think about -- what are the ways in which we can aggregate overlapping preferences in useful and intelligent ways to interrelate disparate data in particular contexts?

NickDoty - Oct 15, 2008 02:24:56 pm

I just want to add my skepticism (following several others in class, I think) about the robustness of Mira's projects like Web Summaries and Zoetrope that take advantage of the existing non-rigorous, non-semantic structures of web pages. I know I have trouble with simple applications that use the DOM like Web Clips on my Dashboard. Will this really work long term? And with a side variety of pages? But more so than that, if Web Summaries prove to be very successful, won't content providers actually have perverse reasons to change or obfuscate the structure of their documents.

Jeff Bowman - Oct 15, 2008 12:46:25 pm

@NickDoty: As a web developer, I know I put in an amount of useful quasi-semantic markup via CSS selectors. For instance, a table with id="products" may contain the CSS markup class="price", which could then be picked up with the XPath selector table[@id = 'products']//td[contains(@Pattern, 'xyz')]. Well-coded computer-generated webpages have a lot of that data built in. That said, her examples probably took advantage of particularly well-formatted sites, and it would be much more difficult to do the same thing with something like a newspaper site with varying features, page counts, advertisement configurations, and such.

I was particularly intrigued by the ability to draw a "window" and step backwards in that time data. As my recently-completed project focuses on Wikipedia, there is a large quantity of timestamped Wikipedia data for any given paragraph or article, and the ability to visualize how that particular paragraph changes over time could use a similar approach to the "scented widgets" slider approach that she demonstrated in her presentation.

Calvin Ardi - Oct 15, 2008 03:03:06 pm

Like Ketrina, the Zoetrope was an interesting tool. The only thing that provides a somewhat similar functionality is archive.org, but in general I believe only major changes are documented. It was interesting to see the changes done "in place" over time. As Nick mentioned, a lot of sites may not necessarily follow a universal model or guideline, perhaps because there is no incentive to do so..

I enjoyed Mira's lecture and, like many of the other students have mentioned, the main focus on usability for the user and how to get the information needed in an efficient manner. It was interesting to see the problem broken down from what a user would like to do (e.g., plan a vacation or buy a house) down to the details (examining the DOM, extracting and storing relevant data), as well as how to present it. An idea to get around all the different layouts that exist, perhaps a library (hopefully updated by some community) could be developed and maintained for these sites to extract data.

Matt Gedigian - Oct 15, 2008 03:30:53 pm

This article: Datascraping Wikipedia with Google Spreadsheets, shows how to scrape HTML tables (such as those on wikipedia) into a google spreadsheet and then use the charting tools to visualize the data.

Seth Horrigan - Oct 15, 2008 06:23:46 pm

I had seen Mira's work on web summaries before and found it interesting, although, as others mentioned, it does not seem like it could be deployed on a large scale and work well. Still, it does show that the system could work in theory, and for certain sites, it does work. I was very impressed with Zoetrope though. It seems to be a sentiment shared by many of my classmates. Certainly, as some others commented, it will not work well as a visualization for all webpages, but for certain types of pages where the structure remains constant for an extended period of time but the content changes within that structure, it provides an amazing - and seemingly fairly robust - tool. I was so impressed with the idea, and the elegant implementation, that I have started looking into job opportunities at the Adobe Advanced Technology Lab.

Razvan Carbunescu - Oct 18, 2008 12:16:05 pm

A little late but I'd still like to comment on the great lecture/talk given by Mira last Wednesday. I especially enjoyed the WebSummaries and the way they were developed to improve searching capabilities. I was actually thinking about websummaries as a tool to help me when I kept getting to pages with data that I needed for Assignment 3 spread all around the page but where some data was missing for some countries. The quick ability to select some fields and create a summary would've allowed me to find what server had more complete data and it would've allowed perhaps even extraction of data from the websites. I also enjoyed the concept of cards and how by just creating a simple template and drag& drop you could easily select important information from a large dataset of webpages to visualize.



[add comment]
Personal tools