A4-ArpadKovacs

From CS 294-10 Visualization Sp10

Jump to: navigation, search

Contents

Part 1: Description of Data Domain and Storyboard Interactive Visualization Techniques

Overview

I plan to create a visualization utilizing dynamic queries that makes it easy to geographically display the location and subnet of an IP address. Alternatively, my tool could also show the routing of IP packets over the Internet.

Interaction Techniques

There exist services for geolookup of IP addresses, however none of them allow the user to perform dynamic queries that continuously update the data that is filtered. Dynamic queries are especially useful in this domain, since they allow the user to see firsthand the hierarchical organization of Internet Protocol network as he/she refines the search results. I hope that this project will lead to a greater understanding of the architecture and structure of the Internet and its subnets.

The user can begin to enter an IP address, and see the corresponding geographical area being highlighted on a map of the United States (if time permits, then the World). As the user enters additional digits of the IP address, the location is dynamically refined, with the particular points that match the specified subnet remaining highlighted, while other points that only share a parent link (but fall within a different subnet) become faded. For packet routing, the user can enter the address of a remote server (either as a domain name, eg www.berkeley.edu, or an IP address), and the program will display the intermediate nodes that the packet passes through, as detected by the traceroute utility. The intermediate nodes will appear as they are detected by traceroute, and will then fade out as new nodes are detected.

Datasource

I am using the IP address geolocation SQL database freely provided by IPInfoDB. This data contains a list of known IP addresses, along with the longitude and latitude coordinates that I will need for geolocation.

Whois and traceroute are well-known Unix system utilities that are bundled/readily available in most modern Linux/Unix distributions.

Storyboard

My design was stylistically inspired by Ben Fry's ZipDecode. Since I am not good at drawing, I decided to mock the storyboard up in Inkscape, but unfortunately these pictures turned out to be of higher fidelity than I planned; maybe I'll retry my horrendous pencil and paper skills next time.

Entering an IP address the user is interested in, and restricting the search domain through a dynamic query. File:AKDynamicIPQuery01.png

File:AKDynamicIPQuery02.png

File:AKDynamicIPQuery03.png

The user tries out the interactive traceroute functionality on nytimes.com. File:AKDynamicIPQuery04.png

File:AKDynamicIPQuery05.png

File:AKDynamicIPQuery06.png

File:AKDynamicIPQuery07.png

File:AKDynamicIPQuery08.png

File:AKDynamicIPQuery09.png

Differences Between Storyboard Sketch and Implementation

I have implemented the traceroute functionality, but unfortunately I had to disable it since it relies on JNI (Java Native Interface), which is very finicky and in my experience does not run on some computers, or crashes often. I needed to use JNI because the traceroute protocol requires access to low-level networking, eg the ICMP protocol, which is not available in the default Java SDK.

Otherwise, the functionality I implemented is largely the same as the described above in the sketch.

Application Description

Online Applet: http://inst.eecs.berkeley.edu/~cs160-di/cs294/as4.html

Source code: File:AKNetVis.tar.gz

Download, gunzip, and run the AKNetVis.jar file in the root directory using the following command java -jar AKNetVis.jar

Optionally, specify a different dataset: java -jar AKNetVis.jar world.gz

My application allows for searching of the world-wide IP address space using dynamic queries. The user is presented with a screen similar to ZipDecode, and can enter the IP address of a site and discover the geographic location of the corresponding host. The input can be in raw binary, domain name (eg google.com), or IPv4 (eg 123.456.789.000) form.

The map implements dynamic queries by filtering the content by color, eg active sites that match the query are drawn in white, while others that are contained in other subnets are drawn in brown. There is also auto-zoom functionality to focus in on locations of interest, but this is not recommended, since it causes the program to slow down immensely.

Data Preprocessing

This took about 5 hours total, since I had to frequently filter, and refactor my dataset for performance and data coherence reasons. I processed the data by importing it to a mysql database, and running various queries on the original datasource to aggregate the data into tab-delimited data files.

I initially created a dataset for the continental US with the following fields:

  • int ip - decimal representation of 32-bit IP address
  • float latitude
  • float longitude
  • string city
  • string state

Unfortunately, I found out that many tuples in the source dataset were missing key information (eg missing or invalid latitudes/longitudes: some tuples used the geographic center of the US as default lat/long coordinates). I also had to change the ip representation due to limitations of the prefuse search facilities, as described below.

  • string ip - binary string representation of 32-bit IP address
  • float latitude
  • float longitude
  • string location - name of the location, eg city, state

mysql> create table USData as select lpad(bin(ip_start), 32, '0') as ip, latitude, longitude, concat(city,', ', region_name) as location from ip_group_city where country_name='United States' and region_name!='Hawaii' and region_name!='Alaska' and region_name!= and region_name!='Armed Forces Africa' and region_name!='Armed Forces Pacific';

Eventually, after solving my performance problems by migrating to the Processing toolkit, I created a world data set, which better represents the hierarchical structure of the IP address space than the default US dataset I was using.

Implementation

I found out that prefuse does not allow entry of numerical data into textboxes, but rather is limited to either searching plain textual data in textboxes via SearchQueryBinding, or representing range data using sliders in RangeQueryBinding. I spent a few hours attempting to add support in RangeQueryBinding for numerical textboxes, but eventually decided that this was a bad idea since this required making extensive changes in parts of the prefuse framework which I was not very familiar with and were not as well documented (eg: http://prefuse.org/doc/manual/interaction/dynamic_queries/ Under Construction! Coming soon...) as the user-visible portions. I spent about 5 hours on this as well.

At this point, I decided to take another approach: I would convert the decimal IP address into a binary string representation, and then use the existing string-based search functionality to compare digits one-by-one. Unfortunately, this incurred severe performance overhead, and dynamic queries are not fun when each refinement takes 30 seconds to process. So I spent some more time (~2 hours) trying to tweak performance by reading documentation and writing a BitwisePredicate for masking the integer IP address to support performing dynamic queries by subnet.

Eventually, I realized that for this project, prefuse's abstractions were more of a hinderance than a help, and thus out of frustration I decided to seek out an alternative toolkit that would also provide better performance. I settled on processing, which it turns out could handle my large geoip datasets (containing 500,000+ coordinates) with ease, compared to prefuse, which was already choking on a much-reduced dataset containing 100,000 tuples. I discovered that I could get more done in Processing than Prefuse, since although I had to implement a lot of basic functionality manually, I had much greater control of the program control flow, and could thus customize the application to fit my needs, rather than trying to hack a framework to work for me. I spent about 1 hour learning the processing framework and configuring it to work in Eclipse (unfortunately the default processing IDE does not contain a debugger), and then spent another 5 hours or so reimplementing my program in the processing framework.

I spent a lot of time on this project (30 hours), partly because I have been sick lately, and my programming efficiency has suffered greatly as a result. In summary I think that prefuse is probably not that hard to use, but it could use some better documentation. Overall though, I think that processing fits my needs better, since although it offers some of the abstractions that make prefuse very convenient, it also offers direct access to low-level graphics API, which is very hard to achieve in prefuse.

Attribution

I used libraries and sample source code from the following projects: Prefuse initial (unacceptably slow) implementation Processing revised implementation for better performance JPCap for Traceroute

Screenshots

File:AKNetVis01.png

Original (slow) implementation in Protovis

File:AKNetVis02.png

Query refinement in old implementation

File:WorldStart.png

New implementation in Processing

File:WorldWest.png

Dynamic query in new interface, based on binary IP address

File:WorldNorthAmerica.png

Refinement of dynamic query, to show North America addresses

File:WorldEurope.png

Refinement of dynamic query, to show European addresses

File:WorldItaly.png

Further refinement of dynamic query, now showing Italian addresses

File:USUrl.png

URL entry mode

File:USIPv4.png

Conversion of URL into IPv4 string

File:USBinary.png

Once the user's input has been parsed, he/she can go back to binary mode to expand/reduce the scope of the query.

Postscript

On Tuesday, I added some pictures to show how the interface looks like; however, the code/writeup has not changed, so please grade my submission from Monday.

I tweaked the performance and made a few aesthetic changes to my original implementation, so that it would run in an applet. The result can be seen at http://inst.eecs.berkeley.edu/~cs160-di/cs294/as4.html.

This applet allows for searching of the world-wide IP address space using dynamic queries. The user is presented with a map of the world, and can enter the domain name (eg csua.berkeley.edu), ipv4 address (eg 128.032.112.223), or raw binary data (eg 10000000001000000111000011011111) of a site and discover the geographic location of the corresponding host. The user can change the input type using the TAB key.

The map implements dynamic queries by filtering the results as the user types in each binary bit, or octet of the ipv4 address. Sites that match the query are drawn in white, while others that are contained in other subnets are drawn in brown. The size (side-length) of each square corresponds to how many bits match the specified query. There is also auto-zoom functionality to focus in on locations of interest.

Unfortunately the visualization does not reveal much information for American domain names, since the allocation of IP addresses in the United States has become relatively disordered due to IPv4 address space exhaustion and reliance on local mirroring/load balancing services such as Akamai. However, the allocation of IP addresses in European and Asian nations remains more organized. For example, an interesting exercise is to start by entering welt.de, and then deleting one binary bit at a time will reveal the hierarchy of the German and European addresses.

Example of a dynamic query starting from welt.de, and expanding outwards by decreasing the size of the subnet mask.

File:Welt01.png

File:Welt02.png

File:Welt03.png

File:Welt04.png

File:Welt05.png

File:Welt06.png

File:Welt07.png


Old Idea (For Reference)

I plan to create a visualization that makes it easy to geographically find people and their phone numbers (as listed in the phonebook) through dynamic queries.

The search domain can be specified by entering at least the first two letters of the target person's last name, and the search space can be refined by entering additional letters of the person's first or last name or their location by state, city, or zip code. If more than 100 names satisfy the query, only the first 100 will be listed.

The results will be plotted on a map of the United States using the Processing toolkit. (This might be a bit too ambitious for a 1-week project...)

I plan to use the White Pages web service for looking up people. Information about the White Pages API can be found at the Developer Portal.

Postscript: Unfortunately, I discovered some limitations with the White Pages API, namely that it only displays 20 matches per lookup, and limits the number of lookups within a time interval, therefore it is not suitable for use in the real-time dynamic query system that I proposed.



[add comment]
Personal tools