From CS294-10 Visualization Sp11
I took the SNAC dataset  for analysis. The SNAC dataset has biographical and social graph information on 130,000 entities that includes historical public figures, families and institutions. The information about these entities has been collected by curating and merging a corpus of archive description records from a number of cultural institutions that includes libraries and museums. Each entity record in SNAC has biographical information that includes name(s), gender, birth and death dates, occupation(s) and biographical text. Entities in the database are linked and the links are categorized as either correspondedWith or associatedWith. In addition, entities are also associated with topics (or) library subject headings.
Motivations and goals
The SNAC dataset is not full, that is not all entities have all the information. As a member of the SNAC project, my first objective for this analysis was to collect statistics about what information is present(and not present) in this dataset and visualize it. Next, I wanted to choose a subset of entities and visualize their social networks. The hope is that as the data is being analyzed, these goals would change and more specific questions would emerge.
What information is present (and not present) ?
I wanted to focus my analysis on person entities. Specifically, I was interested in investigating the amount of missing information. The following visualizations depict the amount of missing information and also some basic summaries across dimensions for person entities.
Nationality and Gender
The visualization shows only countries with more than 200 entities. It is clear that most of SNAC's person entities are from the United States. This was understandable given the source of these records - which is mainly cultural institutions across North America. It is also clear that a majority of records in SNAC do not have nationality information. This was surprising, from the perspective of the SNAC research project of which I am a member of, I think we should consider exploring other data sources to fill this information. The lack of gender information for most of the entities is even more surprising - as these records come from archivists and historians, I expected the records to have these basic personal information.
Occupations and Topics
As mentioned above, SNAC entities have occupation and subject headings information. I did not attempt to visualize the topic and occupation distributions as the SNAC website gives a detailed summary of them. But I found that like nationality and gender, a majority of SNAC entities do not have topics and occupation information. It is understandable that the topic dimension has a better information availability ratio - as the primary use of these records is to help finding the associated archives and librarians use subject headings (or topics) for this purpose.
Distribution of Birth and Death Year
The following visualizations depict the distribution of birth and death dates of SNAC entities. It is interesting to note that the dataset covers a wide date range and most of the information is about persons who existed in the 19th and 20th centuries.
I chose to analyze the social network of Physicists in SNAC. All the below visualizations were created using Protovis. I used MongoDB for backend indexing and query. The manually annotated visualization on the left shows the social network of entities with occupation as 'Physicists.' The graph shows connections that represent both correspondedWith and associatedWith. The thicker lines represent correspondedWith relationship.
It was interesting to see that only 9 entities matched the exact query 'Physicists.' To improve the search result, I queried using 'physics' as a pattern. The number of vertices for this graph were too large to annotate, so I thought it was best to analyze this interactively - Protvis allows to do that. The graph can be accessed at: . The graph is interactive, hovering the mouse over each node would show the name of the node. Also nodes can be interactively dragged to get a clearer picture of its network.
An interesting analysis I wanted to do was to find out how many of these physicists and their associations were involved with the topics 'War' or 'Nuclear '. The graph  clearly shows the involvement of Oppenheimer with these topics. It is interesting to note that Einstein is not associated with either.
I had two objectives for this assignment. First, I wanted to find missing information in the SNAC dataset for the various dimensions. The analysis showed that the ratio of missing or unavailable information is rather large in SNAC. Second, I wanted to visualize the social networks for a subset of entities within SNAC. I chose to analyze physicists. My analysis revealed interesting connections for and between famous physicists such as Einstein and Oppenheimer. I also tried visualizing physicists that were annotated by the 'war' and 'nuclear' topics. As was expected, the visualization showed that the physicist Oppenheimer as an important entity given these two topics.