BigDND: Big Dynamic Network Data

Erik Demaine (MIT) & MohammadTaghi Hajiaghayi (UMD)

Networks are everywhere, and there is an increasing amount of data about networks viewed as graphs: nodes and edges/connections. But this data typically ignores a third key component of networks: time. This repository provides free, big datasets for real-world networks viewed as a dynamic (multi)graph, with two types of temporal data:

A timeseries of instantaneous edge events, such as messages sent between people. Many such events can occur between the same pair of nodes.
Timestamped edge insertions and edge deletions, such as friending and defriending in a social network. Generally only one such edge can exist at any specific time, but the same edge can be added and deleted multiple times.

Our hope is that these datasets will promote new research into the dynamics of complex networks, improving our understanding of their behavior, and helping the community to experimentally evaluate their big-data algorithms: approximation, fixed-parameter, external-memory, streaming, and network-analysis algorithms.

Help us:

If you have a dynamic network dataset, email us at dnd (at) csail.mit.edu with a brief description about the data, its format, its license, and how/where to download it. We will link to it with appropriate credit/citation.
If you have interesting visualizations and/or analysis of these data sets, email us at dnd (at) csail.mit.edu and we will post it with appropriate credit/citation.

DBLP Data

News: Ranking of CS departments based on the number of DBLP papers in theoretical computer science has just been released. A similar ranking for other areas of CS and a general CS ranking is coming in the future.

The computer science bibliography DBLP offers its entire dataset of bibliography entries in XML format under the Open Data Commons Attribution License (ODC-BY 1.0). The data is updated daily, and includes years with each publication, making for timeseries data. As of October 2014, it consists of 4,215,613 papers and 9,086,030 edges between papers and authors.

We have developed free software to compute timestamped graph data for this DBLP data.

Social Network Data from MPI-SS

The Max Planck Institute for Software Systems has gathered several large dynamic network datasets in a variety of social networks. This data is publicly available by emailing Alan Mislove at amislove (at) mpi-sws (dot) org.

Facebook: 60,290 users, 1,545,686 friendships, 838,092 timestamped wall posts. [Viswanath, Mislove, Cha, Gummadi 2009]
Flickr: 1,620,392–2,570,535 users, 11,195,144 photos, 17,034,807–33,140,018 timestamped links, 34,734,221 timestamped favorite markings, from November 2–December 3, 2006 and February 3–May 18, 2007. [Cha, Mislove, Gummadi 2009]
YouTube, LiveJournal, and Orkut data also available

Google+ Social Network Data with Node Attributes

UC Berkeley has published four snapshots taken at four times of the same subset of the Google+ social network. Thus each object has a coarse notion of timestamp, between 1 and 4. The network starts with 4,693,129 nodes and 47,130,325 edges, and grows to 28,942,911 nodes and 462,994,069 edges. Nodes additionally have optional attributes of employer, school, major and places lived.

Twitter Data

The Max Planck Institute for Software Systems has gathered Twitter data encompassing 54,981,152 user accounts, 1,963,263,821 follow links (based on a snapshot in August 2009, no timestamps), and 1,755,925,520 timestamped tweets. This data is publicly available by emailing twitter-contact (at) mpi-sws.org.

Paper Citation Data

In these networks, nodes represent papers/publications and directed edges represent citations.

arXiv.org HEP-PH (high energy physics phenomenology). 34,546 timestamped papers and 421,578 citations, from January 1993 to April 2003. Available for download from SNAP. [From KDD Cup 2003]
arXiv.org HEP-TH (high energy physics theory). 27,770 timestamped papers and 352,807 citations, from January 1993 to April 2003. Available for download from SNAP. [From KDD Cup 2003]
USA Patents. 3,774,768 timestamped patents and 16,518,948 citations, from 1975 to 1999. Available for download from SNAP. [From National Bureau of Economic Research]

HUGE: Brain Connectome Data

Johns Hopkins University's Open Connectome Project has gathered a huge amount of brain network/connectome data. All data is available for download.

HUGE: Web Graph Data

University of Mannheim has gathered a Web Hyperlink Graph based on the Common Crawl 2012 web corpus, featuring 3,500,000,000 webpages and 128,000,000,000 hyperlinks. All data is available for download.

Codes

We have provided two codes for measuring quantitative properties of graphs. These algorithms are specifically designed to run on large graphs.

matching.cpp approximates the size of the largest matching of a graph based on the work of Esfandiari et al.
dense.cpp approximates the size of the densest subgraph of a graph based on the work of Esfandiari et al.

Related Sites

For more datasets, check out SNAP and Kaggle.

About Us

We have been designing graph algorithms for our whole lives, and collaborating together since 2001 with over 60 joint publications. Erik Demaine is a MacArthur Fellow, Sloan Fellow, Guggenheim Fellow, Presburger Award recipient, and Polyá Lecturer; he has published over 400 papers with over 450 co-authors, has given over 300 plenary and invited talks around the world, and writes open-source software. MohammadTaghi Hajiaghayi is the Jack and Rita G. Minker Professor, NSF CAREER recipient, ONR Young Investigator recipient, Google Faculty Research Award (twice); he has published over 250 papers with over 215 co-authors, has given over 70 invited talks around the world, has over 13 granted or filed patents, and runs the Predictaa platform.