Thesis: 1.4 Network datasets
To test the methods proposed within this thesis, we have collected one weighted and longitudinal dataset. This dataset is an online social network created from a virtual community for students at University of California, Irvine, in the period between April to October 2004 (see Panzarasa et al., 2009, for a descriptive analysis). In this network, the nodes are 1,899 students. When joining the community, each student was asked to create a profile. This profile contained a number of personal details. These details included the user’s demographic characteristics, the user’s list of friends, personal blogs, and forum postings. Based on these details, students could search for others and base their decisions to communicate.
The ties are established when online messages (59,835) are exchanged between the students. The weight of a directed tie is defined as the number of messages sent from one student to another. The maximum and average tie weight are 98 and 2.95, respectively. The students have on average 10.69 directed ties to others.
To ensure the protection of individuals and compliance with privacy laws, all individual identifiers were removed before we received the data. This included usernames, email and ip-addresses, and personal description. Each user was randomly assigned an identification number. In addition, the content of the online messages was not made available. For the messages, the information we received included only the identification numbers of sender and receiver, and the time at which the message was sent. Furthermore, two companies that gained access to the community through students with the purpose of mass-communication were excluded in order to filter out spamming activities. Moderators and other technical support staff with the only aim of facilitating the smooth functioning of the community were also excluded.
In addition, we have also relied on 6 dataset with 11 weighted networks used in the literature. The first three networks are from Freeman’s EIES dataset (Freeman, 1978), also used in Wasserman and Faust (1994). This dataset was collected in 1978 and contains three networks of 32 researchers. The first is an acquaintance network of the group recorded at the beginning of the study (time 1). The second network is similar, but the data were recorded at the end of the study (time 2). The third is a frequency matrix of the number of messages sent between the researchers using an electronic communication tool. In the two acquaintance networks, all ties have a weight between 0 and 4. 4 represents a close personal friend of the researcher; 3 represents a friend; 2 represents a person the researcher had met; 1 represents a person the researcher had heard of, but not met; and 0 represents a person unknown to the researcher. In the frequency matrix, the average tie weight is 33.7 and the maximum weight is 559.
The second dataset contains four networks are intra-organisational networks, two from a consulting company and two from a research team in a manufacturing company (Cross and Parker, 2004; We thank Andrew Parker at Stanford University for supplying this dataset.) The consulting company had 46 employees who are the nodes in the first two networks. The ties in the first network are differentiated in terms of frequency of information or advice requests, whereas the ties in the second network reflect the value placed on the information or advice received. In both these networks, the directed ties are weighted on a scale from 0 to 5. The company had offices both in Europe and in the US. The US employees were divided in two tightly knit groups, while on the contrary, the European employees did not group together in the same way. The last two networks are based on a research team in a manufacturing company. The nodes in these networks are the 77 employees of the company. The ties in the first network are based on advice, whereas in the second network, they are based on the awareness of others’ knowledge and skills. In both these networks, the directed ties are weighted on a scale from 0 to 6. The recording of the networks took place after an organisational restructuring, which meant that four separate units in different European countries had been combined. The research team was partitioned into strong communities based on employees’ previous geographical location (Cross and Parker, 2004, pg. 15-17).
The third dataset includes a network representing political support in the US senate (101st Congress, 1989/1990, also used in Skvoretz, 2002; We thank John Skvoretz at University of South Florida for making this dataset available to us.) The nodes are 102 senators and ties are based on co-sponsorship on bills. The average tie value is 2.68 and the maximum value is 29.
The fourth dataset contains the neural network of the Caenorhabditis elegans worm. This was examined by Watts and Strogatz (1998) in their study of the small-world phenomenon. (This dataset was obtained from the Collective Dynamics Group’s (Duncan Watts) website.) In this network, the nodes are 306 neurons and a tie joins two neurons if they are connected by either a synapse or a gap junction. The weight of a tie represents the number of these synapses and gap junctions between two neurons. The average weight is 3.74 and the maximum is 70.
The fifth dataset is the network of commercial airports in the United States (The US airport network). This network is publicly available on the website of the US Department of Transportation. The nodes in this network are the 676 commercial airports in the U.S. Two airports are tied together if at least one fight was scheduled between them in 2002. The weight of each tie corresponds to the average number of seats per day available on the flights connecting the two airports (Barrat et al., 2004; Guimera et al., 2005). There are a total of 3,523 ties or scheduled routes consisting of one or more flights.
Finally, the sixth dataset is a scientific collaboration network collected by Newman (2001b,c). This network is created by the papers published on the arXiv repository in the area of condensed matter physics in the period from 1995 to 1999. The nodes in this network are the authors of those papers, and a tie between two authors is established if they have co-authored at least one paper together. Following Newman (2001c), the weight attached to each tie is the sum over all the co-authored papers of the inverse of the size of the collaboration minus one. In other words, each paper increases the node strength of an author with 1, which is equally distributed across the weights attached to the ties directed towards the collaborators of that paper. For example, if an author writes a single paper with two others, the weight attached to the two ties would be 0.5. A consequence of this is that the node strength is a proxy of productivity, whereas the average strength reflects whether an author collaborates multiple times with a small group of others.