**Abstract**

Companies do not operate in a vacuum. As companies move towards an increasingly specialized production function and their reach is becoming truly global, their aptitude in managing and shaping their inter-organizational network is a determining factor in measuring their health. Current models of company financial health often lack variables explaining the inter-organizational network, and as such, assume that (1) all networks are the same and (2) the performance of partners do not impact companies. This paper aims to be a first step in the direction of removing these assumptions. Specifically, the impact is illustrated by examining the effects of customer and supplier concentrations and partners’ credit risk on credit-default swap (CDS) spreads while controlling for credit risk and size. We rely upon supply-chain data from Bloomberg that provides insight into companies’ relationships. The empirical results show that a well diversified customer network lowers CDS spread, while having stable partners with low default probabilities increase spreads. The latter result suggests that successful companies do not focus on building a stable eco-system around themselves, but instead focus on their own profit maximization at the cost of the financial health of their suppliers’ and customers’. At a more general level, the results indicate the importance of considering the inter-organizational networks, and highlight the value of including network variables in credit risk models.

**1. Introduction**

The inter-organizational network surrounding companies provides opportunities and constrains behavior. By creating an extensive network of customers and suppliers (collectively called partners), a company is likely to increase shareholder value by specializing on core products and services. This feature can be traced all the way back, for workers instead of organizations, to Adam Smith and division of labor. At an organizational level, similar effects are likely to impact companies’ performance and financial health. For example, the assembly of electronic products is becoming an increasingly international process, where components and technology are sourced from specialist suppliers. This process enables companies to focus on core skills, where return on investment and shareholder value are likely to be optimized.

A host of academic papers have applied a network perspective to identify and quantify the impact of inter-organizational networks on various success factors (Cross et al., 2003). According to the network perspective, organizations are embedded within networks of interconnected relationships. Different positions in a network are associated with a range of outcomes, such as imitation, adaptation, innovation, firm survival, and performance (Brass et al., 2004). For example, companies in a position of brokering between others have relatively higher returns (Bae and Gargiulo, 2004).

Unlike traditional perspectives that treat organizations as independent observations and focus simply on their attributes, network science incorporates additional structural information. These two perspectives are not mutually exclusive and can be combined to conduct a superior analysis by including variables based on network characteristics alongside attribute-based ones.

**1.1. Concentration**

The first feature of the inter-organizational network that this article attempts to tackle is concentration. For a company, concentrations can exist both on the customer-side and the supplier-side. These two aspects represents two different types of underlying risks. First, customer-concentration is an immediate threat to the revenue stream as a single or a select few customers provide the majority of revenue received. For example, Table 1 highlights the top customers of TripAdvisor. As Expedia represents more than a quarter of TripAdvisor’s revenue, TripAdvisor is vulnerable to two scenarios: (1) Expedia is in a position to squeeze its profit margins and (2) a substantial chunk of revenue would potentially be affected were Expedia to default, downsize, or restructure.

Customers | Value | % Revenue |
---|---|---|

Expedia Inc | 54.51M | 25.63% |

Priceline Group Inc/The | 42.17M | 19.83% |

Orbitz Worldwide Inc | 11.56M | 5.44% |

CTRIP.COM International Ltd | 10.20M | 4.79% |

Google Inc | 3.25M | 1.53% |

American Airlines Group Inc | 3.02M | 1.42% |

Table 1: Customers with a quantified relationship representing more than 1% of TripAdvisor Inc’s annual revenue

Second, concentration might also occur among suppliers. In this case, a default event could threaten business continuity and pose as a source of operational risk, which ultimately could be detrimental to the financial health of a company. Conversely to customer concentration that might imply a squeeze on the focal company’s profit margins through revenue reduction, supplier concentration could lead to cost increases.

**1.2. Influence: Partners’ default probability**

A key feature of networks is the relatively high number of ties among similar nodes (McPherson et al., 2001). This feature is often referred to as homophily. Two causal mechanisms lead to homophily: selection and influence

(Aral et al., 2009). First, similar nodes tend to form ties together (i.e., selection). Second, dissimilar nodes tend to become more similar over time if they are tied (i.e., influence). The latter feature is of key interest when understanding the consequences of the network as opposed to the mechanisms underpinning the network. An understanding of influence forms the basis for assessing and quantifying contagion, diffusion, and cascades.

From an inter-organizational network perspective, the financial health of companies’ partners could provide additional insight into their own financial health. Specifically, influence can be thought of as directly improving or worsening a company’s health (e.g., investment grade companies are more stable and thereby bring about less volatility to their partners). The lowering in volatility is likely to bring about a smaller risk premium, and as such, enable a smoother operation of the overall system. Conversely, the system might not be driven by global optimization, but by local optimization instead. Companies are likely to prioritize their own profit maximization (i.e., local optimization) at the expense of ensuring a stable overall ecosystem (i.e., global optimization). Given the opposing theoretical aspects, it is an empirical question of whether, and the extent to which, interacting with stable partners is positively or negatively related to financial health of companies.

The remainder of this article is organized as follows. First the methodology, including data collection and metric construction, is presented. This is followed by results. Finally, a conclusion and discussion section ends the paper with notes on general applicability, limitations, and avenues of future work.

**2. Methodology**

To model the effect of customer and supplier concentrations and partners’ credit risk on the financial health of companies, we apply a regression framework with credit-default swap (CDS) spreads as the dependent variable while controlling for credit risk and market cap. Our observations are all companies with 5-year CDS spreads on Bloomberg on April 29, 2014. This includes 828 companies. We limit our observations by excluding banks as they “lack a hard asset/manufacturing-type of supply chain” (Advisory note on the Bloomberg SPLC-screen when analyzing financial institutions). This implies removing 152 financial companies from the sample. As such, the total number of observations is 676. Additionally, we weight the observations by the reciprocal of companies’ logged market cap in billions to account for the fact that larger companies are more likely to be included in the sample than smaller companies. This alleviates potential correlation between the dependent variable and inclusion probability (Fuller, 2009) and enables a better understanding of the financial health of all companies instead of simply the ones that are more likely to have an observable 5-year CDS spread.

The sub-sections below provide details on the data collection and metric operationalization for quantifying the impact of the inter-organizational networks on CDS spreads. Note that all variables tend to be skewed and, as shown below for the dependent variable, transforming the variables by the natural logarithm of them lessens the skewness. For count variables (e.g., number of suppliers) with a potential zero score, the log is taken of 1 plus the variable. More details and distribution plots for all variables are found in the Appendix of the accompanying paper.

**2.1. Dependent variable: CDS spreads**

Success or performance of a company can be quantified by a number of metrics. A traditional risk framework view focuses on the financial health of companies by predicting default probabilities. These probabilities are often arrived at by modeling historical default events in a logistic-regression framework or applications of Merton’s structural default model. However, few companies, and especially large public companies, default. As such, these frameworks are hard to calibrate and exposed to rare-event bias.

To overcome potential biases, we choose an outcome variable that is quantified for a large number of companies and also closely related to the financial health of a company (Longstaff, 2005): CDS spread. Specifically, we use 5-year market CDS spread (in basis points; bps) as listed on Bloomberg under the Default Risk Monitor (DRAM)-view.

The distribution of the 5-year market CDS spreads is highly right-skewed (Figure 1a), but conforms closer to a Gaussian shape after transforming it with the natural logarithm (Figure 1b). This transformation of the dependent variable enables an ordinary least square (OLS) regression framework to be applied, which lessens the complexity of the model. A well-tested simple model allows us to focus on incorporating novel metrics to assess the impact of the inter-organizational network on the financial health of companies.

**2.2 Inter-organizational network data**

The interdependencies among companies are not straight-forward to measure, assess, and collect. In fact, maintaining the confidentiality of this information might be of strategic importance to the companies. For example, it could enable customers to circumvent the focal company by purchasing directly from its suppliers. Nevertheless, some companies publish their supply chain (e.g., Apple provides a list of their top 200 suppliers, see apple.com/supplierresponsibility/) and others’ become known through media coverage and quarterly reports.

One source that attempts to collect the interdependencies among companies is Bloomberg. The supply chain analysis (SPLC)-view gives an insight into the inter-organizational network in which a company operates by listing large suppliers, customers, and peers. The entities in the network are identified by combining Bloomberg analysts’ assessment, company reports, quarterly filings, company news releases, and media coverage. To gauge the coverage of this data, we performed an ad-hoc analysis of Apple’s top 200 suppliers-list. There are 695 suppliers listed in the Bloomberg data, and the intersection among these lists are 191 suppliers, which indicate a 95.5% coverage rate.

A number of identified relationships are quantified (e.g., Toyota Motors is 3M biggest customer responsible for 4.4% of their revenue). This information is, however, only populated for about 37% of the relationships identified. In total there are 100,030 relationships (63,001 supplier relationships and 37,029 customer relationships) for the included companies in this study. Out of these, 37,014 relationships are quantified (23,497 supplier relationships and 13,517 customer relationships). Moreover, the quality of these estimates is uncertain due to this information not being required in regulatory filings.

A key limitation of using this data for network analysis is the difficulty of extracting it for multiple companies. To overcome this limitation, we created a custom tools to aid this otherwise manual process.

**2.3. Independent variables: Concentration**

We consider three features when assessing concentration of customers and suppliers. First, the numbers of customers and suppliers are key to understand how concentrated an inter-organizational network is. This is akin to degree within the network science literature (Opsahl et al., 2010). An increase in these numbers is likely to suggest additional concentration as only large partners representing concentration are listed in Bloomberg.

Second, the distribution of relationship values tends to be skewed (e.g., see Table 1 for TripAdvisor Inc’s customers). Skewness brings about greater concentration that simply the number of partners.

Third, concentration can also arise through other companies. For example, both TripAdvisor and Expedia are suppliers to American Airlines, and since TripAdvisor and Expedia are mutually connected, American Airlines’ are further concentrated than what simply the number of suppliers and skewness would suggest.

To measure the concentrations of customers and suppliers, it is common to apply the Herfindahl index. For example, the SEC applies it to assess the competitiveness of sectors and creation of monopolies when considering whether or not to approve mergers. It is defined as the sum of squared proportions. The square ensures that a single large concentration weights more than many smaller ones. In our context, we use the percentage of revenue that a relationship represents for customer-concentration (percentage of total costs for supplier-concentration). In the example of customer-concentration for TripAdvisor (Table 1), the metric is the sum of 0.2563^{2}, 0.1983^{2}, 0.0544^{2}, 0.0479^{2}, 0.0153^{2}, 0.01422^{2}, and so on for all customers, which is about 11%. Formally, it is:

where $latexw_{ij}$ is the relationship value from company *i* to company *j*, is the total revenue (cost) for customer (supplier)-analysis, and is the fraction between these two values.

It is worth noting that the Herfindahl index does not consider the third feature listed above: indirect concentrations. To incorporate this feature, constraint, an advanced version of the Herfindahl index, is often used in network analysis (Burt, 1992). This metric comes from the structural holes literature and is formally defined as:

where *q* are companies indirectly connected to company *i* and company *j*. If there were no indirect connections, this metric would be equal to the Herfindahl index as the -term would be equal to 0.

However, this metric requires a exponentially larger data collection effort as the network of all partner companies would need to be acquired. As such, we have computed five metrics for both customer and supplier lists:

- Number of companies
- Number of companies with a Bloomberg identifier, which are likely to be public companies
- Number of customers (suppliers) with percentage of revenue (cost) defined
- The sum of percentage of revenue (cost) that customers (suppliers) represent
- The Herfindahl Index

It is worth noting that the last two are only calculated on the subset of partners with a quantified relationship. As such, these metrics only consider about 37% of the relational data.

**2.4. Independent variables: Influence**

We consider the default probability of partners to test whether the financial health of partners impact the focal company. Specifically, we take the average of customers’ (suppliers’) Bloomberg-defined default probability derived from a structural Merton model. We chose this variable instead of CDS spreads as CDS spreads are only available for a limited sample of companies, and as such, would lead to sparse data issues and dropped observations. On the contrary, a default probability is available for the partners of 70% of relationships. Out of the total number of relationships, the partners of 69,637 relationships have a default probability (45,844 supplier relationships and 23,793 customer relationships).

For both suppliers and customers, we take a simple average and a weighted one. The weights are based on the percentage of revenue (costs) that the particular customer (supplier) represents. While the simple average is applied to about 70% of relationships, the weighted is only calculated on 21% (12%) of customer (supplier) relationships as the percentage of revenue (cost) is only available for 37% of relationships.

**2.5. Control variables**

To control for general financial information, we include the corporate default probability as derived by Bloomberg’s structural Merton model (Bloomberg, 2013). This variable is correlated with CDS spreads (pair-wise correlation of 0.40; R^{2} is 0.16 in a univariate model). In fact, it can be argued that these two variables are the same. However, the model-derived default probability only considers the independent variables used in the model, and the purpose of this paper is to highlight that inter-organizational network variables have the potential of increasing the explanatory power in a combined model.

Additionally, we include the market cap of companies as a proxy of size and the liquidity of the swaps. Larger companies are likely to provide more information to the market, be rated by a larger set of investors, and have more contracts with various maturity dates than smaller ones. As such, their CDS contracts are likely to be traded more frequently, which increases the liquidity.

We further control for the Global Industry Classification Standard (GICS) sector of the companies. This ensures that sectoral differences are parsed out.

Finally, indicator variables for the country of risk associated with the companies are included. This variable attempts to overcome spread differences due to countries. The country of risk is determined based on “four factors listed in order of importance: management location, country of primary listing, country of revenue and reporting currency of the issuer. Management location is defined by country of domicile unless location of such key players as Chief Executive Officer (CEO), Chief Financial Officer (CFO), Chief Operating Officer (COO), and/or General Counsel is proven to be otherwise”.

**3. Results**

The regression results from a select set of combination of variables are listed in Table 2. Model 1 is a baseline model without any inter-organizational variables. We find a strong link between default probability and CDS spreads. Together with country and sector indicator variables and market cap, this baseline or control model explains 68% of the variance in CDS spreads. In the Appendix, descriptive statistics and pair-wise correlation for the main variables are listed in Table 3 and results for the indicator variables are in Table 4.

Models | |||
---|---|---|---|

Variables | M1 | M2 | M3 |

Concentration |
|||

Suppliers (log) | -0.081** | -0.04 | |

(0.027) | (0.026) | ||

Customers (log) | 0.135*** | 0.114*** | |

(0.022) | (0.022) | ||

Influence |
|||

Suppliers (log) | -0.251*** | ||

(0.033) | |||

Customers (log) | -0.077** | ||

(0.029) | |||

DP (log) | 0.315*** | 0.363*** | 0.396*** |

(0.022) | (0.024) | (0.023) | |

Market Cap (log) | -0.007 | -0.026 | -0.020 |

(0.019) | (0.025) | (0.023) | |

GICS sector indicators | incl. |
incl. |
incl. |

Country of Risk indicators | incl. |
incl. |
incl. |

Constant | 7.284*** | 7.538*** | 5.599*** |

(0.159) | (0.201) | (0.303) | |

Observations | 676 | 676 | 676 |

R^{2} |
0.6806 | 0.6997 | 0.7319 |

Adjusted R^{2} |
0.6642 | 0.6832 | 0.7163 |

ΔR^{2} (bps; from M1) |
191 | 513 |

Table 2: Regression results; Full table available in the Appendix. *p<0.05; **p<0.01; ***p<0.001$.

We find a relationship between CDS spreads and the various inter-organizational variables. Specifically, Model 2 shows that having many large suppliers lowers the CDS spread while having large customers increase the spread. The latter feature is maintained in Model 3 when including influence variables. Both influence variables are negatively related to spread. This indicates that higher default probabilities of partners are associated with lower spreads of the focal company. This effect suggests that financially healthy focal companies prioritize profit maximization over overall stability in their inter-organizational network by, for example, squeezing their suppliers’ profit margins, which in turn is detrimental to their financial health. For example, Walmart has the potential exerting pressure on suppliers if it represents a large proportion of their revenue. As such, it does seem that local optimization is favored over global optimization.

The inter-organizational variables increase the explanatory power of the framework. By including concentration and influence variables, the R^{2} increases from 0.68 to 0.73. This 513bps increase suggests that inter-organizational variables provide novel and additional insight into CDS spreads.

For a more comprehensive set of variable-combinations, see Table 5 in the Appendix. It is worth noting that some of these combinations bring about greater model improvement, but we have chosen the simpler operationalizations of concentration and influence in Table 2. To ensure an identical sample, we set the suppliers’ (customers’) average default probability to the average of observed values when missing as 61 companies have either no suppliers or no customers with a defined default probability. An alternative to Model 3 with these observations dropped increases the R^{2} to 0.8220. To be conservative, we choose the identical larger sample used in Models 1 and 2 with lower model improvement for comparability. As a robustness check, the analysis was conducted on the complete set of observation (i.e., including financial institutions) and the model improvement was maintained albeit with smaller R^{2} values (m1: 0.6231; m2: 0.6300; m3: 0.6585; Δm2: 69bps; Δm3: 354bps).

**4. Conclusion and discussion**

This project has shown that the inclusion of inter-organizational network variables increases the explanatory power of models predicting the CDS spreads, and in general the financial health of companies. We applied a simple OLS regression framework and found an increase in R^{2} of 191bps and 513bps when including concentration and influence variables, cumulatively. These results have direct applicability to credit risk framework, and suggests that they can be improved by including inter-organizational network information.

The analysis performed in this paper has a number of limitations. Chief among those is the simplicity of the analysis performed. It would surely be improved in more advanced models. For example, we applied a static framework, but spreads vary over time and the volatility could be modeled. Additionally, the supply-chain data used is solely available for public companies. While attempting to mitigate inclusion probability, this limits the applicability of the analysis. Moreover, we did not collect sufficient data to analyze network constraint due to the non-incremental effect of the Herfindahl index over the number of relationships. Finally, although CDS spreads are market variables, they are impacted by algorithmic trading. In turn, this might imply that they are converging around an aggregation of the various trading algorithms used.

A number of avenues of future work exist. We are particularly interested in appending the inter-organizational network variables to other existing credit risk frameworks, such as probability of default models. This would enable an understanding of whether network variables do increase the explanatory power in advanced frameworks predicting the default likelihood.

**References**

Aral, S., Muchnika, L., Sundararajana, A., 2009. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences 106(51), 21544-21549.

Bae, J., Gargiulo, M., 2004. Partner substitutability, alliance network structure, and firm profitability in the telecommunications industry. Academy of Management Journal 47(6), 860-875.

Bloomberg, 2013. Bloomberg credit risk DRSK: Framework, Methodology & Usage. Bloomberg L.P.

Brass, D.J., Galaskiewicz, J., Greve, H.R., Tsai, W., 2004. Taking Stock of Networks and Organizations: A Multilevel Perspective. Academy of Management Journal 47(6), 795-817.

Burt, R.S., 1992. Structural holes: The social structure of competition. Harvard University Press

Cross, R., Parker, A., Sasson, L., 2003. Networks in the Knowledge Economy. Oxford University Press.

Fuller, W.A., 2009. Sampling Statistics. Wiley.

Longstaff, F., Neis, E., Mithal, S., 2005. Corporate yield spreads: Default risk or liquidity? New evidence from the credit-default swap market. Journal of Finance 60(5), 2213-2253.

McPherson, J.M., Smith-Loving, L., Cook, J., 2001. Birds of a feather: Homophily in social networks. Annual review of sociology 27, 415-444.

Opsahl, T., Agneessens, F., Skvoretz, J., 2010. Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32 (3), 245-251.

This post in a paper form is on arXiv. If you use any of the information, please cite: Opsahl, T., Newton, W., 2015. Credit risk and companies’ inter-organizational networks: Assessing impact of suppliers and buyers on CDS spreads. arXiv:1602.06585.

I would like to acknowledge William Newton in helping to develop the idea behind this post.

The content of this post and accompanying paper does not reflect the opinion of the employeer of the author(s). Responsibility for the information and views expressed in the therein lies entirely with the author(s).

Unlike previous posts, I am unable to upload the data due to licensing. As such, I have attempted to describe the data processing in great detail and provided a substantial appendix in the paper.

]]>**Abstract**

As the vast majority of network measures are defined for one-mode networks, two-mode networks often have to be projected onto one-mode networks to be analyzed. A number of issues arise in this transformation process, especially when analyzing ties among nodes’ contacts. For example, the values attained by the global and local clustering coefficients on projected random two-mode networks deviate from the expected values in corresponding classical one-mode networks. Moreover, both the local clustering coefficient and constraint (structural holes) are inversely associated to nodes’ two-mode degree. To overcome these issues, this paper proposes redefinitions of the clustering coefficients for two-mode networks.

**Motivation**

The clustering coefficients for one-mode networks are a measure of cohesion or group formation. These measures are defined around triplets (i.e., three nodes with at least two ties among them) and whether or not these triplets are closed (i.e., they form part of a triangle). Two-mode networks are often projected onto one-mode networks to be analysed. These networks often contain many more triangles than prototypical networks, and thus overestimates the level of clustering in a network. Methodological issues exist at a local level as well. Specifically, when calculating the local clustering coefficient (Watts and Strogatz, 1998) or the structural holes measure constraint (Burt, 1992) on projected two-mode networks, the measures are inversely correlated with nodes’ two-mode degree on a randomly tie reshuffled two-mode network (each node maintains their degree). Below is the average (a) local clustering coefficient and (b) constraint scores for nodes in a random version of the Scientific Collaboration Network (Newman, 2001) for various levels of two-mode degree.

As a result, a host of clustering measures for two-mode networks has been developed. For example, Robin and Alexander (2004) defined a coefficient as the number of four-cycles divided by the number of three-paths. Four-cycles in two-mode networks are the smallest possible cycle (like triangles are the smallest possible cycle in one-mode networks). However, this measure is distinctly different from the idea of triadic closure as the measure only include two primary nodes. In fact, a four-cycle is an indication of reinforcement or agreement between two-nodes and not cohesion or group formation.

The paper proposes redefinitions of the global and local clustering coefficients for two-mode networks. The measures are defined around 4-paths or triplets of primary nodes in two-mode networks. Specifically, the global coefficient is defined as the number of 4-paths that are closed divided by the total number, while the local is similar but focused on 4-paths centred on the focal node. For more details, see the paper (Social Networks; arXiv) or the tnet documentation (tnet » Two-mode Networks » Clustering).

**Want to test it with your data?**

The clustering_tm and clustering_local_tm-functions in tnet allows you to calculate the global and local clustering coefficients for two-mode networks (both binary and weighted) on your own dataset.

# Load tnet library(tnet) # Load a sample network (Figure 3A of the paper) net <- rbind( c(1,1), c(1,2), c(2,1), c(2,3), c(3,2), c(3,3), c(4,3)) # Calculate global clustering coefficient clustering_tm(net) # Calculate local clustering coefficient clustering_local_tm(net)

**References**

Burt, R.S., 1992. Structural holes. Harvard University Press, Cambridge, MA.

Newman, M. E. J., 2001. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Physical Review E 64, 016132.

Opsahl, T., 2013. Triadic closure in two-mode networks: Redefining the global and local clustering coefficients. Social Networks 35, doi: 10.1016/j.socnet.2011.07.001.

Robins, G., Alexander, M., 2004. Small worlds among interlocking Directors: Network structure and distance in bipartite graphs. Computational and Mathematical Organization Theory 10 (1), 69–94.

Watts, D. J., Strogatz, S. H., 1998. Collective dynamics of “small-world” networks. Nature 393, 440-442.

If you use any of the information in this post, please cite: Opsahl, T., 2013. Triadic closure in two-mode networks: Redefining the global and local clustering coefficients. Social Networks 35, doi: 10.1016/j.socnet.2011.07.001.

]]>- Reprogramme everything in (and learn) C++
- Get more resources

While the first solution might be the most appropriate one for repetitive tasks or production code, the second one might be quicker and easier for data scientists doing one-off analyses. It is possible to get more resources without buying a new server or more memory chips by using cloud computing. In essence, cloud computing allows analysts to rent resources when they need them. Amazon is one provider with the Elastic Compute Cloud (EC2). For example, it is possible to rent a server with 8 cores and 68.4 GB of memory for $2 per hour instead of buying a $5,000+ server.

The downside to using cloud computing is security. When I first started using R on Amazon EC2, I followed the instructions Bioconductor in the cloud (Thanks guys for maintaining an updated AMI and a great tutorial!). This tutorial shows how to set up an account, creating public/private keys, changing the firewall, launching a machine image with the latest version of R pre-installed, connecting to the command line using Secure Shell or SSH, using R through the command line interface, and using R through a web browser (RStudio). While the command line interface is secure using SSH and the private key, the web interface is not secure (standard username, password, and port as well as non-encrypted traffic). This means that anyone who knows the hostname or ip address could login to an R session. In fact, a port scan of the standard port across the range of ip addresses could allow a hacker to detect vulnerable servers and get access to their computing resources and data. In the rest of this post, I will borrow from Bioconductor in the cloud but suggest points to increase the security of using their machine image on Amazon EC2.

This tutorial is basic and highlights all the steps needed to get up and running. It is written using Windows and Putty, but should be applicable to people using other software with a bit of fudging. Before you start, please download PuTTY and PuTTYgen and save them in a convenient location. Note: These programmes do not need to be installed.

**Getting to the Management Console**

The first thing to do is to set up an Amazon account (yes, the same one as you buy books with), and register for Amazon Web Services (AWS). Then you need to go to the AWS Management Console. The link will be on the very top of the page when you are signed in with the Amazon id. The Management Console controls a number of AWS products, so you want to go to Elastic Compute Cloud by clicking on “Amazon EC2” on the horizontal menu.

**Public/private key pair**

The first thing to do in the Management Console is to create a public/private key pair. This is important because the private key will be the “username and password” when accessing the server. This is done by clicking on “0 Key Pairs” under “My Resources” on the right, and then “Create Key Pair”. You need to give it an arbitrary name and then save the file with that-name.pem somewhere safe on your computer.

Unlike other SSH programmes, PuTTY cannot read pem-files directly. They must be converted to ppk-files. This can be done using PuTTYgen. When running the programme, click on the Load button in the middle of the screen (do not use the top menu!). Change the file-type drop-down menu from “PuTTY Private Key Files (*.ppk)” to “All Files (*.*), and select the pem-file with the private key downloaded from the Management Console. Then, click on the “Save private key” button to save the private key as a ppk-file. You do not need to password protect the file if you store it in a location only you have access to.

**Firewall settings**

Communication to a server occurs through ports. A firewall is a mechanism for opening some ports and closing others. A part of maintain a secure server is to only open the ones you need. The Bioconductor in the cloud-guide suggests that you open ports 22 and 8787. Port 22 is used to connect to the command line of the server using SSH. Port 8787 is the web interface of RStudio. While SSH traffic is encrypted, http traffic towards port 8787 is not. As such, this is a potential security vulnerability. I suggest that you only open port 22. Later on in this post, I will show how you can reach the web interface securely over port 22.

The firewall protecting servers on Amazon EC2 is controlled through the EC2 Dashboard’s Security Groups. A security group is a collection of instructions or rules. By default, there should be one security group called default. To see the details of this group, click on “1 Security Group” on the EC2 Dashboard and then click on “default”. The rules are listed under the “Inbound”-tab. If “22 (SSH)” is not listed, you need to open it. This is done by selecting SSH from the drop-down menu, clicking “+Add Rule”, and then clicking “Apply Rule Changes”. The servers with the default security settings will now be reachable on port 22.

**Running a server**

Now you have completed all the one-off set-up tasks, and you are ready to launch a server or instance. By clicking on “Instances” on the left-side, you should see the instances running as well as being able to start new one. Click on “Launch Instance” to get started, and select “Launch Classic Wizard”. There are five parts to this process:

*1: Choose an AMI*

The first question you are asked is which kind of software system or machine image (AMI) you want. There are a number of standard ones, but to save time and many lines of code, I will show you how to make use of Bioconductor in the cloud‘s 64-bit Linux system with the latest version of R installed. To load this, select the Community AMIs-tab, enter ami-b5a079dc in the search box (R-2.15; check their website for a new AMI id when a new version of R is released), and then click the Select-button.

*2: Instance Details*

The next question you are asked is the resources you would like, and where you would like the server to be located (note that prices vary based on location with “US East (Virginia)” often being the cheapest). Please refer to Amazon’s current pricing table. At the time of writing, the Hi-Memory On-Demand Instances (Linux) cost the following:

Instance | Processor Units | Memory | Price per hour |
---|---|---|---|

Extra Large | 2 cores / 6.5 ECUs | 17.1 GB | $0.50 |

Double Extra Large | 4 cores/ 13 ECUs | 34.2 GB | $1.00 |

Quadruple Extra Large | 8 cores / 26 ECUs | 68.4 GB | $2.00 |

By clicking continue, you will be offered a number of more advanced options. The default values are ok. On the third screen, you are asked to give the instance a name (e.g., R-server).

*3: Create Key Pair*

You should already have completed this part, so you should see the arbitrary name chosen in a drop-down box and be able to just click Continue.

*4: Configure Firewall*

We have also already completed this step, so make sure the default group is selected and click Continue.

*5: Review*

On the final page, you are able to review all the settings. Below is an example of a 68.4 memory instance.

After hitting launch, a server will be allocated and the AMI will be loaded onto it. When it is complete, the status light will turn green and state “running”. Do note that you are being charged from this moment. See the final section for information on how to stop being charged.

**Connecting to the command line using SSH**

To control the server, you need to use SSH. The first thing we need to find out is the address of the server. This information is found by click on a running instance in the Instance-page of the EC2 Dashboard under “Public DNS”. An address will be similar to “ec2-184-72-187-196.compute-1.amazonaws.com”. Write this address down, or more easily, copy it to your clipboard. I will use this example address in the rest of the post, remember to change it to the address of your instance!

In this post, I am showing how PuTTY can be used for this; however, there are a number of other programmes out there that does the same thing. In PuTTY, we need to enter the address of the server and load the ppk-file created earlier with the private key. The screenshots below show the server’s address entered under Session and the private key loaded under SSH > Auth.

By clicking on Open, PuTTY connects to the server. The first time you connect to a server, you will have to accept the public key. You can check the finger print against the one listed on the Key Pairs-page of the EC2 Dashboard. When asked “login as:”, simply enter `root`

to get full privileges on the server.

**Running R and installing tnet using SSH**

When you have access to the command line, you can start R by simply typing `R`

and hitting enter. This version of R comes with the Bioconductor-packages. To install the latest version of tnet, you need to type `install.packages("tnet")`

After downloading and compiling tnet and its dependencies, you can load tnet by typing `library(tnet)`

**Connecting securely to RStudio Server**

There are certain limitations to using the command line interface with R. First, it does not allow for graphical representations. Second, it is more cumbersome than the standard R for Windows GUI. To overcome these limitations, RStudio Server can be used. This software is a nice GUI for Linux servers running R, and is pre-installed on the Bioconductor AMI. If you opened port 8787 that the Bioconductor in the cloud-tutorial suggests, you could reach this interface by typing `http://ec2-184-72-187-196.compute-1.amazonaws.com:8787`

in a web browser (remember to replace the address with the one of your instance). However, as mentioned above, this leaves a large security hole open and allows others to “borrow” the resources you are paying for as well as being able to steal your data.

It is possible to communicate securely with the RStudio using an SSH tunnel. An SSH tunnel is an encrypted wrapper for other internet traffic. It is possibly best described using a diagram:

In a standard connection, the web browser connects directly to the RStudio Server on port 8787 (e.g., `http://ec2-184-72-187-196.compute-1.amazonaws.com:8787`

). This traffic can be intercepted. Conversely, when using an SSH tunnel, the web browser connects to PuTTY, which encrypts the traffic and sends it to the SSH server, which decrypts it and sends it to RStudio Server. By not opening port 8787 in the Firewall, RStudio Server is only available to people logged on to the server.

To configure PuTTY to run an SSH tunnel, you need to follow the instructions for connecting to the command line. Additionally, you need to enter the following details under SSH > Tunnels:

- Source port: 8787
- Destination: localhost:8787

Do remember to click “Add”. The panel should look similar to this:

When you then connect to the instance (click Open and login), the SSH tunnel will be active. Congratulations: You can then open a web browser and type `http://localhost:8787`

to securely connect to the instance. The default username and password are unbuntu and bioc. In RStudio, you can install tnet by selecting it from the “Packages”-tab in the lower-right panel.

**A second less-secure alternative**

By looking at the length of this post, I do realise that there are quite a few steps to achieve a secure http connection with an Amazon EC2 instance. Although the above solution ensures that it is not possible to eavesdrop on the traffic between your computer and the EC2 instance, there is a simpler trick that should stop most people trying to log into your session: change the default password. You still need to connect to the command line using SSH. When you are there, you should write `passwd ubuntu`

to be prompted to enter a new password. Note that this procedure would require you to open port 8787 in addition to port 22 in the firewall (instead of selecting SSH from the drop-down menu, select “Custom TCP rule” and enter 8787 in the port range). Having said that, I do strongly encourage taking the extra step and using an SSH tunnel to ensure that your data and resources are safe.

**Stopping and Terminating**

As a final note, it is important to stop instances when you are done with them. Otherwise, you will continue to be charged! This is done by selecting an instance on the Instance-page of the EC2 Dashboard, and selecting Stop or Terminate from the Instance Actions drop-down box. Stop means that the server will be shut-down, but all the data and programmes on it will be saved on the Amazon Elastic Block Store (EBS; not free but quite cheap). A stopped instance can easily be restarted by choosing it and selecting Start from the Instance Actions drop-down box. Conversely, Terminate stops an instance without saving it to EBS. It is not possible to restart a terminated instance.

]]>Alaska is a sparsely populated, isolated region with a disproportionately large, for its population size, number of airports. Most Alaskan airports have connections only to other Alaskan airports. This fact makes sense geographically. However, distance-wise, it also would make sense for some Alaskan airports to be connected to airports in Canada’s Northern Territories. These connections are, however, absent. Instead, a few Alaskan airports, singularly Anchorage, are connected to the continental U.S. The reason is clear: the Alaskan population needs to be connected to the political centers, which are located in the continental U.S., whereas there are political constraints making it difficult to have connections to cities in Canada, even to ones that are close geographically ([Guimera and Amaral, 2004]). It is now obvious why Anchorage’s centrality is so large. Indeed, the existence of nodes with anomalous centrality is related to the existence of regions with a high density of airports but few connections to the outside. The degree-betweenness anomaly is therefore ultimately related to the existence of communities in the network.

While many researchers and practitioners highlight this finding, I do not believe it is completely accurate. There are two reasons for this:

**Issue 1: Binary ties**

Admittedly this might be a personal bias as most of my work has been on weighted networks. Without going into much detail in this blog post, I actually strongly believe that if you assign the same importance to the connection between London Heathrow and New York’s JFK as you do to the connection between Pack Creek Airport and Sitka Harbor Sea Plane Base in Alaska (map), then there is a potential for measurement error. The table below lists the top ten airports in terms of betweenness when analyzing the binary and weighted (by passengers) versions of the Bureau of Transportation Statistics (BTS) Transtats data (Brandes, 2001). The code to replicate these results can be found at the end of this page.

Rank | Betweenness | |||
---|---|---|---|---|

Binary Analysis | Weighted Analysis | |||

Airport | Score | Airport | Score | |

1 | ANC (Anchorage, AK, USA) | 465272 | SEA (Seattle/Tacoma, WA, USA) | 834217 |

2 | FAI (Fairbanks, AK, USA) | 215503 | ANC (Anchorage, AK, USA) | 761834 |

3 | YYZ (Toronto, Canada, Canada) | 131562 | ATL (Atlanta, GA, USA) | 735628 |

4 | LAX (Los Angeles, CA, USA) | 129246 | LAX (Los Angeles, CA, USA) | 531980 |

5 | SEA (Seattle/Tacoma, WA, USA) | 125151 | ORD (Chicago, IL, USA) | 409001 |

6 | JFK (New York, NY, USA) | 124927 | DEN (Denver, CO, USA) | 314764 |

7 | HPN (White Plains, NY, USA) | 121096 | JFK (New York, NY, USA) | 247791 |

8 | MIA (Miami, FL, USA) | 120643 | MIA (Miami, FL, USA) | 206547 |

9 | DEN (Denver, CO, USA) | 120342 | BOS (Boston, MA, USA) | 168140 |

10 | MSP (Minneapolis, MN, USA) | 111188 | FAI (Fairbanks, AK, USA) | 157491 |

This table demonstrates that Anchorage has twice the betweenness of the runner-up, Fairbanks Alaska, in the binary analysis. In the weighted analysis, Anchorage loses the first place to Seattle and Fairbanks moves to 10th place. It is also worth noticing that only US airports are in the top ten lists using both analyses, which leads me on to the second issues with using the BTS data: sample selection.

**Issue 2: Sample selection**

This issue affects all network studies, and something I have been interested in for a while. We define a population and analyse the connections among them. For example, I have analysed the scientific collaboration network based on the papers uploaded to the arXiv preprint server (e.g., Opsahl et al., 2008) with the full knowledge that there are many more scientific publications out there as well as other forms of collaboration and channels for knowledge flow among scientists, such as grant proposals and conference attendance. By simply restricting ourselves to data that is easy to collect (often stored in a central location / repository), the research is vulnerable to sample selection bias.

When it comes to airport networks, the Bureau of Transportation Statistics (BTS) Transtats data is straight forward to collect: Go here, select what you want (Origin, Destination), and click Download! However, there is a small note on another page explaining the dataset: “*This table combines domestic and international market data reported by U.S. and foreign air carriers, and contains market data by carrier, origin and destination, and service class for enplaned passengers, freight, and mail. For a uniform end date for the combined databases, the last 3 months U.S. carrier domestic data released in T-100 Domestic Market (U.S. Carriers Only) are not included. Flights with both origin and destination in a foreign country are not included.*” It is the last line of this description that highlights the potential sample selection bias. While the data contain all US airports and all domestic flights, it only contains Non-US flights that leave or terminate at a US airport and the Non-US airports on the other end of these flights. As such, a section of the square adjacency matrix is missing (flights from Non-US to Non-US airports in the dataset) as well as the entire rows and columns for airports without flights to the US. To exemplify this bias, I have plotted the routes on a world map below.

While it is possible to see a concentration in the US on the above picture, the sample selection becomes much more apparent when highlighting Europe. In the picture to below, it is possible to see that no flights are between any pair of European cities nor any other point on the map. Would you have to transit at New York’s JFK to get from London to Barcelona? This gap highlights the need for looking for more complete data sources than the Bureau of Transportation Statistics when analysing airport networks.

**Alternative data-source**

There are a couple of authoritative databases with world-wide airline routes. However, most of them are proprietary as they have enormous business intelligence potential and, as a consequence, are difficult to collect. OAG Worldwide is one such database, and it should be noted that Guimera et al (2004) went through the hoops by getting this data, and therefore, had a much more complete view of the airport network than if they had used the BTS Transtats data. While I do not have access to such a database, Openflights.org is a crowdsourced alternative. Although using this data comes without any guarantee, it has the potential to showcase the limitations of the BTS Transtats data. As a first step, I mapped the data to ensure there were no obvious pockets of missing data.

**Conclusion 1: Anchorage is not the most important airport**

As can be seen from this picture, there are no obvious areas without any form of airline traffic. To show how this data impacts on a betweenness analysis, I have computed betweenness on both the binary and weighted (by number of routes as the passenger numbers were not available) versions of the network. As can be seen in the table below, major airports located around the globe get the highest scores in these analyses instead of only US airports. Specifically, Anchorage is only the third most central in the binary analysis, and the 14th most central in the weighted analysis. As such, it is still an important airport in the networks, but maybe not the most important.

Rank | Betweenness | |||
---|---|---|---|---|

Binary Analysis | Weighted Analysis | |||

Airport | Score | Airport | Score | |

1 | FRA (Frankfurt, Germany) | 587531 | LHR (London, United Kingdom) | 1858349 |

2 | CDG (Paris, France) | 520707 | LAX (Los Angeles, United States) | 1310287 |

3 | ANC (Anchorage, United States) | 481044 | JFK (New York, United States) | 1084392 |

4 | DXB (Dubai, United Arab Emirates) | 443314 | BKK (Bangkok, Thailand) | 797785 |

5 | GRU (Sao Paulo, Brazil) | 402882 | SIN (Singapore) | 739981 |

6 | YYZ (Toronto, Canada) | 398869 | SEA (Seattle, United States) | 723145 |

7 | LHR (London, United Kingdom) | 389846 | MAD (Madrid, Spain) | 707354 |

8 | LAX (Los Angeles, United States) | 356600 | GRU (Sao Paulo, Brazil) | 684057 |

9 | DME (Moscow, Russia) | 353902 | NRT (Tokyo, Japan) | 639074 |

10 | BKK (Bangkok, Thailand) | 352682 | DXB (Dubai, United Arab Emirates) | 610765 |

… | … | … | … | … |

14 | … | … | ANC (Anchorage, United States) | 469203 |

18 | … | … | FRA (Frankfurt, Germany) | 392418 |

_**Conclusion 2: Finding the global superhub using a weighted approached**

London Heathrow is the most central airport when considering both tie weights and the global airport network. And this, unlike Anchorage, is not a surprising finding as it is the airport with most international passengers (Airports Council International, 2011).

To further investigate the effects on the ranking when considering tie weights in the global airport networks, I considered the change in ranking of the two airports ranked first in the binary and weighted analyses, Frankfurt and London Heathrow. Frankfurt went from having the highest betweenness in the binary analysis to only having 18th highest betweenness in the weighted analysis. Conversely, London Heathrow went from having the seventh highest to the highest betweenness score. To look into this cross-over of rankings, I compared the degree (number of airports with direct flights) and strength (number of direct routes) from these two airports:

Airport | Degree | Node Strength | Strength distribution | ||||
---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | |||

FRA (Frankfurt, Germany) | 237 | 349 | 142 | 82 | 9 | 4 | 0 |

LHR (London, United Kingdom) | 157 | 288 | 71 | 55 | 22 | 4 | 5 |

This table shows that Frankfurt has direct flights to 51% more airports than London Heathrow, but only 21% more routes. The variation in tie weights can be further investigated by looking at the weight distribution. While there are only four airports with four direct routes from Frankfurt, there are nine airports with four or five direct routes from London Heathrow.

Moreover, by looking at which airports have the strong ties (i.e., with tie weights greater or equal to 4) with Frankfurt and London Heathrow, it is possible to see that the geographical distribution is strikingly different. Frankfurt has four direct routes to Antalya (Turkey), Madrid (Spain), Mallorca (Spain), and Vienna (Austria), which are 5,597 kilometres long (average: 1,399km). Conversely, London Heathrow has five routes to Delhi (India), Dubai (UAE), Hong Kong (China), Los Angeles (LAX, USA), and New York City (JFK, USA) and four routes to Bangkok (Thailand), Mumbai (India), Boston (USA), and Miami (USA), which are 65,376 kilometres long (average 7,264km). By having strong ties to geographically distant instead of close airports, London Heathrow acts as a intercontinental hub instead of a continental hub. Additionally, the airports with strong ties to London Heathrow have high betweenness, and therfore, act as hubs in their respective regions. As such, London Heathrow can be seen as the global hub of the world-wide airport network.

**References**

Airports Council International, 2011. Year to date International Passenger Traffic, Apr-2011, accessed August 12, 2011.

Brandes, U., 2001. A Faster Algorithm for Betweenness Centrality. Journal of Mathematical Sociology 25, 163-177.

Bureau of Transportation Statistics, 2011. Air Carrier Statistics (Form 41 Traffic): T-100 Market (All Carriers), accessed August 12, 2011.

Guimera, R., Amaral, L. A. N., 2004. Modeling the world-wide airport network. The European Physical Journal B 38, 381–385.

Guimera, R., Mossa, S., Turtschi, A., Amaral, L. A. N., 2004. The worldwide air transportation network: Anomalous centrality, community structure, and cities’ global roles. Proceedings of the National Academy of Sciences 102(22), 7794-7799.

Openflights.org, 2011. Airport, airline and route data, accessed August 12, 2011.

Opsahl, T., Agneessens, F., Skvoretz, J., 2010. Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32 (3), 245-251.

Opsahl, T., Colizza, V., Panzarasa, P., Ramasco, J. J., 2008. Prominence and control: The weighted rich-club effect. Physical Review Letters 101 (168702).

If you use any of the information in this post, please cite: Opsahl, T., Agneessens, F., Skvoretz, J., 2010. Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32 (3), 245-251.

I would like to acknowledge Bernie Hogan in helping to develop the idea behind this post.

**Code used to create the results in this blog post**

Below is the code to redo the analysis in this post. You need to have the R-packages geosphere, maps, and tnet installed before to run the code. You also need to download the Bureau of Transportation Statistics (BTS) Transtats data. Please see notes in the code.

################################### ## US Airport network (BTS data) ## ################################### # Load BTS Transtats data # Downloaded from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=292 using # Filters: Geography=all; Year=2010; Months=all # Columns: Passengers, Origin, OriginCountryName, Dest, DestCountryName BTS <- read.csv("data/344989982_T_T100_MARKET_ALL_CARRIER.csv", header=TRUE, stringsAsFactors=FALSE) BTS <- BTS[,c("ORIGIN", "ORIGIN_COUNTRY_NAME", "DEST", "DEST_COUNTRY_NAME", "PASSENGERS")] # Load airport information (incl. geolocations) # Downloaded from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=288 (select all columns) BTSairports <- read.csv("data/344990073_T_MASTER_CORD.csv", stringsAsFactors=FALSE) # Replace airport codes with id numbers (net1) net1 <- BTS net1.labels <- unique(c(net1[,"ORIGIN"], net1[,"DEST"])) net1.labels <- net1.labels[order(net1.labels)] net1[,"ORIGIN"] <- factor(x=net1[,"ORIGIN"], levels=net1.labels) net1[,"DEST"] <- factor(x=net1[,"DEST"], levels=net1.labels) net1 <- data.frame(i=as.integer(net1[,"ORIGIN"]), j=as.integer(net1[,"DEST"]), w=net1[,"PASSENGERS"]) # Add up duplicated entries (multiple routes) net1 <- net1[order(net1[,"i"], net1[,"j"]),] index <- !duplicated(net1[,c("i","j")]) net1 <- data.frame(net1[index,c("i","j")], w=tapply(net1[,"w"], cumsum(index), sum)) # Take out routes with no passengers (cargo) net1 <- net1[net1[,"w"]>0,] # Take out routes from an airport to itself net1 <- net1[net1[,"i"]!=net1[,"j"],] # Load tnet and the network as a tnet object library(tnet) net1 <- as.tnet(net1, type="weighted one-mode tnet") # Calculate binary and weighted betweenness tmp0 <- betweenness_w(net1, alpha=0) tmp1 <- betweenness_w(net1, alpha=1) # Create output object with top x airports x <- 10 out <- data.frame( tmp0[order(-tmp0[,"betweenness"]),][1:x,], tmp1[order(-tmp1[,"betweenness"]),][1:x,]) dimnames(out)[[2]] <- c("BTS.bb.node", "BTS.bb.score", "BTS.wb.node", "BTS.wb.score") BTSairports[BTSairports[,"TR_COUNTRY_NAME"]=="United States of America","TR_COUNTRY_NAME"] <- "USA" for(i in 1:x) { # Insert label of airport ID (binary) tmp2 <- net1.labels[as.integer(out[i,"BTS.bb.node"])][1] tmp2 <- BTSairports[BTSairports[,"AIRPORT"]==tmp2,][1,] out[i,"BTS.bb.node"] <- paste(tmp2["AIRPORT"], " (", tmp2["TR_CITY_NAME"], ", ", tmp2["TR_COUNTRY_NAME"], ")", sep="") # Insert label of airport ID (weighted) tmp2 <- net1.labels[as.integer(out[i,"BTS.wb.node"])][1] tmp2 <- BTSairports[BTSairports[,"AIRPORT"]==tmp2,][1,] out[i,"BTS.wb.node"] <- paste(tmp2["AIRPORT"], " (", tmp2["TR_CITY_NAME"], ", ", tmp2["TR_COUNTRY_NAME"], ")", sep="") } ######################### ## Plot the US network ## ######################### # Based on FlowingData's blog post (http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/) Thanks Nathan! # Load required packages (type ?install.packages if you get an error) library(maps) library(geosphere) # Symmetrise the network to get the correct tie weight for visualisation as the two directed ties are plotted on top of each other net1s <- as.data.frame(symmetrise_w(net1, method="SUM")) net1s <- net1s[net1s[,"i"]<net1s[,"j"],] # Put labels back in summed up network (i.e, no duplicates) net1s[,"i"] <- net1.labels[net1s[,"i"]] net1s[,"j"] <- net1.labels[net1s[,"j"]] # Sort data so that weak ties are plotted first net1s <- net1s[order(net1s[,"w"]),] # Set up world map and colors for lines pdf("airport_BTS_plot.pdf", width=11, height=7) map("world", col="#eeeeee", fill=TRUE, bg="white", lwd=0.05) pal <- colorRampPalette(c("#cccccc", "black")) colors <- pal(length(unique(net1s[,"w"]))) colors <- rep(colors, times=as.integer(table(net1s[,"w"]))) # Plot ties for(i in 1:nrow(net1s)) { # Get longitude and latitude of the two airports tmp1 <- BTSairports[BTSairports["AIRPORT"]==net1s[i,"i"],c("LONGITUDE","LATITUDE")][1,] tmp2 <- BTSairports[BTSairports["AIRPORT"]==net1s[i,"j"],c("LONGITUDE","LATITUDE")][1,] # Get the geographical distance to see how many points on the Great Circle to plot tmp3 <- 10*ceiling(as.numeric(log(3963.1 * acos((sin(tmp1[2]/(180/pi))*sin(tmp2[2]/(180/pi)))+(cos(tmp1[2]/(180/pi))*cos(tmp2[2]/(180/pi))*cos(tmp1[1]/(180/pi)-tmp2[1]/(180/pi))))))) # Line coordinates inter <- gcIntermediate(tmp1, tmp2, n=round(tmp3), addStartEnd=TRUE, breakAtDateLine=TRUE) # Plot one line if the line does not cross the date line; two if so if(is.matrix(inter)) { lines(inter, col=colors[i], lwd=0.6) } else { for(j in 1:length(inter)) lines(inter[[j]], col=colors[i], lwd=0.6) } } dev.off() ########################## ## Openflights.org data ## ########################## # Download airport geolocations from openflights.org/data.html, and set column headings # "http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/airports.dat" OFairports <- read.csv("data/airports.dat", header=FALSE, stringsAsFactors=FALSE) dimnames(OFairports)[[2]] <- c("Airport ID", "Name", "City", "Country", "IATA/FAA", "ICAO", "Latitude", "Longitude", "Altitude", "Timezone", "DST") # Download routes from openflights.org/data.html # "http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/routes.dat" OF <- read.csv("data/routes.dat", header=FALSE, stringsAsFactors=FALSE) dimnames(OF)[[2]] <- c("Airline", "Airline ID", "Source airport", "Source airport ID", "Destination airport", "Destination airport ID", "Codeshare", "Stops", "Equipment") # Remove code-shares as these are duplicated entries net2 <- OF[OF[,"Codeshare"]=="",c("Source airport ID", "Destination airport ID")] # Take out routes from an airport to itself and the missing cases (~1%) net2 <- net2[net2[,"Source airport ID"]!=net2[,"Destination airport ID"],] net2 <- net2[net2[,"Source airport ID"]!="\\N",] net2 <- net2[net2[,"Destination airport ID"]!="\\N",] # As passengers per route is not available, create a weighted network with the weight equal to number of routes net2 <- data.frame(i=as.integer(net2[,"Source airport ID"]), j=as.integer(net2[,"Destination airport ID"])) net2 <- shrink_to_weighted_network(net2) ################################## ## Plot the OpenFlights network ## ################################## # Symmetrise data for visualisation net2s <- as.data.frame(symmetrise_w(net2, method="SUM")) net2s <- net2s[net2s[,"i"]<net2s[,"j"],] # Sort data so that weak ties are plotted first net2s <- net2s[order(net2s[,"w"]),] # Set up world map and colors for lines pdf("airport_OF_plot.pdf", width=11, height=7) map("world", col="#eeeeee", fill=TRUE, bg="white", lwd=0.05) pal <- colorRampPalette(c("#cccccc", "black")) colors <- pal(length(unique(net2s[,"w"]))) colors <- rep(colors, times=as.integer(table(net2s[,"w"]))) # Plot ties for(i in 1:nrow(net2s)) { # Get longitude and latitude of the two airports tmp1 <- as.numeric(OFairports[OFairports["Airport ID"]==net2s[i,"i"],c("Longitude","Latitude")][1,]) tmp2 <- as.numeric(OFairports[OFairports["Airport ID"]==net2s[i,"j"],c("Longitude","Latitude")][1,]) # Get the geographical distance to see how many points on the Great Circle to plot tmp3 <- 10*ceiling(as.numeric(log(3963.1 * acos((sin(tmp1[2]/(180/pi))*sin(tmp2[2]/(180/pi)))+(cos(tmp1[2]/(180/pi))*cos(tmp2[2]/(180/pi))*cos(tmp1[1]/(180/pi)-tmp2[1]/(180/pi))))))) # Line coordinates inter <- gcIntermediate(tmp1, tmp2, n=round(tmp3), addStartEnd=TRUE, breakAtDateLine=TRUE) # Plot one line if the line does not cross the date line; two if so if(is.matrix(inter)) { lines(inter, col=colors[i], lwd=0.6) } else { for(j in 1:length(inter)) lines(inter[[j]], col=colors[i], lwd=0.6) } } dev.off() ##################################### ## Analyse the OpenFlights network ## ##################################### # Calculate binary and weighted betweenness (on the directed network, net2) tmp0 <- betweenness_w(net2, alpha=0) tmp1 <- betweenness_w(net2, alpha=1) # Create output object with top x airports out <- data.frame(out, tmp0[order(-tmp0[,"betweenness"]),][1:x,], tmp1[order(-tmp1[,"betweenness"]),][1:x,]) dimnames(out)[[2]][5:8] <- c("OF.bb.node", "OF.bb.score", "OF.wb.node", "OF.wb.score") for(i in 1:x) { # Insert label of airport ID (binary) tmp2 <- OFairports[OFairports[,"Airport ID"]==out[i,"OF.bb.node"],] out[i,"OF.bb.node"] <- paste(tmp2["IATA/FAA"], " (", tmp2["City"], ", ", tmp2["Country"], ")", sep="") # Insert label of airport ID (weighted) tmp2 <- OFairports[OFairports[,"Airport ID"]==out[i,"OF.wb.node"],] out[i,"OF.wb.node"] <- paste(tmp2["IATA/FAA"], " (", tmp2["City"], ", ", tmp2["Country"], ")", sep="") } ########################### ## Comparing FRA and LHR ## ########################### # Get FRA and LHR's airport ids ids <- sapply(c("FRA", "LHR"), function(a) OFairports[OFairports[,"IATA/FAA"]==a,"Airport ID"]) # Rank and Score of FRA tmp1 <- as.data.frame(tmp1[order(-tmp1[,"betweenness"]),]) tmp1[tmp1[,"node"]==ids["FRA"],] # Degree and Node strength tmp3 <- degree_w(net2) tmp3[ids,] # Weight distribution sapply(ids, function(a) table(net2[net2[,"i"]==a,3])) # Airports with strong ties (w>=4) tmp4 <- lapply(ids, function(a) data.frame(net2[net2[,"i"]==a & net2[,"w"]>=4,], label="", geo.dist=NaN, stringsAsFactors=FALSE)) # Insert labels for(a in 1:2) { for(b in 1:nrow(tmp4[[a]])) { tmp2 <- OFairports[OFairports[,"Airport ID"]==tmp4[[a]][b,"j"],][1,] tmp4[[a]][b, "label"] <- paste(tmp2["IATA/FAA"], " (", tmp2["City"], ", ", tmp2["Country"], ")", sep="") } } # Geographical distance for(a in 1:2) { tmp5 <- as.numeric(OFairports[OFairports["Airport ID"]==ids[a],c("Longitude","Latitude")][1,]) for(b in 1:nrow(tmp4[[a]])) { tmp6 <- as.numeric(OFairports[OFairports["Airport ID"]==tmp4[[a]][b,"j"],c("Longitude","Latitude")][1,]) tmp4[[a]][b, "geo.dist"] <- 6378.7 * acos((sin(tmp5[2]/(180/pi))*sin(tmp6[2]/(180/pi)))+(cos(tmp5[2]/(180/pi))*cos(tmp6[2]/(180/pi))*cos(tmp5[1]/(180/pi)-tmp6[1]/(180/pi)))) } } sapply(1:2, function(a) mean(tmp4[[a]][,"geo.dist"])) sapply(1:2, function(a) sum(tmp4[[a]][,"geo.dist"]))]]>

where *j* represents all other nodes, *N* is the total number of nodes, and *x* is the adjacency matrix, in which the cell is defined as 1 if node *i* is connected to node *j*, and 0 otherwise.

Degree has generally been extended to the sum of weights when analysing weighted networks, and labeled node strength (Barrat et al., 2004). This measure can be formalised as follows:

where *w* is the weighted adjacency matrix, in which is greater than 0 if the node *i* is connected to node *j*, and the value represents the weight of the tie. This is equal to the definition of degree if the network is binary, i.e. each tie has a weight of 1. Conversely, in weighted networks, the outcomes of these two measures are different. Since node strength takes into consideration the weights of ties, this has been the preferred measure for analyzing weighted networks (e.g., Barrat et al., 2004; Opsahl et al., 2008).

where is a positive tuning parameter that controls the relative importance of the number of ties and the sum of ties. Specifically, there are two benchmark values (0 and 1), and if the parameter is set to either of these values, the existing measure is reproduced. If the parameter is set to the benchmark value of 0, the outcomes of the measure is solely based on the number of ties, and are equal to the ones found when applying Freeman’s (1978) measure to a binary version of a network where all the ties with a weight greater than 0 are set to present. Conversely, if the value of the parameter is 1, the outcomes of the measure is based on tie weights only, and are identical to the already proposed generalization of degree (Barrat et al., 2004). For other values of , alternative outcomes are attained, which are based on both the number of ties and tie weights. In particular, two ranges of values can be distinguished. First, a parameter set between 0 and 1 would positively value both the number of ties and tie weights. This implies that both increments in node degree and node strength will then increase the outcome. Second, if the value of the parameter is above 1, the measures would positively value tie strength and negatively value the number of ties. Nodes with on average stronger ties will get a higher score.

All of the above measures are insensitive to variation in tie weights. For example, the two nodes, A and B, in this diagram have the same number of connections, the same node strength, and attains the same score using the second generalisation as that it is a product of the degree and node strength. While the closeness and betweenness measures proposed in Opsahl et al. (2010) are sensitive to variation in tie weights, the degree measure was designed not to be. However, a measure closely related to the closeness and betweenness measures that is sensitive to tie weight differences can be defined as follows:By exponenting the tie weight instead of the average tie weight, the measure becomes sensitive to variation in tie weights. For example, node A and node B would get the following score using the various measures:

Measure | Node | |
---|---|---|

A | B | |

Freeman’s | 2 | 2 |

Barrat et al.’s | 4 | 4 |

Opsahl et al.’s, alpha=0.5 | 2.83 | 2.83 |

Opsahl et al.’s, alpha=1.5 | 5.66 | 5.66 |

New measure, alpha=0.5 | 2.83 | 2.73 |

New measure, alpha=1.5 | 5.66 | 6.20 |

As it is possible to see from the above table, the new measure is closely linked to generalisation proposed by Opsahl et al. (2010); however, when the tie weights are different, the measure vary between the two nodes. Similarly as the other centrality measures using a tuning parameter, the tuning parameter in these measures control the relative importance of the number of ties and the sum of ties. In addition, it also controls whether variation in tie weights should be discounted or taken favourable. A parameter between 0 and 1 discounts, whereas a parameter above 1, increase the outcome of the measure when tie weights are different.

**What to try it with your data?**

Below is the code to calculate the proposed degree measure. You need to have the R-package tnet installed before to run the code.

# Load tnet library(tnet) # Load a function to calculate the new measures degree2_w <- function (net, type="out", alpha = 1) { net <- as.tnet(net, type="weighted one-mode tnet") if (type == "in") { net <- data.frame(i = net[, 2], j = net[, 1], w = net[,3]) net <- net[order(net[, "i"], net[, "j"]), ] } index <- cumsum(!duplicated(net[, 1])) k.list <- cbind(unique(net[, 1]), NaN, NaN, NaN) dimnames(k.list)[[2]] <- c("node", "degree", "output", "alpha") k.list[, "degree"] <- tapply(net[, "w"], index, length) k.list[, "output"] <- tapply(net[, "w"], index, sum) net[,"w"] <- net[,"w"]^alpha k.list[, "alpha"] <- tapply(net[, "w"], index, sum) if (max(net[, c("i", "j")]) != nrow(k.list)) { k.list <- rbind(k.list, cbind(1:max(net[, c("i", "j")]), 0, 0, 0)) k.list <- k.list[order(k.list[, "node"]), ] k.list <- k.list[!duplicated(k.list[, "node"]), ] } return(k.list) } # Load a sample network net <- cbind( i=c(1,1,2,2), j=c(2,3,1,3), w=c(2,2,1,3)) # Calculate the measures degree_w(net, measure=c("degree","output","alpha"), alpha=1.5) degree_w(net, measure=c("degree","output","alpha"), alpha=0.5) degree2_w(net, alpha=0.5) degree2_w(net, alpha=1.5)

**References**

Barrat, A., Barthelemy, M., Pastor-Satorras, R., Vespignani, A., 2004. The architecture of complex weighted networks. Proceedings of the National Academy of Sciences 101 (11), 3747-3752.

Freeman, L. C., 1978. Centrality in social networks: Conceptual clarification. Social Networks 1, 215-239.

Opsahl, T., Agneessens, F., Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32, 245-251.

Opsahl, T., Colizza, V., Panzarasa, P., Ramasco, J. J., 2008. Prominence and control: The weighted rich-club effect. Physical Review Letters 101 (168702).

Wasserman, S., Faust, K., 1994. Social Network Analysis: Methods and Applications. Cambridge University Press, New York, NY.

If you use any of the information in this post, please cite: Opsahl, T., Agneessens, F., Skvoretz, J., 2010. Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32 (3), 245-251

]]>The main foundation of the paper is a gender representation law that required all public limited companies to compose their boards with at least 40% of each gender by January 2008. The paper attempted to stike a balance between the urgency of studying the gender represention law and the amount of data available (the analysis relied on data from August 2009). The supporting website tries to alleviate this tension by providing up-to-date data for the analysis conducted in the paper.

The paper gathered news coverage in the major Norwegian newspapers. Online versions are available through e24.no and dn.no and Der Spiegel.

**Abstract**

Governments have implemented various affirmative action policies to address vertical sex segregation in organizations. A gender representation law was introduced in Norway, which required public limited companies’ boards to have at least 40 percent representation of each sex by 2008. This law acted as an external shock, and this paper aims to explore its effects. In particular, it explores the gender bias, the emergence and sex of prominent directors, and directors’ social capital. We utilize data from May 2002 to August 2009 to analyze these aspects. The implied intention of the law was to create a larger pool of women acting as directors on boards, and the law has had the effect of increasing the representation of women on boards. However, it has also created a small elite of women directors who rank among the top on a number of proxies of influence.

If you use any of the information in this post, please cite: Seierstad, C., Opsahl, T., 2011. For the few not the many? The effects of affirmative action on presence, prominence, and social capital of women directors in Norway. Scandinavian Journal of Management 27 (1), 44-54

]]>**Abstract**

Ties often have a strength naturally associated with them that differentiate them from each other. Tie strength has been operationalized as weights. A few network measures have been proposed for weighted networks, including three common measures of node centrality: degree, closeness, and betweenness. However, these generalizations have solely focused on tie weights, and not on the number of ties, which was the central component of the original measures. This paper proposes generalizations that combine both these aspects. We illustrate the benefits of this approach by applying one of them to Freeman’s EIES dataset.

**Motivation**

The three measures have already been generalised to weighted networks. Barrat et al. (2004) generalised degree to weighted networks by taking the sum of weights instead of the number ties, while Newman (2001) and Brandes (2001) utilised Dijkstra’s (1959) algorithm of shortest paths for generalising closeness and betweenness to weighted networks, respectiviely. Dijkstra’s algorithm defined the length of paths as the sum of cost (e.g., time in GPS calculations), which is generally only defined as the sum of the inversed tie weights. All these generalisations fail to take into account the main feature of the original measures formalised by Freeman (1978): the number of ties.

This limitation is highlighted for degree centrality by the three ego networks from Freeman’s third EIES network. The three nodes have roughly sent the same amount of messages; however, to a quite different number of others. If Freeman’s (1978) original measure was applied, the centrality score of the node in panel A is almost five times as high as the node in panel C attains. However, when using Barrat et al.’s generalisation, they get roughly the same score.

This articles proposes a new generation of node centrality measures for weighted networks. The second generation of measures takes into consideration both the weight of ties and the number of ties. The relative importance of these two aspects are controlled by a tuning parameter.

**Want to test it with your data?**

The degree_w, closeness_w, and betweenness_w-functions in tnet allows you to calculate the binary, weighted, and the measures that combine these two aspects on your own dataset.

For example, to calculate second generation node centrality measures (alpha = 0.5) on the sample network above, you can run the code below in R. The degree function easily calculates the binary and first generation measures as well; however, this is not the case for the closeness and betweenness-functions. If you would like the binary version, you can either use the dichotomise function or set alpha=0. If you would like the first generation weighted measures, you can set alpha=1 (default value).

# Load tnet library(tnet) # Load network net <- cbind( i=c(1,1,2,2,2,2,3,3,4,5,5,6), j=c(2,3,1,3,4,5,1,2,2,2,6,5), w=c(4,2,4,4,1,2,2,4,1,2,1,1)) # Calculate degree centrality (note that alpha is included in the list of measures) degree_w(net, measure=c("degree", "output", "alpha"), alpha=0.5) # Calculate closeness centrality closeness_w(net, alpha=0.5) # Calculate betweenness centrality betweenness_w(net, alpha=0.5)

To test it on Freeman’s third EIES network from the datasets-page and recreate Table 3 of the paper, you can do the following:

# Load tnet library(tnet) # Load network data(Freemans.EIES) net <- Freemans.EIES.net.3.n32 # Calculate measures tmp <- data.frame( Freemans.EIES.node.Name.n32, degree_w(net, measure=c("degree", "output", "alpha"), alpha=0.5), degree_w(net, measure="alpha", alpha=1.5)[,"alpha"], stringsAsFactors=FALSE) dimnames(tmp )[[2]] <- c("name", "node", "a00", "a10", "a05", "a15") tmp <- tmp[,c("name","a00","a05","a10","a15")] # Merge names and order table out <- data.frame( seq.int(nrow(tmp)), tmp[order(-tmp[,"a00"], -tmp[,"a10"]),c("name", "a00")], tmp[order(-tmp[,"a05"], -tmp[,"a10"]),c("name", "a05")], tmp[order(-tmp[,"a10"], -tmp[,"a10"]),c("name", "a10")], tmp[order(-tmp[,"a15"], -tmp[,"a10"]),c("name", "a15")]) dimnames(out)[[2]] <- c("Rank", "a00.name","a00", "a05.name","a05", "a10.name","a10", "a15.name","a15") # Display table out

**References**

Brandes, U., 2001. A Faster Algorithm for Betweenness Centrality. Journal of Mathematical Sociology 25, 163-177.

Dijkstra, E. W., 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1, 269-271.

Freeman, L. C., 1978. Centrality in social networks: Conceptual clarification. Social Networks 1, 215-239.

Newman, M. E. J., 2001. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Physical Review E 64, 016132.

Opsahl, T., Agneessens, F., Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32, 245-251.

If you use any of the information in this post, please cite: Opsahl, T., Agneessens, F., Skvoretz, J., 2010. Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32 (3), 245-251

]]>This network gives a concrete example of the closeness measure. The distance between node G and node H is infinite as a direct or indirect path does not exist between them (i.e., they belong to separate components). As long as at least one node is unreachable by the others, the sum of distances to all other nodes is infinite. As a consequence, researchers have limited the closeness measure to the largest component of nodes (i.e., measured intra-component). The distance matrix for the nodes in the sample network is:

Nodes | All inclusive | Intra-component | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

A | B | C | D | E | F | G | H | I | J | K | Farness | Closeness | Farness | Closeness | ||

A | … | 1 | 1 | 2 | 2 | 3 | 3 | Inf | Inf | Inf | Inf | Inf | 0 | 12 | 0.08 | |

B | 1 | … | 1 | 2 | 1 | 2 | 3 | Inf | Inf | Inf | Inf | Inf | 0 | 10 | 0.10 | |

C | 1 | 1 | … | 1 | 2 | 2 | 2 | Inf | Inf | Inf | Inf | Inf | 0 | 9 | 0.11 | |

D | 2 | 2 | 1 | … | 2 | 1 | 1 | Inf | Inf | Inf | Inf | Inf | 0 | 9 | 0.11 | |

E | 2 | 1 | 2 | 2 | … | 1 | 3 | Inf | Inf | Inf | Inf | Inf | 0 | 11 | 0.09 | |

F | 3 | 2 | 2 | 1 | 1 | … | 2 | Inf | Inf | Inf | Inf | Inf | 0 | 11 | 0.09 | |

G | 3 | 3 | 2 | 1 | 3 | 2 | … | Inf | Inf | Inf | Inf | Inf | 0 | 14 | 0.07 | |

H | Inf | Inf | Inf | Inf | Inf | Inf | Inf | … | 1 | 2 | Inf | Inf | 0 | 3 | 0.33 | |

I | Inf | Inf | Inf | Inf | Inf | Inf | Inf | 1 | … | 1 | Inf | Inf | 0 | 2 | 0.50 | |

J | Inf | Inf | Inf | Inf | Inf | Inf | Inf | 2 | 1 | … | Inf | Inf | 0 | 3 | 0.33 | |

K | Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf | … | Inf | 0 | 0 | Inf |

Although the intra-component closeness scores are not infinite for all the nodes in the network, it would be inaccurate to use them as a closeness measure. This is due to the fact that the sum of distances would contain different number of paths (e.g., there are two distance from node H to other nodes in its component, while there are six distances from node G to other nodes in its component). In fact, nodes in smaller components would generally be seen as being closer to others than nodes in larger components. Thus, researchers has focused solely on the largest component. However, this leads to a number of methodological issues, including sample selection.

To develop this measure, I went back to the original equation:

where is the focal node, is another node in the network, and is the shortest distance between these two nodes. In this equation, the distances are inversed after they have been summed, and when summing an infinite number, the outcome is infinite. To overcome this issue while staying consistent with the existing measure of closeness, I took advantage of the fact that the limit of a number divided by infinity is zero. Although infinity is not an exact number, the inverse of a very high number is very close to 0. In fact, 0 is returned if you enter 1/Inf in the statistical programme *R*. By taking advantage of this feature, it is possible to rewrite the closeness equation as *the sum of inversed* distances to all other nodes instead of the *inversed of the sum *of distances to all other nodes. The equation would then be:

To exemplify this change, for the example network above, the inversed distances and closeness scores are:

Nodes | Closeness | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

A | B | C | D | E | F | G | H | I | J | K | Sum | Normalized | ||

A | … | 1.00 | 1.00 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 3.67 | 0.37 | |

B | 1.00 | … | 1.00 | 0.50 | 1.00 | 0.50 | 0.33 | 0 | 0 | 0 | 0 | 4.33 | 0.43 | |

C | 1.00 | 1.00 | … | 1.00 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4.50 | 0.45 | |

D | 0.50 | 0.50 | 1.00 | … | 0.50 | 1.00 | 1.00 | 0 | 0 | 0 | 0 | 4.50 | 0.45 | |

E | 0.50 | 1.00 | 0.50 | 0.50 | … | 1.00 | 0.33 | 0 | 0 | 0 | 0 | 3.83 | 0.38 | |

F | 0.33 | 0.50 | 0.50 | 1.00 | 1.00 | … | 0.50 | 0 | 0 | 0 | 0 | 3.83 | 0.38 | |

G | 0.33 | 0.33 | 0.50 | 1.00 | 0.33 | 0.50 | … | 0 | 0 | 0 | 0 | 3.00 | 0.30 | |

H | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1.00 | 0.50 | 0 | 1.50 | 0.15 | |

I | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.00 | … | 1.00 | 0 | 2 | 0.20 | |

J | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 1.00 | … | 0 | 1.50 | 0.15 | |

K | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 |

As can be seen from this table, a closeness score is attained for all nodes taking into consideration an equal number of distances for each node irrespective of the size of the nodes’ component. Moreover, nodes belonging to a larger component generally attains a higher score. This is deliberate as these nodes can reach a greater number of others than nodes in smaller components. The normalized scores are bound between 0 and 1. It is 0 if a node is an isolate, and 1 if a node is directly connected all others.

This measure can easily be extended to weighted networks by introducing Dijkstra’s (1959) algorithm as proposed in Average shortest distance in weighted networks.

**References**

Dijkstra, E. W., 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1, 269-271.

Freeman, L. C., 1978. Centrality in social networks: Conceptual clarification. Social Networks 1, 215-239.

Opsahl, T., Agneessens, F., Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32, 245-251.

Wasserman, S., Faust, K., 1994. Social Network Analysis: Methods and Applications. Cambridge University Press, New York, NY.

**What to try it with your data?**

Below is the code to calculate the closeness measure on the sample network above.

# Load tnet library(tnet) # Load network # Node K is assigned node id 8 instead of 10 as isolates at the end of id sequences are not recorded in edgelists net <- cbind( i=c(1,1,2,2,2,3,3,3,4,4,4,5,5,6,6,7,9,10,10,11), j=c(2,3,1,3,5,1,2,4,3,6,7,2,6,4,5,4,10,9,11,10), w=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)) # Calculate measures closeness_w(net, gconly=FALSE)

This post is the explaination of a footnote the node centrality paper. If you use any of the information in this post, please cite: Opsahl, T., Agneessens, F., Skvoretz, J., 2010. Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32 (3), 245-251

]]>The content of this post has been integrated in the *tnet* manual, see Clustering in Two-mode Networks.

]]>The content of this post has been integrated in the *tnet* manual, see Datasets.

]]>*“we measured users’ average out-strength (instrength) as the average number of messages sent to (received from) others (Opsahl, Colizza, Panzarasa,&Ramasco, 2008). We expected hubs to be weakly connected to others, based on the conjecture that all users are homogeneously limited by the same constraints of resources and time. In this case, having more contacts should reduce the amount of resources and time spent on each of them (Burt, 1992). We were surprised to find a positive and significant (p<0.001) Pearson’s pairwise correlation coefficient between average out-strength (in-strength) and out-degree (in-degree) of 0.28 (0.44). This signals that hubs spend more time and resources with each of their contacts than the less connected users." * (excerpt from page 919).

The heterogeneity in average tie weight for users with different levels of gregariousness might indicate that node degree and node strength are not correlated. **This post aims to test this for the online social network used in the paper and compare degree and strength distributions.**

Given that this is a directed network, each analysis is conducted twice – once for outgoing ties and once for incoming ties. The simplest way to test the association between two variables is to calculate the Pearson pair-wise correlation coefficient . This coefficient tests the linear relationship between two variables, and ranges from -1 to 1. If it is equal to 1, then there is perfect correlation between the two-variables, whereas if it is -1, the two variables are opposites of each other. A value of 0 is attained if there is no linear relationship between the two variables. For out-degree and out-strength, the coefficient is 0.90, and for in-degree and in-strength, the coefficient is 0.89. This indicates that degree and strength is highly correlated with each other (Cohen, 1988).

Since high correlation coefficients were found, it might be interesting to plot the relationships to ensure that extreme values are not distorting the coefficient. The relationships between the two types of degree and strength are:

As it is possible to see from the above plots, there are a number of nodes with extremely high values of degree and strength. However, there are clear trajectories at low values of degree and strength, which might indicate that the outliers are not distorting the correlation coefficients. The fact that there are nodes with extremely high values of degree is not surprising given that power-law degree distributions with exponents of 0.89 and 1.005 were found in the paper:

Given the similarity between degree and strength, it would be interesting to test whether the strength distributions also follow a power-law distribution, and if so, if the exponent is similar to the ones for the degree-distributions:

The exponents of the strength distributions are 0.87 and 1.004. Although I expected some similarity between the degree distributions’ exponents (0.89 and 1.005) and the strength distributions’ exponents, the numerical similarity is striking.

**References**

Burt, R. S., 1992. Structural Holes: The Social Structure of Competition. Harvard University Press, Cambridge, MA.

Cohen, J., 1988. Statistical power analysis for the behavioral sciences (2nd edition). Hillsdale, NJ: Erlbaum.

Opsahl, T., Colizza, V., Panzarasa, P., Ramasco, J. J., 2008. Prominence and control: The weighted rich-club effect. Physical Review Letters 101 (168702). arXiv:0804.0417.

Panzarasa, P., Opsahl, T., Carley, K.M., 2009. Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community. Journal of the American Society for Information Science and Technology 60 (5), 911-932, doi: 10.1002/asi.21015

**What to try it with your data?**

Below is the code to calculate the numbers and create the diagrams used in this post. If you also would like to calculate the power-law with exponential cut-off, then you should remove the # on line 41.

# Load tnet library(tnet) # Load network data(OnlineSocialNetwork.n1899) `script` <- function(net){ output <- list() # Calculate out/in-degree/strength k <- cbind(degree_w(net), degree_w(net, type="in")) dimnames(k)[[2]] <- c("node","ko","so","node2","ki","si") if(sum(k[,"node"] == k[,"node2"])!=nrow(k)) stop("Node ids does not match") k <- k[,c("node","ko","ki","so","si")] output[[1]] <- k # Get pair-wise correlation coefficients corro <- cor.test(k[,"ko"], k[,"so"]) corri <- cor.test(k[,"ki"], k[,"si"]) cat(paste("Pair-wise correlation between degree and strength:\n Out: ", corro$estimate, " (p-value: ", corro$p.value, ")\n In: ", corri$estimate, " (p-value: ", corri$p.value, ")\n Note: If p-value equal 0, p-value is less than 2.2e-16\n", sep="")) output[[2]] <- corro output[[3]] <- corri # Degree distributions cat("Degree distributions\n") looprange <- c("ko","so","ki","si") for(j in 1:length(looprange)) { i <- looprange[j] tmp <- table(k[,i]) tmp <- tmp[which(rownames(tmp)!="0")] tmp <- tmp/(sum(tmp)) tmp <- as.data.frame(cbind(k=as.numeric(rownames(tmp)), pk=tmp)) plaw <- nls(pk ~ C*k^(-t), data=tmp, start=list(C=1, t=1)) plaweco <- nls(pk ~ C*k^(-t)*exp(-k/K), data=tmp, start=list(C=1, t=1, K=30)) cat(switch(i, "ko" = " Out-degree", "so" = " Out-strength", "ki" = " In-degree", "si" = " In-strength")) cat(paste("\n Powerlaw: pk =", plaw$call$formula[3], "\n Coefficients:\n Con =", coef(plaw)["C"], "\n tau =", coef(plaw)["t"])) # cat(paste("\n Powerlaw with exponential cut-off: pk ", plaweco$call$formula[3], "\n Coefficients:\n Con =", coef(plaweco)["C"], "\n tau =", coef(plaweco)["t"], "\n cut =", coef(plaweco)["K"])) cat("\n") output[[(length(output)+1)]] <- tmp output[[(length(output)+1)]] <- plaw output[[(length(output)+1)]] <- plaweco } cat(" Note: These regressions in the article were performed in Stata 9\n The value of the cut-off parameter varies slightly between R and Stata\n") return(output) } output <- script(OnlineSocialNetwork.n1899.net) k <- output[[1]] plot(k[,"ko"], k[,"so"], main="Outgoing ties", xlab="out-degree", ylab="out-strength") plot(k[,"ki"], k[,"si"], main="Incoming ties", xlab="in-degree", ylab="in-strength" ) plot(output[[4]][,1], output[[4]][,2], main="Out-degree distribution", xlab="out-degree", ylab="p(out-degree)", log="xy") lines(output[[4]][,1], fitted(output[[5]])) plot(output[[7]][,1], output[[7]][,2], main="Out-strength distribution", xlab="out-strength", ylab="p(out-strength)", log="xy") lines(output[[7]][,1], fitted(output[[8]])) plot(output[[10]][,1], output[[10]][,2], main="In-degree distribution", xlab="in-degree", ylab="p(in-degree)", log="xy") lines(output[[10]][,1], fitted(output[[11]])) plot(output[[13]][,1], output[[13]][,2], main="In-strength distribution", xlab="in-strength", ylab="p(in-strength)", log="xy") lines(output[[13]][,1], fitted(output[[14]]))

I would like to acknowledge Vittoria Colizza in helping to develop the idea behind this post.

If you use any of the information in this post, please cite: Opsahl, T., Panzarasa, P., 2009. Clustering in weighted networks. Social Networks 31 (2), 155-163

]]>The content of this post has been integrated in the *tnet* manual, see Clustering in Two-mode Networks.

]]>The content of this post has been integrated in the *tnet* manual, see Software.

]]>The content of this post has been integrated in the *tnet* manual, see Weighted Rich-club Effect in Two-mode Networks.

]]>**Acknowledgements**

The theme of this thesis is interdependence among elements. In fact, this thesis is not just a product of myself, but also of my interdependence with others. Without the support of a number of people, it would not have been possible to write. It is my pleasure to have the opportunity to express my gratitude to many of them here.

For my academic achievements, I would like to acknowledge the constant support from my supervisors. In particular, I thank Pietro Panzarasa for taking an active part of all the projects I have worked on. I have also had the pleasure to collaborate with people other than my supervisors. I worked with Vittoria Colizza and Jose J. Ramasco on the analysis and method presented in Chapter 2, Kathleen M. Carley on an empirical analysis of the online social network used throughout this thesis, and Martha J. Prevezer on a project related to knowledge transfer in emerging countries. In addition to these direct collaborations, I would also like to thank Filip Agneessens, Sinan Aral, Steve Borgatti, Ronald Burt, Mauro Faccioni Filho, Thomas Friemel, John Skvoretz, and Vanina Torlo for encouragement and helpful advice. In particular, I would like to thank Tom A. B. Snijders and Klaus Nielsen for insightful reading of this thesis and many productive remarks and suggestions. I have also received feedback on my work at a number of conferences and workshops. I would like to express my gratitude to the participants at these.

On a social note, I would like to thank John, Claudius, and my family for their continuing support. Without them I would have lost focus. My peers and the administrative staff have also been a great source of support. In particular, I would like to extend my acknowledgements to Mariusz Jarmuzek, Geraldine Marks, Roland Miller, Jenny Murphy, Cathrine Seierstad, Lorna Soar, Steven Telford, and Eshref Trushin.

]]>The content of this post has been integrated in the *tnet* manual, see Projecting Two-mode Networks.

]]>The content of this post has been integrated in the *tnet* manual, see Clustering.

]]>**Abstract**

In recent years, researchers have investigated a growing number of weighted networks where ties are differentiated according to their strength or capacity. Yet, most network measures do not take weights into consideration, and thus do not fully capture the richness of the information contained in the data. In this paper, we focus on a measure originally defined for unweighted networks: the global clustering coefficient. We propose a generalization of this coefficient that retains the information encoded in the weights of ties. We then undertake a comparative assessment by applying the standard and generalized coefficients to a number of network datasets.

**Motivation**

In this sample network the binary clustering coefficient is 0.33 as a third of the triplets are closed by being part of a triangle. By looking at the weights, it is possible to see that the strongest ties are in part of the closed triplets. This is not reflected in the binary clustering coefficient.

By applying the proposed generalisation of the coefficient using the arithmetic mean method for defining triplet value, the clustering coefficient increases to 0.42. This increase of this coefficient from the binary coefficient is a reflection of the fact that the strongest ties are part of the closed triplets.

**Want to test it with your data?**

The clustering_w function in tnet allows you to test the generalised clustering coefficient on your own dataset.

For example, to test the clustering_w function on the sample network above, you can run the following code in R:

# Load tnet library(tnet) # Load network net <- cbind( i=c(1,1,2,2,2,2,3,3,4,5,5,6), j=c(2,3,1,3,4,5,1,2,2,2,6,5), w=c(4,2,4,4,1,2,2,4,1,2,1,1)) # Run function clustering_w(net, measure=c("am", "gm", "ma", "mi")) # The output is: # am gm ma mi #0.4166667 0.4361302 0.3750000 0.5000000

To test in on Freeman’s third EIES network from the datasets page, you can do the following:

# Load tnet library(tnet) # Load network data(Freemans.EIES) # Run function clustering_w(Freemans.EIES.net.3.n32, measure=c("am", "gm", "ma", "mi")) # The output is: #0.7378310 0.7331536 0.7410959 0.7249982

If you use any of the information in this post, please cite: Opsahl, T., Panzarasa, P., 2009. Clustering in weighted networks. Social Networks 31 (2), 155-163

]]>The content of this post has been integrated in the *tnet* manual, see Sliding Window.

]]>**Abstract**

This research draws on longitudinal network data from an online community to examine patterns of users’ behavior and social interaction and infer the processes underpinning dynamics of system use. The online community represents a prototypical example of a complex evolving social network in which connections between users are established over time by online messages. We study the evolution of a variety of properties since the inception of the system, including how users create, reciprocate, and deepen relationships with one another, variations in users’ gregariousness and popularity, reachability and typical distances among users, and the degree of local redundancy in the system. Results indicate that the system is a “small world” characterized by the emergence, in its early stages, of a hub-dominated structure with highly heterogeneous users’ behavior. We investigate whether hubs are responsible for holding the system together and facilitating information flow, examine first-mover advantages underpinning users’ ability to rise to system prominence, and uncover gender differences in users’ gregariousness, popularity, and local redundancy. We discuss the implications of the results for research on system use and evolving social networks, and for a host of applications, including information diffusion, communities of practice, and the security and robustness of information systems.

If you use any of the information in this post, please cite: Panzarasa, P., Opsahl, T., Carley, K.M., 2009. Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community. Journal of the American Society for Information Science and Technology 60 (5), 911-932

]]>The content of this post has been integrated in the *tnet* manual, see Node Centrality in Weighted Networks.

]]>The content of this post has been integrated in the *tnet* manual, see Defining Weighted Networks.

]]>The content of this post has been integrated in the *tnet* manual, see Clustering in Weighted Networks.

]]>The content of this post has been integrated in the *tnet* manual, see Shortest Paths in Weighted Networks.

]]>The content of this post has been integrated in the *tnet* manual, see The Weighted Rich-club Effect.

]]>