Weighted local clustering coefficient

January 23, 2009

TripletA fundamental measure that has long received attention in both theoretical and empirical research is the clustering coefficient. This measure assesses the degree to which nodes tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties (Feld, 1981; Friedkin, 1984; Heider, 1946; Holland and Leinhardt, 1970; Louch, 2000; Snijders, 2001; Snijders et al., 2006; Watts and Strogatz, 1998). In real-world networks, this likelihood tends to be greater than the average probability of a tie randomly established between two nodes (Holland and Leinhardt, 1971; Wasserman and Faust, 1994; Watts and Strogatz, 1998).

Two versions of this measure exist: the global and the local. The global version was designed to give an overall indication of the clustering in the network, whereas the local gives an indication of the embeddedness of single nodes. This post is concerned with the latter version, but I will briefly introduce the former quickly here.

The global clustering coefficient is based on triplets of nodes. A triplet is three nodes that are connected by either two (open triplet) or three (closed triplet) undirected ties. A triangle consists of three closed triplets, one centred on each of the nodes. The global clustering coefficient is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed). The first attempt to measure it was made by Luce and Perry (1949). This measure gives an indication of the clustering in the whole network (global), and can be applied to both undirected and directed networks (often called transitivity, see Wasserman and Faust, 1994, page 243). However, it cannot be applied to weighted networks.

Sample networkFor this sample network, the binary global clustering coefficient would be 3 over 9, or 0.33. The closed triplets are B→A←C; A→B←C; A→C←B; whereas the open triplets are A→B←D; A→B←E; C→B←D; C→B←E; D→B←E; B→E←F. As can be seen from the sample network, the strongest ties are inside the triangle. This is not captured by the binary coefficient as the weights are not considered. In the second chapter of my thesis, I proposed a generalisation of the global clustering coefficient to weighted networks (both undirected and directed). For this network, the outcome would be 0.44. The increase from the binary coefficient (0.33) reflects the tendency of strong ties to be part of closed triplets.

The local clustering coefficient is based on ego network density or local density (Scott, 2000; Uzzi and Spiro, 2005; Watts and Strogatz, 1998). For a node, this is the fraction of the number of present ties over the total number of possible ties between the node’s neighbours. Therefore, the outcome ranges between 0 and 1: 0 if no ties exist between the neighbours, and 1 if all possible ties exists. For undirected networks, the local clustering coefficient is formally defined as:

C = \frac{\mbox{actual ties between a node's neighbours}}{\mbox{possible ties between a node's neighbours}}

The number of possible ties between a node’s neighbours can be calculated by multiplying the number of neighbours by the number of neighbours minus 1.

This formula can be re-written in terms of the adjacency matrix a whose entries a_{i,j} is 1 if node i is connected with node j and 0 otherwise:

Local clustering coefficient (adjacency matrix)

The main advantage of this version of the clustering coefficient is that a score is assigned to each node (local measure). This enables researchers to study associations between the coefficient and other nodal properties (e.g. Panzarasa et al., 2009) and perform regression analyses with the observations being the nodes of a network (e.g. Uzzi and Lancaster, 2004). However, this version of the clustering coefficient suffers from three major limitations. First, its outcome does not take into consideration the weight of the ties in the network. Second, the local clustering coefficient cannot be calculated on directed networks. Third, a negatively correlation with degree is often found in real-world networks. This is due to the fact that it is “easier” for a node with two neighbours to get a score of 1 (only one tie is need) than for a node with 10 neighbours (45 ties must be present).

Sample networkFor the sample network, nodes A and C would get a score of 1 since all possible ties among their neighbours are present, whereas node E would get a score of 0 since none of the possible ties among its neighbours are present. Nodes D and F would not get a value since the number of possible ties among their neighbours is 0 (if a node has less than 2 neighbours, the coefficient is undefined). For node B, one out of six possible ties is present, so the coefficient is 1/6 or 0.1667.

Barrat et al. (2004) proposed a generalisation of the local clustering coefficient to weighted networks by taking the weights explicitly into account. First, they assigned a triplet value to each triplet in the network based on the arithmetic mean (normal average). Then for each node, they summed the value of the closed triplets that were centred on the node and divided it by the total value of all triplets centred on the node. Formally, they proposed the following equation:

Weighted local clustering coefficient

For the above sample network, all the nodes, except node B, would get the same outcome as they get the extreme values. Node B would get a value of 0.242. The difference between this value and the one obtain in the binary analysis (0.167) is a reflection of node B’s strong ties being directed towards neighbours that are themselves connected.

However, the arithmetic mean is insensitive to differences between the two tie weights as an extreme value can have a major impact on the triplet value. Therefore, I propose three additional methods for defining the triplet value. These were originally used in my proposed generalisation of the global clustering coefficient.

First, it can be defined as the geometric mean of the weights attached to the two ties. This method overcomes some of the sensitivity issues as a triplet made up by a tie with a low value and a tie with a high value will have a lower value than if the arithmetic mean were used.

Second, it can be defined as the maximum value of the two weights. This method offsets a low tie weight and makes a triplet with a strong tie and a weak tie equal to a triplet with two strong ties.

Third, it can be defined as the minimum value of the two weights. This method offsets a high tie weight by making triplets with a strong tie and a weak tie equal to triplets with two weak ties.

The table below highlights the differences between the methods of defining the triplet value (adopted from Chapter 2 of my thesis).

Triplet value

It is vital to use an appropriate method for defining the value of a triplet as this impacts on the outcome of the coefficient. The method should be chosen based on the research question as well as the way in which the strength of the ties are operationalised into weights. For example, in a network where the weights correspond to the level of flow, and a weak tie would act as a bottleneck, the minimum method might be most appropriate to use.

Want to test it with your data?

First, you need to create an edgelist of your network (see the tnet documentation). The edgelist for the sample network above can be entered into R using the following command:

net <- cbind(
i=c(1,1,2,2,2,2,3,3,4,5,5,6),
j=c(2,3,1,3,4,5,1,2,2,2,6,5),
w=c(4,2,4,4,1,2,2,4,1,2,1,1))

Then you can run the following code to calculate the local clustering coefficient using the four definitions for triplet value.

# Load tnet
library(tnet)

# Calculated all measures (including the binary one)
clustering_w_local(net, measure=c("am","gm","ma","mi","bi"))

The output should be something along these lines (am is the weighted local clustering coefficient (wlcc) using the arithmetic mean; gm is the wlcc using the geometric mean; ma is the wlcc using the maximum method; mi is the wlcc using the minimum method; and bi is the binary version of the wlcc):

     node        am        gm        ma        mi        bi
[1,]    1 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[2,]    2 0.2424242 0.2654092 0.1818182 0.3636364 0.1666667
[3,]    3 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[4,]    4       NaN       NaN       NaN       NaN       NaN
[5,]    5 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
[6,]    6       NaN       NaN       NaN       NaN       NaN

Nodes 4 and 6 do not have a value as they have less than 2 neighbours. Nodes 1 and 3 get a value of 1 because all their neighbours are connected, whereas node 5 get a value of 0 because none of its neighbours are connected.

Node 2 is the interesting node as it does not get the extreme values. The local clustering coefficient obtained by using the four definitions of triplet differ. The choice of which definition to use should be based on the research question. Moreover, within a regression framework, the different values might be used as robustness checks.

Empirical test

I have also tested this extension on the network of the 500 busiest commercial airports in the US linked together by scheduled flights (Thanks Vittoria for uploading this network!).

The code for downloading and calculating the measures is:

# Load tnet
library(tnet)

# Load network
## Please cite Colizza et al. (2007) doi:10.1038/nphys560
data(USairport.n500)
net <- USairport.n500.net

# Since this network has are very large tie weights in this network, the geometric method in R fails. This limitation in R can be overcome by transforming the values.
# Note: This step is not required unless you receive warnings when running the function.
net[,3] <- (net[,3]/min(net[,3]))

# Calculate measure
output <- clustering_w_local(net, measure=c("am","gm","ma","mi","bi"))

# Find the average value for nodes with a degree greater than 1
colMeans(output[,c("am","gm","ma","mi","bi")], na.rm=T)
       am        gm        ma        mi        bi
0.7660205 0.7751321 0.7610040 0.7828600 0.7264579

References

Barrat, A., Barthelemy, M., Pastor-Satorras, R., Vespignani, A., 2004. The architecture of complex weighted networks. Proceedings of the National Academy of Sciences 101 (11), 3747-3752. arXiv:cond-mat/0311416

Colizza, V., Pastor-Satorras, R., Vespignani, A., 2007. Reaction-diffusion processes and metapopulation models in heterogeneous networks. Nature Physics 3, 276-282. arXiv:cond-mat/0703129

Feld, S. L., 1981. The focused organization of social ties. American Journal of Sociology 86, 1015-1035.

Friedkin, N. E., 1984. Structural cohesion and equivalence explanations of social homogeneiety. Sociological Methods and Research 12, 235-261.

Heider, F., 1946. Attitudes and cognitive organization. Journal of Psychology 21, 107-112.

Holland, P. W., Leinhardt, S., 1970. A method for detecting structure in sociometric data. American Journal of Sociology 76, 492-513.

Holland, P. W., Leinhardt, S., 1971. Transitivity in structural models of small groups. Comparative Group Studies 2, 107-124.

Louch, H., 2000. Personal network integration: Transitivity and homophily in strong-tie relations. Social Networks 22, 45-64.

Luce, R. D., Perry, A. D., 1949. A method of matrix analysis of group structure. Psychometrika 14 (1), 95-116.

Panzarasa, P., Opsahl, T., Carley, K. M., 2009. Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community. Journal of the American Society for Information Science and Technology 60 (5), 911-932.

Scott, J., 2000. Social Network Analysis: A Handbook. Sage Publications, London, UK.

Snijders, T. A. B., 2001. The statistical evaluation of social network dynamics. Sociological Methodology 31, 361-395.

Snijders, T. A. B., Pattison, P. E., Robins, G. L., Handcock, M. S., 2006. New specifications for exponential random graph models. Sociological Methodology 35, 99-153.

Uzzi, B., Lancaster, R., 2004. Embeddedness and price formation in the corporate law market. American Sociological Review 69, 319-344.

Uzzi, B., Spiro, J., 2005. Collaboration and creativity: The small world problem. American Journal of Sociology 111, 447-504.

Wasserman, S., Faust, K., 1994. Social Network Analysis. Cambridge University Press, Cambridge, MA.

Watts, D. J., Strogatz, S. H., 1998. Collective dynamics of “small-world” networks. Nature 393, 440-442.

Please cite or link to this post if you use it.

Entry Filed under: Network thoughts. Tags: , , , , , , , , .

1 Comment Add your own

  • 1. David Hope  |  May 10, 2009 at 12:01 pm

    Great to see someone trying to tackle weights in networks – this is very similar to what I’m trying to do for my Ph.D. (over language) I’m just reading your paper ‘Clustering In Weighted Networks’ – very interesting. I actually came up with a (local) weighted clustering coefficient which I use to find cohesion in language: this is simply the sum of the products of the paths and can deal with weighted, ‘unweighted’ (i.e. uniform weights) and both directed and undirected graphs.

    Reply

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Welcome

Tore OpsahlMy aim for this blog is to explore and throw out in the open some of the ideas about social network analysis that I have, but no time to implement. Many of my ideas stem from my interest in weighted networks and my belief that the weights are an enormous source of data. However, many social network measures require that the weights are discarded. In so doing, the richness of the data is considerably reduced. In turn, this limits the analysis.

Recent Posts

Upcoming Posts

Creating an ensemble of binary networks from a weighted one

Closeness in weighted networks

tnet: Software for Analysing Two-Mode Networks

Links

Feeds

Licensing

The information on this blog is published under the Creative Commons Attribution-Noncommercial 3.0-lisence.

This means that you are free to:
· share (copy, distribute and transmit)
· remix (adapt)
under the following conditions:
· attribution (you must cite this blog)
· noncommercial (you may not use it for
   commercial purposes).