Why Anchorage is not (that) important: Binary ties and Sample selection

August 12, 2011 at 1:39 am 4 comments

A surprising finding when analysing airport networks is the importance of Anchorage airport in Alaska. In fact, it is the most central airport in the network when applying betweenness! Betweenness for a node is defined as the number of shortest paths among all others that passes through the node (see Opsahl et al., 2010, for a review). A host of explanations have been offered to account for the high betweenness of Anchorage given its relatively few connections. For example, Guimera et al (2004, pg. 7797) reasoned that:

Alaska is a sparsely populated, isolated region with a disproportionately large, for its population size, number of airports. Most Alaskan airports have connections only to other Alaskan airports. This fact makes sense geographically. However, distance-wise, it also would make sense for some Alaskan airports to be connected to airports in Canada’s Northern Territories. These connections are, however, absent. Instead, a few Alaskan airports, singularly Anchorage, are connected to the continental U.S. The reason is clear: the Alaskan population needs to be connected to the political centers, which are located in the continental U.S., whereas there are political constraints making it difficult to have connections to cities in Canada, even to ones that are close geographically ([Guimera and Amaral, 2004]). It is now obvious why Anchorage’s centrality is so large. Indeed, the existence of nodes with anomalous centrality is related to the existence of regions with a high density of airports but few connections to the outside. The degree-betweenness anomaly is therefore ultimately related to the existence of communities in the network.

While many researchers and practitioners highlight this finding, I do not believe it is completely accurate. There are two reasons for this:

Issue 1: Binary ties

Admittedly this might be a personal bias as most of my work has been on weighted networks. Without going into much detail in this blog post, I actually strongly believe that if you assign the same importance to the connection between London Heathrow and New York’s JFK as you do to the connection between Pack Creek Airport and Sitka Harbor Sea Plane Base in Alaska (map), then there is a potential for measurement error. The table below lists the top ten airports in terms of betweenness when analyzing the binary and weighted (by passengers) versions of the Bureau of Transportation Statistics (BTS) Transtats data (Brandes, 2001). The code to replicate these results can be found at the end of this page.

Rank Betweenness
Binary Analysis Weighted Analysis
Airport Score Airport Score
1 ANC (Anchorage, AK, USA) 465272 SEA (Seattle/Tacoma, WA, USA) 834217
2 FAI (Fairbanks, AK, USA) 215503 ANC (Anchorage, AK, USA) 761834
3 YYZ (Toronto, Canada, Canada) 131562 ATL (Atlanta, GA, USA) 735628
4 LAX (Los Angeles, CA, USA) 129246 LAX (Los Angeles, CA, USA) 531980
5 SEA (Seattle/Tacoma, WA, USA) 125151 ORD (Chicago, IL, USA) 409001
6 JFK (New York, NY, USA) 124927 DEN (Denver, CO, USA) 314764
7 HPN (White Plains, NY, USA) 121096 JFK (New York, NY, USA) 247791
8 MIA (Miami, FL, USA) 120643 MIA (Miami, FL, USA) 206547
9 DEN (Denver, CO, USA) 120342 BOS (Boston, MA, USA) 168140
10 MSP (Minneapolis, MN, USA) 111188 FAI (Fairbanks, AK, USA) 157491

This table demonstrates that Anchorage has twice the betweenness of the runner-up, Fairbanks Alaska, in the binary analysis. In the weighted analysis, Anchorage loses the first place to Seattle and Fairbanks moves to 10th place. It is also worth noticing that only US airports are in the top ten lists using both analyses, which leads me on to the second issues with using the BTS data: sample selection.

Issue 2: Sample selection

This issue affects all network studies, and something I have been interested in for a while. We define a population and analyse the connections among them. For example, I have analysed the scientific collaboration network based on the papers uploaded to the arXiv preprint server (e.g., Opsahl et al., 2008) with the full knowledge that there are many more scientific publications out there as well as other forms of collaboration and channels for knowledge flow among scientists, such as grant proposals and conference attendance. By simply restricting ourselves to data that is easy to collect (often stored in a central location / repository), the research is vulnerable to sample selection bias.

When it comes to airport networks, the Bureau of Transportation Statistics (BTS) Transtats data is straight forward to collect: Go here, select what you want (Origin, Destination), and click Download! However, there is a small note on another page explaining the dataset: “This table combines domestic and international market data reported by U.S. and foreign air carriers, and contains market data by carrier, origin and destination, and service class for enplaned passengers, freight, and mail. For a uniform end date for the combined databases, the last 3 months U.S. carrier domestic data released in T-100 Domestic Market (U.S. Carriers Only) are not included. Flights with both origin and destination in a foreign country are not included.” It is the last line of this description that highlights the potential sample selection bias. While the data contain all US airports and all domestic flights, it only contains Non-US flights that leave or terminate at a US airport and the Non-US airports on the other end of these flights. As such, a section of the square adjacency matrix is missing (flights from Non-US to Non-US airports in the dataset) as well as the entire rows and columns for airports without flights to the US. To exemplify this bias, I have plotted the routes on a world map below.

The BTS airport network in 2010 where the line colour is based on the number of passengers. The code to replicate this image can be found at the end of this page. If you click on the image, a vector graphic version of it is available (pdf; 5.95mb).

While it is possible to see a concentration in the US on the above picture, the sample selection becomes much more apparent when highlighting Europe. In the picture to below, it is possible to see that no flights are between any pair of European cities nor any other point on the map. Would you have to transit at New York’s JFK to get from London to Barcelona? This gap highlights the need for looking for more complete data sources than the Bureau of Transportation Statistics when analysing airport networks.

The European part of the BTS airport network

Alternative data-source

There are a couple of authoritative databases with world-wide airline routes. However, most of them are proprietary as they have enormous business intelligence potential and, as a consequence, are difficult to collect. OAG Worldwide is one such database, and it should be noted that Guimera et al (2004) went through the hoops by getting this data, and therefore, had a much more complete view of the airport network than if they had used the BTS Transtats data. While I do not have access to such a database, Openflights.org is a crowdsourced alternative. Although using this data comes without any guarantee, it has the potential to showcase the limitations of the BTS Transtats data. As a first step, I mapped the data to ensure there were no obvious pockets of missing data.

The Openflight.org airport network where the line colour is based on the number of routes (accessed on August 12, 2011). The code to replicate this image can be found at the end of this page. If you click on the image, a vector graphic version of it is available (pdf; 5.25mb).

Conclusion 1: Anchorage is not the most important airport

As can be seen from this picture, there are no obvious areas without any form of airline traffic. To show how this data impacts on a betweenness analysis, I have computed betweenness on both the binary and weighted (by number of routes as the passenger numbers were not available) versions of the network. As can be seen in the table below, major airports located around the globe get the highest scores in these analyses instead of only US airports. Specifically, Anchorage is only the third most central in the binary analysis, and the 14th most central in the weighted analysis. As such, it is still an important airport in the networks, but maybe not the most important.

Rank Betweenness
Binary Analysis Weighted Analysis
Airport Score Airport Score
1 FRA (Frankfurt, Germany) 587531 LHR (London, United Kingdom) 1858349
2 CDG (Paris, France) 520707 LAX (Los Angeles, United States) 1310287
3 ANC (Anchorage, United States) 481044 JFK (New York, United States) 1084392
4 DXB (Dubai, United Arab Emirates) 443314 BKK (Bangkok, Thailand) 797785
5 GRU (Sao Paulo, Brazil) 402882 SIN (Singapore) 739981
6 YYZ (Toronto, Canada) 398869 SEA (Seattle, United States) 723145
7 LHR (London, United Kingdom) 389846 MAD (Madrid, Spain) 707354
8 LAX (Los Angeles, United States) 356600 GRU (Sao Paulo, Brazil) 684057
9 DME (Moscow, Russia) 353902 NRT (Tokyo, Japan) 639074
10 BKK (Bangkok, Thailand) 352682 DXB (Dubai, United Arab Emirates) 610765
14 ANC (Anchorage, United States) 469203
18 FRA (Frankfurt, Germany) 392418

_
Conclusion 2: Finding the global superhub using a weighted approached

London Heathrow is the most central airport when considering both tie weights and the global airport network. And this, unlike Anchorage, is not a surprising finding as it is the airport with most international passengers (Airports Council International, 2011).

To further investigate the effects on the ranking when considering tie weights in the global airport networks, I considered the change in ranking of the two airports ranked first in the binary and weighted analyses, Frankfurt and London Heathrow. Frankfurt went from having the highest betweenness in the binary analysis to only having 18th highest betweenness in the weighted analysis. Conversely, London Heathrow went from having the seventh highest to the highest betweenness score. To look into this cross-over of rankings, I compared the degree (number of airports with direct flights) and strength (number of direct routes) from these two airports:

Airport Degree Node Strength Strength distribution
1 2 3 4 5
FRA (Frankfurt, Germany) 237 349 142 82 9 4 0
LHR (London, United Kingdom) 157 288 71 55 22 4 5

This table shows that Frankfurt has direct flights to 51% more airports than London Heathrow, but only 21% more routes. The variation in tie weights can be further investigated by looking at the weight distribution. While there are only four airports with four direct routes from Frankfurt, there are nine airports with four or five direct routes from London Heathrow.

Moreover, by looking at which airports have the strong ties (i.e., with tie weights greater or equal to 4) with Frankfurt and London Heathrow, it is possible to see that the geographical distribution is strikingly different. Frankfurt has four direct routes to Antalya (Turkey), Madrid (Spain), Mallorca (Spain), and Vienna (Austria), which are 5,597 kilometres long (average: 1,399km). Conversely, London Heathrow has five routes to Delhi (India), Dubai (UAE), Hong Kong (China), Los Angeles (LAX, USA), and New York City (JFK, USA) and four routes to Bangkok (Thailand), Mumbai (India), Boston (USA), and Miami (USA), which are 65,376 kilometres long (average 7,264km). By having strong ties to geographically distant instead of close airports, London Heathrow acts as a intercontinental hub instead of a continental hub. Additionally, the airports with strong ties to London Heathrow have high betweenness, and therfore, act as hubs in their respective regions. As such, London Heathrow can be seen as the global hub of the world-wide airport network.

References

Airports Council International, 2011. Year to date International Passenger Traffic, Apr-2011, accessed August 12, 2011.

Brandes, U., 2001. A Faster Algorithm for Betweenness Centrality. Journal of Mathematical Sociology 25, 163-177.

Bureau of Transportation Statistics, 2011. Air Carrier Statistics (Form 41 Traffic): T-100 Market (All Carriers), accessed August 12, 2011.

Guimera, R., Amaral, L. A. N., 2004. Modeling the world-wide airport network. The European Physical Journal B 38, 381–385.

Guimera, R., Mossa, S., Turtschi, A., Amaral, L. A. N., 2004. The worldwide air transportation network: Anomalous centrality, community structure, and cities’ global roles. Proceedings of the National Academy of Sciences 102(22), 7794-7799.

Openflights.org, 2011. Airport, airline and route data, accessed August 12, 2011.

Opsahl, T., Agneessens, F., Skvoretz, J., 2010. Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32 (3), 245-251.

Opsahl, T., Colizza, V., Panzarasa, P., Ramasco, J. J., 2008. Prominence and control: The weighted rich-club effect. Physical Review Letters 101 (168702).

If you use any of the information in this post, please cite: Opsahl, T., Agneessens, F., Skvoretz, J., 2010. Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32 (3), 245-251.
I would like to acknowledge Bernie Hogan in helping to develop the idea behind this post.

Code used to create the results in this blog post

Below is the code to redo the analysis in this post. You need to have the R-packages geosphere, maps, and tnet installed before to run the code. You also need to download the Bureau of Transportation Statistics (BTS) Transtats data. Please see notes in the code.

###################################
## US Airport network (BTS data) ##
###################################

# Load BTS Transtats data
# Downloaded from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=292 using
# Filters: Geography=all; Year=2010; Months=all
# Columns: Passengers, Origin, OriginCountryName, Dest, DestCountryName
BTS <- read.csv("data/344989982_T_T100_MARKET_ALL_CARRIER.csv", header=TRUE, stringsAsFactors=FALSE)
BTS <- BTS[,c("ORIGIN", "ORIGIN_COUNTRY_NAME", "DEST", "DEST_COUNTRY_NAME", "PASSENGERS")]

# Load airport information (incl. geolocations)
# Downloaded from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=288 (select all columns)
BTSairports <- read.csv("data/344990073_T_MASTER_CORD.csv", stringsAsFactors=FALSE)

# Replace airport codes with id numbers (net1)
net1 <- BTS
net1.labels <- unique(c(net1[,"ORIGIN"], net1[,"DEST"]))
net1.labels <- net1.labels[order(net1.labels)]
net1[,"ORIGIN"] <- factor(x=net1[,"ORIGIN"], levels=net1.labels)
net1[,"DEST"]   <- factor(x=net1[,"DEST"], levels=net1.labels)
net1 <- data.frame(i=as.integer(net1[,"ORIGIN"]), j=as.integer(net1[,"DEST"]), w=net1[,"PASSENGERS"])

# Add up duplicated entries (multiple routes)
net1 <- net1[order(net1[,"i"], net1[,"j"]),]
index <- !duplicated(net1[,c("i","j")])
net1 <- data.frame(net1[index,c("i","j")], w=tapply(net1[,"w"], cumsum(index), sum))

# Take out routes with no passengers (cargo)
net1 <- net1[net1[,"w"]>0,]

# Take out routes from an airport to itself
net1 <- net1[net1[,"i"]!=net1[,"j"],]

# Load tnet and the network as a tnet object
library(tnet)
net1 <- as.tnet(net1, type="weighted one-mode tnet")

# Calculate binary and weighted betweenness
tmp0 <- betweenness_w(net1, alpha=0)
tmp1 <- betweenness_w(net1, alpha=1)

# Create output object with top x airports
x <- 10
out <- data.frame(
  tmp0[order(-tmp0[,"betweenness"]),][1:x,],
  tmp1[order(-tmp1[,"betweenness"]),][1:x,])
dimnames(out)[[2]] <- c("BTS.bb.node", "BTS.bb.score", "BTS.wb.node", "BTS.wb.score")
BTSairports[BTSairports[,"TR_COUNTRY_NAME"]=="United States of America","TR_COUNTRY_NAME"]  <- "USA"
for(i in 1:x) {
  # Insert label of airport ID (binary)
  tmp2 <- net1.labels[as.integer(out[i,"BTS.bb.node"])][1]
  tmp2 <- BTSairports[BTSairports[,"AIRPORT"]==tmp2,][1,]
  out[i,"BTS.bb.node"] <- paste(tmp2["AIRPORT"], " (", tmp2["TR_CITY_NAME"], ", ", tmp2["TR_COUNTRY_NAME"], ")", sep="")
  # Insert label of airport ID (weighted)
  tmp2 <- net1.labels[as.integer(out[i,"BTS.wb.node"])][1]
  tmp2 <- BTSairports[BTSairports[,"AIRPORT"]==tmp2,][1,]
  out[i,"BTS.wb.node"] <- paste(tmp2["AIRPORT"], " (", tmp2["TR_CITY_NAME"], ", ", tmp2["TR_COUNTRY_NAME"], ")", sep="")
}


#########################
## Plot the US network ##
#########################

# Based on FlowingData's blog post (http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/) Thanks Nathan! 
# Load required packages (type ?install.packages if you get an error)
library(maps)
library(geosphere)

# Symmetrise the network to get the correct tie weight for visualisation as the two directed ties are plotted on top of each other
net1s <- as.data.frame(symmetrise_w(net1, method="SUM"))
net1s <- net1s[net1s[,"i"]<net1s[,"j"],]

# Put labels back in summed up network (i.e, no duplicates) 
net1s[,"i"] <- net1.labels[net1s[,"i"]]
net1s[,"j"] <- net1.labels[net1s[,"j"]]

# Sort data so that weak ties are plotted first
net1s <- net1s[order(net1s[,"w"]),]

# Set up world map and colors for lines
pdf("airport_BTS_plot.pdf", width=11, height=7)
map("world", col="#eeeeee", fill=TRUE, bg="white", lwd=0.05)
pal <- colorRampPalette(c("#cccccc", "black"))
colors <- pal(length(unique(net1s[,"w"])))
colors <- rep(colors, times=as.integer(table(net1s[,"w"])))

# Plot ties
for(i in 1:nrow(net1s)) {
  # Get longitude and latitude of the two airports
  tmp1 <- BTSairports[BTSairports["AIRPORT"]==net1s[i,"i"],c("LONGITUDE","LATITUDE")][1,]
  tmp2 <- BTSairports[BTSairports["AIRPORT"]==net1s[i,"j"],c("LONGITUDE","LATITUDE")][1,]
  # Get the geographical distance to see how many points on the Great Circle to plot
  tmp3 <- 10*ceiling(as.numeric(log(3963.1 * acos((sin(tmp1[2]/(180/pi))*sin(tmp2[2]/(180/pi)))+(cos(tmp1[2]/(180/pi))*cos(tmp2[2]/(180/pi))*cos(tmp1[1]/(180/pi)-tmp2[1]/(180/pi)))))))
  # Line coordinates
  inter <- gcIntermediate(tmp1, tmp2, n=round(tmp3), addStartEnd=TRUE, breakAtDateLine=TRUE)
  # Plot one line if the line does not cross the date line; two if so
  if(is.matrix(inter)) {
    lines(inter, col=colors[i], lwd=0.6)
  } else {
    for(j in 1:length(inter))
      lines(inter[[j]], col=colors[i], lwd=0.6)
  }
}
dev.off()


##########################
## Openflights.org data ##
##########################

# Download airport geolocations from openflights.org/data.html, and set column headings
# "http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/airports.dat"
OFairports <- read.csv("data/airports.dat", header=FALSE, stringsAsFactors=FALSE)
dimnames(OFairports)[[2]] <- c("Airport ID", "Name", "City", "Country", "IATA/FAA", "ICAO", "Latitude", "Longitude", "Altitude", "Timezone", "DST")

# Download routes from openflights.org/data.html
# "http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/routes.dat"
OF <- read.csv("data/routes.dat", header=FALSE, stringsAsFactors=FALSE)
dimnames(OF)[[2]] <- c("Airline", "Airline ID", "Source airport", "Source airport ID", "Destination airport", "Destination airport ID", "Codeshare", "Stops", "Equipment")

# Remove code-shares as these are duplicated entries
net2 <- OF[OF[,"Codeshare"]=="",c("Source airport ID", "Destination airport ID")]

# Take out routes from an airport to itself and the missing cases (~1%)
net2 <- net2[net2[,"Source airport ID"]!=net2[,"Destination airport ID"],]
net2 <- net2[net2[,"Source airport ID"]!="\\N",]
net2 <- net2[net2[,"Destination airport ID"]!="\\N",]

# As passengers per route is not available, create a weighted network with the weight equal to number of routes
net2 <- data.frame(i=as.integer(net2[,"Source airport ID"]), j=as.integer(net2[,"Destination airport ID"]))
net2 <- shrink_to_weighted_network(net2)


##################################
## Plot the OpenFlights network ##
##################################

# Symmetrise data for visualisation
net2s <- as.data.frame(symmetrise_w(net2, method="SUM"))
net2s <- net2s[net2s[,"i"]<net2s[,"j"],]

# Sort data so that weak ties are plotted first
net2s <- net2s[order(net2s[,"w"]),]

# Set up world map and colors for lines
pdf("airport_OF_plot.pdf", width=11, height=7)
map("world", col="#eeeeee", fill=TRUE, bg="white", lwd=0.05)
pal <- colorRampPalette(c("#cccccc", "black"))
colors <- pal(length(unique(net2s[,"w"])))
colors <- rep(colors, times=as.integer(table(net2s[,"w"])))

# Plot ties
for(i in 1:nrow(net2s)) {
  # Get longitude and latitude of the two airports
  tmp1 <- as.numeric(OFairports[OFairports["Airport ID"]==net2s[i,"i"],c("Longitude","Latitude")][1,])
  tmp2 <- as.numeric(OFairports[OFairports["Airport ID"]==net2s[i,"j"],c("Longitude","Latitude")][1,])
  # Get the geographical distance to see how many points on the Great Circle to plot
  tmp3 <- 10*ceiling(as.numeric(log(3963.1 * acos((sin(tmp1[2]/(180/pi))*sin(tmp2[2]/(180/pi)))+(cos(tmp1[2]/(180/pi))*cos(tmp2[2]/(180/pi))*cos(tmp1[1]/(180/pi)-tmp2[1]/(180/pi)))))))
  # Line coordinates
  inter <- gcIntermediate(tmp1, tmp2, n=round(tmp3), addStartEnd=TRUE, breakAtDateLine=TRUE)
  # Plot one line if the line does not cross the date line; two if so
  if(is.matrix(inter)) {
    lines(inter, col=colors[i], lwd=0.6)
  } else {
    for(j in 1:length(inter))
      lines(inter[[j]], col=colors[i], lwd=0.6)
  }
}
dev.off()


#####################################
## Analyse the OpenFlights network ##
#####################################

# Calculate binary and weighted betweenness (on the directed network, net2)
tmp0 <- betweenness_w(net2, alpha=0)
tmp1 <- betweenness_w(net2, alpha=1)

# Create output object with top x airports
out <- data.frame(out,
  tmp0[order(-tmp0[,"betweenness"]),][1:x,],
  tmp1[order(-tmp1[,"betweenness"]),][1:x,])
dimnames(out)[[2]][5:8] <- c("OF.bb.node", "OF.bb.score", "OF.wb.node", "OF.wb.score")
for(i in 1:x) {
  # Insert label of airport ID (binary)
  tmp2 <- OFairports[OFairports[,"Airport ID"]==out[i,"OF.bb.node"],]
  out[i,"OF.bb.node"] <- paste(tmp2["IATA/FAA"], " (", tmp2["City"], ", ", tmp2["Country"], ")", sep="")
  # Insert label of airport ID (weighted)
  tmp2 <- OFairports[OFairports[,"Airport ID"]==out[i,"OF.wb.node"],]
  out[i,"OF.wb.node"] <- paste(tmp2["IATA/FAA"], " (", tmp2["City"], ", ", tmp2["Country"], ")", sep="")
}


###########################
## Comparing FRA and LHR ##
###########################

# Get FRA and LHR's airport ids
ids <- sapply(c("FRA", "LHR"), function(a) OFairports[OFairports[,"IATA/FAA"]==a,"Airport ID"])
# Rank and Score of FRA
tmp1 <- as.data.frame(tmp1[order(-tmp1[,"betweenness"]),])
tmp1[tmp1[,"node"]==ids["FRA"],]
# Degree and Node strength
tmp3 <- degree_w(net2)
tmp3[ids,]
# Weight distribution
sapply(ids, function(a) table(net2[net2[,"i"]==a,3]))
# Airports with strong ties (w>=4)
tmp4 <- lapply(ids, function(a) data.frame(net2[net2[,"i"]==a & net2[,"w"]>=4,], label="", geo.dist=NaN, stringsAsFactors=FALSE))
# Insert labels
for(a in 1:2) {
  for(b in 1:nrow(tmp4[[a]])) {
    tmp2 <- OFairports[OFairports[,"Airport ID"]==tmp4[[a]][b,"j"],][1,]
    tmp4[[a]][b, "label"] <- paste(tmp2["IATA/FAA"], " (", tmp2["City"], ", ", tmp2["Country"], ")", sep="")
  }
}
# Geographical distance
for(a in 1:2) {
  tmp5 <- as.numeric(OFairports[OFairports["Airport ID"]==ids[a],c("Longitude","Latitude")][1,])
  for(b in 1:nrow(tmp4[[a]])) {
    tmp6 <- as.numeric(OFairports[OFairports["Airport ID"]==tmp4[[a]][b,"j"],c("Longitude","Latitude")][1,])
    tmp4[[a]][b, "geo.dist"] <- 6378.7 * acos((sin(tmp5[2]/(180/pi))*sin(tmp6[2]/(180/pi)))+(cos(tmp5[2]/(180/pi))*cos(tmp6[2]/(180/pi))*cos(tmp5[1]/(180/pi)-tmp6[1]/(180/pi))))
  }
}
sapply(1:2, function(a) mean(tmp4[[a]][,"geo.dist"]))
sapply(1:2, function(a) sum(tmp4[[a]][,"geo.dist"]))

Entry filed under: Network thoughts. Tags: , , , , , , , , , , , , , , , , , , , .

Degree Centrality and Variation in Tie Weights Securely using R and RStudio on Amazon’s EC2

4 Comments Add your own

  • 1. John McCreery  |  December 8, 2011 at 8:09 am

    Don’t know if it makes a difference to your data; but as someone who has been flying to Asia from the U.S. since 1969, I have vivid memories of what used to be refueling stops in anchorage before the introduction of planes able to do direct flights between the USA and Tokyo, Taipei or Hong Kong.

    Reply
    • 2. Tore Opsahl  |  December 8, 2011 at 2:17 pm

      Hi John,

      This definitely used to be the case with Anchorage as you said. The data used in the post is from 2010 (BTS) and August 2011 (Openflight), so it should have a limited impact. Although there might be some smaller planes which have to refuel, the aim of the post was not to analyse the airport network, but to highlight two statistical problems: sample selection and sensitivity to measurement using a dataset where “true”-data exist.

      “True” data is immensely important when proposing new methods in efforts to validate them. For the airport network, it is possible to use data from the Airports Council International (2011). This data states that large cities are the ones with the highest throughput of international passengers. Admittedly, by using transit of international passengers, US airports fall on the ranking as the US has a very high level of domestic air travel.

      An interesting extension of weighted betweenness is a measure that does not assume all nodes to be equal. For example, there are less people who live in Oslo than Los Angeles, and who subsequently fly. As such, the shortest path connecting major metropolitan areas might be more important than the one connecting two smaller cities. In turn, a node on the former path should be assigned a higher score than a node on the latter one. By weighting both ties and nodes, the new measure might be more appropriate than current ones. Any thoughts on this measure?

      Best,
      Tore

      Reply
      • 3. jlmccreery  |  December 8, 2011 at 2:38 pm

        I wouldn’t call it a thought in any well-formed sense. Being still in the early stages of learning R, I have been looking at my data using Pajek and comparing betweenness measures and valued cores for weighted and unweighted 1-mode networks projected from my 2-mode networks. Not knowing the underlying algorithms, I am being utterly empirical and seeing how things vary depending on where I start. The starting points are 2-mode networks simplified either with the [number of lines] setting that retains the number of roles linking ads and creators and assigns them as values to the single lines that replace the multiple lines or the [single line] setting that sets the line values of all the single lines to 1. I am now looking into whether this makes any difference when I project the 1-mode networks. In the [single lines] case, I know that Net>Transform>2-mode to 1-mode command generates a network in which line values correspond to the number of connections in overlaps for ad teams (events) or participation rates for creators (actors). Still have to test what happens when I try the same command starting with the [number of lines] 2-mode networks.

        I can see the mathematical point of the examples (A,B) and (B,C) with both line values set to 2 versus having one set to 3 and the other to 1. I’m still thinking about what the sociological or other implications might be.

      • 4. Tore Opsahl  |  December 8, 2011 at 3:38 pm

        Hi John,

        I am not entirely sure how Pajek does two-mode projection, but here is the documentation for the projectiom_tm-function in tnet: tnet » Two-mode Networks » Projection

        Best,
        Tore

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


@toreopsahl on Twitter

Licensing

The information on this blog is published under the Creative Commons Attribution-Noncommercial 3.0-lisence.

This means that you are free to:
· share
· adapt
under the following conditions:
· attribution (cite it)
· noncommercial (email me).

Creative Commons License

Follow

Get every new post delivered to your Inbox.

Join 79 other followers

%d bloggers like this: