## Thesis: 4.5 Sensitivity to the number of control cases

An paper with Bernie Hogan based on this chapter is available. It was written after this chapter and contains a number of changes.

The findings presented in the previous sections are based on a sample of 20 observations for each observed tie: the tie that was formed and 19 randomly selected non-formed ties. However, in the literature several rules of thumb exist with regard to the appropriate number of control observations (Cosslett, 1981; Hosmer and Lemeshow, 2000; King and Zeng, 2001). Cosslett (1981) argued that the optimal number of control cases is same as the number of realised cases (i.e., observed ties). This implies that the sample of observations in the regression is strictly balanced (i.e., a ratio of 1:1). However, King and Zeng (2001) argued for a sensitivity analysis of the number of control cases. The “optimal” number was found when an additional control case did not decrease the standard errors (or increase the significance).

We have undertaken a similar sensitivity analysis for the assessment of both the binary and weighted networks. Figure 15 shows the significance (z-score) of in-degree tested independently in a) the binary network, and b) the weighted network, when the number of control cases increases. As shown, the marginal increase in z-scores is very small when approximately 20 control cases are used. Thus, additional control cases would not add value to the analysis. In fact, it might introduce measuring errors (King and Zeng, 2001) and would increase the computational requirements. Figure 15: Significance (z-score) of in-degree independently tested in the binary (a) and the weighted (b) network with an increasing number of control cases. The z-scores are the average values obtained with 30 regressions with different sets of control nodes.

First, King and Zeng (2001) found that the statistical properties of logistic regression models is not invariant to the unconditional mean of the dependent variable. They showed through a number of simulations that the estimates obtained by traditional logistic regression models is biased. In fact, the estimates are biased in a specific direction: they are underestimated. This bias can be corrected through a number of methods; however, the problem can be overcome altogether by using a limited number of control cases.

Second, since the number of observations in the regression is the number of observed ties multiplied by the number of control cases plus one, the computational requirements of running a regression model is dependent upon the number of control cases. If there is no constraint on the number of control nodes taken into consideration (i.e., it is the number of other nodes in the network), the number of observations can become extremely large and demand a great deal of memory. In fact, a regression of the binary network with 20,296 observed ties when \$A_t\$ was not bounded required 36 gigabyte of memory in the statistical programme R (R Development Team, 2008). This is beyond the capabilities of Windows XP, and generally outside the scope of computers running Linux or Unix. Thus, dedicated servers must be used to estimate models of this size. Yet, the online social network is a fairly small network. For larger networks, such as the network of 2.9 million utility patents connected through 16.5 million citations used by Hall et al. (2001), it would not be feasible to estimate a model. If all patents were available from $t=1$ and $A_t$ was not bounded, there would be 48,310 billion observations in the sample. A regression model of this size would certainly be outside the scope of the current computational capabilities.