# Newman, PNAS, 2001.

## Citation

M.E.J.Newman. 2001. The Structure of Scientific Collaboration Networks. Proceedings of the National Academy of Sciences. 404-409.

## Databases

MEDLINE (biomedical research)[1]

Los Alamos e-Print Archive (physics)[2]

NCSTRL (computer science)[3]

## Summary

This is a paper investigating the structure of scientific collaboration. The author ulitized data from a number of databases in different fields: Biomedical, Physics and Computer Science. Properties of these networks are:

• In all cases, scientific communities seem to constitute a ‘‘small world,’’[4] in which the average distance between scientists via a line of intermediate collaborators varies logarithmically with the size of the relevant community.
• Those networks are highly clustered, meaning that two scientists are much more likely to have collaborated if they have a third common collaborator than are two scientists chosen at random from the community.
• Distributions of both the number of collaborators of scientists and the numbers of papers are well fit by power-law forms with an exponential cutoff. This cutoff may be caused by the finite time window (1995-1999) used in the study.
• There are a number of significant statistical differences between different scientific communities. Some of these are obvious.

## Background

Social networks have been the subject of both empirical and theoretical study in the social sciences for at least 50 years. Although many of these studies directly probe the structure of relevant social network, they suffer from two substantial shortcomings that limit their usefulness. First, the studies are labor intensive, and the size of the network that can be mapped is therefore limited—typically to a few tens or hundreds of people. Second, these studies are highly sensitive to subjective bias on the part of interviewees. In this paper, the author presents a study of a genuine network of human acquaintances that is large—containing over a million people—and for which a precise definition of acquaintance is possible. That network is the network of scientific collaboration, as documented in the papers scientists write.

## Brief Description of Experiment Method and Result Analysis

• Number of Authors: The author estimates the true number of authors by carrying out analysis twice. The first time, all initials of each author are used. This will solve the problem that two authors may have the same name. The second analysis is carried out using only the first initial of each author to solve the problem that authors may identify themselves in different ways on different papers.Thus these two analyses give upper and lower bounds on the number of authors and also give an indication of the expected precision of many of our other measurements. Experiment result is in table.1.
• Mean Papers per Author and Authors per Paper: From table.1 The average authors per paper of SPIRES high-energy physics database is much higher than other databases. The reason is that the SPIRES database contains data on experimental as well as theoretical work.

• Number of Collaborations: In Fig. 1, histograms of the numbers of collaborators of scientists in four of the smaller databases are shown. According to Barabasi's Emergence of scaling in random networks if one makes a similar plot for the number of connections (or ‘‘links’’) z to or from sites on the World Wide Web, the resulting distribution closely follows a power law: ${\displaystyle P(z)\approx z^{-t}}$, where t is a constant exponent with (in that case) a value of about 2.5. However, the author's data do not follow a power-law form perfectly. If they did, the curves in Fig. 2 would be straight lines on the logarithmic scales used. However, these data are well fitted by a power-law form with an exponential cutoff: ${\displaystyle P(z)\approx z^{-t}e^{-{\frac {z}{z_{c}}}}}$. where ${\displaystyle t}$ and ${\displaystyle z_{c}}$ are constants. Fits to this form are shown as the solid lines in Fig. 2. The exponent ${\displaystyle t}$ of the power-law distribution is interesting. We note that in all of the ‘‘hard sciences,’’ this exponent takes values close to 1. In the MEDLINE (biomedicine) database, however, its value is 2.5, similar to that noted for theWorld Wide Web. The value ${\displaystyle t=2}$ forms a dividing line between two fundamentally different behaviors of the network. For ${\displaystyle t<2}$, the average properties of the network are dominated by the few individuals who have a large number of collaborators, whereas networks with ${\displaystyle t>2}$ are dominated by the ‘‘little people’’—those with few collaborators.

• Average Degrees of Separation: Social networks are measurably different from random graphs, although it is the simplest model. but the random graph nonetheless provides a useful benchmark against which to compare them. In Fig.3, the average distance between all pairs of scientists for each of the networks studied here is shown. Using the appropriate values of N and z from Fig.1, we can see that there is a strong correlation (${\displaystyle R^{2}}$ = 0.83) between the measured distances and the expected log N behavior, indicating that distances do indeed vary logarithmically with the number of scientists in a community.

• Clustering: Through the fraction of ‘‘transitive triples’’ in a network[5]also called the clustering coefficient C, we can obtain the existence of clustering in network data. The MEDLINE database is different from other databases in that it possesses a much lower value of the clustering coefficient. This appears to indicate that it is significantly less common in biological research for scientists to broker new collaborations between their acquaintances than it is in physics or computer science. This could again be a result of the ‘‘top-down’’ organization of laboratories under laboratory directors, which tends to produce ‘‘tree-like’’ collaboration network. Such tree-like networks are known to possess low clustering coefficients.

## Related Works

The model to analyze number of collaborators in this paper is highly influenced by Barabasi's Emergence of scaling in random networks. It propose a power-law result that may apply to most networks.

A interesting further study of one of the databases (SPIRES) is Physicists thrive with paperless publishing.