Miray Dongyang Niting project proposal

Citation Network Evolution Miray Kas, Niting Qi, Dongyang Teng Electrical & Computer Engineering Carnegie Mellon University Pittsburgh, PA 15217 USA

Email: 1. mkas@andrew.cmu.edu 2. nqi@andrew.cmu.edu 3. dongyant@andrew.cmu.edu

PDF version of this proposal is available File:Proposal.pdf

ABSTRACT

Staggering growth of availability of electronic resources over the Internet enables rapid dissemination of the ideas and changes in the trends and the interaction patterns. In this project, we plan to investigate how a specific research area (e.g. high-energy physics) changes over time. Our goal is to do research on techniques that might be useful in predicting the future of a particular research area looking at the changes in the currently available data.

CATEGORIES AND SUBJUCT DESCRIPTORS

[Social Networks]: Dynamic Network Models, Network Evolution, Citation Analysis.

GENERAL TERMS

Algorithms, Sociology.

KEYWORDS

Social Networks, Dynamic Network Models, Network Evolution, Citation Analysis, Pattern Matching.

INTRODUCTION

Study of networks, including social networks, biological networks, information works, and many other kinds, is always a focus topic in scientific research. Many of the obtained results are not only solutions for problems in the field of networks, but they are also applicable to other fields. Citation networks, which are the principal focus of this paper, have been studied quantitatively almost from the moment citation databases ﬁrst became available. In 1965, Derek J. de Solla Price described the inherent linking characteristic of the SCI in his paper titled "Networks of Scientific Papers" (1). The links between citing and cited papers became dynamic when the SCI began to be published online. In 1973, Henry Small published his classic work on co-citation analysis (2) which became a self-organizing classification system that led to document clustering experiments and eventually what is later called "Research Reviews".

Autonomous citation indexing was introduced in 1998 by Giles, Lawrence and Bollacker (3), enabling automated extraction and grouping of citations for academic/scientific documents. While previous citation extraction was a manual process, citation measures now could scale up and be computed for any scholarly and scientific field, not just those selected by organizations. Further information on the field history can be found at (4).

Nowadays, many innovations and new research areas emerged from existing references. Creation of a specific research topic as well as the development of it can be traced by mining paper citations. The changes of citations shows how a research topic evolves over times and also help us understand the lineage of topics. Citation networks can also help researchers identify topics that are related to a specific research topic.

Since a research paper contains more information than a bag of words, a comprehensive model should be developed to observe the evolution of a specific research area, further to predict the trend of research.

DATASET

KDD Cup 2003 Dataset

For our project, we plan to use KDD Cup 2003 dataset. The data is publicly available online: click here to access KDD Cup 2003 Dateset The data consists of roughly 29,000 papers in arXiv (1993-2003), within the field of high-energy physics. Each paper has a unique id (a random between 1 and 100,000). The data is structured as follows: (i) LaTeX sources of each paper (classified by year), (ii) abstract of each paper, (iii) SLAC dates of each paper, and (iv) citation graph data in the form of (citing_paper_id, cited_paper_id). The citation graph has roughly 342K entries. The citation graph does not contain any information about the citations that are not covered by the dataset.

IDEAS

General Idea

Social networks have been much studied in social sciences (5) (6). The general features of these studies is that they are often restricted to small systems, and often consider the networks as static graphs, whose nodes represent individuals while links represent their social interactions.

In contrast, we plan to take a different and complementary way to analyze social networks, which are neither static nor small systems. KDD Cup 2003 dataset will enable us to build a citation network, which will be expanded constantly by the addition of new authors, as well as the addition of new links between authors. It is obvious that this network will be extremely large and complicated even after a short period of time, especially when a new research area is developed. Generally, the topological properties of this network are determined by dynamical growth processes (7). Consequently, in order to figure out its topology, understanding the dynamical process that determines its evolution is crucial. Then it is possible to construct a model which has dynamical features of the citation network. This model can also be used to do certain prediction.

So, there are a few ideas which might be useful in predicting the future of a particular research area:

First, it is possible to parse this dataset and build up yearly snapshots of the semantic (conceptual) network. This kind of network might be interesting for “hot topic detection”, looking at the concepts that frequently appear on papers and/or citations papers on a particular topic (e.g. high energy physics) receive over time.

Second, since KDD Cup 2003 dataset is already broken into years, it is also possible to get yearly or even more frequent snapshots of the network. It might also be possible to identify how the key actors in a citation network change over time.

Expanded application

It is important to emphasize that the properties of citation networks are not unique to them. The ideas that we intend to develop through this course project might be applicable to various types of networks.

For instance, the WWW is also a complex evolving network, where nodes and links are added (and removed) at a very high rate, so its network topology is also profoundly determined by these dynamical features (8), (9). It is natural that we may apply the methods used in citation network to analyze WWW. To exemplify a specific case, it is possible figure out the hottest websites among certain group of people (e.g. teenagers), thus it’s easy to know their needs and better satisfy them. The methods are also useful in predicting the trend of market in financial field. In addition, the development trend of a company can be predicted by analyzing the news and customers related to this company.

RELATED WORK

In this section, we provide a brief overview of the related work we could identify so far. We break down the related work section mainly into two parts, where we first discuss papers that use the same dataset (KDD Cup 2003) and then papers in closer research areas. This covers two aspects: work done in the field of citation networks, and work done in the field of dynamic network analysis and network evolution.

In KDD Cup competition, there were mainly three tasks evaluated on this dataset: (i) predicting how the number of citations to each paper in the dataset will change over time, (ii) extracting useful data from a huge set of source/text files (i.e. data cleaning), and (iii) estimating the number of downloads a paper receives in its first two months after it is uploaded to arXiv.

In KDD Cup 2003, for citation prediction task, the method used by the winner team includes conversion of data into a time series and applying regression analysis on it (10). (11) also focuses on time series conversion as their first step. The authors of (11) comment further on the factors that affect citations received by a paper such as the reputation of the authors, publishing seasons (related to the overhead of academic year or conferences), and hot topics in the field. For download estimation tasks, the winner team focused on an extension of bag-of-words approach, using linear regression as the learning algorithm (12).

Other than the work performed using this specific dataset, there are other papers that investigate citation networks and evolution of networks, including citation networks and other types of networks.

For instance, there are various studies that investigate different properties of citation networks such as small world and couplings along with many other properties. For example, (13) investigates non-acyclicity problem in citation networks. Large strongly connected components are very likely to indicate an error in the collected citation data since citation networks are almost acyclic. Strong components might occur due to multiple versions of the same paper being available, and this needs to be detected and eliminated. Another interesting problem in citation networks is to understand how topics evolve over time and how this can be detected using citation networks (14). One other paper based on the evolution of citation networks investigates the influence of marketing journals in subfields of marketing, analyzing which journals emerge as the most influential ones and how this changes over time (15).

(16) investigates the graphical structure of the large-scale time evolving citation networks using three different techniques of analysis (i.e., probabilistic mixture model using an expectation–maximization algorithm, modularity-maximization based network clustering method, and analysis of how eigenvector centrality scores vary over time). As one final example, (17) studies the evolution of social networks in scientific collaboration (co-authorship) networks. (17) provides many interesting results that confirm scale-free topology of co-authorship networks which are governed by preferential attachment rules in their growth.

METHODOLOGY/TOOLS

For the coding part, we plan to write scripts for reading/parsing/converting the available datasets. For now, we are planning to use matlab for our analysis/idea coding. There are also other tools that might be useful for us:

ORA: For visualization and analysis purposes, we will try using ORA (18). ORA is an interactive network analysis tool that maintains the internal structure of an organization/social network as a set of agents, tasks, resources and identifies the relationships among them. However, since it is an interactive tool, we are not sure whether we will be able to open up the citation data with 350K entries in it or visualize something meaningful using it. This is subject to change; we might also look for other visualization tools customized for handling larger number of nodes.

AUTOMAP: Automap is a software tool which is used to perform Semantic Network Analysis (19). It can work in the batch ode and it analyzes existence, frequencies, and covariance of terms and themes. Since we also have the LaTeX source files for the papers, depending on how much time we have, and the final algorithm we want to implement, this tool might be useful if we want to look at idea-to-idea or idea-to-people type of networks in addition to people-to-people networks.

REFERENCES

1. Networks of Scientific Paper. Price, Derek J. de Solla. s.l. : Science, 1965, Vol. 149.

2. Co-citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents. Small, Henry. s.l. : Journal of the American Society for Information Science, 1973, Vol. 24.

3. CiteSeer: An Automatic Citation Indexing System. Giles, C.L. and Bollacker, K.D. and Lawrence, S. 1998 : Proceedings of the 3rd ACM Conference on Digital Libraries.

4. Wikipedia. [Online] click here.

5. S.Wasserman, K.Faust. Social Network Analysis. Cambridge : Cambridge University Press, 1994.

6. The Small World. (Ed.), M.Kochen. NJ : Ablex, Norwood, 1989.

7. Evolution of the social network of scientific collaborations. A.L.Barabasi, H.Jeong, Z.Neda, E.Ravasz. 690-614, s.l. : Physica, 2002, Vol. 311.

8. R.Albert, H.jeong, A.L. Barabasi. s.l. : Nature, 1999, Vol. 400.

9. Giles. S.Lawrence, C.L. s.l. : Science, 1998, Vol. 280.

10. Citation Prediction Using Time Series Approach KDD Cup 2003 (Task 1). Manjunatha, J. N. and Pandey, Raghavendra and Sivaramakrishnan, R. and Murty, Narasimha. Washington, DC : SIGKDD, 2003.

11. Predicting Citation Rates for Physics Papers: Constructing Features for an Ordered Probit Model. Mackassy, Claudia Perlich and Foster Provost and Sofus. Washington, DC. : SIGKDD, 2003.

12. The Download Estimation Task on KDD Cup 2003. Leskovic, Janez Brank and Jure. Washington, DC. : SIGKDD, 2003.

13. Batagelj, Vladimir. Efficient Algorithms for Citation Network Analysis. Ljubljana, Slovenia : University of Ljubljana, Department of Mathematics, 2003.

14. Detecting Topic Evolution in Scientific Literature: How Can Citations Help? Qi He, Bi Chen, Jian Pei, Baojun Qiu, Prasenjit Mitra, C. Lee Giles. Hong Kong, China : CIKM, 2009.

15. The structural influence of marketing journals: A citation analysis of the discipline and its subareas over time. Baumgartner, H. and Pieters, R. s.l. : Journal of Marketing, 2003, Vols. 123--139.

16. Large-scale Structure of Time Evolving Citation Networks. Leicht, EA and Clarkson, G. and Shedden, K. and Newman, M.E.J. 1, s.l. : The European Physical Journal B-Condensed Matter and Complex Systems, 2007, Vol. 59.

17. Evolution of the Social Network of Scientific Collaborations. Barabasi, A.L. and Jeong, H. and Neda, Z. and Ravasz, E. and Schubert, A. and Vicsek, T. 3-4, s.l. : Physica A: Statistical Mechanics and its Applications, 2002, Vol. 311.

18. Carley, K.M. and Reminga, J. and Storrick, J. and Columbus, D. CMU-ISR-10-120 ORA user's Guide 2010. Pittsburgh : Carnegie Mellon University. , 2010.

19. Carley, K.M. and Columbus, D. and Bigrigg, M. and Kunkel, F. AutoMap User's Guide 2010. Pittsburgh : Carnegie Mellon University, 2010.

Miray Dongyang Niting project proposal

Contents

ABSTRACT

INTRODUCTION

DATASET

KDD Cup 2003 Dataset

IDEAS

General Idea

Expanded application

RELATED WORK

METHODOLOGY/TOOLS

REFERENCES

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools