Miray Dongyang Niting project proposal
Email: 1. mkas@andrew.cmu.edu 2. nqi@andrew.cmu.edu 3. dongyant@andrew.cmu.edu
PDF version of this proposal is available File:Proposal.pdf
Contents
ABSTRACT
Staggering growth of availability of electronic resources over the Internet enables rapid dissemination of the ideas and changes in the trends and the interaction patterns. In this project, we plan to investigate how a specific research area (e.g. high-energy physics) changes over time. More specifically, we plan to explore application of Discrete Fourier Transform (DFT) on citation networks and evaluate its potential in predicting the future of a particular research area looking at the changes in the currently available data.
CATEGORIES AND SUBJUCT DESCRIPTORS
[Social Networks]: Dynamic Network Models, Network Evolution, Citation Analysis.
GENERAL TERMS
Algorithms, Sociology, Signal Processing.
KEYWORDS
Social Networks, Dynamic Network Models, Network Evolution, Citation Analysis, Pattern Matching.
INTRODUCTION
Study of networks, including social networks, biological networks, information works, and many other kinds, is always a focus topic in scientific research. Many of the obtained results are not only solutions for problems in the field of networks, but they are also applicable to other fields. Citation networks, which are the principal focus of this paper, have been studied quantitatively almost from the moment citation databases first became available. In 1965, Derek J. de Solla Price described the inherent linking characteristic of the SCI in his paper titled "Networks of Scientific Papers" (1). The links between citing and cited papers became dynamic when the SCI began to be published online. In 1973, Henry Small published his classic work on co-citation analysis (2) which became a self-organizing classification system that led to document clustering experiments and eventually what is later called "Research Reviews".
Autonomous citation indexing was introduced in 1998 by Giles, Lawrence and Bollacker (3), enabling automated extraction and grouping of citations for academic/scientific documents. While previous citation extraction was a manual process, citation measures now could scale up and be computed for any scholarly and scientific field, not just those selected by organizations. Further information on the field history can be found at (4).
Nowadays, many innovations and new research areas emerged from existing references. Creation of a specific research topic as well as the development of it can be traced by mining paper citations. The changes of citations shows how a research topic evolves over times and also help us understand the lineage of topics. Citation networks can also help researchers identify topics that are related to a specific research topic.
Since the interactions among the researchers of a particular scientific field and idea propagation within the field involves a lot more than a bag of buzz words, a comprehensive technique should be developed to observe the evolution of a specific research area, further to predict the trend of research.
In this project, we plan to apply DFT on citation networks to filter out some “noisy” papers which have no contribution to the result of dataset analysis and prediction. Combining the DFT in analyzing a large-scale network can lead to a more valuable result. Since some “noisy” papers have been filter out before the analysis, this technique can be regarded as a pre-prediction process.
DATASET
KDD Cup 2003 Dataset
For our project, we plan to use KDD Cup 2003 dataset. The data is publicly available online: click here to access KDD Cup 2003 Dateset The data consists of roughly 29,000 papers in arXiv (1993-2003), within the field of high-energy physics. Each paper has a unique id (a random between 1 and 100,000). The data is structured as follows: (i) LaTeX sources of each paper (classified by year), (ii) abstract of each paper, (iii) SLAC dates of each paper, and (iv) citation graph data in the form of (citing_paper_id, cited_paper_id). The citation graph has roughly 342K entries. The citation graph does not contain any information about the citations that are not covered by the dataset.
PROPOSED WORK
Social networks have been much studied in social sciences (5) (6). The general features of these studies is that they are often restricted to small systems, and often consider the networks as static graphs, whose nodes represent individuals while links represent their social interactions.
In contrast, we plan to take a different and complementary way to analyze social networks, which are neither static nor small systems. KDD Cup 2003 dataset will enable us to build an evolving citation network, which will be expanded constantly by the addition of new papers, as well as the addition of new citation links between papers.
It is obvious that this network will be extremely large and complicated even after a short period of time, especially when a new research area is developed. Generally, the topological properties of this network are determined by dynamical growth processes (7). The information embedded in the growth history of the network can be used to predict its future.
Since KDD Cup 2003 dataset is already broken into years, it is also possible to get yearly or even more frequent (monthly or quarterly) discrete snapshots of the network which makes it possible to apply digital signal processing techniques like Discrete Fourier Transform (DFT) on this dataset.
DFT is a specific kind of discrete transform, used in Fourier analysis. In particular, the DFT is widely employed in signal processing and related fields to analyze the frequencies contained in a sampled signal, to solve partial differential equations, and to perform other operations such as convolutions or multiplying large integers. It transforms one function (which is often a function in the time domain) into another, which is called the frequency domain representation. DFT requires an input function that is discrete and whose non-zero values have a limited (finite) duration. If we treat citations to a paper as different signals/samples, it becomes applicable to citation networks as well.
In our dataset, the citation number of each paper is discrete on timeline. It can be represented as follows: time-domain sampling number is n, while T represents month or year. Its DFT is:
Different papers in the citation network might show different behavior about the number of citations they receive over time. For instance, a seminal paper might be continuously receiving citations while another paper might be hot topic for a number of years and highly cited within those years, not receiving further citations after a while. Similarly, a paper may always remain as an incremental contribution to the field, not being of significant interest at any point in time.
With the application of DFT over multiple snapshots of the citation network, we intend to differentiate between the papers that belong to different categories. If the DFT of the number of citation is mainly in the low frequency, it is highly possible that the citation number of this paper does not change much during 10 years (the dataset contains papers from 1993-2003). So we can predict that the citation number of the paper will remain the same during the next period. On the contrary, if the DFT of the number of citation has many high frequency components, we can predict that the paper will receive more citations in the future.
RELATED WORK
In this section, we provide a brief overview of the related work we could identify so far. We break down the related work section mainly into two parts, where we first discuss papers that use the same dataset (KDD Cup 2003) and then papers in closer research areas. This covers two aspects: work done in the field of citation networks, and work done in the field of dynamic network analysis and network evolution.
In KDD Cup competition, there were mainly three tasks evaluated on this dataset: (i) predicting how the number of citations to each paper in the dataset will change over time, (ii) extracting useful data from a huge set of source/text files (i.e. data cleaning), and (iii) estimating the number of downloads a paper receives in its first two months after it is uploaded to arXiv.
In KDD Cup 2003, for citation prediction task, the method used by the winner team includes conversion of data into a time series and applying regression analysis on it (8). (9) also focuses on time series conversion as their first step. The authors of (9) comment further on the factors that affect citations received by a paper such as the reputation of the authors, publishing seasons (related to the overhead of academic year or conferences), and hot topics in the field. For download estimation tasks, the winner team focused on an extension of bag-of-words approach, using linear regression as the learning algorithm (10).
Other than the work performed using this specific dataset, there are other papers that investigate citation networks and evolution of networks, including citation networks and other types of networks.
For instance, there are various studies that investigate different properties of citation networks such as small world and couplings along with many other properties. Another interesting problem in citation networks is to understand how topics evolve over time and how this can be detected using citation networks (11). (12) introduces a model of network evolution that gives a more realistic description of the local processes, incorporating the addition of new nodes, new links, and the rewiring of links. The prediction fits the distribution of the network well. One other paper based on the evolution of citation networks investigates the influence of marketing journals in subfields of marketing, analyzing which journals emerge as the most influential ones and how this changes over time (13). (14) shows a method in finding similarity data in a time-sequenced dataset by using DFT. This method is classic since it uses signal processing technique in analyzing the dataset at an early time.
(15) investigates the graphical structure of the large-scale time evolving citation networks using three different techniques of analysis (i.e., probabilistic mixture model using an expectation–maximization algorithm, modularity-maximization based network clustering method, and analysis of how eigenvector centrality scores vary over time). As one final example, (16) studies the evolution of social networks in scientific collaboration (co-authorship) networks. (16) provides many interesting results that confirm scale-free topology of co-authorship networks which are governed by preferential attachment rules in their growth.
METHODOLOGY/TOOLS
For the coding part, we plan to write scripts for reading/parsing/converting the available datasets. For now, we are planning to use matlab for our analysis/idea coding. There are also other tools that might be useful for us:
ORA: For visualization and analysis purposes, we will try using ORA (18). ORA is an interactive network analysis tool that maintains the internal structure of an organization/social network as a set of agents, tasks, resources and identifies the relationships among them. However, since it is an interactive tool, we are not sure whether we will be able to open up the citation data with 350K entries in it or visualize something meaningful using it. This is subject to change; we might also look for other visualization tools customized for handling larger number of nodes.
AUTOMAP: Automap is a software tool which is used to perform Semantic Network Analysis (19). It can work in the batch ode and it analyzes existence, frequencies, and covariance of terms and themes. Since we also have the LaTeX source files for the papers, depending on how much time we have, and the final algorithm we want to implement, this tool might be useful if we want to look at idea-to-idea or idea-to-people type of networks in addition to people-to-people networks.
REFERENCES
1. Networks of Scientific Paper. Price, Derek J. de Solla. s.l. : Science, 1965, Vol. 149.
2. Co-citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents. Small, Henry. s.l. : Journal of the American Society for Information Science, 1973, Vol. 24.
3. CiteSeer: An Automatic Citation Indexing System. Giles, C.L. and Bollacker, K.D. and Lawrence, S. 1998 : Proceedings of the 3rd ACM Conference on Digital Libraries.
4. Wikipedia. [Online] click here.
5. S.Wasserman, K.Faust. Social Network Analysis. Cambridge : Cambridge University Press, 1994.
6. The Small World. (Ed.), M.Kochen. NJ : Ablex, Norwood, 1989.
7. Evolution of the social network of scientific collaborations. A.L.Barabasi, H.Jeong, Z.Neda, E.Ravasz. 690-614, s.l. : Physica, 2002, Vol. 311.
8. Citation Prediction Using Time Series Approach KDD Cup 2003 (Task 1). Manjunatha, J. N. and Pandey, Raghavendra and Sivaramakrishnan, R. and Murty, Narasimha. Washington, DC : SIGKDD, 2003.
9. Predicting Citation Rates for Physics Papers: Constructing Features for an Ordered Probit Model. Mackassy, Claudia Perlich and Foster Provost and Sofus. Washington, DC. : SIGKDD, 2003.
10. The Download Estimation Task on KDD Cup 2003. Leskovic, Janez Brank and Jure. Washington, DC. : SIGKDD, 2003.
11. Detecting Topic Evolution in Scientific Literature: How Can Citations Help? Qi He, Bi Chen, Jian Pei, Baojun Qiu, Prasenjit Mitra, C. Lee Giles. Hong Kong, China : CIKM, 2009.
12. Barabási, Réka Albert and Albert-László. Topology of Evolving Networks: Local Events and Universality. PHYSICAL REVIEW LETTERS. 1999, Vol. 85.
13. The structural influence of marketing journals: A citation analysis of the discipline and its subareas over time. Baumgartner, H. and Pieters, R. s.l. : Journal of Marketing, 2003, Vols. 123--139.
14. Efficient Similarity Search In Sequence. Faloutsos, Rakesh Agrawal and Christos. San Jose : s.n., 1993.
15. Large-scale Structure of Time Evolving Citation Networks. Leicht, EA and Clarkson, G. and Shedden, K. and Newman, M.E.J. 1, s.l. : The European Physical Journal B-Condensed Matter and Complex Systems, 2007, Vol. 59.
16. Evolution of the Social Network of Scientific Collaborations. Barabasi, A.L. and Jeong, H. and Neda, Z. and Ravasz, E. and Schubert, A. and Vicsek, T. 3-4, s.l. : Physica A: Statistical Mechanics and its Applications, 2002, Vol. 311.
17. Carley, K.M. and Reminga, J. and Storrick, J. and Columbus, D. CMU-ISR-10-120 ORA user's Guide 2010. Pittsburgh : Carnegie Mellon University. , 2010.
18. Carley, K.M. and Columbus, D. and Bigrigg, M. and Kunkel, F. AutoMap User's Guide 2010. Pittsburgh : Carnegie Mellon University, 2010.
19. Batagelj, Vladimir. Efficient Algorithms for Citation Network Analysis. Ljubljana, Slovenia : University of Ljubljana, Department of Mathematics, 2003.
20. R.Albert, H.jeong, A.L. Barabasi. s.l. : Nature, 1999, Vol. 400.
21. Giles. S.Lawrence, C.L. s.l. : Science, 1998, Vol. 280.