Difference between revisions of "Miray Dongyang Niting project proposal"

From Cohen Courses
Jump to navigationJump to search
 
(13 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
                                       <center>'''Citation Network Evolution'''</center>     
 
                                       <center>'''Citation Network Evolution'''</center>     
                                       <center>Miray Kas, Niting Qi, Dongyang Teng</center>
+
                                       <center>Miray Kas, Niting Qi, Dongyang Teng</center>
 
                                         <center>Electrical & Computer Engineering</center>
 
                                         <center>Electrical & Computer Engineering</center>
 
                                             <center>Carnegie Mellon University</center>
 
                                             <center>Carnegie Mellon University</center>
Line 11: Line 11:
 
== '''ABSTRACT''' ==
 
== '''ABSTRACT''' ==
  
Staggering growth of availability of electronic resources over the Internet enables rapid dissemination of the ideas and changes in the trends and the interaction patterns. In this project, we plan to investigate how a specific research area (e.g. high-energy physics) changes over time. Our goal is to do research on techniques that might be useful in predicting the future of a particular research area looking at the changes in the currently available data.
+
Staggering growth of availability of electronic resources over the Internet enables rapid dissemination of the ideas and changes in the trends and the interaction patterns. In this project, we plan to investigate how a specific research area (e.g. high-energy physics) changes over time. More specifically, we plan to explore application of Discrete Fourier Transform (DFT) on citation networks and evaluate its potential in predicting the future of a particular research area looking at the changes in the currently available data.
  
=== '''CATEGORIES AND SUBJUCT DESCRIPTORS'''===
+
''' CATEGORIES AND SUBJUCT DESCRIPTORS '''
  
 
[Social Networks]: Dynamic Network Models, Network Evolution, Citation Analysis.
 
[Social Networks]: Dynamic Network Models, Network Evolution, Citation Analysis.
  
=== '''GENERAL TERMS''' ===
+
'''GENERAL TERMS'''  
  
Algorithms, Sociology.
+
Algorithms, Sociology, Signal Processing.
  
=== '''KEYWORDS''' ===
+
'''KEYWORDS'''  
  
 
Social Networks, Dynamic Network Models, Network Evolution, Citation Analysis, Pattern Matching.
 
Social Networks, Dynamic Network Models, Network Evolution, Citation Analysis, Pattern Matching.
  
=== '''INTRODUCTION ''' ===
+
== '''INTRODUCTION ''' ==
  
 
Study of networks, including social networks, biological networks, information works, and many other kinds, is always a focus topic in scientific research. Many of the obtained results are not only solutions for problems in the field of networks, but they are also applicable to other fields.  
 
Study of networks, including social networks, biological networks, information works, and many other kinds, is always a focus topic in scientific research. Many of the obtained results are not only solutions for problems in the field of networks, but they are also applicable to other fields.  
Line 34: Line 34:
 
Nowadays, many innovations and new research areas emerged from existing references. Creation of a specific research topic as well as the development of it can be traced by mining paper citations. The changes of citations shows how a research topic evolves over times and also help us understand the lineage of topics. Citation networks can also help researchers identify topics that are related to a specific research topic.
 
Nowadays, many innovations and new research areas emerged from existing references. Creation of a specific research topic as well as the development of it can be traced by mining paper citations. The changes of citations shows how a research topic evolves over times and also help us understand the lineage of topics. Citation networks can also help researchers identify topics that are related to a specific research topic.
  
Since a research paper contains more information than a bag of words, a comprehensive model should be developed to observe the evolution of a specific research area, further to predict the trend of research.
+
Since the interactions among the researchers of a particular scientific field and idea propagation within the field involves a lot more than a bag of buzz words, a comprehensive technique should be developed to observe the evolution of a specific research area, further to predict the trend of research.
 +
 
 +
In this project, we plan to apply DFT on citation networks to filter out some “noisy” papers which have no contribution to the result of dataset analysis and prediction. Combining the DFT in analyzing a large-scale network can lead to a more valuable result. Since some “noisy” papers have been filter out before the analysis, this technique can be regarded as a pre-prediction process.
  
 
== '''DATASET''' ==
 
== '''DATASET''' ==
  
  
'''2.1 KDD Cup 2003 Dataset'''
+
===''' KDD Cup 2003 Dataset'''===
  
 
For our project, we plan to use KDD Cup 2003 dataset. The data is publicly available online: [http://www.cs.cornell.edu/projects/kddcup/datasets.html click here to access KDD Cup 2003 Dateset]  
 
For our project, we plan to use KDD Cup 2003 dataset. The data is publicly available online: [http://www.cs.cornell.edu/projects/kddcup/datasets.html click here to access KDD Cup 2003 Dateset]  
 
The data consists of roughly 29,000 papers in arXiv (1993-2003), within the field of high-energy physics. Each paper has a unique id (a random between 1 and 100,000). The data is structured as follows: (i) LaTeX sources of each paper (classified by year), (ii) abstract of each paper, (iii) SLAC dates of each paper, and (iv) citation graph data in the form of (citing_paper_id, cited_paper_id). The citation graph has roughly 342K entries. The citation graph does not contain any information about the citations that are not covered by the dataset.
 
The data consists of roughly 29,000 papers in arXiv (1993-2003), within the field of high-energy physics. Each paper has a unique id (a random between 1 and 100,000). The data is structured as follows: (i) LaTeX sources of each paper (classified by year), (ii) abstract of each paper, (iii) SLAC dates of each paper, and (iv) citation graph data in the form of (citing_paper_id, cited_paper_id). The citation graph has roughly 342K entries. The citation graph does not contain any information about the citations that are not covered by the dataset.
  
== '''IDEAS''' ==
+
== '''PROPOSED WORK''' ==
 
 
 
 
=== '''1 General Idea''' ===
 
  
  
 
Social networks have been much studied in social sciences (5) (6). The general features of these studies is that they are often restricted to small systems, and often consider the networks as static graphs, whose nodes represent individuals while links represent their social interactions.
 
Social networks have been much studied in social sciences (5) (6). The general features of these studies is that they are often restricted to small systems, and often consider the networks as static graphs, whose nodes represent individuals while links represent their social interactions.
  
In contrast, we plan to take a different and complementary way to analyze social networks, which are neither static nor small systems. KDD Cup 2003 dataset will enable us to build a citation network, which will be expanded constantly by the addition of new authors, as well as the addition of new links between authors. It is obvious that this network will be extremely large and complicated even after a short period of time, especially when a new research area is developed. Generally, the topological properties of this network are determined by dynamical growth processes (7). Consequently, in order to figure out its topology, understanding the dynamical process that determines its evolution is crucial. Then it is possible to construct a model which has dynamical features of the citation network. This model can also be used to do certain prediction.
+
In contrast, we plan to take a different and complementary way to analyze social networks, which are neither static nor small systems. KDD Cup 2003 dataset will enable us to build an evolving citation network, which will be expanded constantly by the addition of new papers, as well as the addition of new citation links between papers.  
 
 
So, there are a few ideas which might be useful in predicting the future of a particular research area:
 
  
First, it is possible to parse this dataset and build up yearly snapshots of the semantic (conceptual) network. This kind of network might be interesting for “hot topic detection”, looking at the concepts that frequently appear on papers and/or citations papers on a particular topic (e.g. high energy physics) receive over time.  
+
It is obvious that this network will be extremely large and complicated even after a short period of time, especially when a new research area is developed. Generally, the topological properties of this network are determined by dynamical growth processes (7). The information embedded in the growth history of the network can be used to predict its future.
  
Second, since KDD Cup 2003 dataset is already broken into years, it is also possible to get yearly or even more frequent snapshots of the network. It might also be possible to identify how the key actors in a citation network change over time.
+
Since KDD Cup 2003 dataset is already broken into years, it is also possible to get yearly or even more frequent (monthly or quarterly) discrete snapshots of the network which makes it possible to apply digital signal processing techniques like Discrete Fourier Transform (DFT) on this dataset.  
  
=== '''2 Expanded application''' ===
+
DFT is a specific kind of discrete transform, used in Fourier analysis. In particular, the DFT is widely employed in signal processing and related fields to analyze the frequencies contained in a sampled signal, to solve partial differential equations, and to perform other operations such as convolutions or multiplying large integers. It transforms one function (which is often a function in the time domain) into another, which is called the frequency domain representation. DFT requires an input function that is discrete and whose non-zero values have a limited (finite) duration. If we treat citations to a paper as different signals/samples, it becomes applicable to citation networks as well.
  
 +
In our dataset, the citation number of each paper is discrete on timeline. It can be represented as follows: <math>x_{discrete}\left ( t \right )= \sum_{n=0}^{N-1}x\left ( nT \right )\delta \left ( t-nT \right )</math>
 +
time-domain sampling number is n, while T represents month or year. Its DFT is: <math>X_{k}= \sum_{n=0}^{N-1}x_{n}e^{-\frac{2\pi i}{N}kn}</math>
  
It is important to emphasize that the properties of citation networks are not unique to them. The ideas that we intend to develop through this course project might be applicable to various types of networks.  
+
Different papers in the citation network might show different behavior about the number of citations they receive over time. For instance, a seminal paper might be continuously receiving citations while another paper might be hot topic for a number of years and highly cited within those years, not receiving further citations after a while. Similarly, a paper may always remain as an incremental contribution to the field, not being of significant interest at any point in time.  
  
For instance, the WWW is also a complex evolving network, where nodes and links are added (and removed) at a very high rate, so its network topology is also profoundly determined by these dynamical features (8), (9).  It is natural that we may apply the methods used in citation network to analyze WWW. To exemplify a specific case, it is possible figure out the hottest websites among certain group of people (e.g. teenagers), thus it’s easy to know their needs and better satisfy them. The methods are also useful in predicting the trend of market in financial field. In addition, the development trend of a company can be predicted by analyzing the news and customers related to this company.
+
With the application of DFT over multiple snapshots of the citation network, we intend to differentiate between the papers that belong to different categories. If the DFT of the number of citation is mainly in the low frequency, it is highly possible that the citation number of this paper does not change much during 10 years (the dataset contains papers from 1993-2003). So we can predict that the citation number of the paper will remain the same during the next period. On the contrary, if the DFT of the number of citation has many high frequency components, we can predict that the paper will receive more citations in the future.
  
 
== ''' RELATED WORK ''' ==
 
== ''' RELATED WORK ''' ==
  
 
In this section, we provide a brief overview of the related work we could identify so far. We break down the related work section mainly into two parts, where we first discuss papers that use the same dataset (KDD Cup 2003) and then papers in closer research areas. This covers two aspects: work done in the field of citation networks, and work done in the field of dynamic network analysis and network evolution.  
 
In this section, we provide a brief overview of the related work we could identify so far. We break down the related work section mainly into two parts, where we first discuss papers that use the same dataset (KDD Cup 2003) and then papers in closer research areas. This covers two aspects: work done in the field of citation networks, and work done in the field of dynamic network analysis and network evolution.  
 
+
In KDD Cup competition, there were mainly three tasks evaluated on this dataset: (i) predicting how the number of citations to each paper in the dataset will change over time, (ii) extracting useful data from a huge set of source/text files (i.e. data cleaning), and (iii) estimating the number of downloads a paper receives in its first two months after it is uploaded to arXiv.  
In KDD Cup competition, there were mainly three tasks evaluated on this dataset: (i) predicting how the number of citations to each paper in the dataset will change over time, (ii) extracting useful data from a huge set of source/text files (i.e. data cleaning), and (iii) estimating the number of downloads a paper  
+
In KDD Cup 2003, for citation prediction task, the method used by the winner team includes conversion of data into a time series and applying regression analysis on it (8). (9) also focuses on time series conversion as their first step. The authors of (9) comment further on the factors that affect citations received by a paper such as the reputation of the authors, publishing seasons (related to the overhead of academic year or conferences), and hot topics in the field. For download estimation tasks, the winner team focused on an extension of bag-of-words approach, using linear regression as the learning algorithm (10).
receives in its first two months after it is uploaded to arXiv.  
 
 
 
In KDD Cup 2003, for citation prediction task, the method used by the winner team includes conversion of data into a time series and applying regression analysis on it (10). (11) also focuses on time series conversion as their first step. The authors of (11) comment further on the factors that affect citations received by a paper such as the reputation of the authors, publishing seasons (related to the overhead of academic year or conferences), and hot topics in the field. For download estimation tasks, the winner team focused on an extension of bag-of-words approach, using linear regression as the learning algorithm (12).
 
 
 
 
Other than the work performed using this specific dataset, there are other papers that investigate citation networks and evolution of networks, including citation networks and other types of networks.
 
Other than the work performed using this specific dataset, there are other papers that investigate citation networks and evolution of networks, including citation networks and other types of networks.
  
For instance, there are various studies that investigate different properties of citation networks such as small world and couplings along with many other properties. For example, (13) investigates non-acyclicity problem in citation networks. Large strongly connected components are very likely to indicate an error in the collected citation data since citation networks are almost acyclic. Strong components might occur due to multiple versions of the same paper being available, and this needs to be detected and eliminated.
+
For instance, there are various studies that investigate different properties of citation networks such as small world and couplings along with many other properties. Another interesting problem in citation networks is to understand how topics evolve over time and how this can be detected using citation networks (11). (12) introduces a model of network evolution that gives a more realistic description of the local processes, incorporating the addition of new nodes, new links, and the rewiring of links. The prediction fits the distribution of the network well. One other paper based on the evolution of citation networks investigates the influence of marketing journals in subfields of marketing, analyzing which journals emerge as the most influential ones and how this changes over time (13). (14) shows a method in finding similarity data in a time-sequenced dataset by using DFT. This method is classic since it uses signal processing technique in analyzing the dataset at an early time.  
Another interesting problem in citation networks is to understand how topics evolve over time and how this can be detected using citation networks (14). One other paper based on the evolution of citation networks investigates the influence of marketing journals in subfields of marketing, analyzing which journals emerge as the most influential ones and how this changes over time (15).  
 
 
 
(16) investigates the graphical structure of the large-scale time evolving citation networks using three different techniques of analysis (i.e., probabilistic mixture model using an expectation–maximization algorithm, modularity-maximization based network clustering method, and analysis of how eigenvector centrality scores vary over time). As one final example, (17) studies the evolution of social networks in scientific collaboration (co-authorship) networks. (17) provides many interesting results that confirm scale-free topology of co-authorship networks which are governed by preferential attachment rules in their growth.  
 
  
 +
(15) investigates the graphical structure of the large-scale time evolving citation networks using three different techniques of analysis (i.e., probabilistic mixture model using an expectation–maximization algorithm, modularity-maximization based network clustering method, and analysis of how eigenvector centrality scores vary over time). As one final example, (16) studies the evolution of social networks in scientific collaboration (co-authorship) networks. (16) provides many interesting results that confirm scale-free topology of co-authorship networks which are governed by preferential attachment rules in their growth.
  
 
== ''' METHODOLOGY/TOOLS ''' ==
 
== ''' METHODOLOGY/TOOLS ''' ==
  
For the coding part, we plan to write scripts for reading/parsing/converting the available datasets. For now, we are planning to use matlab for our analysis/idea coding.
+
For the coding part, we plan to write scripts to read, parse, and convert the available data in to the form we can use in matlab. For now, we are planning to use matlab for our analysis/idea coding.  
There are also other tools that might be useful for us:
 
 
 
'''ORA''': For visualization and analysis purposes, we will try using ORA (18). ORA is an interactive network analysis tool that maintains the internal structure of an organization/social network as a set of agents, tasks, resources and identifies the relationships among them. However, since it is an interactive tool, we are not sure whether we will be able to open up the citation data with 350K entries in it or visualize something meaningful using it. This is subject to change; we might also look for other visualization tools customized for handling larger number of nodes.
 
 
 
'''AUTOMAP''': Automap is a software tool which is used to perform Semantic Network Analysis (19). It can work in the batch ode and it analyzes existence, frequencies, and covariance of terms and themes. Since we also have the LaTeX source files for the papers, depending on how much time we have, and the final algorithm we want to implement, this tool might be useful if we want to look at idea-to-idea or idea-to-people type of networks in addition to people-to-people networks.  
 
  
 +
In addition, for visualization and analysis purposes, we will try using ORA (17). ORA is an interactive network analysis tool that maintains the internal structure of an organization/social network as a set of agents, tasks, and resources if applicable and identifies the relationships, key actors among them. However, since it is an interactive tool, we are not sure whether we will be able to open up the citation data with 350K entries in it or visualize something meaningful (i.e. something which does not look like a ball of yarn). This is subject to change; we might also look for other visualization tools customized for handling larger number of nodes.
  
 
== ''' REFERENCES''' ==
 
== ''' REFERENCES''' ==
Line 111: Line 100:
 
690-614, s.l. : Physica, 2002, Vol. 311.
 
690-614, s.l. : Physica, 2002, Vol. 311.
  
8. R.Albert, H.jeong, A.L. Barabasi. s.l. : Nature, 1999, Vol. 400.
+
8. Citation Prediction Using Time Series Approach KDD Cup 2003 (Task 1). Manjunatha, J. N. and Pandey, Raghavendra and Sivaramakrishnan, R. and Murty, Narasimha. Washington, DC : SIGKDD, 2003.
 +
 
 +
9. Predicting Citation Rates for Physics Papers: Constructing Features for an Ordered Probit Model. Mackassy, Claudia Perlich and Foster Provost and Sofus. Washington, DC. : SIGKDD, 2003.
 +
 
 +
10. The Download Estimation Task on KDD Cup 2003. Leskovic, Janez Brank and Jure. Washington, DC. : SIGKDD, 2003.
  
9. Giles. S.Lawrence, C.L. s.l. : Science, 1998, Vol. 280.
+
11. Detecting Topic Evolution in Scientific Literature: How Can Citations Help? Qi He, Bi Chen, Jian Pei, Baojun Qiu, Prasenjit Mitra, C. Lee Giles. Hong Kong, China : CIKM, 2009.
  
10. Citation Prediction Using Time Series Approach KDD Cup 2003 (Task 1). Manjunatha, J. N. and Pandey, Raghavendra and Sivaramakrishnan, R. and Murty, Narasimha. Washington, DC : SIGKDD, 2003.
+
12. Barabási, Réka Albert and Albert-László. Topology of Evolving Networks: Local Events and Universality. PHYSICAL REVIEW LETTERS. 1999, Vol. 85.
  
11. Predicting Citation Rates for Physics Papers: Constructing Features for an Ordered Probit Model. Mackassy, Claudia Perlich and Foster Provost and Sofus. Washington, DC. : SIGKDD, 2003.
+
13. The structural influence of marketing journals: A citation analysis of the discipline and its subareas over time. Baumgartner, H. and Pieters, R. s.l. : Journal of Marketing, 2003, Vols. 123--139.
  
12. The Download Estimation Task on KDD Cup 2003. Leskovic, Janez Brank and Jure. Washington, DC. : SIGKDD, 2003.
+
14. Efficient Similarity Search In Sequence. Faloutsos, Rakesh Agrawal and Christos. San Jose : s.n., 1993.
  
13. Batagelj, Vladimir. Efficient Algorithms for Citation Network Analysis. Ljubljana, Slovenia : University of Ljubljana, Department of Mathematics, 2003.
+
15. Large-scale Structure of Time Evolving Citation Networks. Leicht, EA and Clarkson, G. and Shedden, K. and Newman, M.E.J. 1, s.l. : The European Physical Journal B-Condensed Matter and Complex Systems, 2007, Vol. 59.
  
14. Detecting Topic Evolution in Scientific Literature: How Can Citations Help? Qi He, Bi Chen, Jian Pei, Baojun Qiu, Prasenjit Mitra, C. Lee Giles. Hong Kong, China : CIKM, 2009.
+
16. Evolution of the Social Network of Scientific Collaborations. Barabasi, A.L. and Jeong, H. and Neda, Z. and Ravasz, E. and Schubert, A. and Vicsek, T. 3-4, s.l. : Physica A: Statistical Mechanics and its Applications, 2002, Vol. 311.
  
15. The structural influence of marketing journals: A citation analysis of the discipline and its subareas over time. Baumgartner, H. and Pieters, R. s.l. : Journal of Marketing, 2003, Vols. 123--139.
+
17. Carley, K.M. and Reminga, J. and Storrick, J. and Columbus, D. CMU-ISR-10-120 ORA user's Guide 2010. Pittsburgh : Carnegie Mellon University. , 2010.
  
16. Large-scale Structure of Time Evolving Citation Networks. Leicht, EA and Clarkson, G. and Shedden, K. and Newman, M.E.J. 1, s.l. : The European Physical Journal B-Condensed Matter and Complex Systems, 2007, Vol. 59.
+
18. Carley, K.M. and Columbus, D. and Bigrigg, M. and Kunkel, F. AutoMap User's Guide 2010. Pittsburgh : Carnegie Mellon University, 2010.
  
17. Evolution of the Social Network of Scientific Collaborations. Barabasi, A.L. and Jeong, H. and Neda, Z. and Ravasz, E. and Schubert, A. and Vicsek, T. 3-4, s.l. : Physica A: Statistical Mechanics and its Applications, 2002, Vol. 311.
+
19. Batagelj, Vladimir. Efficient Algorithms for Citation Network Analysis. Ljubljana, Slovenia : University of Ljubljana, Department of Mathematics, 2003.
  
18. Carley, K.M. and Reminga, J. and Storrick, J. and Columbus, D. CMU-ISR-10-120 ORA user's Guide 2010. Pittsburgh : Carnegie Mellon University. , 2010.
+
20. R.Albert, H.jeong, A.L. Barabasi. s.l. : Nature, 1999, Vol. 400.
  
19. Carley, K.M. and Columbus, D. and Bigrigg, M. and Kunkel, F. AutoMap User's Guide 2010. Pittsburgh : Carnegie Mellon University, 2010.
+
21. Giles. S.Lawrence, C.L. s.l. : Science, 1998, Vol. 280.

Latest revision as of 23:18, 14 February 2011

Citation Network Evolution
Miray Kas, Niting Qi, Dongyang Teng
Electrical & Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15217 USA

Email: 1. mkas@andrew.cmu.edu 2. nqi@andrew.cmu.edu 3. dongyant@andrew.cmu.edu


PDF version of this proposal is available File:Proposal.pdf


ABSTRACT

Staggering growth of availability of electronic resources over the Internet enables rapid dissemination of the ideas and changes in the trends and the interaction patterns. In this project, we plan to investigate how a specific research area (e.g. high-energy physics) changes over time. More specifically, we plan to explore application of Discrete Fourier Transform (DFT) on citation networks and evaluate its potential in predicting the future of a particular research area looking at the changes in the currently available data.

CATEGORIES AND SUBJUCT DESCRIPTORS

[Social Networks]: Dynamic Network Models, Network Evolution, Citation Analysis.

GENERAL TERMS

Algorithms, Sociology, Signal Processing.

KEYWORDS

Social Networks, Dynamic Network Models, Network Evolution, Citation Analysis, Pattern Matching.

INTRODUCTION

Study of networks, including social networks, biological networks, information works, and many other kinds, is always a focus topic in scientific research. Many of the obtained results are not only solutions for problems in the field of networks, but they are also applicable to other fields. Citation networks, which are the principal focus of this paper, have been studied quantitatively almost from the moment citation databases first became available. In 1965, Derek J. de Solla Price described the inherent linking characteristic of the SCI in his paper titled "Networks of Scientific Papers" (1). The links between citing and cited papers became dynamic when the SCI began to be published online. In 1973, Henry Small published his classic work on co-citation analysis (2) which became a self-organizing classification system that led to document clustering experiments and eventually what is later called "Research Reviews".

Autonomous citation indexing was introduced in 1998 by Giles, Lawrence and Bollacker (3), enabling automated extraction and grouping of citations for academic/scientific documents. While previous citation extraction was a manual process, citation measures now could scale up and be computed for any scholarly and scientific field, not just those selected by organizations. Further information on the field history can be found at (4).

Nowadays, many innovations and new research areas emerged from existing references. Creation of a specific research topic as well as the development of it can be traced by mining paper citations. The changes of citations shows how a research topic evolves over times and also help us understand the lineage of topics. Citation networks can also help researchers identify topics that are related to a specific research topic.

Since the interactions among the researchers of a particular scientific field and idea propagation within the field involves a lot more than a bag of buzz words, a comprehensive technique should be developed to observe the evolution of a specific research area, further to predict the trend of research.

In this project, we plan to apply DFT on citation networks to filter out some “noisy” papers which have no contribution to the result of dataset analysis and prediction. Combining the DFT in analyzing a large-scale network can lead to a more valuable result. Since some “noisy” papers have been filter out before the analysis, this technique can be regarded as a pre-prediction process.

DATASET

KDD Cup 2003 Dataset

For our project, we plan to use KDD Cup 2003 dataset. The data is publicly available online: click here to access KDD Cup 2003 Dateset The data consists of roughly 29,000 papers in arXiv (1993-2003), within the field of high-energy physics. Each paper has a unique id (a random between 1 and 100,000). The data is structured as follows: (i) LaTeX sources of each paper (classified by year), (ii) abstract of each paper, (iii) SLAC dates of each paper, and (iv) citation graph data in the form of (citing_paper_id, cited_paper_id). The citation graph has roughly 342K entries. The citation graph does not contain any information about the citations that are not covered by the dataset.

PROPOSED WORK

Social networks have been much studied in social sciences (5) (6). The general features of these studies is that they are often restricted to small systems, and often consider the networks as static graphs, whose nodes represent individuals while links represent their social interactions.

In contrast, we plan to take a different and complementary way to analyze social networks, which are neither static nor small systems. KDD Cup 2003 dataset will enable us to build an evolving citation network, which will be expanded constantly by the addition of new papers, as well as the addition of new citation links between papers.

It is obvious that this network will be extremely large and complicated even after a short period of time, especially when a new research area is developed. Generally, the topological properties of this network are determined by dynamical growth processes (7). The information embedded in the growth history of the network can be used to predict its future.

Since KDD Cup 2003 dataset is already broken into years, it is also possible to get yearly or even more frequent (monthly or quarterly) discrete snapshots of the network which makes it possible to apply digital signal processing techniques like Discrete Fourier Transform (DFT) on this dataset.

DFT is a specific kind of discrete transform, used in Fourier analysis. In particular, the DFT is widely employed in signal processing and related fields to analyze the frequencies contained in a sampled signal, to solve partial differential equations, and to perform other operations such as convolutions or multiplying large integers. It transforms one function (which is often a function in the time domain) into another, which is called the frequency domain representation. DFT requires an input function that is discrete and whose non-zero values have a limited (finite) duration. If we treat citations to a paper as different signals/samples, it becomes applicable to citation networks as well.

In our dataset, the citation number of each paper is discrete on timeline. It can be represented as follows: time-domain sampling number is n, while T represents month or year. Its DFT is:

Different papers in the citation network might show different behavior about the number of citations they receive over time. For instance, a seminal paper might be continuously receiving citations while another paper might be hot topic for a number of years and highly cited within those years, not receiving further citations after a while. Similarly, a paper may always remain as an incremental contribution to the field, not being of significant interest at any point in time.

With the application of DFT over multiple snapshots of the citation network, we intend to differentiate between the papers that belong to different categories. If the DFT of the number of citation is mainly in the low frequency, it is highly possible that the citation number of this paper does not change much during 10 years (the dataset contains papers from 1993-2003). So we can predict that the citation number of the paper will remain the same during the next period. On the contrary, if the DFT of the number of citation has many high frequency components, we can predict that the paper will receive more citations in the future.

RELATED WORK

In this section, we provide a brief overview of the related work we could identify so far. We break down the related work section mainly into two parts, where we first discuss papers that use the same dataset (KDD Cup 2003) and then papers in closer research areas. This covers two aspects: work done in the field of citation networks, and work done in the field of dynamic network analysis and network evolution. In KDD Cup competition, there were mainly three tasks evaluated on this dataset: (i) predicting how the number of citations to each paper in the dataset will change over time, (ii) extracting useful data from a huge set of source/text files (i.e. data cleaning), and (iii) estimating the number of downloads a paper receives in its first two months after it is uploaded to arXiv. In KDD Cup 2003, for citation prediction task, the method used by the winner team includes conversion of data into a time series and applying regression analysis on it (8). (9) also focuses on time series conversion as their first step. The authors of (9) comment further on the factors that affect citations received by a paper such as the reputation of the authors, publishing seasons (related to the overhead of academic year or conferences), and hot topics in the field. For download estimation tasks, the winner team focused on an extension of bag-of-words approach, using linear regression as the learning algorithm (10). Other than the work performed using this specific dataset, there are other papers that investigate citation networks and evolution of networks, including citation networks and other types of networks.

For instance, there are various studies that investigate different properties of citation networks such as small world and couplings along with many other properties. Another interesting problem in citation networks is to understand how topics evolve over time and how this can be detected using citation networks (11). (12) introduces a model of network evolution that gives a more realistic description of the local processes, incorporating the addition of new nodes, new links, and the rewiring of links. The prediction fits the distribution of the network well. One other paper based on the evolution of citation networks investigates the influence of marketing journals in subfields of marketing, analyzing which journals emerge as the most influential ones and how this changes over time (13). (14) shows a method in finding similarity data in a time-sequenced dataset by using DFT. This method is classic since it uses signal processing technique in analyzing the dataset at an early time.

(15) investigates the graphical structure of the large-scale time evolving citation networks using three different techniques of analysis (i.e., probabilistic mixture model using an expectation–maximization algorithm, modularity-maximization based network clustering method, and analysis of how eigenvector centrality scores vary over time). As one final example, (16) studies the evolution of social networks in scientific collaboration (co-authorship) networks. (16) provides many interesting results that confirm scale-free topology of co-authorship networks which are governed by preferential attachment rules in their growth.

METHODOLOGY/TOOLS

For the coding part, we plan to write scripts to read, parse, and convert the available data in to the form we can use in matlab. For now, we are planning to use matlab for our analysis/idea coding.

In addition, for visualization and analysis purposes, we will try using ORA (17). ORA is an interactive network analysis tool that maintains the internal structure of an organization/social network as a set of agents, tasks, and resources if applicable and identifies the relationships, key actors among them. However, since it is an interactive tool, we are not sure whether we will be able to open up the citation data with 350K entries in it or visualize something meaningful (i.e. something which does not look like a ball of yarn). This is subject to change; we might also look for other visualization tools customized for handling larger number of nodes.

REFERENCES

1. Networks of Scientific Paper. Price, Derek J. de Solla. s.l. : Science, 1965, Vol. 149.

2. Co-citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents. Small, Henry. s.l. : Journal of the American Society for Information Science, 1973, Vol. 24.

3. CiteSeer: An Automatic Citation Indexing System. Giles, C.L. and Bollacker, K.D. and Lawrence, S. 1998 : Proceedings of the 3rd ACM Conference on Digital Libraries.

4. Wikipedia. [Online] click here.

5. S.Wasserman, K.Faust. Social Network Analysis. Cambridge : Cambridge University Press, 1994.

6. The Small World. (Ed.), M.Kochen. NJ : Ablex, Norwood, 1989.

7. Evolution of the social network of scientific collaborations. A.L.Barabasi, H.Jeong, Z.Neda, E.Ravasz. 690-614, s.l. : Physica, 2002, Vol. 311.

8. Citation Prediction Using Time Series Approach KDD Cup 2003 (Task 1). Manjunatha, J. N. and Pandey, Raghavendra and Sivaramakrishnan, R. and Murty, Narasimha. Washington, DC : SIGKDD, 2003.

9. Predicting Citation Rates for Physics Papers: Constructing Features for an Ordered Probit Model. Mackassy, Claudia Perlich and Foster Provost and Sofus. Washington, DC. : SIGKDD, 2003.

10. The Download Estimation Task on KDD Cup 2003. Leskovic, Janez Brank and Jure. Washington, DC. : SIGKDD, 2003.

11. Detecting Topic Evolution in Scientific Literature: How Can Citations Help? Qi He, Bi Chen, Jian Pei, Baojun Qiu, Prasenjit Mitra, C. Lee Giles. Hong Kong, China : CIKM, 2009.

12. Barabási, Réka Albert and Albert-László. Topology of Evolving Networks: Local Events and Universality. PHYSICAL REVIEW LETTERS. 1999, Vol. 85.

13. The structural influence of marketing journals: A citation analysis of the discipline and its subareas over time. Baumgartner, H. and Pieters, R. s.l. : Journal of Marketing, 2003, Vols. 123--139.

14. Efficient Similarity Search In Sequence. Faloutsos, Rakesh Agrawal and Christos. San Jose : s.n., 1993.

15. Large-scale Structure of Time Evolving Citation Networks. Leicht, EA and Clarkson, G. and Shedden, K. and Newman, M.E.J. 1, s.l. : The European Physical Journal B-Condensed Matter and Complex Systems, 2007, Vol. 59.

16. Evolution of the Social Network of Scientific Collaborations. Barabasi, A.L. and Jeong, H. and Neda, Z. and Ravasz, E. and Schubert, A. and Vicsek, T. 3-4, s.l. : Physica A: Statistical Mechanics and its Applications, 2002, Vol. 311.

17. Carley, K.M. and Reminga, J. and Storrick, J. and Columbus, D. CMU-ISR-10-120 ORA user's Guide 2010. Pittsburgh : Carnegie Mellon University. , 2010.

18. Carley, K.M. and Columbus, D. and Bigrigg, M. and Kunkel, F. AutoMap User's Guide 2010. Pittsburgh : Carnegie Mellon University, 2010.

19. Batagelj, Vladimir. Efficient Algorithms for Citation Network Analysis. Ljubljana, Slovenia : University of Ljubljana, Department of Mathematics, 2003.

20. R.Albert, H.jeong, A.L. Barabasi. s.l. : Nature, 1999, Vol. 400.

21. Giles. S.Lawrence, C.L. s.l. : Science, 1998, Vol. 280.