Yandongl writeup of Lee Giles 2009
This is a review of Giles_Talk_2009 by user:Yandongl
Information is overloaded today. The amount of data is increasing too fast, and the nature of the data is heterogeneous - they can be crawled from the Web automatically, manual submission of scientific documents, or user submission of dataset. And often times the data consists of elements in rich format - text, table, algorithm, figure. All of these make it extremely difficult to efficiently access and manage data. And indeed there exists no open source integrated system to build a data management system that focuses on all phases of information and knowledge extraction. Giles proposes CiteSeerX, a successor of CiteSeer, to effectively solve problems above. Its design is modular, scalable, extensible and robust. It consists of a 3-tier design - interface, application and data. It automatically crawls papers from the Web, then parses and extracts metadata from PDF files. By converting postscript files to plain text, it is able to recognize entities with the help of different machine learning approaches such as SVM and CRF. It maintains a list of parent URLs such that it is convenient to crawl again for new information, and high quality can be usually guaranteed from those parent URLs.
Some of the ideas are really innovative such as search formulas. But many of these possibilities are domain-specific, and one might have to maintain many different designs, one for each domain, in order to provide better supporting features.
Since information extraction is an important module of the whole system, there is a lot of machine learning work remains undone.