Nschneid writeup of Giles talk
This is Nschneid's review of SeerSuite_2009_Talk
Lee Giles spoke yesterday about his work with the SeerSuite platform for extracting and sharing academic data. CiteSeer is a database/search engine of CS publications obtained via web crawls, with metadata extracted from PDFs. The main problem mentioned with citation extraction was author resolution: currently frequent names are not disambiguated very well. Aside from citation information, there is work on extracting/indexing by other types of information, such as tables, graphs, algorithms, chemical compounds (ChemSeer), and maps (ArchSeer for archaeologists). An open source release of the tools is designed to encourage others to build similar systems. Other issues mentioned included the design of crawling algorithms (what makes a bot "ethical"?), connections to other academic databases, and cultural differences between research communities (such as the control over information by publishers in some fields, and the resistance to sharing data sets in others).
Later, Dr. Giles mentioned that there is also a great deal of interest in extracting mathematical equations from PDFs. I have long wished for an app which would perform a similar task: producing a table defining variables/symbols used in the notation. For example, the sentence "Let x be a vector of words" would produce a row mapping x to "a vector of words". Has anyone attempted this for PDFs or even LaTeX sources?