Liuliu writeup of Giles talk

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of SeerSuite_2009_Talk by user:Liuliu


I think the theme of Prof. Giles’ talk is data – how to access, manage, preserve, capture, research and search data. He mainly introduced the infrastructure and components of the systems that he developed for data search and management, but didn’t go very deep on the models or methods used to solve each sub problems embedded in these systems.

The talk starts with a comparison between the data access of big science and small science. The data of big science like physics is easy to handle, archive and understand, while data of small science like chemistry is heterogeneous, vast and 2-3 times bigger than big science. To manage more and more emerging data, we need to create indices to limit search and design parallel algorithm to support better data management.

SeerSuite is a framework or infrastructure based on which we could search engine and citeseer_x is such as example. It consists of three main parts: interface, application and data separate. Millions of citations and authors are managed by citeseer_x. It starts with focused crawling – incrementally crawl papers starting from parent urls. After crawling the files, it does metadata extraction from PDF/PS files. It first converts PDF/PS files to text file, apply a document filter and then use machine learning method, SVM and SRF, to extract entities from text. They have lots of rule based templates to assure the accuracy of entity extraction. However, as Prof. Giles pointed later in his talk, the accuracy of the extraction is still a bottom neck of the whole system. Citeseer_x also does name disambiguation to disambiguate, cluster and link names in the large digital library.

Prof. Giles also introduced several other seer systems. I think the formula search problem in ChemSeer is very interesting. Identifying a separate mathematic formula is a fairly easy thing I think. However, identifying formulas that are embedded in text is not an easy task. I remember he showed us two texts and said one of them is formula while the other one is now. I couldn’t tell the difference between those two texts. I guess, we might need some domain knowledge to solve this problem, but it’s still cool to summarize formulas out of texts.