Rbosaghz writeup of SeerSuite Talk
SeerSuite_2009_Talk by user:Rbosaghz
This talk was an overview of how the Seer suite of search engines are built, served, how research is done for them, and other topics.
At the start, Lee Giles brought up the fact that over the past years, there has been a 10-fold growth in the amount of data available. This is reasonable motivation for building more specialized search engines, like the ones presented in the SeerSuite.
Particularly interesting and relevant parts of his talk was how they do bibliography extraction. He mentions that they use a hierarchical CRF to extract bibliography information, which is nice. They also use SVMs to decide the parameters to some tools (this was glossed over as time was running out, which was unfortunate).
I particularly liked the presentation of the Personal Homepage search engine, which should be a valuable tool in academia: there are often times when it is enough to only search homepages to find an item.
There was a short mention of ethical crawling with respect to broken robots.txt files. When a robots.txt file is broken, the crawler must decide how to proceed (to crawl or not...). Turns out that Microsoft crawler is the most "ethical", as a result of some lawsuits against Microsoft over the years.
Another interesting topic of this talk was creating a search engine for chemists which made available chemical compound formulas for easy searching.