Wikipedia Infobox Generator Using Cross Lingual Unstructured Text
From Cohen Courses
Jump to navigationJump to searchContents
Wikipedia Infobox Generator By Combining Multi Lingual Unstructured Text
Team Members
Project Idea
Key facts about Wikipedia articles are often mentioned in short block sections called 'InfoBoxes'. For example, an article on Brazil would mention "Capital:Brasília", "Largest city:São Paulo", "Official language : Portuguese", etc. As these infoboxes are manually created & maintained, several articles have either missing or outdated information (that got revised in the plain text.
- g
- b
Corpus
- Wikipedia XML Dumps (Current Revision only)
- http://en.wikipedia.org/wiki/Wikipedia_database#Other_languages
- English corpus size - 31 GB Uncompressed
- With 5 languages, approximately 200 GB total
Reference Papers
- Wu, Weld. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management CIKM 2007
- Eytan Adar , Michael Skinner , Daniel S. Weld, Information arbitrage across multi-lingual Wikipedia, Proceedings of the Second ACM International Conference on Web Search and Data Mining, February 09-12, 2009, Barcelona, Spain