Wikipedia Infobox Generator Using Cross Lingual Unstructured Text
Key facts about Wikipedia articles are often mentioned in short block sections called 'InfoBoxes'. For example, an article on Brazil would mention "Capital:Brasília", "Largest city:São Paulo", "Official language : Portuguese", etc. As these infoboxes are manually created & maintained, several articles have either missing or outdated information (that only got revised in the plain text). We also note that articles in languages closer to the native speakers have much more detailed information. For eg, articles on latin Soccer players have better facts like "Debut match, number of goals, Records wins" mentioned in Español version of Wikipedia whereas the English version lacks these crucial stats.
In our SPLODD project, we propose to achieve two objectives:
- Extract facts from unstructured wikipedia text to generate infoboxes.
- Combine facts in multiple languages for an article to generate infoboxes with comprehensive information.
- Training CRF models for fact identification and extraction
- Clustering for aligning multiple language entity names based on page topic
- Large scale data
- Noise : Same infobox entity name can be written in different ways in the same language. For eg, "Number of Goals","No. of Goals","Goals", "# Goals", etc.
- Wikipedia XML Dumps (Current Revision only)
- English corpus size - 31 GB Uncompressed
- With 5-6 languages, approximately 200 GB total
- Wu, Weld, Autonomously Semantifying Wikipedia, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management CIKM 2007
- Eytan Adar , Michael Skinner , Daniel S. Weld, Information arbitrage across multi-lingual Wikipedia, Proceedings of the Second ACM International Conference on Web Search and Data Mining, February 09-12, 2009, Barcelona, Spain
Comments from William
There's lots of cool space for a project here, and you guys are aware of the closely-related Weld work, which seems good. My only worry is the size of the data you're working with - do you have plans to boil this down to a size that's manageable for a class project? BTW might also want to look at the DBpedia dumps, which have regularized some of the infobox fields (using rules, I think), and distribute them. I don't know if there are non-English DBPedia dumps available.
--Wcohen 20:45, 22 September 2011 (UTC)
Comments from Noah
This project has been shifted to my "mentorship." Here's some advice: drop the topic modeling part. You can use standard classification methods to categorize pages; that isn't really the interesting part of this project. For labeled data, consider wikipedia's categories and pick out pages semi-automatically that you think are reliable.
ASAP, you need to carefully work through the steps your algorithms are going to take. What models need to be built, on what data? What annotation will you need to provide? Or can you get buy without it? How do they feed into each other? And, very importantly, how are you going to evaluate performance at different stages? You need to carefully circumscribe what you are going to do, because the space of possibilities is huge.
Aligning bits of information across languages is really interesting -- but don't underestimate how difficult it will be.
I agree with William that this may be hard to do on a very large scale. Maybe it's better to pick one category and focus on that, leaving scalability for later. What we want you to learn in this class is how to use structured models, and that is directly at odds with scalability.
--Nasmith 21:27, 9 October 2011 (UTC)