Difference between revisions of "Wikipedia Infobox Generator Using Cross Lingual Unstructured Text"

From Cohen Courses
Jump to navigationJump to search
Line 39: Line 39:
 
=== Comments from William ===
 
=== Comments from William ===
  
There's lots of cool space for a project here, and you guys are aware of the closely-related Weld work, which seems good.  My only worry is the size of the data you're working with - do you have plans to boil this down to a size that's manageable for a class project?  BTW might also want to look at the [[UsesDataset::DBPedia]] dumps, which have regularized some of the infobox fields (using rules, I think), and distribute them.  I don't know if there are non-English DBPedia dumps available.
+
There's lots of cool space for a project here, and you guys are aware of the closely-related Weld work, which seems good.  My only worry is the size of the data you're working with - do you have plans to boil this down to a size that's manageable for a class project?  BTW might also want to look at the [[UsesDataset::DBpedia]] dumps, which have regularized some of the infobox fields (using rules, I think), and distribute them.  I don't know if there are non-English DBPedia dumps available.
  
 
--[[User:Wcohen|Wcohen]] 20:45, 22 September 2011 (UTC)
 
--[[User:Wcohen|Wcohen]] 20:45, 22 September 2011 (UTC)

Revision as of 02:09, 28 September 2011

Team Members

Project Idea

WikipediaExample.png

Key facts about Wikipedia articles are often mentioned in short block sections called 'InfoBoxes'. For example, an article on Brazil would mention "Capital:Brasília", "Largest city:São Paulo", "Official language : Portuguese", etc. As these infoboxes are manually created & maintained, several articles have either missing or outdated information (that only got revised in the plain text). We also note that articles in languages closer to the native speakers have much more detailed information. For eg, articles on latin Soccer players have better facts like "Debut match, number of goals, Records wins" mentioned in Español version of Wikipedia whereas the English version lacks these crucial stats.


In our SPLODD project, we propose to achieve two objectives:

  • Extract facts from unstructured wikipedia text to generate infoboxes.
  • Combine facts in multiple languages for an article to generate infoboxes with comprehensive information.


Methodology used:

  • Topic modelling to detect type of page (Eg, country, person, athlete, actor)
  • Training CRF models for fact identification and extraction
  • Clustering for aligning multiple language entity names based on page topic


Challenges:

  • Large scale data
  • Noise : Same infobox entity name can be written in different ways in the same language. For eg, "Number of Goals","No. of Goals","Goals", "# Goals", etc.

Corpus

Reference Papers


Comments from William

There's lots of cool space for a project here, and you guys are aware of the closely-related Weld work, which seems good. My only worry is the size of the data you're working with - do you have plans to boil this down to a size that's manageable for a class project? BTW might also want to look at the DBpedia dumps, which have regularized some of the infobox fields (using rules, I think), and distribute them. I don't know if there are non-English DBPedia dumps available.

--Wcohen 20:45, 22 September 2011 (UTC)