Nschneid writeup of Wang talk
From Cohen Courses
Jump to navigationJump to searchThis is Nschneid's review of Wang_thesis_defense
Language-Independent Class Instance Extraction Using the Web
- SEAL: Use structured Web data to expand sets from a small list of seed examples via wrappers based on HTML source. Ranking of candidates is based on random walks in graph containing wrappers and their fillers. No tokenization required => language-independent.
- Evaluated on manually-constructed data sets, 12 sets for each of (English, Chinese, Japanese), e.g. Japanese provinces
- Constraining wrappers to be shared by multiple seeds reduces noise: e.g. seeds Seattle and Boston; based on contexts of Seattle alone we might propose Carnegie Mellon, but this is less likely for contexts hosting both Seattle and Boston.
- Limitation: performance drops significantly with more than 5 seeds
- iSEAL (iterative SEAL)
- Supervised version: at every iteration, seeds are obtained from a reliable source. Works with a fixed seed size.
- Boostrapping stage: seeds for the ith stage are selected from candidate items from the previous stage. Fixed seed size doesn't work; instead, for the ith iteration, randomly select (i+1) seeds based on the original seeds+candidates from previous stages.
- ASIA: Finding instances from category names
- Bilingual SEAL: for better precision, uses agreement on category instances extracted from data in multiple languages
- ANET (Automatic Named Entity Translation system): using bilingual snippets, ranks target language chunks (segments surrounded by punctuation or text in another language) based on how closely they correlate with the input string
- Binary SEAL: extracting binary relations (i.e. pairs of class-instance relations)
- Learns binary wrappers, which have a middle context string as well as left and right contexts
- e.g. learn state-governor pairs; federal agency name-acronym pairs
- Bootstrapping approach for unary relations also works for binary relations
- Future work
- identify category names given some instances
- ...
- Question: Could these sorts of models be queried to estimate how likely a given phrase is to belong to a particular class? This could be useful when many instances of the class are known, and also if only the class name (or a few instances) are known.