Nschneid writeup of Wang talk

From Cohen Courses
Jump to navigationJump to search

This is Nschneid's review of Wang_thesis_defense

Language-Independent Class Instance Extraction Using the Web

  • SEAL: Use structured Web data to expand sets from a small list of seed examples via wrappers based on HTML source. Ranking of candidates is based on random walks in graph containing wrappers and their fillers. No tokenization required => language-independent.
    • Evaluated on manually-constructed data sets, 12 sets for each of (English, Chinese, Japanese), e.g. Japanese provinces
    • Constraining wrappers to be shared by multiple seeds reduces noise: e.g. seeds Seattle and Boston; based on contexts of Seattle alone we might propose Carnegie Mellon, but this is less likely for contexts hosting both Seattle and Boston.
    • Limitation: performance drops significantly with more than 5 seeds
  • iSEAL (iterative SEAL)
    • Supervised version: at every iteration, seeds are obtained from a reliable source. Works with a fixed seed size.
    • Boostrapping stage: seeds for the ith stage are selected from candidate items from the previous stage. Fixed seed size doesn't work; instead, for the ith iteration, randomly select (i+1) seeds based on the original seeds+candidates from previous stages.
  • ASIA: Finding instances from category names
  • Bilingual SEAL: for better precision, uses agreement on category instances extracted from data in multiple languages
    • ANET (Automatic Named Entity Translation system): using bilingual snippets, ranks target language chunks (segments surrounded by punctuation or text in another language) based on how closely they correlate with the input string
  • Binary SEAL: extracting binary relations (i.e. pairs of class-instance relations)
    • Learns binary wrappers, which have a middle context string as well as left and right contexts
    • e.g. learn state-governor pairs; federal agency name-acronym pairs
    • Bootstrapping approach for unary relations also works for binary relations
  • Future work
    • identify category names given some instances
    • ...
  • Question: Could these sorts of models be queried to estimate how likely a given phrase is to belong to a particular class? This could be useful when many instances of the class are known, and also if only the class name (or a few instances) are known.