Difference between revisions of "Ritter et al, EMNLP 2011. Named Entity Recognition in Tweets: An Experimental Study"

From Cohen Courses
Jump to navigationJump to search
m
Line 17: Line 17:
 
The authors manually annotated 800 tweets using the [[UsesDataset::PennTreeBank]] as a base tagset. They added new tags for twitter phenomena such as retweets, @usernames, #hashtags, and urls. Their dataset can be found [[TwitterNER | here]]
 
The authors manually annotated 800 tweets using the [[UsesDataset::PennTreeBank]] as a base tagset. They added new tags for twitter phenomena such as retweets, @usernames, #hashtags, and urls. Their dataset can be found [[TwitterNER | here]]
  
To help with OOV words, they performed clustering to group together words which are distributionally similar. They performed hierarchical clustering using JCluster (Goodman, 2001) on 52 million tweets.
+
To help with OOV words, they performed clustering to group together words which are distributionally similar. They performed hierarchical clustering using [[Goodman 2001|JCluster (Goodman, 2001)]] on 52 million tweets.
  
 
The POS tagging system, T-POS uses [[UsesMethod::CRF]] to perform sequence labeling. Features they used include lexical (prefix, suffixes), clusters, in addition to standard features such as POS dictionary, spelling and contextual features.
 
The POS tagging system, T-POS uses [[UsesMethod::CRF]] to perform sequence labeling. Features they used include lexical (prefix, suffixes), clusters, in addition to standard features such as POS dictionary, spelling and contextual features.

Revision as of 03:19, 27 September 2011

Named Entity Recognition in Tweets: An Experimental Study, by A. Ritter, S. Clark, Mausam, O. Etzioni. In Empirical Methods in Natural Language Processing, 2011.

This Paper is available online [1].

Summary

This paper seeks to design an NLP pipeline from the ground up (POS tagging through Chunking, to Named Entity Recognition) for twitter tweets. Off the shelf NER systems are not able to perform NER on tweets effectively due to its noisy (misspellings, short forms, slangs), terse (140 char limit) nature. Tweets contains a large number of distinctive named entity types.

The authors experimentally evaluate the performance of off the shelf news trained NLP tools on Twitter data. POS tagging performance is reported to drop from 0.97 to 0.80.

In addition, the authors introduce a new approach to distant supervision (Mintz et al 2009) using topic model. Distant supervision is a way to leverage on some form of labeled data to constrain the model (which may be unsupervised).

Brief description of the method

Part-of-Speech Tagging

The authors manually annotated 800 tweets using the PennTreeBank as a base tagset. They added new tags for twitter phenomena such as retweets, @usernames, #hashtags, and urls. Their dataset can be found here

To help with OOV words, they performed clustering to group together words which are distributionally similar. They performed hierarchical clustering using JCluster (Goodman, 2001) on 52 million tweets.

The POS tagging system, T-POS uses CRF to perform sequence labeling. Features they used include lexical (prefix, suffixes), clusters, in addition to standard features such as POS dictionary, spelling and contextual features.

Shallow parsing

The authors annotated the same 800 tweets above with tags from the CoNLL'00 shared task for shallow parsing (BIO labeling scheme). They used shallow parsing features described in Sha & Pereira (2003), in addition to clustering information which they had used for POS tagging.

Instead of using only 16k tokens of in-domain tweets, they trained on 210K tokens of CoNLL newswire data as well.

Capitalization

In standard NER system datasets, capitalization is an important orthographic feature of recognizing named entities. However, this is not the case in tweets. Hence, the authors designed a classifier T-CAP which predicts whether or not a tweet has been informatively capitalized (i.e whether the token is capitalized correctly). This is a binary classification task, which the authors used Support Vector Machines and trained on the 800 manually annotated tweets.

Named Entity Recognition

They treated classification of NE and segmentation as separate tasks. The motivation for this was that one could apply techniques more suitable for the given task.

For segmenting named entities, they did it as a sequence labeling task with IOB encodings and used CRF for learning and inference.

For classifying named entities, they leveraged on out of domain data to provide context for the large variety of types that appear in the tweets. They used large lists of entities and types gathered from open-domain onthology as a source for distant supervision. They used LabeledLDA to model each entities distribution over the range of types it can possibly have (according to Freebase). This way, instead of limiting each entity to a single type, each entity is modeled as a mixture of several types, and hence allow this mixture information to be shared across mentions, and handling entities whose mentions could refer to different types.

For entities that have never been seen before, they used a prior distribution over types for entities that were encountered during training.

Experimental Result

POS Tagging

Pos result.png

Shallow parsing

Chunking result.png

Named Entity Recognition

Cap result.png

Seg ner result.png

Ner result.png

Dataset

The authors have released TwitterNER dataset and source code for the paper. The demo and data are available online at [2].

Related Papers

Gimpel et al. ACL 2011 - Twitter POS tagging paper in ACL 2011

Blei et al. JMLR 2003 - Latent Dirichlet Allocation paper which they used to model entity distribution over possible types.

Cohen and Sarawagi. KDD 2004 - This paper also leveraged the use of an external dictionary for NER. In their case, they used a semi Markov network which can score sequences of tokens at a time.