OOV Detection from ASR Hypothesis - Project Midway Report
Contents
Data description
We plan to use the same dataset as proposed in the beginning of the semester. In particular, it is the TDT4 (Text annotations and transcripts and Speech) broadcast new, with labels from ACE. The broadcast news comes from ABC, CNN, NBC, PRI, VOA and MNB.
However, there is a problem that we did not realize in the beginning: the labels provided in ACE are not complete. Instead, only a representative sample was labeled. The percentage of labeled data is only 1%.
The full dataset (with and without labels) has 2410855 tokens. Out of the 33479 tokens that have been labeled, 3164 are named entities.
Have we changed our plan?
We are keeping the same goal. However, the change in the amount of labeled data presented us with an opportunity to explore a more challenging approach. Instead of using a supervised approach, we will now tackle the problem in a semi-supervised manner. In particular, during training, we will combine a portion of the labeled data with the {large} amount of unlabeled data. Evaluation will be done on the held-out labeled data. We will explore several known semi-supervised approaches, including co-training and EM.
What code have we written so far?
We have done some data preprocessing, using Perl. We are in the process of decoding the speech input with a limited vocabulary (40k words), to acquire the desired, noisy ASR hypothesis to work with. We have yet to write the code for the NER part, which we plan to write in Java.
Off-the-shelf packages
For the recognition task, we are using CMU Sphinx 3. We initially tried CMU Sphinx 4, which is a Java implementation of the decoder. This, however, took a long time. 500 utterances, on average 3 seconds long, took about 5 hours to decode. Thus, we decided to switch to Sphinx 3, which is a C++ implementation. It took about 1.5 hour to decode 800 utterances, which translates to a speedup of 5.
We have also downloaded and installed Minorthird, which we plan to use later on for the NER task.