Nschneid Liuy Project status update

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Our project is comprised of two components: a small language for describing features, and a framework for improving the effectiveness of the feature space (primarily by selecting only a subset of the initial feature set without hurting accuracy). Below is our progress for these components in turn.

We will be using the SemEval 2007 data set for frame-semantic parsing. Some statistics for this data are displayed in the table at right.

FN SemEval stats.gif

Feature language component

The project proposal document briefly described our feature language. Using the PLY lexer/parser toolkit, we have built a Python compiler which recognizes a subset of this language.

So far, the compiler recognizes many types of expressions and can interpret them on the fly or generate corresponding Python code. As our language is largely similar to Python, this is fairly straightforward. Python syntax for atomic and string literals, arithmetic expressions, boolean expressions, list literals, sequence operations (indexing/slicing, concatenation, length), type conversion, and constant assignment is now supported. We have implemented few additional characteristics specific to our language:

  • syntax for regular expression literals and matching operations
  • special syntax for integer ranges in lists or sets (e.g. [1:11] represents the list of integers from 1 to 10, and {1:11} is the set of these integers)
  • special slicing of a sequence using a list of indices or ranges

We are now ready to implement support the primary novel functionality of our language: the ability to define feature templates and functions which can refer to input data. This entails implementing:

  • function definitions and function calls
  • use of resources (i.e. tables representing input data)
    For simplicity we will initially limit operations on resources to counting the lines, accessing an entire row (by number), and accessing a value in a row (by row and column number)
  • indexers, integer-valued variables which range over the data points or portions of those data points and are used to index feature vectors
  • feature template definitions
    For simplicity we will initially allow only binary features, specified by feature templates of the form ftrName[...](...) |= condition, where condition is a boolean expression determining whether the feature fires given the values of the parameters/indexers.
  • feature set definitions, where each feature set contains a subset of the feature templates
    We might restrict the use of resources to arguments of feature sets

The language will be compiled into Python so as to be usable from Java via Jython.

Feature-selection component

We'll focus first on regularization (L2; L1 with LASSO). We have a Matlab implementation of LASSO which works for a few thousand features; we've used it to train a model for a subset of the frame parsing data points by randomly selecting features. However, the Matlab approach will not scale to millions of features as required for our application. As the frame parser is written in Java, we would prefer a Java implementation of LASSO and other forms of regularization. The LingPipe package (http://alias-i.com/lingpipe/) provides APIs for logistic regression with various types of regularization; we will see if their implementation can be generalized to log-linear models where the classes are not consistent across data points. We also know of a C implementation that might be appropriate.

Once we complete experiments with regularization, if there's time we'll look at other feature selection techniques like information gain.