Stylistic Structure in Historic Legal Text

From Cohen Courses
Revision as of 20:16, 5 October 2011 by Emayfiel (talk | contribs)
Jump to navigationJump to search

This will be the project page for Elijah Mayfield and William Y. Wang.


The Background

In this project, we are interested in understanding the stylistic differences of judges in historical legal opinions. We specifically focus on cases regarding slaves as property. Slaves remained the largest source of wealth until 1840s. Judicial preferences and styles could generate variations in the security of slaves.

We are interested in studying how these cases were handled in different regions of the United States with varying views towards slavery. Because this is a longitudinal data set, we are also interested in understanding how styles change over the course of decades.

To do this, we will utilize a comparable aligned corpus of judicial opinions and overviews on the same cases. Our belief is that by capturing the topical overlap between an opinion and a neutral overview, the non-content word structure of the judge's opinion that remain will be indicative of the style in which that information is being presented.

To measure this, we will utilize local structured prediction tasks to generate a feature representation of a text based on those stylistic cues. We will then compare that representation to a simpler, unigram or LDA-based feature space at a classification task (region identification) and a regression task (year identification). Our belief is that our stylistic model will be more accurate quantitatively (by measuring accuracy at these tasks) and more interesting qualitatively (by leveraging features other than topic-based cue words to make a classification).


The Dataset

We have collected a corpus of slave-related and property-related US supreme court legal opinions from Lexis Nexis. The dataset includes 6,014 slave-related state supreme court cases from 24 states, during the period of 1730 - 1866. It also includes 14,580 property-related cases from the same period. Most of the cases consist of the following data fields:Parties, Court, Date, Judge Opinion, Previous Court and Judges, Disposition, Case Overview, Procedural Posture, Outcome, Core Terms Generated by Lexis, Headnote, Counsel, and Judge(s).


The Theory

We focus on the issue of author engagement, an attempt to describe the extent to which an author aligns themselves with the content of what they are writing. Examples of low engagement may be signalled by distancing with modal phrases ("it may be the case that...") or by attribution to another source ("the defendant claims that..."). High engagement may be signalled by pronouncement ("Of course it's true that...") or explicit endorsement of a third-party claim ("The defendant has demonstrated that..."). On the other hand, speakers may make statements with no engagement (simply stating a fact), suggesting that they believe that fact will be taken for granted or is entirely obvious to any reader.

These levels of engagement with the facts of a case demonstrate alignment with certain facts or sides in a legal case. Our belief is that the way in which facts, entities, and events are referenced by a judge in an opinion will be influenced heavily by other factors surrounding the judgment, such as the location, time period, and outcome of the verdict. Therefore, if we can extract these behaviors in a systematic way, we can then use them as observed features in a generative model. Moreover, these features are likely to be more informative and interesting for social scientists than simpler n-gram features, even if they perform no better at classification, due to their more descriptive nature.


The Approach

Qualitative analysis of our data set immediately showed a major disparity between the two largest text fields in each case - Judge Opinion and Case Overview. The first, written by the judge in delivering a verdict, is littered with examples of author engagement, with markers for opinionated, convincing, judgmental, or attributed facts. This is only natural for a judgment that must collect myriad testimonies and sources of evidence into a single verdict. On the other hand, the Case Overview section of each case lacks author engagement entirely. Facts and testimonies are recorded impassively, with no attempt to persuade the reader - it is a simple summary.

Most intriguingly, these texts are about the same pieces of evidence, the same testimonies, the same series of events. This means that we have, in effect, fairly large pseudo-parallel corpora for engaged and disengaged authors. However, these texts are not the same length - on average, an overview is roughly 10% of the size of the judge's opinion. Therefore, it is not practical to attempt sentence-by-sentence alignment.


Evaluation

Our task is to build structured representations of text which are informative for describing the stylistic structure of a written text. To test whether we are, in fact, getting any signal from our structured representation, we will attempt a classification task and a regression task. The first will be to predict whether an opinion was written in a slave state, free state, or border state. The second will attempt to predict the year in which an opinion was written.

We can then measure these results both quantitatively (mean squared error (in years) for regression, and classification accuracy or kappa for classification) and qualitatively (by checking that the distribution of features in different categories is indeed informative). For this latter interpretation and analysis, we will be collaborating with a historian from Columbia University and an economist from American University, from whom we received access to this corpus.


Baseline

We will attempt two baselines. The first will be a bag-of-words representation of an opinion. The second will be based on LDA topic modeling, using default settings.

It is possible this model will perform well. However, we believe that if it does, it will be because of shallow features which are not informative for social scientists. By contrast, stylistic features which describe a deeper level of linguistic structure may still be interesting even if they perform slightly worse at the overall tasks.


Engagement Structure Extraction

A key aspect of our representation will be finding the features that surround content words, and describing them succinctly. We have three categories of spans of text in each sentence:

  • Content words - discussing entities or events in a case, or specific case numbers, etc. We may be able to identify these automatically or through straightforward TF/IDF measures.
  • Engagement words - words surrounding the content words in a section which show how that information is being construed by the author. These are what we are interested in.
  • Uninteresting words - words which are not contentful and do not relate to the author's positioning.

Our goal is to identify Engagement words in a sentence. This can be viewed as a superset of the hedge detection problem from the NLP literature. Our work will be largely unsupervised, but we will start with a seed word list from linguistic literature. We can then label those key terms as engagement indicators with high confidence in our training data.


Judge-Year-Region Topic Model

The second stage of our research project is not just to define a feature space, but also to use it for classification in a more intentionally designed way than a simple linear model combining features. We propose an extension of Latent Dirichlet Allocation based on the Author-Topic model (Rosen-Zvi et al., 2004) which incorporates judge, year, and region rather than just author.

Slide1.png

This idea isn't fully fleshed out yet.