Stylistic Structure in Historic Legal Text

This will be the project page for Elijah Mayfield and William Y. Wang.

The Background

In this project, we are interested in understanding the stylistic differences of judges in historical legal opinions. We specifically focus on cases regarding slaves as property. Slaves remained the largest source of wealth until 1840s. Judicial preferences and styles could generate variations in the security of slaves.

We are interested in studying how these cases were handled in different regions of the United States with varying views towards slavery. Because this is a longitudinal data set, we are also interested in understanding how styles change over the course of decades.

To do this, we will utilize a comparable aligned corpus of judicial opinions and overviews on the same cases. Our belief is that by capturing the topical overlap between an opinion and a neutral overview, the non-content word structure of the judge's opinion that remain will be indicative of the style in which that information is being presented.

To measure this, we will utilize local structured prediction tasks to generate a feature representation of a text based on those stylistic cues. We will then compare that representation to a simpler, unigram or LDA-based feature space at a classification task (region identification) and a regression task (year identification). Our belief is that our stylistic model will be more accurate quantitatively (by measuring accuracy at these tasks) and more interesting qualitatively (by leveraging features other than topic-based cue words to make a classification).

The Dataset

We have collected a corpus of slave-related and property-related US supreme court legal opinions from Lexis Nexis. The dataset includes 6,014 slave-related state supreme court cases from 24 states, during the period of 1730 - 1866. It also includes 14,580 property-related cases from the same period. Most of the cases consist of the following data fields:Parties, Court, Date, Judge Opinion, Previous Court and Judges, Disposition, Case Overview, Procedural Posture, Outcome, Core Terms Generated by Lexis, Headnote, Counsel, and Judge(s).

The Theory

We focus on the issue of author engagement, an attempt to describe the extent to which an author aligns themselves with the content of what they are writing. Examples of low engagement may be signalled by distancing with modal phrases ("it may be the case that...") or by attribution to another source ("the defendant claims that..."). High engagement may be signalled by pronouncement ("Of course it's true that...") or explicit endorsement of a third-party claim ("The defendant has demonstrated that..."). On the other hand, speakers may make statements with no engagement (simply stating a fact), suggesting that they believe that fact will be taken for granted or is entirely obvious to any reader.

These levels of engagement with the facts of a case demonstrate alignment with certain facts or sides in a legal case. Our belief is that the way in which facts, entities, and events are referenced by a judge in an opinion will be influenced heavily by other factors surrounding the judgment, such as the location, time period, and outcome of the verdict. Therefore, if we can extract these behaviors in a systematic way, we can then use them as observed features in a generative model. Moreover, these features are likely to be more informative and interesting for social scientists than simpler n-gram features, even if they perform no better at classification, due to their more descriptive nature.

Comments: I think you mean more abstract, not more descriptive? --Nasmith 20:53, 9 October 2011 (UTC)

The Approach

Qualitative analysis of our data set immediately showed a major disparity between the two largest text fields in each case - Judge Opinion and Case Overview. The first, written by the judge in delivering a verdict, is littered with examples of author engagement, with markers for opinionated, convincing, judgmental, or attributed facts. This is only natural for a judgment that must collect myriad testimonies and sources of evidence into a single verdict. On the other hand, the Case Overview section of each case lacks author engagement entirely. Facts and testimonies are recorded impassively, with no attempt to persuade the reader - it is a simple summary.

Most intriguingly, these texts are about the same pieces of evidence, the same testimonies, the same series of events. This means that we have, in effect, fairly large pseudo-parallel corpora for engaged and disengaged authors. However, these texts are not the same length - on average, an overview is roughly 10% of the size of the judge's opinion. Therefore, it is not practical to attempt sentence-by-sentence alignment.

Therefore, we will focus on two aspects for this dataset. First, we attempt to investigate latent variable models to jointly model judges, topics, geographical and temporal information. Secondly, we explore various dependency parsing (Joshi et al., 2010), named entity features (Wang et al., 2011) and other syntactic/semantic/discourse features, and analyze the efficient methods to incorporate them into the generative model.

Comments: strange choices of citations for dependency parsing and named entities. I'm not clear on why you want a generative model of all these things; don't you want to use them as features in a discriminative model? --Nasmith 20:56, 9 October 2011 (UTC)

Evaluation

Our task is to build structured representations of text which are informative for describing the stylistic structure of a written text. To test whether we are, in fact, getting any signal from our structured representation, we will attempt a classification task and a regression task. The first will be to predict whether an opinion was written in a slave state, free state, or border state. The second will attempt to predict the year in which an opinion was written.

We can then measure these results both quantitatively (mean squared error (in years) for regression, and classification accuracy or kappa for classification) and qualitatively (by checking that the distribution of features in different categories is indeed informative). For this latter interpretation and analysis, we will be collaborating with a historian from Columbia University and an economist from American University, from whom we received access to this corpus.

Baselines

We will attempt three to four baselines. The candidates for baseline methods include: a bag-of-words representation of an opinion; the original LDA topic modeling (Blei et al., 2003), using default settings; the Author-Topic model (Rosen-Zvi et al., 2004); and the a variation of supervised LDA: the labeled LDA model (Ramage et al., 2009).

It is possible our proposed model will perform well. However, we believe that if it does, it will be because of shallow features which are not informative for social scientists. By contrast, stylistic features which describe a deeper level of linguistic structure may still be interesting even if they perform slightly worse at the overall tasks.

Comments: I don't see how LDA or the author-topic LDA solve your problem (they do not make predictions of an output variable). I think you are talking here about baseline feature sets (for the classification or regression task), and you plan to use standard classification/regression models. Is that right? It's also not clear how you are going to use topic models to create features (there are lots of ways you could do it). Another sensible baseline is to consider all candidate elements (which I think are phrases) from your fancy engagement model, simply as binary or count features, and let the supervised learner learn which ones to trust. --Nasmith 20:58, 9 October 2011 (UTC)

Engagement Structure Extraction

A key aspect of our representation will be finding the features that surround content words, and describing them succinctly. We have three categories of spans of text in each sentence:

• Content words - discussing entities or events in a case, or specific case numbers, etc. We may be able to identify these automatically or through straightforward TF/IDF measures.
• Engagement words - words surrounding the content words in a section which show how that information is being construed by the author. These are what we are interested in.
• Uninteresting words - words which are not contentful and do not relate to the author's positioning.

Our goal is to identify Engagement words in a sentence. This can be viewed as a superset of the hedge detection problem from the NLP literature. Our work will be largely unsupervised, but we will start with a seed word list from linguistic literature. We can then label those key terms as engagement indicators with high confidence in our training data.

We can also label content terms based on words that overlap, by some metric to be decided, with the overview text. Those texts do not have any of the stylistic indicators of engagement that we wish to annotate, so the words that will overlap most strongly are either from the uninteresting or content words category. Content terms can be further labeled with named entity recognition.

These steps give us a partially-labeled training corpus. We may explore bootstrapping approaches to get more labeled data, based on those seed words. We will also do a qualitative analysis of the output of this step, to ensure that there are no systematic mistakes in the way data is being labeled. In particular, the use of prior court decisions as citations is something we need to worry about, because that's something about this domain that is not considered by the sociolinguistic literature, which focuses on more general text and on classroom interactions in particular in many cases.

In the best case, our data will be annotated as is shown below, using seed words. These seed words are reasonably high precision (by qualitative checking of results) and low recall, occurring in 3.5% of sentences in our corpus. The structured annotations as shown below are likely to be very difficult to reproduce reliably between human annotators.

Given such an annotation, we then need to find patterns of text which we can extract into features. This can be done through a variety of different options, and no one has been settled on yet. We can treat this as a sequence tagging problem in the same sense as named entity recognition or hedge detection; we can use features based on dependency parses of our corpus; one option relevant to recent research in my advisor's group would be to adapt or enhance the stretchy pattern framework described in (Gianfortoni et al., 2011). In the end, the features will likely be some combination of lexical features, grouped categories of words, dependency or part-of-speech information, and length-based features (by # of tokens).

The resulting feature space, which will be stripped of content words and will represent the output of this engagement-based feature extraction process, will be passed to the next stage.

Latent Variable Models

The second stage of our research project is not just to define a feature space, but also to use it for latent variable models in a more intentionally designed way than a simple linear model combining features. We propose an extension of Latent Dirichlet Allocation based on the Author-Topic model (Rosen-Zvi et al., 2004) which incorporates judge, year, and region (we are also thinking about syntactic and semantic features) rather than just author. We then compare our results to LDA and the Author-LDA model.

In the original LDA model (Blei et al., 2003), the joint likelihood of this model can be represented as

${\displaystyle P({\boldsymbol {W}},{\boldsymbol {Z}},{\boldsymbol {\theta }},{\boldsymbol {\varphi }};\alpha ,\beta )=\prod _{i=1}^{K}P(\varphi _{i};\beta )\prod _{j=1}^{M}P(\theta _{j};\alpha )\prod _{t=1}^{N}P(Z_{j,t}|\theta _{j})P(W_{j,t}|\varphi _{Z_{j,t}})}$

and the generative story is that, for each document, it first chooses ${\displaystyle \theta }$ from Dirichlet prior ${\displaystyle \alpha }$, then for each word in the document, a topic ${\displaystyle \mathbf {Z} }$ will be drawn from a multinomial distribution ${\displaystyle \theta }$. Then, each word will be assigned with a conditional probability ${\displaystyle p(W_{n}|Z_{n},\beta )}$.

However, with the original LDA model, it is impossible to model judge(author of the document) and other valuable information. Thus, the Author-Topic model was proposed (Rosen-Zvi et al., 2004), and have the following graphical model representation:

where it has two latent variables: the topic ${\displaystyle \mathbf {Z} }$ and the author ${\displaystyle \mathbf {X} }$. The generative story is that, for each document, it first chooses an author ${\displaystyle \mathbf {X} }$, and then a topic ${\displaystyle \mathbf {Z} }$. Finally, it assigns word posterior probability, given the topic and the author. The result contains the topic distribution over each author, and word distribution over each topic, which in return, can be think of as given an author, how likely he/she is going to say this particular word.

Even though the author topic model is shown to be relatively superior than the original LDA model in modeling author-specific topics, in the context of our dataset, it still fails to address the available regional and temporal information, let alone syntactic and semantic information. Thus, we propose a latent variable model for modeling legal documents:

In this initial model, there are four unobserved variables: the topic ${\displaystyle \mathbf {Z} }$, the judge ${\displaystyle \mathbf {x_{1}} }$, the state the judge ${\displaystyle \mathbf {x_{2}} }$, and the year ${\displaystyle \mathbf {x_{3}} }$. The generative story of this proposed model can be think of as first choosing a judge ${\displaystyle \mathbf {x_{1}} }$, a region (state) ${\displaystyle \mathbf {x_{2}} }$, and a year ${\displaystyle \mathbf {x_{3}} }$, then we generate the topic ${\displaystyle \mathbf {Z} }$ for the particular setting. In the end, we assign word distribution ${\displaystyle \mathbf {W} }$ over the chosen topic. One way in which this can be used is that each word is explicitly assigned a variable corresponding to the judge, region, and year of each individual word in a document. This is similar to the idea of heteroglossia, where some words are coming from other speakers (representing another's opinion), but it's not clear how closely the sociolinguistic insight matches the actual use of the model in a real setting.

Thus, given a test document ${\displaystyle \mathbf {T} }$ of length ${\displaystyle \mathbf {N} }$ and we have ${\displaystyle \mathbf {K} }$ distinct word posterior distributions that involve the region ${\displaystyle \mathbf {x_{2}} }$, our region classification task can be represented as maximizing the joint likelihood:

${\displaystyle \approx \arg \max _{x}\prod _{i}^{N}P(w_{i}|Z,\phi ;\alpha ,\beta )=\arg \max _{x}\prod _{j}^{K}\prod _{i}^{N}P(w_{i}|\phi ;\beta )P(Z|x_{1},x_{2},x_{3},\theta ;\alpha )}$

if we want to model the marginal distribution of ${\displaystyle x_{2}}$ in the joint probability of ${\displaystyle P(x_{1},x_{2},x_{3},\theta ;\alpha ,\beta )}$, we can marginalize out ${\displaystyle x_{1}}$,${\displaystyle x_{3}}$, and ${\displaystyle \theta }$:

${\displaystyle \Pr(X_{2}=x)=\sum _{y}\sum _{w}\sum _{q}\Pr(X_{2}=x,X_{1}=y,X_{3}=w,\Theta =q)}$

We might also consider other optimization methods, such as conditional likelihood estimation, sum of conditional likelihood, and cost-sensitive estimation techniques for this task. However, in terms of the regression task, the objective function might not be obvious to us now.

The latest mix-effect model has the following generative probability of words (added on Dec 5., 2011)

${\displaystyle P(w_{n}^{(d)}|z_{n}^{(d)},\eta ,m,y_{r},y_{q})\propto exp{\big (}\eta _{z_{n}^{(d)}}^{(T)}+\eta _{y^{(r)}}^{(R)}+\eta _{y^{(q)}}^{(Q)}+\eta _{y^{(r)},y^{(q)},z_{n}^{(d)}}^{(I)}+m{\big )}}$

The combined variational mean field method is to optimize this final log-likelihood

{\displaystyle {\begin{alignedat}{1}{\mathcal {L}}=\sum _{d}\langle logP(\theta _{d}|\alpha )\rangle +\sum _{n}^{N_{d}}{\big \langle }logP(w_{n}^{(d)}|z_{n}^{(d)},\eta ,m,y_{r},y_{q}){\big \rangle }\\+{\big \langle }logP(Z_{n}^{(d)}|\theta _{d}){\big \rangle }+\sum _{k}\langle logP(\eta _{k}|0,\tau _{k})\rangle \\+\sum _{k}\langle logP(\tau _{k}|\gamma )\rangle +\sum _{j}\langle logP(\eta _{j}|0,\tau _{j})\rangle \\+\sum _{j}\langle logP(\tau _{j}|\gamma )\rangle +\sum _{q}\langle logP(\eta _{q}|0,\tau _{q})\rangle \\+\sum _{q}\langle logP(\tau _{q}|\gamma )\rangle -{\big \langle }logQ(\tau ,z,\theta ){\big \rangle }\\\end{alignedat}}}

The new gradient for ${\displaystyle \eta ^{(T)}}$ was changed into

${\displaystyle {\frac {dl}{d\eta _{k}^{(T)}}}=\langle c_{k}^{(T)}\rangle -\sum _{q}\sum _{j}\langle C_{qjk}\rangle \beta _{qjk}-diag(\langle (\tau _{k}^{(T)})^{-1}\rangle )\eta _{k}^{(T)}}$

Comments: it's not totally clear to me whether you are proposing this model as a way to do prediction, or simply as the first stage where you are doing un/semisupervised feature induction. It strikes me that topics are not the right way to mediate all these different effects. Take a look at this paper to get a different way of thinking about these different effects on language. You can include topics, or not. Generally speaking, supervised LDA approaches have not worked very well on their own (though it's conceivable that they can discover useful abstractions that a downstream discriminative learner can exploit for good performance). I'm not sure that topics are really an intuitive abstraction to present to social scientists, though they do seem to be the most popular thing going right now. --Nasmith 21:07, 9 October 2011 (UTC)

References

• Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John. ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4–5): pp. 993–1022. doi:10.1162/jmlr.2003.3.4-5.993
• Gianfortoni, Philip and Adamson, David and Ros\'{e}, Carolyn P., "Modeling of Stylistic Variation in Social Media with Stretchy Patterns, in Proceedings of the DIALECTS workshop of ACL HLT 2011.
• Mahesh Joshi and Dipanjan Das and Kevin Gimpel and Noah A. Smith, "Movie reviews and revenues: An experiment in text regression", in Proceedings of NAACL-HLT, 2010.
• Ramage, Daniel and Hall, David and Nallapati, Ramesh and Manning, Christopher D.,Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.
• Rosen-Zvi et al., "The author-topic model for authors and documents", UAI 2001.
• William Yang Wang, Kapil Thadani, and Kathleen R. McKeown, "Identifying Event Descriptions using Co-training with Online News Summaries", to appear in Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011).