Stylistic Structure in Historic Legal Text
This will be the project page for Elijah Mayfield and William Y. Wang.
The Background
Slaves remained the largest source of wealth until 1840s. Judicial preferences and styles could generate variations in the security of slaves. In this project, we are interested in understanding the stylistic differences of judges among all pro-slavery, anti-slavery, and boundary states. We are also interested in investigating how the stylistic patterns from judges' opinions are different from the case summaries, and how they can inform judicial decisions.
The Dataset
We have collected a corpus of slave-related and property-related US supreme court legal opinions from Lexis Nexis. The dataset includes 6,014 slave-related state supreme court cases from 24 states, during the period of 1730 - 1866. It also includes 14,580 property-related cases from the same period. Most of the cases consist of the following data fields:Parties, Court, Date, Judge Opinion, Previous Court and Judges, Disposition, Case Overview, Procedural Posture, Outcome, Core Terms Generated by Lexis, Headnote, Counsel, and Judge(s).
The Theory
Our work will be informed at the highest level by the sociolinguistic theory of heteroglossia. This theory describes the ways in which speakers align themselves with the content of their language. For instance, when stating a fact, it can be given without justification, elaboration, or citation, in which case the writer is assuming that fact as given by all readers. This is termed monoglossic. On the other hand, language can be adapted with any number of different markers to show any uncertainty in a fact - this is termed heteroglossic language. Examples of this type of behavior include distancing (by attributing the fact to an authority or a witness, for instance), justifying (e.g. by giving a causal explanation to show what other information the fact is based on), or hedging (by adding modal auxiliaries or other "softening" language). Importantly, heteroglossic statements can include facts that the writer believes beyond a shadow of a doubt, if they feel the need to justify or explain that belief - any marker showing that a fact is not completely inarguable marks heteroglossia.
Our belief is that the way in which facts, entities, and events are referenced by a judge in an opinion will be influenced heavily by other factors surrounding the judgment, such as the location, time period, and outcome of the verdict. Therefore, if we can extract these behaviors in a systematic way, we can then use them as observables in a generative model of these variables. Moreover, these features are likely to be more informative and interesting for social scientists than simpler n-gram features, even if they perform no better at classification, due to their more descriptive nature.
The Approach
Qualitative analysis of our data set immediately showed a major disparity between the two largest text fields in each case - Judge Opinion and Case Overview. The first, written by the judge in delivering a verdict, is heavily heteroglossic, with markers for opinionated, convincing, judgmental, or attributed facts. This is only natural for a judgment that must collect myriad testimonies and sources of evidence into a single verdict. On the other hand, the Case Overview section of each case is heavily monoglossic. Facts and testimonies are recorded impassively, with no attempt to persuade the reader - it is a simple summary.
Most intriguingly, these texts are about the same pieces of evidence, the same testimonies, the same series of events. This means that we have, in effect, fairly large pseudo-parallel corpora for monoglossia and heteroglossia. Our project will be based in three stages: Alignment, Extraction, and Prediction.
In the Alignment stage, we will develop an approach for aligning sentences between Judge Opinion and Case Overview that are referring to the same facts from the case. This may be done using simple string similarity, or we may include a more complex framework for named entity resolution, coreference, and event detection. This will produce a more closely aligned corpus, on a sentence level, compared to the large paragraphs of parallel texts from a whole case.
In the Extraction stage, we will use unsupervised methods to detect the stylistic differences between parallel sentences describing the same entities or events. This may be based on syntactic structure, or it may be a flatter and less processing intensive method. Our goal is to extract a feature space representation of the extracted patterns for marking text as heteroglossic in a given sentence.
In the final, Prediction stage, we will use the output of the past two stages to represent each case not by the topical content of the ruling, but by an empirically derived representation of the judge's linguistic positioning towards the facts of the case. We can then use this representation to predict three different output variables - the state or region the judge is from; the time period that the judge is from; and the outcome of the case. Our hope is that we can gain leverage from this representation in a way which surface, n-gram based models cannot achieve.