A collection of datasets and tasks for legal machine learning
Datasets for Machine Learning in Law
This is a collection of pointers to datasets/tasks/benchmarks pertaining to the intersection of machine learning and law.
This page is continually being updated. If I missed something, please contact me at [email protected] and I'll add it!
Task agnostic datasets
These datasets can be used for pretraining larger models. Alternatively, you cause them to construct artificial tasks.
- Caselaw Access Project: all official, book-published United States case law.
- Legifrance: a French legal publisher providing access to law codes and legal decisions. Requires scraping (Paper).
- US Supreme Court Database: information about every case decided by the US Supreme Court between 1791 and today.
- European Parliment Proceedings: Parallel text of the proceedings of the European Parliment, collected in 11 languages.
- US Code: downloadable version of the US Code in XML format
- Patent Litigation Docket Reports: detailed patent litigation data on over 80k unique district court cases
- Pile of Law: a 256GB dataset of legal, administrative, and contractual texts.
Benchmarks which combine multiple types of tasks
- LexGlue: a GLUE inspired set of legal tasks
- LegalBench: a large language model benchmark for legal reasoning
Training a model to predict the outcome of a case from various case specific features. - European Court of Human Rights: 11.5k cases from ECHR's public database. Paper.
Training a model to annotate sentences/clauses/sections in a contract (or other document) according to various criteria (e.g. unfairness, argument structure, etc).
- Detecting unfair clauses from online terms-of-service: ~12k sentences from 50 terms-of-service agreements. Paper.
- Usable Privacy Project Data: a collection of datasets for privacy policies, including OPP-115, APP-350, MAPS, and the ACL/COLING 2014 Dataset.
- Contract extraction dataset: 3,500 English contracts manually annotated with 11 different contract elements. Paper.
- EURLEX with EUROVOC annotations: 57k legilsative documents from the EU's public document database, annotated with concepts from EUROVOC. Paper.
- Cornell eRulemaking Corpus: Collection of 731 user comments on the the Consumer Debt Collection Practices rule by the CFPB, with annotations containing information about argument structure. Paper.
- German rental agreements (in English): ~913 sentences from German rental agreements annotated by semantic type. Paper.
- Segmenting US court decision opinions into issue parts: 316 court decisions on cyber crime and trade secrets, manually segmented into 6 content based "types" (encompassing categories like "Introduction", "Dissent", or "Background"). Paper
- ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
Training a model to summarize complex contractual jargon or legal analysis. - Summarizing contracts into plain english: 446 contracts with parallel plain-text section-level English summaries. Paper. - Cookie policies from 151 companies: User agreements for 151 services with sections annotated by TOS;DR. Paper. - Australian case citation summarization: 4000 cases from the Federal Court of Australia with citation-based summaries. - Board of Veterans' Appeals Case Summarization: Summarizing BVA cases concerning PTSD. Paper. - Multi-LexSum: Summarizing civil rights opinions at different granularities!
Linking / question answering
Training a model to answer questions or to identify passages from a target document that are relevant to a specified query. - Linking Supreme Court Opinions to the US Constitution: 36k paragraphs from USC opinions with 41k links to the US Constitution. Paper. - StAtutory Reasoning Assessment (SARA): Collection of rules extracted from US Internal Revenue Code and natural language questions requiring application of those rules. Paper. - PrivacyQA: 1750 questions on mobile application privacy policies and 3500 relevant expert annotations. Paper - CaseHOLD: 53,000+ MC questions that require identifying the correct holding for a case citation from the preceeding context. Paper - LegalSupport: inferring BlueBook support signals from legal texts
Training a model to classify a (typically lengthy) legal filing or document. - EDGAR: Online public database for US Securities and Exchange Commission. Filings can be classified by filing type. Paper.
Datasets which don't fit into the above categories: - Segmenting sentences in US cases: ~26k sentences from 80 cases. Paper.