legal-ml-datasets

A collection of datasets and tasks for legal machine learning

📂 Demo

Datasets for Machine Learning in Law

This is a collection of pointers to datasets/tasks/benchmarks pertaining to the intersection of machine learning and law.

This page is continually being updated. If there's a dataset/resource you think should be included here, please make a pull request adding it, or contact me at [email protected] and I'll add it!

Neel Guha

Task agnostic datasets

These datasets can be used for pretraining larger models. Alternatively, you cause them to construct artificial tasks.

Caselaw Access Project: all official, book-published United States case law.
Legifrance: a French legal publisher providing access to law codes and legal decisions. Requires scraping (Paper).
US Supreme Court Database: information about every case decided by the US Supreme Court between 1791 and today.
European Parliment Proceedings: Parallel text of the proceedings of the European Parliment, collected in 11 languages.
US Code: downloadable version of the US Code in XML format
Patent Litigation Docket Reports: detailed patent litigation data on over 80k unique district court cases
Pile of Law: a 256GB dataset of legal, administrative, and contractual texts.
Open Australian Legal Corpus: The first and only multijurisdictional open corpus of Australian legislative and judicial documents.
Ontario Laws and Regs: A dataset comprised of the most recent version of all current and revoked laws and regulations from Ontario, Canada, totalling around 5,000 documents.
The Cambridge Law Corpus: A dataset consisting of raw text and metadata for 250,000+ court cases from the UK, dating back to the 16th century. Additional expert annotations are provided for a sample of 638 cases.

Benchmarks which combine multiple types of tasks

LexGlue: a GLUE inspired set of legal tasks
LegalBench: a large language model benchmark for legal reasoning

Judgement prediction

Training a model to predict the outcome of a case from various case specific features. - European Court of Human Rights: 11.5k cases from ECHR's public database. Paper.

Document/contract annotation

Training a model to annotate sentences/clauses/sections in a contract (or other document) according to various criteria (e.g. unfairness, argument structure, etc).

Detecting unfair clauses from online terms-of-service: ~12k sentences from 50 terms-of-service agreements. Paper.
Usable Privacy Project Data: a collection of datasets for privacy policies, including OPP-115, APP-350, MAPS, and the ACL/COLING 2014 Dataset.
Contract extraction dataset: 3,500 English contracts manually annotated with 11 different contract elements. Paper.
EURLEX with EUROVOC annotations: 57k legilsative documents from the EU's public document database, annotated with concepts from EUROVOC. Paper.
Cornell eRulemaking Corpus: Collection of 731 user comments on the the Consumer Debt Collection Practices rule by the CFPB, with annotations containing information about argument structure. Paper.
German rental agreements (in English): ~913 sentences from German rental agreements annotated by semantic type. Paper.
Segmenting US court decision opinions into issue parts: 316 court decisions on cyber crime and trade secrets, manually segmented into 6 content based "types" (encompassing categories like "Introduction", "Dissent", or "Background"). Paper
ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts

Summarization

Training a model to summarize complex contractual jargon or legal analysis. - Summarizing contracts into plain english: 446 contracts with parallel plain-text section-level English summaries. Paper. - Cookie policies from 151 companies: User agreements for 151 services with sections annotated by TOS;DR. Paper. - Australian case citation summarization: 4000 cases from the Federal Court of Australia with citation-based summaries. - Board of Veterans' Appeals Case Summarization: Summarizing BVA cases concerning PTSD. Paper. - Multi-LexSum: Summarizing civil rights opinions at different granularities! - EUR-Lex-Sum: Dataset for cross-lingual summarization based on manually curated document summaries of legal acts from the European Union law platform.

Linking / question answering

Training a model to answer questions or to identify passages from a target document that are relevant to a specified query. - Linking Supreme Court Opinions to the US Constitution: 36k paragraphs from USC opinions with 41k links to the US Constitution. Paper. - StAtutory Reasoning Assessment (SARA): Collection of rules extracted from US Internal Revenue Code and natural language questions requiring application of those rules. Paper. - PrivacyQA: 1750 questions on mobile application privacy policies and 3500 relevant expert annotations. Paper - CaseHOLD: 53,000+ MC questions that require identifying the correct holding for a case citation from the preceeding context. Paper - LegalSupport: inferring BlueBook support signals from legal texts

Document classification

Training a model to classify a (typically lengthy) legal filing or document. - EDGAR: Online public database for US Securities and Exchange Commission. Filings can be classified by filing type. Paper.

Misc

Datasets which don't fit into the above categories: - Segmenting sentences in US cases: ~26k sentences from 80 cases. Paper. - Demosthenes Corpus for argument mining in legal documents.

Refresh

👋 Contact 📂 Demo 💻 Source

Alle Teilnehmer*innen, Sponsor, Partner, Freiwilligen und Mitarbeiter*innen unseres Hackathons sind verpflichtet, dem Hack Code of Conduct zuzustimmen. Die Organisatoren werden diesen Kodex während der gesamten Veranstaltung durchsetzen. Wir erwarten die Zusammenarbeit aller Teilnehmer*innen, um eine sichere Umgebung für alle zu gewährleisten. Weitere Einzelheiten zum Ablauf der Veranstaltung finden Sie unter Richtlinien in unserem Wiki.

Die Inhalte dieser Website stehen, sofern nicht anders angegeben, unter einer Creative Commons Attribution 4.0 International License.

Legal Entity Recognition Go back OLDP

Ressourcen