LegalBenchPro

LegalBenchPro is a research benchmark for evaluating large language models on Chinese civil judgments and public legal-exam materials, with a manuscript in preparation. Beyond the dataset, the repository is designed as a reproducibility-first, AI-assisted research workflow: a Codex-assisted scoring and audit pipeline that organizes 20,768 LLM response cells across 22 model configurations, structured rubrics, machine-readable metadata, and documented safeguards for AI-assisted research decisions.

The project asks two questions in parallel:

Do models that perform well on scalable public-exam tasks also transfer to de-identified, practice-oriented case analysis?
What does a defensible, auditable AI-assisted evaluation pipeline look like for legal and institutional text research?

Status (as of April 2026): manuscript draft in preparation; 20,768 LLM response cells collected across 22 model configurations; human-validation pilot underway; full data release pending licensing, privacy, and source-distribution review.

Key Artifacts

Project Team

Hongyu Wang (UC Santa Barbara) - project initiator and lead; benchmark design, scoring-rubric design, AI-assisted scoring/audit pipeline, public repository packaging, and manuscript drafting.
Yilun Zhao (Yale NLP Lab) - weekly research collaborator; benchmark-design feedback, scoring-protocol review, and manuscript revision discussions.
Yixin Liu (Yale NLP Lab) - project feedback on benchmark design and error-analysis protocols.
Xuandong Zhao (UC Berkeley) - project feedback on scoring rubrics and evaluation methodology.

Public Preview Overview

The figure is generated from committed public metadata: data/metadata/dataset_summary.json and data/metadata/source_distribution.csv.

At a Glance

Scope: Chinese institutional and legal text, with both scalable public-exam prompts and de-identified civil-judgment reasoning tasks.
Evaluation design: comparable task construction, model-configuration metadata, scoring regimes, and staged human-validation plans.
Reproducibility: Python sample extraction, machine-readable metadata, tests, data-card documentation, and an explicit workflow audit trail.
Research workflow: public artifacts are organized so that readers can inspect the path from workbook-derived metadata to samples, documentation, figures, and manuscript materials.

Benchmark Design

Dimension	Cardinality	Values
Model configurations	22	Closed, open-weight, reasoning-enabled, and step-by-step prompting variants
Main task instances	944	76 Chinese real-case issue-stance prompts + 868 public-exam instances
Jurisdiction/source families	3	Chinese civil judgments, U.S. state bar materials, U.K. legal-exam materials
Evaluation settings	2	De-identified real-case reasoning and scalable public-exam scoring
Main response cells	20,768	944 task instances x 22 model configurations
Human-validation pilot	90 rows	10 real-case rows + 80 public-exam rows
Public preview	30 rows	10 translated real-case excerpts + 20 public-exam excerpts with capped cell length

Snapshot Counts

Component	Current count	Evaluation design
Chinese real-case split	76 issue-stance prompts	Citation-aware rubric with human validation in progress
Source judgments	15 de-identified civil judgments	Paired support/opposition issue prompts
Public-exam split	868 instances	Reference-answer consistency scoring
Model configurations	22	Standard, reasoning-enabled, and step-by-step prompting modes
Main multimodel response cells	20,768 LLM-generated responses	944 task instances x 22 model configurations
Human validation pilots	10 real-case rows; 80 public-exam rows	Staged for reviewer calibration and agreement analysis

The public preview includes 10 translated preview rows from the Chinese real-case split, 20 preview rows from the public-exam split, model-configuration metadata, and compact source/domain distribution tables. Preview CSV cells are capped at 420 characters.

Research Contribution

LegalBenchPro is designed around a gap in current legal LLM evaluation: public legal benchmarks are scalable and convenient, but legal practice often requires working from long facts, contested interpretations, jurisdiction-specific authorities, and defensible argument structure. This project contributes:

a two-part benchmark that separates public-exam evaluation from real-case legal analysis;
a curated Chinese civil judgment split with paired issue-stance prompts;
a multimodel evaluation matrix spanning 22 model configurations and 20,768 LLM-generated response cells;
a scoring protocol that distinguishes answer matching from citation-aware legal reasoning;
a reproducible public workflow for sample extraction, metadata generation, figure rendering, and manuscript tracking.

For empirical social-science research, the project is also a small example of how LLM-assisted analysis can be made auditable: institutional text is treated as data, model outputs are treated as evidence to be validated rather than accepted, and scoring decisions are documented through schemas, rubrics, provenance notes, and rerunnable scripts.

Where To Start

For a quick review of the project, start with:

paper/LegalBenchPro_intro_draft.pdf for the current draft introduction;
docs/DATA_CARD.md for scope, counts, intended uses, and release constraints;
docs/ANNOTATION_PROTOCOL.md for human-validation and scoring design;
docs/SCORING_RUBRIC.md for the compact scoring rubric;
docs/AI_WORKFLOW.md for auditability and AI-assistance safeguards;
data/README.md for a compact public data preview;
data/sample/legalbenchpro_cn_judgments_sample.csv for real-case content excerpts;
data/sample/legalbenchpro_public_exam_sample.csv for public-exam content excerpts;
data/metadata/source_distribution.csv and data/metadata/model_configurations.csv for concise metadata;
scripts/extract_public_sample.py and scripts/render_benchmark_overview.py for the reproducible export and figure-rendering workflow.

Repository Map

paper/
  LegalBenchPro_intro_draft.pdf       # Current draft introduction
  introduction_revised.tex            # Dataset-aligned introduction for Overleaf
  manuscript_working_draft.md         # Working paper skeleton for GitHub readers
docs/
  DATA_CARD.md                        # Dataset scope, fields, release status, risks
  ANNOTATION_PROTOCOL.md              # Human validation plan and scoring dimensions
  AI_WORKFLOW.md                      # AI-assisted research workflow and safeguards
  SCORING_RUBRIC.md                   # Compact scoring rubric
  MANUSCRIPT_STATUS.md                # What is complete and what remains
data/
  README.md
  sample/legalbenchpro_cn_judgments_sample.csv
  sample/legalbenchpro_public_exam_sample.csv
  metadata/dataset_summary.json
  metadata/model_configurations.csv
  metadata/source_distribution.csv
outputs/
  figures/benchmark_overview.png      # Public metadata overview figure
scripts/
  extract_public_sample.py            # Rebuilds the public sample and metadata
  render_benchmark_overview.py        # Rebuilds the README overview figure
src/legalbenchpro/
  workbook.py                         # Small workbook helpers used by scripts
tests/
  test_workbook.py                    # Lightweight smoke tests for public utilities

Reproduce Public Artifacts

If you have access to the private workbook, the public sample and metadata can be regenerated from the local source file.

macOS/Linux:

python -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
export PYTHONPATH="$PWD/src"
python scripts/extract_public_sample.py \
  --workbook "/path/to/Data Set.xlsx" \
  --out-dir data \
  --cn-sample-size 10 \
  --bar-sample-size 20 \
  --max-cell-chars 420
python scripts/render_benchmark_overview.py

Windows PowerShell:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txt
$env:PYTHONPATH = "$PWD\src"
python .\scripts\extract_public_sample.py `
  --workbook "C:\path\to\Data Set.xlsx" `
  --out-dir data `
  --cn-sample-size 10 `
  --bar-sample-size 20 `
  --max-cell-chars 420
python .\scripts\render_benchmark_overview.py

Validation

The repository includes a small test suite:

macOS/Linux:

export PYTHONPATH="$PWD/src"
python -m unittest discover -s tests
python -m compileall scripts src

Windows PowerShell:

$env:PYTHONPATH = "$PWD\src"
python -m unittest discover -s tests
python -m compileall scripts src

Research Software Signals

This repository is intentionally organized as a research-engineering artifact, not only as a dataset announcement. It demonstrates:

Python scripts that regenerate public samples, metadata, and the README overview figure from structured inputs;
explicit dataset documentation, release constraints, and annotation protocol files;
lightweight tests for workbook parsing utilities;
an audit trail for AI-assisted coding and research workflow decisions;
manuscript-facing materials that separate current evidence from future validation.

Release Status

This is a research preview, not a final benchmark release. The public content samples are excerpted and do not include the full prompt matrix, full reference answers, full model outputs, row-level full indexes, or human review sheets. The full dataset will require final licensing, privacy, source-distribution, and validation review before release.

Author and Collaborators

See Project Team.

Disclaimer

This repository is for research on model evaluation. It is not legal advice, a legal research product, or a substitute for jurisdiction-specific legal review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LegalBenchPro

Key Artifacts

Project Team

Public Preview Overview

At a Glance

Benchmark Design

Snapshot Counts

Research Contribution

Where To Start

Repository Map

Reproduce Public Artifacts

Validation

Research Software Signals

Release Status

Author and Collaborators

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
docs		docs
outputs/figures		outputs/figures
paper		paper
scripts		scripts
src/legalbenchpro		src/legalbenchpro
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LegalBenchPro

Key Artifacts

Project Team

Public Preview Overview

At a Glance

Benchmark Design

Snapshot Counts

Research Contribution

Where To Start

Repository Map

Reproduce Public Artifacts

Validation

Research Software Signals

Release Status

Author and Collaborators

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages