Toward a National Eviction Data Collection Strategy Using Natural Language Processing
How the Eviction Research Network mined 111,740 Washington court records — buried in scanned PDFs and handwritten paper — to expose the racial and gender disparities that structured data could not see. The paper is the baseline methodology for working with any jurisdiction's eviction data, in any format.
Published 2024 Cityscape vol. 26, no. 1 (HUD Office of Policy Development & Research)
Most of the country's eviction data is unreadable. We wrote the pipeline to change that.
Eviction is the most common civil legal action in the United States, and the data it generates is the baseline on which every downstream question — racial disparity, cost of displacement, program evaluation — depends. Yet in most states, that data is trapped in scanned court images, inconsistent portals, and paper-based case management systems. When researchers can't read the records, the disparities stay invisible.
The Eviction Research Network's 2024 Cityscape paper, "Toward a National Eviction Data Collection Strategy Using Natural Language Processing," lays out the pipeline we built to solve this problem in Washington State — and argues it is the template for collecting comparable eviction data anywhere in the country. We downloaded 111,740 court record images across Pierce, Snohomish, Whatcom, and King counties, digitized the text with OCR, extracted defendant addresses using a rule-based and neural-network (spaCy) named-entity recognizer, geocoded to the rooftop where possible, and estimated race and gender using published Bayesian packages validated against legal-aid intake data.
What we found was not marginal. In Pierce County, one in five Black female-headed households was named in an eviction filing between 2013 and 2017. In King County, it was one in nine. These numbers existed in the court files the whole time — no one had the tools to read them at scale.
Background
Why structured eviction data misses most of the country
The pipeline
OCR → address extraction → geocoding → validation → demographics
Case study
Four western Washington counties, 2004–2017
Findings
Sizable racial and gender disparities invisible in structured data
Future strategies
Document layout analysis, LLM-assisted extraction, and a national roadmap
If you are working with court data in any jurisdiction, start here
The paper is cited — and used — as the reference methodology for demographic estimation and court-record work across the Eviction Research Network's state portfolio. Five things it settles.
You can extract structured data from scanned court records at scale
A two-step pipeline — Tesseract OCR, then rule-based plus spaCy neural-network NER — pulled defendant addresses out of 111,740 PDF and TIFF images with 98% confidence in the correct address on the Naive Bayes classifier.
Geocoding platforms differ. Use more than one.
We benchmarked OSM, Census, Google, Azure, and Esri Business Analyst. Azure returned the highest raw success rate; Esri BA had the highest replicable rate. 93% of records were geocoded to rooftop-to-block accuracy — the threshold for trustworthy tract-level analysis.
Bayesian race estimation works — with known limits
Imai & Khanna's wru package, backstopped by Xie's rethnicity Bi-LSTM model, yielded race estimates within a few percentage points of actual King County legal-aid intake data. Training data were voter files (largely Southern), so we document where this pipeline is likely to drift outside the South.
Racial disparities are larger than structured data suggest
Structured county-level filings hid what neighborhood-level, demographic-estimated records exposed: Black female-headed households were filed against at 3–5× the rate of White-headed households across the four Washington counties in the study.
Court cooperation is the hardest problem, not the code
One county clerk wanted $350,000 in per-page fees for 14 years of records. Others were enthusiastic. The technical pipeline is portable; the access strategy — mandates, grant support, legal-aid partnerships — is the scarce resource. The paper documents what worked and what didn't.
What's next: document layout analysis and LLM-assisted extraction
The paper tests ChatGPT-class LLMs on court documents and flags the obvious privacy problems alongside the accuracy gaps. It points to document-layout models (computer vision + NLP) as the next generation of extraction — a lower-skill-barrier approach that could further democratize this kind of research.
Five steps from a scanned court record to a demographically estimated filing
Each stage was benchmarked, validated, and documented so that other teams could reproduce the pipeline — or adapt it to their jurisdiction's quirks — without starting from scratch.
Access & download
111,740 eviction case images pulled from three separate Washington court portals via a throttled HTML scraper, using case numbers obtained from a state records request in partnership with the Washington Office of Civil Legal Aid.
OCR & text cleaning
Tesseract OCR converted scanned PDF and TIFF summons into text. Common error patterns ("1" → "|", "th" → straight quote, row numbers embedded in addresses) were identified and handled in post-processing.
Address extraction
Two-pass extraction: rule-based regex (house number + street type + ZIP patterns) followed by spaCy neural-network Named Entity Recognition. A Naive Bayes classifier distinguished defendant addresses from attorney and court addresses with 98% confidence.
Geocoding & validation
Extracted addresses ran through Azure and Esri BA; 93% resolved to rooftop-to-block accuracy. Addresses resolving only to street or county centroid were dropped. Against manually extracted 2013 addresses, the algorithmic output achieved an average Levenshtein ratio of 0.82.
Race & gender estimation
The R package wru (Imai & Khanna) estimated race by Bayesian inference from surname and tract composition; rethnicity (Xie) handled residual names via a Bi-LSTM model. Gender was inferred via the gender package (Mullen et al.) cross-validating first names against SSA and IPUMS registries. Estimates were cross-checked against King County Bar Housing Justice Project intake data and agreed within a few percentage points.
What structured data missed
The 2013–2017 five-year eviction rates below are the central empirical contribution: neighborhood-level filings divided by the estimated number of renter households in that demographic. Pierce County absorbed a disproportionate share of displaced Black households from Seattle's gentrifying King County, and the filing rate among Black female-headed renter households there was 19%.
| Householder group | King | Pierce | Snohomish | Whatcom |
|---|---|---|---|---|
| Black female | 11% | 19% | 12% | 26%* |
| Black male | 8% | 15% | 7% | 6% |
| Asian female | 6% | 16% | 10% | 5% |
| Asian male | 4% | 16% | 9% | 6% |
| Latinx female | 8% | 14% | 8% | 5% |
| Latinx male | 8% | 12% | 10% | 4% |
| White female | 6% | 13% | 8% | 5% |
| White male | 7% | 16% | N/A | N/A |
* The Whatcom County Black-female rate reflects a very small sample (n=15 named defendants). The Pierce County rate is the paper's headline — it is based on 1,392 filings against an estimated 7,233 Black female-headed renter households.
What 111,740 court records told us that structured data couldn't
Every serious eviction study in the past decade has had to answer the same procurement question: where does the data come from? The national-scale answers — Eviction Lab's third-party purchases, Legal Services Corporation's partnership scrapes — are powerful, but they cover what is already structured. The jurisdictions that print their filings onto paper, scan them, and call that the public record have stayed dark. And those dark jurisdictions are where many of the racial and gender disparities live.
Washington State was a deliberate test case. It is moderately large, moderately cooperative, and — crucially — has no centralized digital eviction record system. The 39 county court clerks operate on three different web portals. Records are held at the clerk's discretion. One clerk wanted $0.25 per downloaded page, which at 14 years of filings came out to roughly $350,000. That is a reasonable proxy for the friction a research team should expect when attempting this work in any state with a fragmented court system — and it is why a portable, inexpensive technical pipeline matters.
The pipeline, in one paragraph
Once case numbers were obtained from a state records request, a throttled HTML scraper pulled 111,740 PDF and TIFF images from the three portals. Tesseract OCR converted the scans to text. A regex-first, spaCy-NER-second extractor pulled candidate addresses; a Naive Bayes classifier distinguished the defendant's address from the attorney's and the courthouse's. Azure and Esri Business Analyst geocoded the addresses — Azure for raw success rate, Esri BA for reproducibility. Ninety-three percent resolved to rooftop-to-block accuracy; the rest were dropped rather than trusted at the tract level. Then the wru package estimated race from surname and tract composition; rethnicity handled the residual names via a Bi-LSTM neural network; gender cross-referenced first names against SSA and IPUMS. Estimates were validated against King County legal-aid intake data and held within a few percentage points.
Findings, and why they moved policy
The numbers did not stay on the page. In Pierce County, 19% of Black female-headed renter households were named in an eviction filing from 2013 to 2017. In King County — home of gentrified Seattle, where the bulk of evictions concentrated in the historically diverse south-county neighborhoods that have long been the displacement destination for Black households pushed out of the central city — it was 11%. At the neighborhood level, the highest filing rates tracked the region's lowest rents and highest racial diversity. These were not marginal disparities; they were the basic structure of who was losing housing in the Puget Sound during the mid-2010s.
That evidence fed directly into two Washington State policy wins: the 2019 extension of the pay-or-vacate notice period from 3 to 14 days (ESSB 5600), and the adoption of just-cause eviction protections. A statewide right-to-counsel program followed in 2021. None of these policies required the NLP pipeline to exist; but the demographic visibility it produced made the case for them far harder to ignore.
What this means for other jurisdictions
The case study is Washington, but the paper's actual contribution is the recipe: a scalable, technically modest, ethically auditable pipeline for taking any jurisdiction's court data — in whatever format it happens to exist — and producing structured, demographically estimated eviction records. For research teams in states where the Eviction Lab and LSC do not yet have complete coverage, this is the reference implementation. For policymakers and legal-aid organizations wondering whether their own state's filing data is worth the engineering cost of extraction, the answer from Washington is yes, and the effort is compressible.
The concluding section of the paper lays out future technical directions — document layout analysis, LLM-assisted extraction, hybrid pipelines — and flags the ethical considerations around privacy and data integrity that any interdisciplinary team entering this space needs to plan for. These are not afterthoughts; they are part of the method.
From court-record image to Olympia
The estimates produced by the pipeline were used in legislative testimony and in coalition work with the Washington Office of Civil Legal Aid, the King County Bar Housing Justice Project, and tenant attorneys across the four counties in the study. Several policy changes followed.
None of these reforms is attributable to a single study. But the paper documents the direct channel between reading previously unreadable data and changing who gets to stay housed.
Pay-or-vacate extended 3 → 14 days
ESSB 5600 (2019). Gives tenants meaningful time to access rental-assistance funds before a filing is initiated.
Just-cause eviction statewide
Removes no-cause termination of long-term tenancies — the mechanism through which many of the filings in the study began.
Right to counsel for low-income tenants
2021. Washington becomes one of the first states in the country to guarantee legal representation in eviction proceedings.
The research team
The work was funded in part by the Moore/Sloan Foundation, the Bill & Melinda Gates Foundation, and the MacArthur Foundation. The authors thank the King, Pierce, Snohomish, and Whatcom county court clerks; the King County Bar Association (Edmund Whitter); Washington State Representative Nicole Macri; Jim Bamberger and the Washington Office of Civil Legal Aid; the University of Washington eScience Institute, iSchool, Department of Sociology, and Center for Studies in Demography and Ecology; UC Berkeley's Institute of Governmental Studies, D-Lab, and BIDS; and Karen Chapple and the Urban Displacement Project.
Read the full paper
The 19-page article is published open-access by HUD's Office of Policy Development & Research in Cityscape vol. 26, no. 1 (2024), "Local Data for Local Action" issue.