Methodology

Eviction data acquisition, cleaning, and analysis

Table of contents

  1. Acquisition
  2. Court record mining
  3. Data cleaning
    1. Geocoding
    2. Personally identifying information
    3. Demographic estimation
    4. Deduplication

Acquisition

(Coming soon)

Data providers:

  1. Legal Services Corporation

    States included: Delaware, Indiana, Minnesota, Missouri, Ohio, Pennsylvania

  2. Portland State University

    Eviction data for the entire state of Oregon were provided by Lisa Bates, PhD director of EvictedInOregon at Portland State University. The records contained fields for case number, date of filing, each party listed on a case, the side of the listed party, type of eviction, and whether the filing occurred during a moratorium.

  3. Chicago Legal Aid / ACLU

    Eviction data for Cook County, DuPage County, Kane County, McHenry County, and Will County was provided through FOIA requests, web scraping, and Chicago Legal Aid. The information available varied with each county and not all records contained sufficient information for reporting. The discrepancies are noted in the Illinois state profile.

  4. Baltimore City Sheriff's Department

    (Updated soon)

  5. Washington state county court clerks' online portals

    (Updated soon)


Court record mining

(Coming soon)


Data cleaning

With all records in a machine readable format, the consolidation of records from disparate jurisdictions and data inquiry requests can begin.

For all data sources, a main_id field was created from the case number, year in which the case was filed, and county FIPS code of the jurisdiction from which it came. Regarding the case numbers, some assumptions had to be made in order to distinguish instances of eviction.

  1. Unique case numbers correspond to unique filings but not necessarily a unique defendant. That is to say, while case numbers are used to distinguish individual cases from one another, this is not sufficient to distinguish unique instances of individual evictions, which is the observation we wish to analyze.
  2. Case numbers contain only alphanumeric characters, with no punctuation or whitespace. This was tested by examining the instances of non-alphanumeric characters in the case number string. These occurances were in the vast minority of cases [report percentage] and primarily consisted of `][;, characters, all of which are placed around the permiter of the home row in a standard QWERTY keyboard. It is reasonable to assume then, that these are typos and do not truly identify a unique case. All instnaces of such characters are replaced with blank strings.
  3. Case numbers are not case sensitive.

Geocoding

Next, we geocoded the data using a combination of US Census Bureau, ArcGIS, and OpenStreetMap geocoding services. The US Census Bureau’s geocoding service is capable of processing up to 10,000 addresses per request, so this is the first service used. Based on the address field availability of records that were not geocoded, ArcGIS or OpenStreetMap is used to geocode the leftover addresses. Geocoding is the process of creating spatial data by establishing the latitude and longitude of individual addresses, which we used to determine the census tract and county where each eviction occurred

Personally identifying information

After geocoding, we used regular expressions and other string manipulation methods to clean and extract the first and last names of individual defendants. The data include information about eviction filings among (1) individual households, (2) businesses, and (3) unnamed tenants. For these state profiles, we are only interested in analyzing evictions of individual households, not commercial evictions, so we filtered out cases where the name suggested the defendant was a business rather than a person.

Demographic Estimation

Using the surname extracted from the defendant name field, we estimated the race of each defendant using a Bayesian prediction model. This ecological inference method developed by Imai and Khanna uses the Bayes’ rule to examine the racial likelihood of frequently occurring surnames within Census name data and the racial composition for each neighborhood (tract data) where the evicted defendant lived. Using these two pieces of information, we computed the predicted probability of each racial category (White, Black, Latine, Asian, or other) for any given individual. For example, a person with the last name Jackson, a common Black surname, living in a neighborhood where a large share of the population is Black would have a higher likelihood of being estimated as Black compared to a person living in a neighborhood where a smaller share of the population is Black. Neighborhood racial composition is defined by the 2020 Decennial Census tract geography.

When we aggregated evictions to the tract and county level, we calculated the racial composition of each geographic area by adding the predicted probabilities of each race across all eviction cases. For example, if there are three individuals with eviction filings in a tract, and their predicted probabilities of being Asian are 0.3, 0.8, and 0.2 respectively, we would say that there are (0.3 + 0.8 + 0.2) = 1.3 evictions among Asians in that tract.

We also estimated the sex of each defendant by cross-validating the first name of the individual with the Social Security Administration (SSA) Name Registry from 1932 to 2012 and the US Census Integrated Public Use Microdata Series (IPUMS).

Deduplication

After cleaning names and standardizing addresses with geocoding services, the records are now deduplicated. This is necessary because the legal process for evictions are different not only between states but between jurisdictions. There is no standard method of filing case numbers- many records have different case numbers but identical fields for name and address- these records cannot be counted as unique eviction events as they are more likely than not a matter of paperwork than they are the act of attempting to remove an individual from their place of residence. In the event of identical name and address fields but distinct case numbers, only the earliest case is kept.