Skip to content

The Database

Data Engineering: Wrangling, Cleaning, and Processing

To make the public testimony data more useful to the public, our team built a database with links to PDFs of each individual submission. 

We first used web scraping tools in Python to automatically download all of the testimonial PDF files from the NYCDC website. Next, we converted all of the PDF files into plaintext files. This step was necessary in order to run the Named Entity Recognition (NER)—a computational text analysis method that identifies important keywords within a text corpus related to location, people, identities, laws, events, organizations, and other important entities within a text—to create a database of public testimony compiled during New York City’s 2022 redistricting process.  

The NER process organizes the public testimony data into general categories including dates, organizations, people, numbers, and laws. We created a codebook to further define those categories and eliminate categories that were not relevant to our research questions. The codebook was designed to guide the manual coding process, which was the most labor-intensive and time-consuming part of the project. The codebook serves two purposes: 1) It streamlined the data cleaning process to ensure that each member of our team followed the same protocol while reviewing testimony, and 2) It defines each entity type and the criteria used to determine its category. For more information on the data we collected and cleaned, please review our Meta Data sheet

Using the codebook as a guide, we then hand cleaned the data. Each member of our 5-person team was assigned a section of data to clean by tagging its appropriate city council district or districts and entity type. We corrected inaccurate entity types assigned during the NER process. We assigned both geographic and qualitative labels for each piece of testimony. We eliminated works of art, events, dates, times, numbers not related to city council districts, objects, and names of all people except members of the New York City Council. We kept streets, buildings, parks, companies, council members, numbers identifying council districts, laws, nationalities, and religious and political groups. 

In addition to the NER data, we also added shape file data for all identified NYC neighborhoods and links to each piece of testimony, which are stored in a Dropbox folder.  To download the entire database as a CSV file, follow this link.