Process

The first step in analyzing the socioeconomic background of Carls using the Zoobooks was to format the Zoobook data for usability. The data is organized with a student’s picture first, followed by their name, high school, city, and state.

The most reliable way to extract data from the Zoobook was to copy and paste the transcript of each Zoobook, provided by the Carleton College Digital Collections, into a Google Sheets. We believe these transcripts were generated using OCR (optical character recognition) software, which converted the PDFs into machine-readable text before being posted as transcriptions for each page. However, we encountered issues with the quality of the PDFs, which affected how accurately the OCR software processed the text. Errors included misspelled city names, incorrect spacing, cities being grouped into a single line, or, in the worst cases, entire pages failing to transcribe which was an issue we faced with the 2021 Zoobook. While we tried to use OCR software on Zoobooks we got worse results than the actual transcripts so we decided to stick with the transcripts provided to us.

An example of a Google Sheets with only raw data.

Once we had the raw data in our Google Sheets, we began preprocessing and cleaning it. We first tried to automate this process using AI, but we had little confidence in the accuracy of AI generated cleaning, and paywalls prevented us from pursuing that approach further. Ultimately, we decided that manually reviewing and correcting the data was the most reliable method. This involved going through the raw data and fixing any errors in the raw data manually

After cleaning the raw data as thoroughly as possible, we extracted city and state information using an Excel formula that searched for state abbreviations in each cell and copied them into a separate sheet titled ‘structured_year’ for further cleaning and preprocessing. For the first two columns, making any corrections required referring back to the original raw data sheet to identify and fix errors in the structured sheet. Once all errors were corrected, we created two new columns containing only the cleaned values, ensuring they were no longer linked to the raw data sheet so that we could continue with preprocessing.

Excel equation that returned cells that contained city and state data only.
Structured Excel Sheet that would allow for preprocessing.

To preprocess our data, we needed to obtain the geographic coordinates of each location so that we could accurately place each student on our visualization. This process, called geocoding, can be done in several ways, but the most effective method we found was using a Google Sheets extension called Geocode by Awesome Table. This extension grouped our city and state columns into a single column called ‘Full Address’ and then used that column to geocode each location.

We then consolidated each year’s data into a single sheet called ‘Consolidated,’ which contained data from 2015 to 2024, with every location geocoded. This consolidated sheet included all the necessary information to create a meaningful visualization.

Consolidated Google Sheet with every location geocoded, missing is a column called ‘year’ which we added to associate each row with its corresponding yea

We chose to proceed with Social Explorer to visualize our data. Initially, we tried to use Flourish as we thought we could utilize we could use the base map taken from Social Explorer and over lap with different layers to include all the students, however, we were not able to download the base map in a format that was required by Social Explorer. We had a similar issue with ArcGIS.

On Social Explorer we changed the data for the base map to be Poverty Total by county. We then chose the year 2023 for the map to be most recent that was available to me.

With Social Explorer, we used the Longitude and Latitude for each student and marked them with a star when we found out we could overlay the layers with this data. We then proceeded to compare the map before and after COVID-19 by using a slider to show the difference.

Overall, the process of using Social Explorer was straightforward and moderately easy to use. While there were some limitations on the site, the sliding overlay made it possible for us to do some data analysis.