Our Process

The first step of our process was extracting data from the Zoobook, a digital book for each year of new students at Carleton starting in 1955. Each year contains student names, high schools, and hometowns. In order to scrap the data, Ethan utilized a python libraries PDFMiner and PyPDF2 to scrape the data into a dataset in Excel.

After obtaining the data, the team manually cleaned the data to remove obvious misspellings and blank white spaces in between the words. After this, a Geocoder web extension was utilized to extract the latitude and longitude from the full addresses of each observation. Schools for which the geocoder could not extract longitude and latitude were excluded to maintain data consistency. All data is available in our GitHub.

The next step was running the dataset against the National Center for Education Statistics and Institute for Education Sciences (NCES) database. We ran our data against both the public and private schools database to determine if the schools in our dataset are public or private.

After this, the team created the data visualizations on this website. Margo created an interactive barchart on Flourish to gain insight into the top 10 high schools attended by incoming classes at Carleton College, and an ArcGIS map was created for the data points classified by private or public schools. Lastly, a heatmap illustrating the origins of high school students attending Carleton was created to visually represent the distribution.

Finally, all of this is showcased on this website that Jenna and Kenton step up. They helped in cleaning the information and final products that were uploaded in the pages on the website.

We had several limitations while scraping and cleaning our data. PDF scrapers don’t always accurately match the high schools and towns, which led to some inaccuracies. Additionally, the geocoder used in google sheets was not completely accurate, but because of our large dataset we were unable to go back and manually change locations. We also ended up having to remove large amounts of data due to a lack of information or unclear names. While this may have caused slight inaccuracies, we believe our data still allowed us to visualize overall trends.