Our process began with collecting data on students and alumni. We had hoped that we could obtain some pre-compiled datasets that would make our lives easier. We reached out to several offices at Carleton, including Alumni Relations, Institutional Research, and Admissions. While all the people that we talked to were eager to help, due to the short time frame and lack of stored data, we were unable to obtain pre-compiled datasets. As a result, we had to get creative.
Alumni Data
We obtained the data on alumni by webscraping the Carleton Alumni Directory. We queried the database by industry, and got as our result a series of .csv files with anonymous data on alumni class year, major, industry, and address:
Initial webscraped alumni data
We then wrote a python script to combine them into a single .csv file, appending “Industry” as a column (see our Github repository for the python script):
Final alumni data
The downside of webscraping the data was that the Directory capped the number of results per query at 500. This means that industries with large numbers of alums -- such as Computers/IT, Communications and Media, Healthcare, and Legal Professions -- are underrepresented in our data. Within each industry, the queries returned results alphabetically by last name. As a result, the population of some cities will be disproportionately low, particularly industry hubs (such as San Francisco and Seattle in the tech industry). Because of this limitation, we acknowledge that our data does not provide a complete picture of the alumni network. However, due to the large sample size, we will be able to provide a broad picture of change over time.
While the alumni data collection was relatively straightforward, the data cleaning was more difficult. Using OpenRefine, we separated the address data into city, state, and country, removing street addresses to protect privacy. However, while this process worked well for most US addresses, there were several longer addresses (mostly of international alumni) that were not so easily split into these three categories. This was a consistent problem throughout the data cleaning process -- how to deal with the differing formats of US and international addresses. We found several ways around this problem (which will be discussed more below), but for the alumni data, we ended up using a combination of splitting columns in excel and manually parsing the data. After several passes through the spreadsheet, we had a cleaned dataset (see Github repository for the final dataset).
Student Data
Compiling the student datasets was a slightly more complicated process. With no pre-compiled data, we ended up using the pdf scans of Carleton’s Zoobooks, available online through the Gould Library in the Carleton College Archives Digital Collections. After downloading the Zoobook scans from the past fifty years, we OCR’ed them into a workable file using PDFminer.six, a python library. We found that the python library was more effective than the PDF-to-excel converters that we tried; not only did it not require a membership, but it was also able to convert the PDF relatively neatly into a plain-text file. While the OCR was not perfect (see below), it was definitely usable, allowing us to turn our attention to the data cleaning.
Zoobook data after OCR. Note the added spaces and the floating cities and high schools
To streamline data cleaning, each group member was assigned a decade of Zoobooks. Because of the variable formatting of the text files and our different backgrounds with DH tools, we each took a slightly different approach. Some of us used OpenRefine to split the data into columns based on separators such as commas or periods, parse the necessary data, and collapse the columns into city, state, and country. Others used python programs to recognize state names, splice the string by commas, and output the relevant information to a .csv file. Each group member took a different approach depending on which tool we were more comfortable with, but even those who wrote python programs still ended up cleaning the data in OpenRefine.
Despite our different approaches, we all faced similar problems with the data cleaning process. Some of these included:
- Cities with two words, such as “Saint Paul”: due to our splicing techniques, these cities were often either split in half or joined together (i.e. “Paul” or “SaintPaul”). We had to write a python program to add spaces to each combined city (see Github repository)
- Different formats between Zoobooks: the format of the Zoobooks changed in 1989 from one student per row to multiple rows per student (see image). Furthermore, different years used different separators: some used commas, some periods, and between 1990-1995, there were no separators at all between city and state, causing us extreme data cleaning headaches
- International cities: because there was no “state” field for international locations, we had to figure out ways of parsing city and country. International cities were also occasionally labelled as being in the United States. For most international cities, we had to double check that they were properly formatted
- Canadian cities: the formatting for students from Canada was variable across years: sometimes there was a province included, but sometimes not. As a result, some provinces got classified as countries or cities
- OCR issues: while the quality of the OCR was relatively good for the most part, there were still some issues. Some cities gained spaces (“M i n n e a p o l i s”); some states were misclassified (“San Francisco, CA” became “San Francisco, GA”); and periods often became commas and vice-versa
- Random Zoobook junk: there was a lot of irrelevant information in each Zoobook, including lists of number of students by state, title pages, and introductions. We realized early on that deleting this information in each text file, before trying to clean the data, would save us a lot of time
When we had compiled all the years into a single dataset, we took turns cleaning up some last issues in OpenRefine -- these mostly were OCR problems and fixing the syntax of two-word city entries.
Geocoding
Once we had our two complete datasets, we geocoded the data using Geocode by Awesome Table, a Google Sheets plugin. The geocoding took a long time (approximately 5 hours, divided among multiple computers), and while it was relatively accurate, there were two big issues. First, the geocoder that we used was unable to locate places based on zip code alone. We initially thought that we would have to exclude these data points, but ultimately we were able to use the geocoder in ArcGIS to locate them. Second, several entries were incorrectly geocoded, mostly international cities that were placed in the United States. Thanks to some solid manual data cleaning by Ethan, we were able to fix those locations. We then added them to a spreadsheet that contained all of the improperly coded data points and uploaded it into ArcGIS for geocoding. We then created feature layers for each.
Geographic Analysis
After importing our data into ArcGIS, we then began to create maps that would highlight the main trends in our datasets. We noticed three major patterns throughout the data: a decrease in midwestern students, an increase in international students, and clustering in particular cities where alumni lived. We decided to center our StoryMaps and data visualizations around these findings.
To create our maps, we used different feature layers to highlight particular data trends. This often involved using the Filter tool in ArcGIS to look at specific demographics, such as the Asian international student body or alumni who had graduated after 2000. Additionally, we used different kinds of Analyses offered by ArcGIS, like heat maps, that visualized the data in a user-friendly manner.
Finally, we compiled the different maps into one StoryMap. We supplemented the various data visualizations with qualitative evidence we had gathered from archival research and conversations with the administration.
Bibliography
Brenda, Chuck. “Fulfilling the Dream.” Voice 73, no.3 (2008): 20-25, https://contentdm.carleton.edu/digital/collection/Voice/id/24637/rec/37.
Carleton College. Zoobooks, Carleton College. Northfield, MN: Carleton College, 1970-2020. https://contentdm.carleton.edu/digital/collection/Zoobooks.
Carleton College. Alumni Directory. Northfield, MN: Carleton College, 1939-2020. https://apps.carleton.edu/alumni/directory.
“Carleton College International Student Report.” College Factual. Accessed March 1, 2021. https://www.collegefactual.com/colleges/carleton-college/student-life/international/#secCountryDeepDive.
“Celebrating Our Staars!” Unpublished booklet, last modified 2020. PDF file.
“How to Help Carleton Students.” Career Center. Accessed March 1, 2021. https://apps.carleton.edu/career/alumni/help.
Leah Meltzer, “Admissions Woos South and West,” The Carletonian, May 9, 2015, Features and Arts, https://thecarletonian.com/2015/05/09/admissions-woos-south-and-west.
Lachance, Danny, and Jan Senn. “Finding Carleton.” Voice 73, no.3 (2007): 12-19, https://contentdm.carleton.edu/digital/collection/Voice/id/18448.
Perrin Stein, “Funding Carleton’s Rising International Community,” The Carletonian, May 18, 2014, News, https://thecarletonian.com/2014/05/18/funding-carletons-rising-international-community.
Petra Crosby. Interview by Carleton International Students at AAF Alumni Workshop A, September 17 2005. Interview 285, MP3. Series 2001-2010. Special Events Department Recordings 1953-2013. Carleton College Archives, Northfield, MN. https://archivedb.carleton.edu/index.php?p=digitallibrary/digitalcontent&id=80651.
Rachel Leatham. Carleton Alumni Data. Northfield, MN: Carleton Career Center, 2021.
“Second-Generation Students Here.” Voice 21, no.4 (1956): 4-6, https://contentdm.carleton.edu/digital/collection/Voice/id/3998.