As we were looking through the OpenRefine data for the Metropolitan Museum of Art, we noticed that provinces were not contained under the expected “Region” column. This presented a problem, as we needed the geographic location at least as accurate as the province level. The Met has a free and open-source API, so we figured the API would have the data. It did not, and we were confused about how to format the data, or even how to get important data to map the artifacts in ArcGIS.
We realized that the Met Museum searchable collection has the province data for the ceramic artifacts, so we decided to use a tool called wget to download the web pages using Terminal. We filtered out all the ceramic artifacts from China and put the HTML links into a text file, and ran wget to get the HTML from every website. After we downloaded the website, we needed to grab the details for each artifact. We wrote a script in Python using the BeautifulSoup library to parse HTML and grab all the sections labeled “details” and write them to a CSV file. This is the script we used:
from bs4 import BeautifulSoup
import os
import csv
path = r"/Users/ezrabarber/ChinaCeramics"
fieldnames = ["File Name", "Title", "Period", "Culture", "Medium", "Dimensions", "Classification", "Credit", "Accession Number"]
with open("value_spans_pottery.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(fieldnames)
for filename in os.listdir(path):
if filename.endswith(".html"):
fullpath = os.path.join(path, filename)
# Get Page, Make Soup
soup = BeautifulSoup(open(fullpath), features="html.parser")
lst = [str(filename)]
for results in soup.findAll("span", {"class": "artwork-tombstone--value"}):
lst.append(results.text)
writer.writerow(lst)
print("Done writing") We were able to use the Met data in CSV format within ArcGIS to add the objects to a layer and count them in our heat map. Later, we discovered that we only needed to use OpenRefine to get the information, as the important parts were not under “Country”, but under “Culture”. Once we filtered out the “Culture” tab, the majority of the artifacts had the province data either in its own column or under the “Medium” column as specialty wares. This proved to be much easier than downloading HTML and scripting data, so we ended up using both in our datasets. Lesson learned, check the filter parameters and all types before you settle on one data cleaning method!