Sources
We downloaded CSV files of entire collections from the official website of the British Museum, the Cleveland Museum in Ohio, and the Metropolitan Museum of Art.
Data Cleaning with OpenRefine
We initially cleaned the data with OpenRefine. Given the large amount of data, we decided to focus on only ceramics produced before 1949. We used filters to select “ceramics”, “earthware”, “porcelain” or similar terms in the medium field. We also used regular expressions to extract the production place, and most are listed in cities. After sorting, we added the province column and manually found the province where those cities are located. Data entries without production place or date were also dropped. This cleaning works for the British Museum and the Cleveland Museum of Art.
Data Cleaning with Pandas
We found that the raw CSV files are very messy and there exists information in columns we cannot extract using OpenRefine. Therefore we also went on to use Python for data cleaning. We used a well-known library called Pandas.
In the raw CSV files we downloaded from Github, for each relic in the Met museum we cannot get the provinces that are exact places in China. What we only have in the description column are places like “(Jingdezhen ware)”, Jingdezhen refers to the name of a town. Therefore we extracted the word before “ ware” with Python by splitting the string into a list of substrings and getting the wanted entry. It is tricky as it might contain other information about other “ ware” without parentheses which we don’t want.
After extracting all the places, we binned them and downloaded the corresponding table, then manually searched all those places and transformed them into provinces.
We figured out that there are too many relics in the gathered data so it’s better to further categorize them as they reflect different cultures and technologies. Therefore we would like to bin relics by dynasties. Some of the relics have two or more dynasties such as “Ming Dynasty; Qing Dynasty”, and others might have a dynasty plus some useless information like the name of an emperor, in whichever order. The solution was to split the string by the semicolon, delete the original row, and create two new rows of the same relic where one says “Ming Dynasty” and the other one says “Qing Dynasty”. Finally, we grouped them and turned all the dynasty bins into CSV files and downloaded them.
We also think that it would be interesting to investigate who or what parties contributed most to the museum collections by selling or donating relics to them. This was also done by Python. Therefore our goal was to extract that information from the column of credit line and/or provenance. However, this data is again messy in that sometimes it consists of multiple names or narrates in detail when the person gave the object to the museum. Therefore we filter them and get rid of unwanted information and prefixes such as “Gift of ” or “Bequest of ”.