What is OCR?
Optical Character Recognition (OCR) is the process of converting detected text from a static file into machine-encoded text that can then be interacted with in the outputted file. For example, OCR of a static image containing text would return the same static image with the addition of the identified text superimposed on the image.
Why OCR?
OCR is especially useful with extracting text from static files where there is not already embedded text. OCR has the capability of identifying text on the file and then embed the corresponding text so that the viewer can highlight and/or search for the text that was identified. For our purpose, identifying text from the scanned microfiche was an essential part of our analysis as our sentiment analysis was based solely on the results of the recognized text.
Implementation
There is a wide variety of OCR software out there, ranging from free to very expensive. In our OCR step, we chose to pick four different OCR software packages to try and then compare their efficacy. We decided to try a free version, ocr.space, which just requires you to upload your file in a web browser, and three premium versions, ABBYY, Adobe Acrobat Pro DC, and omnipage. We did not have enough time to critically compare the results of the four software packages, but this is something we’d like to look into further in future work. determining which OCR software is most effective in identifying text from microfiche scans will be useful information for those looking to conduct text analysis from microfiche scans in the future.