Tools and Techniques for Optical Character Recognition

Letter from the Director of the Digital Humanities Core: June 2025

By Linda García Merchant

June 11, 2025 —

When we began to develop training courses for the Digital Humanities Core, we focused on the processes required to create sustainable digital scholarship. We spent less time on the tools and techniques used by projects and more time on planning, preparing, and engaging with collaborators. Now we are moving into the realm of understanding tools and techniques for specific types of work and to benefit certain processes to better inform humanists on the role and function of these digital production devices.

Our first course is Textual Recovery training, designed to support an understanding of a tool and technique adapted to literary text analysis. TR113 Textual Recovery: OCR Clean up using Python is the first in a series of courses that allows a researcher to automate the clean-up process for text that has been converted to machine-readable through the Optical Character Recognition (OCR) process.

So, why is this kind of course necessary? If the tool has an application that results in an effective process (cleaning up text), why do we need to learn it? Why can’t we just use it? The OCR clean-up course isn’t just about using the clean-up tool, it is about understanding how the process works so that each user can make the tool available for their own kinds of documents.

The OCR clean-up tool, designed by DHC Research Tech Jun Kim for use with 18th-century text, is a series of Python scripts that rely upon an original optimized image to recognize optimal characters, then render them as machine-readable text with minimal error. The key here is “minimal error,” which is why the tool is a series of Python scripts that perform several tasks with this result. One of the challenges of cleaning up 18th-century texts is the way the text was originally printed. Each printer has its own standards of character use and a unique approach to printing. The second challenge is the age of the material scanned, almost 400-year-old text means inconsistencies in ink distribution (fading) and paper degradation.

The goal of the course is to teach researchers how to apply the tool to scanned images and then convert them to OCR text to create machine-readable characters with a minimum of correction. Manual correction will still occur, applied to a 10 to 15% error rate for a document instead of a 40 to 50% error rate. This means text can be manually cleaned at a higher rate because fewer errors will have to be fixed.

The second goal of this course is to teach researchers how to apply the tool to any scanned image that generates a high OCR error rate. Courses like this allow the DHC to introduce the world of scripting and scripting languages to a community often unfamiliar with them. We hope that once we begin to teach humanists how languages like Python work, we can then offer more robust courses on Python for humanists.

Best,

Linda Garcia Merchant, Ph.D.
Director, Digital Humanities Core

Division of Research

Tools and Techniques for Optical Character Recognition

Top Stories

UH Researchers Map Houston’s Mental Health Deserts for the First Time

Quantum Initiative Comes to UH

Weaving History into Design