CorpusBuilder 1.0

In mid- 2017 OpenITI and SHARIAsource at Harvard Law School began collaborating on the creation of CorpusBuilder 1.0—a user-friendly OCR pipeline and post-correction interface that would make the exciting new developments in Arabic-script OCR (which OpenITI pioneered in collaboration with Benjamin Kiessling) accessible for humanities scholars and students. Funding for the development of CorpusBuilder 1.0 was generously provided by the SHARIAsource at Harvard Law School.

Moving beyond the OCR toolbox/workflow model, CorpusBuilder 1.0 advances the field of OCR to the next level by integrating the latest OCR solutions (Kraken, Tesseract 4) into a user-friendly OCR pipeline that enables users with no technical background to OCR and post-correct their own texts without the intervention of technical experts in the process. Due to CorpusBuilder’s incorporation of neural network-based OCR engines that utilize line segmentation, user post-correction data can be recycled and used as training data for the improvement of OCR models as well. 

The CorpusBuilder user interface (i.e., web portal) was designed with our intended user community in mind: non-technically-inclined humanists. However, while CorpusBuilder’s web interface is built to be eminently simple and intuitive, the backend of CorpusBuilder is built on a robust version-controlled database similar to those in source-code control systems such as git. This database manages several types of records related to the layout of the page and the position of text on it. The version-control mechanism allows automatic processes, such as layout analysis and OCR, and human processes, such as transcription and correction, to be interleaved conveniently and without loss of data. 

A demo version of CorpusBuilder was completed in early 2018, at which point OpenITI (with the support of the SHARIAsource project) launched pilot CorpusBuilder-based Arabic and Persian OCR projects with JSTOR and a group of Persian literature scholars led by Principal Investigator Miller and his colleagues, Professors Alexander Jabbari and Austin O’Malley. These pilot projects provided an important first round of testing, debugging, and user feedback on CorpusBuilder, which was essential in preparing us for the beta (open source) release of the CorpusBuilder in March 2019.

The original development of CorpusBuilder 1.0 was funded by SHARIAsource at Harvard Law School. It is currently in the process of being greatly expanded and enhanced through a generous grant from the Scholarly Communications Program of The Andrew W. Mellon Foundation. For more information on the transformation of CorpusBuilder 1.0 into CorpusBuilder 2.0, please visit the OpenITI AOCP project page here.



CorpusBuilder is available here and here

The generic web portal/user interface for CorpusBuilder is available here and here

More information on CorpusBuilder can be found here.

A detailed technical overview of CorpusBuilder is available here.