OCR Guidelines

The following is intended as a guide for users of the OpenITI CorpusBuilder. For a thorough overview of the terms governing published OCR content on OpenITI you are encouraged to read the Terms of Use and the Privacy Policy

Uploading your text to CorpusBuilder

1. Filling out metadata:

It is essential that you fill out the metadata boxes as accurately as possible, as it will be used to create a URI when your text is added to the OpenITI corpus.

1. Give the full title of the text to be uploaded (multivolume works should ideally be uploaded together).

2. Under author give the full author (You may find that your author has already been added. If so, select it from the drop down).

3. Do similarly for the editor or editors of the work.

4. Ensure you select the document type and document language or the upload will fail.

2. Uploading the text:

1. Once the metadata has been filled out, add the texts you wish to upload. For the best results use a high-quality pdf scan of the work.

2. Click upload. Depending on the size of the text and the number of files you have to upload, it could take up to 5 minutes for everything to upload.

3. When the upload is complete you will be prompted to select an OCR engine (Tesseract or Kraken). Select one and type in the language of the text.

4. Click save and continue. If your upload has been successful you will be told ‘Document created successfully’.

Post-correcting your text

1. Once you have uploaded your text, you will be able to post-correct your text (OCR software is never perfect, but it will learn from the corrections that you make. So this is very important).

2. Go to the documents tab and click on ‘unpublished documents’.

3. Find your uploaded text in the list and click ‘view/edit.'*

* Please note: When you first open view/edit you are unlikely to see your text. Instead, initially you will see only a loading page. This is because the OCR process takes time. For example, for a book of around 300 words, it will take about an hour for the OCR process to complete. The whole process is undertaken on our servers, and if you are processing large texts you can close your browser/computer and return to it at a later point.

4. Once your text has been processed, it will be visible to edit when you click the ‘view/edit’ button. Please proceed to correct the text.

Guide on publishing copyrighted texts

All texts uploaded to Corpus Builder will eventually be added to online repositories for public use and data analysis. OpenITI has a duty to ensure all content of uploaded texts is not subject to copyright (for more details see our copyright policy [link]). In order to comply with this, all editions which are still in copyright must have their copyrighted content removed (that is all editorial input, such as footnotes and introductory matter). You should endeavor to remove all of this material either during your scanning process or at the post-correction stage.

What does OpenITI do with your uploaded texts?

The goal of OpenITI Corpus Builder is to build an open access corpus of machine readable texts for the use of scholars and for computational analysis. To this end, you are advised that all uploaded texts will be automatically published to a public online repository after a period of 3 years. 

We accept that in some exceptional cases, a text may need to remain out of the public domain for longer than 3 years, or indefinitely. In these cases, you are encouraged to apply for an exception at the address: openiti@umd.edu before uploading your text. For more details, see the ‘Exceptional User Contribution’ terms in the Terms of Use.