The written heritage of the “Islamicate” cultures that stretch from modern Bengal to Spain is as vast as it is understudied and underrepresented in the digital humanities. The sheer volume and diversity of the surviving works produced in Persian and Arabic by denizens of these lands in the premodern period makes this body of texts ideal for computational forms of analysis. Efforts to utilize these new digital forms of analysis, however, have been stymied by poor OCR technology for Arabic-script languages and the lack of a open-access, standards-compliant Islamicate corpus.
The Open Islamicate Texts Initiative (OpenITI) is a multi-institutional effort to construct the first machine-actionable scholarly corpus of premodern Islamicate texts. Led by researchers at the Aga Khan University (AKU), Universität Wien (UW), and the Roshan Institute for Persian Studies at the University of Maryland (College Park) and an interdisciplinary advisory board of leading digital humanists and Islamic, Persian, and Arabic studies scholars, OpenITI aims to develop the digital infrastructure necessary to achieve this goal, including improved Arabic-script OCR, Arabic-script standards for OCR output and text encoding, and platforms for collaborative corpus creation (e.g., CorpusBuilder). In the process, OpenITI will enable new synergies between Digital Humanities and the inter-related Islamicate fields of Islamic, Persian, and Arabic Studies.
OpenITI Development Plans
Since its founding in 2016, OpenITI's work has focused on two primary areas: (1) improvement of Arabic-script OCR, and (2) corpus building. Our work on OCR—done in collaboration with Benjamin Kiessling of Universität Leipzig—has produced some of the most accurate results to date on Arabic-script texts (see full results here). Most importantly, these results were achieved on a open-source OCR engine (Kraken) which is retrainable and can be adapted for highly specific scholarly needs. Beginning in 2017 OpenITI also began collaborating with the SHARIAsource project of Harvard Law School on the creation of a digital text production pipelane, called CorpusBuilder, which is a user-friendly, web-based, open-source application that allows users to upload, OCR, post-correct, annotate, and structurally tag a document. It includes robust version control (built on the git model) and an API as well—both critically important features that will help facilitate the collaboratively model of corpus production that OpenITI champions. (For more information on CorpusBuilder 1.0, please see its project page.) OpenITI's OCR work received a tremendous boast in 2019 with a $800,000 grant from The Andrew W. Mellon Foundation for the development of CorpusBuilder 1.0 into a full digital text production pipeline and the improvement of Persian and Arabic OCR. (For more details on this project, please see its project page.)
OpenITI's second focus flows out of our OCR work: our ultimate goal is the creation of a machine-actionable and standards-compliant scholarly corpus of Persian and Arabic texts. (We sincerely hope to expand to Ottoman Turkish and Urdu texts in the near future too, as soon as funding permits.) After completing experimental Persian and Arabic corpus development projects over the course of 2015 (i.e., the OpenArabic, KITAB (Knowledge, Information, and The Arabic Book), and Persian Digital Library (PDL) projects), OpenITI team members drafted a development plan that would bring together these efforts in one united Islamicate textual corpus that would contain approximately 10,000 Islamicate texts (ca. 7,000 Arabic and 3,000 Persian texts). This plan calls for us to: (1) review and format existing open-access premodern Persian and Arabic text according to the CapiTainS canonical text services (CTS) and TEI-XML standards; (2) enrich these texts with as much verified metadata as possible; and (3) develop and execute a plan to achieve greater parity in the number, genre, and chronological coverage of both Persian and Arabic texts in the OpenITI corpus after reviewing results of the first phase of this plan. (This need to make the existing collection of digital Persian and Arabic texts more representative of these traditions as a whole is the impetus for our work on Arabic-script OCR.)