The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)
Funded through a generous grant from the Scholarly Communications Program of The Andrew W. Mellon Foundation, the Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP) is the first grant of its kind to specifically tackle the technical and organizational barriers currently stymying the development of Arabic-script OCR and digital text production for Islamicate Studies. OpenITI AOCP is led by a highly interdisciplinary team of humanities, computer science, and digital humanities co-principal investigators from the Roshan Institute for Persian Studies at the University of Maryland (College Park), Northeastern University’s NULab for Texts, Maps, and Networks, the Aga Khan University’s Institute for the Study of Muslim Civilisations (London), and the Maryland Institute for Technology in the Humanities at the University of Maryland (College Park), and we are also proud to partner with the SHARIAsource project of the Program in Islamic Law at Harvard Law School for the technical development portion of the project. For more information, please visit the OpenITI AOCP project page.
In 2017 OpenITI joined forces with the SHARIAsource project of the Program in Islamic Law at Harvard Law School to develop a robust and user-friendly OCR pipeline, called CorpusBuilder. This project was generously funded by the Program in Islamic Law at Harvard Law School. Version 1.0 of CorpusBuilder was released in March 2019, and both the SHARIAsource and OpenITI projects are currently using it in their corpus building projects. A generous grant from the Scholarly Communications Program of The Andrew W. Mellon Foundation is currently funding development work on CorpusBuilder 2.0. For more information on CorpusBuilder 1.0, please visit the CorpusBuilder project page. For more information on CorpusBuilder 2.0, please visit the OpenITI AOCP project page.
Projects Affiliated with OpenITI
Knowledge, Information Technology, and the Arabic Book (KITAB): funded by the European Research Council, no.772989) and led by OpenITI co-PI Dr. Sarah Bowen Savant of the Aga Khan University-ISMC (London), the KITAB project leverages text reuse algorithms developed in collaboration with Dr. David Smith of Northeastern University to study Arabic book history and cultural memory. The KITAB project is working on and has supported the development and refinement of OpenITI’s digital infrastructure, corpus, and data standards. For more information on the KITAB project, please see the project’s website.
Digital Sira Project (DSP) at the Qatar National Library: the Digital Sira Project is working to create an online corpus and digital research pipeline for the Sira of Ibn Ishaq (d. 767). The Sira is an important and exemplary case within the early Arabic tradition of a dispersed text: there is no single original, complete text surviving today, but rather multiple versions, in fragmentary form, scattered within hundreds of other books from the ninth century through to early modern times. These include well-known witnesses to the text, including Ibn Hisham’s (d. 828) commentary, which only contains two of four original parts, but which is often mistakenly referred to as the Sira of Ibn Ishaq. Research questions relate, by way of example, to the manner of production, transmission, and circulation of texts from the period of Ibn Ishaq’s lifetime running to the present. The digital research pipeline relies on innovations in Optical Character Recognition (OCR), Text Reuse Detection, Data Modeling, and Data Visualisation.