AI Tools for Historical Documents in Malay
A short summary of the project’s aims:
Jawi is an adaptation of the Perso-Arabic script adapted for writing Malay and other languages across Southeast Asia. For centuries, it served as one the primary writing system throughout the Malay-speaking world. From the late nineteenth century through the mid-twentieth century, Jawi was the dominant script for Malay-language journalism in Singapore and the broader region. These newspapers chronicle a transformative period in Southeast Asian history, documenting debates over nationalism, identity, and modernization. Yet despite their significance, these newspapers remain largely inaccessible to contemporary researchers.
To address this problem, we are starting with SEA-LION (Southeast Asian Languages in One Network), a family of open-weight models developed by AI Singapore that were specifically designed to work with regional languages, including Malay. Through a process called supervised finetuning (SFT), we recalibrate these weights to suit our specific task. We also want our models to tell us when they are likely to be wrong. For this, we use a range of techniques from Explainable AI (xAI) to detect errors. Then we can direct our human experts to passages of particular difficulty, which might, for example, contain unusual words or names.
By developing reliable tools to transliterate, search through and analyze these newspapers, we make it possible to study this crucial period with unprecedented depth and scale.