Our project goal is to find and record all matching verses among the Chinese, Tibetan and Sanskrit texts of the Mahāratnakūṭa Collection.
We use the Python programming language and Django web framework to create custom software applications and interfaces for analyzing these texts in a variety of ways. First, we import unicode texts from CBETA, rKTs, BDRC and other online repositories, and then use a statistical method to locate frequently repeating phrases across extant witnesses in our collection. Phrases that appear often are considered to be likely candidates for establishing matching points between texts. We decompose these phrases into their constituent n-grams to check them against dictionaries and other known information about our texts. These processes taken together allow our system to recommend potential alignments, which are then checked by human editors.