The Alignment Engine

We aim to create a digital toolkit for multi-lingual alignments across arbitrary texts, using the Chinese-Tibetan-Sanskrit witnesses of the Mahāratnakūṭa Collection as a proof of concept.

By textual alignment we mean looking for sentences across two texts that mean the same thing. While then the basic idea is straightforward, what we consider to be ‘the same’ is a subjective issue fraught with gray areas and edge cases, and its formal description presents various challenges.

An alignment in our system consists of a string and location in one text that has some relationship to a string and location in another text. For the initial phase of the Open Philology project, this means in practice a set of relationships between Chinese and Tibetan texts, Chinese and Chinese texts, and Tibetan and Tibetan texts. Many supposed Sanskrit originals for these texts are available only in fragmentary form and are to be added later.

We consider two phrases to be aligned to some degree when they meet any of the following criteria:

  1. Match or partial match of literal characters
    "Buddha saw the bed" vs. "Buddha saw the bee"
  2. Similarity of relative position within parent text
    Alignment of character 54/4323 with character 135/10807 (i.e. 54/4323 = .0125, 135/10807 = .0125)
  3. Similarity of conceptual meaning
    "He went to the Buddha" vs. "He went to the Lord"
  4. Similarity of narrative purpose
    "He gave the Buddha a book" vs. "He gave the Buddha a gift"

We automatically produce and score alignments using a custom genetic algorithm that combines statistical analysis methods with traditional dictionaries and other philological resources.


Our complete alignment system comprises 3 layers:

  1. Input layer: front end interface takes in a string or strings for search or comparison
    Text/witness title(s)
    Text/witness and fragment
    Fragment and fragment
  2. Analysis layer: custom algorithm determines proper route of query and outputs appropriate response (i.e. return list of alignments in relation to input string(s))
  3. Output layer: front end display of results

Technical description of stack: Open source stack composed of Ubuntu Linux, Python, Django.

Details of the technical implementation of our software designs may be found on our software developer's GitHub page: github.com/handyc