Semantic closeness of canonical (Tipitaka) and post canonical books in percentages
- Dear friends,
This might be of interest to some of you. As you all know the "Tipitaka"
consists of various text strata. This is very obvious of course to anyone
reading and comparing the vocabulary, style and grammatical expressions used
in the Vinaya, Sutta and Abhidhamma texts. Prof. Kingsbury did a statistical
analysis on this a couple of years ago (see
So, whenever someone uses CST4 or similar tools for searching and comparing
text snippets one can see that certain expressions always seem to surface in
certain books while others would contain not a single entry for that
particular word or phrase (take for instance "sabhāv*" - you won't find it
in the 4 Nikāya (for obvious reasons), but already the Milinda mentions it,
So, while Prof. Kingsbury's approach was very straightforward (but complex),
it only covered a small portion of available books and only categorized
those few into three basic categories (early, middle, late).
Taking a much simpler approach I created the following report which you can
download (see link below). What I was interested in is to map out,
automatically, the relationship (in percentages) between all canonical and
post-canonical books. The concept is pretty simple: Quite often Pali texts
from similar text strata show a remarkable closeness in the vocabulary they
So, based on that fact I wrote a little program which extracted a-declension
nominative forms as indicators of a certain semantic text-chain from all 217
books (VRI Tipitaka edition) and compared them against each other (47089
I sorted the resulting table by percentage and uploaded it as well (see
below). Of course this has to be taken with caution and is very crude as we
are just comparing one characteristic (nom. sing. a-decl). However, because
this test is applied to the entire range of texts we can still use the
percentages as a crude indicator of proximity. The closer a percentage
between two books the more vocabulary they share. This is esp. interesting
when we compare the relationship between multiple books. One could play
around with this even more, comparing other grammatical features and then
overlaying those percentages to arrive at an even stronger indicator of the
relationship between the various books.
However, for my purposes, this first run (took 2 hours to complete) was
already more than enough. I guess there is tons of information especially
for those among you who are lexicographers etc. and you are welcome to
re-use etc. the source code which I uploaded as well.
But it is quite interesting to see which books form groups in terms of their
"semantic" (vocabulary) proximity. For instance you will see that the 4
Nikaya share a great percentage in similarity as expected. We can also see
that parts of the AN match the Puggalapannatti or observe the closeness
between Nettipakarana and Petakopadesa. From here we can go through the list
and discover interesting relationships which may have been not that obvious.
So this might help some of you find the "next best book" to read / study.
Download the report here:
[Non-text portions of this message have been removed]