Loading ...
Sorry, an error occurred while loading the content.

Semantic closeness of canonical (Tipitaka) and post canonical books in percentages

Expand Messages
  • Lennart Lopin
    Dear friends, This might be of interest to some of you. As you all know the Tipitaka consists of various text strata. This is very obvious of course to
    Message 1 of 1 , Feb 12, 2010
      Dear friends,

      This might be of interest to some of you. As you all know the "Tipitaka"
      consists of various text strata. This is very obvious of course to anyone
      reading and comparing the vocabulary, style and grammatical expressions used
      in the Vinaya, Sutta and Abhidhamma texts. Prof. Kingsbury did a statistical
      analysis on this a couple of years ago (see

      So, whenever someone uses CST4 or similar tools for searching and comparing
      text snippets one can see that certain expressions always seem to surface in
      certain books while others would contain not a single entry for that
      particular word or phrase (take for instance "sabhāv*" - you won't find it
      in the 4 Nikāya (for obvious reasons), but already the Milinda mentions it,

      So, while Prof. Kingsbury's approach was very straightforward (but complex),
      it only covered a small portion of available books and only categorized
      those few into three basic categories (early, middle, late).

      Taking a much simpler approach I created the following report which you can
      download (see link below). What I was interested in is to map out,
      automatically, the relationship (in percentages) between all canonical and
      post-canonical books. The concept is pretty simple: Quite often Pali texts
      from similar text strata show a remarkable closeness in the vocabulary they
      would share.

      So, based on that fact I wrote a little program which extracted a-declension
      nominative forms as indicators of a certain semantic text-chain from all 217
      books (VRI Tipitaka edition) and compared them against each other (47089

      I sorted the resulting table by percentage and uploaded it as well (see
      below). Of course this has to be taken with caution and is very crude as we
      are just comparing one characteristic (nom. sing. a-decl). However, because
      this test is applied to the entire range of texts we can still use the
      percentages as a crude indicator of proximity. The closer a percentage
      between two books the more vocabulary they share. This is esp. interesting
      when we compare the relationship between multiple books. One could play
      around with this even more, comparing other grammatical features and then
      overlaying those percentages to arrive at an even stronger indicator of the
      relationship between the various books.

      However, for my purposes, this first run (took 2 hours to complete) was
      already more than enough. I guess there is tons of information especially
      for those among you who are lexicographers etc. and you are welcome to
      re-use etc. the source code which I uploaded as well.

      But it is quite interesting to see which books form groups in terms of their
      "semantic" (vocabulary) proximity. For instance you will see that the 4
      Nikaya share a great percentage in similarity as expected. We can also see
      that parts of the AN match the Puggalapannatti or observe the closeness
      between Nettipakarana and Petakopadesa. From here we can go through the list
      and discover interesting relationships which may have been not that obvious.

      So this might help some of you find the "next best book" to read / study.

      Download the report here:




      [Non-text portions of this message have been removed]
    Your message has been successfully submitted and would be delivered to recipients shortly.