Loading ...
Sorry, an error occurred while loading the content.

The Dater Program

Expand Messages
  • Peter Kirby
    THE ORIGINS OF THE APPLICATION AND THE PROBLEM Since the subject of the dating of the gospels has come up in recent discussion, I thought it may be interesting
    Message 1 of 2 , Apr 30 5:53 PM
    View Source
    • 0 Attachment
      THE ORIGINS OF THE APPLICATION AND THE PROBLEM

      Since the subject of the dating of the gospels has come up in recent discussion,
      I thought it may be interesting to the list members for me to introduce the
      application that I have developed over the past month or so. This application
      was developed through dialogue with Quentin David Jones, fellow Computer Science
      student and New Testament dilettante, and other members of a different
      discussion list.

      The problem that the application was developed to solve is a common one: the
      assignment of a date or dating range to a document based on merely probabilistic
      data about internal evidence and about whether certain authors cited a document.

      Specifically, we have to take account of the arguments from silence concerning
      whether or not an author may have known about the document, and we have to make
      use of vague 'citations' that indicate only a probable knowledge of the
      document.

      For example, suppose that Ephesians was written around 90 CE and that Ephesians
      makes no allusions to the Gospel of John. And suppose that there is a 10%
      chance that the author of Ephesians would have cited the Gospel of John (or
      shown dependence on John) given that the author knew about the Gospel of John.
      The effect of this data, all else being equal, should be a very slight
      preference for a date of John that is after 90 CE.

      Or, for example, suppose that Ignatius of Antioch wrote around 110 CE. Further
      suppose that we have weighed the evidence for Johannine allusions and come to
      accept a 30% probability that Ignatius was dependent on John. The effect of
      this data, all else being equal, should be a slight preference for a date of
      John that is prior to 110 CE.

      Note that in neither case does the point of data allow us to come to a firm
      conclusion on the terminus a quo or the terminus ad quem of the Gospel of John.
      We do not need any complicated algorithm in order to determine the latest
      possible date and the earliest possible date for a document. We do need a
      mathematical procedure, however, for quantifying the probabilites that a
      document was written in any given year within the range established by the
      terminus a quo and the terminus ad quem.

      THE DATA INVOLVED IN THE PROBLEM

      Here are the variables involved for each author who may have cited the document:

      1. C, the chance that the author cited the document in question. Using more
      appropriate language, C is the chance that the author was dependent on the
      document in question in the author's extant writings.

      2. CgivenK, the chance that the author would have cited the document given that
      the author knew about the document. This variable has to be guessed based on
      the interests of the author and the extent of the author's writings.

      3. Circulation, the number of years that we would expect it to take typically
      for the document to circulate to the author in question. The author has a
      higher chance of knowing about the document if it was written before the
      circulation period than if it was written after the circulation period. (This
      circulation period is sometimes called the "overlap," although the original
      reason for calling it this has become obsolete. Originally, I had assigned a
      100% chance that the author would have known about the document given that it
      was written before the circulation period. If the author knew about the
      document, it was written any time before him including the "overlap" period; if
      the author did not know the document, it was written any time after the
      beginning of the "overlap" period; in the "overlap" period, the author may or
      may not have known of the document.)

      4. KgivenD, the chance that the author would have known about the document
      given that the author wrote during the document's circulation period.

      5. KgivenB, the chance that the author would have known about the document
      given that the author wrote after the document's circulation period.

      6. Date, the estimated year in which the author wrote.

      In addition to these variables, we have some constants.

      1. KgivenA, which is 0. If the document came after the author, the author
      doesn't know it.

      2. CgivenNotK, which is 0. If the author doesn't know it, the author doesn't
      cite it.

      3. NotKgivenC, which is 0. If the author cites it, the author knows of it.

      There are also variables and constants involved in each of the pieces of
      internal evidence. This exposition of the dater program focuses on the authors,
      which are the more complex points of data. It is hoped that the interested
      reader can look at the source code for the application concerning the internal
      evidence and ask any questions they may have about it.

      THE IMPORTANCE OF BAYES'S THEOREM

      The first major breakthrough in the development of the application came when I
      discovered the relevance of Bayes's Theorem.

      According to Larsen and Marx in _An Introduction to Mathematical Statistics and
      Its Applications_ (Prentice-Hall 1986, 2nd ed.), Bayes's theorem is defined as
      follows (p. 56, mathematical notation paraphrased with subscripts in []):

      "Let {A [i]} with i from i = 1 to n be a set of n events, each with a positive
      probability, that partition S in such a way that the union of A [i] with i from
      1 to n = S and A [i] union A [j] = empty set for i != j. For any event B (also
      defined on S), where P(B) > 0,

      P( A [j] | B ) = P( B | A [j] ) * P( A [j] ) / the sum of P( B | A[i] ) * P( A
      [i] ) with i from i=1 to n,

      for any 1 <= j <= n."

      Let me explain what the symbols usually mean in English. The events A [1], A
      [2], ... A [n] represent n different hypotheses, the actual probability of which
      we would like to determine. The values of P( A[1] ), P( A[2] ) ... P( A [n] )
      are assigned based on some prior knowledge about the likelihood of these
      hypotheses -- or, perhaps, evenly distributed to each hypothesis, giving each a
      probability of 1/n.

      The event B represents an "outcome" or "experiment result" or "data point." We
      know the probability that this outcome would show up given each individual
      hypothesis. So we wish to determine the probability of the hypothesis given
      that the outcome occured. And this is what the formula in Bayes' theorem
      provides.

      If a discussion in the abstract is confusing, perhaps playing with some concrete
      values would help wrap your mind around the concept. This page will let you use
      a Bayes's Theorem calculator assuming you have some flavor of Java installed.

      http://members.aol.com/johnp71/bayes.html

      Let us return to the example of Ephesians.

      Let us assume that the prior probability of the chance that the author knew of
      the document and the chance that the author didn't know of the document are both
      50%.

      I had assigned a 0% chanceCited and a 30% chanceWouldHaveCited, which I
      understood to mean P( Cited | Knew ), or the chance that the author would have
      cited the document given that the author knew of the document. I will say that
      C means that the author cited the document and that K means that the author knew
      of the document. This means the following:

      Hypotheses: K, ~K
      Outcomes: C, ~C

      P(K) = .5
      P(~K) = .5
      P(C) = 0
      P(~C) = 0

      P( C | K ) = .3

      We can deduce:

      P( ~C | K ) = .7

      And common sense says:

      P( C | ~K ) = 0
      P( ~C | ~K ) = 1

      Or, in other words, if he doesn't know of it, he couldn't have cited it.

      Now Bayes's Theorem says:

      P( K | ~C ) = P( ~C | K ) * P( K ) / ( P( ~C | ~K ) * P( ~K ) + P( ~C | K ) *
      P( K ) )
      P( K | ~C ) = .7 * .5 / ( 1 * .5 + .7 * .5 ) = .35 / .85 = .4117647

      P( ~K | ~C ) = P( ~C | ~K ) * P( ~K ) / ( P( ~C | ~K ) * P( ~K ) + P( ~C | K ) *
      P( K ) )
      P( ~K | ~C ) = 1 * .5 / ( 1 * .5 + .7 * .5 ) = .5 / .85 = .5882353

      And a quick sanity check verifies that P( K | ~C ) + P( ~K | ~C ) = 1.

      This means that, for Ephesians, given that there is a 0% chance that the author
      cited the document and given that there is a 30% chance that the author would
      have cited the document if the author knew of the document, and given even
      background probabilities for the chances that the author knew or didn't know of
      the document, there is a resulting 58.82% chance that the author didn't know of
      the document and a 41.18% chance that the author did know of the document.

      And this coincides with our intuitions that Ephesians should provide a slight
      bias towards dates after the time of writing.

      However, this example with Ephesians is not a complete algorithm; it is only an
      explanation of the importance of Bayes's Theorem. In actuality, the background
      probabilities for whether the author knew about the document are not even; they
      depend on the date at which the author wrote relative to the terminus a quo and
      terminus ad quem. A complete algorithm for applying the data of each author is
      explained in the next section.

      THE HEART OF THE ALGORITHM

      Hypotheses: Document before Overlap (B), Document during Overlap (D),
      Document after Overlap (A)
      Outcomes: Author knows document (K), Author does not know document (~K)

      I have assumed that Ephesians was written in 90 CE. I have assumed that the
      'circulation' equals 10 years. I have assumed that, given just a terminus a
      quo and a terminus ad quem, all dates in this range have equal background
      probability.

      Let us assume that the Epistle to the Ephesians was the first document to be
      entered into our algorithm. In this case, we would have the following
      background probabilities, given a terminus a quo of 70 and a terminus ad
      quem of 170 -- which I am using here for mathematical simplicity; other,
      more precise values could be agreed upon.

      P( B ) = 10 / 100 = .1
      P( D ) = 10 / 100 = .1
      P( A ) = 80 / 100 = .8

      Now, let us assign a value to the probability that the author would have known
      about the document given that the document was written more than 'circulation'
      years before. (Note again that I am picking values just for example.)

      P( K | B ) = .8

      Obviously, we can deduce:

      P( ~K | B ) = .2

      Common sense tells us that the chance that the author knew of the document given
      that the document was written after the author is zero.

      P( K | A ) = 0

      And common sense also says that the chance that the author didn't know of the
      document given that the document was written after the author is one.

      P( ~K | A ) = 1

      The difficult case is determining the probability that the author knew of the
      document given that the document was written between 'circulation' and 0 years
      before the author wrote. Obviously, there is some chance that the author knew
      of the document even if it was written during these years. But this chance is
      just as obviously less than the chance that the author would have known about
      the document given that it was written 'circulation' years or more before. Out
      of simplicity, I decide to halve the chanceWouldHaveKnown for the "overlap"
      period.

      P( K | D ) = .4

      And from this we can deduce:

      P( ~K | D ) = .6

      Now we can punch this into a handy Bayesian calculator found here. (You may
      wish to write out the formulas if you're unsure of the calculator.)

      http://members.aol.com/johnp71/bayes.html

      P( B | K ) = .667
      P( D | K ) = .333
      P( A | K ) = 0

      P( B | ~K ) = .023
      P( D | ~K ) = .068
      P( A | ~K ) = .909

      This means:

      If the document was known to the author, there is a 66.7% chance that the
      document was written more than 'circulation' years before (70-80 CE) and a 33.3%
      chance that the document was written less than 'circulation' years before (and
      yet before, 80-90 CE).

      If the document was not known to the author, there is a 2.3% chance that the
      document that the document was written 70-80 CE, a 6.8% chance that the document
      was written 80-90 CE, and a 90.9% chance that the document was written 90-170
      CE.

      Now I can reveal the total algorithm:

      Start with background probabilities for B, D, and A. In the case of Ephesians,
      these are .1, .1, and .8.

      Use these background probabilities, P( K | B ), P( K | D ), P( K | A ), and
      Bayes's theorem to determine P( B | K ), P( D | K ), P( A | K ), P( B | ~K ),
      P( D | ~K ), and P( A | ~K ). For Ephesians, these values are listed above.

      Now use these values and the following formula in order to determine the
      background probabilities for K and ~K. The formula is a corollary of the
      definition of P( A | B ):

      "P( A | B ) = P( A ^ B ) / P( B ), where ^ means union"
      "P( A ^ B ) = P( A | B ) * P( B )"

      So we get the following values:

      P( K ^ B ) = P( K | B ) * P( B ) = .8 * .1 = .08
      P( K ^ D ) = P( K | D ) * P( D ) = .4 * .1 = .04
      P( K ^ A ) = P( K | A ) * P( A ) = 0 * .8 = 0

      P( ~K ^ B ) = P( ~K | B ) * P( B ) = .2 * .1 = .02
      P( ~K ^ D ) = P( ~K | D ) * P( D ) = .6 * .1 = .06
      P( ~K ^ A ) = P( ~K | A ) * P( A ) = 1 * .8 = .8000

      To determine background probabilities for K and ~K:

      P( K ) = P( K ^ B ) + P( K ^ D ) + P( K ^ A ) = .08 + .04 + 0 = 0.12
      P( ~K ) = P( ~K ^ B ) + P( ~K ^ D ) + P( ~K ^ A ) = .02 + .06 + .8000 = .88

      And the sum check works. Great!

      Use these background probabilities for K and ~K, P( C | K ), and Bayes's
      Theorem to determine P( K | ~C ) and P( ~K | ~C ). Note that P( K | C ) = 1
      and P( ~K | C ) = 0.

      Remember that before, at this step, we settled for the default P( K ) of .5.
      Now we have applied the background knowledge about the probabilities for B,
      D, and A in order to make a much better estimate that P( K ) = .1797 for the
      background check. In other words, given that there was only twenty years
      between Ephesians and the earliest possible date for the document, yet that
      there are eighty years between Ephesians and the latest possible date for
      the document, it is most likely that the author of Ephesians did not know of
      the document, pending further evidence. That further evidence is P( ~C ),
      which in this case just serves to reinforce the background probability.

      We assume that P( C | K ) = .3
      We deduce and assume that P( ~C | K ) = .7

      And then the obvious ones:

      P( C | ~K ) = 0
      P( ~C | ~K ) = 1


      Now using Bayes's Theorem:

      P( K | ~C ) = P( ~C | K ) * P( K ) / ( P( ~C | ~K ) * P( ~K ) + P( ~C | K ) *
      P( K ) )

      P( K | ~C ) = .7 * .12 / ( 1 * .88 + .7 * .12 ) = .084 / .964 = .087

      P( ~K | ~C ) = P( ~C | ~K ) * P( ~K ) / ( P( ~C | ~K ) * P( ~K ) + P( ~C | K ) *
      P( K ) )

      P( ~K | ~C ) = 1 * .88 / ( 1 * .88 + .7 * .12 ) = .88 / .964 = .913

      And we can verify that .087 + .913 = 1

      Use P( ~C ) and P( K | ~C ) to determine P( K ^ ~C ) with the formula P( K ^
      ~C ) = P( K | ~C ) * P( ~C ) = .087 * 1 = .087, where ^ is the union symbol.

      Use P( ~C ) and P( ~K | ~C ) to determine P( ~K ^ ~C ) with the formula P( ~K ^
      ~C ) = P( ~K | ~C ) * P( ~C ) = .913 * 1 = .913.

      Use P( C ) and P( K | C ) to determine P( K ^ C ) with the formula P( K ^
      C ) = P( K | C ) * P( C ) = 1 * P( C ) = P( C ) = 0.

      Use P( C ) and P( ~K | C ) to determine P( ~K ^ C ) with the formula P( ~K ^ C )
      = P( ~K | C ) * P( C ) = 0 * P( C ) = 0.

      OK so you know the probabilities for all combinations of K and C. Now
      determine P( K ) like so:

      P( K ) = P( K ^ ~C ) + P( K ^ C ) = P( K ^ ~C ) + P( C ) = .087 + 0 = .087.

      And determine P( ~K ) like so:

      P( ~K ) = P( ~K ^ ~C ) + P( ~K ^ C ) = P( ~K ^ ~C ) = .913.

      And we can check that .087 + .913 = 1

      OK so now you know the probability of K and the probability of ~K.

      Now recall that we computed these values many lines ago.

      P( B | K ) = .667
      P( D | K ) = .333
      P( A | K ) = 0

      P( B | ~K ) = .023
      P( D | ~K ) = .068
      P( A | ~K ) = .909

      We can now use the formulas to find all combinations of B, D, and A with K
      and ~K.

      P( B ^ K ) = P( B | K ) * P( K ) = .667 * .087 = .058029
      P( D ^ K ) = P( D | K ) * P( K ) = .333 * .087 = .028971
      P( A ^ K ) = P( A | K ) * P( K ) = 0 * .087 = 0

      P( B ^ ~K ) = P( B | ~K ) * P( ~K ) = .023 * .913 = .020999
      P( D ^ ~K ) = P( D | ~K ) * P( ~K ) = .068 * .913 = .062084
      P( A ^ ~K ) = P( A | ~K ) * P( ~K ) = .909 * .913 = .829917

      And we can sum up to find the probabilities of B, D, and A.

      P( B ) = P( B ^ K ) + P( B ^ ~K ) = .058029 + .020999 = .079028
      P( D ) = P( D ^ K ) + P( D ^ ~K ) = .028971 + .062084 = .091055
      P( A ) = P( A ^ K ) + P( A ^ ~K ) = 0 + .829917 = .829917

      And we can verify that P( B ) + P( D ) + P( A ) = .079028 + .091055 +
      .829917 = 1

      The results are intuitively correct, as the presence of Ephesians introduces a
      slight bias towards dates after Ephesians and a slight preference for the
      circulation period against the pre-circulation period. The results are
      theoretically correct as explained throughout this section.

      MULTIPLYING PROBABILITIES

      Now we know how to go from background probabilities for B, D, and A combined
      with information on any given author in order to form new probabilities for B,
      D, and A. How do we use this knowledge to form a complete algorithm that
      calculates the values for B, D, and A after the data from several authors is
      applied?

      The problem is combining this information in order to find the probability that
      the document was written in a certain year given that we have data for multiple
      authors. That is, we want to know P( Yn | E ), where E is the set of all
      evidence and authors.

      And the independent version of Bayes's rule comes in handy here. It says:

      P( Yn | E1, E2, E3, ... ) = ( P( Yn ) * P( E1 | Yn ) * P( E2 | Yn ) * P( E3
      | Yn ) ... ) / ( P( E1 ) * P( E2 ) * P( E3 ) ... )

      But how do we find P( E1 | Yn )? Well, if we apply the definition of P( A |
      B ) = P( A, B ) / P( B ), we get:

      P( E1 | Yn ) = P( E1, Yn ) / P( Yn )

      And if we apply the corrolary to the definition of P( A | B ), which states
      that P( A, B ) = P( A | B ) * P( B ), we get:

      P( Yn, E1 ) = P( Yn | E1 ) * P( E1 )

      And when we substitute that in:

      P( E1 | Yn ) = P( Yn | E1 ) * P( E1 ) / P( Yn )

      And when we substitute that into the independent version of Bayes' rule:

      P( Yn | E1, E2, E3, ... ) = ( P( Yn ) * P( Yn | E1 ) * P( E1 ) / P( Yn )
      * P( Yn | E2 ) * P( E2 ) / P( Yn ) * P( Yn | E3 ) * P( E3 ) / P( Yn )
      ... ) / ( P( E1 ) * P( E2 ) * P( E3 ) ... )

      And when we simplify:

      P( Yn | E1, E2, E3, ... ) = ( P( Yn ) * P( Yn | E1 ) * P( Yn | E2 ) * P( Yn
      | E3 ) ... ) / ( P( Yn ) * P( Yn ) * P( Yn ) ... )

      The denominator is called a "normalizing constant," i.e., it is in the
      formula to make sure that everything sums up to 1. This means that we can
      ignore it for a while, and then make sure that things sum up to 1 when we
      are done. With all that knowlege, here is the code for combining the data
      to find the new probability that the document was written in each year.

      void calcNewYears( void )
      {
      double sum = 0;
      double normalizingConstant;

      int y, a, e; // year, author, evidence

      for ( y = 0; y < numYears; y++ )
      {
      newyears[ y ] = years[ y ];

      for ( a = 0; a < numAuthors; a++ )
      newyears[ y ] *= authors[ a ].getYear( y, quo, numYears );

      for ( e = 0; e < numEvidences; e++ )
      newyears[ y ] *= evidences[ e ].getYear( y, quo, numYears );

      sum += newyears[ y ];
      }

      normalizingConstant = 1 / sum;

      for ( y = 0; y < numYears; y++ )
      newyears[ y ] *= normalizingConstant;
      }

      The whole source code and the executable program is found here:

      http://home.earthlink.net/~kirby/dater4.html

      You might want to play with the executable a little. The answers it gives
      seem approximately correct. That's good, because the algorithm seems to be
      theoretically correct, as described in this section on the independent version
      of Bayes' Rule and the previous sections on Bayes' Theorem.

      APPLICATION

      The application described in this article has uses outside the New Testament
      world and should be of interests to all historians involved in determining the
      time of writing for documents. Someone even suggested to me that it could be
      adapted to phylogenetic reconstruction, although I don't know much about that.
      In a separate article, I will describe the data that could be used in
      determining a dating range for the Gospel of John.

      QUESTION

      Now that you have read an exposition of the dater program, do you think that a
      paper based on this idea would be accepted for publication by some scholarly
      journal? If it makes a difference, I am an undergraduate of computer science.

      thanks,
      Peter Kirby
    • bjtraff
      Hello Peter I find your idea intriguing. If I may, I would like to know if you have tested your formula with documents which already have relatively known and
      Message 2 of 2 , Apr 30 8:53 PM
      View Source
      • 0 Attachment
        Hello Peter

        I find your idea intriguing. If I may, I would like to know if you
        have tested your formula with documents which already have relatively
        known and fixed dates? Thus, for example, rather than using disputed
        and disputable dated documents like Ephesians and GJohn, what were
        your results when testing, say Paul's letter to the Romans, and 2nd
        Century documents from Ignatius, Polycarp, Justin Martyr or Papias?
        Did your formula help to establish probable dates for Romans?

        I would be interested in how it worked for other documents as well of
        course, especially as texts with already relatively fixed dates
        should serve as a reasonable test for validity of the theory as a
        whole.

        Thank you again,

        Brian Trafford
        Calgary, AB, Canada
      Your message has been successfully submitted and would be delivered to recipients shortly.