The Dater Program

Expand Messages
• THE ORIGINS OF THE APPLICATION AND THE PROBLEM Since the subject of the dating of the gospels has come up in recent discussion, I thought it may be interesting
Message 1 of 2 , Apr 30 5:53 PM
View Source
• 0 Attachment
THE ORIGINS OF THE APPLICATION AND THE PROBLEM

Since the subject of the dating of the gospels has come up in recent discussion,
I thought it may be interesting to the list members for me to introduce the
application that I have developed over the past month or so. This application
was developed through dialogue with Quentin David Jones, fellow Computer Science
student and New Testament dilettante, and other members of a different
discussion list.

The problem that the application was developed to solve is a common one: the
assignment of a date or dating range to a document based on merely probabilistic
data about internal evidence and about whether certain authors cited a document.

Specifically, we have to take account of the arguments from silence concerning
whether or not an author may have known about the document, and we have to make
use of vague 'citations' that indicate only a probable knowledge of the
document.

For example, suppose that Ephesians was written around 90 CE and that Ephesians
makes no allusions to the Gospel of John. And suppose that there is a 10%
chance that the author of Ephesians would have cited the Gospel of John (or
shown dependence on John) given that the author knew about the Gospel of John.
The effect of this data, all else being equal, should be a very slight
preference for a date of John that is after 90 CE.

Or, for example, suppose that Ignatius of Antioch wrote around 110 CE. Further
suppose that we have weighed the evidence for Johannine allusions and come to
accept a 30% probability that Ignatius was dependent on John. The effect of
this data, all else being equal, should be a slight preference for a date of
John that is prior to 110 CE.

Note that in neither case does the point of data allow us to come to a firm
conclusion on the terminus a quo or the terminus ad quem of the Gospel of John.
We do not need any complicated algorithm in order to determine the latest
possible date and the earliest possible date for a document. We do need a
mathematical procedure, however, for quantifying the probabilites that a
document was written in any given year within the range established by the
terminus a quo and the terminus ad quem.

THE DATA INVOLVED IN THE PROBLEM

Here are the variables involved for each author who may have cited the document:

1. C, the chance that the author cited the document in question. Using more
appropriate language, C is the chance that the author was dependent on the
document in question in the author's extant writings.

2. CgivenK, the chance that the author would have cited the document given that
the author knew about the document. This variable has to be guessed based on
the interests of the author and the extent of the author's writings.

3. Circulation, the number of years that we would expect it to take typically
for the document to circulate to the author in question. The author has a
higher chance of knowing about the document if it was written before the
circulation period than if it was written after the circulation period. (This
circulation period is sometimes called the "overlap," although the original
reason for calling it this has become obsolete. Originally, I had assigned a
100% chance that the author would have known about the document given that it
was written before the circulation period. If the author knew about the
document, it was written any time before him including the "overlap" period; if
the author did not know the document, it was written any time after the
beginning of the "overlap" period; in the "overlap" period, the author may or
may not have known of the document.)

4. KgivenD, the chance that the author would have known about the document
given that the author wrote during the document's circulation period.

5. KgivenB, the chance that the author would have known about the document
given that the author wrote after the document's circulation period.

6. Date, the estimated year in which the author wrote.

In addition to these variables, we have some constants.

1. KgivenA, which is 0. If the document came after the author, the author
doesn't know it.

2. CgivenNotK, which is 0. If the author doesn't know it, the author doesn't
cite it.

3. NotKgivenC, which is 0. If the author cites it, the author knows of it.

There are also variables and constants involved in each of the pieces of
internal evidence. This exposition of the dater program focuses on the authors,
which are the more complex points of data. It is hoped that the interested
reader can look at the source code for the application concerning the internal

THE IMPORTANCE OF BAYES'S THEOREM

The first major breakthrough in the development of the application came when I
discovered the relevance of Bayes's Theorem.

According to Larsen and Marx in _An Introduction to Mathematical Statistics and
Its Applications_ (Prentice-Hall 1986, 2nd ed.), Bayes's theorem is defined as
follows (p. 56, mathematical notation paraphrased with subscripts in []):

"Let {A [i]} with i from i = 1 to n be a set of n events, each with a positive
probability, that partition S in such a way that the union of A [i] with i from
1 to n = S and A [i] union A [j] = empty set for i != j. For any event B (also
defined on S), where P(B) > 0,

P( A [j] | B ) = P( B | A [j] ) * P( A [j] ) / the sum of P( B | A[i] ) * P( A
[i] ) with i from i=1 to n,

for any 1 <= j <= n."

Let me explain what the symbols usually mean in English. The events A [1], A
[2], ... A [n] represent n different hypotheses, the actual probability of which
we would like to determine. The values of P( A[1] ), P( A[2] ) ... P( A [n] )
are assigned based on some prior knowledge about the likelihood of these
hypotheses -- or, perhaps, evenly distributed to each hypothesis, giving each a
probability of 1/n.

The event B represents an "outcome" or "experiment result" or "data point." We
know the probability that this outcome would show up given each individual
hypothesis. So we wish to determine the probability of the hypothesis given
that the outcome occured. And this is what the formula in Bayes' theorem
provides.

If a discussion in the abstract is confusing, perhaps playing with some concrete
a Bayes's Theorem calculator assuming you have some flavor of Java installed.

http://members.aol.com/johnp71/bayes.html

Let us assume that the prior probability of the chance that the author knew of
the document and the chance that the author didn't know of the document are both
50%.

I had assigned a 0% chanceCited and a 30% chanceWouldHaveCited, which I
understood to mean P( Cited | Knew ), or the chance that the author would have
cited the document given that the author knew of the document. I will say that
C means that the author cited the document and that K means that the author knew
of the document. This means the following:

Hypotheses: K, ~K
Outcomes: C, ~C

P(K) = .5
P(~K) = .5
P(C) = 0
P(~C) = 0

P( C | K ) = .3

We can deduce:

P( ~C | K ) = .7

And common sense says:

P( C | ~K ) = 0
P( ~C | ~K ) = 1

Or, in other words, if he doesn't know of it, he couldn't have cited it.

Now Bayes's Theorem says:

P( K | ~C ) = P( ~C | K ) * P( K ) / ( P( ~C | ~K ) * P( ~K ) + P( ~C | K ) *
P( K ) )
P( K | ~C ) = .7 * .5 / ( 1 * .5 + .7 * .5 ) = .35 / .85 = .4117647

P( ~K | ~C ) = P( ~C | ~K ) * P( ~K ) / ( P( ~C | ~K ) * P( ~K ) + P( ~C | K ) *
P( K ) )
P( ~K | ~C ) = 1 * .5 / ( 1 * .5 + .7 * .5 ) = .5 / .85 = .5882353

And a quick sanity check verifies that P( K | ~C ) + P( ~K | ~C ) = 1.

This means that, for Ephesians, given that there is a 0% chance that the author
cited the document and given that there is a 30% chance that the author would
have cited the document if the author knew of the document, and given even
background probabilities for the chances that the author knew or didn't know of
the document, there is a resulting 58.82% chance that the author didn't know of
the document and a 41.18% chance that the author did know of the document.

And this coincides with our intuitions that Ephesians should provide a slight
bias towards dates after the time of writing.

However, this example with Ephesians is not a complete algorithm; it is only an
explanation of the importance of Bayes's Theorem. In actuality, the background
probabilities for whether the author knew about the document are not even; they
depend on the date at which the author wrote relative to the terminus a quo and
terminus ad quem. A complete algorithm for applying the data of each author is
explained in the next section.

THE HEART OF THE ALGORITHM

Hypotheses: Document before Overlap (B), Document during Overlap (D),
Document after Overlap (A)
Outcomes: Author knows document (K), Author does not know document (~K)

I have assumed that Ephesians was written in 90 CE. I have assumed that the
'circulation' equals 10 years. I have assumed that, given just a terminus a
quo and a terminus ad quem, all dates in this range have equal background
probability.

Let us assume that the Epistle to the Ephesians was the first document to be
entered into our algorithm. In this case, we would have the following
background probabilities, given a terminus a quo of 70 and a terminus ad
quem of 170 -- which I am using here for mathematical simplicity; other,
more precise values could be agreed upon.

P( B ) = 10 / 100 = .1
P( D ) = 10 / 100 = .1
P( A ) = 80 / 100 = .8

Now, let us assign a value to the probability that the author would have known
about the document given that the document was written more than 'circulation'
years before. (Note again that I am picking values just for example.)

P( K | B ) = .8

Obviously, we can deduce:

P( ~K | B ) = .2

Common sense tells us that the chance that the author knew of the document given
that the document was written after the author is zero.

P( K | A ) = 0

And common sense also says that the chance that the author didn't know of the
document given that the document was written after the author is one.

P( ~K | A ) = 1

The difficult case is determining the probability that the author knew of the
document given that the document was written between 'circulation' and 0 years
before the author wrote. Obviously, there is some chance that the author knew
of the document even if it was written during these years. But this chance is
just as obviously less than the chance that the author would have known about
the document given that it was written 'circulation' years or more before. Out
of simplicity, I decide to halve the chanceWouldHaveKnown for the "overlap"
period.

P( K | D ) = .4

And from this we can deduce:

P( ~K | D ) = .6

Now we can punch this into a handy Bayesian calculator found here. (You may
wish to write out the formulas if you're unsure of the calculator.)

http://members.aol.com/johnp71/bayes.html

P( B | K ) = .667
P( D | K ) = .333
P( A | K ) = 0

P( B | ~K ) = .023
P( D | ~K ) = .068
P( A | ~K ) = .909

This means:

If the document was known to the author, there is a 66.7% chance that the
document was written more than 'circulation' years before (70-80 CE) and a 33.3%
chance that the document was written less than 'circulation' years before (and
yet before, 80-90 CE).

If the document was not known to the author, there is a 2.3% chance that the
document that the document was written 70-80 CE, a 6.8% chance that the document
was written 80-90 CE, and a 90.9% chance that the document was written 90-170
CE.

Now I can reveal the total algorithm:

Start with background probabilities for B, D, and A. In the case of Ephesians,
these are .1, .1, and .8.

Use these background probabilities, P( K | B ), P( K | D ), P( K | A ), and
Bayes's theorem to determine P( B | K ), P( D | K ), P( A | K ), P( B | ~K ),
P( D | ~K ), and P( A | ~K ). For Ephesians, these values are listed above.

Now use these values and the following formula in order to determine the
background probabilities for K and ~K. The formula is a corollary of the
definition of P( A | B ):

"P( A | B ) = P( A ^ B ) / P( B ), where ^ means union"
"P( A ^ B ) = P( A | B ) * P( B )"

So we get the following values:

P( K ^ B ) = P( K | B ) * P( B ) = .8 * .1 = .08
P( K ^ D ) = P( K | D ) * P( D ) = .4 * .1 = .04
P( K ^ A ) = P( K | A ) * P( A ) = 0 * .8 = 0

P( ~K ^ B ) = P( ~K | B ) * P( B ) = .2 * .1 = .02
P( ~K ^ D ) = P( ~K | D ) * P( D ) = .6 * .1 = .06
P( ~K ^ A ) = P( ~K | A ) * P( A ) = 1 * .8 = .8000

To determine background probabilities for K and ~K:

P( K ) = P( K ^ B ) + P( K ^ D ) + P( K ^ A ) = .08 + .04 + 0 = 0.12
P( ~K ) = P( ~K ^ B ) + P( ~K ^ D ) + P( ~K ^ A ) = .02 + .06 + .8000 = .88

And the sum check works. Great!

Use these background probabilities for K and ~K, P( C | K ), and Bayes's
Theorem to determine P( K | ~C ) and P( ~K | ~C ). Note that P( K | C ) = 1
and P( ~K | C ) = 0.

Remember that before, at this step, we settled for the default P( K ) of .5.
Now we have applied the background knowledge about the probabilities for B,
D, and A in order to make a much better estimate that P( K ) = .1797 for the
background check. In other words, given that there was only twenty years
between Ephesians and the earliest possible date for the document, yet that
there are eighty years between Ephesians and the latest possible date for
the document, it is most likely that the author of Ephesians did not know of
the document, pending further evidence. That further evidence is P( ~C ),
which in this case just serves to reinforce the background probability.

We assume that P( C | K ) = .3
We deduce and assume that P( ~C | K ) = .7

And then the obvious ones:

P( C | ~K ) = 0
P( ~C | ~K ) = 1

Now using Bayes's Theorem:

P( K | ~C ) = P( ~C | K ) * P( K ) / ( P( ~C | ~K ) * P( ~K ) + P( ~C | K ) *
P( K ) )

P( K | ~C ) = .7 * .12 / ( 1 * .88 + .7 * .12 ) = .084 / .964 = .087

P( ~K | ~C ) = P( ~C | ~K ) * P( ~K ) / ( P( ~C | ~K ) * P( ~K ) + P( ~C | K ) *
P( K ) )

P( ~K | ~C ) = 1 * .88 / ( 1 * .88 + .7 * .12 ) = .88 / .964 = .913

And we can verify that .087 + .913 = 1

Use P( ~C ) and P( K | ~C ) to determine P( K ^ ~C ) with the formula P( K ^
~C ) = P( K | ~C ) * P( ~C ) = .087 * 1 = .087, where ^ is the union symbol.

Use P( ~C ) and P( ~K | ~C ) to determine P( ~K ^ ~C ) with the formula P( ~K ^
~C ) = P( ~K | ~C ) * P( ~C ) = .913 * 1 = .913.

Use P( C ) and P( K | C ) to determine P( K ^ C ) with the formula P( K ^
C ) = P( K | C ) * P( C ) = 1 * P( C ) = P( C ) = 0.

Use P( C ) and P( ~K | C ) to determine P( ~K ^ C ) with the formula P( ~K ^ C )
= P( ~K | C ) * P( C ) = 0 * P( C ) = 0.

OK so you know the probabilities for all combinations of K and C. Now
determine P( K ) like so:

P( K ) = P( K ^ ~C ) + P( K ^ C ) = P( K ^ ~C ) + P( C ) = .087 + 0 = .087.

And determine P( ~K ) like so:

P( ~K ) = P( ~K ^ ~C ) + P( ~K ^ C ) = P( ~K ^ ~C ) = .913.

And we can check that .087 + .913 = 1

OK so now you know the probability of K and the probability of ~K.

Now recall that we computed these values many lines ago.

P( B | K ) = .667
P( D | K ) = .333
P( A | K ) = 0

P( B | ~K ) = .023
P( D | ~K ) = .068
P( A | ~K ) = .909

We can now use the formulas to find all combinations of B, D, and A with K
and ~K.

P( B ^ K ) = P( B | K ) * P( K ) = .667 * .087 = .058029
P( D ^ K ) = P( D | K ) * P( K ) = .333 * .087 = .028971
P( A ^ K ) = P( A | K ) * P( K ) = 0 * .087 = 0

P( B ^ ~K ) = P( B | ~K ) * P( ~K ) = .023 * .913 = .020999
P( D ^ ~K ) = P( D | ~K ) * P( ~K ) = .068 * .913 = .062084
P( A ^ ~K ) = P( A | ~K ) * P( ~K ) = .909 * .913 = .829917

And we can sum up to find the probabilities of B, D, and A.

P( B ) = P( B ^ K ) + P( B ^ ~K ) = .058029 + .020999 = .079028
P( D ) = P( D ^ K ) + P( D ^ ~K ) = .028971 + .062084 = .091055
P( A ) = P( A ^ K ) + P( A ^ ~K ) = 0 + .829917 = .829917

And we can verify that P( B ) + P( D ) + P( A ) = .079028 + .091055 +
.829917 = 1

The results are intuitively correct, as the presence of Ephesians introduces a
slight bias towards dates after Ephesians and a slight preference for the
circulation period against the pre-circulation period. The results are
theoretically correct as explained throughout this section.

MULTIPLYING PROBABILITIES

Now we know how to go from background probabilities for B, D, and A combined
with information on any given author in order to form new probabilities for B,
D, and A. How do we use this knowledge to form a complete algorithm that
calculates the values for B, D, and A after the data from several authors is
applied?

The problem is combining this information in order to find the probability that
the document was written in a certain year given that we have data for multiple
authors. That is, we want to know P( Yn | E ), where E is the set of all
evidence and authors.

And the independent version of Bayes's rule comes in handy here. It says:

P( Yn | E1, E2, E3, ... ) = ( P( Yn ) * P( E1 | Yn ) * P( E2 | Yn ) * P( E3
| Yn ) ... ) / ( P( E1 ) * P( E2 ) * P( E3 ) ... )

But how do we find P( E1 | Yn )? Well, if we apply the definition of P( A |
B ) = P( A, B ) / P( B ), we get:

P( E1 | Yn ) = P( E1, Yn ) / P( Yn )

And if we apply the corrolary to the definition of P( A | B ), which states
that P( A, B ) = P( A | B ) * P( B ), we get:

P( Yn, E1 ) = P( Yn | E1 ) * P( E1 )

And when we substitute that in:

P( E1 | Yn ) = P( Yn | E1 ) * P( E1 ) / P( Yn )

And when we substitute that into the independent version of Bayes' rule:

P( Yn | E1, E2, E3, ... ) = ( P( Yn ) * P( Yn | E1 ) * P( E1 ) / P( Yn )
* P( Yn | E2 ) * P( E2 ) / P( Yn ) * P( Yn | E3 ) * P( E3 ) / P( Yn )
... ) / ( P( E1 ) * P( E2 ) * P( E3 ) ... )

And when we simplify:

P( Yn | E1, E2, E3, ... ) = ( P( Yn ) * P( Yn | E1 ) * P( Yn | E2 ) * P( Yn
| E3 ) ... ) / ( P( Yn ) * P( Yn ) * P( Yn ) ... )

The denominator is called a "normalizing constant," i.e., it is in the
formula to make sure that everything sums up to 1. This means that we can
ignore it for a while, and then make sure that things sum up to 1 when we
are done. With all that knowlege, here is the code for combining the data
to find the new probability that the document was written in each year.

void calcNewYears( void )
{
double sum = 0;
double normalizingConstant;

int y, a, e; // year, author, evidence

for ( y = 0; y < numYears; y++ )
{
newyears[ y ] = years[ y ];

for ( a = 0; a < numAuthors; a++ )
newyears[ y ] *= authors[ a ].getYear( y, quo, numYears );

for ( e = 0; e < numEvidences; e++ )
newyears[ y ] *= evidences[ e ].getYear( y, quo, numYears );

sum += newyears[ y ];
}

normalizingConstant = 1 / sum;

for ( y = 0; y < numYears; y++ )
newyears[ y ] *= normalizingConstant;
}

The whole source code and the executable program is found here:

You might want to play with the executable a little. The answers it gives
seem approximately correct. That's good, because the algorithm seems to be
theoretically correct, as described in this section on the independent version
of Bayes' Rule and the previous sections on Bayes' Theorem.

APPLICATION

The application described in this article has uses outside the New Testament
world and should be of interests to all historians involved in determining the
time of writing for documents. Someone even suggested to me that it could be
In a separate article, I will describe the data that could be used in
determining a dating range for the Gospel of John.

QUESTION

Now that you have read an exposition of the dater program, do you think that a
paper based on this idea would be accepted for publication by some scholarly
journal? If it makes a difference, I am an undergraduate of computer science.

thanks,
Peter Kirby
• Hello Peter I find your idea intriguing. If I may, I would like to know if you have tested your formula with documents which already have relatively known and
Message 2 of 2 , Apr 30 8:53 PM
View Source
• 0 Attachment
Hello Peter

I find your idea intriguing. If I may, I would like to know if you
known and fixed dates? Thus, for example, rather than using disputed
and disputable dated documents like Ephesians and GJohn, what were
your results when testing, say Paul's letter to the Romans, and 2nd
Century documents from Ignatius, Polycarp, Justin Martyr or Papias?
Did your formula help to establish probable dates for Romans?

I would be interested in how it worked for other documents as well of
course, especially as texts with already relatively fixed dates
should serve as a reasonable test for validity of the theory as a
whole.

Thank you again,

Brian Trafford