- I've done some more work on the statistics I previously reported, and

happily, besides getting what I believe are much more reliable results, I

now have enough material to use this as a class project.

Let me apologize in advance to the non-mathematically inclined for the

mathematical complications.

First, the problem with the current set of results is this: The tests

involved, particularly the t-test to determine the p-value, assume that the

distributions being correlated are normally distributed. To the extent that

this is not true, the results are unreliable.

Examining the distribution of the frequency of the words involved reveals

that they are not normally distributed.

There is a large concentration near zero, and some extreme outliers in the

tails. They are very, "lepto-kurtotic".

The effect is that the outliers dominate the results, we do not fully

utilize all available information, and we overstate the confidence level.

My first attempt at a solution to this problem was to use a non-parametric

method. Rather that using the actual data values, this method ranks the

data, then tries to correlate the ranks, The advantage here is that we can

do a significance test that does not involve any assumptions about the

distribution of the data. The drawback is that we lose even more

information.

The expected results would be that only the strongest correlations from the

previous attempts will show up. This is indeed what we find.

I'll post the full results later tonight, but here are the values that are

significant at the .0003 level.

Positive

------------

012-112

221-121

221-021

222-220

211-210

112-012

112-102** (new result) (.0003 level exactly)

221-121

221-021

221-220

121-021

121-120

021-121

002-112

002-012

020-120

200-202

200-201

negatives

----------

002-200

002-202

002-221

I note here that since the 102-202 connection has disappeared, and 102-112

has appeared, this can be viewed as rather positive news for the FH, and

the 3SH. Influential outliers must have been largely responsible for

previous results.

AUTON, is the biggest offended here. However, in the next method I describe

All of the above results appear, and more, *except* for 112-102. Plus 102

and 201 remain symmetric with respect to 202 in the next set of results.

Other that to state the obvious, that 102 seems to be related to both Luke

and 202, I've no more insight as to why 112-102 appears in this test and

not the next one. In the next test we can make use of all the zeros, so we

are using more information.

==========

The problem with this non- parametric approach is that we are making poor

use of the data. By ranking them we lose information. Also, we still are

not making any use of the zero values. The next method solves these

problems.

We can, effectively and correctly, use all the data including the zeros.

The results,it turns out are free of those annoying minor effects too.

While it might be possible for a redactor to preferentially retain a

specific word. It is unlikely that this happens over many words. So, by

effectively using *ALL* the data, we can remove many of these. Whereas

before we had to push the confidence levels very high to eliminate them, I

can now go as low as .99 confidence for individual results, without seeing

anything bizarre, and getting even more results that seem very plausible.

The method involves maximum likelihood fitting, and a likelihood ratio test

to determine significance. The first question is, if a normal distribution

is not appropriate, what distribution is. Realizing that we are dealing

with frequencies, and integer values leads us quickly to the Poisson

distribution.

An example of a Poisson process is the frequency of customer arrival at a

store. There is an average arrival rate (gamma), and in any given time

interval we can calculate the exact probability that 1 person arrives, 2

people arrive, 0 people arrive, etc., by using a Poisson distribution. The

only parameter we need is the average rate (gamma).

I treat each different word as a separate Poisson process with its own

gamma. I first estimate the gamma by looking at the overall frequency of

the word, and the number of words in a category.

Example:

ABRAAM occurs 18 times in all categories.

There are 25843 words studies in total.

1220 words are in category 200.

The expected number in 200 based on this is 18*1220 / 25843 = about 2.67.

So 2.67 will be our gamma estimate.

We can calc the probability of 0 occurrences is = .06

of 1 occurrence = .18

of 2 occurrences = .24

of 3 = .22

of 4 = .14

etcetera. (The actual value is 3).

Once we have the probability of the actual observed occurrence for each

word, multiplying the results would give the total probability. Since this

would involve multiplying many fractions, the result would be tiny. So, the

preferred method is to take the log of each probability, and add the

individual results.

The next step, is to ask if information from another category (say 202),

might be useful in predicting 200. A frequency estimate based on 202 would

be calculated in a similar manner to the estimated frequency based on all

categories.

I then assign a variable "beta" to weigh the estimates.

Best estimate = B * estimate based on other category + (1-B) * estimate

based on overall frequency.

I then use Excel's solver feature to find the value of beta that will most

improve the overall calculated likelihood.

If beta is 0 we conclude the category is unrelated. If there is a positive

relation, we need a test for significance.

The test statistic is -2* ln( Lu / Lr ) where Lu is the likelihood for the

model with the Beta, and Lr is the likelihood of the model based only on

the overall frequency. The statistic is distributed chi-squared with n

degrees of freedom where n is the number of parameters added. In this case

1.

The method does not test for negative relations. Also note that we do not

have absolute symmetry. 002 + overall may predict 112 better that 112 +

overall predicts 002.

The full results will be posted later. Here I will list results significant

at the .99 level.

If the result is not also significant at the .0003 level, I'll mark it with

an *.

Luke group

------------

012-112

112-002

002-012*

Matthew group

--------------

212-211

212-210

210-211

Sayings group

-----------

200-201

200-202

201-202*

202-102*

Central group

-----------

222-220

222-022*

Mark group

----------

020-021

020-120

020-121

020-221*

121-120

121-221

121-122

121-021

021-120

021-221

120-122*

Mark-central connections

---------------

022-021*

220-221

222-221

I'm sure Ron with be happy about the support the 212 results give for

Luke's use of Matthew.

Again, full results will be uploaded tonight. Comments are welcome.

David Gentile

Riverside, Illinois

M.S. Physics

Ph.D. Management Science candidate

Synoptic-L Homepage: http://www.bham.ac.uk/theology/synoptic-l

List Owner: Synoptic-L-Owner@... > You're welcome. I'm just glad, at this point, that I can use it as a class

text

> project.

> It's due it 2 weeks, and I wasn't having a lot of other inspirations for

> projects.

>

> Dave Gentile

> Riverside, Illinois

> M.S. Physics

> Ph.D. Management Science candidate

>

> >

> > Dave, I think it would be a daunting task and there may not be enough

> to

Synoptic-L Homepage: http://www.bham.ac.uk/theology/synoptic-l

> > provide a useful sample. Maybe it would make a good Ph.D. dissertation

> project

> > for someone. In any event, thanks for all your good work.

> >

> > Ted

> >

>

>

List Owner: Synoptic-L-Owner@...