> A specific value has been suggested. Bruce Merrill wrote, some

I've finally figured out why this doesn't work in this analysis. Dave

> weeks ago, --

> >

> >There is no widely accepted objective definition of a "strong"

> >correlation. A correlation of 0.7 explains only half the variability

> >in the data (i.e., if r=0.7, r**2 =0.7**2 =0.49 or approximately

> >half), so many researchers begin talking about meaningful

> >correlations when the absolute value reaches 0.7, with strong

> >correlations being somewhat greater than 0.7 (or strong negative

> >correlations being somewhat less than -0.7).

> >

> On this view, a cut off absolute value for r of 0.3 is far too low. the

> value only *begins* to be meaningful at 0.7.

Gentile's analysis works this way:

1 For each category, count the number of times each of 807 different

words are used in the synoptic gospels (Words used only once or twice, and

words used very frequently indeed (basically KAI and DE) are not counted.

2 Convert each word count to a word frequency by dividing by the overall

number of words in each category.

3 Convert each word frequency to a relative frequency (or frequency

shift) by subtracting the average word frequency across all the categories.

4 Correlate the relative frequency values.

Some people have questioned why step 3 is necessary, and it is indeed

perfectly possible to correlate the frequencies produced by step 2. In

fact, if you do that, you get correlation coefficients (r) ranging from r =

0.397 for 212-021 to r = 0.890 for 112-002 (and note that ALL the 'r' values

are positive). Now, on this basis I think it would be perfectly fair to say

that 112-002 really is a 'strong' correlation. So, why have we been

performing step 3 at all, and getting confused over whether r = 0.3 means

anything, and also what negative values of r really mean?

The answer is that step 3 is trying to remove that part of the positive

correlation that is due to all the categories having been written in Greek.

In other words, common Greek words in one category tend to also be common in

other categories, and uncommon Greek words in one category tend to be

uncommon in other categories. As a result, there is a 'built in' positive

correlation despite different people having written (or edited) different

categories, because they all used Greek. What step 3 does is not only to

remove the 'language' positive, but in the process expand the differences

that are left to make it easier to see the remaining correlations. This is

basically what happens in the PCA (Principle Component Analysis) also

performed by Dave Gentile. As Dave reported, the 'Prin1' column in the PCA

results indicates that the biggest correlating factor is language (it's all

Greek), and only when that factor is removed can you then see other

differences. For example, once language is removed, Prin2 says that the

biggest factor determining correlations is Mark vs. non-Mark.

Now, the mechanism used by Dave Gentile to remove the 'language' effect is

Step 3. By subtracting the average word frequencies he creates values that

are positive where the frequency is greater than the average, and negative

where the frequency is less than the average. As a result, values of 'r'

that were in the range 0 to +1 now become values in the range -? to +?. The

'?'s indicate that it's not simply a question of shifting all values by

(say) 0.5 and having everything in the range -0.5 to +0.5. Unfortunately

for us, every value of 'r' gets changed by a different amount!

Nevertheless, there is an overall trend of reducing the value of 'r' even

for very strong correlations found in Step 2. For example 112-002 reduces

from r = 0.89 at step 2 to 0.52 at step 3 (and even lower, 0.48, when the

'zeros' are included), and for 212-021 the value of r = 0.397 becomes r

= -0.21!

This 'negative 'shift' is more easily seen by reducing the number of

categories from 19 to something smaller. For example, just take 202 and

201. At step 2 their 'r' value is 0.772. Now, if we perform step 3 on just

these two categories, for each word we calculate an average of just two

values, and for every word this average lies midway between the 202 and 201

frequency values. As a result, every single data point of value 'X' in 202

is '-X' in 201, resulting in an 'r' value of exactly -1, and this is true

for any pair of categories we choose. Whatever 'r' value we got at step 2,

the 'r' value at step 3 is always -1. Now, as we add more categories this

'negative bias' effect is reduced. For example, when the average is

calculated for 202, 201, and 200, we get step 3 'r' values as follows:

202-201 = -0.355, 202-200 = -0.674, 201-200 = -0.452. As more categories

are included in the calculation of the average the 'r' values climb from

their '-1' start, until with all 19 categories 'r' reaches the values we are

familiar with. However, the negative bias hasn't gone away. Instead, it's

just been reduced. Dave Gentile calculated the value of the bias

as -1/(X-1) (where X is the number of categories), so for X=2 the bias

is -1, for X=3 the bias is -0.5, for X=4 it's -0.333, and for X=19

it's -0.0556.

Overall, step 3 affects the 'r' values in two different ways.

1 In the first place, because one of the major factors causing a

correlation (language) has been removed, all values of 'r' are changed.

2 In the second place, the step 3 mechanism doubles the potential range

of 'r' values, changing it from 0 to +1, to -1 to +1.

What all the above tells us is that because of step 3, it is completely

unrealistic to use an 'r >=0.7' cut off to determine whether there is a

strong correlation between the categories or not. However, the above still

doesn't tell us what the correct value is. One approach might be to take

all the 'step 2' values with r >= 0.8 and see what they become at step 3

(but only for those where P<=0.0003):

Step2 Step3

112-002 = 0.890 0.519

121-021 = 0.853 0.693

201-200 = 0.849 0.349

202-200 = 0.848 0.397

121-120 = 0.841 0.442

222-220 = 0.828 0.484

211-210 = 0.820 0.497

221-121 = 0.813 0.357

120-020 = 0.810 0.461

This suggests that values down to at least r = 0.35 must be regarded as

valid. What about going down to 'step 2' values of r >=0.75?

202-102 = 0.797 0.316

222-002 = 0.773 -0.333

202-201 = 0.772 0.349

222-221 = 0.766 0.244

120-002 = 0.760 -0.267

121-002 = 0.757 -0.236

211-221 = 0.756 0.282

220-002 = 0.752 -0.325

120-021 = 0.750 0.409

As expected, all 'r' values drop from step 2 to step 3, but by widely

varying amounts. What I think this shows is that the 'language effect'

completely dominates at step 2, and we have to remove it to be able to see

any other effects. It also shows (I think) that we would be safe at least

regarding r >=0.35 as a 'strong' correlation, but it's hard to push this

logic down to 0.3, let alone lower numbers.

Another possibility is simply to try to determine "what does 0.7 at step

become at step 3"? First, we can remove the remnant of the 'negative bias'

caused by having 19 categories by reducing the 'r' cut off by 0.0556, which

brings it down to 0.64444. Then to expand the range from 0 to +1 to -1 to

+1 all we have to do is double the difference between 1 and the 'r' value,

i.e. Step3 = 1-2*(1-Step2), or Step3 = 2*Step2-1. So, 0.6444 at step 2

becomes 0.2888 at step 3. So, does this work? Does this suggest that we

can use r >=0.2888 to indicate a strong correlation in this analysis, or is

my logic completely off the wall?

Any thoughts?

Dave Inglis

david@...

3538 O'Connor Drive

Lafayette, CA, USA

Synoptic-L Homepage: http://www.bham.ac.uk/theology/synoptic-l

List Owner: Synoptic-L-Owner@...