[Synoptic-L] What value of 'r' should we use in correlations?
- Brian Wilson wrote:
> A specific value has been suggested. Bruce Merrill wrote, someI've finally figured out why this doesn't work in this analysis. Dave
> weeks ago, --
> >There is no widely accepted objective definition of a "strong"
> >correlation. A correlation of 0.7 explains only half the variability
> >in the data (i.e., if r=0.7, r**2 =0.7**2 =0.49 or approximately
> >half), so many researchers begin talking about meaningful
> >correlations when the absolute value reaches 0.7, with strong
> >correlations being somewhat greater than 0.7 (or strong negative
> >correlations being somewhat less than -0.7).
> On this view, a cut off absolute value for r of 0.3 is far too low. the
> value only *begins* to be meaningful at 0.7.
Gentile's analysis works this way:
1 For each category, count the number of times each of 807 different
words are used in the synoptic gospels (Words used only once or twice, and
words used very frequently indeed (basically KAI and DE) are not counted.
2 Convert each word count to a word frequency by dividing by the overall
number of words in each category.
3 Convert each word frequency to a relative frequency (or frequency
shift) by subtracting the average word frequency across all the categories.
4 Correlate the relative frequency values.
Some people have questioned why step 3 is necessary, and it is indeed
perfectly possible to correlate the frequencies produced by step 2. In
fact, if you do that, you get correlation coefficients (r) ranging from r =
0.397 for 212-021 to r = 0.890 for 112-002 (and note that ALL the 'r' values
are positive). Now, on this basis I think it would be perfectly fair to say
that 112-002 really is a 'strong' correlation. So, why have we been
performing step 3 at all, and getting confused over whether r = 0.3 means
anything, and also what negative values of r really mean?
The answer is that step 3 is trying to remove that part of the positive
correlation that is due to all the categories having been written in Greek.
In other words, common Greek words in one category tend to also be common in
other categories, and uncommon Greek words in one category tend to be
uncommon in other categories. As a result, there is a 'built in' positive
correlation despite different people having written (or edited) different
categories, because they all used Greek. What step 3 does is not only to
remove the 'language' positive, but in the process expand the differences
that are left to make it easier to see the remaining correlations. This is
basically what happens in the PCA (Principle Component Analysis) also
performed by Dave Gentile. As Dave reported, the 'Prin1' column in the PCA
results indicates that the biggest correlating factor is language (it's all
Greek), and only when that factor is removed can you then see other
differences. For example, once language is removed, Prin2 says that the
biggest factor determining correlations is Mark vs. non-Mark.
Now, the mechanism used by Dave Gentile to remove the 'language' effect is
Step 3. By subtracting the average word frequencies he creates values that
are positive where the frequency is greater than the average, and negative
where the frequency is less than the average. As a result, values of 'r'
that were in the range 0 to +1 now become values in the range -? to +?. The
'?'s indicate that it's not simply a question of shifting all values by
(say) 0.5 and having everything in the range -0.5 to +0.5. Unfortunately
for us, every value of 'r' gets changed by a different amount!
Nevertheless, there is an overall trend of reducing the value of 'r' even
for very strong correlations found in Step 2. For example 112-002 reduces
from r = 0.89 at step 2 to 0.52 at step 3 (and even lower, 0.48, when the
'zeros' are included), and for 212-021 the value of r = 0.397 becomes r
This 'negative 'shift' is more easily seen by reducing the number of
categories from 19 to something smaller. For example, just take 202 and
201. At step 2 their 'r' value is 0.772. Now, if we perform step 3 on just
these two categories, for each word we calculate an average of just two
values, and for every word this average lies midway between the 202 and 201
frequency values. As a result, every single data point of value 'X' in 202
is '-X' in 201, resulting in an 'r' value of exactly -1, and this is true
for any pair of categories we choose. Whatever 'r' value we got at step 2,
the 'r' value at step 3 is always -1. Now, as we add more categories this
'negative bias' effect is reduced. For example, when the average is
calculated for 202, 201, and 200, we get step 3 'r' values as follows:
202-201 = -0.355, 202-200 = -0.674, 201-200 = -0.452. As more categories
are included in the calculation of the average the 'r' values climb from
their '-1' start, until with all 19 categories 'r' reaches the values we are
familiar with. However, the negative bias hasn't gone away. Instead, it's
just been reduced. Dave Gentile calculated the value of the bias
as -1/(X-1) (where X is the number of categories), so for X=2 the bias
is -1, for X=3 the bias is -0.5, for X=4 it's -0.333, and for X=19
Overall, step 3 affects the 'r' values in two different ways.
1 In the first place, because one of the major factors causing a
correlation (language) has been removed, all values of 'r' are changed.
2 In the second place, the step 3 mechanism doubles the potential range
of 'r' values, changing it from 0 to +1, to -1 to +1.
What all the above tells us is that because of step 3, it is completely
unrealistic to use an 'r >=0.7' cut off to determine whether there is a
strong correlation between the categories or not. However, the above still
doesn't tell us what the correct value is. One approach might be to take
all the 'step 2' values with r >= 0.8 and see what they become at step 3
(but only for those where P<=0.0003):
112-002 = 0.890 0.519
121-021 = 0.853 0.693
201-200 = 0.849 0.349
202-200 = 0.848 0.397
121-120 = 0.841 0.442
222-220 = 0.828 0.484
211-210 = 0.820 0.497
221-121 = 0.813 0.357
120-020 = 0.810 0.461
This suggests that values down to at least r = 0.35 must be regarded as
valid. What about going down to 'step 2' values of r >=0.75?
202-102 = 0.797 0.316
222-002 = 0.773 -0.333
202-201 = 0.772 0.349
222-221 = 0.766 0.244
120-002 = 0.760 -0.267
121-002 = 0.757 -0.236
211-221 = 0.756 0.282
220-002 = 0.752 -0.325
120-021 = 0.750 0.409
As expected, all 'r' values drop from step 2 to step 3, but by widely
varying amounts. What I think this shows is that the 'language effect'
completely dominates at step 2, and we have to remove it to be able to see
any other effects. It also shows (I think) that we would be safe at least
regarding r >=0.35 as a 'strong' correlation, but it's hard to push this
logic down to 0.3, let alone lower numbers.
Another possibility is simply to try to determine "what does 0.7 at step
become at step 3"? First, we can remove the remnant of the 'negative bias'
caused by having 19 categories by reducing the 'r' cut off by 0.0556, which
brings it down to 0.64444. Then to expand the range from 0 to +1 to -1 to
+1 all we have to do is double the difference between 1 and the 'r' value,
i.e. Step3 = 1-2*(1-Step2), or Step3 = 2*Step2-1. So, 0.6444 at step 2
becomes 0.2888 at step 3. So, does this work? Does this suggest that we
can use r >=0.2888 to indicate a strong correlation in this analysis, or is
my logic completely off the wall?
3538 O'Connor Drive
Lafayette, CA, USA
Synoptic-L Homepage: http://www.bham.ac.uk/theology/synoptic-l
List Owner: Synoptic-L-Owner@...