Loading ...
Sorry, an error occurred while loading the content.

[Synoptic-L] What value of 'r' should we use in correlations?

Expand Messages
  • David Inglis
    ... I ve finally figured out why this doesn t work in this analysis. Dave Gentile s analysis works this way: 1 For each category, count the number of
    Message 1 of 1 , Jan 28, 2002
    • 0 Attachment
      Brian Wilson wrote:

      > A specific value has been suggested. Bruce Merrill wrote, some
      > weeks ago, --
      > >
      > >There is no widely accepted objective definition of a "strong"
      > >correlation. A correlation of 0.7 explains only half the variability
      > >in the data (i.e., if r=0.7, r**2 =0.7**2 =0.49 or approximately
      > >half), so many researchers begin talking about meaningful
      > >correlations when the absolute value reaches 0.7, with strong
      > >correlations being somewhat greater than 0.7 (or strong negative
      > >correlations being somewhat less than -0.7).
      > >
      > On this view, a cut off absolute value for r of 0.3 is far too low. the
      > value only *begins* to be meaningful at 0.7.

      I've finally figured out why this doesn't work in this analysis. Dave
      Gentile's analysis works this way:

      1 For each category, count the number of times each of 807 different
      words are used in the synoptic gospels (Words used only once or twice, and
      words used very frequently indeed (basically KAI and DE) are not counted.
      2 Convert each word count to a word frequency by dividing by the overall
      number of words in each category.
      3 Convert each word frequency to a relative frequency (or frequency
      shift) by subtracting the average word frequency across all the categories.
      4 Correlate the relative frequency values.

      Some people have questioned why step 3 is necessary, and it is indeed
      perfectly possible to correlate the frequencies produced by step 2. In
      fact, if you do that, you get correlation coefficients (r) ranging from r =
      0.397 for 212-021 to r = 0.890 for 112-002 (and note that ALL the 'r' values
      are positive). Now, on this basis I think it would be perfectly fair to say
      that 112-002 really is a 'strong' correlation. So, why have we been
      performing step 3 at all, and getting confused over whether r = 0.3 means
      anything, and also what negative values of r really mean?

      The answer is that step 3 is trying to remove that part of the positive
      correlation that is due to all the categories having been written in Greek.
      In other words, common Greek words in one category tend to also be common in
      other categories, and uncommon Greek words in one category tend to be
      uncommon in other categories. As a result, there is a 'built in' positive
      correlation despite different people having written (or edited) different
      categories, because they all used Greek. What step 3 does is not only to
      remove the 'language' positive, but in the process expand the differences
      that are left to make it easier to see the remaining correlations. This is
      basically what happens in the PCA (Principle Component Analysis) also
      performed by Dave Gentile. As Dave reported, the 'Prin1' column in the PCA
      results indicates that the biggest correlating factor is language (it's all
      Greek), and only when that factor is removed can you then see other
      differences. For example, once language is removed, Prin2 says that the
      biggest factor determining correlations is Mark vs. non-Mark.

      Now, the mechanism used by Dave Gentile to remove the 'language' effect is
      Step 3. By subtracting the average word frequencies he creates values that
      are positive where the frequency is greater than the average, and negative
      where the frequency is less than the average. As a result, values of 'r'
      that were in the range 0 to +1 now become values in the range -? to +?. The
      '?'s indicate that it's not simply a question of shifting all values by
      (say) 0.5 and having everything in the range -0.5 to +0.5. Unfortunately
      for us, every value of 'r' gets changed by a different amount!
      Nevertheless, there is an overall trend of reducing the value of 'r' even
      for very strong correlations found in Step 2. For example 112-002 reduces
      from r = 0.89 at step 2 to 0.52 at step 3 (and even lower, 0.48, when the
      'zeros' are included), and for 212-021 the value of r = 0.397 becomes r
      = -0.21!

      This 'negative 'shift' is more easily seen by reducing the number of
      categories from 19 to something smaller. For example, just take 202 and
      201. At step 2 their 'r' value is 0.772. Now, if we perform step 3 on just
      these two categories, for each word we calculate an average of just two
      values, and for every word this average lies midway between the 202 and 201
      frequency values. As a result, every single data point of value 'X' in 202
      is '-X' in 201, resulting in an 'r' value of exactly -1, and this is true
      for any pair of categories we choose. Whatever 'r' value we got at step 2,
      the 'r' value at step 3 is always -1. Now, as we add more categories this
      'negative bias' effect is reduced. For example, when the average is
      calculated for 202, 201, and 200, we get step 3 'r' values as follows:
      202-201 = -0.355, 202-200 = -0.674, 201-200 = -0.452. As more categories
      are included in the calculation of the average the 'r' values climb from
      their '-1' start, until with all 19 categories 'r' reaches the values we are
      familiar with. However, the negative bias hasn't gone away. Instead, it's
      just been reduced. Dave Gentile calculated the value of the bias
      as -1/(X-1) (where X is the number of categories), so for X=2 the bias
      is -1, for X=3 the bias is -0.5, for X=4 it's -0.333, and for X=19
      it's -0.0556.

      Overall, step 3 affects the 'r' values in two different ways.

      1 In the first place, because one of the major factors causing a
      correlation (language) has been removed, all values of 'r' are changed.
      2 In the second place, the step 3 mechanism doubles the potential range
      of 'r' values, changing it from 0 to +1, to -1 to +1.

      What all the above tells us is that because of step 3, it is completely
      unrealistic to use an 'r >=0.7' cut off to determine whether there is a
      strong correlation between the categories or not. However, the above still
      doesn't tell us what the correct value is. One approach might be to take
      all the 'step 2' values with r >= 0.8 and see what they become at step 3
      (but only for those where P<=0.0003):

      Step2 Step3
      112-002 = 0.890 0.519
      121-021 = 0.853 0.693
      201-200 = 0.849 0.349
      202-200 = 0.848 0.397
      121-120 = 0.841 0.442
      222-220 = 0.828 0.484
      211-210 = 0.820 0.497
      221-121 = 0.813 0.357
      120-020 = 0.810 0.461

      This suggests that values down to at least r = 0.35 must be regarded as
      valid. What about going down to 'step 2' values of r >=0.75?

      202-102 = 0.797 0.316
      222-002 = 0.773 -0.333
      202-201 = 0.772 0.349
      222-221 = 0.766 0.244
      120-002 = 0.760 -0.267
      121-002 = 0.757 -0.236
      211-221 = 0.756 0.282
      220-002 = 0.752 -0.325
      120-021 = 0.750 0.409

      As expected, all 'r' values drop from step 2 to step 3, but by widely
      varying amounts. What I think this shows is that the 'language effect'
      completely dominates at step 2, and we have to remove it to be able to see
      any other effects. It also shows (I think) that we would be safe at least
      regarding r >=0.35 as a 'strong' correlation, but it's hard to push this
      logic down to 0.3, let alone lower numbers.

      Another possibility is simply to try to determine "what does 0.7 at step
      become at step 3"? First, we can remove the remnant of the 'negative bias'
      caused by having 19 categories by reducing the 'r' cut off by 0.0556, which
      brings it down to 0.64444. Then to expand the range from 0 to +1 to -1 to
      +1 all we have to do is double the difference between 1 and the 'r' value,
      i.e. Step3 = 1-2*(1-Step2), or Step3 = 2*Step2-1. So, 0.6444 at step 2
      becomes 0.2888 at step 3. So, does this work? Does this suggest that we
      can use r >=0.2888 to indicate a strong correlation in this analysis, or is
      my logic completely off the wall?

      Any thoughts?

      Dave Inglis
      3538 O'Connor Drive
      Lafayette, CA, USA

      Synoptic-L Homepage: http://www.bham.ac.uk/theology/synoptic-l
      List Owner: Synoptic-L-Owner@...
    Your message has been successfully submitted and would be delivered to recipients shortly.