Loading ...
Sorry, an error occurred while loading the content.
 

Re: [Synoptic-L] Statistical Significance - missing data points

Expand Messages
  • David Inglis
    ... This is my view, which is why I asked for data including the zeros. ... contribute a ... But ... Our ... The problem with the (0,0) points is that in the
    Message 1 of 7 , Jan 23, 2002
      > Ron Price writes:
      >
      > I don't understand this logic.
      > Suppose source A has 20 occurrences of a word. How can it possibly be
      > logical to count this against the correlation with source B if it has
      > one occurrence, but not count it against if it has zero occurrences? One
      > occurrence could surely also reflect a limited corpus. This is something
      > we just have to live with.

      This is my view, which is why I asked for data including the zeros.
      >
      > ===========
      >
      > Dave Gentile: With the design we are using, it's true that a zero does
      contribute a
      > little to the result, since a zero must be below the average frequency.
      But
      > picture a design where every zero was plotted at point (0,0) on an x,y
      > grid. Adding these points could have no effect on r, but would raise p.
      Our
      > design is only a little different than that.

      The problem with the (0,0) points is that in the correlations they actually
      show up as (-X,-X) points, where X is the overall average frequency for that
      word. As a result, they all fit on a single line of points with a +45
      degree slope in the bottom left quadrant of the scatter diagrams. These
      points therefore all represent a positive correlation, and therefore do
      increase the value of 'r'. How much it is increased depends on the 'X'
      values for each of the words, with points further from the origin increasing
      'r' more.

      > Dave Inglis suggests using only points where one observation is zero, but
      > not both. However, I can't see how this could easily be accomplished.

      This currently is a problem, for the above reason. As Dave Gentile reports,
      I wanted to include the (0,X) or (X,0) points, but not the (0,0).
      Unfortunately, this is not easy. As (I think) Stephen pointed out, adding a
      whole bunch of (0,0) observations to something small like the 212 category
      could cause problems. So, here's my plan. First, I am doing a new version
      of the spreadsheet, based on including all the zeros. Then, I plan to use
      the individual word counts and Excel to examine some of the correlations to
      see how removing just the (0,0) points affects the results (I don't plan on
      doing this for all 171 results - just some of the more interesting or
      'marginal' ones). Overall, I think the best way to use the latest data is:

      1 To confirm what we already know
      2 To indicate new correlations where at present the data is marginal
      3 To see if any existing results change

      The last one is potentially a problem, since what do we do if a current
      positive becomes negative? Let's hope it doesn't happen.

      > So as it stands, I think Stephen's idea is probably over-conservative,
      > while including all points over-states p.

      So, it probably makes sense to use the original (conservative) data when
      showing how particular hypotheses can be falsified, and then use the new
      data to confirm what the conservative data already shows, or provide new
      insights that can't be seem with the original data.

      > ============
      >
      > Ron: There are a few quirky results amongst the new set. Perhaps these
      > could be reduced by tightening the probability cut-off value.

      For example, 222-121: r = -0.13457 at P = 0.000126. I haven't thought
      about things like this yet. This was a marginal negative before (r
      = -0.18905 at P = 0.003772), and has not gone away with the new data. This
      is one that probably needs looking at in more detail.
      >
      > ===========
      >
      > Dave Gentile: That's what I would do. Unfortunately SAS does not have an
      option to
      > look at values for p less than .0001

      No problem. Dave Gentile showed me how to calculate P to allow very small
      values to be selected.
      As far as I can see at the moment, you need to set the 'P' cut-off to around
      0.00000000001 with the new data to get approximately the same set of
      positives as we got with a 'P' cut-off of 0.0003 with the earlier data.
      With the 'P' cut-off at 0.0003 with the new data you actually get far too
      many correlations (I think it's 36 positives!) to see easily what is
      happening.

      > If Dave Inglis provides us with a new spreadsheet with these results
      > included, we could raise the cut-off.

      I should get it posted today.

      > This is a judgement call, of course. I believe by setting it high enough
      we
      > eliminate not only random effects, but minor effects not directly related
      > to authorship.

      Seeing the order in which correlations show up as P is increased seems to me
      to be a very effective way of seeing how the results cluster together.

      Dave Inglis
      david@...
      3538 O'Connor Drive
      Lafayette, CA, USA






      Synoptic-L Homepage: http://www.bham.ac.uk/theology/synoptic-l
      List Owner: Synoptic-L-Owner@...
    Your message has been successfully submitted and would be delivered to recipients shortly.