Sorry, an error occurred while loading the content.

## Re: [Synoptic-L] Statistical Significance - missing data points

Expand Messages
• ... This is my view, which is why I asked for data including the zeros. ... contribute a ... But ... Our ... The problem with the (0,0) points is that in the
Message 1 of 7 , Jan 23, 2002
> Ron Price writes:
>
> I don't understand this logic.
> Suppose source A has 20 occurrences of a word. How can it possibly be
> logical to count this against the correlation with source B if it has
> one occurrence, but not count it against if it has zero occurrences? One
> occurrence could surely also reflect a limited corpus. This is something
> we just have to live with.

This is my view, which is why I asked for data including the zeros.
>
> ===========
>
> Dave Gentile: With the design we are using, it's true that a zero does
contribute a
> little to the result, since a zero must be below the average frequency.
But
> picture a design where every zero was plotted at point (0,0) on an x,y
> grid. Adding these points could have no effect on r, but would raise p.
Our
> design is only a little different than that.

The problem with the (0,0) points is that in the correlations they actually
show up as (-X,-X) points, where X is the overall average frequency for that
word. As a result, they all fit on a single line of points with a +45
degree slope in the bottom left quadrant of the scatter diagrams. These
points therefore all represent a positive correlation, and therefore do
increase the value of 'r'. How much it is increased depends on the 'X'
values for each of the words, with points further from the origin increasing
'r' more.

> Dave Inglis suggests using only points where one observation is zero, but
> not both. However, I can't see how this could easily be accomplished.

This currently is a problem, for the above reason. As Dave Gentile reports,
I wanted to include the (0,X) or (X,0) points, but not the (0,0).
Unfortunately, this is not easy. As (I think) Stephen pointed out, adding a
whole bunch of (0,0) observations to something small like the 212 category
could cause problems. So, here's my plan. First, I am doing a new version
of the spreadsheet, based on including all the zeros. Then, I plan to use
the individual word counts and Excel to examine some of the correlations to
see how removing just the (0,0) points affects the results (I don't plan on
doing this for all 171 results - just some of the more interesting or
'marginal' ones). Overall, I think the best way to use the latest data is:

1 To confirm what we already know
2 To indicate new correlations where at present the data is marginal
3 To see if any existing results change

The last one is potentially a problem, since what do we do if a current
positive becomes negative? Let's hope it doesn't happen.

> So as it stands, I think Stephen's idea is probably over-conservative,
> while including all points over-states p.

So, it probably makes sense to use the original (conservative) data when
showing how particular hypotheses can be falsified, and then use the new
data to confirm what the conservative data already shows, or provide new
insights that can't be seem with the original data.

> ============
>
> Ron: There are a few quirky results amongst the new set. Perhaps these
> could be reduced by tightening the probability cut-off value.

For example, 222-121: r = -0.13457 at P = 0.000126. I haven't thought
about things like this yet. This was a marginal negative before (r
= -0.18905 at P = 0.003772), and has not gone away with the new data. This
is one that probably needs looking at in more detail.
>
> ===========
>
> Dave Gentile: That's what I would do. Unfortunately SAS does not have an
option to
> look at values for p less than .0001

No problem. Dave Gentile showed me how to calculate P to allow very small
values to be selected.
As far as I can see at the moment, you need to set the 'P' cut-off to around
0.00000000001 with the new data to get approximately the same set of
positives as we got with a 'P' cut-off of 0.0003 with the earlier data.
With the 'P' cut-off at 0.0003 with the new data you actually get far too
many correlations (I think it's 36 positives!) to see easily what is
happening.

> If Dave Inglis provides us with a new spreadsheet with these results
> included, we could raise the cut-off.

I should get it posted today.

> This is a judgement call, of course. I believe by setting it high enough
we
> eliminate not only random effects, but minor effects not directly related
> to authorship.

Seeing the order in which correlations show up as P is increased seems to me
to be a very effective way of seeing how the results cluster together.

Dave Inglis
david@...
3538 O'Connor Drive
Lafayette, CA, USA

Synoptic-L Homepage: http://www.bham.ac.uk/theology/synoptic-l
List Owner: Synoptic-L-Owner@...
Your message has been successfully submitted and would be delivered to recipients shortly.