Re: [Synoptic-L] Statistical Significance - missing data points
> Ron Price writes:This is my view, which is why I asked for data including the zeros.
> I don't understand this logic.
> Suppose source A has 20 occurrences of a word. How can it possibly be
> logical to count this against the correlation with source B if it has
> one occurrence, but not count it against if it has zero occurrences? One
> occurrence could surely also reflect a limited corpus. This is something
> we just have to live with.
> Dave Gentile: With the design we are using, it's true that a zero does
> little to the result, since a zero must be below the average frequency.But
> picture a design where every zero was plotted at point (0,0) on an x,yOur
> grid. Adding these points could have no effect on r, but would raise p.
> design is only a little different than that.The problem with the (0,0) points is that in the correlations they actually
show up as (-X,-X) points, where X is the overall average frequency for that
word. As a result, they all fit on a single line of points with a +45
degree slope in the bottom left quadrant of the scatter diagrams. These
points therefore all represent a positive correlation, and therefore do
increase the value of 'r'. How much it is increased depends on the 'X'
values for each of the words, with points further from the origin increasing
> Dave Inglis suggests using only points where one observation is zero, butThis currently is a problem, for the above reason. As Dave Gentile reports,
> not both. However, I can't see how this could easily be accomplished.
I wanted to include the (0,X) or (X,0) points, but not the (0,0).
Unfortunately, this is not easy. As (I think) Stephen pointed out, adding a
whole bunch of (0,0) observations to something small like the 212 category
could cause problems. So, here's my plan. First, I am doing a new version
of the spreadsheet, based on including all the zeros. Then, I plan to use
the individual word counts and Excel to examine some of the correlations to
see how removing just the (0,0) points affects the results (I don't plan on
doing this for all 171 results - just some of the more interesting or
'marginal' ones). Overall, I think the best way to use the latest data is:
1 To confirm what we already know
2 To indicate new correlations where at present the data is marginal
3 To see if any existing results change
The last one is potentially a problem, since what do we do if a current
positive becomes negative? Let's hope it doesn't happen.
> So as it stands, I think Stephen's idea is probably over-conservative,So, it probably makes sense to use the original (conservative) data when
> while including all points over-states p.
showing how particular hypotheses can be falsified, and then use the new
data to confirm what the conservative data already shows, or provide new
insights that can't be seem with the original data.
> ============For example, 222-121: r = -0.13457 at P = 0.000126. I haven't thought
> Ron: There are a few quirky results amongst the new set. Perhaps these
> could be reduced by tightening the probability cut-off value.
about things like this yet. This was a marginal negative before (r
= -0.18905 at P = 0.003772), and has not gone away with the new data. This
is one that probably needs looking at in more detail.
> Dave Gentile: That's what I would do. Unfortunately SAS does not have an
> look at values for p less than .0001No problem. Dave Gentile showed me how to calculate P to allow very small
values to be selected.
As far as I can see at the moment, you need to set the 'P' cut-off to around
0.00000000001 with the new data to get approximately the same set of
positives as we got with a 'P' cut-off of 0.0003 with the earlier data.
With the 'P' cut-off at 0.0003 with the new data you actually get far too
many correlations (I think it's 36 positives!) to see easily what is
> If Dave Inglis provides us with a new spreadsheet with these resultsI should get it posted today.
> included, we could raise the cut-off.
> This is a judgement call, of course. I believe by setting it high enoughwe
> eliminate not only random effects, but minor effects not directly relatedSeeing the order in which correlations show up as P is increased seems to me
> to authorship.
to be a very effective way of seeing how the results cluster together.
3538 O'Connor Drive
Lafayette, CA, USA
Synoptic-L Homepage: http://www.bham.ac.uk/theology/synoptic-l
List Owner: Synoptic-L-Owner@...