> Ron Price writes:

This is my view, which is why I asked for data including the zeros.

>

> I don't understand this logic.

> Suppose source A has 20 occurrences of a word. How can it possibly be

> logical to count this against the correlation with source B if it has

> one occurrence, but not count it against if it has zero occurrences? One

> occurrence could surely also reflect a limited corpus. This is something

> we just have to live with.

>

contribute a

> ===========

>

> Dave Gentile: With the design we are using, it's true that a zero does

> little to the result, since a zero must be below the average frequency.

But

> picture a design where every zero was plotted at point (0,0) on an x,y

Our

> grid. Adding these points could have no effect on r, but would raise p.

> design is only a little different than that.

The problem with the (0,0) points is that in the correlations they actually

show up as (-X,-X) points, where X is the overall average frequency for that

word. As a result, they all fit on a single line of points with a +45

degree slope in the bottom left quadrant of the scatter diagrams. These

points therefore all represent a positive correlation, and therefore do

increase the value of 'r'. How much it is increased depends on the 'X'

values for each of the words, with points further from the origin increasing

'r' more.

> Dave Inglis suggests using only points where one observation is zero, but

This currently is a problem, for the above reason. As Dave Gentile reports,

> not both. However, I can't see how this could easily be accomplished.

I wanted to include the (0,X) or (X,0) points, but not the (0,0).

Unfortunately, this is not easy. As (I think) Stephen pointed out, adding a

whole bunch of (0,0) observations to something small like the 212 category

could cause problems. So, here's my plan. First, I am doing a new version

of the spreadsheet, based on including all the zeros. Then, I plan to use

the individual word counts and Excel to examine some of the correlations to

see how removing just the (0,0) points affects the results (I don't plan on

doing this for all 171 results - just some of the more interesting or

'marginal' ones). Overall, I think the best way to use the latest data is:

1 To confirm what we already know

2 To indicate new correlations where at present the data is marginal

3 To see if any existing results change

The last one is potentially a problem, since what do we do if a current

positive becomes negative? Let's hope it doesn't happen.

> So as it stands, I think Stephen's idea is probably over-conservative,

So, it probably makes sense to use the original (conservative) data when

> while including all points over-states p.

showing how particular hypotheses can be falsified, and then use the new

data to confirm what the conservative data already shows, or provide new

insights that can't be seem with the original data.

> ============

For example, 222-121: r = -0.13457 at P = 0.000126. I haven't thought

>

> Ron: There are a few quirky results amongst the new set. Perhaps these

> could be reduced by tightening the probability cut-off value.

about things like this yet. This was a marginal negative before (r

= -0.18905 at P = 0.003772), and has not gone away with the new data. This

is one that probably needs looking at in more detail.>

option to

> ===========

>

> Dave Gentile: That's what I would do. Unfortunately SAS does not have an

> look at values for p less than .0001

No problem. Dave Gentile showed me how to calculate P to allow very small

values to be selected.

As far as I can see at the moment, you need to set the 'P' cut-off to around

0.00000000001 with the new data to get approximately the same set of

positives as we got with a 'P' cut-off of 0.0003 with the earlier data.

With the 'P' cut-off at 0.0003 with the new data you actually get far too

many correlations (I think it's 36 positives!) to see easily what is

happening.

> If Dave Inglis provides us with a new spreadsheet with these results

I should get it posted today.

> included, we could raise the cut-off.

> This is a judgement call, of course. I believe by setting it high enough

we

> eliminate not only random effects, but minor effects not directly related

Seeing the order in which correlations show up as P is increased seems to me

> to authorship.

to be a very effective way of seeing how the results cluster together.

Dave Inglis

david@...

3538 O'Connor Drive

Lafayette, CA, USA

Synoptic-L Homepage: http://www.bham.ac.uk/theology/synoptic-l

List Owner: Synoptic-L-Owner@...