> Some questions to David.

> 1] Why a subtraction and not a division ?

> > So now we can compute the frequencies by category, relative

> > to what we would expect.

> > For "the" in category "222" we get .02 - .025 = -.005.

> > "the" occurs less frequently in "222" that it does in all categories.

> > For "bottom" in category "222" we get .005 - .004 = .001

> Why do you use the subtraction, and not the division, for that operation ?

> Your justification, with the word "relative", induce logically a division.

> You said elsewhere that you need to balance the high frequency of "the"

> but you can do it only by a division, for instance :

> For "the" in category "222" we get (.02 - .025) / 0.025 = -.2

> For "bottom" in category "222" we get (.005 - .004) / 0.004 = .25

> This operation would give a better relative representation. Do you think

> For that time being, I do not understand the purpose of your subtraction.

We've all ready done one division to get to frequencies. I'm not sure what

impact your idea would have. However, the best way (as I did with the SAS

tool) is to compute the individual frequencies, then treat them as

variables, and treat the overall frequency as a partial variable, and do

what is called a "partial-correlation". It takes into account the standard

deviations, as well as doing conceptually, about the same thing I did with

the subtraction.

If I understand you correctly, I agree. I've discovered many of the results

> 2] Imagine that hypothetical simple schema :

>

> A => Mt

> A + Mt => Lk

> A + Mt + Lk => Mk

> In 222, you may find :

> - words of A that have been kept by Mt, by Lk and Mk.

> - words of Mt that have been kept by Mk and Lk

>

> In 122 you may find :

> - words of A that have been changed by Mt, but kept by Lk, and Mk.

> - words of Lk, kept by Mk.

> Only for that simple schema, your basic vectors (122, 222, etc.)

> may be constituted by composite redactional features. Particularly,

> it may produce a combined correlation effect with different material

> of gospels that separately would have produced an anti-correlative

> effect.

are very difficult to untangle. (See the note I just posted.)

True, they are all ground up, only the frequency of the words is being

> With other words, your data are not a complete overview of styles of

> gospels, since it does not take into account location of occurrences.

> (exactly as I can not describe the demography of USA just by giving an

> histogram of population according states). Your data are a snap shot

> of gospel styles, where a single appearing style profile may in fact

> hide three of four very different redaction flows.

studied. Although, I'm comparing that to what I know about the structure.

Average-linkage cluster analysis. It takes each member of each cluster and

> 3] What is the method you choose for clustering ?

> I guess K-mean, but may you confirm it ?

compares in to each member of the other cluster to determine distance.

>

I'm not sure I got the first part. I agree cluster A is a bit surprising.

> Whatever the case, since it is possible to produce a correlative

> effect with a combination of anti-correlated pairs of vectors,

> your clustering is more negative than positive : you may not

> warrant that a cluster is a single style element, even if the

> Particularly, your cluster A can not be considered as a natural

> groups of close elements.

I think I now understand the 202-222 result as both being the result of

being kept by both Matthew and Luke. The 200-222 connection is not at a high

confidence level, and from a scatter plot, its hardly there at all. I think

the 202-200 connection may be the one real significant result of all this,

however.

See my previous note about this.

> 4] On the opposite, the anti-correlative phenomenon looks as

> an unexpected effect. May you assess how significant it is ?

> And is it possible to imagine a single writer producing a

> pattern of anti-correlative elements ?

An example of what seems to happen:

22x and 12x have a negative correlation.

If Matthew dislikes a word, he will end up lowering the 22x frequency and

raising the 12x, tending to make the anti-correlate.

I thought about your first suggestion here, a little more.

The current method gives more weight to common words. Your system would give

each word an equal weight. I think that might lead to more noise, since low

frequency words might not be very well distributed.

OK. But in that case, what is the purpose of this substraction ?

I still not understand it.

Without the subtraction we'd be asking:

"Do these documents have similar word frequencies?"

They do, because they are both samples of Greek language.

With the subtraction, we are asking:

"Do these documents depart from the average Greek language frequency, in a

similar way?"

Does that help any?

