All the details (for my investigation) can be found here:
Dave I is using the same data and categories and is experimenting with using different subsets of the vocabulary items different "normalizations" of the data and is using a simpler statistical technique, which has the virtue at least of being easier to explain than what I did.
The study was done 7 or 8 years ago. I've made no serious effort to publish it. NT studies is not an area in which I'm formally trained which makes things difficult. I do however have a number of quantitative degrees and I'm a statistician by profession, and still consider the statistical methods used there to be up-to-stuff.
Again, please visit the site for details, but I'll give some quick answers here:
a) The categories are described here:
and in detail here:
b) I used 800 or so vocabulary items. Essentially the 800 most common. One thing Dave I has been looking at is using a much smaller subset of very high frequency words, based on the idea that these are function words. If I run my method with both sets of data the results are very similar. And while I see the point of focusing on function words, I think content words might also help us tell authors apart. Authors with different agendas might tend to focus more on certain content words than other authors would.
Some technical details which can be safely skipped:
Dave I is using correlation, which assumes normally distributed data. My study makes things more complicated and a lot more computer intensive by using Poisson distributions. At high frequencies a Poisson and a Normal distribution look the same, but they diverge at low frequencies. Normal distributions can have non-integer values and negative values. Poisson distributions describe things that must be non-negative integers. Thus this difference comes out most when dealing with low counts of say zero and one.
Dave I.'s results do change when moving from the full sample to the high frequency sample. His results for the high frequency words are similar to mine.
Yes, there is a significance test. And, I've stuck with a cutoff value suggested my Stephen Carlson a number of years ago on this list.
d) I'm not familiar with the reference to 'stepdisk'. But I think the i's are dotted and the t's crossed statistically speaking. For the equivalent of a significance test, I use a "likelihood ratio test" with one adjustable parameter.
--- In Synoptic@yahoogroups.com, David Mealand <D.Mealand@...> wrote:
> I can understand that the authors of the study
> may well wish to wait till the whole study is published
> before releasing specific details, but it would
> be helpful to know one or two things about the
> statistical method, without prejudicing release of
> specific details.
> a) Were the data partitioned?
> In other words were each of the "blocks" divided
> into samples, so that "within group" variance could
> be compared with "between group" variance?
> b) If the tests were based on "vocabulary" did this
> focus on high frequency function words or did it
> include low frequency "content words"? (The latter
> have the disadvantages of being subject related,
> and also of having very low counts per thousand words.)
> c) Did the method include some kind of significance test
> for a p value or some equivalent?
> d) What precautions were taken against the kinds
> of bias that methods using prior variable selection
> (e.g. Stepdisc) might suffer?
> I hope this does not sound too suspicious, it is really
> only an attempt to encourage a little information
> which might give some indication of the robustness
> of the method. Many NT studies have been based
> on vocabulary counts which have very slight statistical
> underpinning, and it would be good if there were more
> which were in line with literary stats elsewhere.
> David M.
> David Mealand, University of Edinburgh
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.