- Dear All,

One week ago I posted a question about large n and normal distritbuion, and have got several good replies from Isobel Clark, Ned Levine, Ruben Roa Ureta, Thies Dose, Chris Hlavka, Donald Myers and Jeffrey Blume. Jeffrey is perhaps not in the list, but I assume he has no objections if I copy his message to the list.

Generally speaking, when n is too large, e.g., n>1,000 which is very common in geochemistry nowadays, statistical (goodness-of-fit) tests become too powerful, and the p-values are less informative. Therefore, users need to be very careful in using these tests with a large n. Suggestions to solve this problem include: (1) To use graphical methods; (2) To develop methods which are suitable for large n; (3) To use methods which are not sensitive to n.

Well, the solutions may not be very satisfactory, but I do hope statisticians pay more attention on large n, as they have been paying too much attention on small ones. More personal discussions are welcome. If you need some data sets to play with, please feel free to get in touch with me.

Please find the following original question and the replies. I would like to show my sincere thanks to all those who replied me (I hope nobody is missing in the above list).

Cheers,

Chaosheng

--------------------------------------------------------------------------

Dr. Chaosheng Zhang

Lecturer in GIS

Department of Geography

National University of Ireland, Galway

IRELAND

Tel: +353-91-524411 x 2375

Fax: +353-91-525700

E-mail: Chaosheng.Zhang@...

Web 1: www.nuigalway.ie/geography/zhang.html

Web 2: www.nuigalway.ie/geography/gis/index.htm

----------------------------------------------------------------------------

----- Original Message -----

> Dear list,

>

> I'm wondering if anyone out there has the experience of dealing with the

> probability distribution of data sets of a large sample size, e.g.,

> n>10,000. I am studying the probability feature of chemical element

> concentrations in a USGS sediment database with the sample number of around

> 50,000, and have found that it is virtually impossible for any real data set

> to pass tests for normality as the tests become too powerful with the

> increase of sample size. It is widely oberved that geochemical data do not

> follow a normal or even a lognormal distribution. However, I feel that the

> large sample size is also making trouble.

>

> I am looking for references on this topic. Any references or comments are

> welcome.

>

> Cheers,

>

> Chaosheng

-----------------------

Chaosheng

Your problem may be 'non-stationarity' rather than the

large sample size. If you have so many samples, you

are probably sampling more than one 'population'.

We have had success in fitting lognormals to mining

data sets of up to half a million, where these are all

within the same geological environment and primary

minerlisation.

We have also had a lot of success in reasonably large

data sets (up to 100,000) with fitting mixtures of

two, three or four lognormals (or Normals) to

characterise different populations. See, for example,

the paper given at the Australian Miing Geology

conference in 1993 on my page at

http://drisobelclark.ontheweb.com/resume/Publications.html

Isobel

http://ecosse.ontheweb.com

------------------

Chaosheng,

Can't you do a Monte Carlo simulation for the distribution? In S-Plus, you can create confidence intervals from a MC simulation with a sample size as large as you have. That is, you draw 50,000 or so points from a normal distribution and calculate the distribution. You then re-run this a number of times (e.g., 1000) to establish approximate confidence intervals. You can then what proportion of your data points fall outside the approximate confidence intervals; you would expect no more than 5% or so of the data points to fall outside the intervals if your distribution is normal. If more than 5% fall outside, then you really don't have a normal distribution (since a normal distribution is essentially a random distribution, I would doubt that any real data set would be truly normal - the sampling distribution is another issue).

Anyway, just some thoughts. Hope everything is well with you.

Regards,

Ned

---------------

I pressume your null hypothesis is that the data comes from the given

distribution as is usual in goodness of fit tests. If such is the case

your sample size will almost surely lead to rejection. The well-known

logical inconsistencies of the standard test of hypothesis based on the

p-value are magnified under large n.

You have these options at least:

1) Find some authority that says that for large sample sizes the p-value

is less informative; e.g. Lindley and Scott. 1984. New Cambridge

Elementary Statistical Tables. Cambridge Univ Press; and then you can

throw away your goodness-of-fit test. But be warned that equally important

authorities have said exactly the contrary thing, that the force of the

p-value is stronger for large sample sizes (Peto et al. 1976. British

Medical Journal 34:585-612). To make matters even worse, certainly other

equally important authorities have said that the sample size doesn't

matter (Cornfield 1966, American Statistician 29:18-23).

2) Do a more reasonable analysis than the standard goodness-of-fit test.

I suggest you plot the likelihood function under normal and lognormal

models and derive the probabilistic features of your data by direct

inspection of the function. Also you can test for different location or

scale parameters using the likelihood ratio (its direct valu, not its

derived asymptotic distribution in the sample space) for any two well

defined hypotheses.

Ruben

--------------

Dear Chaosheng,

this will not answer your question directly, but I hope that it will be

helpful anyway:

1.) Independance of values

I am not quite sure, whether tests for normality (chi-square, shapiro-wilk,

kolmogorow-smirnov) require independance of the samples, but I have a strong

feeling that they do so. Most likely your data samples are not statistically

independant of each other, because if the data would be so, you could save

your time on the spatial analysis and work with the global mean or a

transformed random number generator as local estimator instead. So in

general this kind of test might not be appropriate.

In addition, in case of clustered data in your data set, the clustering will

lead for sure to biased results, and any results from statistical tests

would be quite doubtful.

2.) rank transform

I would try to do a spatial analysis on the rank transform of your

variables, in the case that you can deal with the ties in the data set. For

such a large no. of samples, this will probably provide a robust approach.

In addition, a multigaussian approach has been discussed widely, and could

be a useful alternative.

Happy evaluations,

Thies

---------------------

Chaosheng - Other apporaches to your problem are:

- Randomly select a few smaller samples and apply the goodness-of-fit test.

- Test fit to normal and lognormal distributions with probability plots.

-- Chris

-------------------

A couple of observations about your question/problem

1. Most any statistical test will have an underlying assumption of

random sampling (or perhaps a modification of random sampling such as

stratified). It is very unlikely that the data will have been generated

in that way (random sampling in this context refers to sampling from the

" distribution" and not to sampling from a region or space). Generally

speaking, random site selection for sampling is not the same thing as

random sampling from the distribution. It is highly unlikely that you

can really use statistical tests with your data because the underlying

assumptions are not satisifed. They may be useful information to look at

but don't take them as really hard evidence.

2. As a further point, the sampling in this case is obviously "without

replacement" , i.e., you can't generate two samples from the (exact)

same location. For smaller sample sizes the difference between "with

replacement" and " without replacement" is probably negligible but not

for larger sample sizes. You may be seeing this.

Suppose that the "population" size is M (M very large) is,

random sampling WITH replacement means that each possible value will be

chosen with probability 1/M. For a sample of size n then the

probability will be this raised to the power n. If the sampling is

WITHOUT replacement then each sample of size n has a probability of 1/[

M!/(n! (M-n)!)] For M = 1000 and n = 5 the numerical difference in

these two probabilities is very very small. But if n > 50 (as an

example) then the difference is significant.

3. Finally, what is the "support:" of the samples? Generally speaking

the probability distribution changes as the support changes. (In the

Geography literature this is referred to as the "Modified Unit Area

problem"

I don't remember having seen this discussed but you might want to look

at the literature pertaining to Pierre Gy's work on sampling (in fact

there is to be or was a conference somewhere in Scandanavia recently on

his work).

Donald Myers

http://www.u.arizona.edu/~donaldm

-------------------

Chaosheng,

Probably the best approach is to take a different tact and try estimating an important quantity rather than testing to see if the normal distribution fits your data. With such a high sample size almost any goodness of fit test will reject.

Also as long as the distributions are symmetric, you can assume normality without loosing too much (even if the test rejects normality). I'm not sure the articles will help you in this matter, because they are more concerned with demonstrating that two equal p-values do not represent the same amount of evidence unless the sample sizes are equal. Which smaple has a stronger amount of evidence is still debatable (as you'll see).

You might try an altogether different approach: look at the likelihood function. I have attached a tutorial that explains how to do this.

Good Luck.

Jeffrey

[Non-text portions of this message have been removed] - Hi,

I'm not sure i agree with the idea that a test can be too powerful. This is a

common argument in simulation experiments, that because you can do an infinite

number of replicate simulations, somehow the differences detected are not

real. In fact, the differences are real. They may not be biologically (or

geologically or whatever field you are in) significant, but they are still

real. That is why it is better to decide first on the magnitude of difference

that you consider significant. Now, in the case of deviation from normality,

I suppose you wouldn't have much intuition about what is significant, but the

relevant question is what is the effect of small deviations from normality on

your test or conclusions of your analysis? These kinds of studies are out

there in the statistical literature for many tests (T-tests etc.) --I'm not

sure how much has been done to look at the robustness of geostatistical

analyses, but there are probably some studies (does anyone know?) I would not

opt for a less-powerful test just to justify an assumption - that's, like,

unethical or something.

Yetta

--

* To post a message to the list, send it to ai-geostats@...

* As a general service to the users, please remember to post a summary of any useful responses to your questions.

* To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list

* Support to the list is provided at http://www.ai-geostats.org - Dear Yetta,

Thanks for the comments, and I agree with you. I think there is a function between sample size and statistical power. The power increases with the increase of n. It's true that it is hard to define how powerful is "too powerful". Some people suggest to use a lower significance level for large n. However, it is also a problem that how low (e.g., 0.0000001) is low enough? Some people suggest not to use the p-value as mentioned in the summary.

It is also a question how serious it may be if the data set does not follow a normal distribution. Statisticians may provide us some artificial examples showing how serious it is, but this may not be so serious in the real world if it's only a minor departure. Some people even say that statistical methods can not be used because our samples are not independent at all because of spatial autocorrelation. Well, perhaps I have gone too far, but it is an interesting topic. (Geo)statisticians may have better comments.

By the way, I may not summarize again. If anyone would like to share your ideas with the list, please copy to it.

Cheers,

Chaosheng

----- Original Message -----

From: "zij" <zij@...>

To: "ai-geostats" <ai-geostats@...>; "Chaosheng Zhang" <Chaosheng.Zhang@...>

Sent: Monday, August 11, 2003 7:13 PM

Subject: RE: AI-GEOSTATS: Summary: Large sample size and normal distribution

> Hi,

>

> I'm not sure i agree with the idea that a test can be too powerful. This is a

> common argument in simulation experiments, that because you can do an infinite

> number of replicate simulations, somehow the differences detected are not

> real. In fact, the differences are real. They may not be biologically (or

> geologically or whatever field you are in) significant, but they are still

> real. That is why it is better to decide first on the magnitude of difference

> that you consider significant. Now, in the case of deviation from normality,

> I suppose you wouldn't have much intuition about what is significant, but the

> relevant question is what is the effect of small deviations from normality on

> your test or conclusions of your analysis? These kinds of studies are out

> there in the statistical literature for many tests (T-tests etc.) --I'm not

> sure how much has been done to look at the robustness of geostatistical

> analyses, but there are probably some studies (does anyone know?) I would not

> opt for a less-powerful test just to justify an assumption - that's, like,

> unethical or something.

>

> Yetta

>

>

>

[Non-text portions of this message have been removed] > Hi,

is a common argument in simulation experiments, that because you can do an

>

> I'm not sure i agree with the idea that a test can be too powerful. This

infinite number of replicate simulations, somehow the differences

detected are not real. In fact, the differences are real. They may not

be biologically (or geologically or whatever field you are in)

significant, but they are still real. That is why it is better to decide

first on the magnitude of difference that you consider significant.

The null hypothesis is always false although it might be false by a very

small quantity, that is the trivial fact that the very large sample size

illustrates in the common test of significance. The conclusion to be drawn

from this is not that we must set in advance the amount of difference that

we would find significant (a rather restrictive strategy which will be

violated very often because it is nonsensical), but rather that the only

sensible strategy is to compare hypotheses one against another. This can

be done on an evidential basis by evaluating the likelihood ratio, the

likelihood of the data under one hypothesis divided by the likelihood of

the data under another hypothesis. By constructing the whole likelihood

function (in the case of a single parameter) any pair of hypotheses can be

tested for the value of the likelihood ratio.

> Now, in the case of deviation from normality, I suppose you wouldn't

have much intuition about what is significant, but the relevant question

is what is the effect of small deviations from normality on your test or

conclusions of your analysis?

Perhaps a better question is what the data say about a given hypothesis

for the mean versus another value for the mean assuming the normal

distribution is true? If the variance is unknown there is a simple

solution only for the normal and a few other cases, by orthogonalization,

and then the two parameters can be assessed separately. For comparing two

different models, say normal versus lognormal, a likelihood based

approach, the Akaike Information Criterion, is available although i am not

sure that Akaike's approach is fully in agreement with the likelihood

principle.

Ruben

--

* To post a message to the list, send it to ai-geostats@...

* As a general service to the users, please remember to post a summary of any useful responses to your questions.

* To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list

* Support to the list is provided at http://www.ai-geostats.org