Loading ...
Sorry, an error occurred while loading the content.
 

AI-GEOSTATS: Summary: Large sample size and normal distribution

Expand Messages
  • Chaosheng Zhang
    Dear All, One week ago I posted a question about large n and normal distritbuion, and have got several good replies from Isobel Clark, Ned Levine, Ruben Roa
    Message 1 of 5 , Aug 9, 2003
      Dear All,

      One week ago I posted a question about large n and normal distritbuion, and have got several good replies from Isobel Clark, Ned Levine, Ruben Roa Ureta, Thies Dose, Chris Hlavka, Donald Myers and Jeffrey Blume. Jeffrey is perhaps not in the list, but I assume he has no objections if I copy his message to the list.

      Generally speaking, when n is too large, e.g., n>1,000 which is very common in geochemistry nowadays, statistical (goodness-of-fit) tests become too powerful, and the p-values are less informative. Therefore, users need to be very careful in using these tests with a large n. Suggestions to solve this problem include: (1) To use graphical methods; (2) To develop methods which are suitable for large n; (3) To use methods which are not sensitive to n.

      Well, the solutions may not be very satisfactory, but I do hope statisticians pay more attention on large n, as they have been paying too much attention on small ones. More personal discussions are welcome. If you need some data sets to play with, please feel free to get in touch with me.

      Please find the following original question and the replies. I would like to show my sincere thanks to all those who replied me (I hope nobody is missing in the above list).

      Cheers,

      Chaosheng
      --------------------------------------------------------------------------
      Dr. Chaosheng Zhang
      Lecturer in GIS
      Department of Geography
      National University of Ireland, Galway
      IRELAND
      Tel: +353-91-524411 x 2375
      Fax: +353-91-525700
      E-mail: Chaosheng.Zhang@...
      Web 1: www.nuigalway.ie/geography/zhang.html
      Web 2: www.nuigalway.ie/geography/gis/index.htm
      ----------------------------------------------------------------------------

      ----- Original Message -----

      > Dear list,
      >
      > I'm wondering if anyone out there has the experience of dealing with the
      > probability distribution of data sets of a large sample size, e.g.,
      > n>10,000. I am studying the probability feature of chemical element
      > concentrations in a USGS sediment database with the sample number of around
      > 50,000, and have found that it is virtually impossible for any real data set
      > to pass tests for normality as the tests become too powerful with the
      > increase of sample size. It is widely oberved that geochemical data do not
      > follow a normal or even a lognormal distribution. However, I feel that the
      > large sample size is also making trouble.
      >
      > I am looking for references on this topic. Any references or comments are
      > welcome.
      >
      > Cheers,
      >
      > Chaosheng

      -----------------------
      Chaosheng

      Your problem may be 'non-stationarity' rather than the
      large sample size. If you have so many samples, you
      are probably sampling more than one 'population'.

      We have had success in fitting lognormals to mining
      data sets of up to half a million, where these are all
      within the same geological environment and primary
      minerlisation.

      We have also had a lot of success in reasonably large
      data sets (up to 100,000) with fitting mixtures of
      two, three or four lognormals (or Normals) to
      characterise different populations. See, for example,
      the paper given at the Australian Miing Geology
      conference in 1993 on my page at
      http://drisobelclark.ontheweb.com/resume/Publications.html

      Isobel
      http://ecosse.ontheweb.com

      ------------------
      Chaosheng,

      Can't you do a Monte Carlo simulation for the distribution? In S-Plus, you can create confidence intervals from a MC simulation with a sample size as large as you have. That is, you draw 50,000 or so points from a normal distribution and calculate the distribution. You then re-run this a number of times (e.g., 1000) to establish approximate confidence intervals. You can then what proportion of your data points fall outside the approximate confidence intervals; you would expect no more than 5% or so of the data points to fall outside the intervals if your distribution is normal. If more than 5% fall outside, then you really don't have a normal distribution (since a normal distribution is essentially a random distribution, I would doubt that any real data set would be truly normal - the sampling distribution is another issue).

      Anyway, just some thoughts. Hope everything is well with you.

      Regards,

      Ned
      ---------------
      I pressume your null hypothesis is that the data comes from the given
      distribution as is usual in goodness of fit tests. If such is the case
      your sample size will almost surely lead to rejection. The well-known
      logical inconsistencies of the standard test of hypothesis based on the
      p-value are magnified under large n.
      You have these options at least:
      1) Find some authority that says that for large sample sizes the p-value
      is less informative; e.g. Lindley and Scott. 1984. New Cambridge
      Elementary Statistical Tables. Cambridge Univ Press; and then you can
      throw away your goodness-of-fit test. But be warned that equally important
      authorities have said exactly the contrary thing, that the force of the
      p-value is stronger for large sample sizes (Peto et al. 1976. British
      Medical Journal 34:585-612). To make matters even worse, certainly other
      equally important authorities have said that the sample size doesn't
      matter (Cornfield 1966, American Statistician 29:18-23).
      2) Do a more reasonable analysis than the standard goodness-of-fit test.
      I suggest you plot the likelihood function under normal and lognormal
      models and derive the probabilistic features of your data by direct
      inspection of the function. Also you can test for different location or
      scale parameters using the likelihood ratio (its direct valu, not its
      derived asymptotic distribution in the sample space) for any two well
      defined hypotheses.
      Ruben

      --------------
      Dear Chaosheng,

      this will not answer your question directly, but I hope that it will be
      helpful anyway:

      1.) Independance of values
      I am not quite sure, whether tests for normality (chi-square, shapiro-wilk,
      kolmogorow-smirnov) require independance of the samples, but I have a strong
      feeling that they do so. Most likely your data samples are not statistically
      independant of each other, because if the data would be so, you could save
      your time on the spatial analysis and work with the global mean or a
      transformed random number generator as local estimator instead. So in
      general this kind of test might not be appropriate.
      In addition, in case of clustered data in your data set, the clustering will
      lead for sure to biased results, and any results from statistical tests
      would be quite doubtful.

      2.) rank transform
      I would try to do a spatial analysis on the rank transform of your
      variables, in the case that you can deal with the ties in the data set. For
      such a large no. of samples, this will probably provide a robust approach.
      In addition, a multigaussian approach has been discussed widely, and could
      be a useful alternative.

      Happy evaluations,
      Thies

      ---------------------
      Chaosheng - Other apporaches to your problem are:
      - Randomly select a few smaller samples and apply the goodness-of-fit test.
      - Test fit to normal and lognormal distributions with probability plots.
      -- Chris

      -------------------
      A couple of observations about your question/problem

      1. Most any statistical test will have an underlying assumption of
      random sampling (or perhaps a modification of random sampling such as
      stratified). It is very unlikely that the data will have been generated
      in that way (random sampling in this context refers to sampling from the
      " distribution" and not to sampling from a region or space). Generally
      speaking, random site selection for sampling is not the same thing as
      random sampling from the distribution. It is highly unlikely that you
      can really use statistical tests with your data because the underlying
      assumptions are not satisifed. They may be useful information to look at
      but don't take them as really hard evidence.

      2. As a further point, the sampling in this case is obviously "without
      replacement" , i.e., you can't generate two samples from the (exact)
      same location. For smaller sample sizes the difference between "with
      replacement" and " without replacement" is probably negligible but not
      for larger sample sizes. You may be seeing this.

      Suppose that the "population" size is M (M very large) is,
      random sampling WITH replacement means that each possible value will be
      chosen with probability 1/M. For a sample of size n then the
      probability will be this raised to the power n. If the sampling is
      WITHOUT replacement then each sample of size n has a probability of 1/[
      M!/(n! (M-n)!)] For M = 1000 and n = 5 the numerical difference in
      these two probabilities is very very small. But if n > 50 (as an
      example) then the difference is significant.

      3. Finally, what is the "support:" of the samples? Generally speaking
      the probability distribution changes as the support changes. (In the
      Geography literature this is referred to as the "Modified Unit Area
      problem"

      I don't remember having seen this discussed but you might want to look
      at the literature pertaining to Pierre Gy's work on sampling (in fact
      there is to be or was a conference somewhere in Scandanavia recently on
      his work).

      Donald Myers
      http://www.u.arizona.edu/~donaldm
      -------------------
      Chaosheng,

      Probably the best approach is to take a different tact and try estimating an important quantity rather than testing to see if the normal distribution fits your data. With such a high sample size almost any goodness of fit test will reject.

      Also as long as the distributions are symmetric, you can assume normality without loosing too much (even if the test rejects normality). I'm not sure the articles will help you in this matter, because they are more concerned with demonstrating that two equal p-values do not represent the same amount of evidence unless the sample sizes are equal. Which smaple has a stronger amount of evidence is still debatable (as you'll see).

      You might try an altogether different approach: look at the likelihood function. I have attached a tutorial that explains how to do this.

      Good Luck.
      Jeffrey

      [Non-text portions of this message have been removed]
    • zij
      Hi, I m not sure i agree with the idea that a test can be too powerful. This is a common argument in simulation experiments, that because you can do an
      Message 2 of 5 , Aug 11, 2003
        Hi,

        I'm not sure i agree with the idea that a test can be too powerful. This is a
        common argument in simulation experiments, that because you can do an infinite
        number of replicate simulations, somehow the differences detected are not
        real. In fact, the differences are real. They may not be biologically (or
        geologically or whatever field you are in) significant, but they are still
        real. That is why it is better to decide first on the magnitude of difference
        that you consider significant. Now, in the case of deviation from normality,
        I suppose you wouldn't have much intuition about what is significant, but the
        relevant question is what is the effect of small deviations from normality on
        your test or conclusions of your analysis? These kinds of studies are out
        there in the statistical literature for many tests (T-tests etc.) --I'm not
        sure how much has been done to look at the robustness of geostatistical
        analyses, but there are probably some studies (does anyone know?) I would not
        opt for a less-powerful test just to justify an assumption - that's, like,
        unethical or something.

        Yetta



        --
        * To post a message to the list, send it to ai-geostats@...
        * As a general service to the users, please remember to post a summary of any useful responses to your questions.
        * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
        * Support to the list is provided at http://www.ai-geostats.org
      • Chaosheng Zhang
        Dear Yetta, Thanks for the comments, and I agree with you. I think there is a function between sample size and statistical power. The power increases with the
        Message 3 of 5 , Aug 11, 2003
          Dear Yetta,

          Thanks for the comments, and I agree with you. I think there is a function between sample size and statistical power. The power increases with the increase of n. It's true that it is hard to define how powerful is "too powerful". Some people suggest to use a lower significance level for large n. However, it is also a problem that how low (e.g., 0.0000001) is low enough? Some people suggest not to use the p-value as mentioned in the summary.

          It is also a question how serious it may be if the data set does not follow a normal distribution. Statisticians may provide us some artificial examples showing how serious it is, but this may not be so serious in the real world if it's only a minor departure. Some people even say that statistical methods can not be used because our samples are not independent at all because of spatial autocorrelation. Well, perhaps I have gone too far, but it is an interesting topic. (Geo)statisticians may have better comments.

          By the way, I may not summarize again. If anyone would like to share your ideas with the list, please copy to it.

          Cheers,

          Chaosheng


          ----- Original Message -----
          From: "zij" <zij@...>
          To: "ai-geostats" <ai-geostats@...>; "Chaosheng Zhang" <Chaosheng.Zhang@...>
          Sent: Monday, August 11, 2003 7:13 PM
          Subject: RE: AI-GEOSTATS: Summary: Large sample size and normal distribution


          > Hi,
          >
          > I'm not sure i agree with the idea that a test can be too powerful. This is a
          > common argument in simulation experiments, that because you can do an infinite
          > number of replicate simulations, somehow the differences detected are not
          > real. In fact, the differences are real. They may not be biologically (or
          > geologically or whatever field you are in) significant, but they are still
          > real. That is why it is better to decide first on the magnitude of difference
          > that you consider significant. Now, in the case of deviation from normality,
          > I suppose you wouldn't have much intuition about what is significant, but the
          > relevant question is what is the effect of small deviations from normality on
          > your test or conclusions of your analysis? These kinds of studies are out
          > there in the statistical literature for many tests (T-tests etc.) --I'm not
          > sure how much has been done to look at the robustness of geostatistical
          > analyses, but there are probably some studies (does anyone know?) I would not
          > opt for a less-powerful test just to justify an assumption - that's, like,
          > unethical or something.
          >
          > Yetta
          >
          >
          >

          [Non-text portions of this message have been removed]
        • Ruben Roa Ureta
          ... is a common argument in simulation experiments, that because you can do an infinite number of replicate simulations, somehow the differences detected are
          Message 4 of 5 , Aug 11, 2003
            > Hi,
            >
            > I'm not sure i agree with the idea that a test can be too powerful. This
            is a common argument in simulation experiments, that because you can do an
            infinite number of replicate simulations, somehow the differences
            detected are not real. In fact, the differences are real. They may not
            be biologically (or geologically or whatever field you are in)
            significant, but they are still real. That is why it is better to decide
            first on the magnitude of difference that you consider significant.

            The null hypothesis is always false although it might be false by a very
            small quantity, that is the trivial fact that the very large sample size
            illustrates in the common test of significance. The conclusion to be drawn
            from this is not that we must set in advance the amount of difference that
            we would find significant (a rather restrictive strategy which will be
            violated very often because it is nonsensical), but rather that the only
            sensible strategy is to compare hypotheses one against another. This can
            be done on an evidential basis by evaluating the likelihood ratio, the
            likelihood of the data under one hypothesis divided by the likelihood of
            the data under another hypothesis. By constructing the whole likelihood
            function (in the case of a single parameter) any pair of hypotheses can be
            tested for the value of the likelihood ratio.

            > Now, in the case of deviation from normality, I suppose you wouldn't
            have much intuition about what is significant, but the relevant question
            is what is the effect of small deviations from normality on your test or
            conclusions of your analysis?

            Perhaps a better question is what the data say about a given hypothesis
            for the mean versus another value for the mean assuming the normal
            distribution is true? If the variance is unknown there is a simple
            solution only for the normal and a few other cases, by orthogonalization,
            and then the two parameters can be assessed separately. For comparing two
            different models, say normal versus lognormal, a likelihood based
            approach, the Akaike Information Criterion, is available although i am not
            sure that Akaike's approach is fully in agreement with the likelihood
            principle.

            Ruben

            --
            * To post a message to the list, send it to ai-geostats@...
            * As a general service to the users, please remember to post a summary of any useful responses to your questions.
            * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
            * Support to the list is provided at http://www.ai-geostats.org
          Your message has been successfully submitted and would be delivered to recipients shortly.
          »
          «