Loading ...
Sorry, an error occurred while loading the content.

re:GEOSTATS: data distribution impact on kriging?

Expand Messages
  • Swantje Lindner
    ... I cannot suggest a reference, but here are some ideas: Above all, a non-normal data set will influence the variogram or covariance function. The variogram
    Message 1 of 4 , Nov 30, 2002
    • 0 Attachment
      Laura wrote:
      > 1) can you suggest a reference that explicitly addresses the issue of how
      > using a non-normal data set will influence kriging values and associated
      > errors?

      I cannot suggest a reference, but here are some ideas:

      Above all, a non-normal data set will influence the variogram or covariance
      function. The variogram will become unstructured, close to a pure nugget effect.
      Kriging with variograms of type "pure nugget" (i.e. for data without
      spatial correlation) doesn't improve the estimation (in terms of
      estimation variance), compared to other estimation methods, as e.g.
      Inverse Distance Weighting or Moving Average. So, no transforming the
      highly skewed data will result in some senselessness of kriging, as you
      could perform instead of it a more simpler estimation method without
      time consuming variography.

      Concerning the loss of structure of the experimental variogram,
      it can be explained as follows. Calculating the variogram is
      calculating a mean value of the differnce between the variable
      values for every lag class h:

      gamma(h) = 1/n_h sum from (i=1) to (n_h) {[z(x)-z(x+h)]_i}^2

      n_h being the number of pairs in the lag class h.
      The majority of the z-values is small and only some values are very high.
      So the majority of the differences will be small (differences between
      the small values) and only a very small part of the differences will
      become large (differences between small and large values). It is, I think,
      the nonstability of the variances of nontransformed variables, mentioned
      by McBratney et al. 1982. As the large differences are even some orders
      bigger, they have a great influence on the mean variogram value of the
      lag class. But these large differences are not associated with a special
      lag class, we will have this relation of a few of large differences and
      a lot of small differences in every lag class. So with skewed distributions
      we will get relatively high variogram values for every lag class, even
      for the short lags near the origin - the result being an unstructured
      variogram.

      In Journel & Froidevaux (1982, Math.Geol., 14, pp.217-239) the variograms
      of a highly skewed variable and its log transform are compared and the
      coefficient of variation is given as a kind of indicator for the goodness
      of the variograms.

      So far about the problems arising if we don't take account of the non-normality.

      Unfortunately, transforming the data into normal ones (or say better, reducing
      the skewness of the data distribution) by taking its logarithm, will result in
      other problems:

      -the problems of sensitivity of the back transformed values to the kriging
      errors of the transformed variable (see the remark of D.Myers and O. Costello)

      -the problem of back transforming the kriging error (even the formulae exist
      both for the simple and for the ordinary kriging, I didn't find an implemetation
      in a geostat. program, so one had to write its own program).

      If you are interested in the kriging values of your variable on the scale of the
      raw data, lognormal transformation seems to be not a good way to it.

      (It's another matter, if you are interested in a statistical univariate or
      multivariate analysis of your data, there it may be sufficient to transform the
      lognormal variable, more precisely the non-normal distributed variable, without
      back transform. In this case the transformations mentioned by A.Prasad and
      S. Low Choy could give better results than the simple log transform. As to the
      application of the Box-Cox method to geochemical data, you can refer to Howarth
      and Earle (1979, Math.Geol., 14, pp.45-62).)

      An alternative way for geostatistical handling of data with a skewed distribution
      is to perform a normal score transform (see Journel & Huibregts 1978, pp.566, 567
      and also the figure on p. 478). After kriging the transformed data, the
      kriged values can be easily backtransformed by the normal score transform.
      There is a routine in GSLIB for it. The kriging variances cannot be back
      transformed as simple as the values. To get an expression for the estimation
      variance on the scale of the raw data, one can perform a lot of simulations
      of the transformed data and then build an error interval from the back transformed
      simulated values. (I didn't try it myself, but it seems to be reasonable and
      I plan to do it with GSLIB in a free minute.) In this way one avoids the
      high values, which are due to the high kriging errors in sparsely sampled regions.
      But, of course, in such regions, the quality of the estimation is allways
      not so good.

      [Concerning the mail of O.Costello - he wrote: ...I end up with
      relatively high concentration estimates for nodes in areas I know
      are clean .... -
      did the kriging know too, that these areas are clean, or was
      it a supplementary "soft information", not included into the
      data set for kriging?]


      Laura wrote:
      > I did check the ai-geostat web site and I didn't find anything on this
      > question, but I didn't go through all the archives....

      There have been two discussions about lognormal issues in the ai-geostat list,
      one in december 1996 (sampling for the 90th percentile of the lognormal
      distribution) and one in march 1997 (the lognormal in geology). I found
      there a good argumentation about appearance of lognormal distribution in
      geology (the mentioned there article by Allegre and Lewin about mixing
      and fractionation processes is really nice to read and comprehensible for a
      non-mathematician), though it doesn't give a solution for the spatial
      estimation of skewed distributed variables.

      Regards

      Swantje (lindner@...-freiberg.de)
      --
      *To post a message to the list, send it to ai-geostats@....
      *As a general service to list users, please remember to post a summary
      of any useful responses to your questions.
      *To unsubscribe, send email to majordomo@... with no subject and
      "unsubscribe ai-geostats" in the message body.
      DO NOT SEND Subscribe/Unsubscribe requests to the list!
    • Samantha (Sama) Low Choy
      Hi Laura, As a statistician, I can help you with your question 2 on tests for normality. I wonder if any geostatisticians currently find these things useful?
      Message 2 of 4 , Jun 26, 1997
      • 0 Attachment
        Hi Laura,

        As a statistician, I can help you with your question 2 on tests for
        normality. I wonder if any geostatisticians currently find these things
        useful?

        A. Some visual aids for diagnosing normality include:

        1. comparing the density of your data and the theoretical density
        * compute the mean and standard deviation for your data
        * plot a normal density function with this mean and deviation
        * overlay a fitted density (splus does this easily with the density
        function) or a histogram of the data to see how well they fit

        2. a q-q plot (or quantile-quantile plot) will compare the distribution of
        your data to any other distribution (or dataset), e.g. a normal. splus
        does this easily with the qqnorm function. other statistical packages
        also do this, but sometimes need a bit more fiddling. if you can't find
        a nice package to work it out for you (or if you want to know how it's
        worked out), then do this:
        * work out the quantiles of the matching theoretical normal
        distribution (ie has same mean and stdev as your data). E.g. work out
        the 1st, 2nd, ..., 99th, 100th percentiles of the normal using tables
        or a stats package.
        * sort your data from lowest to highest. find out where the percentiles
        are. eg with 280 datapoints, the 1st percentile will be the 2.8th
        datapoint, ie .2 * the 2nd + .8 * the 3rd. etc.
        * plot the theoretical percentiles against the data percentiles.
        this should give a straight line if they match up.

        B. Statistical tests for normality:
        * You could also use Kolmogorov-Smirnov to test that the cumulative
        distribution functions are equal (similar to the qqplot).(also easy in
        splus).

        C. There is another distribution gaining popularity in
        Long-Range-Dependence modelling, since it allows very heavy tails.
        Unfortunately, I can't remember its name right now... And I'm doubtful
        that it has been incorporated into the kriging literature.

        D. The comment about transforming the data to stabilize the variance is
        what statisticians used to do with linear models (LMs) fitting BEFORE
        generalized linear models (GLMs) came on to the scene and BEFORE software
        became available to do this easily. So instead of
        fitting a true log-linear model with Poisson error (a GLM), statisticians
        used to simply take log(Y), with Y being the response variable, and fit a
        linear model with Normal errors *on the log scale*, and then
        back-transform. Other transformations are: the square-root transformation
        for count data, the logistic/probit/gompertz transforms for binary data,
        etc.

        E. If your data doesn't follow the distribution assumed with *any*
        statistical modelling, then your estimates can be biased, and/or the
        precision of your estimates can be incorrectly estimated. So it is
        important! There is usually a good discussion of this in books on linear
        modelling. I don't know about kriging...

        F. You say you have 280 samples, yet somebody commented "you didn't have
        enough data" to say whether the data was normal or not... I find this hard
        to believe. Usually with 30-50 datapoints you can get a pretty good idea,
        so 280 should be ample? Depending on whether they cover the full range of
        possible values...

        Hope this is helpful.
        Sama

        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        Sama Low Choy s.lowchoy@...
        Senior Research Assistant ph: +61 07 3864 1750
        Australian Housing & Urban Research Institute fax: +61 07 3864 1827
        Queensland University of Technology, Brisbane, Australia
        *and*
        PhD student in statistics ph: +61 07 3864 1114
        School of Mathematics, QUT fax: +61 07 3864 2310
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        On Thu, 26 Jun 1997, Laura Lengnick wrote:

        > Date: Thu, 26 Jun 1997 01:40:09 +0000
        > From: Laura Lengnick <llengnic@...>
        > To: ai-geostats@...
        > Subject: GEOSTATS: data distribution impact on kriging?
        >
        > I'm currently learning some spatial statistics and kriging techniques as
        > part of a project to characterize some agricultural land.
        >
        > I'm reading lots of papers and (thankfully) found "An introduction to
        > Applied Geostatistics" but I'm having trouble finding any literature to
        > explain the effect of a non-normal distribution of the sample data on the
        > rest of the analysis required to create kriged maps of soil and crop
        > variables.
        >
        > I've read lots of papers that ignore the normality issue all together, some
        > that transform log normal data without comment, still others that krig
        > obviously non-normal data without comment....and nothing in any of the
        > papers or books that I've read that lays out this problem and what the
        > consequences of using non-normal data might be (except I think maybe
        > Cressie addresses this issue in his book....but I cannot understand his
        > book, so please don't suggest it as a resource without an accompanying
        > non-mathematical translation!).
        >
        > The best I've found in the literature is one comment in a McBratney, et.
        > al. paper (1982 Agronomie) that they " transformed the data to stabilize
        > their variances for later analysis and interpretation."
        >
        > I did check the ai-geostat web site and I didn't find anything on this
        > question, but I didn't go through all the archives....
        >
        > My data set has 280 samples, 20 variables. 4 are normally distributed, 2
        > are log-normal, the rest look sort of normal, but with heavy tails and
        > often skewed right. So far, I've used two tests for normality, the
        > Shapiro-Wilk test (in SAS) and a test for significance of skewness and
        > kurtosis that I found in a Snedecor and Cochran text that involved
        > comparing your data's skew and kurtosis to tables of significant values. I
        > have SAS and GS+ v. 2.3 available to do this work.
        >
        > Three questions:
        >
        > 1) can you suggest a reference that explicitly addresses the issue of how
        > using a non-normal data set will influence kriging values and associated
        > errors?
        >
        > 2) can you suggest other, more liberal tests for determining normality?
        > I've spoken with one statistician about this. He looked at the
        > distributions and said, "Oh, they are pretty close to normal, you could
        > probably just use them as is. They don't look like some other kind of
        > well-known distribution. And besides, you don't have enough data points to
        > really be able to test for normality." Hum, easy for him to say!
        >
        > 3) how would you approach an analysis of this data set?
        >
        >
        >
        >
        > --
        > *To post a message to the list, send it to ai-geostats@....
        > *As a general service to list users, please remember to post a summary
        > of any useful responses to your questions.
        > *To unsubscribe, send email to majordomo@... with no subject and
        > "unsubscribe ai-geostats" in the message body.
        > DO NOT SEND Subscribe/Unsubscribe requests to the list!
        >

        --
        *To post a message to the list, send it to ai-geostats@....
        *As a general service to list users, please remember to post a summary
        of any useful responses to your questions.
        *To unsubscribe, send email to majordomo@... with no subject and
        "unsubscribe ai-geostats" in the message body.
        DO NOT SEND Subscribe/Unsubscribe requests to the list!
      Your message has been successfully submitted and would be delivered to recipients shortly.