re:GEOSTATS: data distribution impact on kriging?

Expand Messages
• ... I cannot suggest a reference, but here are some ideas: Above all, a non-normal data set will influence the variogram or covariance function. The variogram
Message 1 of 4 , Nov 30, 2002
• 0 Attachment
Laura wrote:
> 1) can you suggest a reference that explicitly addresses the issue of how
> using a non-normal data set will influence kriging values and associated
> errors?

I cannot suggest a reference, but here are some ideas:

Above all, a non-normal data set will influence the variogram or covariance
function. The variogram will become unstructured, close to a pure nugget effect.
Kriging with variograms of type "pure nugget" (i.e. for data without
spatial correlation) doesn't improve the estimation (in terms of
estimation variance), compared to other estimation methods, as e.g.
Inverse Distance Weighting or Moving Average. So, no transforming the
highly skewed data will result in some senselessness of kriging, as you
could perform instead of it a more simpler estimation method without
time consuming variography.

Concerning the loss of structure of the experimental variogram,
it can be explained as follows. Calculating the variogram is
calculating a mean value of the differnce between the variable
values for every lag class h:

gamma(h) = 1/n_h sum from (i=1) to (n_h) {[z(x)-z(x+h)]_i}^2

n_h being the number of pairs in the lag class h.
The majority of the z-values is small and only some values are very high.
So the majority of the differences will be small (differences between
the small values) and only a very small part of the differences will
become large (differences between small and large values). It is, I think,
the nonstability of the variances of nontransformed variables, mentioned
by McBratney et al. 1982. As the large differences are even some orders
bigger, they have a great influence on the mean variogram value of the
lag class. But these large differences are not associated with a special
lag class, we will have this relation of a few of large differences and
a lot of small differences in every lag class. So with skewed distributions
we will get relatively high variogram values for every lag class, even
for the short lags near the origin - the result being an unstructured
variogram.

In Journel & Froidevaux (1982, Math.Geol., 14, pp.217-239) the variograms
of a highly skewed variable and its log transform are compared and the
coefficient of variation is given as a kind of indicator for the goodness
of the variograms.

So far about the problems arising if we don't take account of the non-normality.

Unfortunately, transforming the data into normal ones (or say better, reducing
the skewness of the data distribution) by taking its logarithm, will result in
other problems:

-the problems of sensitivity of the back transformed values to the kriging
errors of the transformed variable (see the remark of D.Myers and O. Costello)

-the problem of back transforming the kriging error (even the formulae exist
both for the simple and for the ordinary kriging, I didn't find an implemetation
in a geostat. program, so one had to write its own program).

If you are interested in the kriging values of your variable on the scale of the
raw data, lognormal transformation seems to be not a good way to it.

(It's another matter, if you are interested in a statistical univariate or
multivariate analysis of your data, there it may be sufficient to transform the
lognormal variable, more precisely the non-normal distributed variable, without
back transform. In this case the transformations mentioned by A.Prasad and
S. Low Choy could give better results than the simple log transform. As to the
application of the Box-Cox method to geochemical data, you can refer to Howarth
and Earle (1979, Math.Geol., 14, pp.45-62).)

An alternative way for geostatistical handling of data with a skewed distribution
is to perform a normal score transform (see Journel & Huibregts 1978, pp.566, 567
and also the figure on p. 478). After kriging the transformed data, the
kriged values can be easily backtransformed by the normal score transform.
There is a routine in GSLIB for it. The kriging variances cannot be back
transformed as simple as the values. To get an expression for the estimation
variance on the scale of the raw data, one can perform a lot of simulations
of the transformed data and then build an error interval from the back transformed
simulated values. (I didn't try it myself, but it seems to be reasonable and
I plan to do it with GSLIB in a free minute.) In this way one avoids the
high values, which are due to the high kriging errors in sparsely sampled regions.
But, of course, in such regions, the quality of the estimation is allways
not so good.

[Concerning the mail of O.Costello - he wrote: ...I end up with
relatively high concentration estimates for nodes in areas I know
are clean .... -
did the kriging know too, that these areas are clean, or was
it a supplementary "soft information", not included into the
data set for kriging?]

Laura wrote:
> I did check the ai-geostat web site and I didn't find anything on this
> question, but I didn't go through all the archives....

There have been two discussions about lognormal issues in the ai-geostat list,
one in december 1996 (sampling for the 90th percentile of the lognormal
distribution) and one in march 1997 (the lognormal in geology). I found
there a good argumentation about appearance of lognormal distribution in
geology (the mentioned there article by Allegre and Lewin about mixing
and fractionation processes is really nice to read and comprehensible for a
non-mathematician), though it doesn't give a solution for the spatial
estimation of skewed distributed variables.

Regards

Swantje (lindner@...-freiberg.de)
--
*To post a message to the list, send it to ai-geostats@....
*As a general service to list users, please remember to post a summary
of any useful responses to your questions.
*To unsubscribe, send email to majordomo@... with no subject and
"unsubscribe ai-geostats" in the message body.
DO NOT SEND Subscribe/Unsubscribe requests to the list!
• Hi Laura, As a statistician, I can help you with your question 2 on tests for normality. I wonder if any geostatisticians currently find these things useful?
Message 2 of 4 , Jun 26, 1997
• 0 Attachment
Hi Laura,

normality. I wonder if any geostatisticians currently find these things
useful?

A. Some visual aids for diagnosing normality include:

1. comparing the density of your data and the theoretical density
* compute the mean and standard deviation for your data
* plot a normal density function with this mean and deviation
* overlay a fitted density (splus does this easily with the density
function) or a histogram of the data to see how well they fit

2. a q-q plot (or quantile-quantile plot) will compare the distribution of
your data to any other distribution (or dataset), e.g. a normal. splus
does this easily with the qqnorm function. other statistical packages
also do this, but sometimes need a bit more fiddling. if you can't find
a nice package to work it out for you (or if you want to know how it's
worked out), then do this:
* work out the quantiles of the matching theoretical normal
distribution (ie has same mean and stdev as your data). E.g. work out
the 1st, 2nd, ..., 99th, 100th percentiles of the normal using tables
or a stats package.
* sort your data from lowest to highest. find out where the percentiles
are. eg with 280 datapoints, the 1st percentile will be the 2.8th
datapoint, ie .2 * the 2nd + .8 * the 3rd. etc.
* plot the theoretical percentiles against the data percentiles.
this should give a straight line if they match up.

B. Statistical tests for normality:
* You could also use Kolmogorov-Smirnov to test that the cumulative
distribution functions are equal (similar to the qqplot).(also easy in
splus).

C. There is another distribution gaining popularity in
Long-Range-Dependence modelling, since it allows very heavy tails.
Unfortunately, I can't remember its name right now... And I'm doubtful
that it has been incorporated into the kriging literature.

D. The comment about transforming the data to stabilize the variance is
what statisticians used to do with linear models (LMs) fitting BEFORE
generalized linear models (GLMs) came on to the scene and BEFORE software
became available to do this easily. So instead of
fitting a true log-linear model with Poisson error (a GLM), statisticians
used to simply take log(Y), with Y being the response variable, and fit a
linear model with Normal errors *on the log scale*, and then
back-transform. Other transformations are: the square-root transformation
for count data, the logistic/probit/gompertz transforms for binary data,
etc.

statistical modelling, then your estimates can be biased, and/or the
precision of your estimates can be incorrectly estimated. So it is
important! There is usually a good discussion of this in books on linear
modelling. I don't know about kriging...

F. You say you have 280 samples, yet somebody commented "you didn't have
enough data" to say whether the data was normal or not... I find this hard
to believe. Usually with 30-50 datapoints you can get a pretty good idea,
so 280 should be ample? Depending on whether they cover the full range of
possible values...

Sama

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sama Low Choy s.lowchoy@...
Senior Research Assistant ph: +61 07 3864 1750
Australian Housing & Urban Research Institute fax: +61 07 3864 1827
Queensland University of Technology, Brisbane, Australia
*and*
PhD student in statistics ph: +61 07 3864 1114
School of Mathematics, QUT fax: +61 07 3864 2310
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Thu, 26 Jun 1997, Laura Lengnick wrote:

> Date: Thu, 26 Jun 1997 01:40:09 +0000
> From: Laura Lengnick <llengnic@...>
> To: ai-geostats@...
> Subject: GEOSTATS: data distribution impact on kriging?
>
> I'm currently learning some spatial statistics and kriging techniques as
> part of a project to characterize some agricultural land.
>
> I'm reading lots of papers and (thankfully) found "An introduction to
> Applied Geostatistics" but I'm having trouble finding any literature to
> explain the effect of a non-normal distribution of the sample data on the
> rest of the analysis required to create kriged maps of soil and crop
> variables.
>
> I've read lots of papers that ignore the normality issue all together, some
> that transform log normal data without comment, still others that krig
> obviously non-normal data without comment....and nothing in any of the
> papers or books that I've read that lays out this problem and what the
> consequences of using non-normal data might be (except I think maybe
> Cressie addresses this issue in his book....but I cannot understand his
> book, so please don't suggest it as a resource without an accompanying
> non-mathematical translation!).
>
> The best I've found in the literature is one comment in a McBratney, et.
> al. paper (1982 Agronomie) that they " transformed the data to stabilize
> their variances for later analysis and interpretation."
>
> I did check the ai-geostat web site and I didn't find anything on this
> question, but I didn't go through all the archives....
>
> My data set has 280 samples, 20 variables. 4 are normally distributed, 2
> are log-normal, the rest look sort of normal, but with heavy tails and
> often skewed right. So far, I've used two tests for normality, the
> Shapiro-Wilk test (in SAS) and a test for significance of skewness and
> kurtosis that I found in a Snedecor and Cochran text that involved
> comparing your data's skew and kurtosis to tables of significant values. I
> have SAS and GS+ v. 2.3 available to do this work.
>
> Three questions:
>
> 1) can you suggest a reference that explicitly addresses the issue of how
> using a non-normal data set will influence kriging values and associated
> errors?
>
> 2) can you suggest other, more liberal tests for determining normality?
> distributions and said, "Oh, they are pretty close to normal, you could
> probably just use them as is. They don't look like some other kind of
> well-known distribution. And besides, you don't have enough data points to
> really be able to test for normality." Hum, easy for him to say!
>
> 3) how would you approach an analysis of this data set?
>
>
>
>
> --
> *To post a message to the list, send it to ai-geostats@....
> *As a general service to list users, please remember to post a summary
> of any useful responses to your questions.
> *To unsubscribe, send email to majordomo@... with no subject and
> "unsubscribe ai-geostats" in the message body.
> DO NOT SEND Subscribe/Unsubscribe requests to the list!
>

--
*To post a message to the list, send it to ai-geostats@....
*As a general service to list users, please remember to post a summary
of any useful responses to your questions.
*To unsubscribe, send email to majordomo@... with no subject and
"unsubscribe ai-geostats" in the message body.
DO NOT SEND Subscribe/Unsubscribe requests to the list!
Your message has been successfully submitted and would be delivered to recipients shortly.