- Laura wrote:
> 1) can you suggest a reference that explicitly addresses the issue of how

I cannot suggest a reference, but here are some ideas:

> using a non-normal data set will influence kriging values and associated

> errors?

Above all, a non-normal data set will influence the variogram or covariance

function. The variogram will become unstructured, close to a pure nugget effect.

Kriging with variograms of type "pure nugget" (i.e. for data without

spatial correlation) doesn't improve the estimation (in terms of

estimation variance), compared to other estimation methods, as e.g.

Inverse Distance Weighting or Moving Average. So, no transforming the

highly skewed data will result in some senselessness of kriging, as you

could perform instead of it a more simpler estimation method without

time consuming variography.

Concerning the loss of structure of the experimental variogram,

it can be explained as follows. Calculating the variogram is

calculating a mean value of the differnce between the variable

values for every lag class h:

gamma(h) = 1/n_h sum from (i=1) to (n_h) {[z(x)-z(x+h)]_i}^2

n_h being the number of pairs in the lag class h.

The majority of the z-values is small and only some values are very high.

So the majority of the differences will be small (differences between

the small values) and only a very small part of the differences will

become large (differences between small and large values). It is, I think,

the nonstability of the variances of nontransformed variables, mentioned

by McBratney et al. 1982. As the large differences are even some orders

bigger, they have a great influence on the mean variogram value of the

lag class. But these large differences are not associated with a special

lag class, we will have this relation of a few of large differences and

a lot of small differences in every lag class. So with skewed distributions

we will get relatively high variogram values for every lag class, even

for the short lags near the origin - the result being an unstructured

variogram.

In Journel & Froidevaux (1982, Math.Geol., 14, pp.217-239) the variograms

of a highly skewed variable and its log transform are compared and the

coefficient of variation is given as a kind of indicator for the goodness

of the variograms.

So far about the problems arising if we don't take account of the non-normality.

Unfortunately, transforming the data into normal ones (or say better, reducing

the skewness of the data distribution) by taking its logarithm, will result in

other problems:

-the problems of sensitivity of the back transformed values to the kriging

errors of the transformed variable (see the remark of D.Myers and O. Costello)

-the problem of back transforming the kriging error (even the formulae exist

both for the simple and for the ordinary kriging, I didn't find an implemetation

in a geostat. program, so one had to write its own program).

If you are interested in the kriging values of your variable on the scale of the

raw data, lognormal transformation seems to be not a good way to it.

(It's another matter, if you are interested in a statistical univariate or

multivariate analysis of your data, there it may be sufficient to transform the

lognormal variable, more precisely the non-normal distributed variable, without

back transform. In this case the transformations mentioned by A.Prasad and

S. Low Choy could give better results than the simple log transform. As to the

application of the Box-Cox method to geochemical data, you can refer to Howarth

and Earle (1979, Math.Geol., 14, pp.45-62).)

An alternative way for geostatistical handling of data with a skewed distribution

is to perform a normal score transform (see Journel & Huibregts 1978, pp.566, 567

and also the figure on p. 478). After kriging the transformed data, the

kriged values can be easily backtransformed by the normal score transform.

There is a routine in GSLIB for it. The kriging variances cannot be back

transformed as simple as the values. To get an expression for the estimation

variance on the scale of the raw data, one can perform a lot of simulations

of the transformed data and then build an error interval from the back transformed

simulated values. (I didn't try it myself, but it seems to be reasonable and

I plan to do it with GSLIB in a free minute.) In this way one avoids the

high values, which are due to the high kriging errors in sparsely sampled regions.

But, of course, in such regions, the quality of the estimation is allways

not so good.

[Concerning the mail of O.Costello - he wrote: ...I end up with

relatively high concentration estimates for nodes in areas I know

are clean .... -

did the kriging know too, that these areas are clean, or was

it a supplementary "soft information", not included into the

data set for kriging?]

Laura wrote:> I did check the ai-geostat web site and I didn't find anything on this

There have been two discussions about lognormal issues in the ai-geostat list,

> question, but I didn't go through all the archives....

one in december 1996 (sampling for the 90th percentile of the lognormal

distribution) and one in march 1997 (the lognormal in geology). I found

there a good argumentation about appearance of lognormal distribution in

geology (the mentioned there article by Allegre and Lewin about mixing

and fractionation processes is really nice to read and comprehensible for a

non-mathematician), though it doesn't give a solution for the spatial

estimation of skewed distributed variables.

Regards

Swantje (lindner@...-freiberg.de)

--

*To post a message to the list, send it to ai-geostats@....

*As a general service to list users, please remember to post a summary

of any useful responses to your questions.

*To unsubscribe, send email to majordomo@... with no subject and

"unsubscribe ai-geostats" in the message body.

DO NOT SEND Subscribe/Unsubscribe requests to the list! - Hi Laura,

As a statistician, I can help you with your question 2 on tests for

normality. I wonder if any geostatisticians currently find these things

useful?

A. Some visual aids for diagnosing normality include:

1. comparing the density of your data and the theoretical density

* compute the mean and standard deviation for your data

* plot a normal density function with this mean and deviation

* overlay a fitted density (splus does this easily with the density

function) or a histogram of the data to see how well they fit

2. a q-q plot (or quantile-quantile plot) will compare the distribution of

your data to any other distribution (or dataset), e.g. a normal. splus

does this easily with the qqnorm function. other statistical packages

also do this, but sometimes need a bit more fiddling. if you can't find

a nice package to work it out for you (or if you want to know how it's

worked out), then do this:

* work out the quantiles of the matching theoretical normal

distribution (ie has same mean and stdev as your data). E.g. work out

the 1st, 2nd, ..., 99th, 100th percentiles of the normal using tables

or a stats package.

* sort your data from lowest to highest. find out where the percentiles

are. eg with 280 datapoints, the 1st percentile will be the 2.8th

datapoint, ie .2 * the 2nd + .8 * the 3rd. etc.

* plot the theoretical percentiles against the data percentiles.

this should give a straight line if they match up.

B. Statistical tests for normality:

* You could also use Kolmogorov-Smirnov to test that the cumulative

distribution functions are equal (similar to the qqplot).(also easy in

splus).

C. There is another distribution gaining popularity in

Long-Range-Dependence modelling, since it allows very heavy tails.

Unfortunately, I can't remember its name right now... And I'm doubtful

that it has been incorporated into the kriging literature.

D. The comment about transforming the data to stabilize the variance is

what statisticians used to do with linear models (LMs) fitting BEFORE

generalized linear models (GLMs) came on to the scene and BEFORE software

became available to do this easily. So instead of

fitting a true log-linear model with Poisson error (a GLM), statisticians

used to simply take log(Y), with Y being the response variable, and fit a

linear model with Normal errors *on the log scale*, and then

back-transform. Other transformations are: the square-root transformation

for count data, the logistic/probit/gompertz transforms for binary data,

etc.

E. If your data doesn't follow the distribution assumed with *any*

statistical modelling, then your estimates can be biased, and/or the

precision of your estimates can be incorrectly estimated. So it is

important! There is usually a good discussion of this in books on linear

modelling. I don't know about kriging...

F. You say you have 280 samples, yet somebody commented "you didn't have

enough data" to say whether the data was normal or not... I find this hard

to believe. Usually with 30-50 datapoints you can get a pretty good idea,

so 280 should be ample? Depending on whether they cover the full range of

possible values...

Hope this is helpful.

Sama

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sama Low Choy s.lowchoy@...

Senior Research Assistant ph: +61 07 3864 1750

Australian Housing & Urban Research Institute fax: +61 07 3864 1827

Queensland University of Technology, Brisbane, Australia

*and*

PhD student in statistics ph: +61 07 3864 1114

School of Mathematics, QUT fax: +61 07 3864 2310

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Thu, 26 Jun 1997, Laura Lengnick wrote:

> Date: Thu, 26 Jun 1997 01:40:09 +0000

> From: Laura Lengnick <llengnic@...>

> To: ai-geostats@...

> Subject: GEOSTATS: data distribution impact on kriging?

>

> I'm currently learning some spatial statistics and kriging techniques as

> part of a project to characterize some agricultural land.

>

> I'm reading lots of papers and (thankfully) found "An introduction to

> Applied Geostatistics" but I'm having trouble finding any literature to

> explain the effect of a non-normal distribution of the sample data on the

> rest of the analysis required to create kriged maps of soil and crop

> variables.

>

> I've read lots of papers that ignore the normality issue all together, some

> that transform log normal data without comment, still others that krig

> obviously non-normal data without comment....and nothing in any of the

> papers or books that I've read that lays out this problem and what the

> consequences of using non-normal data might be (except I think maybe

> Cressie addresses this issue in his book....but I cannot understand his

> book, so please don't suggest it as a resource without an accompanying

> non-mathematical translation!).

>

> The best I've found in the literature is one comment in a McBratney, et.

> al. paper (1982 Agronomie) that they " transformed the data to stabilize

> their variances for later analysis and interpretation."

>

> I did check the ai-geostat web site and I didn't find anything on this

> question, but I didn't go through all the archives....

>

> My data set has 280 samples, 20 variables. 4 are normally distributed, 2

> are log-normal, the rest look sort of normal, but with heavy tails and

> often skewed right. So far, I've used two tests for normality, the

> Shapiro-Wilk test (in SAS) and a test for significance of skewness and

> kurtosis that I found in a Snedecor and Cochran text that involved

> comparing your data's skew and kurtosis to tables of significant values. I

> have SAS and GS+ v. 2.3 available to do this work.

>

> Three questions:

>

> 1) can you suggest a reference that explicitly addresses the issue of how

> using a non-normal data set will influence kriging values and associated

> errors?

>

> 2) can you suggest other, more liberal tests for determining normality?

> I've spoken with one statistician about this. He looked at the

> distributions and said, "Oh, they are pretty close to normal, you could

> probably just use them as is. They don't look like some other kind of

> well-known distribution. And besides, you don't have enough data points to

> really be able to test for normality." Hum, easy for him to say!

>

> 3) how would you approach an analysis of this data set?

>

>

>

>

> --

> *To post a message to the list, send it to ai-geostats@....

> *As a general service to list users, please remember to post a summary

> of any useful responses to your questions.

> *To unsubscribe, send email to majordomo@... with no subject and

> "unsubscribe ai-geostats" in the message body.

> DO NOT SEND Subscribe/Unsubscribe requests to the list!

>

--

*To post a message to the list, send it to ai-geostats@....

*As a general service to list users, please remember to post a summary

of any useful responses to your questions.

*To unsubscribe, send email to majordomo@... with no subject and

"unsubscribe ai-geostats" in the message body.

DO NOT SEND Subscribe/Unsubscribe requests to the list!