- Mar 4, 2001Dear all,

I�m sorry for being so late with the summary of the replies I got to the

following question

- What are the drawbacks of the normal score transformation

- What are the latest developments that have been made to handle properly data

sets that have a log normal distribution.

I have cut and pasted here under bits and parts of the many replies I

received. Thanks a lot to:

Andrew, Joao Felipe, Isobel Clark, Paulo Justiniano Ribeiro Jr, Warr Benjamin,

Hirotaka Saito, Nelleke Swager, Syed Abdul Rahman Shibli, Raymond J. O'Connor

I also received a two pages long reply from Donald Myers. I have put the full

text in the archives of AI-GEOSTATS.

-----------------------------------------------------------------------

A. Comments on skewed data sets

The skewness of a data set can have many different origins and its

interpretation is of course highly subjective. Many assumptions have therefore

to be made.

Most of geostatistics is "distribution free", i.e., the derivation of the

simple kriging, ordinary kriging and universal kriging equations do not depend

on a distributional assumption (contrary to what is sometimes claimed).

However if a distributional assumption is to be useful it should be

multivariate rather than just univariate. Essentially none of the

transformations that are used in geostatistics can really preserve or produce

multivariate distributional properties, they are only univariate

transformations. For example, a histogram might appear lognormal and a log

transformation might then appear normal, this does not imply anything about

multivariate lognormality.

When handling skewed data sets, one can

1) remove the long tail and dismiss them as another population, i.e. work with

the main subset

2) dismiss the long tail as a set of "erroneous" data (this might be difficult

to justify)

3) use the data "as is" and use more robust measures, e.g. madogram, and do

not work with squared differences which are quite sensitive to long tails. The

choice of the sill becomes a problem in such a case. In the case of

multivariate lognormality, one can compute the relationship between the

variogram/covariance of the original and the variogram/covariance of the

transformed. This relationship is

essentially unknown in all other cases because it requires again, knowing the

multivariate distribution in analytic form (and being able to carrying out

certain complicated multiple integrations). The multivariate

transform must be known in analytic form and have a unique inverse. There are

examples in the literature of using power series approximations for the

transformation but too often the approximation is reduced to a linear one.

4) use a transformation and work in the transformed domain before

backtransforming (watch out for possible biases, where applicable).

5) use an indicator transform for different thresholds and regard the

connectivity of extreme values foremost on your agenda. This might be

difficult to implement in practice, particularly with sparse datasets

and the deterioration of the number of "pairs" at extreme thresholds where you

would normally want the best "resolution" anyway (median indicator kriging is

a possible workaround).

From the replies I have received, the last seems to be the most frequently

chosen option.

B. Problems with Normal Score Transformation (NST)

NST are useful to reveal the spatial correlation of highly skewed data sets.

Nevertheless, when a transformation is made prior to the estimation, several

problems will remain, First, one has introduced an element of ranking rather

than interval or ratio data for the original. Although one uses the NST data

as satisfying the requirements of normality, the back transformation process

can only recover the point estimates (e.g., for confidence limits) within the

resolution afforded by the original data at that point. If you have sparsely

distributed data there, the limit estimate has an uncertainty reflecting the

corresponding coarse steps (more a measurement error than an estimation

error).

Second, if one has ties in the original data, the NST assigns them to the

corresponding block of contiguous normal scores. Thus extra variance is

introduced as a result of handling the ties.

There are two types of nscore transformation:

1) a frequency based NST: data are transformed in order to get a histogram

showing a normal distribution.

Inconvenient: The ordering of the tied values introduces a bias when doing a

back-transformation, especially if there are many zero values

2) an empiricaly based NST: the transformation uses the cumulative

distribution and assigns the equivalent in the Gaussian space. When performing

a back-transformation, one get the original value.

Inconvenient: the histogram of the transformed data is often not normal.

Nevertheless, the results after kriging and simulation appear to be relatively

robust.

C. Performing kriging with log normal data sets

Most of the replies underlined the frequent use of an indicator approach. If

Lognormal kriging seems to be the solution for log normal data sets, it is

based on the strict assumption that the data set is log normal, assumption

which is almost impossible to verify unless one has an extensive knowledge of

the data set.

If one is willing to assume multi-variate lognormality (univariate is not

really sufficient) then the transformation is theoretically known and has a

unique inverse that is also known. Even in this case there

is the problem of a bias in the re-transformed estimates. A number of authors

have written on this, Journel, Dowd being two of them (see various papers in

Math. Geology). As pointed out in those papers the correction in the case of

Simple Kriging (punctual) is essentially solved, a good approximation is

available in the case of Ordinary Kriging (punctual). There are some

theoretical problems in the case of block kriging that are usually handled in

an almost ad-hoc way, e.g., if the point values are multi-variate lognormal

then the block values theoretically should not be either univariate or

multivariate lognormal. There seems to be little in the literature pertaining

to a mixing of lognormality and non-constant drift(mean). If the non-constant

mean is not first removed then the complications resulting from a non-linear

transformation are much worse since the non-constant mean and the mean zero

random component are not separately transformed.

For other non-linear transforms (other than the log in the case of

multivariate lognormality), even knowing the inverse transform in analytic

form is not sufficient to allow computing the bias adjustment unless

one also knows the MULTIVARIATE distribution in analytic form. Even then, the

actual mechanics of doing so can be very tedious or complicated. That is,

while there is a nice theorem on change of variables in a multiple integral,

the actual step of applying it to a specific problem can be very tedious and

complicated. Moreover the theorem has moderately strong assumptions which are

not always satisfied.

In the case of multivariate lognormality, one can also determine the

adjustment needed in the kriging variances. This aspect seems to have

attracted little attention in the case of other non-linear transforms and it

is at least as difficult a problem.

Apparently, lognormal kriging and indicator kriging produce very similar

results.

D. Recent developments:

The litterature seems to be quite poor in publications on non-parametric

geostatistics.

The Box-Cox family of transformations which has the log-normal as a particular

case has been recently proposed.

SUGGESTED READING

CHRISTENSEN, O.F., DIGGLE, P.J. AND RIBEIRO JR, P.J. (2001). Analysing

positive-valued spatial data: the transformed Gaussian model. In GeoENV III -

Geostatistics for environmental applications, Quantitative Geology and

Geostatistics, Kluwer Series (to appear)

CLARK I. 1996 "Lognormal kriging applied to non-lognormal deposits: two case

studies",

5th International Geostatistics Congress, Wollongong Australia, 22--27

September

CLARK I. 1997. Geostatistics applied to skewed data", Conference of the

International Section on Mathematical Methods in Geology (Mining P��bram

Symposia) of the International Association for Mathematical Geology, Prague,

6--10 October, Matematicke Metody V Geologii: P��bram Scientiae Rerum

Montanarum

CLARK I. 1998. Geostatistical estimation and the lognormal distribution

Geocongress, Pretoria RSA, June

SAITO, H. and P. GOOVAERTS. 2000. Geostatistical interpolation of positively

skewed and censored data in a dioxin contaminated site. Environmental Science

& Technology, vol.34, No.19: 4228-4235.

Gregoire Dubois

Institute of Mineralogy and Petrography

Dept. of Earth Sciences

University of Lausanne

Switzerland

http://www.ai-geostats.org

____________________________________________________________________

Get free email and a permanent address at http://www.netaddress.com/?N=1

--

* To post a message to the list, send it to ai-geostats@...

* As a general service to the users, please remember to post a summary of any useful responses to your questions.

* To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list

* Support to the list is provided at http://www.ai-geostats.org