Sometimes people are using terms with different meanings or are confusing

the terms or ideas.

1. There is a difference between saying that the "data" is normal

(gaussian) and saying that the random function is multi-variate normal. If

the data is viewed as a sample from one realization of the random function

then the sample histogram is an estimator of the SPATIAL distribution but

this is not the same as the ensemble distribution.

It is not quite correct to say that the sample histogram is normal (or

non-normal). The histogram is based on putting the data into classes or

bins and hence it can NOT be normal since the data set is discrete (and in

fact finite) where as the normal distribution is continuous. Normality is

not a "decision", one may decide to make the ASSUMPTION of normality, i.e.,

to use an hypothesis of normality.

2. There are at least six difference statistics that can be computed from

cross-validation; the (i) average error, (ii) the average squared error,

(iii)the average standardized squared error, (iv) the sample correlation

between the estimated and the observed data values, (v) the sample

correlation between the estimated values and the "errors" (which could be

standardized or not), (vi) the histogram of the errors (standardized or

not). The expected value of (i) is zero since the kriging estimator is

unbiased, the expected value of the second is the sum of the kriging

variances, the expected value of (iii) is one. In the case of simple

kriging the expected value of (iv) is one (but not in the case of ordinary

kriging unless the Lagrange multipliers are all zero), the expected value

of (v) is zero in the case of simple kriging. Unfortunately these

statistics are not equally sensitive to changes in the variogram model (or

its parameters), the choice of the search neighborhood. Cross-validation as

a tool for evaluating possible variogram models is neither perfect nor

uniformly imperfect. It is foolish to think that with the use of

cross-validation (or MLE or least squares fitting of the variogram or some

other favorite method) that we can find the "right" variogram model. In

some cases cross-validation may be more useful for identifying "unusual"

data locations than for evaluating the variogram model. Cross-validation

can be useful for comparing one variogram model against another (taking

into account the effect of the search neighborhood.

3. Authors frequently say that the kriging estimator is "robust" but one

should ask exactly what that means. It is true that small changes in the

variogram parameters and/or small changes in the data values do not produce

large changes in the kriged estimates. In order to quantify this

"robustness" however one would need to be a bit more specific. As is

well-known the kriging estimator is in two parts, the weights (obtained

from the kriging equations which do not explicitly depend on the data) and

the data. The weights then depend on only two things, the variogram model

and the search neighborhood. To detect change in the kriging weights (which

is a vector not a scalar), one must quantify change in the variogram model,

at least two definitions of "neighborhood" for variograms (see Armstrong

and Diamond) have been given. One is simply the maximum absolute difference

between two variograms, the other is a ratio (both are sensitive to the

maximum lag considered), a third is essentially differentiability.

Unfortunately none of these is best for all circumstances nor are they

equivalent or ranked (i.e., one implying the other(s)). In trying to

quantify the "size" of the change vector (change in the weights vector) the

two commonest measures would be the maximum absolute value (of an entry)

and the sum of the squares of the entries. The second is easier to work

with but has other dis-advantages.

THE REAL PROBLEM is that in most (potential) applications of geostatistics

there are no state equations from which one could derive the model

parameters or verify the underlying assumptions, hence the variogram, the

assumption of second order or intrinsic stationarity (or multi-variate

normality) become just that, they are assumptions. As yet all purported

inference tests depend on random sampling (random site selection is not

random sampling) or strong distributional assumptions such as multi-variate

normality. The problem is ILL_POSED, i.e., we do not have enough

information to make a unique inference about the values at non-data

locations. To do so we must make model assumptions and then the question is

how strongly do our results depend on the model assumptions and how

strongly on the data?

LANGUAGE AND WORDS

The data is an inanimate object hence it does NOT "intervene" in anything,

nor does it "honor" anything.

****************************************************************************

*******************************

There is no substitute for having some understanding of the particular

phenomenon to which geostatistical tools are being applied, geostatistics

should not be used in a vacumn. The bottom line is does it produce useful

results, that is usually not a statistical or a mathematical question.

Donald E. Myers

Department of Mathematics

University of Arizona

Tucson, AZ 85721

myers@...

http://www.u.arizona.edu/~donaldm

--

*To post a message to the list, send it to ai-geostats@....

*As a general service to list users, please remember to post a summary

of any useful responses to your questions.

*To unsubscribe, send email to majordomo@... with no subject and

"unsubscribe ai-geostats" in the message body.

DO NOT SEND Subscribe/Unsubscribe requests to the list!