GEOSTATS: Normality, cross-validation, etc
- Some observations about some of the comments that have been appearing.
Sometimes people are using terms with different meanings or are confusing
the terms or ideas.
1. There is a difference between saying that the "data" is normal
(gaussian) and saying that the random function is multi-variate normal. If
the data is viewed as a sample from one realization of the random function
then the sample histogram is an estimator of the SPATIAL distribution but
this is not the same as the ensemble distribution.
It is not quite correct to say that the sample histogram is normal (or
non-normal). The histogram is based on putting the data into classes or
bins and hence it can NOT be normal since the data set is discrete (and in
fact finite) where as the normal distribution is continuous. Normality is
not a "decision", one may decide to make the ASSUMPTION of normality, i.e.,
to use an hypothesis of normality.
2. There are at least six difference statistics that can be computed from
cross-validation; the (i) average error, (ii) the average squared error,
(iii)the average standardized squared error, (iv) the sample correlation
between the estimated and the observed data values, (v) the sample
correlation between the estimated values and the "errors" (which could be
standardized or not), (vi) the histogram of the errors (standardized or
not). The expected value of (i) is zero since the kriging estimator is
unbiased, the expected value of the second is the sum of the kriging
variances, the expected value of (iii) is one. In the case of simple
kriging the expected value of (iv) is one (but not in the case of ordinary
kriging unless the Lagrange multipliers are all zero), the expected value
of (v) is zero in the case of simple kriging. Unfortunately these
statistics are not equally sensitive to changes in the variogram model (or
its parameters), the choice of the search neighborhood. Cross-validation as
a tool for evaluating possible variogram models is neither perfect nor
uniformly imperfect. It is foolish to think that with the use of
cross-validation (or MLE or least squares fitting of the variogram or some
other favorite method) that we can find the "right" variogram model. In
some cases cross-validation may be more useful for identifying "unusual"
data locations than for evaluating the variogram model. Cross-validation
can be useful for comparing one variogram model against another (taking
into account the effect of the search neighborhood.
3. Authors frequently say that the kriging estimator is "robust" but one
should ask exactly what that means. It is true that small changes in the
variogram parameters and/or small changes in the data values do not produce
large changes in the kriged estimates. In order to quantify this
"robustness" however one would need to be a bit more specific. As is
well-known the kriging estimator is in two parts, the weights (obtained
from the kriging equations which do not explicitly depend on the data) and
the data. The weights then depend on only two things, the variogram model
and the search neighborhood. To detect change in the kriging weights (which
is a vector not a scalar), one must quantify change in the variogram model,
at least two definitions of "neighborhood" for variograms (see Armstrong
and Diamond) have been given. One is simply the maximum absolute difference
between two variograms, the other is a ratio (both are sensitive to the
maximum lag considered), a third is essentially differentiability.
Unfortunately none of these is best for all circumstances nor are they
equivalent or ranked (i.e., one implying the other(s)). In trying to
quantify the "size" of the change vector (change in the weights vector) the
two commonest measures would be the maximum absolute value (of an entry)
and the sum of the squares of the entries. The second is easier to work
with but has other dis-advantages.
THE REAL PROBLEM is that in most (potential) applications of geostatistics
there are no state equations from which one could derive the model
parameters or verify the underlying assumptions, hence the variogram, the
assumption of second order or intrinsic stationarity (or multi-variate
normality) become just that, they are assumptions. As yet all purported
inference tests depend on random sampling (random site selection is not
random sampling) or strong distributional assumptions such as multi-variate
normality. The problem is ILL_POSED, i.e., we do not have enough
information to make a unique inference about the values at non-data
locations. To do so we must make model assumptions and then the question is
how strongly do our results depend on the model assumptions and how
strongly on the data?
LANGUAGE AND WORDS
The data is an inanimate object hence it does NOT "intervene" in anything,
nor does it "honor" anything.
There is no substitute for having some understanding of the particular
phenomenon to which geostatistical tools are being applied, geostatistics
should not be used in a vacumn. The bottom line is does it produce useful
results, that is usually not a statistical or a mathematical question.
Donald E. Myers
Department of Mathematics
University of Arizona
Tucson, AZ 85721
*To post a message to the list, send it to ai-geostats@....
*As a general service to list users, please remember to post a summary
of any useful responses to your questions.
*To unsubscribe, send email to majordomo@... with no subject and
"unsubscribe ai-geostats" in the message body.
DO NOT SEND Subscribe/Unsubscribe requests to the list!