Re: AI-GEOSTATS: estimation with biased data- SUMMARY
- Thank you for the prompt responses to my question. I have posted a summary
of the responses below.
Donald E. Myers wrote
There are a couple of underlying assumptions that are critical, you will then
have to ask how your problem/application relates to those.
1. The data is considered to be a non-random sample from one realization of a
Hence "probability basis" as it relates to the design of a sample
pattern is not relevant. "Pattern" in this case pertains to the data
locations, not the distribution of values
2. The random function must satisfy certain stationarity assumptions
a. if you use a covariance and Simple Kriging then you need second order
b. if you use a variogram and Ordinary Kriging then you need
c. In the case that the mean function(of the random function, this is
theoretically the same as a trend surface but is sometime estimated
by a trend surface) is a polynomial function of the position
coordinates then you need either second order or intrinsic
stationarity of the residuals. You can either use Universal Kriging
or one of the above (the latter on the residuals)
Now some practical as well as theoretical questions and problems
A. What do you mean by "biased" data? In general in statistics, bias
an estimator, i.e., when the expected value of the estimator is not the
the quantity being estimated (estimator, not "estimate"). Authors will
sometimes use the word in an intuitive sense but this is not very precise
hard to either check or utilize.
B. The Kriging estimator (any of the above three types) already compensates
somewhat for clustering in the data locations. Unlike inverse distance
weighting, when there are two data locations close together the weights are
decreased on each location.
C. Now there are aspects of the frequency distribution of the data that
effect. The sample variogram is an average of squared differences, hence a
skewed distribution can distort the sample variogram. Likewise the Kriging
estimator is a weighted average and averages in general are sensitive to a few
"outliers". This is why it is sometimes useful or necessary to use a
transform such as the logarithm. That is a big discussion in itself.
There are no distributional assumptions implicit in the derivation of
the kriging equations.
D. The Kriging estimator is always unbiased (separately at each location where
you want an estimate). That is, the equations for the coefficients in kriging
estimator are derived under the constraint of unbiasedness. This is
the same as an intuitive idea of unbiasedness.
E. Any valid choice of the variogram/covariance function will result in a
solution for the kriging equations (valid means that the variogram is
conditionally negative definite or that the covariance function is positive
definite). However the solution and hence the estimated values are affected by
the choice of the variogram/covariance function, hence it is important to fit
the model well. In practice you will use a search neighborhood and the results
can be sensitive to the search neighborhood parameters.
The problem alluded to by Cressie is related to some tendency to change the
sampling design based on the data collected, i.e., when they found high grades
they tended to drill more exploratory holes nearby and when they found low
grades they tended to not drill more exploratory holes nearby. Thus they
the distribution of the grades by the sampling plan.
Finally, note that a "good" sampling plan for kriging is not the same as a
"good" sampling plan for estimating and modeling the variogram or covariance
function. There are quite a number of papers in the literature on both of
issues but no absolute solution.
Donald E. Myers
Isobel Clark wrote
When I saw the title of your email, I thought you
would be talking about data which was incorrectly
measured -- that is what we generally understand as
'bias'. For example, in the gold mines, the method of
determining how much gold is in a sample can be
consistently lower than the real value (or higher!).
Your problem seems to be in non-uniform (or
non-random) sampling with respect to both location and
value. Clustered/preferential sampling is not a
problem with ordinary geostatistics but can become one
if you use one of the mechanical transformation
methods such as 'normal score' or rank order transform
since these really on 'random' sampling with respect
to value in order to get a representative histogram.
Using a lognormal or other parametric transform is not
affected by these problems unless the preferential
sampling is excessive.
Kriging estimates deal with the clustering and
preferential sampling provided you have either used a
parametric transform or have declustered before your
score or rank transform. So you should get unbiassed
answers for your overall parameters.
Hope this helps some
Ruben Roa Ureta wrote:
> > Dear list members,[Non-text portions of this message have been removed]
> > I am wrestling with particular dilemma regarding how to incorporate data
> > collected without a design or probability basis into kriging estimators.
>Kriging estimators of interpolated values on a grid coming from intrinsic
>geostatistics do not depend on a sampling desing, i.e. they are the same
>for all sampling designs. In transitive geostatistics they do depend on
>the sampling design. Transitive and intrinsic geostatistics represent the
>same divide as design- and model-based statistics in general. Estimates of
>the estimation variance of the mean or the total across the grid do depend
>on the sampling design both in intrinsic and transitive geostatistics,
>though in essentially different manners.
> > In particular I am dealing with data that has clustered and uneven
> > sampling as well as some bias towards higher data values. Is is
> > appropriate to use geostatistics to obtain means and variances in this
> > situation.
>Your language is a little imprecise. The bias is defined for estimators
>and not for values so it is rather strange to read "bias towards higher
>data values". I guess you mean that the people collecting the samples had
>an intention to collect more samples where the variable yielded higher
>values. If that is the case, geostatistics can be applied to those samples
>because contrary to design-based inference, the intrinsic geostatistical
>estimator of the mean or the total do not depend on the intentions of the
>people collecting the samples.
>Also, "to obtain mean and variances" is imprecise. In intrinsic
>least-squares geostatistics you have the 'kriging variance', which
>fulfills an analytical role in optimizing interpolation, and 'estimation
>variance', which is the second order statement about the quality of the
>estimate of the mean or the total. If your question refer to the
>estimation variance, then you can use intrinsic geostatistis to estimate
>the estimation variance because this estimation do not depend on the
>intentions of the people doing the sampling, though it may depend on the
>geometry of the actual sampling. In fact, it is convenient to perform some
>form of systematic sampling. The latest i have seen on estimation
>Aubry and Debouzie. 2000. Geostatistical estimation variance for the
>spatial mean in two-dimensional systematic sampling. Ecology 81:543-553.
>And there is a program, called EVA, written by Lafont and Petitgas. You
>can ask a copy of the program to Pierre Petitgas.
> > I understand that the use of biased data was part of the original dilemma
> > and impetus for the development of geostatistics in the gold mining
> > industry (Cressie, 2003. J Math. Geol. 22:239-252) but I cannot find a
> > satisfactory to the question of whether you can use biased data in
> > geostatistical estimation.
>Please see above.
> > Based on kriged estimates obtained from biased samples of simulated
> > spatially autocorrelated data sets with known paramaters, I find that
> > kriging means are, on average, less biased than the corresponding
> > arithmetic sample mean. Is this a case where, in practice, the
> > differential spatial weighting of sample data provided by kriging,
> > results in less biased means but with little theoretical basis?
>The theoretical basis of model-based estimation in general is sound. I
>guess that is why most of statistics is model-based, i.e. in most of
>statistics expectations for the estimators are computed with reference to
>a model for a random variable rather than with reference to the
>probability of the sample under a sampling design.
> > Secondarily are the
> > geostatistical variance estimates obtained from biased data theoretically
> > valid? I guess that you could interpret them in the sense that "if one was
> > to sample the same random process with the same set of biased sample
> > locations, the geostatistical variance is the prediction error that one
> > would observe". The problem lies, I think, in how "representative" the
> > biased samples are of the random process and, with no design basis to the
> > sampling, one is left with the inherent logical confound of model-based
> > estimation methods- that estimates are model-unbiased, provided the model
> > is correct, but I will never know if the model is correct."
>When you work with models you are forced to try to understand the physics
>of the problem, how variables relate to each other in reality. You don't
>know if your model is correct for certain, but you can defend it by
>understanding the nature of the problem. On the other hand, when you base
>your judgement on blind random sampling, yo never know if the sample you
>actually obtained share the properties of all the possible samples that
>could have been obtained under the sampling design, though all your
>computations depend on this real and unique sample being replaced by all
>the possible samples that could have been obtained.
> > So does
> > geostatistics provide a "better" model for estimation with biased data in
> > practice in certain situations because of the spatial weighting of samples
> > or is this theoretically unsound?
>Samples are not biased. Bias is a property of estimators. When you say
>"biased samples" you seem to mean "non random, or intentional samples".
>There is no special problem with intentional samples and they are very
>good in some conditions.
> > I have searched the literature with limited definitive answers but wanted
> > to engage the group in this discussion and ask for any references on the
> > subject.
>We have discussed this issue a few times in this mail list. See the
>archives at the AI-GEOSTAT website.