- ---------------------------- Mensaje Original ----------------------------

Asunto: Re: AI-GEOSTATS: Detecting spatial autocorrelation in highly

nonnormaldata De: "Ruben Roa Ureta" <rroa@...>

Fecha: Wed, 26 de Noviembre de 2003, 7:32 pm

Para: "Yetta Jager" <jagerhi@...>

Cc: ai-geostat@...

--------------------------------------------------------------------------

Dear Ruben:

>I'm still very interested in discussion of the issue of probability

sampling vs. model-based sampling, but have been waiting until reading and

digesting the Godambe and Hansen et al papers. I struggled years ago with

how/whether to incorporate inclusion probabilities in regional estimation

and ended up just using separate models (variances) by strata.

We demonstrated the use of a cokriging model to "find" low alkalinity

streams at high elevations in the Southern Blue Ridge that were missed by

the sample, but this depended in part on extrapolating a relationship

between elevation and alkalinity.

Hi Yetta: the above is an example of model-based inference.

>Say we have a finite population (i.e., lakes) and we need to estimate the

total of some attribute. Both approaches are attempting to quantify

uncertainty in the attributes of unmeasured lakes. In the case of

geostatistics, the kriging variances reflect uncertainty due to

interpolating to unmeasured locations. Because of the strong assumed

superpopulation model, kriging estimates of variance are (in my

experience) too small, in the sense that a new sample would show larger

actual MSEs than kriging MSEs.

I see that there are two sources of variance from a geostatistical model:

the variance due to partial observation of the spatial process and the

variance due to incomplete specification of the model. These are accounted

for in the model, at least when it is formulated as a stochastic process.

Thus variance estimates of total estimates (or other functions or

parameters) from the geostatistical model are higher than variance

estimates from a pure random sampling approach, which only takes into

account the variance due to incomplete observation of the population.

>For simplicity, lets assume there is no spatial autocorrelation, a

stationary model, and the joint probabilities of inclusion are zero. Is

there a difference in the uncertainty represented by the design-based and

model-based variance estimates?

Yes, the model-based estimate of variance is generally higher than the

design-based one. This is because the model-based variance is composed at

least of a term for model specification plus a term due to incomplete

observation (i.e. sampling). This can be shown after some algebra for the

simplest example of the expansion estimator for finite populations. The

simple regression estimator of the total has a variance estimated by a

formula which includes a term due to the model plus a term due to the

sampling, while the variance of the equivalent design-based estimator only

has the term coming from the sampling. This shows that in the simplest of

cases, the model-based estimator actually has higher variance, i.e. is

more conservative, than the corresponding design-based estimator, contrary

to popular belief.

>If there is no SA, the model-based pop. variance is the number of lakes

not sampled (N-n) times the overall sample variance, V(n).

I think there should be another term for the model, whatever this model

might be, probably a relation with another, predictor variable, since the

coordinates are irrelevant (no spatial autocorrelation). If there is no

term for a model, then we do not have a model-based estimation. Perhaps I

am missing something in your scenario.

>As sample size, n, increases, uncertainty decreases only because N-n

decreases. Note that the actual observed values play no part in the

variance, which depends only on the distances involved (and with no SA,

not even that). Implicitly, the total, T, is estimated as though the

sample is equal-probablity, by summing the values obtained in the sample,

and the kriged estimates [here, =sample mean x (N-n)].

I think you are talking of the kriging estimator of the mean when the

variogram is flat. In that case I guess the kriging estimator should be no

different from the simple expansion estimator of design-based inference

and we do not have a case of model-based estimation.

>If there is SA, one way of thinking about it is that higher weight is

assigned to measured values of lakes that are near many unmeasured lakes.

The variance of a Horowitz-Thompson estimator will also decrease as

sample size increases as the probabilities of inclusion n/N increase, but

unlike the kriging-model-based estimator, it depends on the values

observed in the sample. What are the implications of this?

I understand that the variance of the kriging model-based estimator of the

total depends on the observations as a consequence of the use of the sill

parameter in the computation of the estimation variance.

>Each sampled lake represents a number of others, but we don't assume

anything about their location and the "number of others" cloned from each

sampled lake is determined by the inclusion probability. If these are

equal, then each sample lake is given equal weight regardless of how many

close neighbors it has. Using a list frame would therefore give more

similar results to kriging, whereas an equal-area design would yield less

similar results. I think, in general, the underestimation of variance

increases as the semivariogram model gets away from the reality of the

sample, in terms of requiring a sill that relates to sample variance. I

would like to see studies comparing the two approaches for different

situations.

The particular case of predicting over spatial processes that are

spatially separate (lime predicting with some observed lakes for other

unobserved lakes) seems to me rather difficult. However, regarding the

underestimation of variance, in general I expect higher estimation

variances from model-based as compared with design-based estimators.

>I have digested the Hansen et al paper and discussion thereafter, but am

still struggling with Godambe. I don't see how the Hansen et al. paper

supports the conclusion that probabilistic sampling design has been called

into question as a basis for inference.

It werent Hansen et al. the ones who questioned randomization-based

inference in finite populations (rather they are one of the main

proponents of that approach) but rather, the discussion around that paper.

>Hansen's point is that model-based inference requires an assumption that

the superpopulation model is true, and can lead to bad results if this is

not the case. (As I recall that's how this discussion started was from

the observation that the model is always wrong).

It is true that models are always wrong in some sense but,

1)the construction of model-based estimators forces you to think of how

the system under study works, which are the mechanistic relations among

observable variables, how the known laws of physics (or biology, etc)

affect your spatial process or finite population, etc. Building models

have allowed the development of science as we know it.

2)it is also true that sampling is never truly random, because

practitioners very often violate the dictates of random sampling theory;

this is only because of common sense since samples derived from taking

numbers from a hat usually are very inconvenient or misleading.

3)models can be made robust by balancing on predictor variables.

>From the back-and-forth following the Hansen et al. paper (and reading

about adaptive design), I get the sense that few discount the importance

of beginning with a sample drawn according to a probabilistic design.

This is important to emphasize for the geostatistics community because

many practitioners are in the habit of beginning with a "found" sample

with unknown relationship to the population. Where the statisticians

diverge is in the use of a model vs. design to draw inferences.

True. It is convenient to design a sampling program, to plan in advance

how to take samples, but it is not completely clear that the sampling

shall be probabilistic. For example if I know of a certain nuisance

parameter I want to get rid of at the inference stage then I sample in

such a way, a deterministic way, that the nuisance parameter is eliminated

by simple algebraic operations. For example pairing usually allows

computing a difference which deterministically eliminates a nuisance

parameter. Probabilistic sampling enters into the picture when I suspect

there are hidden, latent parameters, of which I am not aware of, and then

I apply a randomization procedure in order to average over, i.e. in

expectation, those hidden, latent parameters. However, after I have

obtained my sample and made my deterministic computations to eliminate

nuisance parameter, then I forget about the sampling procedure and make

inference based on my model, and conditioning on the observed sample.

>I tried to read Godambe, but its too godambe hard to follow - and the

editors/reviewers let him get away with not defining his terms. Its

interesting to think of the value of a probabilistic design as a means of

removing nuisance parameters (such as spatial autocorrelation), but I

confess I can't follow his ancillary principle. If you have time and can

explain it to me/us in English, I'd sure appreciate it.

After the time I have put into this reply, I guess there is no point in

avoiding this other topic. Right away, I believe Godambes paradox paper

is a landmark in statistics. In a nutshell, Godambe shows that when a

purely randomization-based inferential approach is used along with

theoretically sound pivotal methods for a parameter of a finite

population, you arrive at a unavoidable contradiction. This is that the

procedure gives a correct probability coverage for an estimated value of

the parameter of interest, say Theta1, and also a correct probability

coverage for ANY other parameter value, say Theta2, in the parameter

space.

I have a more detailed discussion and an analysis of the meaning of the

paradox in my thesis. Also, you can use Google groups, and check

sci.stat.math, and then search for Godambe. The thread is entitled

Godambes paradox.

Note that Godambe defends the randomization theory, which is ironic.

>I disagree agree with Royall's assessment that a particular random sample

is biased just because its mean is not that of the population -- the

expectation of the mean of a sample is still unbiased and it is

unreasonable to expect to draw a "balanced" sample without first

enumerating the population. That's why the standard error is of

interest. Sure, if you can stratify or use ancillary variables to improve

balance, fine.

I think Royall is right. You know that *the expectation of the mean*

equals the population mean, but you dont know if *your particular mean*,

the one derived from the actual sample you obtained, approximates the true

mean. In contrast, in model-based inference you condition on the observed

sample and calculates expectations based on repetitions of the model,

rather than of the sampling (or you use likelihood-based inference).

>Sorry this has gotten so long. Sometime (when i retire?), I'd like to

write a paper arguing that worries about spatial autocorrelation, except

in the case of regression, are misplaced. As far as I'm concerned, its

perfectly reasonable to guarantee an unbiased estimate of the proportion

of black/white marbles in a hat by shaking them up before drawing a

sample;

its a hell of a lot easier than mapping out their spatial positions in the

hat and fitting some variogram model. :-)

Yes, that is a nice comment. However, you cannot shake natural population

to destroy the mechanisms that determine their functioning and make them

random. I gotta go now, or else I miss the soccer game!

R.

--

* To post a message to the list, send it to ai-geostats@...

* As a general service to the users, please remember to post a summary of any useful responses to your questions.

* To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list

* Support to the list is provided at http://www.ai-geostats.org - Hello list,

I need your help to interpret this. I am working with contamination

data in soil. I think the dataset has two populations, one

representing a diffusive process (the majority of the data) and a

point source process which generates outliers - or it seems part of

them. I used the Moran scatterplot to look at outliers and

such and these are my results:

If i use the data from one depth layer i have a global

Moran of about 0.02 .... so almost no spatial correlation. If i am

eliminating all outliers i identified with the box-plot i get a global

Moran of about 0,36 - much, much better. But if i eliminate only

part of the outliers, and not all of them, i get a global Moran of

0.49 - extremely good for spatial autocorrelation. I am not sure if i

am right but i would interpret this like that: Some outliers (probably

thelowest values of my upper outliers - there are no lower

outliers - at least detected by box-plot) belong to the diffusive

contamination process, which should have a good spatial

autocorrelation, while the rest should belong to the point source

process.

Do you think is it correct my interpretation? How important is this

finding in your opinion from a statistical point of view?

Thank you so much for any input on that,

Monica

--

* To post a message to the list, send it to ai-geostats@...

* As a general service to the users, please remember to post a summary of any useful responses to your questions.

* To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list

* Support to the list is provided at http://www.ai-geostats.org - Ruben Roa Ureta wrote:

>Yes, that is a nice comment. However, you cannot shake natural population

You can shake the (imaginary) bottle with all sample locations, from

>to destroy the mechanisms that determine their functioning and make them

>random. I gotta go now, or else I miss the soccer game!

>

>R.

>

which you

are going to randomly pick the ones for your sample. This makes the samples

perfectly independent (although not in the geostatistical, model-based

sense).

See also:

Model-free estimation from Spatial Samples: a reappraisal of classical

sampling

theory; J. de Gruijter and C. ter Braak, Math.Geol. 22(4), 407-415,

and follow-up articles by D. Brus et al. in Math.Geol. or Environmetrics.

--

Edzer

--

* To post a message to the list, send it to ai-geostats@...

* As a general service to the users, please remember to post a summary of any useful responses to your questions.

* To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list

* Support to the list is provided at http://www.ai-geostats.org