Loading ...
Sorry, an error occurred while loading the content.
 

Re: AI-GEOSTATS: Detecting spatial autocorrelation in highly nonnormaldata]

Expand Messages
  • Ruben Roa Ureta
    ... Asunto: Re: AI-GEOSTATS: Detecting spatial autocorrelation in highly nonnormaldata De: Ruben Roa Ureta Fecha: Wed, 26 de Noviembre de
    Message 1 of 3 , Nov 26, 2003
      ---------------------------- Mensaje Original ----------------------------
      Asunto: Re: AI-GEOSTATS: Detecting spatial autocorrelation in highly
      nonnormaldata De: "Ruben Roa Ureta" <rroa@...>
      Fecha: Wed, 26 de Noviembre de 2003, 7:32 pm
      Para: "Yetta Jager" <jagerhi@...>
      Cc: ai-geostat@...
      --------------------------------------------------------------------------

      Dear Ruben:

      >I'm still very interested in discussion of the issue of probability
      sampling vs. model-based sampling, but have been waiting until reading and
      digesting the Godambe and Hansen et al papers. I struggled years ago with
      how/whether to incorporate inclusion probabilities in regional estimation
      and ended up just using separate models (variances) by strata.
      We demonstrated the use of a cokriging model to "find" low alkalinity
      streams at high elevations in the Southern Blue Ridge that were missed by
      the sample, but this depended in part on extrapolating a relationship
      between elevation and alkalinity.

      Hi Yetta: the above is an example of model-based inference.

      >Say we have a finite population (i.e., lakes) and we need to estimate the
      total of some attribute. Both approaches are attempting to quantify
      uncertainty in the attributes of unmeasured lakes. In the case of
      geostatistics, the kriging variances reflect uncertainty due to
      interpolating to unmeasured locations. Because of the strong assumed
      superpopulation model, kriging estimates of variance are (in my
      experience) too small, in the sense that a new sample would show larger
      actual MSEs than kriging MSEs.

      I see that there are two sources of variance from a geostatistical model:
      the variance due to partial observation of the spatial process and the
      variance due to incomplete specification of the model. These are accounted
      for in the model, at least when it is formulated as a stochastic process.
      Thus variance estimates of total estimates (or other functions or
      parameters) from the geostatistical model are higher than variance
      estimates from a pure random sampling approach, which only takes into
      account the variance due to incomplete observation of the population.

      >For simplicity, lets assume there is no spatial autocorrelation, a
      stationary model, and the joint probabilities of inclusion are zero. Is
      there a difference in the uncertainty represented by the design-based and
      model-based variance estimates?

      Yes, the model-based estimate of variance is generally higher than the
      design-based one. This is because the model-based variance is composed at
      least of a term for model specification plus a term due to incomplete
      observation (i.e. sampling). This can be shown after some algebra for the
      simplest example of the expansion estimator for finite populations. The
      simple regression estimator of the total has a variance estimated by a
      formula which includes a term due to the model plus a term due to the
      sampling, while the variance of the equivalent design-based estimator only
      has the term coming from the sampling. This shows that in the simplest of
      cases, the model-based estimator actually has higher variance, i.e. is
      more conservative, than the corresponding design-based estimator, contrary
      to popular belief.

      >If there is no SA, the model-based pop. variance is the number of lakes
      not sampled (N-n) times the overall sample variance, V(n).

      I think there should be another term for the model, whatever this model
      might be, probably a relation with another, predictor variable, since the
      coordinates are irrelevant (no spatial autocorrelation). If there is no
      term for a model, then we do not have a model-based estimation. Perhaps I
      am missing something in your scenario.

      >As sample size, n, increases, uncertainty decreases only because N-n
      decreases. Note that the actual observed values play no part in the
      variance, which depends only on the distances involved (and with no SA,
      not even that). Implicitly, the total, T, is estimated as though the
      sample is equal-probablity, by summing the values obtained in the sample,
      and the kriged estimates [here, =sample mean x (N-n)].

      I think you are talking of the kriging estimator of the mean when the
      variogram is flat. In that case I guess the kriging estimator should be no
      different from the simple expansion estimator of design-based inference
      and we do not have a case of model-based estimation.

      >If there is SA, one way of thinking about it is that higher weight is
      assigned to measured values of lakes that are near many unmeasured lakes.
      The variance of a Horowitz-Thompson estimator will also decrease as
      sample size increases as the probabilities of inclusion n/N increase, but
      unlike the kriging-model-based estimator, it depends on the values
      observed in the sample. What are the implications of this?

      I understand that the variance of the kriging model-based estimator of the
      total depends on the observations as a consequence of the use of the sill
      parameter in the computation of the estimation variance.

      >Each sampled lake represents a number of others, but we don't assume
      anything about their location and the "number of others" cloned from each
      sampled lake is determined by the inclusion probability. If these are
      equal, then each sample lake is given equal weight regardless of how many
      close neighbors it has. Using a list frame would therefore give more
      similar results to kriging, whereas an equal-area design would yield less
      similar results. I think, in general, the underestimation of variance
      increases as the semivariogram model gets away from the reality of the
      sample, in terms of requiring a sill that relates to sample variance. I
      would like to see studies comparing the two approaches for different
      situations.

      The particular case of predicting over spatial processes that are
      spatially separate (lime predicting with some observed lakes for other
      unobserved lakes) seems to me rather difficult. However, regarding the
      underestimation of variance, in general I expect higher estimation
      variances from model-based as compared with design-based estimators.

      >I have digested the Hansen et al paper and discussion thereafter, but am
      still struggling with Godambe. I don't see how the Hansen et al. paper
      supports the conclusion that probabilistic sampling design has been called
      into question as a basis for inference.

      It weren’t Hansen et al. the ones who questioned randomization-based
      inference in finite populations (rather they are one of the main
      proponents of that approach) but rather, the discussion around that paper.

      >Hansen's point is that model-based inference requires an assumption that
      the superpopulation model is true, and can lead to bad results if this is
      not the case. (As I recall that's how this discussion started was from
      the observation that the model is always wrong).

      It is true that models are always wrong in some sense but,
      1)the construction of model-based estimators forces you to think of how
      the system under study works, which are the mechanistic relations among
      observable variables, how the known laws of physics (or biology, etc)
      affect your spatial process or finite population, etc. Building models
      have allowed the development of science as we know it.
      2)it is also true that sampling is never truly random, because
      practitioners very often violate the dictates of random sampling theory;
      this is only because of common sense since samples derived from taking
      numbers from a hat usually are very inconvenient or misleading.
      3)models can be made robust by balancing on predictor variables.

      >From the back-and-forth following the Hansen et al. paper (and reading
      about adaptive design), I get the sense that few discount the importance
      of beginning with a sample drawn according to a probabilistic design.
      This is important to emphasize for the geostatistics community because
      many practitioners are in the habit of beginning with a "found" sample
      with unknown relationship to the population. Where the statisticians
      diverge is in the use of a model vs. design to draw inferences.

      True. It is convenient to design a sampling program, to plan in advance
      how to take samples, but it is not completely clear that the sampling
      shall be probabilistic. For example if I know of a certain nuisance
      parameter I want to get rid of at the inference stage then I sample in
      such a way, a deterministic way, that the nuisance parameter is eliminated
      by simple algebraic operations. For example pairing usually allows
      computing a difference which deterministically eliminates a nuisance
      parameter. Probabilistic sampling enters into the picture when I suspect
      there are hidden, latent parameters, of which I am not aware of, and then
      I apply a randomization procedure in order to average over, i.e. in
      expectation, those hidden, latent parameters. However, after I have
      obtained my sample and made my deterministic computations to eliminate
      nuisance parameter, then I forget about the sampling procedure and make
      inference based on my model, and conditioning on the observed sample.

      >I tried to read Godambe, but its too godambe hard to follow - and the
      editors/reviewers let him get away with not defining his terms. Its
      interesting to think of the value of a probabilistic design as a means of
      removing nuisance parameters (such as spatial autocorrelation), but I
      confess I can't follow his ancillary principle. If you have time and can
      explain it to me/us in English, I'd sure appreciate it.

      After the time I have put into this reply, I guess there is no point in
      avoiding this other topic. Right away, I believe Godambe’s paradox paper
      is a landmark in statistics. In a nutshell, Godambe shows that when a
      purely randomization-based inferential approach is used along with
      theoretically sound pivotal methods for a parameter of a finite
      population, you arrive at a unavoidable contradiction. This is that the
      procedure gives a correct probability coverage for an estimated value of
      the parameter of interest, say Theta1, and also a correct probability
      coverage for ANY other parameter value, say Theta2, in the parameter
      space.

      I have a more detailed discussion and an analysis of the meaning of the
      paradox in my thesis. Also, you can use Google – groups, and check
      sci.stat.math, and then search for Godambe. The thread is entitled
      Godambe’s paradox.

      Note that Godambe defends the randomization theory, which is ironic.

      >I disagree agree with Royall's assessment that a particular random sample
      is biased just because its mean is not that of the population -- the
      expectation of the mean of a sample is still unbiased and it is
      unreasonable to expect to draw a "balanced" sample without first
      enumerating the population. That's why the standard error is of
      interest. Sure, if you can stratify or use ancillary variables to improve
      balance, fine.

      I think Royall is right. You know that *the expectation of the mean*
      equals the population mean, but you don’t know if *your particular mean*,
      the one derived from the actual sample you obtained, approximates the true
      mean. In contrast, in model-based inference you condition on the observed
      sample and calculates expectations based on repetitions of the model,
      rather than of the sampling (or you use likelihood-based inference).

      >Sorry this has gotten so long. Sometime (when i retire?), I'd like to
      write a paper arguing that worries about spatial autocorrelation, except
      in the case of regression, are misplaced. As far as I'm concerned, its
      perfectly reasonable to guarantee an unbiased estimate of the proportion
      of black/white marbles in a hat by shaking them up before drawing a
      sample;
      its a hell of a lot easier than mapping out their spatial positions in the
      hat and fitting some variogram model. :-)

      Yes, that is a nice comment. However, you cannot shake natural population
      to destroy the mechanisms that determine their functioning and make them
      random. I gotta go now, or else I miss the soccer game!

      R.




      --
      * To post a message to the list, send it to ai-geostats@...
      * As a general service to the users, please remember to post a summary of any useful responses to your questions.
      * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
      * Support to the list is provided at http://www.ai-geostats.org
    • Monica Palaseanu-Lovejoy
      Hello list,  I need your help to interpret this. I am working with contamination  data in soil. I think the dataset has two populations, one representing a
      Message 2 of 3 , Nov 27, 2003
        Hello list,
        
        I need your help to interpret this. I am working with contamination 
        data in soil. I think the dataset has two populations, one
        representing a diffusive process (the majority of the data) and a
        point source process which generates outliers - or it seems part of
        them. I used the Moran scatterplot to look at outliers and 
        such and these are my results:
        
        If i use the data from one depth layer i have a global
        Moran of about 0.02 .... so almost no spatial correlation. If i am
        eliminating all outliers i identified with the box-plot i get a global
        Moran of about 0,36 - much, much better. But if i eliminate only
        part of the outliers, and not all of them, i get a global Moran of
        0.49 - extremely good for spatial autocorrelation. I am not sure if i
        am right but i would interpret this like that: Some outliers (probably
        thelowest values of my upper outliers - there are no lower
        outliers - at least detected by box-plot) belong to the diffusive
        contamination process, which should have a good spatial
        autocorrelation, while the rest should belong to the point source
        process.
        Do you think is it correct my interpretation? How important is this
        finding in your opinion from a statistical point of view?
        
        Thank you so much for any input on that,
        
        Monica


        --
        * To post a message to the list, send it to ai-geostats@...
        * As a general service to the users, please remember to post a summary of any useful responses to your questions.
        * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
        * Support to the list is provided at http://www.ai-geostats.org
      • Edzer J. Pebesma
        ... You can shake the (imaginary) bottle with all sample locations, from which you are going to randomly pick the ones for your sample. This makes the samples
        Message 3 of 3 , Nov 28, 2003
          Ruben Roa Ureta wrote:

          >Yes, that is a nice comment. However, you cannot shake natural population
          >to destroy the mechanisms that determine their functioning and make them
          >random. I gotta go now, or else I miss the soccer game!
          >
          >R.
          >
          You can shake the (imaginary) bottle with all sample locations, from
          which you
          are going to randomly pick the ones for your sample. This makes the samples
          perfectly independent (although not in the geostatistical, model-based
          sense).

          See also:
          Model-free estimation from Spatial Samples: a reappraisal of classical
          sampling
          theory; J. de Gruijter and C. ter Braak, Math.Geol. 22(4), 407-415,
          and follow-up articles by D. Brus et al. in Math.Geol. or Environmetrics.
          --
          Edzer


          --
          * To post a message to the list, send it to ai-geostats@...
          * As a general service to the users, please remember to post a summary of any useful responses to your questions.
          * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
          * Support to the list is provided at http://www.ai-geostats.org
        Your message has been successfully submitted and would be delivered to recipients shortly.