Loading ...
Sorry, an error occurred while loading the content.

AI-GEOSTATS: Summary: data transformation and variograms

Expand Messages
  • Juliann Aukema
    I appologize for the delay in posting a summary to my question on data transformation. Here it is, better late than never, I hope. Thanks a lot to all those
    Message 1 of 1 , May 9, 2001
    • 0 Attachment
      I appologize for the delay in posting a summary to my question on data
      transformation. Here it is, better late than never, I hope. Thanks a lot to
      all those who responded.


      The key question about transformations and geostatistics is whether one
      needs to re-transform. For example, if one uses a log transform (not logit)
      then usually one wants to re-transform to the original form
      whereas in the case of the indicator transform one does not re-transform.

      The two difficulties and problems that arise are (1) how the variogram of
      the original and the variogram of the transformed variable are related, (2)
      in the case of a re-transformation how to compute the bias.
      (1) is probably not a problem if you are not going to re-transform but to
      actually compute the relationship one would need to know the multivariate
      distribution density function (even then it may be difficult)
      which is very unlikely in most geostatistical applications.

      Donald Meyers

      It is always better to use untransformed data if you

      Every complexity you add to your modelling increases
      your chances of things going wrong exponentially.

      Prime rule: simpler is always better

      What I do every time I get a new set of data is the

      (1) calculate semi-variograms and look at histograms.
      If semi-variogram nice, model and continue. If not:

      (2) take logarithms and repeat. If still not nice:

      (3) try indicators (lots of) to see if you have mixed
      distributions or something similar. If still not nice:

      (4) try a rank order (uniform) transform. If you still
      don't got nice semi-variograms there is something
      BADLY WRONG with your data. Re-assess your basic

      (a) precise reproducable data?
      (b) accurate representative data?
      (c) homogeneous sampling zones (single populations)?
      (d) trend?

      Isobel Clark

      Handling correlation on the link scale vs handling it on the unadjusted
      scale is apparently "a topic of discussion in statistics." However, the
      may help: if you handle covariance on the link scale you are working with
      a subject-specific model while a population averaged model refers to
      modeling the covariance in the error term. I'd recommend getting a
      copy of Wolfinger R and M O'Connell 1993 on generalized linear mixed models.

      Fundamentally, your approach may depend on your goals. Are you really
      trying to explain outcomes using predictor variables? Are you
      fundamentally interested in the covariance from an ecological
      perspective? Or, are you trying to predict the number of trees per given

      If your goal falls into the former two categories and if you have a
      nonignorable source of nonstationarity, then you can adjust for that
      nonstationarity using binary or binomial regression. If you have covariates
      at the tree level, then you might want to use the binary route. You'll
      need to pick a link but you might find that a logit link might get you
      started. After modeling the mean using logistic regression, you can
      assess the spatial structure of the residuals by building semivariograms
      from the Pearson or deviance residuals. if you observe structure, you can
      model both the nonstationarity *and* the covariance using
      generalized linear mixed models. if you get this far, you should probably
      have read the papers below (or their equivalent). you can model spatial
      variability as either a random effect or as correlated errors. all
      this can be done in SAS using PROC LOGISTIC, PROC SEMIVARIOGRAM and the
      GLIMMIX macro, respectively . Brian

      z. Gotway, CA and WW Stroup. 1997. A generalized linear model approach to
      spatial data analysis and prediction. JABES 2: 157-178.
      aa. Gumpertz, ML, C Wu and JM Pye. 2000. Logistic regression for Southern
      Pine Beetle outbreaks with spatial and temporal correlation. Forest Science
      46: 95-107.
      Wolfinger, R. 1993. Covariance structure selection in general mixed
      models. Communications in Statistics–Simulations 22: 1079-1106.
      Wolfinger, R. and M O'Connell. 1993. Generalized Linear Mixed Models: A
      Pseudo-Likelihood Approach. Journal of Statistical Computation and
      Simulation 48: 233-243

      Brian Gray

      I think the problem might be even more subtle. Essentially you are looking
      at a marked point process, and trying to apply methods designed
      principally for data that is continuous throughout the sampling domain.

      I would suggest looking at the following paper:
      Stoyan and Waelder 2000. On variograms in point process statistics
      II. Models of markings and ecological interpretation. Biometrical journal

      Another approach you might think about is spatial cdf estimation. take a
      look at the work of cressie and friends.

      Nicholas Lewin-Koh

      > >Juliann Aukema wrote:
      > >
      > >> Hi. I have a question about transforming data.
      > >>
      > >> I have infection prevalence data for many points- a proportion of
      > >> trees infected. Numbers are between 0 and 1. Sample size varies for the
      > >> different points (because density of trees varies). When I plot a
      > >> of the prevalence data, I get a nice sill for about 4000 meters and
      >then a
      > >> rise in the variogram. If I take the residuals of prevalence against
      > >> elevation the second rise goes away. Biologically this all makes sense and
      > >> makes a nice story.
      > >> However for some other analyses that I also did with this data, I
      > >> was advised to logit transform the prevalence data because it is a
      > >> proportion and should be binomially distributed.
      > >> If I plot the variogram of the logit transformed prevalence, the
      > >> first sill is much less distinct if it is there at all - this seems to be
      > >> mostly due to one point, the last point before the rise, which now goes up
      > >> instead of being about even with the previous point. ( I guess this
      > >> difference is due to the stretching of zero prevalence values that occurs
      > >> with the logit transformation.) And if I look at smaller lags, it looks
      > >> like a power function with no sill. Biologically, that is harder to
      > >> explain. If I plot the residuals of the (logit transformed prevalence)
      > >> against ( elevation), the variogram has a nice sill and is similar, even
      > >> prettier than the analysis of the untransformed data (but based on the
      > >> previous variogram, I don't have a very good reason for plotting the
      > >> residuals).
      > >> My question, then is whether the logit transformation is necessary
      > >> and/or appropriate for the geostatistical analysis. Does it make sense to
      > >> use the transformed data for both variograms, for just the residuals
      > >> (because the residuals are based on regression for which the
      > >> ought to be done) or for neither?
      > >> Thank you very much.
      > >>
      > >> Juliann
      > >> jaukema@...
      > >>

      * To post a message to the list, send it to ai-geostats@...
      * As a general service to the users, please remember to post a summary of any useful responses to your questions.
      * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
      * Support to the list is provided at http://www.ai-geostats.org
    Your message has been successfully submitted and would be delivered to recipients shortly.