AI-GEOSTATS: Summary: data transformation and variograms
- I appologize for the delay in posting a summary to my question on data
transformation. Here it is, better late than never, I hope. Thanks a lot to
all those who responded.
The key question about transformations and geostatistics is whether one
needs to re-transform. For example, if one uses a log transform (not logit)
then usually one wants to re-transform to the original form
whereas in the case of the indicator transform one does not re-transform.
The two difficulties and problems that arise are (1) how the variogram of
the original and the variogram of the transformed variable are related, (2)
in the case of a re-transformation how to compute the bias.
(1) is probably not a problem if you are not going to re-transform but to
actually compute the relationship one would need to know the multivariate
distribution density function (even then it may be difficult)
which is very unlikely in most geostatistical applications.
It is always better to use untransformed data if you
Every complexity you add to your modelling increases
your chances of things going wrong exponentially.
Prime rule: simpler is always better
What I do every time I get a new set of data is the
(1) calculate semi-variograms and look at histograms.
If semi-variogram nice, model and continue. If not:
(2) take logarithms and repeat. If still not nice:
(3) try indicators (lots of) to see if you have mixed
distributions or something similar. If still not nice:
(4) try a rank order (uniform) transform. If you still
don't got nice semi-variograms there is something
BADLY WRONG with your data. Re-assess your basic
(a) precise reproducable data?
(b) accurate representative data?
(c) homogeneous sampling zones (single populations)?
Handling correlation on the link scale vs handling it on the unadjusted
scale is apparently "a topic of discussion in statistics." However, the
may help: if you handle covariance on the link scale you are working with
a subject-specific model while a population averaged model refers to
modeling the covariance in the error term. I'd recommend getting a
copy of Wolfinger R and M O'Connell 1993 on generalized linear mixed models.
Fundamentally, your approach may depend on your goals. Are you really
trying to explain outcomes using predictor variables? Are you
fundamentally interested in the covariance from an ecological
perspective? Or, are you trying to predict the number of trees per given
If your goal falls into the former two categories and if you have a
nonignorable source of nonstationarity, then you can adjust for that
nonstationarity using binary or binomial regression. If you have covariates
at the tree level, then you might want to use the binary route. You'll
need to pick a link but you might find that a logit link might get you
started. After modeling the mean using logistic regression, you can
assess the spatial structure of the residuals by building semivariograms
from the Pearson or deviance residuals. if you observe structure, you can
model both the nonstationarity *and* the covariance using
generalized linear mixed models. if you get this far, you should probably
have read the papers below (or their equivalent). you can model spatial
variability as either a random effect or as correlated errors. all
this can be done in SAS using PROC LOGISTIC, PROC SEMIVARIOGRAM and the
GLIMMIX macro, respectively . Brian
z. Gotway, CA and WW Stroup. 1997. A generalized linear model approach to
spatial data analysis and prediction. JABES 2: 157-178.
aa. Gumpertz, ML, C Wu and JM Pye. 2000. Logistic regression for Southern
Pine Beetle outbreaks with spatial and temporal correlation. Forest Science
Wolfinger, R. 1993. Covariance structure selection in general mixed
models. Communications in StatisticsSimulations 22: 1079-1106.
Wolfinger, R. and M O'Connell. 1993. Generalized Linear Mixed Models: A
Pseudo-Likelihood Approach. Journal of Statistical Computation and
Simulation 48: 233-243
I think the problem might be even more subtle. Essentially you are looking
at a marked point process, and trying to apply methods designed
principally for data that is continuous throughout the sampling domain.
I would suggest looking at the following paper:
Stoyan and Waelder 2000. On variograms in point process statistics
II. Models of markings and ecological interpretation. Biometrical journal
Another approach you might think about is spatial cdf estimation. take a
look at the work of cressie and friends.
> >Juliann Aukema wrote:--
> >> Hi. I have a question about transforming data.
> >> I have infection prevalence data for many points- a proportion of
> >> trees infected. Numbers are between 0 and 1. Sample size varies for the
> >> different points (because density of trees varies). When I plot a
> >> of the prevalence data, I get a nice sill for about 4000 meters and
> >> rise in the variogram. If I take the residuals of prevalence against
> >> elevation the second rise goes away. Biologically this all makes sense and
> >> makes a nice story.
> >> However for some other analyses that I also did with this data, I
> >> was advised to logit transform the prevalence data because it is a
> >> proportion and should be binomially distributed.
> >> If I plot the variogram of the logit transformed prevalence, the
> >> first sill is much less distinct if it is there at all - this seems to be
> >> mostly due to one point, the last point before the rise, which now goes up
> >> instead of being about even with the previous point. ( I guess this
> >> difference is due to the stretching of zero prevalence values that occurs
> >> with the logit transformation.) And if I look at smaller lags, it looks
> >> like a power function with no sill. Biologically, that is harder to
> >> explain. If I plot the residuals of the (logit transformed prevalence)
> >> against ( elevation), the variogram has a nice sill and is similar, even
> >> prettier than the analysis of the untransformed data (but based on the
> >> previous variogram, I don't have a very good reason for plotting the
> >> residuals).
> >> My question, then is whether the logit transformation is necessary
> >> and/or appropriate for the geostatistical analysis. Does it make sense to
> >> use the transformed data for both variograms, for just the residuals
> >> (because the residuals are based on regression for which the
> >> ought to be done) or for neither?
> >> Thank you very much.
> >> Juliann
> >> jaukema@...
* To post a message to the list, send it to ai-geostats@...
* As a general service to the users, please remember to post a summary of any useful responses to your questions.
* To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
* Support to the list is provided at http://www.ai-geostats.org