GEOSTATS: r.e. heteroscedasticity of
Here's a post of the many helpful responses I received to a question about
heteroscedasticity of my residuals from a regression analysis. What I tried
was giving all of my zero values a very tiny, near-zero number proportional
to the mean value of all of the points in the neighborhood of the zero
value. I then normal score transformed the data, performed my regression
and the variography and kriging of the residuals, and then backtransformed
my final surface of normal scores. My residual plots then did not indicate
any bias, interestingly, and I couldn't really find any evidence that any of
my results performed very poorly, as judged by a cross validation procedure.
I'm currently trying to find some information about potential theoretical
problems with using normal scored data in this sort of analysis. It seems
that normal scored data are used in sequential gaussian conditional
simulation in gslib�It also seems to me that normal scoring is kind of like
a ranking procedure, not like the lognormal transformation, and thus the
same problems associated with kriging of lognormal data do not necessarily
exist. However, there are probably some other theoretical problems with
using normal scored data that I am not aware of. If anyone has any pointers
r.e. this, I'd be happy to hear them!
Again thank you so much for your help! Original message, and then responses
>I'm wondering if anyone might have some advice on this one: I'm basicallyof
>trying to model the spatial distribution of the importance of a certain
>forest type using a combination of multiple linear regression and kriging
>(~universal kriging). I have wall to wall coverages of the dependent
>variables: precipitation, slope, elevation, aspect, road density, distance
>to roads, and various satellite variables (vegetation indices); I have ~700
>spatially referenced field samples where there is a measurement of the
>dependent variable: amount of spruce-fir forest type. I transformed all
>the variables, trying to make them as normal as possible.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>I've found, using stepwise linear regression, that the best model uses 4 of
>the dependent variables, and has an r^2 of about .45 (which I found kind of
>exciting, given the amount of noise in this sort of thing). My intention
>to then krig the residuals of the estimates from this model, assuming they
>exhibit spatial autocorrelation, which they do. Adding the estimates from
>both procedures will hopefully yield me a "better" set of estimates than
>either procedure alone.
>My worry, however, is that when I examine the residuals from my multiple
>linear regression, I find that the plot of the residuals (y axis) vs. the
>fitted value (x axis) indicate heteroscedasticity (they are more
>concentrated around 0 at low values of x, and spread out as x increases (a
>megaphone form)). They are normally distributed around 0, however, and do
>not show any spatial pattern.
>I have transformed the heck out of everything, and I have tried in a rather
>clumsy way to implement weighted least squares regression (in Minitab--the
>online help is very weak on this!), with poor results (the residual plot
>remains very much the same). I also removed outliers to the point where I
>felt a little guilty, but without much impact (although minitab still tells
>me that there are lots of "unusual observations.....)
>One clue: the dependent variable has all sorts of zero values--there are
>about 200 of the 700 that have a measurement of "0 spruce-fir" found at
>plot. I removed the zero values, then ran the regression on the remaining
>values, yielding a nearly normal distribution, but the residual plot did
>change much (still the megaphone shape). I looked at scatterplots of all
>the independents vs. the dependent, and saw a little evidence of
>variance in the x across values of y, but it didn't seem dramatic. I also
>plotted the absolute values of the residuals vs. the independents, and
>didn't see any crazy relationship in terms of non homogeneous variance..
>The correlation coefficients of the independents vs. the dependent are all
>between .3 and .5; the scatterplots are pretty fat...
>My next course of action is to try doing a principal components analysis on
>the independent variables, and using a pc in the regression analysis. I
>also going to look into some sort of nonparametric regression.. I'd really
>like to just stick with the model I came up with, however....
>Does anyone have any good ideas? Should I worry about the
>heteroscedasticity of the data, given my goals (from what I read, it seems
>like one mostly worries about heteroscedasticity when considering
>intervals...However, I'd like my predictions not to be biased...)
>Sorry if this is an inane question!
>I'll post any responses...
andrew: Let me recommend generalized linear mixed modeling (GLiMM). See
and Stroup (1997), JABES 2: 157-178. Briefly, this may allow you to model a
binomial outcome (or other members of hte exponential family) that exhibits
spatial covariance. I say "may" because you may need to face a few caveats
you may need to go through a little trial and error. Allison (1999, pp
in SAS's Logistic Regression text covers some concerns--as does the macro
documentation (see below). The big deal, however, is that you should be
avoid the outlier and transformation approaches you've used to date. If you
have access to SAS, GLiMM can be performed using the GLIMMIX macro. The
documentation is provided in the macro and in the PROC MIXED documentation
(which GLIMMIX calls). The macro is available from www.sas.com and as a
to me the problem looks like functional misspecification and not like a
distributional problem. have you tried squares or squesrroots of the driver
variables and interaction terms? you can either try these things or do a
test for functional specification or analyse the residual versus the driver
vareiables. there is a very simple test for functional misspecification: the
Ramsey test. you calculate the residual and the square it and cube it. add
the squared and cubed residual to the regression model and see whether the r
square increases significantly (f-test if you want). i typically do all
these things iteratively until i am satisfied.
i would not worry about weighted regression and stuff like that but i would
develop the model manually instead of stepwise regression. typically one
needs meny runs=model respecifications to get an understanding of the data.
i find graphs or tables of lefthand side variables versus drivers always
very usefull as a first step.
A few observations
1. You mention having transformed "everything" to obtain "normality", I
would assume that ultimately you want to obtain interpolates for the
original data (not the transformed data nor the residual data). Simply
exponentiating will lead to problems. Consider the simpler problem of a
constant mean (no auxillary variables) and a log transformation for the
variable of interest. If you model a variogram from the transformed data,
use the transformed data and this variogram to krige then you will have a
multiplicative bias if you simply exponentiate the kriged values. Under an
assumption of multivariate lognormality for the original you can compute
the bias adjustment (see for example Journel, Math. Geology, circa 1980).
For any other transformation than logarithmic and without the lognormality
assumption it is not possible to compute the bias adjustment (even knowing
the multivariate distribution type and the analytic form of the transform
is usually not enough to actually do it).
As an ad-hoc solution (described in Journel and Huigbrechts, 1978)
log transform case, compute the mean of the original untransformed data,
mean of the kriged values. Take the quotient, this is your
Did you transform the original dependent variable data or the residuals? If
the former then you have real problems, if the later then your problem is
close at least to that described above.
You mention universal kriging but don't seem to have used it. Having
identified the 4 critical auxillary variables you might try UK using
"external drift". This avoids having to go back and compute the value of
the regression term at non-data locations.
Although there are a lot of similarities between regression (multivariate)
and kriging, the underlying assumptions are quite different. One should be
a little careful mixing them.
I agree with the post that suggested that you try other (nonlinear) forms
of regression if your data is not naturally continuous and/or linear.
Linearizing transforms are often the culprits when residuals are
heteroskedastic. If you are working with continuous data, which you have
transformed to obtain normal distributions then the advice below may
Usually, parametric statistical tests like multiple regression are
much more robust to non-normality than they are to heterogeniety of variance
problems, so transforms are generally geared toward creating homoskedastic
residuals rather than normal data. The fan-shaped or
megaphone-shaped residual plot you describe suggests that you need to
transform the data one more time using a log or square root transform.
(Most stats books on linear regression describe this problem and suggest
appropriate transforms based on shape of the residual plot.)
Additionally, multiple regression tests lose power when independent
variables are correlated, which is why PCA is used to remove correlations
before doing multiple regression or alternatively, one variable from a set
of correlated variables is used for the regression test. If the variables
you are using are correlated, your r^2 may be even better than .45.
It's also important to remember that step-wise regression may produce a
"best fit" set of variables that is unique to the particular data set on
which it is used, but the "best fit" set may not generalize to new data
sets. Therefore it should generally be used only to give you a feel for
the data sets and the variables, but you should try to base your published
work on either established theory or personal experience (your own
theories) about the variables involved (i.e., try to test a priori models
rather than post hoc models).
I would suggest to assume Poisson-like distributions for the regression
residual, which would lead you to fit Generalized Linear (or Additive)
Models for the trend. This deals well with variances that are
Proportional to the mean value.
Residuals from such a regression are non-stationary by nature.
them is possible, if you assume a non-stationary (mean dependent)
covariogram, derived from a stationary correllogram. I wrote a paper (+
this for the geostats conference in Cape town, which is in PDF available on
the web. Follow Conference paper Nr. 4 (Pebesma/Bio/Duin) in my publications
I've done something similar and used blocking or stratification of the
residuals to deal with heteroscedasticity instead of transformation.
Transformation is really not advisable because of having to then add the
kriging variances in the back-transformation. See the pdf file on my
In OLS regression, the assumption is that you have independent observations.
Your description gives me the sense that your observations are highly
autocorrelated. Your correlations are thus invalid. Consider strategies to
get rid of the spatial autocorrelation. The PCA won't do it since your
not of equal worth.
I understand from your email that you are trying to map the
"amount of spruce fir forest type using wal to wall coverages of
a number of predictor variables.
Instead of multiple regression, you might find the predictions
provided by classification and regression trees (CART) more
useful. For an example of CART and kriging applied to a
geological data set, see http://www.isaaks.com, and click on
EDA (exploratory data analysis). Here, the response variable is
the amount of copper while the predictors are an assortment of
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
*To post a message to the list, send it to ai-geostats@....
*As a general service to list users, please remember to post a summary
of any useful responses to your questions.
*To unsubscribe, send email to majordomo@... with no subject and
"unsubscribe ai-geostats" in the message body.
DO NOT SEND Subscribe/Unsubscribe requests to the list!