GEOSTATS: heteroscedasticity in mult. lin. regression
- Hello list!
I'm wondering if anyone might have some advice on this one: I'm basically
trying to model the spatial distribution of the importance of a certain
forest type using a combination of multiple linear regression and kriging
(~universal kriging). I have wall to wall coverages of the dependent
variables: precipitation, slope, elevation, aspect, road density, distance
to roads, and various satellite variables (vegetation indices); I have ~700
spatially referenced field samples where there is a measurement of the
dependent variable: amount of spruce-fir forest type. I transformed all of
the variables, trying to make them as normal as possible.
I've found, using stepwise linear regression, that the best model uses 4 of
the dependent variables, and has an r^2 of about .45 (which I found kind of
exciting, given the amount of noise in this sort of thing). My intention is
to then krig the residuals of the estimates from this model, assuming they
exhibit spatial autocorrelation, which they do. Adding the estimates from
both procedures will hopefully yield me a "better" set of estimates than
either procedure alone.
My worry, however, is that when I examine the residuals from my multiple
linear regression, I find that the plot of the residuals (y axis) vs. the
fitted value (x axis) indicate heteroscedasticity (they are more
concentrated around 0 at low values of x, and spread out as x increases (a
megaphone form)). They are normally distributed around 0, however, and do
not show any spatial pattern.
I have transformed the heck out of everything, and I have tried in a rather
clumsy way to implement weighted least squares regression (in Minitab--the
online help is very weak on this!), with poor results (the residual plot
remains very much the same). I also removed outliers to the point where I
felt a little guilty, but without much impact (although minitab still tells
me that there are lots of "unusual observations.....)
One clue: the dependent variable has all sorts of zero values--there are
about 200 of the 700 that have a measurement of "0 spruce-fir" found at
plot. I removed the zero values, then ran the regression on the remaining
values, yielding a nearly normal distribution, but the residual plot did not
change much (still the megaphone shape). I looked at scatterplots of all of
the independents vs. the dependent, and saw a little evidence of nonconstant
variance in the x across values of y, but it didn't seem dramatic. I also
plotted the absolute values of the residuals vs. the independents, and
didn't see any crazy relationship in terms of non homogeneous variance..
The correlation coefficients of the independents vs. the dependent are all
between .3 and .5; the scatterplots are pretty fat...
My next course of action is to try doing a principal components analysis on
the independent variables, and using a pc in the regression analysis. I was
also going to look into some sort of nonparametric regression.. I'd really
like to just stick with the model I came up with, however....
Does anyone have any good ideas? Should I worry about the
heteroscedasticity of the data, given my goals (from what I read, it seems
like one mostly worries about heteroscedasticity when considering confidence
intervals...However, I'd like my predictions not to be biased...)
Sorry if this is an inane question!
I'll post any responses...
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
*To post a message to the list, send it to ai-geostats@....
*As a general service to list users, please remember to post a summary
of any useful responses to your questions.
*To unsubscribe, send email to majordomo@... with no subject and
"unsubscribe ai-geostats" in the message body.
DO NOT SEND Subscribe/Unsubscribe requests to the list!