Will's questions/comments on GLM & transformations
- Hello All. Some interesting dialogues as of late...wanted to touch on just
one for now.
Will, regarding your questions/comments on the GLM and data transformations, I
have a few thoughts.
Will stated, "There is also the general issue of whether log transformation,
the standard default for frequencies in Genmod when a relative risk is the
appropriate outcome statistic for the design (as opposed to the odds ratio in
case-control designs), is appropriate for frequencies. I have my doubts. Root
transformation is theoretically better, surely? Yet everyone who uses complex
modeling of frequencies in stats packages uses log transformation."
I don't see how this last statement could be true given the variety of
statistical methods and packages available. My understanding of the GLM is
that several choices for the link function g (u) (i.e., including the
aforementioned logarithmic link [log (u)]) are acceptable (see Upton & Cook
). However, one's choice of function will affect other considerations.
For example, use of the logarithmic link assumes an underlying Poisson
distribution and these types of variables have equal means and variances (see
Evans et al. ).
As for the comment that "Root transformation is theoretically better,
surely?," it seems that different transformations are not generally more or
less effective than others, but rather that it depends upon the particular
attribute of the data in question. For example, it is not enough to say that
a particular dataset is heterogeneous with regards to variance, but how so?
Rafter et al. (2003) provide the following accessible guidelines:
use a square root transformation when variance is proportional to the mean
(large mean w/ large variance);
use a logarithm transformation when the standard deviation is proportional
to the mean (coefficient of variation is constant)
use a reciprocal transformation when the standard deviation is proportional
to the square of the mean
use a square transformation when standard deviations are inversely
proportional to the mean (large mean w/ small standard deviation)
Similarly, when the Normality assumption is in question and the problem seems
to be a matter of skewness, log transforms are supposed to be better for
positively skewed data and square transforms are more likely to improve
negatively skewed data.
Will stated, "I've known for a long time that back transformation of a mean
does not in general result in the same mean as that of the original raw
variable. I came to terms with this apparent conflict when I realized that the
back-transformed mean of a transformed variable is a kind of superduper or
parametric median, and that therefore there is no need in general to adjust
it. You just treat it as the best measure of centrality of the data, and yes,
it is a median. Why? Because the >transformation is intended to make the
distribution of raw values symmetrical, or even normal (Gaussian). The mean is
therefore the median, but as I say, it's a median that uses all the data fully
or parametrically. Back transform and you still have the median. Does this
interpretation make sense?"
No, some of this does not make sense (e.g., '...parametric median...uses all
the data fully or parametrically...'), but...I'm not sure that it matters.
For example, if your goal is to provide *a* "description" of the data (e.g.,
an estimate of the center of the sample distribution) then any of the
statistics that you mentioned (various means, medians, etc.) should accomplish
that task. Anyway, if you like the properties of this "median" estimator ('
the best measure of centrality of the data ') then why not use it? [Possible
arguments for *particular* estimators of location parameters are too numerous
Potentially, I think that the bigger issue will be how to develop or construct
population "inferences" based on the data. For example, how much do we really
know about Poisson variables? which properties are preserved following
transformation and which are not? Given all the transforming and back
transforming that seems to be going on, this is shaping into a rather slippery
Will stated, "There may be situations when you want to use the true mean
rather than the back-transformed mean, to work out costs, for example. I'm not
sure whether adjusting the back-transformed mean in some manner, as suggested
in the paper, can achieve that when you control for other effects in the
model. The controlling works on the transformed variable. Does that mean you
can adjust the controlled back-transformed mean to make it more like a raw
mean? I don't know."
I can't/shouldn't comment on methods of "adjustment" as I haven't read the
paper. I would suggest having a closer look at the code for Proc Genmod...I'm
not sure how this "controlling" is accomplished.
Dwight J. Thé
Exercise Science & Science Education