one for now.

Will, regarding your questions/comments on the GLM and data transformations, I

have a few thoughts.

Will stated, "There is also the general issue of whether log transformation,

the standard default for frequencies in Genmod when a relative risk is the

appropriate outcome statistic for the design (as opposed to the odds ratio in

case-control designs), is appropriate for frequencies. I have my doubts. Root

transformation is theoretically better, surely? Yet everyone who uses complex

modeling of frequencies in stats packages uses log transformation."

I don't see how this last statement could be true given the variety of

statistical methods and packages available. My understanding of the GLM is

that several choices for the link function g (u) (i.e., including the

aforementioned logarithmic link [log (u)]) are acceptable (see Upton & Cook

[2002]). However, one's choice of function will affect other considerations.

For example, use of the logarithmic link assumes an underlying Poisson

distribution and these types of variables have equal means and variances (see

Evans et al. [2000]).

As for the comment that "Root transformation is theoretically better,

surely?," it seems that different transformations are not generally more or

less effective than others, but rather that it depends upon the particular

attribute of the data in question. For example, it is not enough to say that

a particular dataset is heterogeneous with regards to variance, but how so?

Rafter et al. (2003) provide the following accessible guidelines:

use a square root transformation when variance is proportional to the mean

(large mean w/ large variance);

use a logarithm transformation when the standard deviation is proportional

to the mean (coefficient of variation is constant)

use a reciprocal transformation when the standard deviation is proportional

to the square of the mean

use a square transformation when standard deviations are inversely

proportional to the mean (large mean w/ small standard deviation)

Similarly, when the Normality assumption is in question and the problem seems

to be a matter of skewness, log transforms are supposed to be better for

positively skewed data and square transforms are more likely to improve

negatively skewed data.

Will stated, "I've known for a long time that back transformation of a mean

does not in general result in the same mean as that of the original raw

variable. I came to terms with this apparent conflict when I realized that the

back-transformed mean of a transformed variable is a kind of superduper or

parametric median, and that therefore there is no need in general to adjust

it. You just treat it as the best measure of centrality of the data, and yes,

it is a median. Why? Because the >transformation is intended to make the

distribution of raw values symmetrical, or even normal (Gaussian). The mean is

therefore the median, but as I say, it's a median that uses all the data fully

or parametrically. Back transform and you still have the median. Does this

interpretation make sense?"

No, some of this does not make sense (e.g., '...parametric median...uses all

the data fully or parametrically...'), but...I'm not sure that it matters.

For example, if your goal is to provide *a* "description" of the data (e.g.,

an estimate of the center of the sample distribution) then any of the

statistics that you mentioned (various means, medians, etc.) should accomplish

that task. Anyway, if you like the properties of this "median" estimator ('

the best measure of centrality of the data ') then why not use it? [Possible

arguments for *particular* estimators of location parameters are too numerous

to mention.]

Potentially, I think that the bigger issue will be how to develop or construct

population "inferences" based on the data. For example, how much do we really

know about Poisson variables? which properties are preserved following

transformation and which are not? Given all the transforming and back

transforming that seems to be going on, this is shaping into a rather slippery

case indeed.

Will stated, "There may be situations when you want to use the true mean

rather than the back-transformed mean, to work out costs, for example. I'm not

sure whether adjusting the back-transformed mean in some manner, as suggested

in the paper, can achieve that when you control for other effects in the

model. The controlling works on the transformed variable. Does that mean you

can adjust the controlled back-transformed mean to make it more like a raw

mean? I don't know."

I can't/shouldn't comment on methods of "adjustment" as I haven't read the

paper. I would suggest having a closer look at the code for Proc Genmod...I'm

not sure how this "controlling" is accomplished.

Kind regards,

Dwight

Dwight J. Thé

Exercise Science & Science Education

Syracuse University