Loading ...
Sorry, an error occurred while loading the content.

Will's questions/comments on GLM & transformations

Expand Messages
  • dthe
    Hello All. Some interesting dialogues as of late...wanted to touch on just one for now. Will, regarding your questions/comments on the GLM and data
    Message 1 of 1 , Jun 30, 2004
    • 0 Attachment
      Hello All. Some interesting dialogues as of late...wanted to touch on just
      one for now.

      Will, regarding your questions/comments on the GLM and data transformations, I
      have a few thoughts.

      Will stated, "There is also the general issue of whether log transformation,
      the standard default for frequencies in Genmod when a relative risk is the
      appropriate outcome statistic for the design (as opposed to the odds ratio in
      case-control designs), is appropriate for frequencies. I have my doubts. Root
      transformation is theoretically better, surely? Yet everyone who uses complex
      modeling of frequencies in stats packages uses log transformation."

      I don't see how this last statement could be true given the variety of
      statistical methods and packages available. My understanding of the GLM is
      that several choices for the link function g (u) (i.e., including the
      aforementioned logarithmic link [log (u)]) are acceptable (see Upton & Cook
      [2002]). However, one's choice of function will affect other considerations.
      For example, use of the logarithmic link assumes an underlying Poisson
      distribution and these types of variables have equal means and variances (see
      Evans et al. [2000]).

      As for the comment that "Root transformation is theoretically better,
      surely?," it seems that different transformations are not generally more or
      less effective than others, but rather that it depends upon the particular
      attribute of the data in question. For example, it is not enough to say that
      a particular dataset is heterogeneous with regards to variance, but how so?
      Rafter et al. (2003) provide the following accessible guidelines:

      • use a square root transformation when variance is proportional to the mean
      (large mean w/ large variance);

      • use a logarithm transformation when the standard deviation is proportional
      to the mean (coefficient of variation is constant)

      • use a reciprocal transformation when the standard deviation is proportional
      to the square of the mean

      • use a square transformation when standard deviations are inversely
      proportional to the mean (large mean w/ small standard deviation)

      Similarly, when the Normality assumption is in question and the problem seems
      to be a matter of skewness, log transforms are supposed to be better for
      positively skewed data and square transforms are more likely to improve
      negatively skewed data.

      Will stated, "I've known for a long time that back transformation of a mean
      does not in general result in the same mean as that of the original raw
      variable. I came to terms with this apparent conflict when I realized that the
      back-transformed mean of a transformed variable is a kind of superduper or
      parametric median, and that therefore there is no need in general to adjust
      it. You just treat it as the best measure of centrality of the data, and yes,
      it is a median. Why? Because the >transformation is intended to make the
      distribution of raw values symmetrical, or even normal (Gaussian). The mean is
      therefore the median, but as I say, it's a median that uses all the data fully
      or parametrically. Back transform and you still have the median. Does this
      interpretation make sense?"

      No, some of this does not make sense (e.g., '...parametric median...uses all
      the data fully or parametrically...'), but...I'm not sure that it matters.
      For example, if your goal is to provide *a* "description" of the data (e.g.,
      an estimate of the center of the sample distribution) then any of the
      statistics that you mentioned (various means, medians, etc.) should accomplish
      that task. Anyway, if you like the properties of this "median" estimator ('
      the best measure of centrality of the data ') then why not use it? [Possible
      arguments for *particular* estimators of location parameters are too numerous
      to mention.]

      Potentially, I think that the bigger issue will be how to develop or construct
      population "inferences" based on the data. For example, how much do we really
      know about Poisson variables? which properties are preserved following
      transformation and which are not? Given all the transforming and back
      transforming that seems to be going on, this is shaping into a rather slippery
      case indeed.

      Will stated, "There may be situations when you want to use the true mean
      rather than the back-transformed mean, to work out costs, for example. I'm not
      sure whether adjusting the back-transformed mean in some manner, as suggested
      in the paper, can achieve that when you control for other effects in the
      model. The controlling works on the transformed variable. Does that mean you
      can adjust the controlled back-transformed mean to make it more like a raw
      mean? I don't know."

      I can't/shouldn't comment on methods of "adjustment" as I haven't read the
      paper. I would suggest having a closer look at the code for Proc Genmod...I'm
      not sure how this "controlling" is accomplished.

      Kind regards,

      Dwight J. Thé
      Exercise Science & Science Education
      Syracuse University
    Your message has been successfully submitted and would be delivered to recipients shortly.