Loading ...
Sorry, an error occurred while loading the content.
 

AI-GEOSTATS: SUMMARY: Normal distributions

Expand Messages
  • Chaosheng Zhang
    Dear all, One week ago, I asked a question about the requirement of normal distribution in statistics. The problem is that in most cases, the data sets we are
    Message 1 of 1 , May 24, 2002
      Dear all,

      One week ago, I asked a question about the requirement of normal
      distribution in statistics. The problem is that in most cases, the data sets
      we are dealing with do not follow the normal disttibution. If normality is
      required, data transformation needs to be carried out prior to (parametric)
      statistical analyses. On the other hand, data transformation may cause some
      other problems.

      Thanks to Isobel Clark, Brian Gray and Ruben Roa for their replies and
      comments. Please find the following the original question and replies.

      ---------------------

      Dear (geo_)statisticians in the list:

      I'm quite often confused with the requirement of "normal distribution" in
      (geo)statistics. My question is: When the normal distribution requirement
      MUST be satisfied? Specifically, in which of the following analyses, the
      variables MUST follow the normal distribution? If not, what would happen?

      Uni-variate analyses:
      Outlier detection, Mean calculation, etc.?
      Bi-variate analyses:
      Correlation; Regression; etc.?
      Multi-variate analyses:
      Principal component, Cluster, Regression, Factor,
      Discriminant, etc.?
      Spatial statistics:
      Spatial autocorrelation, Spatial outlier, etc.?
      Geostatistics:
      Variogram, Kriging, Simulation, etc.?

      Regards,

      Chaosheng Zhang
      ---------------------------------

      Short answer is yes to everything.

      middle length answer is that Normality is not required
      for anything except where it is a basic assumption -
      such as in simulation. It is necessary that the
      distribution be well behaved (not skewed) and conform
      to the Central Limit Theorem.

      Having said that, I can't tell you where the join is
      between 'well behaved' and not. It is usually fairly
      obvious from probability plots and semi-variograms
      when things start to get hairy.

      Isobel Clark

      http://geoecosse.bizland.com/BYOGeostats.htm

      -----------------------------
      Isobel,

      Thanks for the reply. I feel this problem deserves more discussion.

      I have found the message from: Gregoire Dubois, Date: Mon Mar 5, 2001,
      Subject: AI-GEOSTATS: SUMMARY: Nscore transform & kriging of log normal
      data sets (The original author couldn't be found, and s/he should be in the
      list): "Most of geostatistics is "distribution free", i.e., the derivation
      of the simple kriging, ordinary kriging and universal kriging equations do
      not depend on a distributional assumption (contrary to what is sometimes
      claimed)."

      An example may be the "Indicator Kriging": It is impossible for the "0"s and
      "1"s to follow the normal distribution.

      The reason why I care about this issue is that there are at least two
      problems related to data transformation (in order to follow the normal
      distribution):

      (1) The measurement scale is reduced. The orignal ratio/interval scale may
      be reduced to the lower level of ordinal, even close to nominal, which
      results in loss of raw information.

      (2) Artificial relationship is introduced. We know that the lognormal
      distribution is widely accepted. In correlation analysis, if the
      log-transformed data are used, the correlation becomes the "log-log"
      relationship, not the oginal linear relationship. In bivariate regression
      analysis, the original function is:
      y = a x + b
      However, for the log-transformed data, the function becomes:
      log(y) = a log(x) + b
      or y = exp (a log(x) + b)
      In many cases, it is not clear if the relationship should be linear or
      "log-linear". However, the artificially introduced "log-linear" relationship
      need to be proved.

      The most difficult situation is that if scientifically the relationship
      between x and y is linear, should the data transformation still be carried
      out (just to satisfy the statistical requirement)?

      Cheers,

      Chaosheng

      ---------------------
      nice point. classical statisticians are slowly eating away at how to
      estimate variance structures under a marginal binary assumption using
      relatively simple generalized mixed models. no, I don't know how they will
      handle (if ever) the problem associated with possible underestimation of
      spatial variance components after--or at the same time
      as/iteratively--estimating mean components. and, of course, the focus is
      typically estimation rather than prediction. regardless, the variance
      component estimation question *is* approached under a marginal binary
      assumption--and spatial trend in the mean or equivalent is typically the
      primary part of such models. of course, with trend in the mean we face
      another interesting problem--namely, that the spatial variance components
      under a marginal binary assumption are a function of the mean/go to zero as
      the mean goes to 0 or 1. cheers, brian

      ****************************************************************
      Brian Gray
      USGS Upper Midwest Environmental Sciences Center
      575 Lester Avenue, Onalaska, WI 54650
      ph 608-783-7550 ext 19, FAX 608-783-8058
      brgray@...
      *****************************************************************

      Hi Chaosheng:

      A few points about log transforms. See below.

      > The reason why I care about this issue is that there are at least two
      > problems related to data transformation (in order to follow the normal
      > distribution):
      >
      > (1) The measurement scale is reduced. The orignal ratio/interval scale may
      > be reduced to the lower level of ordinal, even close to nominal, which
      > results in loss of raw information.

      There shouldn't be any loss of information since the log transformation
      is a one-to-one mapping. The sample variance is much smaller in the log
      scale but the log transform is often used precisely for that purpose.

      > (2) Artificial relationship is introduced. We know that the lognormal
      > distribution is widely accepted. In correlation analysis, if the
      > log-transformed data are used, the correlation becomes the "log-log"
      > relationship, not the oginal linear relationship. In bivariate regression
      > analysis, the original function is:
      > y = a x + b
      > However, for the log-transformed data, the function becomes:
      > log(y) = a log(x) + b
      > or y = exp (a log(x) + b)
      > In many cases, it is not clear if the relationship should be linear or
      > "log-linear". However, the artificially introduced "log-linear"
      relationship
      > need to be proved.

      Rather, when you apply the log transform you a-priori assume the
      existence of what you call the 'artificial relation', and this
      assumption refer to the algebraic form of the error term rather than to
      the relation between y and x. Say E(y)=f(x) is the model for y versus x,
      where E is the expectation operator. If you assume an additive error
      structure y_i=f(x_i)+e_i, where i indexes observation, and you consider
      the e_i's as iid normal random variates, then there is no reason to
      apply the log tranform. On the other hand if you assume a multiplicative
      error structure such as y_i=f(x_i)*e_i, and you assume that the e_i are
      iid lognormal random variates, then the log transform yields
      ln(y_i)=lnf(x_i)+ln(e_i) and now the ln(e_i) are iid normal random
      variates (with a much smaller variance than the e_i). The theory of the
      lognormal is well developed so that there isn't actually much need to
      transform the data to make it normal (e.g. Crow and Shimizu, 1988,
      Lognormal distributions, Dekker Inc, NY).

      > The most difficult situation is that if scientifically the relationship
      > between x and y is linear, should the data transformation still be carried
      > out (just to satisfy the statistical requirement)?

      When the relation between x and y is linear such as in E(y)=a*x+b, the
      errors are assumed to be additive (people usually do not believe that
      y_i=(a*x_i+b)*e_i but rather y_i=(a*x_i+b)+e_i), so applying a log
      transform to such case does not satisfy statistical requirements. On the
      contrary, it goes against statistical advice.
      The multiplicative error structure arises in models of the form
      E(y)=a*x^b or E(y)=a*exp(b*x), or in general, in all multiplicative
      processes.
      As there is a central limit theoren for additive processes leading to
      the normal, there is a central limit theorem for multiplicative
      processes leading to the lognormal.

      In the case of geostatistics, as pointed out by other people, the
      kriging equations do not require distributional assumptions (though the
      fitting of the model variogram to the moment-based Matheron variogram
      does). If the frequency distribution of the regionalised variable looks
      lognormal, it means that there is un underlying mechanism which is
      multiplicative, but still i don't see why the variable should be
      transformed for geostatistical analysis, except perhaps for fitting the
      variogram.

      Cheers
      Ruben

      =================================================
      Dr. Chaosheng Zhang
      Lecturer in GIS
      Department of Geography
      National University of Ireland
      Galway
      IRELAND

      Tel: +353-91-524411 ext. 2375
      Fax: +353-91-525700
      Email: Chaosheng.Zhang@...
      ChaoshengZhang@...
      Web: http://www.nuigalway.ie/geography/zhang.html
      =================================================




      --
      * To post a message to the list, send it to ai-geostats@...
      * As a general service to the users, please remember to post a summary of any useful responses to your questions.
      * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
      * Support to the list is provided at http://www.ai-geostats.org
    Your message has been successfully submitted and would be delivered to recipients shortly.