Loading ...
Sorry, an error occurred while loading the content.

Reply from D. Myers on the analysis of skewed data sets

Expand Messages
  • Gregoire Dubois
    This is not a complete answer to your question. The principal problems with using a non-linear transformation of the data in geostatistics are 1. Analytically
    Message 1 of 2 , Mar 4 3:03 PM
    • 0 Attachment
      This is not a complete answer to your question.

      The principal problems with using a non-linear transformation of the data in
      geostatistics are

      1. Analytically determining the back-transform
      It is relatively easy to transform a data set to "normal" by using a
      basic property of the distribution function, in particular of the normal
      distribution function. One uses the empirical distribution function of the data and the theoretical distribution function of the normal.
      Intuitively this corresponds to matching points on the two distribution
      function graphs via the probability value. Plot the two on the same graph, for any one point on the empirical draw the horizontal line and find the intersection with the graph of the normal distribution function. This
      uniquely determines a point on the horizontal axis.

      The empirical distribution function is discrete (a step function) but in the
      case of the forward transform this does not cause a problem. The difficulty
      is in the reverse transform, if one uses only the empirical
      distribution function then the inverse value is not uniquely determined,
      i.e., the horizontal line determined by a point on the graph of the normal may not intersect the graph of the empirical at all, or if it does the intersection may be a line segment (depending on the choice of the classes used to construct the empirical distribution function graph). One solution which appears in some of the software is to first fit an analytic form to the empirical distribution function so that one has a "continuous" curve. The fitting process however is not unique and is especially non-unique at the ends. Does one assume that the maximum data value is really the maximum of the distribution and if not, what is the choice of the maximum? Same problem on the left hand end.

      The re-transformation problem does not occur in the case of the Indicator
      since then one does in fact not re-transform.

      2. Multivariate vs univariate distributional assumptions
      Most of geostatistics is "distribution free", i.e., the derivation of the
      simple kriging, ordinary kriging and universal kriging equations do not
      depend on a distributional assumption (contrary to what is sometimes claimed). However if a distributional assumption is to be useful it should be multivariate rather than just univariate. Essentially none of the transformations that are used in geostatistics can really preserve or produce multivariate distributional properties, they are only univariate transformations. For example, a histogram might appear lognormal and a log transformation might then appear normal, this does not imply anything about multivariate lognormality. One simply has to make the assumption (or not make it and hence not use it).

      3. Bias in the kriging estimator and effect on the kriging variance.

      If one is willing to assume multi-variate lognormality (univariate is not
      really sufficient) then the transformation is theoretically known and has a
      unique inverse that is also known. Even in this case there is the problem of a bias in the re-transformed estimates. A number of authors have written on this, Journel, Dowd being two of them (see various papers in Math. Geology). As pointed out in those papers the correction in the case of Simple Kriging (punctual) is essentially solved, a good approximation is available in the case of Ordinary Kriging (punctual).

      There are some theoretical problems in the case of block kriging that are usually handled in an almost ad-hoc way, e.g., if the point values are multi-variate lognormal then the block values theoretically should not be either univariate or multivariate lognormal. There seems to be little in the literature pertaining to a mixing of lognormality and non-constant drift(mean). If the non-constant mean is not first removed then the complications resulting from a non-linear transformation are much worse since the non-constant mean and the mean zero random component are not separately transformed.

      For other non-linear transforms (other than the log in the case of
      multivariate lognormality), even knowing the inverse transform in analytic form is not sufficient to allow computing the bias adjustment unless
      one also knows the the MULTIVARIATE distribution in analytic form. Even then, the actual mechanics of doing so can be very tedious or complicated (Mathematica
      may be of assistance here). That is, while there is a nice theorem on change of variables in a multiple integral, the actual step of applying it to a specific problem can be very tedious and complicated. Moreover the theorem has moderately strong assumptions which are not always satisfied.

      In the case of multivariate lognormality, one can also determine the
      adjustment needed in the kriging variances. This aspect seems to have attracted little attention in the case of other non-linear transforms and it is at least as difficult a problem.

      4. Transforms and variograms/covariances
      Again, in the case of multivariate lognormality, one can compute the relationship between the variogram/covariance of the original and the variogram/covariance of the transformed. This relationship is essentially unknown in all other cases because it requires again, knowing the multivariate distribution in analytic form (and being able to carrying out certain complicated multiple integrations). The multivariate transform must be known in analytic form and have a unique inverse. There are examples in the literature of using power series approximations for the transformation but too often the approximation is reduced to a linear one.

      There are two bottom lines perhaps:

      A. Given only a finite amount of data and a lack of a derived theoretical
      model (derived from first principles such as in physics), the problem is always
      ill-defined. One must impose some additional assumptions, the question is which ones? Which ones are reasonable, which ones are most useful, which ones are so strong that the conclusions/results depend more on those assumptions than on the data?

      B. Geostatistics has always been strongly associated with applications, hence
      one might ask then do the methods produce useful results? "Useful" obviously
      has to be interpreted in the context of the problem at hand. Also obviously one would like to have the results as strongly theoretically based as possible so that one has an objective analysis rather than just an "expert opinion" (not to say that expert opinions are not useful but experts sometimes differ and of course sometimes they are wrong).

      Donald E. Myers
      Department of Mathematics
      University of Arizona
      Tucson, Arizona 85721
    Your message has been successfully submitted and would be delivered to recipients shortly.