Loading ...
Sorry, an error occurred while loading the content.

AI-GEOSTATS: Summary: Extreme Values?

Expand Messages
  • Chaosheng Zhang
    Dear list, Happy Christmas! Many thanks to all those who replied my question about extreme values, especially Isobel Clark, Marcel Vallée, Benjamin Warr,
    Message 1 of 5 , Dec 22, 2001
    • 0 Attachment
      Dear list,

      Happy Christmas! Many thanks to all those who replied my question about extreme values, especially Isobel Clark, Marcel Vallée, Benjamin Warr, Claudio Cocheo, Martin Roseveare, Pierre Goovaerts, Jeff Myers.

      Until now, this problem has not been satisfactorily solved, but all the comments and suggestions are quite helpful. As mentioned in several replies, this problem is also related to sampling design, Kriging grid system design, and spatial variation.

      Please find following the original question and the replies.

      Cheers,

      Chaosheng Zhang
      ===================================
      Dr. Chaosheng Zhang
      Department of Geography
      National University of Ireland, Galway
      IRELAND

      Tel: +353-91-524411 ext. 2375
      Fax: +353-91-525700
      Email: Chaosheng.Zhang@...
      ===================================

      ===================================================

      Original Question:

      13 December 2001 13:01

      Dear all,

      My question is: How to deal with the extreme/outlying values in a data set?

      I am dealing with heavy metal concentrations in soils from a mine area. The sample number is 223, and the samples are spatially evenly distributed with the sampling interval of 400 metres. There are several samples with extremely high values, which makes me feel uncomfortable. The percentiles of the dataset are listed as follows (in mg/kg):

      Zn Cu Pb Cd As
      Min 4 1 25 0.0 2
      5% 35 6 35 0.1 6
      10% 40 7 41 0.2 7
      25% 65 13 62 0.3 9
      50% 122 18 168 0.6 15
      75% 338 27 821 1.5 28
      90% 907 56 2799 2.8 58
      95% 1986 116 4490 4.2 80
      96% 2462 151 4698 4.9 82
      97% 3493 178 5413 6.2 91
      98% 4697 207 7609 8.3 111
      99% 6712 247 11750 12.4 184
      Max 11473 1293 16305 48.5 1060

      When doing geostatistical and statistical analyses, we need some confidence in dealing with the these very high extreme values which account for less than 2% of the total sample number.

      Any suggestions?

      Cheers,

      Chaosheng Zhang
      ==================================================

      Replies:

      > My question is: How to deal with the
      > extreme/outlying values in a data set?
      The real priority is to establish why you have extreme
      highs. For example:

      (1) is there a high imprecision in measuring the
      values, so that the sample observations are actually
      inaccurate? If so, is it relative to the value or a
      flat error?

      (2) do you have a skewed distribution of values?

      (3) do you have two (or more) populations, only one of
      which gives the high values?

      and there may be others. Once you determine the reason
      for extreme values, then you can more objectively know
      how to deal with them.

      For example, if you think (2) is most likely than look
      at transformations or distribution-free approaches to
      geostatistics. You can find some of my papers in
      dealing with positivel skewed distributions at:

      http://uk.geocities.com/drisobelclark/resume/Publications.html

      If (3) is more likely - as may be probable is your are
      looking at an area where samples may be 'background'
      or 'contaminated' - you really need to identify the
      populations first. Then you may be able to apply a
      mixture model together with indicator geostatistical
      approaches.

      If (1) is your problem, then you may be able to use a
      rough non-parametric approach to get to cross
      validation. The 'error statistics' in a cross
      validation exercise will often assist in identifying
      erroneous sample measurements.

      Hope this helps
      Isobel Clark
      ---------------------

      Dear Isobel,

      Thanks for your quick and helpful reply!

      (1) I would like to trust both the accuracy and precision of the dataset,
      and the real problem is how we "play the computer game". The extreme values
      may be from the samples which by chance contains many minerals.

      (2) From the information of percentiles I provided in the message, you can
      find that the dataset is heavily skewed in deed. Logarithmic transformation can make some of the variables follow the "normal distribution", but not all.
      However, the extreme values still look extreme in the transformed dataset.

      (3) There may be two populations: "background" and "mineralised". However,
      there is really no way to "dichotomise" the two populations. Geographically
      or mathematically? Geographically, there are three areas of high values.
      Mathematically, we need some proof. Even though we could properly separate
      the datasets into two "populations", the extreme values may still be extreme
      in the "mineralised" population.

      Since the really "bad" values are only <2% of the total number (such as 4 or
      5 values out of the total number of 223, which can also be seen from the
      percentiles), I am unwilling to use nonparametric methods until we cannot
      find a way to use the parametric methods.

      Another problem is when we carry out spatial interpolation, these values may
      produce artificial contour lines around these sampling locations, even
      though they can be smoothed. I don't think this is the realistic situation
      in the field.

      Well, I am still not very confident what the best way should be ... I know
      the worst way is to discard these "outlying" values, and the second worst
      way is to use non-parametric methods.

      Cheers,

      Chaosheng Zhang

      -----------------------------
      Dear Chaosheng Zang

      The sampling interval is so wide that the high values could easily be related to "hot spots" of higher grade contamination, i..e dumping areas for particular kinds of slags, mineralized waste, etc. A property map might help.

      Have you contoured the data? If so, the sampling interval is so wide that real hot spots of environmental significance might not show 2D distribution on such a wide sampling grid, however.

      Regards

      Marcel Vallée, Eng,, Geo.
      Geoconseil Marcel Vallée Inc.
      706 Routhier Ave
      Québec, Québec G1X 3J9
      Canada
      Tel: (1) 418 652 3497
      Fax: (1) 418 652 9148
      Email: vallee.marcel@...
      --------------------------------------------

      Dear Marcel Vallée,

      Thanks. I think the sampling density is good enough to reveal the spatial
      structure, and the extreme samples are located within the "hot spots". The
      problem is that the few values are still extremely high within the "hot
      spots". This may be what the "nugget effect" means.

      I'm just wondering if these few extreme values should really be "discarded"/
      "censored" or replaced. However, this could get some criticism as they may
      be "real".

      If it is hard to find the best way, I will have to "replace" all the extreme
      values with 99% or 98% percentiles. But I'm not sure if it is appropriate to
      do so.

      Cheers,

      Chaosheng Zhang

      ------------------------------------
      Hi Chaosheng,

      have a look at this paper
      Saito, H. and P. Goovaerts. (2000). Geostatistical interpolation of positively skewed and censored data in a dioxin contaminated site. Environmental Science & Technology, vol.34, No.19: 4228-4235.

      Ben

      Benjamin Warr

      Research Associate
      Centre for the Management of Environmental Resource(CMER)
      INSEAD
      Boulevard de Constance,
      77305 Fontainebleau Cedex,
      France

      Tel: 33 (0)1 60 72 4456
      Fax: 33 (0)1 60 74 55 64
      e-mail: benjamin.warr@...
      http://www.insead.fr/CMER

      --------------------------------

      Is it possible, in your opinion, to model your variogram excluding those few
      extremes data and after to krige all data, included the extremes values?
      In this way, probably, you loose some spatial information concerning the
      variability of your data but you could obtain a more reliable picture of the
      "background" values. It depends from what you are asking to your data.
      What you, or somebody else, think about?

      regards
      Claudio

      Claudio Cocheo
      Fondazione Salvatore Maugeri - IRCCS
      Centro di Ricerche Ambientali
      via Svizzera, 16
      I 35127 - Padova
      ph. (39) 0498064511
      fax (39) 0498064555
      mailto:ccocheo@...
      website: http://www.fsm.it
      --------------------------

      Dear Chaosheng Zhang

      This problem can be looked in various perspectives. You have to fit the data in the broader picture and objectives.

      First, what do your soil samples represent? How were they collected, what was their size? Are they spot samples, multiple takes in a cross pattern with x metres between takes up to y meters away from the centre? Etc.?

      A significant part of nuggets effects when dealing with rock or soil materials may be sampling and sample preparation generated. If these samples were assayed by AA, what was the size of the portion used? If one gram, it is much more liable to generating a nugget effect than with 5 or 10 grams whenever pulverisation size was not fine enough and uniform.

      Second, what is the purpose of your study. Academic work? Detection, remediation-restoration, etc.? The high values might have physical significance in the later perspective and smothing them may not be the ideal solution. Lead and Arsenic contamination cannot be neglected or minimized.

      In an industry or regulation perspective, the recommendation in that case might be to to carry out additional sampling around the hot spots to delineate them better, say samples at 100 m spacing, as well as checking the original hot spots, with a sampling method designed to be representative. I am afraid I may not be easing you out of your problem, but such is physical reality.

      Chapter 8 in Jeff Myer's book "Geostatistical Error Management," deals with sampling and Chapter 16 with sampling strategy. I published a text on "Sampling Quality Control" in a mineral exploration and development perspective in Exploration and Mining Geology, Vol 7, No 1-2, p. 107-116 (1998). This issue has several other papers on sampling. If it is not available to you, I could send you a file copy of my paper.

      Cheers

      Marcel Vallée

      Geoconseil Marcel Vallée Inc.
      706 Routhier Ave
      Québec, Québec G1X 3J9
      Canada
      Tel: (1) 418 652 3497
      Fax: (1) 418 652 9148
      Email: vallee.marcel@...
      --------------------------------

      "Another problem is when we carry out spatial interpolation, these values
      may produce artificial contour lines around these sampling locations, even
      though they can be smoothed. I don't think this is the realistic situation
      in the field."

      This sounds like the crux of the problem. You sampled data and within it you
      have discrete large values. You have confidence in the integrity of the data
      but don't accept that for these values to be genuine you must have all these
      'artificial' contour lines. This suggests to me that you are expecting the
      data to behave so that these large values don't exist, yet you are saying
      they should be regarded as valid. Is your sampling at a high enough spatial
      resolution?

      If you were to sample another point right next to one of these large values
      would you expect another large value or a more 'normal' one? If you know the
      answer to that then you should be able to decide whether the large values
      are truly errors or simply unexpected but valid data. I would suggest the
      problem here lies with understanding the underlying spatial variation of the
      data set from which the samples were taken, rather than a problem of which
      process to apply to the sampled data.

      Just another way of looking at it!

      regards,

      Martin

      ______________________________________

      ArchaeoPhysica Ltd.
      Reconnaissance & Geophysics for Archaeology

      Telephone: +44 (0) 7050 369789
      E-mail: mail@...
      Website: http://www.archaeophysica.co.uk
      ______________________________________

      Hello,

      The crux of the problem is the smoothing effect of kriging.
      If you don't want to get artificial countour lines in your
      map, you have 2 choices:
      1. use stochastic simulation which generates maps that
      are consistent with (reproduce) the variability of your data.
      2. use a non-exact interpolator, that is filter the
      noise at data locations. An alternative is to slightly
      shift the interpolation grid so that no interpolation
      grid node coincides with a sampled location.

      Pierre
      <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

      ________ ________
      | \ / | Pierre Goovaerts
      |_ \ / _| Assistant professor
      __|________\/________|__ Dept of Civil & Environmental Engineering
      | | The University of Michigan
      | M I C H I G A N | EWRE Building, Room 117
      |________________________| Ann Arbor, Michigan, 48109-2125, U.S.A
      _| |_\ /_| |_
      | |\ /| | E-mail: goovaert@...
      |________| \/ |________| Phone: (734) 936-0141
      Fax: (734) 763-2275
      http://www-personal.engin.umich.edu/~goovaert/

      <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>


      Chaosheng Zhang -

      I think Marcel Vallee is headed in the right direction on your problem. There is a good chance that the problem is one of sample and or subsample support. As mentioned, if you sampled within a foot or tow of a location that displays an extreme or "outlier" value, you may find values an order of magnitude or more below the outlier. Similarly, you may also have "inliers", where a sample nearby a location with a low concentration may contain a significantly higher value. Of course, no one gets excited about the inliers that may be unrepresentative, but we get very excited about the outliers!

      The possibility of extreme values should be planned for in the initial stage of the sampling program. Pierre Gy's work has revealed that the physical size, volume, and orientation of a sample and subsample (i.e. the support) are crucial to the concentration estimate obtained. You are asking a lot to have a 10-g sample represent 400 meters between sample locations in any case. Unless the support of the original sample and all subsampling stages was sufficient, there is little chance that the samples are highly representative of the true concentration. Mine areas typically are very heterogeneous and proper sampling support when sampling is essential. Perhaps you can provide some details. If the underlying data are not representative due to improper suppoort, you are trying to "contour an illusion", and typically the results are not pleasing.

      The way in which the data are used in decision-making is also important. For instance, if your purpose is to delineate hot spots for risk assessment, extreme values do not pose a problem as they will be addressed. You may, however, be very interested in getting your best information at an economic cutoff value or risk threshold, since the decision for treatment of values high above or way below the action level is easy.

      Jeff Myers
      Westinghouse Safety Management Solutions
      2131 S. Centennial Ave., SE
      Aiken, SC 29803
      803.502.9747 (direct)
      803.502.9767 (main)
      803.502.2747 (fax)
      jeff.myers@...
      http://www.gemdqos.com




      [Non-text portions of this message have been removed]
    • Ted Harding
      ... Dear Chaosheng Xhang, Thank you for your comprehensive summary (which now enables me to delete all the interesting individual replies by others!). I d like
      Message 2 of 5 , Dec 22, 2001
      • 0 Attachment
        On 22-Dec-01 Chaosheng Zhang wrote:
        > Dear list,
        >
        > Happy Christmas! Many thanks to all those who replied my question about
        > extreme values, especially Isobel Clark, Marcel Vallée, Benjamin Warr,
        > Claudio Cocheo, Martin Roseveare, Pierre Goovaerts, Jeff Myers.

        Dear Chaosheng Xhang,
        Thank you for your comprehensive summary (which now enables
        me to delete all the interesting individual replies by others!).

        I'd like to add one consideration which seems not to
        have been mentioned by others.

        Especially in a regulatory ("clean-up") context, the
        regulator may want to have a determination of the total
        quantity of contaminant on a site.

        Your high sample values are typical of "hot-spot" values.

        It is a useful formula (proof by integration by parts)
        that

        integral from 0 to inf (1 - F(x)) dx = expectation of X

        where F(x) is the cumulative distribution function of X.

        Applying this (somewhat crudely, numerically speaking)
        to your data for Lead shows that the top second percentile
        (98-100%) accounts for about 1/3 of the total content,
        while the top 5th percentile (95-100%) accounts for well
        over half the total.

        It is therefore essential, for purposes such as the
        above, both to take these extreme values very seriously,
        and also to try to get estimates of the percentiles which
        are as accurate as possible. The latter is not at all easy
        (in fact I do not know of a satisfactory solution in
        the context of "grid sampling" of a contaminated site,
        where the amount and density of sampling which is feasible
        in practice is usually quite insufficient -- how do you
        know, for instance, that there are not much larger
        extremes still, somewhere, remaining unobserved? their
        probabilities of being sampled may be very small; but if
        their values are very high they could dominate everything
        else.)

        However, as far as the data are concerned, you have
        what you have got and you _must_ respect what it tells
        you. From the above, for purposes of estimating the
        total, rather than trying to ignore the high percentiles
        you would even do better to ignore the lower percentiles,
        since they contribute very little!


        And, that being said,

        Happy Christmas and Prosperous New Year to All!
        Ted.

        --------------------------------------------------------------------
        E-Mail: (Ted Harding) <Ted.Harding@...>
        Fax-to-email: +44 (0)870 167 1972
        Date: 22-Dec-01 Time: 21:38:33
        ------------------------------ XFMail ------------------------------

        --
        * To post a message to the list, send it to ai-geostats@...
        * As a general service to the users, please remember to post a summary of any useful responses to your questions.
        * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
        * Support to the list is provided at http://www.ai-geostats.org
      • Myers, Jeff
        Ted s comments on thr regulatory perpspective brings up some interesting issues, assuming this were an hazardous waste site. First, since the upper 5 percent
        Message 3 of 5 , Dec 29, 2001
        • 0 Attachment
          Ted's comments on thr regulatory perpspective brings up some interesting
          issues, assuming this were an hazardous waste site. First, since the upper
          5 percent accounts for such a high percentage of the mass of lead, a
          surgical cleanup targeting the "hot spots" might be sufficient.
          Environmental remediation is focused on reducing risk to human health and
          the environment, and if removing the high zones brings the average lead
          concentration below th risk-based threshold, then the remediation is
          successful.

          Next, in remediation, the remediation decision support unit is as important
          as the sample and subsample support. A few extreme values can be "diluted"
          out if a large decision unit is selected.

          Furthermore, a "hot spot" must be defined with relation to its size and
          concentration (at a minimum). Then the impact of the size/support of the
          "hot spot" can be determined in relation to the decision unit support. So
          basically, the only things you need to remember in environmental
          characterization and decision-making are support, support, and support
          (sample, subsample, decision unit).

          As far as contouring, issues still remain. It's hard to contour yourself
          out of a situation you sampled yourself into.

          Happy Holidays and a Prosperous New Year!

          Jeff Myers
          Westinghouse Safety Management Solutions
          2131 S. Centennial Dr., SE
          Aiken, SC 29803
          jeff.myers@...
          http://www.gemdqos.com

          -----Original Message-----
          From: Ted.Harding@...
          To: Chaosheng Zhang
          Cc: ai-geostats@...
          Sent: 12/22/01 4:38 PM
          Subject: RE: AI-GEOSTATS: Summary: Extreme Values?

          On 22-Dec-01 Chaosheng Zhang wrote:
          > Dear list,
          >
          > Happy Christmas! Many thanks to all those who replied my question
          about
          > extreme values, especially Isobel Clark, Marcel Vallée, Benjamin Warr,
          > Claudio Cocheo, Martin Roseveare, Pierre Goovaerts, Jeff Myers.

          Dear Chaosheng Xhang,
          Thank you for your comprehensive summary (which now enables
          me to delete all the interesting individual replies by others!).

          I'd like to add one consideration which seems not to
          have been mentioned by others.

          Especially in a regulatory ("clean-up") context, the
          regulator may want to have a determination of the total
          quantity of contaminant on a site.

          Your high sample values are typical of "hot-spot" values.

          It is a useful formula (proof by integration by parts)
          that

          integral from 0 to inf (1 - F(x)) dx = expectation of X

          where F(x) is the cumulative distribution function of X.

          Applying this (somewhat crudely, numerically speaking)
          to your data for Lead shows that the top second percentile
          (98-100%) accounts for about 1/3 of the total content,
          while the top 5th percentile (95-100%) accounts for well
          over half the total.

          It is therefore essential, for purposes such as the
          above, both to take these extreme values very seriously,
          and also to try to get estimates of the percentiles which
          are as accurate as possible. The latter is not at all easy
          (in fact I do not know of a satisfactory solution in
          the context of "grid sampling" of a contaminated site,
          where the amount and density of sampling which is feasible
          in practice is usually quite insufficient -- how do you
          know, for instance, that there are not much larger
          extremes still, somewhere, remaining unobserved? their
          probabilities of being sampled may be very small; but if
          their values are very high they could dominate everything
          else.)

          However, as far as the data are concerned, you have
          what you have got and you _must_ respect what it tells
          you. From the above, for purposes of estimating the
          total, rather than trying to ignore the high percentiles
          you would even do better to ignore the lower percentiles,
          since they contribute very little!


          And, that being said,

          Happy Christmas and Prosperous New Year to All!
          Ted.

          --------------------------------------------------------------------
          E-Mail: (Ted Harding) <Ted.Harding@...>
          Fax-to-email: +44 (0)870 167 1972
          Date: 22-Dec-01 Time: 21:38:33
          ------------------------------ XFMail ------------------------------

          --
          * To post a message to the list, send it to ai-geostats@...
          * As a general service to the users, please remember to post a summary
          of any useful responses to your questions.
          * To unsubscribe, send an email to majordomo@... with no subject and
          "unsubscribe ai-geostats" followed by "end" on the next line in the
          message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
          * Support to the list is provided at http://www.ai-geostats.org

          --
          * To post a message to the list, send it to ai-geostats@...
          * As a general service to the users, please remember to post a summary of any useful responses to your questions.
          * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
          * Support to the list is provided at http://www.ai-geostats.org
        • Ted Harding
          ... Jeff, Thanks for your comments which are very much to the point. ... With your permission (which I assume will not be unreasonably withheld) I propose to
          Message 4 of 5 , Dec 29, 2001
          • 0 Attachment
            On 29-Dec-01 Myers, Jeff wrote:
            > Ted's comments on the regulatory perspective bring up some
            > interesting issues, assuming this were an hazardous waste site.

            Jeff, Thanks for your comments which are very much to the point.

            > It's hard to contour yourself out of a situation you sampled
            > yourself into.

            With your permission (which I assume will not be unreasonably
            withheld) I propose to trot out this delightful maxim on
            suitable occasions!

            Thanks for this too -- just in time to set me smiling for
            the New Year.

            Best wishes to all,
            Ted.

            --------------------------------------------------------------------
            E-Mail: (Ted Harding) <Ted.Harding@...>
            Fax-to-email: +44 (0)870 167 1972
            Date: 29-Dec-01 Time: 19:22:06
            ------------------------------ XFMail ------------------------------

            --
            * To post a message to the list, send it to ai-geostats@...
            * As a general service to the users, please remember to post a summary of any useful responses to your questions.
            * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
            * Support to the list is provided at http://www.ai-geostats.org
          • Myers, Jeff
            Permission granted. And a Happy New Year to all! Jeff ... From: Ted.Harding@nessie.mcc.ac.uk To: Myers, Jeff Cc: ai-geostats@unil.ch Sent: 12/29/01 2:22 PM
            Message 5 of 5 , Jan 1, 2002
            • 0 Attachment
              Permission granted. And a Happy New Year to all!

              Jeff

              -----Original Message-----
              From: Ted.Harding@...
              To: Myers, Jeff
              Cc: ai-geostats@...
              Sent: 12/29/01 2:22 PM
              Subject: RE: AI-GEOSTATS: Summary: Extreme Values?

              On 29-Dec-01 Myers, Jeff wrote:
              > Ted's comments on the regulatory perspective bring up some
              > interesting issues, assuming this were an hazardous waste site.

              Jeff, Thanks for your comments which are very much to the point.

              > It's hard to contour yourself out of a situation you sampled
              > yourself into.

              With your permission (which I assume will not be unreasonably
              withheld) I propose to trot out this delightful maxim on
              suitable occasions!

              Thanks for this too -- just in time to set me smiling for
              the New Year.

              Best wishes to all,
              Ted.

              --------------------------------------------------------------------
              E-Mail: (Ted Harding) <Ted.Harding@...>
              Fax-to-email: +44 (0)870 167 1972
              Date: 29-Dec-01 Time: 19:22:06
              ------------------------------ XFMail ------------------------------

              --
              * To post a message to the list, send it to ai-geostats@...
              * As a general service to the users, please remember to post a summary of any useful responses to your questions.
              * To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list
              * Support to the list is provided at http://www.ai-geostats.org
            Your message has been successfully submitted and would be delivered to recipients shortly.