Loading ...
Sorry, an error occurred while loading the content.

[ai-geostats] Re: Sill versus least-squares classical variance estimate

Expand Messages
  • Digby Millikan
    Meng, You wan t to have an evenly spaced sample pattern for you estimation of the variance, if you use samples within range of each others then these are
    Message 1 of 18 , Dec 8, 2004
    • 0 Attachment
      Meng,

      You wan't to have an evenly spaced sample pattern for you estimation
      of the variance, if you use samples within range of each others then these
      are clusters of samples which will overweight that area, hence by removing
      samples below the range, you remove "clusters" of samples. A common
      method of performing statistics on spatial data is first to perform data
      declustering, than calculate your statistics, however as Isobel points
      out a fast way to do this is remove samples below the range.

      Digby
    • Digby Millikan
      Meng, If your sample grid spacing is regular, I assume it wouldn t make much difference, however in mining drilling campaigns commonly have high amounts of
      Message 2 of 18 , Dec 8, 2004
      • 0 Attachment
        Meng,

        If your sample grid spacing is regular, I assume it wouldn't make
        much difference, however in mining drilling campaigns commonly
        have high amounts of clustering of drillhole data in high grade and
        anomalistic areas, and grade control and other forms of sampling
        similarly.

        Digby
      • Meng-Ying Li
        Thanks Digby, You answered more to the question I asked. In this case I assume that you define the overall variance of a random field to be the variance of
        Message 3 of 18 , Dec 8, 2004
        • 0 Attachment
          Thanks Digby,

          You answered more to the question I asked. In this case I assume that you
          define the overall variance of a random field to be the variance of data
          spaced beyond the variogram range-- which I can buy, but not quite sure
          if this definition is practical in all cases-- and that's why I asked
          this question about estimating variance initially. In my point of view,
          expected variance for samples with CSR would be a better definition for
          the overall variance. That's some personal preference, however.

          And since you mentioned declustering, I do know a few declustering
          approaches that will solve the problem of data clusters, but it is
          doubtful whether these approaches removes all effect of correlation
          between point data.

          I'm sure I understand all points of the replies to my question. I think
          I'm just trying to make sure the definition of variance applies to all
          cases of application.

          Meng-ying

          On Wed, 8 Dec 2004, Digby Millikan wrote:

          > Meng,
          >
          > You wan't to have an evenly spaced sample pattern for you estimation
          > of the variance, if you use samples within range of each others then these
          > are clusters of samples which will overweight that area, hence by removing
          > samples below the range, you remove "clusters" of samples. A common
          > method of performing statistics on spatial data is first to perform data
          > declustering, than calculate your statistics, however as Isobel points
          > out a fast way to do this is remove samples below the range.
          >
          > Digby
          >
          >
          >
          >
        • Digby Millikan
          Dear Meng-Ying, It s not that you are defining variance to be the variance of data to be data beyond the range of the variogram. Say you have a panel made up
          Message 4 of 18 , Dec 8, 2004
          • 0 Attachment
            Dear Meng-Ying,

            It's not that you are defining variance to be the variance of data to be
            data
            beyond the range of the variogram. Say you have a panel made up of a
            1 million samples which covers the entire panel, then you select 1000
            samples
            to estimate the variance. If two samples of the thousand are within range of
            each other (close and similar value), then you are effectively doubling up
            on one of the samples, so to give a better representation of the 1 million
            samples you are better to remove the doubled up sample, giving 999
            samples to estimate the variance of the 1 million. This will give a better
            estimate of the variance you could calculate from the million by the least
            squares classical method, which is what Isobel was saying.

            Regards Digby
          • Digby Millikan
            Dear Meng-Ying, If you imagine the 1 million samples (total dataset and area) overlying a pattern of 1000 low and high grade regions, your 1000 sample set you
            Message 5 of 18 , Dec 8, 2004
            • 0 Attachment
              Dear Meng-Ying,

              If you imagine the 1 million samples (total dataset and area) overlying
              a pattern of 1000 low and high grade regions, your 1000 sample set
              you would only want one sample from each low grade and each high
              grade region, if you had two samples in one low grade region, this
              region would be overweighted (introduction of bias to your estimate),
              so you would want to remove the extra sample, the two samples being
              in the same low grade patch will be within range of each other.

              Digby
            • Isobel Clark
              Meng-Ying No, I do not think we are communicating. The variance of data values is not affected by correlation between the sample values. The estimated variance
              Message 6 of 18 , Dec 8, 2004
              • 0 Attachment
                Meng-Ying

                No, I do not think we are communicating.

                The variance of data values is not affected by
                correlation between the sample values.

                The estimated variance for the population IS affected
                by correlation between the sample values. Statistical
                inference about the population is based on the
                assumption that samples were taken randomly and
                independently from that population.

                It is the process of estimation of unknown parameters
                by classical statistical theory which requires these
                assumptions.

                Geostatistical inference does not require absence of
                correlation, quite the contrary. The semi-variogram
                graph is constructed on the assumption that there is a
                correlation between samples and that this depends on
                distance and direction between the pair of samples.

                If we have a stationary situation, where the mean and
                variance are constant over the study area, the
                semi-variogram generally reaches a sill value. The
                distance at which this happens is interpreted as that
                distance beyond which the correlation is zero. Sample
                pairs at this distance or greater can be used to
                estimate the variance, since the statistical
                assumptions are now satisifed.

                Isobel
                http://geoecosse.bizland.com/whatsnew.htm




                --- Meng-Ying Li <mengyl@...> wrote:
                > Hi Isobel,
                >
                > I understand all points you pointed out, but I'm not
                > sure why the variance
                > should be defined as data NOT SPATIALLY CORRELATED
                > when they may or may
                > not be correlated.
                >
                > Thanks for the clarification, though, I don't think
                > I'd be able to
                > clarify the things you clarifies. You're good.
                >
                >
                > Meng-ying
                >
                > On Wed, 8 Dec 2004, Isobel Clark wrote:
                >
                > > Meng-Ying
                > >
                > > I don't know how to say this any other way. At
                > > distances larger than the range of influence,
                > samples
                > > are NOT SPATIALLY CORRELATED.
                > >
                > > The variance of the difference between two
                > > uncorrelated samples is twice the variance of one
                > > sample around the mean.
                > >
                > > The semi-variogram is one-half of the variance of
                > the
                > > difference.
                > >
                > > Hence the sill is (theoretically) equal to the
                > > variance. The sill is based on all pairs of
                > samples
                > > found at a distance greater thn the range of
                > > influence.
                > >
                > > The classical statistical estimator of the
                > variance is
                > > only unbiassed if the correct degrees of freedom
                > are
                > > used. If the samples are correlated, n-1 is NOT
                > the
                > > correct degrees of freedom.
                > >
                > > All explained in immense detail in Practical
                > > Geostatistics 2000, Clark and Harper,
                > > http://geoecosse.hypermart.net
                > >
                > > Did I get it clear this time?
                > > Isobel
                > >
                > > --- Meng-Ying Li <mengyl@...> wrote:
                > > > I understand why it is not appropriate to force
                > the
                > > > sill so it matches the
                > > > sample variance. My question is, why estimate
                > the
                > > > overall variance by the
                > > > sill value when data are actually correlated?
                > > >
                > > >
                > > > Meng-ying
                > > >
                > > > On Tue, 7 Dec 2004, Isobel Clark wrote:
                > > >
                > > > > Meng-Ying
                > > > >
                > > > > We are talking about estimating the variance
                > of a
                > > > set
                > > > > of samples where spatial dependence exists.
                > > > >
                > > > > The classical statistical unbiassed estimator
                > of
                > > > the
                > > > > population variance is s-squared which is the
                > sum
                > > > of
                > > > > the squared deviations from the mean divided
                > by
                > > > the
                > > > > relevant degrees of freedom. If the samples
                > are
                > > > not
                > > > > inter-correlated, the relevant degrees of
                > freedom
                > > > are
                > > > > (n-1). This gives the formula you find in any
                > > > > introductory statistics book or course.
                > > > >
                > > > > If samples are not independent of one another,
                > the
                > > > > degrees of freedom issue becomes a problem and
                > the
                > > > > classical estimator will be biassed (generally
                > too
                > > > > small on average).
                > > > >
                > > > > In theory, pairs of samples beyond the range
                > of
                > > > > influence on a semi-variogram graph are
                > > > independent of
                > > > > one another. In theory, the variance of the
                > > > difference
                > > > > betwen two values which are uncorrelated is
                > twice
                > > > the
                > > > > variance of one sample around the population
                > mean.
                > > > > This is thought to be why Matheron defined the
                > > > > semi-variogram (one-half the squared
                > difference)
                > > > so
                > > > > that the final sill would be (theoretically)
                > equal
                > > > to
                > > > > the population variance.
                > > > >
                > > > > There are computer software packages which
                > will
                > > > draw a
                > > > > line on your experimental semi-variogram at
                > the
                > > > height
                > > > > equivalent to the classically calculated
                > sample
                > > > > variance. Some people try to force their
                > > > > semi-variogram models to go through this line.
                > > > This is
                > > > > dumb as the experimental sill is a better
                > estimate
                > > > > because it does have the degrees of freedom it
                > is
                > > > > supposed to have.
                > > > >
                > > > > I am not sure whether this is clear enough. If
                > you
                > > > > email me off the list, I can recommend
                > > > publications
                > > > > which might help you out.
                > > > >
                > > > > Isobel
                > > > > http://geoecosse.bizland.com/books.htm
                > > > >
                > > > > --- Meng-Ying Li <mengyl@...>
                > wrote:
                > > > > > Hi Isobel,
                > > > > >
                > > > > > Could you explain why it would be a better
                > > > estimate
                > > > > > of the variance when
                > > > > > independance is considered? I'd rather think
                > > > that we
                > > > > > consider the
                > > > > > dependance when the overall variance are to
                > be
                > > > > > estimated-- if there
                > > > > > actually is dependance between values.
                > > > > >
                > > > > > Or are you talking about modeling sill value
                > by
                > > > the
                > > > > > stablizing tail on
                > > > > > the experimental variogram, instead of
                > modeling
                > > > by
                > > > > > the calculated overall
                > > > > > variance?
                > > > > >
                > > > > > Or, are we talking about variance of
                > different
                > > > > > definitions? I'd be
                > > > > > concerned if I missed some point of the
                > original
                > > > > > definition for variances,
                > > > > > like, the variance should be defined with no
                > > > > > dependance beween values or
                > > > > > something like that. Frankly, I don't think
                > I
                > > > took
                > > > > > the definition of
                > > > > > variance too serious when I was learning
                > stats.
                > > > > >
                > > > > >
                > > > > > Meng-ying
                > > > > >
                > > > > > > Digby
                > > > > > >
                > > > > > > I see where you are coming from on this,
                > but
                > > > in
                > > > > > fact
                > > > > > > the sill is composed of those pairs of
                > samples
                > > > > > which
                > > > > > > are independent of one another - or, at
                > least,
                > > > > > have
                >
                === message truncated ===
              • Meng-Ying Li
                Hi Digby and All, I did a little experiment on the idea that Digby mentioned: The sill will estimate the population variance, but found it not true in my
                Message 7 of 18 , Dec 8, 2004
                • 0 Attachment
                  Hi Digby and All,

                  I did a little experiment on the idea that Digby mentioned: The sill will
                  estimate the population variance, but found it not true in my experiment:

                  1. I generated a set of one-dimentional data with 27 points on regular
                  unit spacings, which I'd like to take it as the true, or population
                  value. On purpose, I generate the data so it has an influence range of
                  three length units.
                  2. I calculated the experimental variogram. Notice that the variogram is
                  the population variogram. The sill value is around 2.8.
                  3. But the population variance is 2.39, lower than the sill value.

                  This confirms my doubt about using sill value as the estimate of
                  population variance, since I calculate the variogram and variance based on
                  all data points. Please tell me what you think. The data I generated are
                  as follows:

                  0.056970748
                  0.14520424
                  0.849710204
                  1.650514605
                  1.101666385
                  1.015177986
                  2.150259206
                  2.830780659
                  0.223495817
                  -2.47615958
                  -3.372697392
                  -0.530685611
                  0.786582177
                  0.970673
                  0.674755256
                  0.338461632
                  1.020874834
                  0.410936991
                  1.702892405
                  2.649748012
                  4.290179731
                  3.442015668
                  1.488818953
                  0.862788738
                  0.728709892
                  2.398182914
                  1.522546427
                • Colin Daly
                  Hi DigbySorry to say - but suggesting that less data is systematically better is mistaken - this is fundemental...and is contained in the intro pages of any
                  Message 8 of 18 , Dec 8, 2004
                  • 0 Attachment
                    RE: [ai-geostats] Re: Sill versus least-squares classical variance estimate

                    Hi Digby

                    Sorry to say - but suggesting that less data is systematically better is mistaken - this is fundemental...and is contained in the intro pages of any good intro to geostats.  If the data is clustered - then you might have to decluster in some sense - But with an unbiased sample then you will use all million samples. Please, any new users of geostats lucky to have a million samples - don't throw 99.9% of your data away!!

                    Declustering is about trying to remove the bias that most realistic sampling strategies have (e,g, in petroleum, you tend to drill into the best reservoir regions first...). If your data is an unbaised sample from the true histogram (ie what you would get by mining out the resource fully) then you will use all  of it for estimating any statistic. This does not mean that the samples have to be far apart - just that they don't cluster into high or low regions.

                    There seems to be some confusion about independence and estimates. Suppose the mean (and/or variance) is being estimated (provisos: 1) unbaised sample data 2) stationary (so that mean and variance have a meaning)), then the estimate is unbiased - irrespective of the correlation of the data - what does depend on the correlation is the error in the estimation. For a zero correlation length, then the variance of error of the mean drops as 1/n. For a non-zero correlation length it drops slower than 1/n  -  but you do not get a quicker convergence by throwing away good data - in fact virtually always you will get a strictly worse estimate!

                    Regards

                    Colin


                    -----Original Message-----
                    From:   Digby Millikan [mailto:digbym@...]
                    Sent:   Wed 12/8/2004 5:12 PM
                    To:     ai-geostats; Meng-Ying  Li
                    Cc:    
                    Subject:        Re: [ai-geostats] Re: Sill versus least-squares classical variance estimate
                    Dear Meng-Ying,

                     It's not that you are defining variance to be the variance of data to be
                    data
                    beyond the range of the variogram. Say you have a panel made up of a
                    1 million samples which covers the entire panel, then you select 1000
                    samples
                    to estimate the variance. If two samples of the thousand are within range of
                    each other (close and similar value), then you are effectively doubling up
                    on one of the samples, so to give a better representation of the 1 million
                    samples you are better to remove the doubled up sample, giving 999
                    samples to estimate the variance of the 1 million. This will give a better
                    estimate of the variance you could calculate from the million by the least
                    squares classical method, which is what Isobel was saying.

                    Regards Digby





                    DISCLAIMER:
                    This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
                    
                  • Colin Daly
                    Hi Meng-Ying27 points - you can t really calculate a variogram. With a range of 3 - you have about 9 correlation lenghts in the field. So as a crude
                    Message 9 of 18 , Dec 8, 2004
                    • 0 Attachment
                      RE: [ai-geostats] Re: Sill versus least-squares classical variance estimate

                      Hi Meng-Ying

                      27 points - you can't really calculate a variogram. With a range of 3 - you have about 9 correlation lenghts in the field. So as a crude approximation, even the standard deviation on the estimate of the mean would be of the order of s.d/sqrt(9) (I vaguely remember trying to get a more accurate version of this in the case of a Gaussian RF as an exercise in one of Matheron's classes...)

                      so with s.d = 2.8 (or 2.4 ---similar answers), then standard error is 2.8/3=0.9 (approx) 

                      so your confidence interval for the mean would be  [m-1.8, m+1.8]

                      -  this is the same order for both the estimate of the sill and for the direct estimate of the variance... both are bad

                      That is for the comparitively easy case of the mean -  The situation for the variance is even worse - so there is no way that you can complain about the quality of the estimate.

                      I'm not sure if you are suggesting that you should get different answers - or that there is some bias involved but to convince yourself that there is not
                      repeat your experiment but use a length of 1,000,000 instead of 27....then at least we would get rid of most of the statistical fluctuations - and the estimates should be similar. How are you generating the random sequence - is it an AR process or something where the variance is known theoretically?

                      Colin 

                      -----Original Message-----
                      From:   Meng-Ying Li [mailto:mengyl@...]
                      Sent:   Wed 12/8/2004 6:36 PM
                      To:     Digby Millikan
                      Cc:     ai-geostats
                      Subject:        Re: [ai-geostats] Re: Sill versus least-squares classical variance estimate
                      Hi Digby and All,

                      I did a little experiment on the idea that Digby mentioned: The sill will
                      estimate the population variance, but found it not true in my experiment:

                      1. I generated a set of one-dimentional data with 27 points on regular
                         unit spacings, which I'd like to take it as the true, or population
                         value. On purpose, I generate the data so it has an influence range of
                         three length units.
                      2. I calculated the experimental variogram. Notice that the variogram is
                         the population variogram. The sill value is around 2.8.
                      3. But the population variance is 2.39, lower than the sill value.

                      This confirms my doubt about using sill value as the estimate of
                      population variance, since I calculate the variogram and variance based on
                      all data points. Please tell me what you think. The data I generated are
                      as follows:

                      0.056970748
                      0.14520424
                      0.849710204
                      1.650514605
                      1.101666385
                      1.015177986
                      2.150259206
                      2.830780659
                      0.223495817
                      -2.47615958
                      -3.372697392
                      -0.530685611
                      0.786582177
                      0.970673
                      0.674755256
                      0.338461632
                      1.020874834
                      0.410936991
                      1.702892405
                      2.649748012
                      4.290179731
                      3.442015668
                      1.488818953
                      0.862788738
                      0.728709892
                      2.398182914
                      1.522546427




                      DISCLAIMER:
                      This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
                      
                    • Digby Millikan
                      RE: [ai-geostats] Re: Sill versus least-squares classical variance estimateColin, You misunderstood me, the 1 million data is the total unknown dataset. Say
                      Message 10 of 18 , Dec 8, 2004
                      • 0 Attachment
                        RE: [ai-geostats] Re: Sill versus least-squares classical variance
                        estimateColin,

                        You misunderstood me, the 1 million data is the total unknown dataset. Say
                        you have a volume in a
                        mine and it's volume is 1 million 1 metre core samples. You drill the volume
                        and have a sample set
                        of 1000 1m core samples. You then analyse the statistics of the 1000 samples
                        to try and estimate
                        the variance of the total volume (1 million core samples). So your estimate
                        of the variance comes
                        from the 1000 samples. You can plot the variogram of the 1000 samples and
                        you can also calculate
                        it's variance. You are trying to estimate the variance of the 1 million
                        peices of core which you do not
                        have. So you must decide wether your 1000 sample set is a true
                        representation of the 1 million.
                        Our argument is that samples within the 1000 which are clustered together do
                        not create a good
                        representation of the true dataset and will create a biased estimate.

                        Digby
                        www.users.on.net/~digbym
                      • Mat (University Account)
                        Hi Digby, Just a note - in circumstances that you have just described, the greater the level and range of autocorrelation means the more precise your estimate
                        Message 11 of 18 , Dec 8, 2004
                        • 0 Attachment
                          Hi Digby,
                          Just a note - in circumstances that you have just described,
                          the greater the level and range of autocorrelation means the more precise
                          your estimate of the mean will be.

                          If your 1000 cores were randomly sampled from the population of 1 million,
                          then the fact that some (perhaps many) of pairs of datapoints lie less than
                          the (variogram) range apart
                          will not matter. s^2 is a valid, unbiased estimate of the population
                          variance.
                          (The population is defined here as being the 1,000,000 possible cores that
                          could be taken from this area - not of the process that generated this
                          realization/data).

                          What's more the typical simple random sample (SRS) standard error (s^2/n),
                          will perform exactly as expected.

                          If you chose to use a more sensible design, say a grid (systematic sample)
                          .. then your s^2/n would
                          be in fact be an _overestimate_ of the standard error.

                          Mat

                          -----Original Message-----
                          From: Digby Millikan [mailto:digbym@...]
                          Sent: Thursday, 9 December 2004 8:32 a.m.
                          To: ai-geostats
                          Subject: Re: [ai-geostats] Re: Sill versus least-squares classical variance
                          estimate

                          RE: [ai-geostats] Re: Sill versus least-squares classical variance
                          estimateColin,

                          You misunderstood me, the 1 million data is the total unknown dataset. Say
                          you have a volume in a mine and it's volume is 1 million 1 metre core
                          samples. You drill the volume and have a sample set of 1000 1m core samples.
                          You then analyse the statistics of the 1000 samples to try and estimate the
                          variance of the total volume (1 million core samples). So your estimate of
                          the variance comes from the 1000 samples. You can plot the variogram of the
                          1000 samples and you can also calculate it's variance. You are trying to
                          estimate the variance of the 1 million peices of core which you do not have.
                          So you must decide wether your 1000 sample set is a true representation of
                          the 1 million.
                          Our argument is that samples within the 1000 which are clustered together do
                          not create a good representation of the true dataset and will create a
                          biased estimate.

                          Digby
                          www.users.on.net/~digbym
                        • Digby Millikan
                          Mat, The point is the spatial randomness with which they were sampled. Typically in a mining situation core samples far from follow a spatially random sampling
                          Message 12 of 18 , Dec 8, 2004
                          • 0 Attachment
                            Mat,

                            The point is the spatial randomness with which they were sampled. Typically
                            in a mining situation core samples far from follow a spatially random
                            sampling
                            pattern.

                            Digby
                            Geolite Mining Systems
                            www.users.on.net/~digbym
                          • Colin Daly
                            Hi DigbyYes, I agree with what you say below - if your only aim was to estimate the variance and you only could collect 1000 samples - then choose them to
                            Message 13 of 18 , Dec 8, 2004
                            • 0 Attachment
                              RE: [ai-geostats] Re: Sill versus least-squares classical variance estimate

                              Hi Digby

                               Yes, I agree with what you say below - if your only aim was to estimate the variance and you only could collect 1000 samples - then choose them to be 'maximally independent' to reduce the variance of the error. But note, as Don said yesterday, a random sample, which is clustered, will also give an unbiased estimate of the variance but with a somewhat larger error of estimation. (of course, there may be reasons not to take all the samples as far from one another as possible - for example to estimate the variogram close to the origin, which is it's most important part - but that is another story). The bit that I disagreed with in your original message was the bit that said
                              "....giving 999 samples to estimate the variance of the 1 million. This will give a better estimate of the variance you could calculate from the million by the least squares classical method, which is what Isobel was saying"
                              I understood this to say that you would do better with 1000 (or 999) points that with the full million...if that is not what you meant then, yes, i did misunderstand

                              Colin

                              -----Original Message-----
                              From:   Digby Millikan [mailto:digbym@...]
                              Sent:   Wed 12/8/2004 7:32 PM
                              To:     ai-geostats
                              Cc:    
                              Subject:        Re: [ai-geostats] Re: Sill versus least-squares classical variance estimate
                              RE: [ai-geostats] Re: Sill versus least-squares classical variance
                              estimateColin,

                               You misunderstood me, the 1 million data is the total unknown dataset. Say
                              you have a volume in a
                              mine and it's volume is 1 million 1 metre core samples. You drill the volume
                              and have a sample set
                              of 1000 1m core samples. You then analyse the statistics of the 1000 samples
                              to try and estimate
                              the variance of the total volume (1 million core samples).  So your estimate
                              of the variance comes
                              from the 1000 samples. You can plot the variogram of the 1000 samples and
                              you can also calculate
                              it's variance. You are trying to estimate the variance of the 1 million
                              peices of core which you do not
                              have. So you must decide wether your 1000 sample set is a true
                              representation of the 1 million.
                              Our argument is that samples within the 1000 which are clustered together do
                              not create a good
                              representation of the true dataset and will create a biased estimate.

                              Digby
                              www.users.on.net/~digbym





                              DISCLAIMER:
                              This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
                              
                            • Meng-Ying Li
                              Hi Colin, What I m talking about in my example is comparing two descriptive statistics for this population which consists of 27 data points. No estimation here
                              Message 14 of 18 , Dec 8, 2004
                              • 0 Attachment
                                Hi Colin,

                                What I'm talking about in my example is comparing two descriptive
                                statistics for this population which consists of 27 data points. No
                                estimation here is involved, so the thing about confidence interval of
                                the mean or variance is not of concern here. And it doesn't matter which
                                model I used in the generator or what parameters I used, since I
                                re-calculated the population sill and variance after the data are
                                generated.

                                Let me state this clear:
                                (Capitalization indicates highlighting, not speaking tone :p)

                                1. I generated a POPULATION which is, believe it or not, a series of 27
                                data.
                                2. The POPULATION variance, in my example, doesn't match the POPULATION
                                sill calculated in the POPULATION variogram.
                                3. So how are we going to estimate the POPULATION variance by the sill in
                                a SAMPLE, when the sill and the variance in the POPULATION just
                                doesn't match?

                                And just a personal opinion, I would like to think geostatistic
                                theories apply to population of any size, as small as 27, or as large as
                                1,000,000. If I'm making an example that geostatistics doesn't apply, then
                                there's something to concern about in this approach.


                                Meng

                                On Wed, 8 Dec 2004, Colin Daly wrote:

                                >
                                > Hi Meng-Ying
                                >
                                > 27 points - you can't really calculate a variogram. With a range of 3 -
                                > you have about 9 correlation lenghts in the field. So as a crude
                                > approximation, even the standard deviation on the estimate of the mean
                                > would be of the order of s.d/sqrt(9) (I vaguely remember trying to get a
                                > more accurate version of this in the case of a Gaussian RF as an
                                > exercise in one of Matheron's classes...)
                                >
                                > so with s.d = 2.8 (or 2.4 ---similar answers), then standard error is
                                > 2.8/3=0.9 (approx)
                                >
                                > so your confidence interval for the mean would be [m-1.8, m+1.8]
                                >
                                > - this is the same order for both the estimate of the sill and for the
                                > direct estimate of the variance... both are bad
                                >
                                > That is for the comparitively easy case of the mean - The situation
                                > for the variance is even worse - so there is no way that you can
                                > complain about the quality of the estimate.
                                >
                                > I'm not sure if you are suggesting that you should get different
                                > answers - or that there is some bias involved but to convince yourself
                                > that there is not repeat your experiment but use a length of 1,000,000
                                > instead of 27....then at least we would get rid of most of the
                                > statistical fluctuations - and the estimates should be similar. How are
                                > you generating the random sequence - is it an AR process or something
                                > where the variance is known theoretically?
                                >
                                > Colin
                                >
                                > -----Original Message-----
                                > From: Meng-Ying Li [mailto:mengyl@...]
                                > Sent: Wed 12/8/2004 6:36 PM
                                > To: Digby Millikan
                                > Cc: ai-geostats
                                > Subject: Re: [ai-geostats] Re: Sill versus least-squares classical variance estimate
                                > Hi Digby and All,
                                >
                                > I did a little experiment on the idea that Digby mentioned: The sill will
                                > estimate the population variance, but found it not true in my experiment:
                                >
                                > 1. I generated a set of one-dimentional data with 27 points on regular
                                > unit spacings, which I'd like to take it as the true, or population
                                > value. On purpose, I generate the data so it has an influence range of
                                > three length units.
                                > 2. I calculated the experimental variogram. Notice that the variogram is
                                > the population variogram. The sill value is around 2.8.
                                > 3. But the population variance is 2.39, lower than the sill value.
                                >
                                > This confirms my doubt about using sill value as the estimate of
                                > population variance, since I calculate the variogram and variance based on
                                > all data points. Please tell me what you think. The data I generated are
                                > as follows:
                                >
                                > 0.056970748
                                > 0.14520424
                                > 0.849710204
                                > 1.650514605
                                > 1.101666385
                                > 1.015177986
                                > 2.150259206
                                > 2.830780659
                                > 0.223495817
                                > -2.47615958
                                > -3.372697392
                                > -0.530685611
                                > 0.786582177
                                > 0.970673
                                > 0.674755256
                                > 0.338461632
                                > 1.020874834
                                > 0.410936991
                                > 1.702892405
                                > 2.649748012
                                > 4.290179731
                                > 3.442015668
                                > 1.488818953
                                > 0.862788738
                                > 0.728709892
                                > 2.398182914
                                > 1.522546427
                                >
                                >
                                >
                                >
                                >
                                >
                                >
                                > DISCLAIMER:
                                > This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
                              • Digby Millikan
                                Meng-Ying, You seem to have a point the theory of using the sill is not a hard and fast rule, it is just if you re conducting a mine study and you plot a
                                Message 15 of 18 , Dec 8, 2004
                                • 0 Attachment
                                  Meng-Ying,

                                  You seem to have a point the theory of using the sill is not a hard and fast
                                  rule, it is just if you're conducting a mine study and you plot a variogram
                                  you can use the sill as a better estimate when you have many samples,
                                  as you say only having 27 samples using the sill is a pretty rough estimate,
                                  as would be expected.
                                  What results do you get for a different 27 samples or 1000 samples, from
                                  a practical side I've never seen a variogram that has any functional use
                                  what
                                  so ever with 27 samples.

                                  Digby
                                • Colin Daly
                                  Hi Meng-Ying The calculation of the experimental variance on a finite set of data (population or sample) is simply a mathematical operation - in itself it has
                                  Message 16 of 18 , Dec 8, 2004
                                  • 0 Attachment
                                    RE: [ai-geostats] Re: Sill versus least-squares classical variance estimate

                                    Hi Meng-Ying

                                     The calculation of  the experimental variance on a finite set of data (population or sample) is simply a mathematical operation
                                    - in itself it has no more meaning that say adding the square of the first value to the cube root of the second and dividing the answer
                                    by the geometic mean of the rest of them.
                                    What bestows meaning on this particular calculation is (roughly speaking)
                                    the assumption that
                                    'each of the 27 values vary about the same mean value with the same distribution of variability at each point'. If we did not believe this  - if
                                    for example each of the 27 points sampled completely different phenomena -
                                    then there would be little point in using the variance as a means of
                                    describing the data (spatial data or not..). In other words for the variance to have the sort of meaning that we usually ascribe to it as variation about the mean value - we interpret the observations as realisations of some random
                                     variable and then the calculated variance as estimating the mathematical idealisation of the mathematical variance of the RV.
                                     Likewise, if we interpret the data as being spatial data, we may calculate the experimental variogram and try to interpret it as an estimate of the theoretical variogram of some idealised random function. For a standard variogram to be interpretable we need the first and second moments of the first order increments Z(x+h)-Z(x) to be invariant under translations. Under the more stringent criteria that we may reasonably model our data as
                                    2nd order stationary then the mean of Z(x) is constant everywhere and the variability about the mean is the same at each point  - so we can calculate the variance.  It is then a theorem, in the context of this model, that the sill is equal to the variance. 
                                     outside the context of the stationarity hypothesis, then the variance of the data looses its meaning as variation about a mean value - so is a meaningless
                                    calculation. So, it is hardly a surprise, or concern, that it does not agree
                                     with the sill.  The variogram seems to retain its objectivity a bit longer,
                                    until the increments are no longer well modeled as stationary.
                                    For the small populations that you give, well the two calculations (variance and sill) are just numbers. If you try to ascribe meaning to them in the context of stationarity - then their likely variations about some 'true' value
                                    comes into play.

                                    Anyhow, I will be away for the next few days so will miss the end of this
                                    topic (much to the relief of everyone on ai-geostat no doubt!) - but it was
                                    fun - and took me back a good few years (wishing i listened a bit better in matheron's classes on his 'estimating and choosing' book)

                                    Regards

                                    Colin Daly

                                    -----Original Message-----
                                    From:   Meng-Ying Li [mailto:mengyl@...]
                                    Sent:   Wed 12/8/2004 9:52 PM
                                    To:     Colin Daly
                                    Cc:     Digby Millikan; ai-geostats
                                    Subject:        RE: [ai-geostats] Re: Sill versus least-squares classical variance estimate
                                    Hi Colin,

                                    What I'm talking about in my example is comparing two descriptive
                                    statistics for this population which consists of 27 data points. No
                                    estimation here is involved, so the thing about confidence interval of
                                    the mean or variance is not of concern here. And it doesn't matter which
                                    model I used in the generator or what parameters I used, since I
                                    re-calculated the population sill and variance after the data are
                                    generated.

                                    Let me state this clear:
                                    (Capitalization indicates highlighting, not speaking tone :p)

                                    1. I generated a POPULATION which is, believe it or not, a series of 27
                                       data.
                                    2. The POPULATION variance, in my example, doesn't match the POPULATION
                                       sill calculated in the POPULATION variogram.
                                    3. So how are we going to estimate the POPULATION variance by the sill in
                                       a SAMPLE, when the sill and the variance in the POPULATION just
                                       doesn't match?

                                    And just a personal opinion, I would like to think geostatistic
                                    theories apply to population of any size, as small as 27, or as large as
                                    1,000,000. If I'm making an example that geostatistics doesn't apply, then
                                    there's something to concern about in this approach.


                                    Meng

                                    On Wed, 8 Dec 2004, Colin Daly wrote:

                                    >
                                    > Hi Meng-Ying
                                    >
                                    > 27 points - you can't really calculate a variogram. With a range of 3 -
                                    > you have about 9 correlation lenghts in the field. So as a crude
                                    > approximation, even the standard deviation on the estimate of the mean
                                    > would be of the order of s.d/sqrt(9) (I vaguely remember trying to get a
                                    > more accurate version of this in the case of a Gaussian RF as an
                                    > exercise in one of Matheron's classes...)
                                    >
                                    > so with s.d = 2.8 (or 2.4 ---similar answers), then standard error is
                                    > 2.8/3=0.9 (approx)
                                    >
                                    > so your confidence interval for the mean would be  [m-1.8, m+1.8]
                                    >
                                    > -  this is the same order for both the estimate of the sill and for the
                                    > direct estimate of the variance... both are bad
                                    >
                                    > That is for the comparitively easy case of the mean -  The situation
                                    > for the variance is even worse - so there is no way that you can
                                    > complain about the quality of the estimate.
                                    >
                                    > I'm not sure if you are suggesting that you should get different
                                    > answers - or that there is some bias involved but to convince yourself
                                    > that there is not repeat your experiment but use a length of 1,000,000
                                    > instead of 27....then at least we would get rid of most of the
                                    > statistical fluctuations - and the estimates should be similar. How are
                                    > you generating the random sequence - is it an AR process or something
                                    > where the variance is known theoretically?
                                    >
                                    > Colin
                                    >
                                    > -----Original Message-----
                                    > From: Meng-Ying Li [mailto:mengyl@...]
                                    > Sent: Wed 12/8/2004 6:36 PM
                                    > To:   Digby Millikan
                                    > Cc:   ai-geostats
                                    > Subject:      Re: [ai-geostats] Re: Sill versus least-squares classical variance estimate
                                    > Hi Digby and All,
                                    >
                                    > I did a little experiment on the idea that Digby mentioned: The sill will
                                    > estimate the population variance, but found it not true in my experiment:
                                    >
                                    > 1. I generated a set of one-dimentional data with 27 points on regular
                                    >    unit spacings, which I'd like to take it as the true, or population
                                    >    value. On purpose, I generate the data so it has an influence range of
                                    >    three length units.
                                    > 2. I calculated the experimental variogram. Notice that the variogram is
                                    >    the population variogram. The sill value is around 2.8.
                                    > 3. But the population variance is 2.39, lower than the sill value.
                                    >
                                    > This confirms my doubt about using sill value as the estimate of
                                    > population variance, since I calculate the variogram and variance based on
                                    > all data points. Please tell me what you think. The data I generated are
                                    > as follows:
                                    >
                                    > 0.056970748
                                    > 0.14520424
                                    > 0.849710204
                                    > 1.650514605
                                    > 1.101666385
                                    > 1.015177986
                                    > 2.150259206
                                    > 2.830780659
                                    > 0.223495817
                                    > -2.47615958
                                    > -3.372697392
                                    > -0.530685611
                                    > 0.786582177
                                    > 0.970673
                                    > 0.674755256
                                    > 0.338461632
                                    > 1.020874834
                                    > 0.410936991
                                    > 1.702892405
                                    > 2.649748012
                                    > 4.290179731
                                    > 3.442015668
                                    > 1.488818953
                                    > 0.862788738
                                    > 0.728709892
                                    > 2.398182914
                                    > 1.522546427
                                    >
                                    >
                                    >
                                    >
                                    >
                                    >
                                    >
                                    > DISCLAIMER:
                                    > This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.


                                  • Digby Millikan
                                    Meng-Ying, For interests sake could you perform the same experiment for a stationary sample set of size 1000. Regards Digby * By using the ai-geostats mailing
                                    Message 17 of 18 , Dec 8, 2004
                                    • 0 Attachment
                                      Meng-Ying,

                                      For interests sake could you perform the same experiment for a stationary
                                      sample set of size 1000.

                                      Regards Digby
                                    • Meng-Ying Li
                                      ... I did that. But with this short influence range of just 3 lags in a population of size 1000 (0.3% of the domain), the correlation of data doesn t do much
                                      Message 18 of 18 , Dec 8, 2004
                                      • 0 Attachment
                                        > Meng-Ying,
                                        >
                                        > For interests sake could you perform the same experiment for a
                                        > stationary sample set of size 1000.
                                        >
                                        > Regards Digby

                                        I did that. But with this short influence range of just 3 lags in a
                                        population of size 1000 (0.3% of the domain), the correlation of data
                                        doesn't do much influence to the population variance. That's why I looked
                                        into other data set to speak for me.

                                        For people interested in this phenomenum, I used the second realization of
                                        SGSIM.OUT in the GSLIB manual as the population, add coordiate to this
                                        realization by <addcoord>, calculated omni-directional variogram by
                                        <gamv>, and on the screen of <gamv> calculation it shows the overall
                                        variance, which doesn't fit the sill in the variogram if you put the
                                        maximum lag distance to 30.


                                        Mng-yng

                                        On Thu, 9 Dec 2004, Digby Millikan wrote:

                                        > Meng-Ying,
                                        >
                                        > For interests sake could you perform the same experiment for a stationary
                                        > sample set of size 1000.
                                        >
                                        > Regards Digby
                                        >
                                        >
                                        >
                                      Your message has been successfully submitted and would be delivered to recipients shortly.