Loading ...
Sorry, an error occurred while loading the content.

[ai-geostats] Re: Sill versus least-squares classical variance estimate

Expand Messages
  • Digby Millikan
    Meng, You wan t to have an evenly spaced sample pattern for you estimation of the variance, if you use samples within range of each others then these are
    Message 1 of 18 , Dec 8, 2004
    • 0 Attachment
      Meng,

      You wan't to have an evenly spaced sample pattern for you estimation
      of the variance, if you use samples within range of each others then these
      are clusters of samples which will overweight that area, hence by removing
      samples below the range, you remove "clusters" of samples. A common
      method of performing statistics on spatial data is first to perform data
      declustering, than calculate your statistics, however as Isobel points
      out a fast way to do this is remove samples below the range.

      Digby
    • Digby Millikan
      Dear Meng-Ying, It s not that you are defining variance to be the variance of data to be data beyond the range of the variogram. Say you have a panel made up
      Message 2 of 18 , Dec 8, 2004
      • 0 Attachment
        Dear Meng-Ying,

        It's not that you are defining variance to be the variance of data to be
        data
        beyond the range of the variogram. Say you have a panel made up of a
        1 million samples which covers the entire panel, then you select 1000
        samples
        to estimate the variance. If two samples of the thousand are within range of
        each other (close and similar value), then you are effectively doubling up
        on one of the samples, so to give a better representation of the 1 million
        samples you are better to remove the doubled up sample, giving 999
        samples to estimate the variance of the 1 million. This will give a better
        estimate of the variance you could calculate from the million by the least
        squares classical method, which is what Isobel was saying.

        Regards Digby
      • Isobel Clark
        Meng-Ying No, I do not think we are communicating. The variance of data values is not affected by correlation between the sample values. The estimated variance
        Message 3 of 18 , Dec 8, 2004
        • 0 Attachment
          Meng-Ying

          No, I do not think we are communicating.

          The variance of data values is not affected by
          correlation between the sample values.

          The estimated variance for the population IS affected
          by correlation between the sample values. Statistical
          inference about the population is based on the
          assumption that samples were taken randomly and
          independently from that population.

          It is the process of estimation of unknown parameters
          by classical statistical theory which requires these
          assumptions.

          Geostatistical inference does not require absence of
          correlation, quite the contrary. The semi-variogram
          graph is constructed on the assumption that there is a
          correlation between samples and that this depends on
          distance and direction between the pair of samples.

          If we have a stationary situation, where the mean and
          variance are constant over the study area, the
          semi-variogram generally reaches a sill value. The
          distance at which this happens is interpreted as that
          distance beyond which the correlation is zero. Sample
          pairs at this distance or greater can be used to
          estimate the variance, since the statistical
          assumptions are now satisifed.

          Isobel
          http://geoecosse.bizland.com/whatsnew.htm




          --- Meng-Ying Li <mengyl@...> wrote:
          > Hi Isobel,
          >
          > I understand all points you pointed out, but I'm not
          > sure why the variance
          > should be defined as data NOT SPATIALLY CORRELATED
          > when they may or may
          > not be correlated.
          >
          > Thanks for the clarification, though, I don't think
          > I'd be able to
          > clarify the things you clarifies. You're good.
          >
          >
          > Meng-ying
          >
          > On Wed, 8 Dec 2004, Isobel Clark wrote:
          >
          > > Meng-Ying
          > >
          > > I don't know how to say this any other way. At
          > > distances larger than the range of influence,
          > samples
          > > are NOT SPATIALLY CORRELATED.
          > >
          > > The variance of the difference between two
          > > uncorrelated samples is twice the variance of one
          > > sample around the mean.
          > >
          > > The semi-variogram is one-half of the variance of
          > the
          > > difference.
          > >
          > > Hence the sill is (theoretically) equal to the
          > > variance. The sill is based on all pairs of
          > samples
          > > found at a distance greater thn the range of
          > > influence.
          > >
          > > The classical statistical estimator of the
          > variance is
          > > only unbiassed if the correct degrees of freedom
          > are
          > > used. If the samples are correlated, n-1 is NOT
          > the
          > > correct degrees of freedom.
          > >
          > > All explained in immense detail in Practical
          > > Geostatistics 2000, Clark and Harper,
          > > http://geoecosse.hypermart.net
          > >
          > > Did I get it clear this time?
          > > Isobel
          > >
          > > --- Meng-Ying Li <mengyl@...> wrote:
          > > > I understand why it is not appropriate to force
          > the
          > > > sill so it matches the
          > > > sample variance. My question is, why estimate
          > the
          > > > overall variance by the
          > > > sill value when data are actually correlated?
          > > >
          > > >
          > > > Meng-ying
          > > >
          > > > On Tue, 7 Dec 2004, Isobel Clark wrote:
          > > >
          > > > > Meng-Ying
          > > > >
          > > > > We are talking about estimating the variance
          > of a
          > > > set
          > > > > of samples where spatial dependence exists.
          > > > >
          > > > > The classical statistical unbiassed estimator
          > of
          > > > the
          > > > > population variance is s-squared which is the
          > sum
          > > > of
          > > > > the squared deviations from the mean divided
          > by
          > > > the
          > > > > relevant degrees of freedom. If the samples
          > are
          > > > not
          > > > > inter-correlated, the relevant degrees of
          > freedom
          > > > are
          > > > > (n-1). This gives the formula you find in any
          > > > > introductory statistics book or course.
          > > > >
          > > > > If samples are not independent of one another,
          > the
          > > > > degrees of freedom issue becomes a problem and
          > the
          > > > > classical estimator will be biassed (generally
          > too
          > > > > small on average).
          > > > >
          > > > > In theory, pairs of samples beyond the range
          > of
          > > > > influence on a semi-variogram graph are
          > > > independent of
          > > > > one another. In theory, the variance of the
          > > > difference
          > > > > betwen two values which are uncorrelated is
          > twice
          > > > the
          > > > > variance of one sample around the population
          > mean.
          > > > > This is thought to be why Matheron defined the
          > > > > semi-variogram (one-half the squared
          > difference)
          > > > so
          > > > > that the final sill would be (theoretically)
          > equal
          > > > to
          > > > > the population variance.
          > > > >
          > > > > There are computer software packages which
          > will
          > > > draw a
          > > > > line on your experimental semi-variogram at
          > the
          > > > height
          > > > > equivalent to the classically calculated
          > sample
          > > > > variance. Some people try to force their
          > > > > semi-variogram models to go through this line.
          > > > This is
          > > > > dumb as the experimental sill is a better
          > estimate
          > > > > because it does have the degrees of freedom it
          > is
          > > > > supposed to have.
          > > > >
          > > > > I am not sure whether this is clear enough. If
          > you
          > > > > email me off the list, I can recommend
          > > > publications
          > > > > which might help you out.
          > > > >
          > > > > Isobel
          > > > > http://geoecosse.bizland.com/books.htm
          > > > >
          > > > > --- Meng-Ying Li <mengyl@...>
          > wrote:
          > > > > > Hi Isobel,
          > > > > >
          > > > > > Could you explain why it would be a better
          > > > estimate
          > > > > > of the variance when
          > > > > > independance is considered? I'd rather think
          > > > that we
          > > > > > consider the
          > > > > > dependance when the overall variance are to
          > be
          > > > > > estimated-- if there
          > > > > > actually is dependance between values.
          > > > > >
          > > > > > Or are you talking about modeling sill value
          > by
          > > > the
          > > > > > stablizing tail on
          > > > > > the experimental variogram, instead of
          > modeling
          > > > by
          > > > > > the calculated overall
          > > > > > variance?
          > > > > >
          > > > > > Or, are we talking about variance of
          > different
          > > > > > definitions? I'd be
          > > > > > concerned if I missed some point of the
          > original
          > > > > > definition for variances,
          > > > > > like, the variance should be defined with no
          > > > > > dependance beween values or
          > > > > > something like that. Frankly, I don't think
          > I
          > > > took
          > > > > > the definition of
          > > > > > variance too serious when I was learning
          > stats.
          > > > > >
          > > > > >
          > > > > > Meng-ying
          > > > > >
          > > > > > > Digby
          > > > > > >
          > > > > > > I see where you are coming from on this,
          > but
          > > > in
          > > > > > fact
          > > > > > > the sill is composed of those pairs of
          > samples
          > > > > > which
          > > > > > > are independent of one another - or, at
          > least,
          > > > > > have
          >
          === message truncated ===
        • Meng-Ying Li
          Hi Digby and All, I did a little experiment on the idea that Digby mentioned: The sill will estimate the population variance, but found it not true in my
          Message 4 of 18 , Dec 8, 2004
          • 0 Attachment
            Hi Digby and All,

            I did a little experiment on the idea that Digby mentioned: The sill will
            estimate the population variance, but found it not true in my experiment:

            1. I generated a set of one-dimentional data with 27 points on regular
            unit spacings, which I'd like to take it as the true, or population
            value. On purpose, I generate the data so it has an influence range of
            three length units.
            2. I calculated the experimental variogram. Notice that the variogram is
            the population variogram. The sill value is around 2.8.
            3. But the population variance is 2.39, lower than the sill value.

            This confirms my doubt about using sill value as the estimate of
            population variance, since I calculate the variogram and variance based on
            all data points. Please tell me what you think. The data I generated are
            as follows:

            0.056970748
            0.14520424
            0.849710204
            1.650514605
            1.101666385
            1.015177986
            2.150259206
            2.830780659
            0.223495817
            -2.47615958
            -3.372697392
            -0.530685611
            0.786582177
            0.970673
            0.674755256
            0.338461632
            1.020874834
            0.410936991
            1.702892405
            2.649748012
            4.290179731
            3.442015668
            1.488818953
            0.862788738
            0.728709892
            2.398182914
            1.522546427
          • Colin Daly
            Hi DigbySorry to say - but suggesting that less data is systematically better is mistaken - this is fundemental...and is contained in the intro pages of any
            Message 5 of 18 , Dec 8, 2004
            • 0 Attachment
              RE: [ai-geostats] Re: Sill versus least-squares classical variance estimate

              Hi Digby

              Sorry to say - but suggesting that less data is systematically better is mistaken - this is fundemental...and is contained in the intro pages of any good intro to geostats.  If the data is clustered - then you might have to decluster in some sense - But with an unbiased sample then you will use all million samples. Please, any new users of geostats lucky to have a million samples - don't throw 99.9% of your data away!!

              Declustering is about trying to remove the bias that most realistic sampling strategies have (e,g, in petroleum, you tend to drill into the best reservoir regions first...). If your data is an unbaised sample from the true histogram (ie what you would get by mining out the resource fully) then you will use all  of it for estimating any statistic. This does not mean that the samples have to be far apart - just that they don't cluster into high or low regions.

              There seems to be some confusion about independence and estimates. Suppose the mean (and/or variance) is being estimated (provisos: 1) unbaised sample data 2) stationary (so that mean and variance have a meaning)), then the estimate is unbiased - irrespective of the correlation of the data - what does depend on the correlation is the error in the estimation. For a zero correlation length, then the variance of error of the mean drops as 1/n. For a non-zero correlation length it drops slower than 1/n  -  but you do not get a quicker convergence by throwing away good data - in fact virtually always you will get a strictly worse estimate!

              Regards

              Colin


              -----Original Message-----
              From:   Digby Millikan [mailto:digbym@...]
              Sent:   Wed 12/8/2004 5:12 PM
              To:     ai-geostats; Meng-Ying  Li
              Cc:    
              Subject:        Re: [ai-geostats] Re: Sill versus least-squares classical variance estimate
              Dear Meng-Ying,

               It's not that you are defining variance to be the variance of data to be
              data
              beyond the range of the variogram. Say you have a panel made up of a
              1 million samples which covers the entire panel, then you select 1000
              samples
              to estimate the variance. If two samples of the thousand are within range of
              each other (close and similar value), then you are effectively doubling up
              on one of the samples, so to give a better representation of the 1 million
              samples you are better to remove the doubled up sample, giving 999
              samples to estimate the variance of the 1 million. This will give a better
              estimate of the variance you could calculate from the million by the least
              squares classical method, which is what Isobel was saying.

              Regards Digby





              DISCLAIMER:
              This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
              
            • Colin Daly
              Hi DigbyYes, I agree with what you say below - if your only aim was to estimate the variance and you only could collect 1000 samples - then choose them to
              Message 6 of 18 , Dec 8, 2004
              • 0 Attachment
                RE: [ai-geostats] Re: Sill versus least-squares classical variance estimate

                Hi Digby

                 Yes, I agree with what you say below - if your only aim was to estimate the variance and you only could collect 1000 samples - then choose them to be 'maximally independent' to reduce the variance of the error. But note, as Don said yesterday, a random sample, which is clustered, will also give an unbiased estimate of the variance but with a somewhat larger error of estimation. (of course, there may be reasons not to take all the samples as far from one another as possible - for example to estimate the variogram close to the origin, which is it's most important part - but that is another story). The bit that I disagreed with in your original message was the bit that said
                "....giving 999 samples to estimate the variance of the 1 million. This will give a better estimate of the variance you could calculate from the million by the least squares classical method, which is what Isobel was saying"
                I understood this to say that you would do better with 1000 (or 999) points that with the full million...if that is not what you meant then, yes, i did misunderstand

                Colin

                -----Original Message-----
                From:   Digby Millikan [mailto:digbym@...]
                Sent:   Wed 12/8/2004 7:32 PM
                To:     ai-geostats
                Cc:    
                Subject:        Re: [ai-geostats] Re: Sill versus least-squares classical variance estimate
                RE: [ai-geostats] Re: Sill versus least-squares classical variance
                estimateColin,

                 You misunderstood me, the 1 million data is the total unknown dataset. Say
                you have a volume in a
                mine and it's volume is 1 million 1 metre core samples. You drill the volume
                and have a sample set
                of 1000 1m core samples. You then analyse the statistics of the 1000 samples
                to try and estimate
                the variance of the total volume (1 million core samples).  So your estimate
                of the variance comes
                from the 1000 samples. You can plot the variogram of the 1000 samples and
                you can also calculate
                it's variance. You are trying to estimate the variance of the 1 million
                peices of core which you do not
                have. So you must decide wether your 1000 sample set is a true
                representation of the 1 million.
                Our argument is that samples within the 1000 which are clustered together do
                not create a good
                representation of the true dataset and will create a biased estimate.

                Digby
                www.users.on.net/~digbym





                DISCLAIMER:
                This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
                
              • Meng-Ying Li
                ... I did that. But with this short influence range of just 3 lags in a population of size 1000 (0.3% of the domain), the correlation of data doesn t do much
                Message 7 of 18 , Dec 8, 2004
                • 0 Attachment
                  > Meng-Ying,
                  >
                  > For interests sake could you perform the same experiment for a
                  > stationary sample set of size 1000.
                  >
                  > Regards Digby

                  I did that. But with this short influence range of just 3 lags in a
                  population of size 1000 (0.3% of the domain), the correlation of data
                  doesn't do much influence to the population variance. That's why I looked
                  into other data set to speak for me.

                  For people interested in this phenomenum, I used the second realization of
                  SGSIM.OUT in the GSLIB manual as the population, add coordiate to this
                  realization by <addcoord>, calculated omni-directional variogram by
                  <gamv>, and on the screen of <gamv> calculation it shows the overall
                  variance, which doesn't fit the sill in the variogram if you put the
                  maximum lag distance to 30.


                  Mng-yng

                  On Thu, 9 Dec 2004, Digby Millikan wrote:

                  > Meng-Ying,
                  >
                  > For interests sake could you perform the same experiment for a stationary
                  > sample set of size 1000.
                  >
                  > Regards Digby
                  >
                  >
                  >
                Your message has been successfully submitted and would be delivered to recipients shortly.