## [ai-geostats] Re: Sill versus least-squares classical variance estimate

Expand Messages
• Meng, You wan t to have an evenly spaced sample pattern for you estimation of the variance, if you use samples within range of each others then these are
Message 1 of 18 , Dec 8, 2004
Meng,

You wan't to have an evenly spaced sample pattern for you estimation
of the variance, if you use samples within range of each others then these
are clusters of samples which will overweight that area, hence by removing
samples below the range, you remove "clusters" of samples. A common
method of performing statistics on spatial data is first to perform data
declustering, than calculate your statistics, however as Isobel points
out a fast way to do this is remove samples below the range.

Digby
• Dear Meng-Ying, It s not that you are defining variance to be the variance of data to be data beyond the range of the variogram. Say you have a panel made up
Message 2 of 18 , Dec 8, 2004
Dear Meng-Ying,

It's not that you are defining variance to be the variance of data to be
data
beyond the range of the variogram. Say you have a panel made up of a
1 million samples which covers the entire panel, then you select 1000
samples
to estimate the variance. If two samples of the thousand are within range of
each other (close and similar value), then you are effectively doubling up
on one of the samples, so to give a better representation of the 1 million
samples you are better to remove the doubled up sample, giving 999
samples to estimate the variance of the 1 million. This will give a better
estimate of the variance you could calculate from the million by the least
squares classical method, which is what Isobel was saying.

Regards Digby
• Meng-Ying No, I do not think we are communicating. The variance of data values is not affected by correlation between the sample values. The estimated variance
Message 3 of 18 , Dec 8, 2004
Meng-Ying

No, I do not think we are communicating.

The variance of data values is not affected by
correlation between the sample values.

The estimated variance for the population IS affected
by correlation between the sample values. Statistical
inference about the population is based on the
assumption that samples were taken randomly and
independently from that population.

It is the process of estimation of unknown parameters
by classical statistical theory which requires these
assumptions.

Geostatistical inference does not require absence of
correlation, quite the contrary. The semi-variogram
graph is constructed on the assumption that there is a
correlation between samples and that this depends on
distance and direction between the pair of samples.

If we have a stationary situation, where the mean and
variance are constant over the study area, the
semi-variogram generally reaches a sill value. The
distance at which this happens is interpreted as that
distance beyond which the correlation is zero. Sample
pairs at this distance or greater can be used to
estimate the variance, since the statistical
assumptions are now satisifed.

Isobel
http://geoecosse.bizland.com/whatsnew.htm

--- Meng-Ying Li <mengyl@...> wrote:
> Hi Isobel,
>
> I understand all points you pointed out, but I'm not
> sure why the variance
> should be defined as data NOT SPATIALLY CORRELATED
> when they may or may
> not be correlated.
>
> Thanks for the clarification, though, I don't think
> I'd be able to
> clarify the things you clarifies. You're good.
>
>
> Meng-ying
>
> On Wed, 8 Dec 2004, Isobel Clark wrote:
>
> > Meng-Ying
> >
> > I don't know how to say this any other way. At
> > distances larger than the range of influence,
> samples
> > are NOT SPATIALLY CORRELATED.
> >
> > The variance of the difference between two
> > uncorrelated samples is twice the variance of one
> > sample around the mean.
> >
> > The semi-variogram is one-half of the variance of
> the
> > difference.
> >
> > Hence the sill is (theoretically) equal to the
> > variance. The sill is based on all pairs of
> samples
> > found at a distance greater thn the range of
> > influence.
> >
> > The classical statistical estimator of the
> variance is
> > only unbiassed if the correct degrees of freedom
> are
> > used. If the samples are correlated, n-1 is NOT
> the
> > correct degrees of freedom.
> >
> > All explained in immense detail in Practical
> > Geostatistics 2000, Clark and Harper,
> > http://geoecosse.hypermart.net
> >
> > Did I get it clear this time?
> > Isobel
> >
> > --- Meng-Ying Li <mengyl@...> wrote:
> > > I understand why it is not appropriate to force
> the
> > > sill so it matches the
> > > sample variance. My question is, why estimate
> the
> > > overall variance by the
> > > sill value when data are actually correlated?
> > >
> > >
> > > Meng-ying
> > >
> > > On Tue, 7 Dec 2004, Isobel Clark wrote:
> > >
> > > > Meng-Ying
> > > >
> > > > We are talking about estimating the variance
> of a
> > > set
> > > > of samples where spatial dependence exists.
> > > >
> > > > The classical statistical unbiassed estimator
> of
> > > the
> > > > population variance is s-squared which is the
> sum
> > > of
> > > > the squared deviations from the mean divided
> by
> > > the
> > > > relevant degrees of freedom. If the samples
> are
> > > not
> > > > inter-correlated, the relevant degrees of
> freedom
> > > are
> > > > (n-1). This gives the formula you find in any
> > > > introductory statistics book or course.
> > > >
> > > > If samples are not independent of one another,
> the
> > > > degrees of freedom issue becomes a problem and
> the
> > > > classical estimator will be biassed (generally
> too
> > > > small on average).
> > > >
> > > > In theory, pairs of samples beyond the range
> of
> > > > influence on a semi-variogram graph are
> > > independent of
> > > > one another. In theory, the variance of the
> > > difference
> > > > betwen two values which are uncorrelated is
> twice
> > > the
> > > > variance of one sample around the population
> mean.
> > > > This is thought to be why Matheron defined the
> > > > semi-variogram (one-half the squared
> difference)
> > > so
> > > > that the final sill would be (theoretically)
> equal
> > > to
> > > > the population variance.
> > > >
> > > > There are computer software packages which
> will
> > > draw a
> > > > line on your experimental semi-variogram at
> the
> > > height
> > > > equivalent to the classically calculated
> sample
> > > > variance. Some people try to force their
> > > > semi-variogram models to go through this line.
> > > This is
> > > > dumb as the experimental sill is a better
> estimate
> > > > because it does have the degrees of freedom it
> is
> > > > supposed to have.
> > > >
> > > > I am not sure whether this is clear enough. If
> you
> > > > email me off the list, I can recommend
> > > publications
> > > >
> > > > Isobel
> > > > http://geoecosse.bizland.com/books.htm
> > > >
> > > > --- Meng-Ying Li <mengyl@...>
> wrote:
> > > > > Hi Isobel,
> > > > >
> > > > > Could you explain why it would be a better
> > > estimate
> > > > > of the variance when
> > > > > independance is considered? I'd rather think
> > > that we
> > > > > consider the
> > > > > dependance when the overall variance are to
> be
> > > > > estimated-- if there
> > > > > actually is dependance between values.
> > > > >
> > > > > Or are you talking about modeling sill value
> by
> > > the
> > > > > stablizing tail on
> > > > > the experimental variogram, instead of
> modeling
> > > by
> > > > > the calculated overall
> > > > > variance?
> > > > >
> > > > > Or, are we talking about variance of
> different
> > > > > definitions? I'd be
> > > > > concerned if I missed some point of the
> original
> > > > > definition for variances,
> > > > > like, the variance should be defined with no
> > > > > dependance beween values or
> > > > > something like that. Frankly, I don't think
> I
> > > took
> > > > > the definition of
> > > > > variance too serious when I was learning
> stats.
> > > > >
> > > > >
> > > > > Meng-ying
> > > > >
> > > > > > Digby
> > > > > >
> > > > > > I see where you are coming from on this,
> but
> > > in
> > > > > fact
> > > > > > the sill is composed of those pairs of
> samples
> > > > > which
> > > > > > are independent of one another - or, at
> least,
> > > > > have
>
=== message truncated ===
• Hi Digby and All, I did a little experiment on the idea that Digby mentioned: The sill will estimate the population variance, but found it not true in my
Message 4 of 18 , Dec 8, 2004
Hi Digby and All,

I did a little experiment on the idea that Digby mentioned: The sill will
estimate the population variance, but found it not true in my experiment:

1. I generated a set of one-dimentional data with 27 points on regular
unit spacings, which I'd like to take it as the true, or population
value. On purpose, I generate the data so it has an influence range of
three length units.
2. I calculated the experimental variogram. Notice that the variogram is
the population variogram. The sill value is around 2.8.
3. But the population variance is 2.39, lower than the sill value.

This confirms my doubt about using sill value as the estimate of
population variance, since I calculate the variogram and variance based on
all data points. Please tell me what you think. The data I generated are
as follows:

0.056970748
0.14520424
0.849710204
1.650514605
1.101666385
1.015177986
2.150259206
2.830780659
0.223495817
-2.47615958
-3.372697392
-0.530685611
0.786582177
0.970673
0.674755256
0.338461632
1.020874834
0.410936991
1.702892405
2.649748012
4.290179731
3.442015668
1.488818953
0.862788738
0.728709892
2.398182914
1.522546427
• Hi DigbySorry to say - but suggesting that less data is systematically better is mistaken - this is fundemental...and is contained in the intro pages of any
Message 5 of 18 , Dec 8, 2004
RE: [ai-geostats] Re: Sill versus least-squares classical variance estimate

Hi Digby

Sorry to say - but suggesting that less data is systematically better is mistaken - this is fundemental...and is contained in the intro pages of any good intro to geostats.  If the data is clustered - then you might have to decluster in some sense - But with an unbiased sample then you will use all million samples. Please, any new users of geostats lucky to have a million samples - don't throw 99.9% of your data away!!

Declustering is about trying to remove the bias that most realistic sampling strategies have (e,g, in petroleum, you tend to drill into the best reservoir regions first...). If your data is an unbaised sample from the true histogram (ie what you would get by mining out the resource fully) then you will use all  of it for estimating any statistic. This does not mean that the samples have to be far apart - just that they don't cluster into high or low regions.

There seems to be some confusion about independence and estimates. Suppose the mean (and/or variance) is being estimated (provisos: 1) unbaised sample data 2) stationary (so that mean and variance have a meaning)), then the estimate is unbiased - irrespective of the correlation of the data - what does depend on the correlation is the error in the estimation. For a zero correlation length, then the variance of error of the mean drops as 1/n. For a non-zero correlation length it drops slower than 1/n  -  but you do not get a quicker convergence by throwing away good data - in fact virtually always you will get a strictly worse estimate!

Regards

Colin

-----Original Message-----
From:   Digby Millikan [mailto:digbym@...]
Sent:   Wed 12/8/2004 5:12 PM
To:     ai-geostats; Meng-Ying  Li
Cc:
Subject:        Re: [ai-geostats] Re: Sill versus least-squares classical variance estimate
Dear Meng-Ying,

It's not that you are defining variance to be the variance of data to be
data
beyond the range of the variogram. Say you have a panel made up of a
1 million samples which covers the entire panel, then you select 1000
samples
to estimate the variance. If two samples of the thousand are within range of
each other (close and similar value), then you are effectively doubling up
on one of the samples, so to give a better representation of the 1 million
samples you are better to remove the doubled up sample, giving 999
samples to estimate the variance of the 1 million. This will give a better
estimate of the variance you could calculate from the million by the least
squares classical method, which is what Isobel was saying.

Regards Digby

 ```DISCLAIMER: This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. ```
• Hi DigbyYes, I agree with what you say below - if your only aim was to estimate the variance and you only could collect 1000 samples - then choose them to
Message 6 of 18 , Dec 8, 2004
RE: [ai-geostats] Re: Sill versus least-squares classical variance estimate

Hi Digby

Yes, I agree with what you say below - if your only aim was to estimate the variance and you only could collect 1000 samples - then choose them to be 'maximally independent' to reduce the variance of the error. But note, as Don said yesterday, a random sample, which is clustered, will also give an unbiased estimate of the variance but with a somewhat larger error of estimation. (of course, there may be reasons not to take all the samples as far from one another as possible - for example to estimate the variogram close to the origin, which is it's most important part - but that is another story). The bit that I disagreed with in your original message was the bit that said
"....giving 999 samples to estimate the variance of the 1 million. This will give a better estimate of the variance you could calculate from the million by the least squares classical method, which is what Isobel was saying"
I understood this to say that you would do better with 1000 (or 999) points that with the full million...if that is not what you meant then, yes, i did misunderstand

Colin

-----Original Message-----
From:   Digby Millikan [mailto:digbym@...]
Sent:   Wed 12/8/2004 7:32 PM
To:     ai-geostats
Cc:
Subject:        Re: [ai-geostats] Re: Sill versus least-squares classical variance estimate
RE: [ai-geostats] Re: Sill versus least-squares classical variance
estimateColin,

You misunderstood me, the 1 million data is the total unknown dataset. Say
you have a volume in a
mine and it's volume is 1 million 1 metre core samples. You drill the volume
and have a sample set
of 1000 1m core samples. You then analyse the statistics of the 1000 samples
to try and estimate
the variance of the total volume (1 million core samples).  So your estimate
of the variance comes
from the 1000 samples. You can plot the variogram of the 1000 samples and
you can also calculate
it's variance. You are trying to estimate the variance of the 1 million
peices of core which you do not
have. So you must decide wether your 1000 sample set is a true
representation of the 1 million.
Our argument is that samples within the 1000 which are clustered together do
not create a good
representation of the true dataset and will create a biased estimate.

Digby
www.users.on.net/~digbym

 ```DISCLAIMER: This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. ```
• ... I did that. But with this short influence range of just 3 lags in a population of size 1000 (0.3% of the domain), the correlation of data doesn t do much
Message 7 of 18 , Dec 8, 2004
> Meng-Ying,
>
> For interests sake could you perform the same experiment for a
> stationary sample set of size 1000.
>
> Regards Digby

I did that. But with this short influence range of just 3 lags in a
population of size 1000 (0.3% of the domain), the correlation of data
doesn't do much influence to the population variance. That's why I looked
into other data set to speak for me.

For people interested in this phenomenum, I used the second realization of
SGSIM.OUT in the GSLIB manual as the population, add coordiate to this
realization by <addcoord>, calculated omni-directional variogram by
<gamv>, and on the screen of <gamv> calculation it shows the overall
variance, which doesn't fit the sill in the variogram if you put the
maximum lag distance to 30.

Mng-yng

On Thu, 9 Dec 2004, Digby Millikan wrote:

> Meng-Ying,
>
> For interests sake could you perform the same experiment for a stationary
> sample set of size 1000.
>
> Regards Digby
>
>
>
Your message has been successfully submitted and would be delivered to recipients shortly.