[ai-geostats] Large samples, t tests, etc
- Most of the tests of hypotheses that have been mentioned recently on this list
serv are non-spatial, i.e., there is nothing in the underlying statistical
assumptions that specifically pertains to spatial data. The one common
assumption is "random sampling" or "iid" (independent, identically
distributed). In many typical (non-spatial) applications, this assumption is
ensured by the "design of the experiment", i.e., the way the data is generated
and collected. Spatial data problems more often involve "observational data"
which does not easily lend itself to being able to design the experiment in such
a way as to ensure this basic assumption.
In the case of spatial data, random site selection does not necessarily
correspond to "random sampling". In the case of the random function model
implicit in most of geostatistics, the data is a non-random sample from one
realization of the random function (in that context using random site selection
does not then make it a "random sample"). Note that not all spatial statistical
analysis methods are based on this random function model.
Normality is another common underlying assumption in many hypothesis tests. In
the case of random sampling from a distribution with a finite moment of order
2+delta, delta >0 then the distribution of the sample mean will converge IN
DISTRIBUTION to a normal distribution. This means that a sequence of functions
is converging to another function. It is important to note that this convergence
may be pointwise or uniform or uniform on intervals. Pointwise is you usually
get from the Central Limit Theorem, this means that the rate of convergence
depends on where you are on the curve. The difference between using a normal
statistic vs using a t-statistic usually is the difference between a known
variance and an unknown variance (and hence estimated). But in either case the
variance is assumed to exist and be finite. The sample variance can always be
computed from a data set but that does not ensure that the variance of the
distribution exists. The quotient of two standard normal random variables has a
Cauchy distribution, neither the mean nor the variance is finite. Hence the
Central Limit Theorem does not apply.
In the case of a non-normal distribution one really needs to know how robust the
test is to deviation from normality, increasing the sample size does not really
solve this problem.
Finally note that most tests of hypotheses are not exactly "neutral", there is a
tendency to accept the null hypothesis UNLESS there is evidence against the null
hypothesis, this is one of the reasons for the emphasis on the POWER of the
test. Often the null hypothesis is the "status quo" and this logical stance for
the null and alternative hypotheses is okay but not in all circumstances.
However in some tests for normality (which still depend on the assumption of
random sampling) the test is set up in such a way that the null hypothesis
corresponds to the conclusion of normality. E.g., Chi-square tests. If you are
trying to argue that it is safe to assume normality then you want to accept the
null hypothesis and you should want a very high power for the test, you don't
want a small p-vallue, instead you want a very large p-value. Note that the
normal distribution is symmetric but not all symmetric distributions are normal.