Loading ...
Sorry, an error occurred while loading the content.

1787[ai-geostats] Large samples, t tests, etc

Expand Messages
  • myers@math.arizona.edu
    Dec 5, 2004
      Most of the tests of hypotheses that have been mentioned recently on this list
      serv are non-spatial, i.e., there is nothing in the underlying statistical
      assumptions that specifically pertains to spatial data. The one common
      assumption is "random sampling" or "iid" (independent, identically
      distributed). In many typical (non-spatial) applications, this assumption is
      ensured by the "design of the experiment", i.e., the way the data is generated
      and collected. Spatial data problems more often involve "observational data"
      which does not easily lend itself to being able to design the experiment in such
      a way as to ensure this basic assumption.

      In the case of spatial data, random site selection does not necessarily
      correspond to "random sampling". In the case of the random function model
      implicit in most of geostatistics, the data is a non-random sample from one
      realization of the random function (in that context using random site selection
      does not then make it a "random sample"). Note that not all spatial statistical
      analysis methods are based on this random function model.

      Normality is another common underlying assumption in many hypothesis tests. In
      the case of random sampling from a distribution with a finite moment of order
      2+delta, delta >0 then the distribution of the sample mean will converge IN
      DISTRIBUTION to a normal distribution. This means that a sequence of functions
      is converging to another function. It is important to note that this convergence
      may be pointwise or uniform or uniform on intervals. Pointwise is you usually
      get from the Central Limit Theorem, this means that the rate of convergence
      depends on where you are on the curve. The difference between using a normal
      statistic vs using a t-statistic usually is the difference between a known
      variance and an unknown variance (and hence estimated). But in either case the
      variance is assumed to exist and be finite. The sample variance can always be
      computed from a data set but that does not ensure that the variance of the
      distribution exists. The quotient of two standard normal random variables has a
      Cauchy distribution, neither the mean nor the variance is finite. Hence the
      Central Limit Theorem does not apply.

      In the case of a non-normal distribution one really needs to know how robust the
      test is to deviation from normality, increasing the sample size does not really
      solve this problem.

      Finally note that most tests of hypotheses are not exactly "neutral", there is a
      tendency to accept the null hypothesis UNLESS there is evidence against the null
      hypothesis, this is one of the reasons for the emphasis on the POWER of the
      test. Often the null hypothesis is the "status quo" and this logical stance for
      the null and alternative hypotheses is okay but not in all circumstances.

      However in some tests for normality (which still depend on the assumption of
      random sampling) the test is set up in such a way that the null hypothesis
      corresponds to the conclusion of normality. E.g., Chi-square tests. If you are
      trying to argue that it is safe to assume normality then you want to accept the
      null hypothesis and you should want a very high power for the test, you don't
      want a small p-vallue, instead you want a very large p-value. Note that the
      normal distribution is symmetric but not all symmetric distributions are normal.

      Donald Myers