- Mar 9, 2004
> Hi,

Exploratory analysis of the frequency distribution of the data (i.e. the

>

> I am working myself with pollution data in soils and i have very high

> values very close to very low values, and highly skewed

> distribution. I am more and more concerned with doing kriging on

> transformed data. This simply means we believe the data came

> from only one population. But what if it comes from 2 different

> populations representing 2 different polluting processes? Much

> more if we do believe there are no gross error measurements. The

> fact that high values are very close to low values would tell me that

> the spatial autocorrelation is violated locally. I would try first to see

> if the outliers (local and global) represent a different population, if

> these values cluster or not, how significant is the association high-

> low values, and if the global Moran's I increases if i eliminate the

> "outliers". Maybe the majority of the data which have a higher

> spatial autocorrelation belong to a "better expressed" diffusive

> process, (maybe an older one) while the rest of the data which

> were identified as outliers before, represent a more patch-y or point

> source pollution process which didn't have time to diffuse over the

> entire study area (a younger process, maybe?).

aggregated, non-spatial, frequency) could reveal the existence of two (or

more) populations. To evaluate the evidence in favour of such an

hypothesis, you could compare the hypothesis that the frequency

distribution is formed by a mixture of two (or more) specified

distributions versus the hypothesis that it is formed by only one. The

general topic in statistics is called 'mixture distribution analysis' (not

to be confused with 'mixture models'). Useful references are:

Everitt & Hand, 1981, Mixture distribution analysis. Chapman & Hall

Chen & Chen, 2001, Statistics and Probability Letters 52:125

Hawkins et al., 2001, Computational Statistics & Data Analysis 38:15

http://www.math.mcmaster.ca/peter/mix/mix.html

Some robust regression methods, for example, are based on treating the

data as coming from a mixture of two distributions, the main one, and a

contaminating distribution.

If you conclude that there are two (or more) distributions, then you can

compute the maximum conditional probability that any given data point

belong to any of the two (or more) distributions, and use this computation

to classify data. After this exploratory analysis, you could treat the two

(or more) populations differently, if there is evidence for a mixture, and

maybe even perform separate geostatistical analyses on the separate

populations.

I used this general strategy in the analysis of a time series of an index

of returns from investments in finantial markets. The strategy was

proposed by Hamilton, 1994, Time Series Analysis, Ch. 22, Princeton U. P.

Ruben

--

* To post a message to the list, send it to ai-geostats@...

* As a general service to the users, please remember to post a summary of any useful responses to your questions.

* To unsubscribe, send an email to majordomo@... with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list

* Support to the list is provided at http://www.ai-geostats.org - << Previous post in topic Next post in topic >>