- DEAR ALL,

This is a provisional summary of the help I received in response to my

question a few weeks ago on non-colocated datasets. As is customary, I

apologise to correspondents if I have failed to understand or express

important points. I still have at least as many questions as before, hence

I would be grateful if anyone has further suggestions.

Sections of this summary:

MY ORIGINAL QUESTION

CLARIFICATION

BASIC QUESTIONS

ADVICE SO FAR

REMAINING OBSTACLES

MY ORIGINAL QUESTION> I'm an ecologist with an interest in wildlife disease epidemiology. I

have

> two unique datasets representing indices of occurrence of the same disease

postings,

> in two species, both highly mobile terrestrial mammals. Visually (in

> postings) the two maps are convincingly similar - i.e. these data are

> ecologically very interesting indeed! I want to test the spatial

> correlation between the two datasets, because it's likely that one species

> is the reservoir infecting the other.

>

> My problems fall into two categories:

>

> (1) Spatial autocorrelation

>

> A logical first step would seem to be to test whether each dataset is

> spatially autocorrelated. This seems likely when one examines the

> but semivariograms suggest that a sill is very quickly reached (at about

20

> km), a distance that seems improbably small as we are dealing with highly

increasing

> mobile mammals and epidemics that look to be 100-200 km across. In both

> datasets, the variogram value is thereafter highly variable with

> distance, and there is some suggestion of an oscillating 'hole effect'.

by

> However, as distance increases the variogram is clearly being influenced

> the shape of mainland Britain, and the timid faith I have in the variogram

this

> at small h falls away rapidly as h increases. For instance, a location in

> south-west Wales is very close to north Cornwall for a bird (by Euclidean

> distance), but quite far away for a terrestrial mammal that must travel by

> land around the Severn estuary.

>

> A further consideration, if I have correctly understood the meaning of

> stationarity, is that both datasets have an underlying trend, with values

> increasing from west to east and north to south. Despite having read at

> length in the AI-Geostats archives, I am still unsure how to deal with

> in practical terms.

seems

>

> For each species alone, the variogram is theoretically of tremendous

> interest. How local are the epidemics? How close must an epidemic be

> before a given animal is at risk? Is there genuinely a rippling effect

> surrounding a disease epicentre? Unfortunately, from the outset there

> to be a discrepancy between the variogram and my eye. I am a sworn

disciple

> of objectivity, but I'm not yet convinced that my variogram is doing the

datasets

> right thing.

>

> (2) Sampling locations and correlation between the datasets

>

> Both datasets cover the whole UK (including some islands, which are easily

> and logically excluded), but originate from two populations of people

> (hunters and veterinarians) with necessarily different geographical

> distributions - i.e. they are not colocated. I could convert both

> to a common regular grid, but this involves interpolation, a number of

NO

> assumptions, and the creation of quite a few new grid locations that have

> data from one or both dataset(s). If I did convert to a common grid, I am

have

> then at a loss to know how to proceed further. The two datasets do not

> similar underlying distributions. One is an incidence (count of diseased

transformation.

> animals per unit effort), and is easily normalised by a log

> The other is a measure of prevalence, with many essential (meaningful)

zeros

> that make transformation awkward and perhaps undesirable; these prevalence

words

> data can also be weighted by the sample size on which each is based.

>

> Please can anyone suggest a route forward? I have read (all the easy

> in) quite a number of textbooks. So far as I can judge (I pull up all too

awkwardly.

> soon), most books stop short of problems like this because no

> self-respecting miner would burden himself by collecting data so

> For me, this is a crude pilot study, hence a stratified sampling programme

CLARIFICATION

> to test a hypothesis will be the next stage IF I can formalise the

> correlation that looks so blindingly obvious to the naked eye. So please

> don't suggest I do my sampling differently..................

I should have explained better what the two datasets (each located by x,y

coordinates) are:

(1). Prevalence of an infectious disease among wild animals shot (randomly)

by hunters operating within small areas. These data were collected from the

hunters as a percentage (based on recollection of the preceding 12 months),

but we also know the sample size (i.e. the experience) on which this is

based - this could be used to weight values.

(2). Numbers of domestic dogs with the same disease presented at veterinary

surgeries. As we don't know the population of dogs from which these

infected dogs are drawn, these data represent 'incidence' rather than

'prevalence'. One must assume either that veterinary practices saturate the

landscape so that they all draw on similar sized populations of dogs; or

alternatively that time constraints mean that individual vets tend to deal

with similar numbers of dogs (I have eliminated vets who do deal only with

farm animals).

NB:

Neither dataset is normal. Both can be made approximately normal by

transformation, but the many zeros (dataset 1) and very low values (dataset

2) are essential features of the data, so I am loathe to do this.

The two datasets are not co-located. This is inevitable because hunters and

vets occupy such different niches.

Data are not evenly spread in x,y space. The vets data (2) in particular

are highly clumped. Combining values within cells of a superimposed grid by

averaging (1 and 2) or possibly summing (2) seems conceptually OK to me, but

conversion to a grid loses some of the spatial information present; except

on a very coarse grid it also creates a fresh problem of grid cells in which

one or both data sets are missing.

BASIC QUESTIONS:

A. How is the likelihood of disease at location x,y related to the

likelihood of disease in the same species at increasing distance h? This is

a really matter of descibing the epidemiology in spatial terms.

B. How is the likelihood of disease at location x,y related to the

likelihood of disease in the other species at increasing distance h? A

convincing preliminary to this would be to show that the spatial pattern of

the disease is broadly similar in the two species. I suppose I mean by this

that regression-style modelling is attractive, but simple correlation

between the two datasets would be a huge step forward.

ADVICE SO FAR:

Softly, softly

Steve Rushton suggested that the analysis should begin with as little

modelling as possible (I approve the sentiment of this softly-softly

approach because I mistrust the strings of arbitrary choices apparently

involved in modelling), for instance through a randomisation test or a

Mantel test. The Mantel test, because it operates on matrices, may allow me

to utilise an overland distance matrix (see below) - I need to do some more

reading here. However, I'm dubious about randomisation tests, as it seems

to me that neither of my datasets fulfills the assumption of independence

(data are constrained by the mobility of the disease organism and its hosts,

and by the underlying distributions of hosts and recorders in Britain).

Deriving and interpreting variograms.

Donald Myers and Klemens Barfuss favoured detrending the data by fitting an

x,y plane and working with the residuals only. Brian Gray pointed out that

some authors caution against this, arguing that small- and large-scale

trends should be modelled together, but that he personally would suggest the

detrending approach (at least in my case?). On logical grounds, I favour

detrending, because prior knowledge shows that there are underlying

large-scale trends in the distribution of hosts that one would wish to

remove so far as possible.

It occurred to me (confirmed by Brian) that the calculation of residuals

could proceed by Generalised Linear Modelling, which would take due account

of the non-normal distribution of each dataset. Fitting a simple

first-order Euclidean x,y plane to either data set using GLM explains a

large proportion of the variation. We then corresponded about how the

residuals would be distributed after GLM. Surely residuals of a Poisson

distributed variable would also be Poisson distributed? Can one 'unlink'

mean and variance by such a process? Should one transform the residuals?

Which residuals (natural, standardised, Pearson, deviance - there is

probably a confusion of terms here) should one use? I still don't know the

answer to all this.

Suggested literature:

Gumpertz, ML, et al. (2000) Logistic regression for Southern Pine Beetle

outbreaks with spatial and temporal autocorrelation. Forest Science

46:95-107 [THIS IS A MARVELLOUSLY CLEAR PAPER FROM WHICH I HAVE LEARNED A

GREAT DEAL - THOROUGHLY RECOMMENDED. However, it deals with a binomial

event, not with count data as in my case, so the detailed methodology is not

applicable.]

Gotway and Stroup (1997) A generalized linear model approach to spatial data

analysis and

prediction, JABES 2:157-178

Discovering the limits of dry land

Steve Rushton favoured my proposal to calculate overland distances to use as

h values. Manifold v. 4.5 or later offered a simple means to do this. The

advice was: choose the origin and spacing of your analysis grid (perhaps

need to try several alternatives). Superimpose a grid of points, and build a

nearest neighbour network through these. Calculate the shortest path

through the network for each pair of points. Speed of calculation and

accuracy of path length estimation are conflicting aims determined by grid

spacing. In practice this procedure did not achieve what I wanted. On the

other hand, I think I can write a solver in Manifold to accomplish what I

want. (See www.manifold.net for details of the package. This company has a

genuinely helpful and FAST technical advice service by email. The Manifold

package is astoundingly cheap, and it excels in solvers and custom

programming potential. Despite being an already experienced MapInfo user, I

found Manifold very exciting as a means to prepare data for analysis.)

Steve suggested Floyd's Shortest Path algorithm, which achieves much the

same thing; but for a complex shape like mainland Britain and thousands of

data points it promised too much initial work coding the data.

Using overland distances as the basis for calculating spatial statistics

appears to require me to write my own code to calculate those statistics. I

am a GenStat user, not SPlus or SASS - but can anyone tell me of clever ways

that allow the use of distance look-up tables in any common package?

Analysis dendrogram

A common feature of advice seems to be to try things two or more ways and

see which works best. (Perhaps one should say, see how different the

results are.) Given the number of technical issues here (grid origin, grid

spacing, with and without detrending, different types of residual, with and

without transformation, lag distance, tolerance, etc, etc) a tree of

alternative approaches arises and objectivity seems to recede rapidly. Are

we looking to see which answer best fits our preconceptions? This was one

of my original worries about variogram modelling and kriging (see Charles T.

Kufs posting today!), but I now see that it applies to other analytical

decisions too. Will geostatisticians in due course be able to recommend

objective routes through this branching maze?

REMAINING OBSTACLES:

I am still seriously hung-up over the following, and would much appreciate

further advice:

1. How to incorporate spatial correlation within and between datasets into

GLM. Advice in practical terms if possible.

2. How to calculate semi-variogram and/or correlogram statistics using a

matrix of overland distances. [This must surely be a common problem? How

many study areas are uniform and rectangular?]

3. Is it technically correct to calculate the semi-variogram on residuals

(of any kind) from a GLM?

With gratitude to all who responded!

Jonathan Reynolds

Dr Jonathan C Reynolds

The Game Conservancy Trust

Fordingbridge

Hampshire SP6 1EF

UK

tel: +44 (0)1425 652381

FAX: +44 (0)1425 651026

email: jreynolds@...

website: www.gct.org.uk/index.html

--

*To post a message to the list, send it to ai-geostats@....

*As a general service to list users, please remember to post a summary

of any useful responses to your questions.

*To unsubscribe, send email to majordomo@... with no subject and

"unsubscribe ai-geostats" in the message body.

DO NOT SEND Subscribe/Unsubscribe requests to the list! - Jonathan

I haven't had time to go through your extensive e-mail

in detail, but here are a couple of thoughts to be

going on with:

========================

simple correlation: if you calculate the

non-co-located semi-variogram (or covariance function)

the apparent nugget effect is a direct estimate of

what the correlation between the two variables would

be if you had them co-located. This allows for the

spatial 'auto-correlation' as well as statistical

correlation. I (personally) favour a rank transform as

this is a well established way of finding a

correlation for non-Normal data.

With a rank transform, your zeroes would be given

arbitrary (randomised) ranks and you could do a few

repeats to see if this makes a lot of difference.

References for MUCK can be found at

http://uk.geocities.com/drisobelclark/resume/publications.html

=======================

Some packages (including EcoSSe) define 'distance' in

a specified module which is used by all of the

routines. This module can be replaced to allow the use

of an algebraic function of look up table for

distances. All routines then use that definition

instead of Euclidean distance.

========================

Technically constructing a semi-variogram or other

spatial dependence analysis on the residuals from a

GLM (trend) surface is incorrect. However, we have all

been doing it very successfully for almost 30 years,

so I wouldn't worry about it over muchly. cf.

Practical Geostatistics 2000, Chapter 12.

Isobel Clark

drisobelclark@...

http://uk.geocities.com/drisobelclark

____________________________________________________________

Do You Yahoo!?

Get your free @... address at http://mail.yahoo.co.uk

or your free @... address at http://mail.yahoo.ie

--

*To post a message to the list, send it to ai-geostats@....

*As a general service to list users, please remember to post a summary

of any useful responses to your questions.

*To unsubscribe, send email to majordomo@... with no subject and

"unsubscribe ai-geostats" in the message body.

DO NOT SEND Subscribe/Unsubscribe requests to the list!