1690GEOSTATS: SUMMARY: Non-colocated disease datasets. Further help sought!
- Oct 27, 2000DEAR ALL,
This is a provisional summary of the help I received in response to my
question a few weeks ago on non-colocated datasets. As is customary, I
apologise to correspondents if I have failed to understand or express
important points. I still have at least as many questions as before, hence
I would be grateful if anyone has further suggestions.
Sections of this summary:
MY ORIGINAL QUESTION
ADVICE SO FAR
MY ORIGINAL QUESTION
> I'm an ecologist with an interest in wildlife disease epidemiology. Ihave
> two unique datasets representing indices of occurrence of the same diseasepostings,
> in two species, both highly mobile terrestrial mammals. Visually (in
> postings) the two maps are convincingly similar - i.e. these data are
> ecologically very interesting indeed! I want to test the spatial
> correlation between the two datasets, because it's likely that one species
> is the reservoir infecting the other.
> My problems fall into two categories:
> (1) Spatial autocorrelation
> A logical first step would seem to be to test whether each dataset is
> spatially autocorrelated. This seems likely when one examines the
> but semivariograms suggest that a sill is very quickly reached (at about20
> km), a distance that seems improbably small as we are dealing with highlyincreasing
> mobile mammals and epidemics that look to be 100-200 km across. In both
> datasets, the variogram value is thereafter highly variable with
> distance, and there is some suggestion of an oscillating 'hole effect'.by
> However, as distance increases the variogram is clearly being influenced
> the shape of mainland Britain, and the timid faith I have in the variogramthis
> at small h falls away rapidly as h increases. For instance, a location in
> south-west Wales is very close to north Cornwall for a bird (by Euclidean
> distance), but quite far away for a terrestrial mammal that must travel by
> land around the Severn estuary.
> A further consideration, if I have correctly understood the meaning of
> stationarity, is that both datasets have an underlying trend, with values
> increasing from west to east and north to south. Despite having read at
> length in the AI-Geostats archives, I am still unsure how to deal with
> in practical terms.seems
> For each species alone, the variogram is theoretically of tremendous
> interest. How local are the epidemics? How close must an epidemic be
> before a given animal is at risk? Is there genuinely a rippling effect
> surrounding a disease epicentre? Unfortunately, from the outset there
> to be a discrepancy between the variogram and my eye. I am a sworndisciple
> of objectivity, but I'm not yet convinced that my variogram is doing thedatasets
> right thing.
> (2) Sampling locations and correlation between the datasets
> Both datasets cover the whole UK (including some islands, which are easily
> and logically excluded), but originate from two populations of people
> (hunters and veterinarians) with necessarily different geographical
> distributions - i.e. they are not colocated. I could convert both
> to a common regular grid, but this involves interpolation, a number ofNO
> assumptions, and the creation of quite a few new grid locations that have
> data from one or both dataset(s). If I did convert to a common grid, I amhave
> then at a loss to know how to proceed further. The two datasets do not
> similar underlying distributions. One is an incidence (count of diseasedtransformation.
> animals per unit effort), and is easily normalised by a log
> The other is a measure of prevalence, with many essential (meaningful)zeros
> that make transformation awkward and perhaps undesirable; these prevalencewords
> data can also be weighted by the sample size on which each is based.
> Please can anyone suggest a route forward? I have read (all the easy
> in) quite a number of textbooks. So far as I can judge (I pull up all tooawkwardly.
> soon), most books stop short of problems like this because no
> self-respecting miner would burden himself by collecting data so
> For me, this is a crude pilot study, hence a stratified sampling programmeCLARIFICATION
> to test a hypothesis will be the next stage IF I can formalise the
> correlation that looks so blindingly obvious to the naked eye. So please
> don't suggest I do my sampling differently..................
I should have explained better what the two datasets (each located by x,y
(1). Prevalence of an infectious disease among wild animals shot (randomly)
by hunters operating within small areas. These data were collected from the
hunters as a percentage (based on recollection of the preceding 12 months),
but we also know the sample size (i.e. the experience) on which this is
based - this could be used to weight values.
(2). Numbers of domestic dogs with the same disease presented at veterinary
surgeries. As we don't know the population of dogs from which these
infected dogs are drawn, these data represent 'incidence' rather than
'prevalence'. One must assume either that veterinary practices saturate the
landscape so that they all draw on similar sized populations of dogs; or
alternatively that time constraints mean that individual vets tend to deal
with similar numbers of dogs (I have eliminated vets who do deal only with
Neither dataset is normal. Both can be made approximately normal by
transformation, but the many zeros (dataset 1) and very low values (dataset
2) are essential features of the data, so I am loathe to do this.
The two datasets are not co-located. This is inevitable because hunters and
vets occupy such different niches.
Data are not evenly spread in x,y space. The vets data (2) in particular
are highly clumped. Combining values within cells of a superimposed grid by
averaging (1 and 2) or possibly summing (2) seems conceptually OK to me, but
conversion to a grid loses some of the spatial information present; except
on a very coarse grid it also creates a fresh problem of grid cells in which
one or both data sets are missing.
A. How is the likelihood of disease at location x,y related to the
likelihood of disease in the same species at increasing distance h? This is
a really matter of descibing the epidemiology in spatial terms.
B. How is the likelihood of disease at location x,y related to the
likelihood of disease in the other species at increasing distance h? A
convincing preliminary to this would be to show that the spatial pattern of
the disease is broadly similar in the two species. I suppose I mean by this
that regression-style modelling is attractive, but simple correlation
between the two datasets would be a huge step forward.
ADVICE SO FAR:
Steve Rushton suggested that the analysis should begin with as little
modelling as possible (I approve the sentiment of this softly-softly
approach because I mistrust the strings of arbitrary choices apparently
involved in modelling), for instance through a randomisation test or a
Mantel test. The Mantel test, because it operates on matrices, may allow me
to utilise an overland distance matrix (see below) - I need to do some more
reading here. However, I'm dubious about randomisation tests, as it seems
to me that neither of my datasets fulfills the assumption of independence
(data are constrained by the mobility of the disease organism and its hosts,
and by the underlying distributions of hosts and recorders in Britain).
Deriving and interpreting variograms.
Donald Myers and Klemens Barfuss favoured detrending the data by fitting an
x,y plane and working with the residuals only. Brian Gray pointed out that
some authors caution against this, arguing that small- and large-scale
trends should be modelled together, but that he personally would suggest the
detrending approach (at least in my case?). On logical grounds, I favour
detrending, because prior knowledge shows that there are underlying
large-scale trends in the distribution of hosts that one would wish to
remove so far as possible.
It occurred to me (confirmed by Brian) that the calculation of residuals
could proceed by Generalised Linear Modelling, which would take due account
of the non-normal distribution of each dataset. Fitting a simple
first-order Euclidean x,y plane to either data set using GLM explains a
large proportion of the variation. We then corresponded about how the
residuals would be distributed after GLM. Surely residuals of a Poisson
distributed variable would also be Poisson distributed? Can one 'unlink'
mean and variance by such a process? Should one transform the residuals?
Which residuals (natural, standardised, Pearson, deviance - there is
probably a confusion of terms here) should one use? I still don't know the
answer to all this.
Gumpertz, ML, et al. (2000) Logistic regression for Southern Pine Beetle
outbreaks with spatial and temporal autocorrelation. Forest Science
46:95-107 [THIS IS A MARVELLOUSLY CLEAR PAPER FROM WHICH I HAVE LEARNED A
GREAT DEAL - THOROUGHLY RECOMMENDED. However, it deals with a binomial
event, not with count data as in my case, so the detailed methodology is not
Gotway and Stroup (1997) A generalized linear model approach to spatial data
prediction, JABES 2:157-178
Discovering the limits of dry land
Steve Rushton favoured my proposal to calculate overland distances to use as
h values. Manifold v. 4.5 or later offered a simple means to do this. The
advice was: choose the origin and spacing of your analysis grid (perhaps
need to try several alternatives). Superimpose a grid of points, and build a
nearest neighbour network through these. Calculate the shortest path
through the network for each pair of points. Speed of calculation and
accuracy of path length estimation are conflicting aims determined by grid
spacing. In practice this procedure did not achieve what I wanted. On the
other hand, I think I can write a solver in Manifold to accomplish what I
want. (See www.manifold.net for details of the package. This company has a
genuinely helpful and FAST technical advice service by email. The Manifold
package is astoundingly cheap, and it excels in solvers and custom
programming potential. Despite being an already experienced MapInfo user, I
found Manifold very exciting as a means to prepare data for analysis.)
Steve suggested Floyd's Shortest Path algorithm, which achieves much the
same thing; but for a complex shape like mainland Britain and thousands of
data points it promised too much initial work coding the data.
Using overland distances as the basis for calculating spatial statistics
appears to require me to write my own code to calculate those statistics. I
am a GenStat user, not SPlus or SASS - but can anyone tell me of clever ways
that allow the use of distance look-up tables in any common package?
A common feature of advice seems to be to try things two or more ways and
see which works best. (Perhaps one should say, see how different the
results are.) Given the number of technical issues here (grid origin, grid
spacing, with and without detrending, different types of residual, with and
without transformation, lag distance, tolerance, etc, etc) a tree of
alternative approaches arises and objectivity seems to recede rapidly. Are
we looking to see which answer best fits our preconceptions? This was one
of my original worries about variogram modelling and kriging (see Charles T.
Kufs posting today!), but I now see that it applies to other analytical
decisions too. Will geostatisticians in due course be able to recommend
objective routes through this branching maze?
I am still seriously hung-up over the following, and would much appreciate
1. How to incorporate spatial correlation within and between datasets into
GLM. Advice in practical terms if possible.
2. How to calculate semi-variogram and/or correlogram statistics using a
matrix of overland distances. [This must surely be a common problem? How
many study areas are uniform and rectangular?]
3. Is it technically correct to calculate the semi-variogram on residuals
(of any kind) from a GLM?
With gratitude to all who responded!
Dr Jonathan C Reynolds
The Game Conservancy Trust
Hampshire SP6 1EF
tel: +44 (0)1425 652381
FAX: +44 (0)1425 651026
*To post a message to the list, send it to ai-geostats@....
*As a general service to list users, please remember to post a summary
of any useful responses to your questions.
*To unsubscribe, send email to majordomo@... with no subject and
"unsubscribe ai-geostats" in the message body.
DO NOT SEND Subscribe/Unsubscribe requests to the list!
- Next post in topic >>