Loading ...
Sorry, an error occurred while loading the content.

1690GEOSTATS: SUMMARY: Non-colocated disease datasets. Further help sought!

Expand Messages
  • Jonathan Reynolds
    Oct 27, 2000
      DEAR ALL,
      This is a provisional summary of the help I received in response to my
      question a few weeks ago on non-colocated datasets. As is customary, I
      apologise to correspondents if I have failed to understand or express
      important points. I still have at least as many questions as before, hence
      I would be grateful if anyone has further suggestions.

      Sections of this summary:

      > I'm an ecologist with an interest in wildlife disease epidemiology. I
      > two unique datasets representing indices of occurrence of the same disease
      > in two species, both highly mobile terrestrial mammals. Visually (in
      > postings) the two maps are convincingly similar - i.e. these data are
      > ecologically very interesting indeed! I want to test the spatial
      > correlation between the two datasets, because it's likely that one species
      > is the reservoir infecting the other.
      > My problems fall into two categories:
      > (1) Spatial autocorrelation
      > A logical first step would seem to be to test whether each dataset is
      > spatially autocorrelated. This seems likely when one examines the
      > but semivariograms suggest that a sill is very quickly reached (at about
      > km), a distance that seems improbably small as we are dealing with highly
      > mobile mammals and epidemics that look to be 100-200 km across. In both
      > datasets, the variogram value is thereafter highly variable with
      > distance, and there is some suggestion of an oscillating 'hole effect'.
      > However, as distance increases the variogram is clearly being influenced
      > the shape of mainland Britain, and the timid faith I have in the variogram
      > at small h falls away rapidly as h increases. For instance, a location in
      > south-west Wales is very close to north Cornwall for a bird (by Euclidean
      > distance), but quite far away for a terrestrial mammal that must travel by
      > land around the Severn estuary.
      > A further consideration, if I have correctly understood the meaning of
      > stationarity, is that both datasets have an underlying trend, with values
      > increasing from west to east and north to south. Despite having read at
      > length in the AI-Geostats archives, I am still unsure how to deal with
      > in practical terms.
      > For each species alone, the variogram is theoretically of tremendous
      > interest. How local are the epidemics? How close must an epidemic be
      > before a given animal is at risk? Is there genuinely a rippling effect
      > surrounding a disease epicentre? Unfortunately, from the outset there
      > to be a discrepancy between the variogram and my eye. I am a sworn
      > of objectivity, but I'm not yet convinced that my variogram is doing the
      > right thing.
      > (2) Sampling locations and correlation between the datasets
      > Both datasets cover the whole UK (including some islands, which are easily
      > and logically excluded), but originate from two populations of people
      > (hunters and veterinarians) with necessarily different geographical
      > distributions - i.e. they are not colocated. I could convert both
      > to a common regular grid, but this involves interpolation, a number of
      > assumptions, and the creation of quite a few new grid locations that have
      > data from one or both dataset(s). If I did convert to a common grid, I am
      > then at a loss to know how to proceed further. The two datasets do not
      > similar underlying distributions. One is an incidence (count of diseased
      > animals per unit effort), and is easily normalised by a log
      > The other is a measure of prevalence, with many essential (meaningful)
      > that make transformation awkward and perhaps undesirable; these prevalence
      > data can also be weighted by the sample size on which each is based.
      > Please can anyone suggest a route forward? I have read (all the easy
      > in) quite a number of textbooks. So far as I can judge (I pull up all too
      > soon), most books stop short of problems like this because no
      > self-respecting miner would burden himself by collecting data so
      > For me, this is a crude pilot study, hence a stratified sampling programme
      > to test a hypothesis will be the next stage IF I can formalise the
      > correlation that looks so blindingly obvious to the naked eye. So please
      > don't suggest I do my sampling differently..................

      I should have explained better what the two datasets (each located by x,y
      coordinates) are:

      (1). Prevalence of an infectious disease among wild animals shot (randomly)
      by hunters operating within small areas. These data were collected from the
      hunters as a percentage (based on recollection of the preceding 12 months),
      but we also know the sample size (i.e. the experience) on which this is
      based - this could be used to weight values.

      (2). Numbers of domestic dogs with the same disease presented at veterinary
      surgeries. As we don't know the population of dogs from which these
      infected dogs are drawn, these data represent 'incidence' rather than
      'prevalence'. One must assume either that veterinary practices saturate the
      landscape so that they all draw on similar sized populations of dogs; or
      alternatively that time constraints mean that individual vets tend to deal
      with similar numbers of dogs (I have eliminated vets who do deal only with
      farm animals).

      Neither dataset is normal. Both can be made approximately normal by
      transformation, but the many zeros (dataset 1) and very low values (dataset
      2) are essential features of the data, so I am loathe to do this.
      The two datasets are not co-located. This is inevitable because hunters and
      vets occupy such different niches.
      Data are not evenly spread in x,y space. The vets data (2) in particular
      are highly clumped. Combining values within cells of a superimposed grid by
      averaging (1 and 2) or possibly summing (2) seems conceptually OK to me, but
      conversion to a grid loses some of the spatial information present; except
      on a very coarse grid it also creates a fresh problem of grid cells in which
      one or both data sets are missing.

      A. How is the likelihood of disease at location x,y related to the
      likelihood of disease in the same species at increasing distance h? This is
      a really matter of descibing the epidemiology in spatial terms.

      B. How is the likelihood of disease at location x,y related to the
      likelihood of disease in the other species at increasing distance h? A
      convincing preliminary to this would be to show that the spatial pattern of
      the disease is broadly similar in the two species. I suppose I mean by this
      that regression-style modelling is attractive, but simple correlation
      between the two datasets would be a huge step forward.

      Softly, softly
      Steve Rushton suggested that the analysis should begin with as little
      modelling as possible (I approve the sentiment of this softly-softly
      approach because I mistrust the strings of arbitrary choices apparently
      involved in modelling), for instance through a randomisation test or a
      Mantel test. The Mantel test, because it operates on matrices, may allow me
      to utilise an overland distance matrix (see below) - I need to do some more
      reading here. However, I'm dubious about randomisation tests, as it seems
      to me that neither of my datasets fulfills the assumption of independence
      (data are constrained by the mobility of the disease organism and its hosts,
      and by the underlying distributions of hosts and recorders in Britain).

      Deriving and interpreting variograms.
      Donald Myers and Klemens Barfuss favoured detrending the data by fitting an
      x,y plane and working with the residuals only. Brian Gray pointed out that
      some authors caution against this, arguing that small- and large-scale
      trends should be modelled together, but that he personally would suggest the
      detrending approach (at least in my case?). On logical grounds, I favour
      detrending, because prior knowledge shows that there are underlying
      large-scale trends in the distribution of hosts that one would wish to
      remove so far as possible.
      It occurred to me (confirmed by Brian) that the calculation of residuals
      could proceed by Generalised Linear Modelling, which would take due account
      of the non-normal distribution of each dataset. Fitting a simple
      first-order Euclidean x,y plane to either data set using GLM explains a
      large proportion of the variation. We then corresponded about how the
      residuals would be distributed after GLM. Surely residuals of a Poisson
      distributed variable would also be Poisson distributed? Can one 'unlink'
      mean and variance by such a process? Should one transform the residuals?
      Which residuals (natural, standardised, Pearson, deviance - there is
      probably a confusion of terms here) should one use? I still don't know the
      answer to all this.
      Suggested literature:
      Gumpertz, ML, et al. (2000) Logistic regression for Southern Pine Beetle
      outbreaks with spatial and temporal autocorrelation. Forest Science
      GREAT DEAL - THOROUGHLY RECOMMENDED. However, it deals with a binomial
      event, not with count data as in my case, so the detailed methodology is not
      Gotway and Stroup (1997) A generalized linear model approach to spatial data
      analysis and
      prediction, JABES 2:157-178

      Discovering the limits of dry land
      Steve Rushton favoured my proposal to calculate overland distances to use as
      h values. Manifold v. 4.5 or later offered a simple means to do this. The
      advice was: choose the origin and spacing of your analysis grid (perhaps
      need to try several alternatives). Superimpose a grid of points, and build a
      nearest neighbour network through these. Calculate the shortest path
      through the network for each pair of points. Speed of calculation and
      accuracy of path length estimation are conflicting aims determined by grid
      spacing. In practice this procedure did not achieve what I wanted. On the
      other hand, I think I can write a solver in Manifold to accomplish what I
      want. (See www.manifold.net for details of the package. This company has a
      genuinely helpful and FAST technical advice service by email. The Manifold
      package is astoundingly cheap, and it excels in solvers and custom
      programming potential. Despite being an already experienced MapInfo user, I
      found Manifold very exciting as a means to prepare data for analysis.)
      Steve suggested Floyd's Shortest Path algorithm, which achieves much the
      same thing; but for a complex shape like mainland Britain and thousands of
      data points it promised too much initial work coding the data.
      Using overland distances as the basis for calculating spatial statistics
      appears to require me to write my own code to calculate those statistics. I
      am a GenStat user, not SPlus or SASS - but can anyone tell me of clever ways
      that allow the use of distance look-up tables in any common package?

      Analysis dendrogram
      A common feature of advice seems to be to try things two or more ways and
      see which works best. (Perhaps one should say, see how different the
      results are.) Given the number of technical issues here (grid origin, grid
      spacing, with and without detrending, different types of residual, with and
      without transformation, lag distance, tolerance, etc, etc) a tree of
      alternative approaches arises and objectivity seems to recede rapidly. Are
      we looking to see which answer best fits our preconceptions? This was one
      of my original worries about variogram modelling and kriging (see Charles T.
      Kufs posting today!), but I now see that it applies to other analytical
      decisions too. Will geostatisticians in due course be able to recommend
      objective routes through this branching maze?

      I am still seriously hung-up over the following, and would much appreciate
      further advice:
      1. How to incorporate spatial correlation within and between datasets into
      GLM. Advice in practical terms if possible.
      2. How to calculate semi-variogram and/or correlogram statistics using a
      matrix of overland distances. [This must surely be a common problem? How
      many study areas are uniform and rectangular?]
      3. Is it technically correct to calculate the semi-variogram on residuals
      (of any kind) from a GLM?

      With gratitude to all who responded!

      Jonathan Reynolds

      Dr Jonathan C Reynolds
      The Game Conservancy Trust
      Hampshire SP6 1EF

      tel: +44 (0)1425 652381
      FAX: +44 (0)1425 651026
      email: jreynolds@...
      website: www.gct.org.uk/index.html

      *To post a message to the list, send it to ai-geostats@....
      *As a general service to list users, please remember to post a summary
      of any useful responses to your questions.
      *To unsubscribe, send email to majordomo@... with no subject and
      "unsubscribe ai-geostats" in the message body.
      DO NOT SEND Subscribe/Unsubscribe requests to the list!
    • Show all 2 messages in this topic