- Greetings,

This is a question which troubles me for a long time.

I am doing statistical analysis on a species distribution data. The

interested variable is binary (i.e., presence/absence at sampled

locations) However, the sampling scheme is much influenced by the

accessability of locations. For example, many areas have no

observations at all, some areas have dense observations of absence,

while other areas have dense observations of presence. Under such

"non-statistically designed" sampling scheme and also the reason that

species tends to form colonisations, the data have clusters of

presence/presence in different locations. We can easily make visual

discrimination easily in two dimensional plot. The consequence is,

adding coordinates variables into my logistic regression model is

highly significant. Quadratic terms of coordinates will improve the

model even more to a very good discrimination (area under ROC curve

more than 90%). However, this obviously causes difficulties in

interpreting these linear and quadratic term of coordinate variables.

(Well, this might make sense for large scale data, latitude reflects

temperature, longtitude reflects distance to sea, but it is

meaningless for small or medium scale data.)

So my question is, is there any method to "pre-process" these data, or

is logistic regression a suitable approach for them? Any other

approaches? I doubt the autologistic approach (Besag 1974, Augustin

1993, Huffer 1997, Hoeting 2000) will help much because there are so

many unsampled areas (more than 70% if I discretize them into a

regular lattice system).

Any suggestions will be highly appreciated!

tib - I don't know if ecological-niche factor analysis is suitable for your

kind of analysis, but this multivariate method is used when presence

data of a species are available and absence data are not available or

unreliable:

Hirzel, A. H., J. Hausser, D. Chessel, and N. Perrin. 2002.

Ecological-niche factor analysis: how to compute habitat-suitability

maps without absence data? Ecology 83:2027-2036.

There's software to do ENFA:

http://www2.unil.ch/biomapper

or e.g. in R:

http://www.maths.lth.se/help/R/.R/library/adehabitat/html/enfa.html

Christof

On 21.02.2005, at 12:13, Tib wrote:

> Greetings,

>

> This is a question which troubles me for a long time.

>

> I am doing statistical analysis on a species distribution data. The

> interested variable is binary (i.e., presence/absence at sampled

> locations) However, the sampling scheme is much influenced by the

> accessability of locations. For example, many areas have no

> observations at all, some areas have dense observations of absence,

> while other areas have dense observations of presence. Under such

> "non-statistically designed" sampling scheme and also the reason that

> species tends to form colonisations, the data have clusters of

> presence/presence in different locations. We can easily make visual

> discrimination easily in two dimensional plot. The consequence is,

> adding coordinates variables into my logistic regression model is

> highly significant. Quadratic terms of coordinates will improve the

> model even more to a very good discrimination (area under ROC curve

> more than 90%). However, this obviously causes difficulties in

> interpreting these linear and quadratic term of coordinate variables.

> (Well, this might make sense for large scale data, latitude reflects

> temperature, longtitude reflects distance to sea, but it is

> meaningless for small or medium scale data.)

>

> So my question is, is there any method to "pre-process" these data, or

> is logistic regression a suitable approach for them? Any other

> approaches? I doubt the autologistic approach (Besag 1974, Augustin

> 1993, Huffer 1997, Hoeting 2000) will help much because there are so

> many unsampled areas (more than 70% if I discretize them into a

> regular lattice system).

>

> Any suggestions will be highly appreciated!

>

> tib - Tib,

I forwarded the posting to a colleague of mine, Falk Huetman

(fffh@...). Here is his response (you can obtain the paper he

refers, to from him):

Rajive

-----

Thanks, that's a common issue with Museum Data (as well as with most

data

collected in wilderness and by humans),

and is partly addressed by using Ecological Niche Factor

Analysis (see BIOMAPPER website by A. Hirzel; 'presence only' data), as

well as with correction surfaces and weighing, e.g. when

considering distribution of samples across the range of predictors.

Researchers in New Zealand and Switzerland did that (see GAM modeling on

WWW)

For some biological data in Israel, it was shown that a road bias would

be very small, for instance.

See attached a paper that deals with these issues a little bit, too.

The lat lon predictors one can bring in, but only if you don't want to

generalize to other areas. Usually, this is not done by biologists

because they are after the environmental predictors alone.

E.g. describing pres/abs from geographical space into the biological

space

(see papers by Townsend Peterson).

Let me know and we go from there; kind regards

F.

-----

On Mon, 21 Feb 2005 14:13:31 -0500, Tib <tibshirani@...> wrote:

> Greetings,

>

> This is a question which troubles me for a long time.

>

> I am doing statistical analysis on a species distribution data. The

> interested variable is binary (i.e., presence/absence at sampled

> locations) However, the sampling scheme is much influenced by the

> accessability of locations. For example, many areas have no

> observations at all, some areas have dense observations of absence,

> while other areas have dense observations of presence. Under such

> "non-statistically designed" sampling scheme and also the reason that

> species tends to form colonisations, the data have clusters of

> presence/presence in different locations. We can easily make visual

> discrimination easily in two dimensional plot. The consequence is,

> adding coordinates variables into my logistic regression model is

> highly significant. Quadratic terms of coordinates will improve the

> model even more to a very good discrimination (area under ROC curve

> more than 90%). However, this obviously causes difficulties in

> interpreting these linear and quadratic term of coordinate variables.

> (Well, this might make sense for large scale data, latitude reflects

> temperature, longtitude reflects distance to sea, but it is

> meaningless for small or medium scale data.)

>

> So my question is, is there any method to "pre-process" these data, or

> is logistic regression a suitable approach for them? Any other

> approaches? I doubt the autologistic approach (Besag 1974, Augustin

> 1993, Huffer 1997, Hoeting 2000) will help much because there are so

> many unsampled areas (more than 70% if I discretize them into a

> regular lattice system).

>

> Any suggestions will be highly appreciated!

>

> tib

>

>

> * By using the ai-geostats mailing list you agree to follow its rules

> ( see http://www.ai-geostats.org/help_ai-geostats.htm )

>

> * To unsubscribe to ai-geostats, send the following in the subject or in the body (plain text format) of an email message to sympa@...

>

> Signoff ai-geostats

>

>

--

Rajive - Tib wrote:

>So my question is, is there any method to "pre-process" these data, or

Tib, if you consider including x and y coordinates in your

>is logistic regression a suitable approach for them? Any other

>approaches? I doubt the autologistic approach (Besag 1974, Augustin

>1993, Huffer 1997, Hoeting 2000) will help much because there are so

>many unsampled areas (more than 70% if I discretize them into a

>regular lattice system).

>

>Any suggestions will be highly appreciated!

>

>tib

>

>

trend model, make sure that you add their interaction

otherwise the surface you predict is not rotation

invariant. An alternative is to use a two-dimensionan

(smoothing) spline in x and y. An alternative closely

related to the spline is to use kriging for spatial prediction.

--

Edzer