Loading ...
Sorry, an error occurred while loading the content.

Re: Standardized Sampling Methodologies and a Common Database

Expand Messages
  • Dan Kjar
    Discoverlife s fields are whatever the submitter wants them to be. The only thing required is a taxonomic name and hopefully a location in whatever format you
    Message 1 of 28 , Aug 15, 2008
    • 0 Attachment
      Discoverlife's fields are whatever the submitter wants them to be.
      The only thing required is a taxonomic name and hopefully a location
      in whatever format you like.

      --- In beemonitoring@yahoogroups.com, "Matthew Sarver" <mjsarver@...>
      wrote:
      >
      > John -
      >
      > "It is extremely important to note that there are already multiple
      linked
      > central repositories in place."
      >
      > Thanks for pointing this out. I am obviously not as well-versed in
      > bioinformatics databases as I could be. I did not mean to suggest
      > reinventing the wheel on this, but wasn't sure how many of these
      existing
      > databases are flexible enough in their data input to allow us to
      work with
      > the specific fields that the bee community would find useful /
      neccessary.
      > Generating a map for a species is one thing, but a fully searchable
      database
      > that allows one to find flower records, flight periods, etc for a
      certain
      > part of the world or a certain species is another. Right now, the
      Discover
      > Life specimen view includes a number of very useful data fields, but
      there
      > are certainly many more that might be of interest, particularly in
      terms of
      > habitat and floral associations. As far as I know, there is no easy
      way to
      > search the fields in that database, other than by viewing a specimen
      record
      > from the mapper. Likewise, GBIF is primarily biogeographical data.
      I was
      > thinking about the creation of a database web portal with a design
      and front
      > end that would be specifically geared toward pollinator records, and the
      > associated ecological data that might not fit the mold of available
      broader
      > repositories.
      >
      > Such a customized portal could also be expanded to include an EBird or
      > Bugguide-like citizen science component, where photos could be posted by
      > amateurs. I agree that bugguide already serves that purpose
      admirably, but
      > its structure does not encourage the entry of scientifically useful data
      > along with submitted records in the way that a custom-tailored user
      > interface like Ebird does. The already useful information generated by
      > bugguide could be made even more useful by asking users for more
      information
      > about their sighting.
      >
      > "Local repositories can enhance centralized (global) data by providing
      > additional more particular services (e.g., customizable dynamic
      local maps
      > and potentially analyses based on these) "
      >
      > I guess this is more along the lines of what I am thinking. But
      "local" in
      > the sense of specificty of purpose or usage, rather than geography.
      > Thoughts?
      >
      > Matt
      >
    • John S. Ascher
      Matt - Thanks for another thoughtful response. I did not mean to suggest ... existing ... with ... neccessary. As Dan already noted Discoverlife can
      Message 2 of 28 , Aug 15, 2008
      • 0 Attachment
        Matt -

        Thanks for another thoughtful response.

        I did not mean to suggest
        > reinventing the wheel on this, but wasn't sure how many of these
        existing
        > databases are flexible enough in their data input to allow us to work
        with
        > the specific fields that the bee community would find useful /
        neccessary.

        As Dan already noted Discoverlife can accommodate virtually any field as
        long as data are linked directly to a species name. Only fields with data
        appear when you pull up specimen records; blank fields are not displayed.

        > Generating a map for a species is one thing, but a fully searchable
        database
        > that allows one to find flower records, flight periods, etc for a
        certain
        > part of the world or a certain species is another.

        There are web portals being designed specifically to fulfill precisely
        these needs, e.g.:

        http://libraryportals.com/PCDL

        Stuart Roberts in the UK is developing an excellent database optimized to
        record these data.

        Right now, the
        > Discover
        > Life specimen view includes a number of very useful data fields, but
        there
        > are certainly many more that might be of interest, particularly in terms of
        > habitat and floral associations.

        These can already be mapped. These and other fields you can dream up can
        certainly be displayed. Sam even has a field where he notes brand of
        soap!

        As far as I know, there is no easy way
        > to
        > search the fields in that database, other than by viewing a specimen record
        > from the mapper.

        You are correct. The search function needs improvement.

        Likewise, GBIF is primarily biogeographical data. I was
        > thinking about the creation of a database web portal with a design and
        front
        > end that would be specifically geared toward pollinator records, and the
        associated ecological data that might not fit the mold of available
        broader
        > repositories.

        As noted above this may already exist:

        http://libraryportals.com/PCDL

        > Such a customized portal could also be expanded to include an EBird or
        Bugguide-like citizen science component, where photos could be posted by
        amateurs. I agree that bugguide already serves that purpose admirably,
        but
        > its structure does not encourage the entry of scientifically useful data
        along with submitted records in the way that a custom-tailored user
        interface like Ebird does. The already useful information generated by
        bugguide could be made even more useful by asking users for more
        information
        > about their sighting.

        I would advocate an all of the above solution, i.e. improving Bugguide
        itself, improving relevant tools at other sites such as Discoverlife, and
        establishing useful links between sites with complementary emphases.

        > "Local repositories can enhance centralized (global) data by providing
        additional more particular services (e.g., customizable dynamic local
        maps
        > and potentially analyses based on these) "
        >
        > I guess this is more along the lines of what I am thinking. But "local" in
        > the sense of specificty of purpose or usage, rather than geography.
        Thoughts?

        I meant both.

        In terms of geography, one example of a local site would be a global or
        regional ID guide customized for a specific site by filtering out
        extralimital taxa.

        For example, here is the eastern Bee Genera guide customized for the
        Fingerlakes region of NY:

        http://www.discoverlife.org/mp/20q?guide=Bee_genera&cl=US/NY/Fingerlakes

        In terms of specificity of purpose, a local site could highlight and
        extend a subset of data, e.g., pollinator-plant interactions, derived by
        querying one or more central repositories.

        John


        > Matt
        >
        >
        >


        --
        John S. Ascher, Ph.D.
        Bee Database Project Manager
        Division of Invertebrate Zoology
        American Museum of Natural History
        Central Park West @ 79th St.
        New York, NY 10024-5192
        work phone: 212-496-3447
        mobile phone: 917-407-0378
      • Matthew Sarver
        Great! I didn t know discoverlife was set up that way until Dan pointed it out. A query interface for this database now seems like an obvious starting point.
        Message 3 of 28 , Aug 15, 2008
        • 0 Attachment
          Great!  I didn't know discoverlife was set up that way until Dan pointed it out.  A query interface for this database now seems like an obvious starting point.  As for PCDL - I thought they were only tackling literature, at least for now.  Do they have plans to incorporate specimen data as well?  I've certainly used it for plant/pollinator interactions a number of times already. 
           
          The "citizen science" thing for insects has great potential - as long as those who can ID the pics can keep up!  An integration of bugguide and discover life would be really cool!
           
          Matt


          From: beemonitoring@yahoogroups.com [mailto:beemonitoring@yahoogroups.com] On Behalf Of John S. Ascher
          Sent: Saturday, August 16, 2008 1:16 AM
          To: beemonitoring@yahoogroups.com
          Subject: Re: [beemonitoring] Re: Standardized Sampling Methodologies and a Common Database


          Matt -

          Thanks for another thoughtful response.

          I did not mean to suggest

          > reinventing the wheel on this, but wasn't sure how many
          of these
          existing
          > databases are flexible enough in their data input
          to allow us to work
          with
          > the specific fields that the bee community
          would find useful /
          neccessary.

          As Dan already noted Discoverlife can accommodate virtually any field as
          long as data are linked directly to a species name. Only fields with data
          appear when you pull up specimen records; blank fields are not displayed.

          > Generating a map for a species is
          one thing, but a fully searchable
          database
          > that allows one to find
          flower records, flight periods, etc for a
          certain
          > part of the world
          or a certain species is another.

          There are web portals being designed specifically to fulfill precisely
          these needs, e.g.:

          http://libraryporta ls.com/PCDL

          Stuart Roberts in the UK is developing an excellent database optimized to
          record these data.

          Right now, the
          > Discover
          > Life specimen view
          includes a number of very useful data fields, but
          there
          > are certainly
          many more that might be of interest, particularly in terms of
          > habitat
          and floral associations.

          These can already be mapped. These and other fields you can dream up can
          certainly be displayed. Sam even has a field where he notes brand of
          soap!

          As far as I know, there is no easy way
          > to
          > search the fields in that database, other than by viewing
          a specimen record
          > from the mapper.

          You are correct. The search function needs improvement.

          Likewise, GBIF is primarily biogeographical data. I was
          > thinking about the creation of a database web portal with a
          design and
          front
          > end that would be specifically geared toward
          pollinator records, and the
          associated ecological data that might not fit the mold of available
          broader
          > repositories.

          As noted above this may already exist:

          http://libraryporta ls.com/PCDL

          >
          Such a customized portal could also be expanded to include an EBird or
          Bugguide-like citizen science component, where photos could be posted by
          amateurs. I agree that bugguide already serves that purpose admirably,
          but
          > its structure does not encourage the entry of
          scientifically useful data
          along with submitted records in the way that a custom-tailored user
          interface like Ebird does. The already useful information generated by
          bugguide could be made even more useful by asking users for more
          information
          > about their sighting.

          I would advocate an all of the above solution, i.e. improving Bugguide
          itself, improving relevant tools at other sites such as Discoverlife, and
          establishing useful links between sites with complementary emphases.

          > "Local repositories can enhance centralized (global) data
          by providing
          additional more particular services (e.g., customizable dynamic local
          maps
          > and potentially analyses based on these) "
          >
          >
          I guess this is more along the lines of what I am thinking. But "local" in
          > the sense of specificty of purpose or usage, rather than
          geography.
          Thoughts?

          I meant both.

          In terms of geography, one example of a local site would be a global or
          regional ID guide customized for a specific site by filtering out
          extralimital taxa.

          For example, here is the eastern Bee Genera guide customized for the
          Fingerlakes region of NY:

          http://www.discover life.org/ mp/20q?guide= Bee_genera& cl=US/NY/ Fingerlakes

          In terms of specificity of purpose, a local site could highlight and
          extend a subset of data, e.g., pollinator-plant interactions, derived by
          querying one or more central repositories.

          John

          >
          Matt
          >
          >
          >

          --
          John S. Ascher, Ph.D.
          Bee Database Project Manager
          Division of Invertebrate Zoology
          American Museum of Natural History
          Central Park West @ 79th St.
          New York, NY 10024-5192
          work phone: 212-496-3447
          mobile phone: 917-407-0378

        • Sam Droege
          I wasn t aware of some of those new, more flexible database features, it will be good to have representation at the meeting from that group. While one could
          Message 4 of 28 , Aug 16, 2008
          • 0 Attachment
            I wasn't aware of some of those new, more flexible database features,
            it will be good to have representation at the meeting from that
            group. While one could argue that you could develop those features
            later, I think that more and more that database functions will help
            guide the development of what gets monitored. Its also clear that
            internet functions can be built directly into monitoring schemes
            rather than having paper surveys that get entered later.

            The possibilities of expanding Bugguide.net are intriguing. It seems
            particularly good at detetecting the spread of introduced
            species...and the digital libraries that are produced are going to
            become invaluable.

            sam


            --- In beemonitoring@yahoogroups.com, "Matthew Sarver" <mjsarver@...>
            wrote:
            >
            > Great! I didn't know discoverlife was set up that way until Dan
            pointed it
            > out. A query interface for this database now seems like an obvious
            starting
            > point. As for PCDL - I thought they were only tackling literature,
            at least
            > for now. Do they have plans to incorporate specimen data as well?
            I've
            > certainly used it for plant/pollinator interactions a number of
            times
            > already.
            >
            > The "citizen science" thing for insects has great potential - as
            long as
            > those who can ID the pics can keep up! An integration of bugguide
            and
            > discover life would be really cool!
            >
            > Matt
            >
            > _____
            >
            > From: beemonitoring@yahoogroups.com
            [mailto:beemonitoring@yahoogroups.com]
            > On Behalf Of John S. Ascher
            > Sent: Saturday, August 16, 2008 1:16 AM
            > To: beemonitoring@yahoogroups.com
            > Subject: Re: [beemonitoring] Re: Standardized Sampling
            Methodologies and a
            > Common Database
            >
            >
            >
            >
            > Matt -
            >
            > Thanks for another thoughtful response.
            >
            > I did not mean to suggest
            > > reinventing the wheel on this, but wasn't sure how many of these
            > existing
            > > databases are flexible enough in their data input to allow us to
            work
            > with
            > > the specific fields that the bee community would find useful /
            > neccessary.
            >
            > As Dan already noted Discoverlife can accommodate virtually any
            field as
            > long as data are linked directly to a species name. Only fields
            with data
            > appear when you pull up specimen records; blank fields are not
            displayed.
            >
            > > Generating a map for a species is one thing, but a fully
            searchable
            > database
            > > that allows one to find flower records, flight periods, etc for a
            > certain
            > > part of the world or a certain species is another.
            >
            > There are web portals being designed specifically to fulfill
            precisely
            > these needs, e.g.:
            >
            > http://libraryporta <http://libraryportals.com/PCDL> ls.com/PCDL
            >
            > Stuart Roberts in the UK is developing an excellent database
            optimized to
            > record these data.
            >
            > Right now, the
            > > Discover
            > > Life specimen view includes a number of very useful data fields,
            but
            > there
            > > are certainly many more that might be of interest, particularly
            in terms
            > of
            > > habitat and floral associations.
            >
            > These can already be mapped. These and other fields you can dream
            up can
            > certainly be displayed. Sam even has a field where he notes brand of
            > soap!
            >
            > As far as I know, there is no easy way
            > > to
            > > search the fields in that database, other than by viewing a
            specimen
            > record
            > > from the mapper.
            >
            > You are correct. The search function needs improvement.
            >
            > Likewise, GBIF is primarily biogeographical data. I was
            > > thinking about the creation of a database web portal with a
            design and
            > front
            > > end that would be specifically geared toward pollinator records,
            and the
            > associated ecological data that might not fit the mold of available
            > broader
            > > repositories.
            >
            > As noted above this may already exist:
            >
            > http://libraryporta <http://libraryportals.com/PCDL> ls.com/PCDL
            >
            > > Such a customized portal could also be expanded to include an
            EBird or
            > Bugguide-like citizen science component, where photos could be
            posted by
            > amateurs. I agree that bugguide already serves that purpose
            admirably,
            > but
            > > its structure does not encourage the entry of scientifically
            useful data
            > along with submitted records in the way that a custom-tailored user
            > interface like Ebird does. The already useful information generated
            by
            > bugguide could be made even more useful by asking users for more
            > information
            > > about their sighting.
            >
            > I would advocate an all of the above solution, i.e. improving
            Bugguide
            > itself, improving relevant tools at other sites such as
            Discoverlife, and
            > establishing useful links between sites with complementary emphases.
            >
            > > "Local repositories can enhance centralized (global) data by
            providing
            > additional more particular services (e.g., customizable dynamic
            local
            > maps
            > > and potentially analyses based on these) "
            > >
            > > I guess this is more along the lines of what I am thinking.
            But "local" in
            > > the sense of specificty of purpose or usage, rather than
            geography.
            > Thoughts?
            >
            > I meant both.
            >
            > In terms of geography, one example of a local site would be a
            global or
            > regional ID guide customized for a specific site by filtering out
            > extralimital taxa.
            >
            > For example, here is the eastern Bee Genera guide customized for the
            > Fingerlakes region of NY:
            >
            > http://www.discover
            > <http://www.discoverlife.org/mp/20q?
            guide=Bee_genera&cl=US/NY/Fingerlakes>
            > life.org/mp/20q?guide=Bee_genera&cl=US/NY/Fingerlakes
            >
            > In terms of specificity of purpose, a local site could highlight and
            > extend a subset of data, e.g., pollinator-plant interactions,
            derived by
            > querying one or more central repositories.
            >
            > John
            >
            > > Matt
            > >
            > >
            > >
            >
            > --
            > John S. Ascher, Ph.D.
            > Bee Database Project Manager
            > Division of Invertebrate Zoology
            > American Museum of Natural History
            > Central Park West @ 79th St.
            > New York, NY 10024-5192
            > work phone: 212-496-3447
            > mobile phone: 917-407-0378
            >
          • Dan Kjar
            Here is a quick break down of relational vs flat databases. Relational databases link tables to tables and those links allow you to do some very powerful
            Message 5 of 28 , Aug 16, 2008
            • 0 Attachment
              Here is a quick break down of relational vs flat databases.

              Relational databases link tables to tables and those links allow you
              to do some very powerful queries. However, as the tables grow the
              queries slow and as the relationships become more complex the database
              gets kludgy to deal with and nearly incomprehensible to people that
              did not design it.

              Flat file databases are always meaningful to humans and any human that
              can read text. Flat files do not allow you to do some of the more
              wizbang pull it out of your *** searches that relational databases
              allow you. However, if you know what people are going to search
              (genus/species/whatever), the way you make flat file databases scream
              is by indexing the information and holding the indexes in hash tables
              (at the file system/OS/Perl/C++) level. This is how pick can put
              300,000 points on a map in just a few seconds. His database currently
              has over 1.4 million records and when he gets all of th GBIF info it
              will be over 15 million records (if I remember correctly). The
              difficult part here is that you need to predetermine what queries the
              user will be doing. The big search engines all work along the same lines.

              I have mostly made relational databases, including my last one for the
              Smithsonian. That database is limited to the exact number of type ant
              specimens the museum holds. I made the decision that 1200 specimens
              would not slow the searches to any appreciable level so I went with
              the ease and power of a relational database. If it were going to
              30,000 I would go with a flat file design.

              If you would like to see the difference do a search on aphaenogaster
              at this website
              http://ripley.si.edu/ent/nmnhtypdb

              and compare it to an author search on wheeler
              at this website
              http://ripley.si.edu/ent/nmnhtypedb/wlb/wlbsearch.cfm

              The first is relational and allows me to easily assign multiple
              taxonomies and specimens for a single type. The second is a flat
              file. The first has 1400 or so entries in the typetable hooked to a
              variety of other tables through relationships. The second has 10,000
              records and is not hooked to other tables.


              Dan
            • Matthew Sarver
              Dan wrote: the way you make flat file databases scream is by indexing the information and holding the indexes in hash tables (at the file system/OS/Perl/C++)
              Message 6 of 28 , Aug 16, 2008
              • 0 Attachment
                Dan wrote: "the way you make flat file databases scream
                is by indexing the information and holding the indexes in hash tables
                (at the file system/OS/Perl/ C++) level."

                John replied: "Clearly I need to learn more about this, at least enough to understand
                something about what the experts are doing."

                 
                The whole topic is way over my head, but maybe this will help with some very basic info about different ways of indexing a database, including hash tables (I hope the info presented in this brief article is correct):
                 
                 
                So, Dan - what you're telling us is that a db of the size that could store all of the potentially-contributed bee specimen records from North America would HAVE to be a flat db (eg Discover Life), rather than relational, right?  So, the question is, is it possible to create some kind of front end web interface for a db like Discover Life that would allow queries on the basis of host plant, locality, collection method, month, etc.?  Or would the amount of indexing required to do this screw up data entry?  It doesn't seem very useful to store all this information with a specimen record, but effectively have no way to access it via a query.  Being able to sort by collection method and collection protocol would go a long way toward the goal of increasing standardization without sacrificing information.  
                 
                I didn't realize how limited relational dbs were in terms of number of records - thanks for enlightening us on all of this!
                 
                Apologies for ignorance about database design. :(
                 
                Thanks
                Matt

              • Dan Kjar
                There is no real limit on the hashes since they can be stored in various ways on filesystems. They can be loaded into memory and accessed very quickly. The
                Message 7 of 28 , Aug 16, 2008
                • 0 Attachment
                  There is no 'real' limit on the hashes since they can be stored in
                  various ways on filesystems. They can be loaded into memory and
                  accessed very quickly. The limit on this method is exactly what you
                  state... we need to know the searches a priori of the visit. If
                  someone suddenly wants to map all of the 5 legged male bees found in
                  southern utah we will have a problem.

                  Relational databases get around this by caching common searches and
                  renewing the cache occasionally. Products like cold fusion have
                  included this for years (yuck, but easy, that is what I wrote the
                  Smithsonian site in. MYSQL for the database if you are interested. Now
                  I only use perl and MYSQL. Pick uses berkeleyDB, luddite that he is).

                  Let me run down a simple search using a relational database.
                  You have three tables. One is a taxonomic data, another is specimen
                  data, and another is locale data. You can have multiple specimens
                  tied to single entries in the taxonomic data table and multiple
                  specimens tied to the locale data (e.g. all the specimens of one
                  species, and all of the specimens from one site). You would do this
                  to avoid having the exact same taxonomic or locale data for all 150
                  million specimens. The more crap in the table the longer it takes to
                  search it.

                  The problem is if you search on the fly and you have 300,000 records,
                  a simple search for the bees of Wisconsin takes a very long time (but
                  not nearly as long as searching a flat file without the hash table).
                  If you have a hash table of locales all you need to do is search down
                  the locales and then grab all of the records included.

                  example hash table based on previously searched terms
                  key value
                  Minnesota 1,3,5,6,9,10,23,35
                  Wisconsin 2,3,4,8,11,20,34

                  It only takes a split second to reach into the flat database and grab
                  everything in records 2,3, etc. It takes a little longer to reach in
                  to a relational database and check each specimen record to see if it
                  has a link to a locale table entry that includes Wisconsin (or vice
                  versa, but you would still need to check the taxonomic table to make
                  sure it is a bee or whatever you are interested in). Every time there
                  is a comparison statement it takes much more time. Like I said though,
                  this only really matters with very large datasets and people at places
                  invested in relational datasets spend most of their time figuring out
                  how to make things move more quickly.

                  There are many other ways to get relational datasets moving fast but
                  in the business world it is a bit easier for the consumer. If you log
                  onto your bank account they can cache all information dealing with
                  your accounts so you can have quick access to it after a short login
                  wait. However, they know you are only going to look at your own stuff
                  (hopefully). Since it takes this kind of magic to get relational
                  databases to move I have decided that I might as well skip all that
                  nonsense and move to the indexing right away and leave the data in a
                  human readable format in case I kick off.

                  The other nice thing about flat files is that anyone can write queries
                  or index it however they see fit. As soon as you decide to put it
                  into a relational setup (e.g. speciesname table, genusname table,
                  specimen table, source table, locale table, alien invasive status
                  table etc..) You are tied to that setup to create queries. Of course
                  you could right a query that would flatten it (I did this with some
                  Fish data from STRI and it WAS AWFUL), but that begs the question why
                  not just leave the data in human readable form and cut it up for
                  individual uses?

                  Not that any of this needs to be worried about at this point....

                  Dan


                  --- In beemonitoring@yahoogroups.com, "Matthew Sarver" <mjsarver@...>
                  wrote:
                  >
                  > Dan wrote: "the way you make flat file databases scream
                  > is by indexing the information and holding the indexes in hash tables
                  > (at the file system/OS/Perl/C++) level."
                  >
                  > John replied: "Clearly I need to learn more about this, at least
                  enough to
                  > understand
                  > something about what the experts are doing."
                  >
                  >
                  > The whole topic is way over my head, but maybe this will help with
                  some very
                  > basic info about different ways of indexing a database, including hash
                  > tables (I hope the info presented in this brief article is correct):
                  >
                  > http://20bits.com/2008/05/13/interview-questions-database-indexes/
                  >
                  > So, Dan - what you're telling us is that a db of the size that could
                  store
                  > all of the potentially-contributed bee specimen records from North
                  America
                  > would HAVE to be a flat db (eg Discover Life), rather than relational,
                  > right? So, the question is, is it possible to create some kind of
                  front end
                  > web interface for a db like Discover Life that would allow queries
                  on the
                  > basis of host plant, locality, collection method, month, etc.? Or
                  would the
                  > amount of indexing required to do this screw up data entry? It
                  doesn't seem
                  > very useful to store all this information with a specimen record, but
                  > effectively have no way to access it via a query. Being able to sort by
                  > collection method and collection protocol would go a long way toward the
                  > goal of increasing standardization without sacrificing information.
                  >
                  > I didn't realize how limited relational dbs were in terms of number of
                  > records - thanks for enlightening us on all of this!
                  >
                  > Apologies for ignorance about database design. :(
                  >
                  > Thanks
                  > Matt
                  >
                  >
                  <http://geo.yahoo.com/serv?s=97359714/grpId=17598545/grpspId=1705083125/msgI
                  > d=406/stime=1218922240/nc1=3848642/nc2=4025291/nc3=5202316>
                  >
                Your message has been successfully submitted and would be delivered to recipients shortly.