Loading ...
Sorry, an error occurred while loading the content.

Re: Standardized Sampling Methodologies and a Common Database

Expand Messages
  • Dan Kjar
    There is no real limit on the hashes since they can be stored in various ways on filesystems. They can be loaded into memory and accessed very quickly. The
    Message 1 of 28 , Aug 16 6:24 PM
    • 0 Attachment
      There is no 'real' limit on the hashes since they can be stored in
      various ways on filesystems. They can be loaded into memory and
      accessed very quickly. The limit on this method is exactly what you
      state... we need to know the searches a priori of the visit. If
      someone suddenly wants to map all of the 5 legged male bees found in
      southern utah we will have a problem.

      Relational databases get around this by caching common searches and
      renewing the cache occasionally. Products like cold fusion have
      included this for years (yuck, but easy, that is what I wrote the
      Smithsonian site in. MYSQL for the database if you are interested. Now
      I only use perl and MYSQL. Pick uses berkeleyDB, luddite that he is).

      Let me run down a simple search using a relational database.
      You have three tables. One is a taxonomic data, another is specimen
      data, and another is locale data. You can have multiple specimens
      tied to single entries in the taxonomic data table and multiple
      specimens tied to the locale data (e.g. all the specimens of one
      species, and all of the specimens from one site). You would do this
      to avoid having the exact same taxonomic or locale data for all 150
      million specimens. The more crap in the table the longer it takes to
      search it.

      The problem is if you search on the fly and you have 300,000 records,
      a simple search for the bees of Wisconsin takes a very long time (but
      not nearly as long as searching a flat file without the hash table).
      If you have a hash table of locales all you need to do is search down
      the locales and then grab all of the records included.

      example hash table based on previously searched terms
      key value
      Minnesota 1,3,5,6,9,10,23,35
      Wisconsin 2,3,4,8,11,20,34

      It only takes a split second to reach into the flat database and grab
      everything in records 2,3, etc. It takes a little longer to reach in
      to a relational database and check each specimen record to see if it
      has a link to a locale table entry that includes Wisconsin (or vice
      versa, but you would still need to check the taxonomic table to make
      sure it is a bee or whatever you are interested in). Every time there
      is a comparison statement it takes much more time. Like I said though,
      this only really matters with very large datasets and people at places
      invested in relational datasets spend most of their time figuring out
      how to make things move more quickly.

      There are many other ways to get relational datasets moving fast but
      in the business world it is a bit easier for the consumer. If you log
      onto your bank account they can cache all information dealing with
      your accounts so you can have quick access to it after a short login
      wait. However, they know you are only going to look at your own stuff
      (hopefully). Since it takes this kind of magic to get relational
      databases to move I have decided that I might as well skip all that
      nonsense and move to the indexing right away and leave the data in a
      human readable format in case I kick off.

      The other nice thing about flat files is that anyone can write queries
      or index it however they see fit. As soon as you decide to put it
      into a relational setup (e.g. speciesname table, genusname table,
      specimen table, source table, locale table, alien invasive status
      table etc..) You are tied to that setup to create queries. Of course
      you could right a query that would flatten it (I did this with some
      Fish data from STRI and it WAS AWFUL), but that begs the question why
      not just leave the data in human readable form and cut it up for
      individual uses?

      Not that any of this needs to be worried about at this point....


      --- In beemonitoring@yahoogroups.com, "Matthew Sarver" <mjsarver@...>
      > Dan wrote: "the way you make flat file databases scream
      > is by indexing the information and holding the indexes in hash tables
      > (at the file system/OS/Perl/C++) level."
      > John replied: "Clearly I need to learn more about this, at least
      enough to
      > understand
      > something about what the experts are doing."
      > The whole topic is way over my head, but maybe this will help with
      some very
      > basic info about different ways of indexing a database, including hash
      > tables (I hope the info presented in this brief article is correct):
      > http://20bits.com/2008/05/13/interview-questions-database-indexes/
      > So, Dan - what you're telling us is that a db of the size that could
      > all of the potentially-contributed bee specimen records from North
      > would HAVE to be a flat db (eg Discover Life), rather than relational,
      > right? So, the question is, is it possible to create some kind of
      front end
      > web interface for a db like Discover Life that would allow queries
      on the
      > basis of host plant, locality, collection method, month, etc.? Or
      would the
      > amount of indexing required to do this screw up data entry? It
      doesn't seem
      > very useful to store all this information with a specimen record, but
      > effectively have no way to access it via a query. Being able to sort by
      > collection method and collection protocol would go a long way toward the
      > goal of increasing standardization without sacrificing information.
      > I didn't realize how limited relational dbs were in terms of number of
      > records - thanks for enlightening us on all of this!
      > Apologies for ignorance about database design. :(
      > Thanks
      > Matt
      > d=406/stime=1218922240/nc1=3848642/nc2=4025291/nc3=5202316>
    Your message has been successfully submitted and would be delivered to recipients shortly.