Re: Standardized Sampling Methodologies and a Common Database
- There is no 'real' limit on the hashes since they can be stored in
various ways on filesystems. They can be loaded into memory and
accessed very quickly. The limit on this method is exactly what you
state... we need to know the searches a priori of the visit. If
someone suddenly wants to map all of the 5 legged male bees found in
southern utah we will have a problem.
Relational databases get around this by caching common searches and
renewing the cache occasionally. Products like cold fusion have
included this for years (yuck, but easy, that is what I wrote the
Smithsonian site in. MYSQL for the database if you are interested. Now
I only use perl and MYSQL. Pick uses berkeleyDB, luddite that he is).
Let me run down a simple search using a relational database.
You have three tables. One is a taxonomic data, another is specimen
data, and another is locale data. You can have multiple specimens
tied to single entries in the taxonomic data table and multiple
specimens tied to the locale data (e.g. all the specimens of one
species, and all of the specimens from one site). You would do this
to avoid having the exact same taxonomic or locale data for all 150
million specimens. The more crap in the table the longer it takes to
The problem is if you search on the fly and you have 300,000 records,
a simple search for the bees of Wisconsin takes a very long time (but
not nearly as long as searching a flat file without the hash table).
If you have a hash table of locales all you need to do is search down
the locales and then grab all of the records included.
example hash table based on previously searched terms
It only takes a split second to reach into the flat database and grab
everything in records 2,3, etc. It takes a little longer to reach in
to a relational database and check each specimen record to see if it
has a link to a locale table entry that includes Wisconsin (or vice
versa, but you would still need to check the taxonomic table to make
sure it is a bee or whatever you are interested in). Every time there
is a comparison statement it takes much more time. Like I said though,
this only really matters with very large datasets and people at places
invested in relational datasets spend most of their time figuring out
how to make things move more quickly.
There are many other ways to get relational datasets moving fast but
in the business world it is a bit easier for the consumer. If you log
onto your bank account they can cache all information dealing with
your accounts so you can have quick access to it after a short login
wait. However, they know you are only going to look at your own stuff
(hopefully). Since it takes this kind of magic to get relational
databases to move I have decided that I might as well skip all that
nonsense and move to the indexing right away and leave the data in a
human readable format in case I kick off.
The other nice thing about flat files is that anyone can write queries
or index it however they see fit. As soon as you decide to put it
into a relational setup (e.g. speciesname table, genusname table,
specimen table, source table, locale table, alien invasive status
table etc..) You are tied to that setup to create queries. Of course
you could right a query that would flatten it (I did this with some
Fish data from STRI and it WAS AWFUL), but that begs the question why
not just leave the data in human readable form and cut it up for
Not that any of this needs to be worried about at this point....
--- In email@example.com, "Matthew Sarver" <mjsarver@...>
> Dan wrote: "the way you make flat file databases scream
> is by indexing the information and holding the indexes in hash tables
> (at the file system/OS/Perl/C++) level."
> John replied: "Clearly I need to learn more about this, at least
> understandsome very
> something about what the experts are doing."
> The whole topic is way over my head, but maybe this will help with
> basic info about different ways of indexing a database, including hashstore
> tables (I hope the info presented in this brief article is correct):
> So, Dan - what you're telling us is that a db of the size that could
> all of the potentially-contributed bee specimen records from NorthAmerica
> would HAVE to be a flat db (eg Discover Life), rather than relational,front end
> right? So, the question is, is it possible to create some kind of
> web interface for a db like Discover Life that would allow querieson the
> basis of host plant, locality, collection method, month, etc.? Orwould the
> amount of indexing required to do this screw up data entry? Itdoesn't seem
> very useful to store all this information with a specimen record, but<http://geo.yahoo.com/serv?s=97359714/grpId=17598545/grpspId=1705083125/msgI
> effectively have no way to access it via a query. Being able to sort by
> collection method and collection protocol would go a long way toward the
> goal of increasing standardization without sacrificing information.
> I didn't realize how limited relational dbs were in terms of number of
> records - thanks for enlightening us on all of this!
> Apologies for ignorance about database design. :(