What is a taxonomy (a two page memo)
- Hello,I am Steve Tolkin, a principal architect at Fidelity Investments. I have a Master's in computer science and and have been working in this field, and information science, for quite a while. Here is a two page memo I have written defining a taxonomy. I would appreciate any feedback.
What is a taxonomy? A taxonomy is a particular kind of classification scheme. It is a hierarchy of nodes, where each node represents a class of individual instances. But not every hierarchical classification scheme is a taxonomy. A taxonomy must have this essential characteristic:
· If an instance is assigned to a class it is also assigned to all of it ancestor classes. (We call this the transitive property.)
To be most useful a taxonomy should also have these characteristics:
· The hierarchy should be a tree; each node (except the root) should have exactly one parent. (We call this the tree property)
· Every instance in the domain of the taxonomy should have exactly one path from the lowest level node where that instance is assigned to the root. (We call this the single inheritance property. If this property is not true we say the taxonomy supports multiple inheritance.)
· The child nodes of a class should partition the instance assigned to that class. (This is the partition property. It is actually a combination of two separable properties: mutually exclusive and jointly exhaustive. A partition of a set divides that set into subsets such that each element is in exactly one subset.
These properties are very important. They allow using the taxonomy to reason about the domain, i.e. to draw valid conclusions about any individual. For example, if we know that an individual is a dog we automatically know that it is also a Carnivore , a Mammal, a Vertebrate, etc. That is due to the transitive property. The mutually exclusive property lets us conclude that if anind ividual is a dog then it is not a cat. In accounting, a chart of accounts is a taxonomy. The partition property means it is possible to “roll up” the numbers at any level and know that the items will correctly sum to subtotals, and subtotals to grand totals; no data will be missing and there will be no “doubling counting”.
Some people have created taxonomies that do not have these properties, and have sometimes tried to explain that they were providing the more “flexibility”. It is indeed easier to create a “taxonomy” that does not have these properties. Unfortunately the resulting taxonomy is less useful for searching.
Here are some example taxonomies. The original taxonomy comes from biology. It assigns each kind of organism its position in the “tree of life”. For example
http://animaldiversity.ummz.umich.edu/site/accounts/information/Canis_lupus_familiaris.html shows the classification for a domestic dog, as follows: Kingdom: Animalia; Phylum: Chordata; Subphylum: Vertebrata; Class: Mammalia; Order: Carnivora; Family: Canidae; Genus: Canis; Species: Canis lupus; Subspecies: Canis lupus familiaris. To see this same information using a tree control go to http://spice.sp2000.org/browse_taxa.php
Another large and widely used taxonomy is the NAICS ( North American Ind ustrial Classification System). See http://www.census.gov/epcd/www/naics.html for details. (This is one of the taxonomies that UDDI has built-in support for.) The example below show some of the taxonomy is the area of financial services.
52 Finance and Insurance
523 Securities, Commodity Contracts, and Other Financial Investments and Related Activities
5231 Securities and Commodity Contracts Intermediation and Brokerage
52312 Securities Brokerage
523120 Securities Brokerage
525 Funds, Trusts, and Other Financial Vehicles
5251 Insurance and Employee Benefit Funds
52511 Pension Funds
525110 Pension Funds
At the other extreme a taxonomy can have just two possible values, e.g. a gender code of M or F.
Originally a taxonomy was always based on the relation “is a kind of”, and was used for classes based on concepts. Now other relations are allowed. However the transitive property should always be borne in mind. Consider a taxonomy based on geographic regions. For example, if an individual is assigned as being in New York state, then it is also in the United States . The underlying reason this works is that the “is contained in” relation, like “is a kind of”, is transitive.
In the past few years the term taxonomy has been used quite often to refer to a taxonomy of terms (rather than of concepts). A term is a word or phrase used to help a user search, e.g. search the web, or a enterprise intranet. These term taxonomies may or may not have the properties described above. There are many vendors that will help an enterprise create its own term taxonomy. These tend to be much larger than the concept taxonomies discussed above. For example the NAICS taxonomy is relatively large for a concept based taxonomy, having 2341 nodes. In contrast a medium size enterprise term taxonomy has 2000 - 20000 nodes. This needs a full time staff of 1 - 3 people to keep it maintained, even with automated support. (Source http://www.stratify.com/infocenter/download/DelphiResearchReport.pdf )
In some cases “taxonomy” is used rather than “ontology”, because for some reason it has caught on more and seems less forbidding. It is used in this way by XBRL. Sometimes “taxonomy” is used where “thesaurus”, or “data model”, would be more appropriate.
Hopefully helpfully yours,
There is nothing so practical as a good theory. Comments are by me, not Fidelity Investments, its subsidiaries or affiliates.
Actually, I think it is bad to have all the information… As you say, we just want the relevant information.
Of course context and relevance is what it’s all about. The only way in my opinion to make things relevant is to understand your tasks and context as much as possible and try to surface appropriate information. At the Gilbane conference Tony Byrne had disagreed with me on this point saying that it really is not possible to do this in a search context. (Tony if I am misquoting you for the sake of brevity, forgive me).
I would say it is not always possible to do this effectively given a body of knowledge and broad needs/interests of an audience. But we need to make an effort to do so. Part of that is not broadening an application (or a taxonomy or ontology or any system for organizing information) to try to be all things to all people, but to address the needs of the audience, community, function or task.
From: TaxoCoP@yahoogroups.com On Behalf Of Christine Connors
I agree that the waters get muddy, but is it bad to have ALL the information? The problem is selecting the relevant information. What is important for one task may be irrelevant for another, so how do you choose what to include and what not to include. I'd rather have the data and find a way to solve my relevance problem.
See Karl Fast 's comments on a post of Peter Morville's at http://www.findability.org/archives/000113.php.
That is exactly correct – the relationships can describe any term in any context (even disambiguating terms).
The challenge of ontologies as you mention in the other post is they can become overly complicated based on the desire to abstract the concept to cover too many definitions, technologies, contexts, etc. If you think of an ontology in this simple definition, you can pretty much cover all of the other cases. And as you say, organizations don’t use stuff that is too complicated.
The analogous situation is when someone clever comes up with an interesting software application for a specific problem. As new groups use the technology, they add on their little pieces and make the application more ‘capable’ and more ‘flexible’ by adding on configuration modules and new functionality. The end result is an application that has a greater range of possible uses and configurations, but perhaps loses site of its original focused purpose and becomes less wieldy and less cost effective.
An ontology allows greater flexibility in describing knowledge and functions, but let’s not abstract too far and make them overly complex. (I once saw a demo of an “ontology application” that could include green screens from mainframes as artifacts. Talk about muddying the waters. I suppose one could make the argument that a green screen could be in an ontology if you use a broad enough interpretation but to what purpose?