Jan 17, 2012

A Brief History of Classification





Girginakku at Glencairn. These clay tablets were used for many purposes, including cataloging.

The earliest known means of classifying an object and keeping it in order are girginakku. These are ancient Mesopotamian clay tablets that were attached to scrolls and tablets and used to identify the contents. Examples of approximately 5300 years in age can be found in the British Museum.

Girginakku at Glencairn. These clay tablets were used for many purposes, including cataloging.

The famous Library of Alexandria in Egypt housed one of the earliest forms of library catalog in the third century BCE. The library reportedly housed more than 120,000 scrolls which were stored in bins categorized by subject. Each of these bins was labeled, and the labels were indexed in Pinakes. The taxonomy of subjects was devised by Callimachus, the second recorded librarian at Alexandria. He created a system with 11 main categories: six genres and 5 kinds of prose (6 categories for non-fiction, 5 for fiction.) These were rhetoric, law, history, medicine, mathematics, natural science, epic, tragedy, comedy, lyric poetry and miscellaneous. The influences of this system are still seen today in such systems as the Dewey Decimal Classification system.

Beginning in the 8th century CE, the Islamic library at Baghdad, The House Of Wisdom, began collecting books in earnest. The knowledge of papermaking had been acquired from Chinese prisoners and books proliferated. This is akin to the explosion of digital information we see today. These books were organized into genres, categories and sub-categories to make them easier to manage until the library was destroyed by a Mongol invastion in the mid 13th century.

The Leiden University Library, The Netherlands, created the first printed institutional library catalog shorty after it opened in the late 16th Century. The book was titled Nomenclator, and was a list of all authors whose books - in manuscript or print - were available in the library. The Library continued on the leading edge until the 20th century: it was among the first to use cards for its catalog and in 1969 began work on an automated system which was bought by OCLC in 2000. OCLC maintains WorldCat, the Worldwide Catalog, a machine system for libraries large and small, private and public, worldwide.

In 1735 Carolus Linnaeus published his Systema Naturæ, more commonly known as the Linnaean or Animal Kingdom taxonomy. Most of us are familiar with this system from grade school biology - there are three kingdoms (animals, plants, minerals) which are divided into classes, orders, genus and species. This is purely hierarchical in nature, and while it is capable of greater things, is used as an information placement tool mostly by non-biologists - akin to navigation taxonomies today. When you speak to people about taxonomy, this is often what they think of, and it is very useful to have some examples of similarity and differentiation at the ready to explain how your own taxonomy relates.

Three hundred years later Melvil Dewey created the Dewey Decimal system, which organizes artifacts by subject into 10 main categories. This system took hold quickly in the public and school libraries in the United States. The Library of Congress created their first dictionary catalog a couple of decades later in 1898, the Library of Congress Subject Headings. This is the basis for cataloging and classifying all of the works that are in or are sent to the Library of Record in the USA. These catalog entries are the basis for a fee-based service which generates income for the LoC. It charges other libraries for copies of their catalog cards so that the subscribing library doesn’t have to do the cataloging work themselves.

In the middle of the 20th Century an Indian mathematician and scholar by the name of Ranganathan created Colon Classification, a system still in use in Indian Libraries today. He posited that everything could be organized under 5 key facets, combined appropriately for the resource: Personality, Matter, Energy, Space, and Time. Each of these facets has a controlled value entered which is obtained from a taxonomy or thesaurus. The delimiters between the facets is a colon, and they are always entered in the PMEST order. This type of faceted taxonomy is a more practical solution for cataloging items in a digital world. Rather than having to have a list of 10k items, one can have 4 lists of 10 items, which is much easier to manage. This is NOT a rule - it is an example. Each application has its own business requirements.

Taxonomies in the enterprise reach back further than one thinks, but became known to researchers in 1858 when the NY Times began its index to the newspaper. It became such a valuable tool that publishers began indexing books and periodicals and publishing such - H.W.Wilson is a great publisher of indexes. The Reader’s Guide to Periodical Literature is one that most school students are introduced to. Database providers and large academic/scholarly/professional publishers added this capability early on as well. Proquest/Gale/Cengage, Dialog, Factiva, Reuters, IEEE, ACM all have indexes. Large government organizations also have indexes organized by subject taxonomies or thesauri: NASA, DTIC, NIH, BLS, CIA, NAICS, SEC.

Taxonomies for the enterprise and the web as we know them today began as experiments in search improvements in the 1990s. Yahoo’s first release and Open Directory were clearly a librarian-like effort to organize the then small web. Those categorization structures were re-created within the realm of Natural Language Processing - math with letters. Pattern matching is the basis for much of what occurs in these systems for rules based categorization. In simplest terms, a rule which tags a piece of content with a term from the taxonomy is an if-then statement.

Efforts are underway to transform semantic systems into more than just known item or NLP derived labeling to systems capable of contextual understanding. Ontologies are the means by which much of this effort will be accomplished in the short term. An ontology is more advanced than a taxonomy as it an contain self-defined relationships beyond that of parent-child. It can also be used to infer data and reason over information. The World Wide Web Consortium is one of the key leaders in efforts for standards in this space, as a semantic space is what Tim Berners-Lee had in mind for the web from the beginning.

This article was originally posted at http://ping.fm/N9Nzq

No comments: