Christine's blog

Semantic Link Podcast With Guest Nova Spivack

As many of you may have seen on SemanticWeb.com, we've kicked off 2012 well on the Semantic Link podcast. This month we were joined by veteran semantic entrepreneur Nova Spivack and our conversation surrounded privacy, UI/UX design and whether or not we've yet found a definition for the word "semantic."

We spent some time talking about one of Nova's newest projects, Bottlenose. If you haven't checked it out yet, you can request access using the invite code "semanticlink". My fellow Linker Eric Hoffer has a great write-up of the session on his blog "Axonomics."

Listen here!

A Brief History of Classification

The earliest known means of classifying an object and keeping it in order are girginakku. These are ancient Mesopotamian clay tablets that were attached to scrolls and tablets and used to identify the contents. Examples of tablets approximately 5300 years in age can be found in the British Museum.

Girginakku at Glencairn. These clay tablets were used for many purposes, including cataloging.Girginakku at Glencairn. These clay tablets were used for many purposes, including cataloging.

The famous Library of Alexandria in Egypt housed one of the earliest forms of library catalog in the third century BCE. The library reportedly housed more than 120,000 scrolls which were stored in bins categorized by subject. Each of these bins was labeled, and the labels were indexed in Pinakes. The taxonomy of subjects was devised by Callimachus, the second recorded librarian at Alexandria. He created a system with 11 main categories: six genres and 5 kinds of prose (6 categories for non-fiction, 5 for fiction.) These were rhetoric, law, history, medicine, mathematics, natural science, epic, tragedy, comedy, lyric poetry and miscellaneous. The influences of this system are still seen today in such systems as the Dewey Decimal Classification system.

Beginning in the 8th century CE, the Islamic library at Baghdad, The House Of Wisdom, began collecting books in earnest. The knowledge of papermaking had been acquired from Chinese prisoners and books proliferated. This is akin to the explosion of digital information we see today. These books were organized into genres, categories and sub-categories to make them easier to manage until the library was destroyed by a Mongol invastion in the mid 13th century.

The Leiden University Library, The Netherlands, created the first printed institutional library catalog shorty after it opened in the late 16th Century. The book was titled Nomenclator, and was a list of all authors whose books - in manuscript or print - were available in the library. The Library continued on the leading edge until the 20th century: it was among the first to use cards for its catalog and in 1969 began work on an automated system which was bought by OCLC in 2000. OCLC maintains WorldCat, the Worldwide Catalog, a machine system for libraries large and small, private and public, worldwide.

In 1735 Carolus Linnaeus published his Systema Naturæ, more commonly known as the Linnaean or Animal Kingdom taxonomy. Most of us are familiar with this system from grade school biology - there are three kingdoms (animals, plants, minerals) which are divided into classes, orders, genus and species. This is purely hierarchical in nature, and while it is capable of greater things, is used as an information placement tool mostly by non-biologists - akin to navigation taxonomies today. When you speak to people about taxonomy, this is often what they think of, and it is very useful to have some examples of similarity and differentiation at the ready to explain how your own taxonomy relates.

Three hundred years later Melvil Dewey created the Dewey Decimal system, which organizes artifacts by subject into 10 main categories. This system took hold quickly in the public and school libraries in the United States. The Library of Congress created their first dictionary catalog a couple of decades later in 1898, the Library of Congress Subject Headings. This is the basis for cataloging and classifying all of the works that are in or are sent to the Library of Record in the USA. These catalog entries are the basis for a fee-based service which generates income for the LoC. It charges other libraries for copies of their catalog cards so that the subscribing library doesn’t have to do the cataloging work themselves.

In the middle of the 20th Century an Indian mathematician and scholar by the name of Ranganathan created Colon Classification, a system still in use in Indian Libraries today. He posited that everything could be organized under 5 key facets, combined appropriately for the resource: Personality, Matter, Energy, Space, and Time. Each of these facets has a controlled value entered which is obtained from a taxonomy or thesaurus. The delimiters between the facets is a colon, and they are always entered in the PMEST order. This type of faceted taxonomy is a more practical solution for cataloging items in a digital world. Rather than having to have a list of 10k items, one can have 4 lists of 10 items, which is much easier to manage. This is NOT a rule - it is an example. Each application has its own business requirements.

Taxonomies in the enterprise reach back further than one thinks, but became known to researchers in 1858 when the NY Times began its index to the newspaper. It became such a valuable tool that publishers began indexing books and periodicals and publishing such - H.W.Wilson is a great publisher of indexes. The Reader’s Guide to Periodical Literature is one that most school students are introduced to. Database providers and large academic/scholarly/professional publishers added this capability early on as well. Proquest/Gale/Cengage, Dialog, Factiva, Reuters, IEEE, ACM all have indexes. Large government organizations also have indexes organized by subject taxonomies or thesauri: NASA, DTIC, NIH, BLS, CIA, NAICS, SEC.

Taxonomies for the enterprise and the web as we know them today began as experiments in search improvements in the 1990s. Yahoo’s first release and Open Directory were clearly a librarian-like effort to organize the then small web. Those categorization structures were re-created within the realm of Natural Language Processing - math with letters. Pattern matching is the basis for much of what occurs in these systems for rules based categorization. In simplest terms, a rule which tags a piece of content with a term from the taxonomy is an if-then statement.

Efforts are underway to transform semantic systems into more than just known item or NLP derived labeling to systems capable of contextual understanding. Ontologies are the means by which much of this effort will be accomplished in the short term. An ontology is more advanced than a taxonomy as it an contain self-defined relationships beyond that of parent-child. It can also be used to infer data and reason over information. The World Wide Web Consortium is one of the key leaders in efforts for standards in this space, as a semantic space is what Tim Berners-Lee had in mind for the web from the beginning.

Less is More

I'm doing some file cleanup and stumbled across a copy of a "CM Briefing" from several years ago - CMb 2005-13, entitled "More Users=Simpler CMS." It was written by James Robertson of Step Two Designs, an Australian consultancy with a specialty in intranets. I've known James for years, he has a solid background in intranet design, content management, user-centered design and knowledge management.

I'm writing this quick post because this briefing opens with:

In many projects, the plan is to deploy a new content management system (CMS) across the whole organisation. In these organisation-wide deployments, an assumption is made that a “big” CMS will be needed to meet the “enterprise” needs. In practice, a better rule is that the more users that will be accessing the CMS, the simpler (and more usable) the system should be.

YES! Less is more, even in the world of linked data. For years we've seen attempts at building very large, very complicated ontologies, taxonomies and metadata schema for public use. The big ones are fine, but for the right reason, in fewer scenarios. What we've seen gain adoption on a larger scale are some relatively simple frameworks: Dublin Core & FOAF; more recently Open Graph Protocol and Schema.org.

Are there times when a large ontology is needed? Absolutely. Do you need one to get started? Heck no. Start small and simple.

First determine what you need: a simple schema with small controlled vocabularies? A lightweight ontology? That will depend on your goals for publishing data and the kinds of questions you want users to be able to ask of your data.

Next decide on the smallest number of elements you need to get the important data modeled. For example, an Address Record. You need a Street, Building Location, City, State and Zip Code (in the U.S.). Having a controlled vocabulary for the States will make your life much simpler. That's it; you're good to go. Move on to the next data problem.

Finally, encode in a way that will allow it to grow, integrate with other data sets, be usable in many applications and have reasonable maintenance requirements.

Keep it simple, until you need more.

Syndicate content