The Semantic Web Acid Test
The Semantic Web Acid Test is an important “bullshit detector” to see whether a project or business will scale to meet the needs of the future. To apply the test, ask two questions:
1 – Is it semantic? Does it make everything in its world unambiguous? Typically, this involves a standard format that provides a known structure for the data. The data standards of the future are non-royalty-bearing, governed by a nonprofit entities. Or it disambiguates text and speech by doing text analytics and then mapping the concepts onto a term and subject backbone. It must mean the same thing to every program, without translation. Finally, it uses a common name space so any piece of semantic data can be found and linked in context.
2 – Is it on the web? Is it online, in the cloud, ready to be called by name by anyone or any software with access to it? Or is it locked in a database, a single web site, or on a desktop? If the data (or content) is sitting on the web where it can be seen easily by search engines or mashed up into ad-hoc applications on demand, it’s in the right place.
Put these two together and you have a semantic web of data that soon becomes an ecosystem of information that scales up to meet tomorrow’s challenges. If it’s semantic and it’s on the web, it automatically is set up to be pulled. Too many “semantic solutions” are one of these but not the other. For example, the restaurant menus at seamlessweb or the diamond finders at bluenile are wonderfully semantic, but they are not on the web. They are in data silos with their own language for fetching them, and they don’t mix well with the menus or diamond descriptions on other web sites. Owners of restaurants or diamonds must go to many such sites and enter their product descriptions in many different formats, losing the ability to syndicate their data. Instead, there are copies everywhere and updating them is a huge task.
On the other hand, many “cloud-based” solutions are one-off productions that don’t use semantic formats and, again, don’t scale up. An example is Google Docs and Microsoft Live documents, which are containers for text but anything can go into the text. They are completely unsemantic and use keywords to attract advertisers. Keyword-based findability is not semantic, no matter how many business plans say it is!
On this site, we use an icon system to identify structured/unstructured and web/silo approaches to metadata:
Unstructured Silo Push is like most database-driven web sites, where the documents and data are buried in a database, and you must query the database to learn what’s in it. The meaning of the text is clear, at best, only to humans (software must guess the meaning and context). The information is not ready to be pulled into other systems. An example would be any site of text files, like Aintitcoolnews.com.
Unstructured Web Push is like most blogs, articles, PDFs, tables, spreadsheets, podcasts, videos, and other mixed text and data online. It’s on the web, but the lack of meaningful tags (structure) makes it difficult at best for software to interpret the meaning. Very difficult to mine for content without a human being to guide the process. A good example is WikiPedia, and a better example are the millions of tweets sent by people using Twitter every day.
Unstructured Silo Pull is a scenario where data lives in a database, has very little structure, but can still be accessed from anywhere, thanks to a unique set of names. An example would be YouTube.
Unstructured Web Pull includes documents that can be pulled into other systems right from the web, even though they lack structure. The best example is RSS feeds, which let us pull unknown content from a known source (like a daily blog or a regular podcast). An open example would be Google Image search, which lets you search for images online using the keywords surrounding them on web pages. All images online can be found, but it’s difficult for a search engine or other software to know what or who is in them.
Semi-structured Web Push includes many book descriptions, restaurant reviews, real-estate listings, music descriptions, and other listings that have a bit of structure and can be found on any page online. The structure isn’t common, and it isn’t enough to be pulled into another system, but because the domain is limited, a smart search engine can figure out most of what is meant. This is also the world of proprietary data feeds, where information is online but in uncommon formats, meant for only a few to pick up. Examples would be sports scores from a particular web site or iTunes songs.
Semi-structured Silo Push Most semi-structured content is set up this way, in databases that must be queried and essentially designed for humans to interpret. The key difference is that there is some structure, so a program that looks at the page and decides which information means what can be useful. This is called scraping. It’s how some search engines find information inside of catalogs and other databases. There could be keywords associated, but there isn’t a fully semantic set of tags to identify the meaning of the data. You can find semi-structured silos all over the web, from real estate listings to car descriptions to most
Semi-structured Silo Pull is lightly formatted data that can be pulled from databases and used by other software or understood by search engines. It’s similar to the category above, but the data is meant to be pulled into other systems. Many microformats, like hResume, qualify for this category. Another example is streaming media, like the songs at Soma.fm or Lala.com – there’s enough structure to know who made the content and its name, but little to describe the actual content itself.
Semi-structured Web Pull is a category of metadata practically dominated by one microformat called FOAF. Designed to represent the relationships between people, there are millions of FOAF descriptions online. Most Microformats, like hCards, hCalendar, and many others would be considered semi-structured and can be found online and pulled into other systems. Some SearchMonkey formats are in this category as well. The best example of this category is Creative Commons licenses, which fully describes the rights to use a particular work and can be used to find and pull content into a system according to license. In many of these systems, especially for streaming media, it’s common to describe the container accurately, but it doesn’t describe the actual content, and that’s an important part of the equation.
Semantic Silo Push is a fairly common scheme, where semantic information is kept in proprietary databases with no visibility to the web. Most semantic formats are governed by B2B consortia, and the data in these formats are still buried deep in databases for internal use. The goal of the semantic web is to surface this data, make it findable, and make it pullable into all systems as needed.
Semantic Silo Pull is the category of Virtual Private Networks, where companies can set up specifically to pull information from one to another, and there are semantic formats or ontologies involved. To pull this kind of information, you must first set up a channel connecting two databases. The Web is not involved.
Semantic Web Push category doesn’t exist. If it’s on the web and it’s in a semantic format, then it can, by definition, be pulled into any system that wants it and recognized.
Semantic Web Pull is the Holy Grail of the pull paradigm. Meanings are precise, the data is online, and it can be pulled into any sytem that wants it. This is what we call “the web of data.” It could be data in standard formats or in RDF triples (ontologies), or a combination of the two. At the moment, there are few examples. Yet many people working on building the systems to make it happen. Best current example is XBRL, the business-reporting language, which will soon be on the cloud. Two important videos of how this will work are The brilliant STI video showing how Europe plans to go semantic and John Wilbanks’ excellent explanation of linked data from Science Commons.
See also: Metadata.