The Semantic Web Acid Test
January 15, 2010
It’s early days in the semantic web, and every VC is starting to use the word semantic like it’s the next social networking. It isn’t the next social networking. It’s hard. It requires cooperation. And it’s easy to confuse semantic projects with the semantic web. So let’s clear up the confusion right now. As Dan Connolly says: “The key term in ’semantic web’ is ‘web’.” We’ll use these icons to distinguish what’s semantic and what’s not, what’s visible on the web and what’s not:
![]()
Unstructured Silo Push is like most database-driven web sites, where the documents and data are buried in a database, and you must query the database to learn what’s in it. The meaning of the text is clear, at best, only to humans (software must guess the meaning and context). The information is not ready to be pulled into other systems. An example would be any site of text files, like Aintitcoolnews.com.
![]()
Unstructured Web Push is like most blogs, articles, PDFs, tables, spreadsheets, podcasts, videos, and other mixed text and data online. It’s on the web, but the lack of meaningful tags (structure) makes it difficult at best for software to interpret the meaning. Very difficult to mine for content without a human being to guide the process. A good example is WikiPedia, and a better example are the millions of tweets sent by people using Twitter every day.
![]()
Unstructured Silo Pull is a scenario where data lives in a database, has very little structure, but can still be accessed from anywhere, thanks to a unique set of names. An example would be YouTube.
![]()
Unstructured Web Pull includes documents that can be pulled into other systems right from the web, even though they lack structure. The best example is RSS feeds, which let us pull unknown content from a known source (like a daily blog or a regular podcast). An open example would be Google Image search, which lets you search for images online using the keywords surrounding them on web pages. All images online can be found, but it’s difficult for a search engine or other software to know what or who is in them.
![]()
Semi-structured Web Push includes many book descriptions, restaurant reviews, real-estate listings, music descriptions, and other listings that have a bit of structure and can be found on any page online. The structure isn’t common, and it isn’t enough to be pulled into another system, but because the domain is limited, a smart search engine can figure out most of what is meant. This is also the world of proprietary data feeds, where information is online but in uncommon formats, meant for only a few to pick up. Examples would be sports scores from a particular web site or iTunes songs.
![]()
Semi-structured Silo Push Most semi-structured content is set up this way, in databases that must be queried and essentially designed for humans to interpret. The key difference is that there is some structure, so a program that looks at the page and decides which information means what can be useful. This is called scraping. It’s how some search engines find information inside of catalogs and other databases. There could be keywords associated, but there isn’t a fully semantic set of tags to identify the meaning of the data. You can find semi-structured silos all over the web, from real estate listings to car descriptions to most
![]()
Semi-structured Silo Pull is lightly formatted data that can be pulled from databases and used by other software or understood by search engines. It’s similar to the category above, but the data is meant to be pulled into other systems. Many microformats, like hResume, qualify for this category. Another example is streaming media, like the songs at Soma.fm or Lala.com – there’s enough structure to know who made the content and its name, but little to describe the actual content itself.
![]()
Semi-structured Web Pull is a category of metadata practically dominated by one microformat called FOAF. Designed to represent the relationships between people, there are millions of FOAF descriptions online. Most Microformats, like hCards, hCalendar, and many others would be considered semi-structured and can be found online and pulled into other systems. Some SearchMonkey formats are in this category as well. The best example of this category is Creative Commons licenses, which fully describes the rights to use a particular work and can be used to find and pull content into a system according to license. In many of these systems, especially for streaming media, it’s common to describe the container accurately, but it doesn’t describe the actual content, and that’s an important part of the equation.
![]()
Semantic Silo Push is a fairly common scheme, where semantic information is kept in proprietary databases with no visibility to the web. Most semantic formats are governed by B2B consortia, and the data in these formats are still buried deep in databases for internal use. The goal of the semantic web is to surface this data, make it findable, and make it pullable into all systems as needed.
![]()
Semantic Silo Pull is the category of Virtual Private Networks, where companies can set up specifically to pull information from one to another, and there are semantic formats or ontologies involved. To pull this kind of information, you must first set up a channel connecting two databases. The Web is not involved.
![]()
Semantic Web Push category doesn’t exist. If it’s on the web and it’s in a semantic format, then it can, by definition, be pulled into any system that wants it and recognized.
![]()
Semantic Web Pull is the Holy Grail of the pull paradigm. Meanings are precise, the data is online, and it can be pulled into any sytem that wants it. This is what we call “the web of data.” It could be data in standard formats or in RDF triples (ontologies), or a combination of the two. At the moment, there are few examples. Yet many people working on building the systems to make it happen. Best current example is XBRL, the business-reporting language, which will soon be on the cloud. Two important videos of how this will work are The brilliant STI video showing how Europe plans to go semantic and John Wilbanks’ excellent explanation of linked data from Science Commons.
To learn more, visit The Semantic Web Acid Test.
Next: An Open Letter to Ray Ozzie







