Michael J. Radwin

Tales of a software engineer who keeps kosher and hates the web.

ApacheCon: TAP and the Semantic Web

rob-mccool.jpg Rob McCool, continuing in the spirit of the easy to understand but never-adopted Meta Content Framework and the standard but substantially harder to grok Resource Description Framework, presented TAP.

The overall problem is that there is a ton of data out there on the web, but it’s not in machine-understandable form. McCool is looking at addressing key problems of supporting a true web of data: query languages, canonical names, caching, and a system of trust (to avoid spammers).

On the query langauges front, McCool believes that SQL and XQL are overkill, but HTTP GET is not specific enough. So TAP defines a GetData protocol. It follows in the spirit of the DNS system, where you can use a gethostbyname() function to access the service. TAP uses RDF schemas to describe graphs of data, and SOAP as the over-the-wire protocol for querying.

McCool described a module called TAPache to implement the GetData protocol. In the same way that Apache provides an htdocs directory, it provides an RDF repository. His stated goal for TAPache is to be the “BIND” application for data.

Since Amazon.com and CDnow might have different identifiers for the same album, TAP doesn’t require using globally unique identifiers. But how do you tell the difference between “Michael Jackson” the musician and “Michael Jackson” your next-door neighbor? TAP addresses this using reference by desciption hen you want to do a query for “Michael Jackson”, you ask for someone whose firstName=”Michael” and lastName=”Jackson” and profession=”Musician” and who is the author of an album with title=”Thriller”.

When asked about the problem of matching “Donald Rumsfeld” and “Donald H. Rumsfeld” or “al Qaeda” and “al-Qaida”, McCool said that there are some decent algorithms for matching names that go beyond simple string comparisons. Sounds like a substantially difficult project to me. Is my laptop an “IBM 390X” or is it “390-X by IBM”?

An interesting sample application was a “related items” sidebar for news stories. In addition to doing simple Capitalized Words extraction from the document, you could envison something that used the RDF graphs to discover that Brett Favre was a football player and match that with eBay auctions for tickets for the Green Bay Packers.