Calling all Bloggers: Semantic Text Annotation with RDFa

Dec-9th-2011

Connected web of semantic content

The semantic web is not the “future”, it is the now. Semantic web technologies have matured in recent years, at the rapid rate in which enriched web content has emerged on the linked open data cloud. Take a look, http://richard.cyganiak.de/2007/10/lod/imagemap.html ! Whatever language you choose to adopt, you can be sure that they’ll be a feature rich semantic web library to match. I can vouch for RDF4H [1] for Haskell and Jena [2] for Java.

Why the need for embedded semantics?

Firstly, to understand embedded RDFa, one should appreciate the value in semantics on the web. Why have many organizations (including the BBC [3] and the British Library [4]) invested a lot of time semantizing existing web content and data? Simply put, from the very beginning of the World Wide Web, it has been a connected network of human readable documents, largely made up of documents written in the markup language – HTML. We’ve been able to learn from one another, distribute information to wide audiences, and read weather reports for our respective regions of the world.

The problem

Web 1.0 and Web 2.0 are both broadly defined as a connected network of human readable resources. Given two blog posts on two separate webpages, how do I discover whether they discuss similar or identical concepts? I have to read them and check! Isn’t this all a bit cumbersome?! How do people make discoveries on the web… Search for text in one’s favourite search engine? When I’m enthralled by an entertaining blog post, how do I find other posts that touch on similar issues, or by the same author, or posted within the last month etc…

The Solution? RDF + Ontology Reuse!


The fundamental building block, RDF [5] is one of the simplest data models there is, and there are some important principles built into the Semantic Web that are crucial to its success. This includes a paramount rule that resources on the web should be referenced using a unique resource identifier (URI), wherever possible. This is a mechanism to disambiguate terms on the web, meaning that we can unequivocally state that two separate web pages are referring to the same concept, place, person, or date in time. This is a machine readable representation of web content, and the powers of reasoning over a collection of documents containing disambiguous terms is obvious, albeit simple and probably not powerful enough to warrant attention and convergence to the Semantic Web.

But crucially, the RDF model allows resources to be described using ontological vocabularies. Some popular examples include DBPedia [6] (structured content derived from wikipedia.org) and FOAF [7] to describe people, and the links between them. These ontologies define concepts hierarchically, and the properties between them. So for instance, I might one day blog about my favourite sport, tennis, using my semantic web-aware blogging engine. Underneath my blog post, I have an unambiguous machine readable annotation of this term, which implicitly holds more information about “tennis” than I have mentioned in the blog post. For example, what does DBPedia tell us about tennis? http://dbpedia.org/page/Tennis . As you can see, there is a substantial amount of structured data about this resource: the equipment includes a tennis ball and a tennis racquet; it is not a contact sport; a list of tennis associations etc… The list is very extensive.

So, now we have my blog post about my favourite sport, human readable for my avid followers, and hidden beneath that – a machine readable document holding much more information for the concepts I have written about. This information even includes a broader classification for tennis: racquet sports. You can crawl up or down the resource hierachy in DBPedia in this way.. for instance the next level up is sports, above that is hobbies, leisure, recreation, games, entertainment, and excercise. The potential for machine based semantic web reasoning now appears a lot more exciting.

Wow, I’m convinced! I want my blog content semantized!

Good decision! If you’re using WordPress, I can recommend rdface [8] – “It supports different views for semantic content authoring and uses existing Semantic Web APIs to facilitate the annotation and editing of RDFa contents.”. In essence, it allows you to select one or a number of external APIs including DBPedia Spotlight, Open Calais, and Alchemy to inspect the text of your blog post, to search for associated URIs on the cloud of Linked Open data, and to add them to the underlying HTML of your page, using RDFa [9], which is a mechanism to embed URIs into HTML documents.

I’ve been playing with rdface for the last few days, and it’s a lot of fun! I’ve even run it on this blog post, and so if you inspect the source for this page, you should see ‹span› tags scattered around, containing URIs. Remember – this is the language that’s really only useful for machines to reason on the content, but it’s reassuring to see it there!

I’ll see you in Web 3.0.. just get those blog posts annotated first!

1 – https://github.com/amccausl/RDF4H
2 – http://jena.sourceforge.net
3 – http://www.bbc.co.uk/ontologies/programmes/2009-09-07.shtml
4 – http://openbiblio.net/2010/11/17/jisc-openbibliography-british-library-data-release
5 – http://en.wikipedia.org/wiki/Resource_Description_Framework
6 – http://dbpedia.org
7 – http://www.foaf-project.org
8 – http://wordpress.org/extend/plugins/rdface
9 – http://en.wikipedia.org/wiki/RDFa

Comments

  1. Egon Willighagen Said,

    Are you aware of tools that might help people on the Blogger platform, like me?

  2. Rob Said,

    @Egon Hhmm, I’m not familiar with this platform I’m afraid.. If you discover a way, do post back!

Add A Comment

By Rob