Wednesday, March 02, 2005

A Case for RDF & Open Source in South Australia

Interoperability between software is a major concern. If program A can't talk to program B, then you have to take everything from one and manually reenter it into the other.
More and more we can witness the crippling effect this lack of interoperability has upon public services - website A has a page about a subject, but it's hidden and not easy to find. Website B has more about the subject, but you can't share information between the two, as A & B use two different methods to store their information.

Uh... Houston... we have a problem.
A sticky one at that. If you decree that program A must work with program B, who's right? Does program B take on the extra work of talking to program A, or vice versa? The answer, in real life, is that no one bothers, because it's too hard. Or, if they do, program C comes along and all of the effort is lost.
The answer, which the W3C advanced initially, was XML. XML is a way of representing information in trees - heirachies of connected properties with no inherent meaning. All around the world, people learnt about XML and rejoiced. This was the magic glue to fix everything. You were allowed to do anything you liked in XML, so long as it validated.

The problem was, people did. It's the same old problem, just less painful - as there is a solution, it's just tedious. XSLT is a method for translating one XML document into another - say an RSS news feed into a webpage. Program A still can't talk to Program B because they both talk slightly different local dialects in XML, and though syntax can be communicated, meaning can't.

Example:



Document A
<shop>
<close>1700</close>
</shop>

Document B
<door>
<close>true</close>
</door>



If you look at the close tag, it's got either a time (int) value or a boolean value. If I publish my XML about doors and if they are closed or open, your tool might read it for shop opening and closing times - but not understand it, as it expects a time rather than a true/false value.

What's a Developer to Do?
This is where RDF solves a lot of problems. RDF is a model, a graph: think of a whole lot of information trees (like XML documents), some sharing a common branch. Connect together more "branches" and you soon can no longer visualize it without it becoming a tangle of interwoven information.
How RDF does this is very simple. It uses a URI or other identifier to represent some kind of abstract thing. Example: http://www.me.com/my/car/. It then goes on to describe the attributes of the URI, using the "rdf:about" attribute.


<Shop rdf:about="http://www.shop.com/">
<closes>1700</closes>
</Shop>

<Location rdf:about="http://www.shop.com/">
<address>#1 Bob Street.</address>
</Location>


What we see above is RDF expressed in XML, describing two seperate trees of information. One is a Shop, and one is a Location. The magic comes into play because an RDF parser knows that the Shop is the same as the Location, so it's both a Shop & a Location.

Graphed, we get this:
Example Graph

You can see how it's recognising the abstract thing at http://www.shop.com/ has 4 properties.

More Neat Tricks
If you wish to describe other things, you can do some nifty tricks too. There's the rdf:resource mechanism - the opposite of an rdf:about.


<Shop rdf:about="http://www.shop.com/">
<located rdf:resource=
"http://www.shop.com/address" />
</Shop>

<Location rdf:about="http://www.shop.com/address">
<address>#1 Bob Street.</address>
</Location>

Now, we've said that the Shop object's located property is described by http://www.shop.com/address. The Location object now describes whatever's at http://www.shop.com/address. Thus, we're saying the Shop has a Location.

The power of RDF lies in that I can describe one thing on my website, and someone else describes another on theirs, so long as we use a common URL (like an ISBN URN, IMDB URL, email address (mailto:foo@foo.com), irc schema (irc://irc.network.com/) or anything else) we can link statements to each other.
It's a lot like how real people work - you ask someone for directions to a landmark, they tell you it's near Waymouth St. The next person you ask, you ask about Waymouth St instead of the landmark, and learn where Waymouth St is. The third person you ask can tell you the landmark is located just around the corner from Waymouth St. You then in your mind connect all of this together to find both Waymouth St and the landmark.

Onwards, enough with the RDF talk
AGLS is an RDF vocabulary designed for the Australian government. It's 19 descriptive elements which help form a foundation of well understood shared data. Agency A might not understand everything Agency B is saying, but they can catch the drift of it.
There's 5 terms that are base requirements.

Five metadata elements must be present for compliance with this standard. The mandatory elements are:

Creator

Title

Date

Subject OR Function

Identifier OR Availability

Covering the Who, What, When aspects of any service, document, or webpage which contains RDF.

The what is the most important part. From the spec:

Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Example formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL)), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN).

So if Agency A is talking about State Supply (Software Procurement) Amendment 2003, at http://www.parliment.gov.au/statesupplyamendment2003/, so long as Agency B knows that URL (enter google) there is much improved communication and information finding.

Agency A might be concerned entirely with the cost side of things, Agency B the security - but now, A & B have equal access to each other's information about the subject. A&B can compare versions of it, search for the last modified versions, specify data that one doesn't about the other, like coverage and so forth.

On Trust
Of course, in the wild, there's an issue of trust on things like this. Asking for directions, you trust that a person isn't lying to you and letting you get yourself lost. It's less of an issue for government, as they can simply say the dept of X trusts the dept of Y explicitly, but doesn't trust the dept of Z much - and their RDF database would reflect that.
Public keys and signed documents are a step in the right direction, as well as things like Trust in the Semantic Web - but mostly, it's not a huge concern.

Enter the Open Source
What has any of this got to do with Open Source? Vendor Lockin. Microsoft has largely withdrawn from standards like RDF - their products purport to support WinFS. It's the planned new filesystem which they can't get working for Longhorn - when Longhorn ships finally, it won't be with WinFS. WinFS is almost like RDF, just different enough so that Microsoft can keep it exclusively theirs.
You won't see other companies being able to implement web ready WinFS solutions on Linux, so you end up stuck with a costly operating system on every desktop - something you didn't need to upgrade in the first place.
Open Source, however, is developing fiercing the the sphere of RDF and the Semantic Web. Many of the major RDF vocabularies are under open or free licences that are designed to let developers use them. Likewise with the software - RDF Parsing libraries for almost every language exist, few don't have friendly licenses.
Furthermore: Firefox & Mozilla support RDF right out of the box. It powers the bookmarks, the extensions, and a whole lot more. This means that you could develop applications directly in the browser, reducing costs with everything and leveraging the power of open source. No more intranets, per se, but an intraweb of facts which aren't in static webpages.

Think, interdepartment calendars, documents, planning, maps, images and task management - all connected, all working more or less together.

AGLS is being adopted at a national level. The tools to do productive and innovative things with it are mainly open source. The great ideas will be open source - because you can't keep something the same and make it better (MS Word, for instance). Open source allows rapid innovation as ideas from one project move to the next and are made different. Open source allows you to do whatever the hell you like to your software without paying a cent.
Open source is empowering. RDF is even more empowering. The two together?

Open Data

Try and tell me that's not going to be doubly empowering.

1 comment:

Dan said...

Yeah, I was shocked too! I was in #SWIG when Libby Miller was talking about it.

#SWIG [irc://irc.freenode.net/swig/] & #SWHACK [irc://irc.freenode.net/swhack/] are two places for a lot of Semweb related news.