Friday, November 02, 2012

How to put a semantically enabled autocomplete control into your applications

One of the most common application design patterns is to implement a lookup table - some piece of business data has been given a description, and possible a code or identifier.

When creating new data, a user is often needing to select a code/identifier for a piece of information. This is usually done as a dropdown, or if they are many entries, an autocomplete control is often used.

This works well - some people will just make hashes storing the key/value pair in their code, others will ensure it's published into their relational data store.

Where it starts to fall down is in multiple applications working together - who can agree on the meaning of a code?
Your code of CASE_NIGHTMARE_GREEN is applied by a user and treated by one application as the coming of Chthulu, but after an ETL, CSV export or webservices message, the next application treats it as something different - users not up to date with the latest Lovecraftian spy thrillers start to misinterpret the data and apply it to anything involving green suitcases in horrible colours.

How do you fix this?
The next logical step often becomes to add a description, so that a UI can explain the term, but in a non services oriented environment, that's trapped in your datastore.

This won't work in a multiple vendor scenario, at least not unless you want to share your DB with them.

Another approach is the Code Table service - a service that has a focus on only retrieving data about a given input identifier.

I've seen this done in at least one SOA, and it's not a terrible pattern - but each vendor still has to stand up their own code table services, and there's a lot of repetition.

What else can you do?
Soon it becomes obvious that you want a decent way to find a code and the related data, but you also want to support aliases - my CASE_NIGHTMARE_GREEN is your WALK_IN_THE_PARK.

This gets tricky, quickly, as 1-1 mappings are difficult - and either a collection of vendors pull together and standardise on a list and the mappings, or no one really collaborates and fragile mapping code is introduced.

By this point, fear of change often sets in as the interfaces between parties are fragile, or to push changes through the consortium of vendors becomes a nightmare of project management and communication.

If you haven't had to roll out minor enhancements to a standard with a number of other parties who just aren't quite interested, take my word for it - it's painful.

All is not lost, there is another way - and it's simple.


What's the way forward?

My recommendation here is to push your codes into a triplestore. It doesn't fix everything, but it becomes trivial to relate information to the code - aliases, for example, or descriptions.

A triplestore is a RESTful service that allows you to execute queries - if you can deal with mongodb or mysql, you should be able to comprehend what's going on.

Don't just take my word for it - here's one prepared earlier - SNOMED, SPARQL powered autocomplete UI components. Pretty neat stuff.

Here's what wikipedia has to say about SNOMED, if you haven't heard of it.
SNOMED CT Concepts are representational units that categorize all the things that characterize health care processes and need to be recorded therein. In 2011, SNOMED CT includes more than 311,000 concepts, which are uniquely identified by a concept ID, i.e. the concept 22298006 refers to Myocardial infarction. All SNOMED CT concepts are organized into acyclic taxonomic (is-a) hierarchies; for example, Viral pneumonia IS-A Infectious pneumonia IS-A Pneumonia IS-A Lung disease. Concepts may have multiple parents, for example Infectious pneumonia is also a child of Infectious disease. The taxonomic structure allows data to be recorded and later accessed at different levels of aggregation. SNOMED CT concepts are linked by approximately 1,360,000 links, called relationships
That's one big code table, and you can see it's grown beyond just code/name pairing to include more data.

One of the key things that has been highlighted by the freebase folks and a few other places is the common problem - from a bunch of user input, go locate an object or identifier related to that term.

The moment you have an autocomplete control like these, it instantly kicks your application from "user is entering data into a text field" into "user is describing a semantic object, and I can grab all of the information about it that is relevant to my user".

Unlike standard, relational powered applications, SKOS + SPARQL makes this trivial - you simply write out a preferred label (skos:prefLabel), and many alternative labels (skos:altLabel).
What does that look like? Here's a sample query showing a user searching for... ear wax.

Note the URIs (try clicking on them to find out more information), and the preferred label/aliased label in the resultset, and try the resultset as JSON.

Even if no other parts of your application is aware of linked data, you can see how this graph of information can be flattened and pushed into a standard data store, for later use.




How can I build myself one of these?

Installing 4store


For this exercise, let's install some of the requirements:
$ sudo apt-get install 4store

Now we'll instantiate a new store (think database):

$ sudo 4s-backend-setup reference_store
4store[5196]: backend-setup.c:185 erased files for KB reference_store
4store[5196]: backend-setup.c:310 created RDF metadata for KB reference_store

Fire up the backend service (think of it like /etc/init.d/mysql start)
$ sudo 4s-backend reference_store

Populate some data - we'll use something I've prepared earlier as in turtle format. It helps to think of turtle as yaml but with URIs and a bit more magic.


$ git clone git://github.com/CloCkWeRX/4store-reference-service.git
$ cd 4store-reference-service
$ 4s-import reference_store --format turtle data.ttl



We're good to go - let's put the endpoint up
$ 4s-httpd -p 8000 reference_store



Now there's a (restful) endpoint living at
http://127.0.0.1:8000/sparql/

and you can run queries on it via http://127.0.0.1:8000/test/ - though until Issue #93 is solve, you probably just want to open the test-query.html page - this query will bring back both sets of data.

$ chrome test-query.html

From here, you can see the plain text, csv, JSON or XML results.

How do I do this in PHP, Rails, etc?

There's a lot of client libraries out there - I'd suggest having a quick read through of http://www.jenitennison.com/blog/node/152 for most rails developers, or looking at the sparql-client gem.

Failing that, peruse the ClientLibraries.


Where can I learn more about SPARQL?


Step 1, learn Turtle. If you can comprehend YAML, you should feel fairly comfortable.

Step 2, I'd try SPARQL by example. There's a good chance that if you are thinking of an SQL concept you want, such as LIKE matching; there's a SPARQL equivalent (FILTER regexp).

Luckily 95% of what you learned with turtle is simply reused by SPARQL - it introduces variables, where clauses/graphs, filters, and a few other things... but that's really all that's new.

Where to from here?

If you were to deploy this internally within an organisation, your service is pretty much good to go. You may want to look at Graph Access Control to add in some security, and the related SparqlServer/Update APIs.

Was this easy enough?

In comparison to the other approaches I have seen, it's fairly good.
  • It's trivial to put a front-end on your triplestore.
    You can roll your own with a minimum of fuss, or use things like https://github.com/kurtjx/SNORQL to provide an 'expert user' ability to inspect your data.
  • Adding, removing, etc aliases is trivial - there's no schema to migrate or anything else troublesome, and you can add in extra data at the drop of a hat - even if it's unrelated to your core set.
  • It's trivial to relate concepts to each other.
  • Your ontology (schema) is already there for code tables - http://www.w3.org/TR/skos-reference/ - you'll never have to reinvent that
  • There's products available that let you tie in your application behaviour/code tables right into Confluence or other platforms.

No comments: