Wednesday, January 07, 2009

Generating Nutritional Data RDF from USDA, Part 2

I has a bit of a whinge yesterday about copyright, nutrition data, and so forth.

Today, my inbox has a nice copy of the NUTTAB data I want, I've located SR21, I've had someone else point me at canadian nutritional data too.

To get the NUTTAB data, you have to email Food Standards Australia, but thats not a huge deal.


So; progress:
I've made a script to import the USDA SR21 data into a database (ie, mysql), and render it out as RDF.

Installing it


Pretty easy stuff! Its in PHP, and makes use of PEAR.
# Get the code:
$ svn co svn checkout http://freebase-owl.googlecode.com/svn/trunk/nutrition/

# 0. Install dependencies
$ sudo apt-get install php-pear mysql wget unzip
$ sudo pear install -fa MDB2 XML_Beautifier

# 1. Get the SR21 data, extract it
$ wget http://www.nal.usda.gov/fnic/foodcomp/Data/SR21/dnload/sr21.zip
$ unzip sr21.zip

# 2. Make configuration
$ cp config.php.dist config.php
$ vim config.php

# 3. Create a database of your choosing, with the same settings as configuration
mysql -u root -p

CREATE DATABASE usda;

# 4. Run the install script. This will take a while as it imports all data. If it fails, just DROP the database and start again
$ php install.php

# 5. Give it a shot from the command line. "1002" is the USDA food id.
$ php rdfizer.php 1002

$ php rdfizer.php 1002 > 1002.rdf

# 6. Generate the whole set:
php generate-all.php



The basic plan: import everything, render out individual items, publish them statically on the web. Maybe later, get someone to stuff them all into a SPARQL endpoint.

Rinse, repeat with Canadian, Australian data. Grow a common ontology for Food, Nutrients, etc.

You can view some of the output RDF, I've not generated the whole set yet as my poor computer is far too old and creaky to do so.

Additionally, there are lots of linked data connections I want to make.

I want to link the sources with pubmed, the units with... something (side note: there's not much in the way of unit and measurement ontologies I could find!), the USDA style names with wikipedia/dbpedia/freebase; the compound names (PROCNT - protein content) with... something.

How silly is this: there's no semantic web url for milligrams. The best I could do was a few related concepts, because someone at wikipedia decided to merge all of the sub-articles for measurements into the single unit (ie, mg to g).




Reblog this post [with Zemanta]
Post a Comment