Before going into the technical details of how the ORDF library works, perhaps it is helpful to explain some of what led us here.
The story begins, really, with the CKAN software, which provides a metadata registry for catalogueuing datasets. It is a traditional Pylons application with a SQL back-end and all the usual models and controllers and such. The basic design pattern would be familiar to anyone who has ever used a MVC framework such as Pylons or Django or Ruby on Rails.
As the community expanded others started running instances of the software. Where we started with a single site, now there is a decentralised network of sites in Canada, Germany and elsewhere. Even some governments are running their national data catalogues using it, for example http://data.gov.uk/.
Perhaps unsurprisingly it was not long before members of this community began to want slight changes to the data model to suit their local requirements. The most common request was to have additional metadata for a package. There was no particular common thread to the specific metadata each wanted so a system was devised to be able to add arbitrary key-value pairs to a package, along with a way to specify how they were to be edited and validated.
This was a less than elegant solution, but it was simple and it worked. Each site now has an extensions file that says what extra pieces of data they want.
And then came the INSPIRE Directive which says that member states of the European Union must make certain geographic data available to the public. In order to be able to do this well, to be able to add geographic metadata to catalogue entries and search it with bounding boxes or circles we need to have a specialised index on this metadata.
Making a spatial index is not hard, there is good support in PostGIS, the GIS extensions for the PostgreSQL database that we typically use, but it means that using a table of key-value pairs is out and entails more extensive changes to the schema.
Worse, it risks introducing a dependency on PostGIS for sites that have no need of it – sites outside of the EU that are not concerned with geographic data.
It seemed like we were running up against the limits of the usual SQL-backed MVC approach in trying to run a federated network of data catalogues, each with their own local extensions and slightly different requirements.
The particular drawbacks can be summarised:
- SQL schemas are rigid. Making changes to the database schema means making parallel changes in the application code and making sure any running database is in sync.
- Hacks like the table of key-value pairs to avoid divergent schemas at different sites forego efficient indexing and the expressivity otherwise possible with SQL for those fields.
- Indexing of datatypes other than strings is impossible with the key-value pair arrangement.
The problems we were encountering with the CKAN network appear because in a decentralised environment the equivalence of conceptual model and SQL table (or group of tables) starts to break down. Each node in the network has a slightly different idea of what the meaning of a model is and supporting these overlapping but different conceptions is not something that SQL is well suited to.
This is, however, not a new problem. A common problem domain example is a system to model medical patients’ complaints. There are a large number of possible complaints and any patient will only have a small number of them. It quickly becomes unwieldy to either have a very large number of tables, one for each type of complaint (not to mention what happens when a new complaint is encountered) or to have a table of patients with a very large number of mostly empty columns.
If you think of a matrix where the rows correspond to patients and the columns correspond to complaints, and a particular cell holds the value 1 if a particular complaint is present for that patient and 0 otherwise, the matrix will hold mostly zeros. This is known as a sparse matrix and is well studied in the mathematical and computer science literature.
Quite clearly in the CKAN problem domain the model is not nearly as sparse as in the example. In fact most datasets in the catalogue can be expected to have most of the attributes and the sparse area is more or less limited to the key-value pair site extensions. However the more sites there are the larger this subset of metadata becomes.
A related problem that arises when we start allowing local extensions to the data model is about what to do when aggregating data. A simple answer is to just aggregate the core information but this is very probably not good enough. What happens when different sites define an extension with the same name but different meanings?
This is what led us to the Resource Description Framework. In the RDF model, the attributes of an entity (called predicate and subject, respectively, in the RDF nomenclature) are URI s. This gives us a global namespace that is locally extensible to avoid collisions.
Furthermore, there are a large variety of existing RDF Vocabularies which may be useful for describing our data without having to re-invent the wheel.
It also quickly became apparent that it should be possible to build a system that can be repurposed for other projects easily - after all if we are not constrained to put the bulk of our thinking into a SQL schema rigidly tied to python classes, then the software becomes more decoupled from the data and hence more reuseable.
In April of 2010 we started work on a new project with the University of Edinburgh IDEALab for a new type of bibliographic information creation and collaboration platform and decided that this was a good opportunity to pursue this new strategy.
Some project members had done work with RDF before, and we embarked on a project to see how the state of the art had advanced since we last looked. It was not exceptionally impressive. Without naming names, the choices for storing RDF data all have several of the following problems:
- slow when holding a lot of data
- large in terms of resource (disc, memory, software) requirements
- complicated to set-up
- poor support for aggregate operations
- poor support for indices on specific values
- poor support for full-text indices
After some initial efforts using 4store we found that a collegue at the University of Oxford, Ben O’Steen had been thinking about this very problem. While the ORDF software only uses some of his lower-level primitives, the arrangement of the RDF storage is very much bears the mark of his thinking.
The basic idea is to use a variety of indices. The most basic (and most important) is simply the filesystem. Serialised RDF files are stored in a specialised directory hierarchy. They can be read and written very quickly and efficiently if you know their identifiers or filenames but they cannot be searched.
For searching or querying we use two other indices at the moment. We have persevered with 4store. This gives us the abilty to search and explore relationships between entities using the SPARQL query language. We also use Xapian for full-text indexing. The actual process of building the full-text index is somewhat application specific if it is to be done in a more advanced way than simply looking at some of the predicates that commonly contain informative string literals, but that means that applications using the ORDF software simply need to define one function to do this and a corresponding function for searching.
Once we were committed to having a primary filesystem store and a variety of associated indices, perhaps allowing indices to exist on different hosts for scalability, the question arises of how to keep them in sync and updated.
We use a strategy where a save operation on an RDF graph (a named collection of statements) first writes the new graph into the filesystem storage and then passes it on to a message queueing system – RabbitMQ in particular. Each index then has a small daemon that then listens for incoming messages and does whatever updates are required.
This is implemented in such a way as to not require the use of RabbitMQ, it is perfectly possible to simply iterate over the indexes and give them the graph to update instead of passing them to the messaging subsystem but a live, production system would typically make use of the greater robustness of such a setup.
Use of a message queueing system in this way also opens the way to doing more expensive operations on the graph after a user may have saved it but before the indices are updated. A good example would be Production Rule inferencing using, for example, FuXi to populate the system with extra statements that are implied by the data received from the user but not explicitly stated. This means richer data and more interesting searching possibilities. The whole arrangement can start looking like an Expert System though there is much to be done first before this tantalising direction can be fully explored.
Another requirement common to most data management projects, CKAN and Bibliographica alike is the keeping of change history data. It is not simply enough to overwrite previous versions. CKAN uses a specially implemented Versioned Domain Model which implements this for objects in a SQL database.
We decide to build on the Changesets vocabulary from Talis. We had to extend it in a few ways in order to include a notion of RDF graphs which are the basic unit of storage in our back-end and simultaneous changes to multiple such graphs.