Spiral mySQL Schema

I've been reading Benjamin Novack's writing on exploding triple stores with interest and took some time to speak with him about it at ISWC this year. It prompted me ot take another look at the mySQL triplestore used by Spiral. This is a little write up of the design of that triplestore database.

First, let me explain a little about the RDF representation used in Spiral which might be a little different to other frameworks. Spiral is resource-centric rather than graph-centric. In concrete terms Spiral has a Resource class which has zero or more graph nodes associated with it. A graph node can only be associated with a single resource. Without smushing or reasoning each resource always has one node. The smushing process generally finds nodes that denote the same resource and so these associations can simply be updated.

The TripleStore interface provides various methods to deal with either resources or graph nodes and their associations. For example there's a GetResourceDenotedBy method which gives you the resource that a particular graph node denotes. Conversely there's a GetBestDenotingNode method which gives you a graph node for a particular resource. Generally this will give you a URI if the store knows of one, otherwise you'll get a blank node (I'm ignoring literals here). You can also get a list of all nodes for a resource using GetNodesDenoting. Finally you can build associations by using AddDenotation which takes a graph node and a resource pairing.

This is the model we wanted to follow when designing the mySQL support in Spiral. We already had a memory based triple store working well but we knew we wanted to support much larger persistent stores.

In Spiral a triplestore is equivilent to a named graph. Each mySQL database can store multiple triplestores. We use a Graphs table to partition the space. Each graph has a unique id. I'd like to see this changed to a URI to be more consistent with other implementations of named graphs.

The core of the mySQL triplestore is the Statements table. This table holds all of the triples for all of the graphs. Key here is that the triples are in terms of the underlying resources not the graph nodes. Each resource is represented in this table by a unique hash, currently a 32-bit integer. We may decide to increase this as an option but haven't needed to yet. Each row in the Statements table is 16 bytes. Using a bigint would double this row length with the corresponding reductions in rows per disk read. There is a separate Resources table which records which resource is known in a particular graph with a unique index to ensure that each resource can appear in a graph only once. They are related to the specific nodes using a ResourceNodes table. This table associates a node with a resource in a specific graph. Every node appears in this table and those that have additional lexical information such as literals and URIs have additional tables as well. Blank nodes simply exist in the ResourceNodes table.

Diagram of Spiral database schema

This schema is designed to work well for querying smushed graphs. Queries are rewritten in terms of the underlying resources by looking up the graph node/resource relationships at query compilation time. The query is then evaluated against the Statements table and the best nodes for each matching resource looked up at output time. Because the Statements table contains only unique triples based on the underlying resources it can actually get smaller after smushing and so should improve query performance.

Comments

No comments yet.

Leave a comment

Sorry, the comment form is closed at this time.