RDF for Modular, Extensible Markup
Summary: XML provides extensibility through a common syntax but leaves the interpretation of the information to the developer. Increasing modularity through the use of XML namespaces further increases developers’ workload as they struggle to interpret ever more complex data models. RDF leverages the syntactic extensibility of XML and the modularity of XML namespaces and additionally provides global extensibility through a common data model
X is for eXtensible
What does it mean for XML to be extensible?
Webster's defines extensible as the capability to be extended.
In the XML world this means that XML documents can be extended by following the syntax rules laid out in the XML specification. Provided you follow these rules, you won't have to change the way you process your document. Right?
Well, not quite. Following the syntax rules means that you can use an off-the-shelf XML parser to turn the bytes in yours and your business partner's documents into some kind of tree of elements and attributes. But that tree can be just about any shape or size and mapping that into something that is useful for your business isn't trivial.
What's missing is a specification for the types of tree that the XML documents can be turned into. Is it deep or wide? What data appears as attributes or elements? One approach is to enforce the use of an XML schema or a DTD. A schema defines what order the elements can appear in the tree as well as whether or not they can have attributes or contain text.
Enforcing the use of an XML schema means that you know the type of tree that is going to be produced from the XML you receive. You only have to write the code that maps it into your business database once and then you can be confident that you can easily handle the XML documents your business partners are sending you. Your tool chain consists of an off-the-shelf XML parser, a piece of code to understand the XML tree and a piece of code to store the results in your company's database.
Here comes the fly in the ointment: one schema is never enough. Most businesses deal with lots of different types of document and maintaining one grand, unified schema soon becomes unfeasible. Instead, multiple schemas are produced - one for the product description; one for the parts list; another for customer records and so on.
Each schema requires a new piece of code to be written to understand the trees the documents produce plus another piece of code to store that information in the database. Now your tool chain consists of the off-the-shelf XML parse, multiple pieces of code to understand the trees and multiple pieces of database interfacing code.
But at least you only have to write each piece of code once…or do you?
Modularity - extensibility's significant other
Businesses don't stand still and nor does the data they deal with every day. Documents evolve, new features are added and products get priced and packaged in ever more creative ways.
In order to counter this entropic pressure, programmers have devised, over the past forty years, increasingly sophisticated ways of improving business application maintainability and extensibility using such techniques as functions, modular code and object-oriented programming.
Similar problems exist with the data the business uses: as the data evolves and more and more special cases and uses are introduced, the document schemas end up bulging at the seams. The obvious solution is to introduce modularity, commonly by using XML namespaces to extract related groups of terms. Modularity also allows different areas of the business or even other businesses to define the structure of the information that they need.
These modules can be mixed into existing documents simply by including a namespace reference and using the elements and attributes in suitable places. There could be a 'packaging’ module, for example, that contains terms that describe how products are packed onto palettes, how many are in a carton etc. The module could then be used in shipping notes, pick lists and inventory records.
To accommodate all these modules, the document schemas start introducing integration points - elements that allow any element from other namespaces to be added as children.
Now you have to rewrite your code that understands the trees produced by the XML parser to account for the modules that have been introduced and also for the modules that might be introduced. You're almost back to square one - all the safety that having one schema brought has been lost. The only benefit left is being able to validate that the XML you're receiving conforms to the host of schemas that you now have.
What's needed is something that uses XML so off-the-shelf parsers can be used, allows the use of namespaces to keep the XML modular but at the same time provides a consistent data model that eliminates the need for bespoke mapping of the XML to business data.
There is a candidate that fits these criteria, but it's one that many XML proponents prefer to keep out in the yard: RDF, the Resource Description Framework.
At its heart RDF is a way of describing things and concepts such as people, companies or the weather in Toronto in terms of the attributes they possess. In RDF terminology the things being described are called resources and their attributes are called properties. A resource can have any number of properties and each property can have a single value that is either another resource or a piece of text called a literal. This linking together of resources via properties can be thought of as a graph in which the resources are represented as nodes and the properties as arcs linking the nodes together. In fact, this graph is the consistent data model that RDF provides.
Kicking the dog
RDF has, since its inception in 1998, garnered a reputation for being obscure, hard to learn and downright ugly. The elegance of its data model was obscured behind a thick covering of verbose XML syntax.
It's easy to kick such a mangy mutt.
Despite all that, a lot of work has been done over the past couple of years in grooming both the syntax and the official documentation to make them clearer and simpler. This effort is coming near to completion and now might be the time to reconsider the use of RDF.
RDF resides in a layer on top of XML so it benefits from all the syntax extensibility that XML provides. It uses XML namespaces throughout in a consistent manner so it's modular. However, it also provides a well defined data model based on triples.
A triple consists, as its name suggests, of three items of information, the resource, a property of that resource and the property value, which can be a resource or a literal. In the language of RDF these are called the subject, the predicate and the object. For example, one triple might be {customer, customerNumber, 43511}. Here the subject is the customer, the predicate (or property) is customerNumber and the object (or value) is 43511. Triples can be linked together so the value of one property can be a resource, which might be the subject of another triple. Performing a query against a collection of triples involves the query processor following these links in order to determine the values being queried.
Because the data model is fixed for RDF, it's possible to use off-the-shelf RDF parsers to read and interpret any RDF document that may be presented. Most, if not all, parsers provide interfaces to database-backed triple stores. Thus, the tool chain becomes an off-the-shelf XML parser hooked up to an off-the-shelf RDF parser that feeds into a triple-store. The only bespoke code may be a one-off hit on hooking the RDF parser up to the selected database.
RDF in practice
RDF provides rules that govern how you write XML documents. These can be restrictive but are necessary to enforce the consistent model which is the key benefit gained from using RDF over plain XML.
Here are some suggested steps to use when designing XML that conforms to the RDF model. Full RDF gives you a lot more flexibility than these steps do, but if you follow them then you're guaranteed to produce clean, valid and readable RDF.
The first step is to decide what it is you're going to be describing. It could be a purchase order, or perhaps a customer. This is your first RDF resource and you need to find or invent an XML tag to represent it just as you would when designing any other XML format.
Suppose for example, that you're describing a purchase order. You might create a namespace for purchase order related terms, or you might reuse an existing one. Suppose the namespace is http://example.com/purchase-order-ns and you want to use the tag name PurchaseOrder to represent purchase orders. In RDF it's conventional to use tag names with an upper case first letter to represent resources. The minimal amount of RDF to represent this single resource would be:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:po="http://example.com/purchase-order-ns">
<po:PurchaseOrder>
</po:PurchaseOrder>
</rdf:RDF>
The next step is to think about its properties. In RDF properties of a resource can be added to your XML either as attributes of the resource or as child elements. The convention in RDF is to use tag or attribute names with a lower case first letter for properties. You need to decide whether each property has a simple text value or is better off having a resource as its value. A property must have one value. In XML terms it means the property element must either contain text or a single child element.
In our example we might decide that a purchase order has an order number, which is a text string:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:po="http://example.com/purchase-order-ns">
<po:PurchaseOrder>
<po:orderNumber>123456</po:orderNumber>
</po:PurchaseOrder>
</rdf:RDF>
We want to associate the customer with this purchase order so we introduce a raisedBy property. There is value to be had by providing more detail about a customer than just the name so we invent a new resource called Customer. Our RDF becomes:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:po="http://example.com/purchase-order-ns">
<po:PurchaseOrder>
<po:orderNumber>123456</po:orderNumber>
<po:raisedBy>
<po:Customer>
</po:Customer>
</po:raisedBy>
</po:PurchaseOrder>
</rdf:RDF>
Now we can think about properties that describe a customer such as name and address. Some of these might be further resources with their own properties and so on. This alternation of resource and property is characteristic of RDF and is called striping. If you stick to the convention of upper case first letters for resources and lower case first letters for their properties then it's easy to see where you are in the alternation sequence.
Adding in modules that other people have developed works in the same way. Modules will make new resources and properties available that can be added to your document in appropriate places. In our example, we might use an address module and a product module to provide additional detail in our document. Our example might end up looking like this:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:po="http://example.com/purchase-order-ns"
xmlns:addr="http://example.com/address-ns"
xmlns:prod="http://example.com/product-ns">
<po:PurchaseOrder>
<po:orderNumber>123456</po:orderNumber>
<po:raisedBy>
<po:Customer>
<po:name>Wild Widgets Inc.</po:name>
<po:customerNumber>1447389</po:customerNumber>
</po:Customer>
</po:raisedBy>
<po:customerRef>XS31444</po:customerRef>
<po:shipTo>
<addr:StreetAddress>
<addr:number>1421</addr:number>
<addr:street>Plane Avenue</addr:street>
<addr:town>1421</addr:town>
</addr:StreetAddress>
</po:shipTo>
<po:lineItem>
<po:Item>
<prod:code>TYW-65523-GB</prod:code>
<prod:color>TYW-65523-GB</prod:color>
<po:quantity>15</po:quantity>
</po:Item>
</po:lineItem>
</po:PurchaseOrder>
</rdf:RDF>
That's all there is to getting started with simple RDF and it's certainly not a lot harder than writing your first plain XML document. More advanced RDF allows you to cross reference things like a customer's address and the shipping address and lets you use collections of resources to describe lists of things.
A fable
Sam and Kerry both work for different companies in their respective IT departments. Each company has recently implemented electronic document exchange with their business partners. Sam's company has selected XML as a technology for marking up the various billing documents. Kerry's company has also settled on XML for their documents, but they took a further decision to require these documents to be valid RDF as well.
Sam plans to parse the documents as they arrive and store the information in a relational database for future processing. Sam will create tables to represent the various information items contained in the documents - customers, purchase orders, invoices, line items and products.
Kerry also plans to parse each document as it arrives, but because the documents are RDF, the information can be stored in a triple store, which may be implemented in a relational database. Kerry doesn't need to define the tables to store the information items because the RDF model dictates how the data is to be represented.
Sam starts work on reading and querying purchase order documents. Someone else in the company has created an XML schema that can be used to validate the documents as they arrive. It also defines the structure of the expected documents, which helps Sam write a purchase order SAX reader. The schema tells Sam which elements can contain which other elements and what order to expect them in. A lot of Sam's time is spent writing code that stores the context of the current element in the SAX stream so the various prices can be disambiguated. Another large chunk of time is spent writing SQL to store the various information items to the correct tables in the database. Once this work has been completed, Sam can move onto writing more SQL to query the database in order to produce a list of products that the customer wants to buy.
Kerry, coincidently, chooses to work on the purchase orders first too. Since the documents are expected to be RDF, no schema needs to be created and Kerry can get straight to work on the parsing. Kerry selects an off-the-shelf RDF parser and writes some Java to interface it to the triple store that the company has chosen to use. After a short time, Kerry starts work on the query interface. Here Kerry has to make a decision from the several competing query languages available and chooses one that's has a SQL-like syntax and feels comfortable to use. Kerry writes the product list query for the user interface.
The next week, Sam is presented with the XML schema for the invoices the company plans to send out. Sam reads through the schema and notes that there are small inconsistencies with the purchase order schema. For example, this schema uses attributes for the customer number whereas the previous schema used elements. Sam realises that there isn't going to be much reuse of the code that was written for parsing the purchase orders. In fact, Sam ends up writing a completely new SAX reader for the invoice format, plus another chunk of invoice-specific SQL and database code. Sam also discovers that products can have discounts for bulk orders and has to change the database schema to account for this. Sam ends up re-writing the purchase order query as well to ensure that the prices are being pulled from the new table.
Kerry, on the other hand, has also started work on processing invoices. The RDF parser is already hooked up to the triple store so there's no additional work to do to parse the document. Kerry writes a query to produce the list of invoiced items for the billing application.
On Friday afternoon, the XML schema for pick lists arrives in Sam's inbox. The email mentions that both the invoice and the purchase order schemas have been changed to use the new product module that has been developed to standardise the elements being used across the various schemas. Sam looks wistfully out of the window to the office across the street where Kerry, having already written the pick list query and changed the invoice and purchase order queries, is getting ready to hit the town with some friends from the billing department…
Finding out more
The definitive source for information about RDF and the place to find the formal specifications is the W3C's web site: http://www.w3.org/RDF/
Dave Beckett maintains a comprehensive list of RDF resources and tools: http://www.ilrt.bris.ac.uk/discovery/rdf/resources/
About the Author
Ian Davis is a British developer, based in central England. He is a co-founder and contributor to Semantic Planet, a semantic web advocacy website, which can be found at http://www.semanticplanet.com. Ian's weblog is at http://InternetAlchemy.org. Thanks to Danny Ayers who provided valuable feedback, although any errors are entirely my own.
Copyright
This article is copyright Ian Davis 2003. Permission is granted to reproduce this document in its entirety so long as this copyright message is preserved and a link to the original article is provided.
3 Comments
Leave a comment
Sorry, the comment form is closed at this time.

RDF for Modular, Extensible Markup
Great article by Ian Davis RDF has, since its inception in 1998, garnered a reputation for being obscure, hard to
Trackback by Raw Blog — 24 Apr 2003 @ 4:39 pm
Modular, extensible RDF
Builiding up RDF step-by-step like this makes a lot more sense than just a bundle of code. I can see the analogy with relational databases now, as well.
Trackback by HubLog — 27 Apr 2003 @ 3:09 pm
Semantic Planet Weblog: RDF for Modular, Extensible Markup
Trackback by dj — 3 Feb 2004 @ 7:34 pm