Introducing our pilot Tabular Data Package

I hear the term ‘open’ all the time – open source, open data, open by default. This post continues along this very vein and sets outs what ONS is doing around open data and to start this off I am going to give you a bit of a history lesson.

Some years ago the ONS made the decision to stop producing hard copy books and start releasing electronic products that could be downloaded by our users free of charge from our new website. That was the start for us to produce open data at its most basic. ONS were open data pioneers, light years ahead of the pack – and we didn’t even know it. In our early years we developed online tools such as DataBank, TimeZone, StatBase which kind of evolved into the current Time Series Data functionality. We moved [slowly] away from PDF formats and started releasing more excel versions of our tables making it easier for users to handle our data. Over the years we expanded our web estates with the introduction of the websites – NeSS and NOMIS – both of which offer APIs and the ability to customise and view data online.

More recently ONS has released the Open Geography Portal and the beta versions of our ONS Data Explorer tool and Open API too. We will talk about those in a future update.

In summary we’ve done a huge amount over the years and we are not about to stop there!

It has becoming increasingly clear for those of us involved in open data that this community think that what organisations have done to get more and more data out there is great but it’s being letting down by its lack of structure. A few of us recently attended an Open Knowledge Festival and we heard the same messages from the community attending this event were – give us data in formats that we can use.

The common denominator right now seems to be the good old CSV format. That is great! I can just open my excel file and do a Save As…. CSV! Well – no not really. What most of us think of as a CSV is not really a CSV – not in the sense that a developer type person may need from a CSV. In the most basic form they are structured data files that contain a single header row and a row per observation, in essence a flat file, and they must contain no additional metadata.

I can hear the scratching of heads here in ONS and the challenges it will bring: How do we explain what the data is? What about footnotes in tables? No background notes? Thankfully we will not have to start releasing data without metadata. There are open products out there that will help us do both. Over the last few months the open datasets team in ONS have been looking formats that might help us bridge the gap between having structured outputs and still provide the metadata to explain it.

There are a few products that have been developed in the open data community. As these are developed by the open community they are free for us to adopt and use. We decided to have a go at using a particular open product – the Tabular Data Package. The specification is maintained by the Open Knowledge Foundation (OKFN).

So what is a data package? It is a series of files that offers users data in a true CSV form and the metadata in a technical file format. The numbers are provided in a data.csv files (can be more than one data file), we set out the structure for all those numbers in a data.json file and we provide the metadata in a README.html file. Why do we want to replace one data table with three files? Remember this format is for the data geeks and they prefer it this way; it makes things consistent and therefore easier to automate so that machines do the hard work. The good news is that work we are already doing around open dataset for our OpenAPI and Data Explorer allows us to move into this space quite easily – so we decided to give the data package a go.

Over the last few months we have been working with the Personal Well-being team to create a pilot Tabular Data Package. Our first package contains data from the Measuring National Well-being Programme using Personal Well-being data collected between April 2012 and March 2013 on the Annual Population Survey. We have released this package on GitHub, which is a platform for the open source community to share ideas and code, and the pilot is intended for demonstration purposes only. We want to find if providing data in this format can be of value to users.

Early feedback from the co-founder of the OKFN, Rufus Pollock, has been extremely encouraging and positive. We hope to get more feedback over the coming weeks before reviewing where we go from here. Is it a valuable format? Is it a format that can encourage innovators to use and produce services or applications? Can we make build these easily and with little effort? All questions that need to be addressed as we move forward and we are quite prepared for the answers to all those questions to be a no. The beauty behind what we have done so far is that we can just stop, throw it away and try something different.

I can’t remember who said this phrase “the best things that will done with your data will be done by someone else.” but I think it’s true and the availability of these types of machine readable formats will facilitate better tools and services in the future.

If you are interested go check it out. There are free tools that can help you view these formats.

One comment on “Introducing our pilot Tabular Data Package”

Leave a comment

Your email address will not be published.