What’s wrong with the way we publish data?
Most users come to our site with a question, for which they believe some data is the answer. The best way to answer this question could be written content, simple charts/tables, an interactive or more detailed datasets. Whilst most parts of the website and our content have been improved over time, we haven’t always iterated and improved the way we publish the data in the same way – either to take advantages of new technologies or, more importantly, our users changing expectations of how they want to interact and use this data.
With a few exceptions, all the data we publish on ons.gov.uk takes the form of Excel spreadsheets. I am a big a fan of Excel, and it will likely always play an important role in how we make our data available, but there are pros and cons to it being our primary, and in most cases only, publishing format.
Let’s start with the good. Excel is an almost endlessly flexible tool allowing us to manipulate the formatting to suit the varied requirements of the (often complex) data we produce. It is also a tool that the majority of our users are familiar with and will often be the format that they will use for their ongoing use and analysis.
Ok, so the bad. For the most part this is same reason as the good, the flexibility. This causes massive inconsistencies in how we publish different data. Whilst some of these differences are for sensible reasons they can often be caused by different takes on the best way solve the same problem. This can cause difficulties for users as they have to learn different patterns depending on what data they are looking at and don’t know what to expect when they click ‘download’.
The other main ‘bad’ is again a bit of a double edged sword. Publishing in Excel has allowed our published datasets to grow organically over time and additional data to be easily added that is loosely ‘related’ but not part of the same data structure. Whilst individually this can sometimes help users, this makes the overall dataset difficult to describe and is ultimately detrimental to users finding the data they require.
Customise my Data (CMD)
CMD aims to address these problems and offer a digital-first approach that meets more of our user needs and allow us do do much more to get users to the data they need. It has been out in public beta now for a number of months after proving to meet needs at the Discovery and Alpha stages.
What CMD aims to offer users
From our user research we know the main number issue faced by users is not understanding what is contained in any given dataset. CMD allows us to hold and present a lot more metadata, much of it derived from the data itself, to help users make decisions on if a dataset is likely to meet their needs. In particular laying out the dimensions contained within a dataset has tested really well and meant users are much more confident they are in the right place.
Filter and download the data
Whilst there are many users that will just want to take away all the data we produce in bulk, there are many users that only want a small subset of any dataset. Many users also want this same selection of data month-on-month. The filter journey within CMD has been designed to support both these needs and allows us to apply different ways of filtering to different types of dimensions, for example time or a hierarchy.
Find the data
Helping users find the right dataset when most of the information is locked in an Excel file has proven to be incredibly difficult. Whilst we can add all the information for the file into our search, there is no structure to allow us to prioritise certain types of information over others. For example you would want to rank matches within title higher than if they matched a word in the footnotes.
With CMD we hold the data in a structured format which means we can be much more selective with how we index to content and the information we present to users in search results.
We are also working to add a ‘Browse by area’ journey to the site. At first supporting users to find a location and providing all the datasets that use this location, but ultimately layering in geographic information, such as boundaries, as well.
Machines have needs
I can’t finish without talking about the API as this has been my primary focus through much of CMD. For anyone not familiar with the term API it is a way of providing a view on the data that allows users easy access to the information via programmatic means.
On of the most useful elements of the CMD API is that you can get observations directly from the API without needing to sign up. This makes the API much more open and lowers the technical barrier to reuse. This simple dashboard demonstrates a small subset of what this makes possible.
If you are interesting in having a play or seeing the API, check out our developer site. If you want to have a chat about the API or any other part of CMD or the ONS website, let me know @robchamberspfc or firstname.lastname@example.org.
4 comments on “What’s wrong with the way we publish data?”
what is API?
Hi Jane, here are a couple of blog links that explain it better than I could. Hope these help. Rob
I’m desperate to use your API to download data to be used programmatically as I find the format of your usual spreadsheets to be somewhat counterproductive (who on earth would format this year as “2,018”?)
Disappointingly I see the API only has a very limited number of datasets available, when is this likely to change.
I’d like to use the new household projections that you released today and it would be great if I could use the API – I’d be happy to beta test it for you.
I can appreciate your frustration. We are working to get more data added but mostly are having to transform from the published tables so it is a time consuming process. The household projections are not ones we have looked at so far but I have passed your comments on to our data team who are going to have a look.