Some open data publishing principles

This week I’ve started working with the Digital Publishing team at the ONS. They’re currently hard at work on the Data Discovery Alpha exploring how to better support users in finding and accessing datasets.

As our national statistics body and the UK’s largest producer of official statistics, it’s important that the ONS is seen as an exemplar of how to publish high-quality data. Open data from the ONS should be published according to current best practices. The team have asked me to help them think through how these apply to the ONS website.

This is an exciting opportunity and I’m already enjoying getting up to speed with everything that’s happening across the organisation. It’s also a big task as the ONS publish a lot of different types of data. For example, it’s not just statistics, there’s also geographic datasets.

To help frame the work that we’ll be doing I’ve drafted a few high-level principles which I thought I’d share here.

The principles provide an approach for thinking about open data publishing that focuses on the outcomes: what it is that we want to enable users to do?

Importantly, the principles are aligned with the Data on the Web Best Practices, the recommendations in the Open Data Institute’s Open Data Certificates, and the Code of Practice for Official Statistics.

Obviously, implementing all this will also draw on the open principles enshrined in the GDS service manual. For example, building on open standards.

  1. Make data discoverable

Datasets need to be discoverable on the ONS website and the team are continuing to put a great deal of effort into that.

But there are various ways in which discovery can happen and not all of those need be on the ONS website. Users might find data via Google and/or specialised data aggregators and portals.

This means that data needs to have good quality descriptive metadata and be easily indexed by third-parties

  1. Ensure reuse rights are always clear

Data published by the ONS is reusable under the Open Government Licence (OGL). But individual datasets may be derived from data provided by other organisations. This means re-users may need to include additional attribution or copyright statements when reusing the data.

While these requirements are all documented, the rights of re-users, along with any obligations should be clear at the point of use.

And, as data may be distributed by third-parties, those licensing and rights statements should also be machine-readable.

  1. Help users cite their sources

Clear attribution statements and stable links can do more than help users fulfil their obligations under the OGL.

Easy ways to reference and link to datasets will encourage users to cite their sources. This provides another route for potential users to discover datasets, by following links to primary sources from analysis, visualisations and applications.

Stable links, clearly labelled citation examples, and supporting metadata can make all of this easier for reusers.

  1. Always present data in context

Access to data only gets you so far. Deciding whether the data is fit for purpose and the process of turning it into insight requires access to more information.

Documentation about the contents of the dataset, notes on how it was collected and processed, and any known limitations with its quality are all important to deciding when and how a dataset might be used.

Users should be able to easily find and access this contextual information. Where possible it should be packaged with the dataset to support downloading and redistribution.

  1. Make datasets legible

Statistical datasets can be very complex. They can include multiple dimensions and use complex hierarchical coding schemes. Terms used in the data may have specific statistical definitions that are at odds with their use in common language. Individual data points may even have annotations and notes, for example, to mark provisional or revised figures.

This information needs to be as readily accessible as the data itself. This makes it easier for re-users to understand and correctly interpret the data. Ideally definitions of standard attributes, dimensions and measures should all be independently available and accessible, especially where these are reused across datasets.

  1. Data should be useful for everyone

Open formats and standards ensure that data can be used by anyone, without requiring proprietary software or systems. But there is no single approach to consuming and reusing data. Treating data as infrastructure means recognising that there are a range of communities interested in that data and they have different needs.

Supporting these user needs may require presenting a choice of formats and data access options. Some users will want customised downloads while others may want to automatically access data in bulk or via APIs.

The GDS registers framework is a good example of a system that supports multiple ways to access, use and share the same core data.

  1. Make data part of the web

Hopefully, as the other principles make clear, a dataset doesn’t stand alone. There’s a whole collection of supporting documentation, definitions and metadata that helps to describe it. And, surrounding that, are all of the other outputs of the ONS: the bulletins, visualisations and other commentary that threads together multiple datasets.

Regardless of the technology used to manage and publish data, everything that a user needs to refer to or share should have a place on the web.

Collectively these principles should hopefully give us a framework that will guide both the work carried out on the alpha but also beyond. Over the coming weeks I’ll be turning these principles into suggestions and recommendations for how to manage and publish open data as part of the ONS website.

If you’ve got feedback or comments then I’d love to hear from you!