The trouble with statistics

Reusers of open government data want simple, standard, easy to use data formats that allow rapid access to the relevant data. So the frustration of being presented with Excel spreadsheets that are formatted for display rather than reuse is understandable.

Simple CSV downloads are preferred by many data users, although the format isn’t useful for everyone. Not all users are developers. Many people would be very happy to open up an Excel or OpenOffice document and immediately start working within a tidy spreadsheet.

For many datasets creating downloads of simple tabular data is straightforward. But there are some specific challenges that relate to statistical data which I want to highlight in this post.

The first thing to recognise about statistical data is that it’s rarely just a simple table or a time series. Statistics are often structured as a “data cube”.

If you’re not familiar with the idea of a statistical data cube, there’s a fantastic description, complete with diagrams, on the Scottish Statistics website.

Presenting a data cube as a tabular CSV file requires some thought.

The second area of complexity comes in when we need to start describing parts of a data cube.

Any part of a data cube might need to be annotated with notes, quality metrics or other attributes that present the context necessary to properly understand the statistics and any limitations they may have.

An example might help illustrate the different types of annotation and how they’re currently surfaced to users. Go and download the Crime in England and Wales: Police Force Area Data Tables (year ending Sept 2016) spreadsheet and take a look at Table P2. Just that table alone illustrates 4 different types of annotations:

  • dataset – the title for Table P2 has 2 footnotes that provide some background on the whole table
  • dimension – in the same table, the column “Total recorded crime (excluding fraud)” has a footnote discussing the current practice of reporting data against that dimension
  • observation – some values in Table P2 have been replaced with a data marker (“-”); a footnote explains that in this dataset this means “the base number is fewer than 50”
  • “slice” – footnotes associated with the Norfolk, Suffolk and Wiltshire rows in the table explain that there are quality issues associated with all of the observed values reported against those areas in this dataset

So we can see that when dealing with a statistical data cube, we might need to add context to the whole data cube, the dimensions that define the cube, individual observations or even sets of observations (“slices”).

For more detail, Full Fact has produced a list of the types of caveat that might apply to statistical data. When publishing a statistical dataset it’s essential to provide all of this context.

It can be very difficult to collapse all of those annotations into a nice tidy CSV file. Notes and quality markers that apply to individual observations are easy to handle. They can just be included in separate columns. But the other types of annotation require other approaches.

For example, it helps if all of the elements of a data cube have a unique identifier, such as a URI, so that more information can be attached to them. This is why making data part of the web is one of the open data principles.

The recently published CSV on the Web standard also allows annotations to be attached to ranges of cells in a CSV file. This feature provides the equivalent of footnotes and data markers used in Excel spreadsheets but in a machine-readable form.

Publishing statistical data presents some unique issues. Hopefully this blog post has highlighted some of those challenges. The good news is that there are approaches and emerging standards that can help us meet them head on.