The Digital Publishing technology stack – Part 2: Our backend systems

The Digital Publishing technology stack – Part 2: Our backend systems

In part one we looked at our hosting platform and how we build, deploy and manage services in our test and production environments.

In part two we’ll look at our backend systems – how we import, store, publish and export data.

Part 2: Our backend systems

Our backend systems are a collection of data processing pipelines and APIs – built with a mixture of Go and Java, and a little bit of Python. This is probably the stack which contains the most technical debt, and it’s also the stack which is the most difficult to improve without unpredictable or unintended consequences. But it’s high up our list of priorities to fix, which we’re doing a bit at a time as we introduce new features.

Most of our backend code – at least, the bits used for publishing written content and static datasets to the ONS website – lives in a service known as Zebedee. It’s a Java monolith which provides the content APIs used by Florence and the public-facing website, handles all user and team management, permissions, encryption, scheduling and manages most of the publishing process. It also does a load of other stuff – some of it we know about, some of it we don’t (and we always find out more at the worst possible time). It’s built on top of the same framework as Babbage, and stores everything on disk as JSON files (only audit data is written to a database).

There’s two other services which are part of our publishing process:

  • The Train – the service used to publish content from our CMS to the public facing website, which receives files over HTTP and copies them around on disk
  • Project Brian – converts CSDB files (our proprietary, and undocumented, time series data format) into Excel and CSV files for publication

Much of the work we’ve done in these services since the website launched has been maintenance, bug fixing and refactoring to help us understand how and why things are failing and to improve reliability and performance. After the launch of the website we were sometimes failing to publish content for unknown reasons, or suffering services outages. Over the last two years we’ve uncovered and fixed a variety of serious design flaws and bugs in the way these services work, and both the website and publishing system are now much more robust.

Our challenges

Over the past two years the demands on our service have increased – both internally, where we have more users publishing more content, and our expectations on availability and performance have increased, and externally, where visitor numbers have risen considerably. We’re also publishing more complex content, for example data visualisations often include images, maps, data, lookup tables, Javascript and CSS.

Because of the publishing model we use (encrypted files on disk, which are decrypted and copied to multiple frontend servers at publish time), we have a relatively low limit on how much content we can publish at any one time. The increase in demand has meant we’re reaching the limit of what our current design can handle.

We’re required by T3.6 of the Code of Practice for Statistics to publish statistical releases at 9:30am. A combination of factors has meant we can’t start the publishing process until 9:30:00am (at which point all of the content is still only available through our CMS, and encrypted on disk, in a different subnet to the web frontend), whilst requiring that we finish the publishing process by 9:30:59am (at which point the content must be decrypted, available on all frontend web servers, and available through search and our release calendar). This 60-second window coincides with our peak traffic for the day – just as we’re trying to publish data, all of our users are trying to download it.

What we’ve been doing

We’ve taken some steps to improve things, but ultimately we need to use a different model for storing and publishing data which doesn’t require intensive CPU and I/O activity at the same time our inbound traffic peaks, and which doesn’t require a publishing process that scales linearly with the volume of data being published.

As part of building the “Filter a dataset” service, we’ve taken a completely different approach to publishing content. We have a set of backend services which are responsible for the import of datasets into Neo4j (soon to be AWS Neptune), and another set responsible for the filtering and export of data (to Excel and CSV and stored in Amazon S3). Metadata is stored in MongoDB, and access to the data is controlled by a set of APIs. This makes it instantly available at publish time without moving files around on disk and across the network.

The data is still encrypted before publication, so we still have some work to do as part of the publishing process – but we minimise the impact by decrypting the data as soon as possible after it’s published (which is typically started within a few seconds), and we’ll decrypt on–demand for any requests before that process has finished. This will increase CPU usage as traffic increases, but in a way we can scale horizontally and which has minimal impact on our backend databases.

We’ve also moved towards an event-driven model where anything which doesn’t need an immediate response happens asynchronously via Kafka, significantly improving our ability to handle spikes in demand on either publishing or web traffic. The entire import process and export process happens that way, and synchronous API calls are only used when absolutely necessary.

What we’ve got planned

Over the next couple of years, our goal is to replace much of the functionality in Zebedee with new services and backend processes which apply the “Filter a dataset” publishing model to the rest of our content. With the demands on our service already increasing, the 2021 Census only a few years away, and the size and complexity of the content increasing, we need to be ready to handle much higher levels of demand than we’ve been used to.

We’ll also be improving the reliability, security and performance of our CMS by rebuilding our content and publishing APIs and backend processes, storing most of our content in databases instead of JSON files on disk, and moving all static content to our CDN.

Our website search functionality has had a lot of attention over the past year. We’ve listened to your feedback so we know it’s something we’ve been pretty bad at, and you don’t always find the results you’d expect, but we’ve been working hard to improve it. We’ve had a dedicated search engineer who has been applying machine learning and natural language processing techniques to our content and search queries to help us give you better search results. Some of that work is now being deployed to production, and it’s something we’ll be building on over the next couple of years.

These services are currently owned by a small team of backend engineers who are responsible for all server and backend data pipeline services.

Some of the tools and services we use in backend engineering are:

  • Go – for a lot of our newer APIs and backend services
  • Java – for our legacy codebases and some newer services
  • Python – for our search APIs and machine learning code
  • Apache Kafka – for most communication between backend services
  • MongoDB – for storing most structured data
  • PostgreSQL – for storing audit data, though we’d like to change this
  • Neo4j – for storing structured datasets, used for filtering datasets and direct access to specific observations
  • AWS (for example S3, CloudFront, and soon Neptune) – hosts most of our web assets (CSS, Javascript, and so on), and imported and exported datasets for the “Filter a dataset” service
  • Elasticsearch – indexes our content and powers our search functionality

To achieve our goals over the next few years, we’re also scaling up our backend engineering team. We’re about to start hiring two new backend engineers, with more roles to come early next year. To find out when we’re hiring you can sign up for Civil Service job alerts or follow us on Twitter.

Come back next week for part three where we’ll look at our frontend systems – how we build internal tooling and web services for publishing and accessing our content.

 

4 comments on “The Digital Publishing technology stack – Part 2: Our backend systems”

  1. Why do you encrypt at all? A fail-safe against other security provisions not working?

    Are you replacing static pages (“copied to multiple frontend servers at publish time”) with dynamic (“access to the data is controlled by a set of APIs”), or is it already dynamic? It’s not completely clear to me, although it sounds like you’re replacing static HTML publishing with dynamic rendering – presumably exchanging initial cost and long-term saving for ongoing medium-to-high cost? Why did you decide not to, for example, publish the pages to a location which was inaccessible and change permissions or symlinks at the appointed time? Or are there more specific guidelines that you need to adhere to?

    1. Hi Phil

      You’ve asked some really good questions!

      At a very high level – we encrypt because the data we hold is particularly sensitive, and access to that data needs to be strictly controlled. It’s partly a fail-safe against other security mechanisms, partly good security practice, and partly because we work from the assumption that any part of our system might be compromised and we should minimise any potential consequences of that.

      The static pages aren’t really static – the content is, but the rest of the page isn’t – so even though we have files on disk which are copied around, we still have server-side components which take that content and render the final page. This means we already have the costs associated with dynamic rendering, but it’s built in such a way that we’re not really benefiting from that.

      The type of data we’re publishing has also changed since the original architectural design so we need to be more flexible around how we store and publish data. For example, the “Filter a dataset” service means storing the data in a structured way which can be easily queried rather than files on disk.

      Moving much of the static content to a database allows us to create, store and publish our content in a much more efficient way. This should minimise the impact of the 9:30am spike and allow us to reduce our overall system capacity.

      Thanks
      Ian

  2. Interesting answers, thanks. I wonder if you’ve had any of the MongoDB issues that the Guardian team have had? https://www.theguardian.com/info/2018/nov/30/bye-bye-mongo-hello-postgres

    1. Thanks for the link! We haven’t yet, but it sounds like most of the problems the Guardian had were around maintenance and management of the cluster, rather than problems with the cluster itself. Hopefully we won’t have similar problems, but useful to be aware of the issues others have faced.

Leave a comment

Your email address will not be published.