The Digital Publishing technology stack – Part 2: Our backend systems
In part one we looked at our hosting platform and how we build, deploy and manage services in our test and production environments.
In part two we’ll look at our backend systems – how we import, store, publish and export data.
Part 2: Our backend systems
Our backend systems are a collection of data processing pipelines and APIs – built with a mixture of Go and Java, and a little bit of Python. This is probably the stack which contains the most technical debt, and it’s also the stack which is the most difficult to improve without unpredictable or unintended consequences. But it’s high up our list of priorities to fix, which we’re doing a bit at a time as we introduce new features.
Most of our backend code – at least, the bits used for publishing written content and static datasets to the ONS website – lives in a service known as Zebedee. It’s a Java monolith which provides the content APIs used by Florence and the public-facing website, handles all user and team management, permissions, encryption, scheduling and manages most of the publishing process. It also does a load of other stuff – some of it we know about, some of it we don’t (and we always find out more at the worst possible time). It’s built on top of the same framework as Babbage, and stores everything on disk as JSON files (only audit data is written to a database).
There’s two other services which are part of our publishing process:
- The Train – the service used to publish content from our CMS to the public facing website, which receives files over HTTP and copies them around on disk
- Project Brian – converts CSDB files (our proprietary, and undocumented, time series data format) into Excel and CSV files for publication
Much of the work we’ve done in these services since the website launched has been maintenance, bug fixing and refactoring to help us understand how and why things are failing and to improve reliability and performance. After the launch of the website we were sometimes failing to publish content for unknown reasons, or suffering services outages. Over the last two years we’ve uncovered and fixed a variety of serious design flaws and bugs in the way these services work, and both the website and publishing system are now much more robust.
Because of the publishing model we use (encrypted files on disk, which are decrypted and copied to multiple frontend servers at publish time), we have a relatively low limit on how much content we can publish at any one time. The increase in demand has meant we’re reaching the limit of what our current design can handle.
We’re required by T3.6 of the Code of Practice for Statistics to publish statistical releases at 9:30am. A combination of factors has meant we can’t start the publishing process until 9:30:00am (at which point all of the content is still only available through our CMS, and encrypted on disk, in a different subnet to the web frontend), whilst requiring that we finish the publishing process by 9:30:59am (at which point the content must be decrypted, available on all frontend web servers, and available through search and our release calendar). This 60-second window coincides with our peak traffic for the day – just as we’re trying to publish data, all of our users are trying to download it.
What we’ve been doing
We’ve taken some steps to improve things, but ultimately we need to use a different model for storing and publishing data which doesn’t require intensive CPU and I/O activity at the same time our inbound traffic peaks, and which doesn’t require a publishing process that scales linearly with the volume of data being published.
As part of building the “Filter a dataset” service, we’ve taken a completely different approach to publishing content. We have a set of backend services which are responsible for the import of datasets into Neo4j (soon to be AWS Neptune), and another set responsible for the filtering and export of data (to Excel and CSV and stored in Amazon S3). Metadata is stored in MongoDB, and access to the data is controlled by a set of APIs. This makes it instantly available at publish time without moving files around on disk and across the network.
The data is still encrypted before publication, so we still have some work to do as part of the publishing process – but we minimise the impact by decrypting the data as soon as possible after it’s published (which is typically started within a few seconds), and we’ll decrypt on–demand for any requests before that process has finished. This will increase CPU usage as traffic increases, but in a way we can scale horizontally and which has minimal impact on our backend databases.
We’ve also moved towards an event-driven model where anything which doesn’t need an immediate response happens asynchronously via Kafka, significantly improving our ability to handle spikes in demand on either publishing or web traffic. The entire import process and export process happens that way, and synchronous API calls are only used when absolutely necessary.
What we’ve got planned
Over the next couple of years, our goal is to replace much of the functionality in Zebedee with new services and backend processes which apply the “Filter a dataset” publishing model to the rest of our content. With the demands on our service already increasing, the 2021 Census only a few years away, and the size and complexity of the content increasing, we need to be ready to handle much higher levels of demand than we’ve been used to.
We’ll also be improving the reliability, security and performance of our CMS by rebuilding our content and publishing APIs and backend processes, storing most of our content in databases instead of JSON files on disk, and moving all static content to our CDN.
Our website search functionality has had a lot of attention over the past year. We’ve listened to your feedback so we know it’s something we’ve been pretty bad at, and you don’t always find the results you’d expect, but we’ve been working hard to improve it. We’ve had a dedicated search engineer who has been applying machine learning and natural language processing techniques to our content and search queries to help us give you better search results. Some of that work is now being deployed to production, and it’s something we’ll be building on over the next couple of years.
These services are currently owned by a small team of backend engineers who are responsible for all server and backend data pipeline services.
Some of the tools and services we use in backend engineering are:
- Go – for a lot of our newer APIs and backend services
- Java – for our legacy codebases and some newer services
- Python – for our search APIs and machine learning code
- Apache Kafka – for most communication between backend services
- MongoDB – for storing most structured data
- PostgreSQL – for storing audit data, though we’d like to change this
- Neo4j – for storing structured datasets, used for filtering datasets and direct access to specific observations
- Elasticsearch – indexes our content and powers our search functionality
To achieve our goals over the next few years, we’re also scaling up our backend engineering team. We’re about to start hiring two new backend engineers, with more roles to come early next year. To find out when we’re hiring you can sign up for Civil Service job alerts or follow us on Twitter.
Come back next week for part three where we’ll look at our frontend systems – how we build internal tooling and web services for publishing and accessing our content.