The Digital Publishing technology stack – Part 1: Our platform

In Digital Publishing we use a lot of technologies – some are industry standard, others not so much. Since we try to be open and transparent about everything we do, and we’ve just finished rebuilding our hosting platform, now seems like a good time to look at the technologies we use and how our website works.

If you’re new to the blog – Digital Publishing are the team in ONS who are responsible for the ONS website, API, developer site, content management system, social media, digital and print design, data visualisation, content and editorial work, and all sorts of other “digital” and “publishing” things!

This will be a three-part series where I’ll talk a bit about each of our main stacks – platform, backend and frontend – and I’ll try to cover some of our plans for the next few years.

I’ll only be looking at our web services – the ONS website, API, developer site and CMS. We use a lot more technologies in other areas like interaction design, print design and social media that I won’t be covering here.

If you have an interest in any of these technologies, or the challenges we’re taking on, then we’d like to hear from you – we’re scaling up our engineering teams and we’d love for you to apply for one of our roles!

Part 1: Our platform

In part one, we’ll be looking at our platform – how we run our services in test and production environments.

Everything we do in Digital Publishing is hosted on Amazon Web Services (AWS). We don’t have our own data centres or physical kit – our goal is to build modern user-centred products, not build and maintain physical infrastructure.

Our platform underpins everything else we do – every service we build, every product we’re responsible for, and a lot of the internal tooling we use is hosted on the same platform. This allows us to define security policies, access controls and audit requirements once and have it shared across everything else we do.

We occasionally use a different approach – for example we have some non-critical services running on AWS Lambda or Elastic Beanstalk rather than our own platform.

But first, a bit of background…

Our old “platform” was a bunch of Docker containers running directly on Amazon EC2 instances – deployments meant removing instances from our load balancer, connecting with SSH and manually running bash scripts and Docker commands to update the application code, a bit of manual testing, then adding them back into the load balancer before moving onto the next EC2 instance.

We lived with that solution for over two years, but we couldn’t scale it beyond the capacity it was originally built with, and it relied on deprecated AWS technologies which we couldn’t easily upgrade or replace. We also hated deployments, which just meant lots of unrelated code changes being bunched together for release. That meant our risks were increasing over time as we struggled to apply critical security patches, failures required time-consuming manual intervention, and it was difficult to work out exactly which commits or pull requests had introduced a bug.

Our new platform

Our new platform is much better – we have more capacity, better redundancy and resilience against failure, and many of the manual tasks from the old platform are now fully automated (or only require the click of a button). We also have far better logging, monitoring and alerting, and better visibility into how our platform is performing.

We’ve replaced the deprecated AWS infrastructure, built a container platform using the Hashicorp stack (Nomad, Consul, Vault), introduced industry standard tools for monitoring and alerting (the ELK stack and Prometheus),  and improved our build and deployment pipelines using Concourse. We’ll cover all of that in a bit more detail in a future post.

But it’s the minimum we could deliver to replace our existing platform and start getting some value back from the huge amount of work that’s gone into it. There’s a story there for another time – building a container platform is hard; if you don’t want to wait, Alice Goldfuss did an excellent talk at Lead Developer London 2018 that’s worth a look and reflects much of the pain we’ve experienced. But it’s been worth it!

Most of our platform code isn’t open to the public – we’d love it to be, but we need time to make sure we don’t accidentally share information we should be keeping to ourselves (but that’s just two GitHub repos out of the 100 plus which belong to Digital Publishing – I think I’ll count that as a win!).

Some of the tools and services we use regularly:

  • AWS – virtualised cloud infrastructure
  • Terraform – for provisioning infrastructure on AWS
  • Ansible – for configuring our infrastructure after it’s been provisioned
  • Concourse – for our continuous integration and deployment pipelines
  • Cloudflare – DDOS mitigation and caching
  • Pingdom – external availability and response time monitoring
  • Hashicorp stack (Nomad, Consul, Vault) – orchestrates services and deployments across our container platform, and provides a secure storage backend for configuration and other sensitive data
  • ELK stack – log aggregation and analysis
  • Prometheus – metrics, monitoring and alerting

What we’ve got planned

Over the next couple of years we want to build on the work we’ve already done. That means filling out many of the gaps we left behind – things like improved traffic routing and management, more reliable deployments, better rollback support, and helping the platform automatically recover from even more failure scenarios without intervention.

We’ll be launching the “Filter a dataset” service early next year which will put more demand on our platform, so we’re in the middle of building and securing our database backends (currently MongoDB, Kafka and Neo4j – though our goal is to replace Neo4j with AWS Neptune in the near future). We’re also working towards ensuring we have the capability to publish the output of the next census (Census 2021), which is also likely to put significant demand on our platform.

Beyond that we want to improve the tooling we provide to the engineers responsible for our web services – things like simpler ways to access our environments (particularly when we’re working remotely), improved automation in our pipelines (for example adding smoke and canary tests to production deployments), and making life easier for engineers when working on code locally (like opening up access to shared development databases – we currently run everything locally, and that’ll only scale so far!).

To do this we need to expand our platform team – we’re currently working out what that needs to look like. Should it be a platform engineering team? Should we take inspiration from Google’s “Site Reliability Engineer” role? We don’t know yet.

Either way, we’ll need more people who have an interest in infrastructure and platform engineering, ideally with a solid background in software engineering and experience of public cloud providers.

If you’re interested in being part of that, we already have an open position which you can find on Civil Service Jobs and there’s still a few weeks left to apply.

Come back next week for part two where we’ll look at our backend systems – how we import, store, publish and export data and content.