Introducing the Big Data team
My name is Jane Naylor and I’m Head of the Big Data team at the ONS. The team was established in January 2014 and brought together staff with a mixture of statistical, methodological and IT backgrounds with an enthusiasm and interest in data science and data engineering.
The key aims of the team are to demonstrate the potential for using big data within official statistics, to investigate the methodological and technological issues and other challenges and to develop skills and capability within ONS.
We adopted a dual approach to this work; undertaking hands-on pilot work with new data sources, tools and technologies and also exploring collaborative and partnership opportunities with a range of different external partners.
I can’t possibly summarise everything that we have done over the past 2 years but hopefully this post will provide a high level overview of key activities to date.
What have we been working on?
Over the last 2 years, the team have undertaken a number of pilots to demonstrate the potential benefits (reduced collection/production costs, improved quality, new types of outputs) and also tackle the challenges (statistical, technical, ethical, commercial) of the use of big data within the production of official statistics. These pilots have also allowed the team to develop new data science skills.
There are many different definitions of ‘big data’ but quite simply we have interpreted it as ‘alternative’ or ‘new forms’ of data. Official statistics are traditionally produced using survey, Census or administrative data – we focus on data sets that don’t fit within these 3 types. For example, to give you a flavour, we have undertaken research to try to answer the following questions:
- Can geo-located Twitter data provide new insights into population and mobility?
- Will data on utility usage provide a good indicator of vacant properties and hence allow us to be smarter about the way we conduct our Census or surveys, i.e. saving the tax payer money?
- Rather than collect price data (that feed through to our economic outputs) by manually visiting stores, isn’t it more efficient and can’t we collect more data more frequently if we scrape prices from supermarket websites?
- Can Oyster card data from tube travel in London be used to understand travel and commuting patterns?
- What additional intelligence about properties in a certain area can we automatically gather from housing websites such as Zoopla that will help us when we undertake a survey or Census?
- By analysing the difference between the number of electricity meters in an area with the number of addresses can we identify areas where properties have been demolished, there has been significant development or where there are large residential establishments?
- How can mobile phone data be used to produce statistics for the population and population mobility? We haven’t actually had access to any data here but we have learnt a lot about the challenges of trying to do so!
It’s important to remember that in order to produce statistics using big data sources we are only interested in trends or patterns that can be observed at an aggregate level, not personal data about individuals. However, we recognise that accessing data from the private sector or from the internet may raise concerns around security and privacy. We have therefore only accessed publically available, anonymous or aggregated data within these pilots, All of our work fully complies with legal requirements and our obligations under the Code of Practice for Official Statistics and aspects of our work has been scrutinised by the National Statisticians Data Ethics Committee.
As well as exploring new data sets we have also investigated and developed new methods in order to process and analyse the data. We have used machine learning, clustering algorithms, text string analysis, data visualisation methods as well as traditional statistical approaches. In addition, the team are using new (for ONS), mostly open source technologies; we are programming in languages such as R and Python, processing and storing data using technologies such as MongoDB, Neo4j, Spark and Cassandra.
Who have we been working with?
We recognise that data science is multi-disciplinary and multi-institutional and we have been working with a range of different external organisations to learn from their experience and expertise, to coordinate efforts, to work collaboratively and to acquire data:
- Government: We are key players (along with the Cabinet Office, Government Digital Service and Government Office for Science) in the virtual Government Data Science Partnership that was established to explore the opportunities for data science in Government and to embed a data-driven approach within professions and departments. We have also engaged with specific departments to share experiences and expertise.
- Academia: Data science is a growing discipline within academia – recognising this we have worked collaboratively with a number of different universities and academic bodies.
- International bodies: We are contributing to international initiatives in this area such as a UNECE Big Data Project and a Eurostat Big Data Taskforce, working, coordinating and sharing expertise with other National Statistical Institutes who are undertaking similar work to us and addressing similar challenges.
- Commercial organisations: In some cases the engagement is focused on acquiring/purchasing data for research purposes in other cases to share experiences and understand how data science is impacting on their business.
- Privacy groups: Many of the data sets we are exploring raise ethical and privacy issues. At ONS we are committed to protecting the confidentiality of all the information that we hold and addressing issues around ethics and privacy. We have therefore engaged with a number of privacy groups and ethical experts to seek advice and feedback.
The work of the Big Data team continues. Many of the pilots described above will be taken forward and developed further over the next year. We will also be taking on new pilots and identifying new areas where big data can make an impact on official statistics. In particular the recent announcement of more investment in data science at the ONS following the Bean Review will bring us lots of new challenges and opportunities.
A key challenge will be to move some of these pilots from research into implementation – using the new data sources, tools and techniques within the production of an official statistic.
Want to know more?
Please also look for future posts about the work of the team or email us directly –> email@example.com.