Open, reproducible analysis and reporting of data provenance for high-security health and administrative data

Year of award: 2019


  • Dr Jessica Butler    

    University of Aberdeen, United Kingdom

Project summary

Many types of routinely-collected data from the NHS and other government agencies are available for research in the UK. To protect privacy, data governance law requires that only project-specific portions of the data be extracted, filtered and anonymised before release for research. Currently little information is provided to researchers on the methods used to produce their data. This lack of transparency results in an increased risk of propagating undetected error and leaves the resulting research difficult or impossible to evaluate and reproduce.

We will co-design, pilot and evaluate methods for recording and reporting provenance for research using high-security data. The result will be a method to report data provenance that maintains privacy and makes the research more findable, accessible, interoperable and reproducible. Our approach recognises that meeting the needs of data guardians and researchers requires active cooperation. It is a collaboration between data guardians, computing scientists specialising in provenance and trust, an expert in service evaluation methods, and a specialist in open science practice. 

The project will provide a provenance ontology model for the high-security NHS Data Safe Haven and open-source resources for tracking and safely reporting provenance data for research. The method is designed to be scalable across the UK’s high-security environments that host a range of government data. No tools for capture and reporting of data provenance from these environments exist. This project lays the groundwork for a process that is required for fair use of the public’s data for healthcare research and innovation.