Using ehrQL in OpenSAFELY projects
This page describes how ehrQL fits in with a full OpenSAFELY project.
In one sentence:
Researchers develop an ehrQL query and analysis code on their own computers using dummy data, then submit it to the OpenSAFELY jobs site to run against real data in an OpenSAFELY backend.
Project workflow summary🔗
The workflow for a single study using ehrQL is much like that for existing studies that use cohort-extractor.
In summary:
- Create a Git repository from the template repository provided and clone it on your local machine.
- Write a dataset definition in ehrQL that specifies what data you want to extract from the database. Only this step is specific to ehrQL.
- Develop analysis scripts using dummy data in R, Stata, or Python to process and analyse the dataset(s) created by ehrQL.
- Test the code by running the analysis steps specified in the project pipeline.
- Execute the analysis on the real data via OpenSAFELY's jobs site. This will generate outputs on the secure server.
- Check the output for disclosivity within the server, and redact if necessary.
- Release the outputs on the jobs site.
Dummy data🔗
Because OpenSAFELY doesn't allow researchers direct access to patient data, researchers must use dummy data for developing their analysis code on their own computer.
When an ehrQL action is executed on a researcher's computer (see Running ehrQL), ehrQL can generate random dummy data based on the properties of the tables used in the dataset definition. Alternatively, users can also provide their own dummy data.
This allows the dataset definition to be checked for errors, and produces dummy output data that can be used to test downstream actions that depend on the output of the ehrQL action.
Real data🔗
Executing a dataset definition against real data in an OpenSAFELY backend involves running the study on the OpenSAFELY jobs site. More information about the jobs site and and how to run a study can be found in the OpenSAFELY documentation.