Skip to content

OpenSAFELY documentation

ehrQL output formats

ehrQL output formats

Supported output formats🔗

The following output formats are supported:

Recommended🔗

.arrow — Apache Arrow format
.csv.gz — compressed CSV format

Not recommended🔗

.csv — uncompressed CSV format

The uncompressed CSV format is not recommended, because this produces much larger files than the alternative formats.

Unsupported output formats🔗

.dta and .dta.gz — Stata formats
- Stata output support is still in development.
- There is an open ehrQL issue that discusses the work of supporting a suitable format for Stata.

Selecting an output format🔗

You select an output format when you use the --output option to specify an output filename for ehrQL. The filename extension — for example, .arrow — that you provide determines the output format file.

If you specify a filename extension that is not supported, you will get an error telling you so.

Examples with `opensafely exec`🔗

`.arrow`🔗

opensafely exec databuilder:v0 generate-dataset "./dataset-definition.py" --dummy-tables "example-data/" --output "./outputs/data_extract.arrow"

`.csv.gz`🔗

opensafely exec databuilder:v0 generate-dataset "./dataset-definition.py" --dummy-tables "example-data/" --output "./outputs/data_extract.csv.gz"

Example `project.yaml`🔗

version: "3.0"

expectations:
  population_size: 1000

actions:
  extract_data:
    run: databuilder:v0 generate-dataset "./dataset_definition.py" --output "outputs/data_extract.arrow"
    outputs:
      highly_sensitive:
        population: outputs/data_extract.arrow

The population filename must be identical to the output filename specified by --output. Otherwise you will see the following error when you use opensafely run to run the project actions:

$ opensafely run run_all
=> ProjectValidationError
   Invalid project:
   1 validation error for Pipeline
   __root__
     --output in run command and outputs must match (type=value_error)