ehrQL output formats
Supported output formats🔗
The following output formats are supported:
Recommended🔗
.arrow
— Apache Arrow format.csv.gz
— compressed CSV format
Not recommended🔗
.csv
— uncompressed CSV format
The uncompressed CSV format is not recommended, because this produces much larger files than the alternative formats.
Unsupported output formats🔗
.dta
and.dta.gz
— Stata formats- Stata output support is still in development.
- There is an open ehrQL issue that discusses the work of supporting a suitable format for Stata.
Selecting an output format🔗
You select an output format
when you use the --output
option to specify an output filename for ehrQL.
The filename extension — for example, .arrow
— that you provide determines the output format file.
If you specify a filename extension that is not supported, you will get an error telling you so.
Examples with opensafely exec
🔗
.arrow
🔗
opensafely exec databuilder:v0 generate-dataset "./dataset-definition.py" --dummy-tables "example-data/" --output "./outputs/data_extract.arrow"
.csv.gz
🔗
opensafely exec databuilder:v0 generate-dataset "./dataset-definition.py" --dummy-tables "example-data/" --output "./outputs/data_extract.csv.gz"
Example project.yaml
🔗
version: "3.0"
expectations:
population_size: 1000
actions:
extract_data:
run: databuilder:v0 generate-dataset "./dataset_definition.py" --output "outputs/data_extract.arrow"
outputs:
highly_sensitive:
population: outputs/data_extract.arrow
The population
filename must be identical to the output filename specified by --output
.
Otherwise you will see the following error when you use opensafely run
to run the project actions:
$ opensafely run run_all
=> ProjectValidationError
Invalid project:
1 validation error for Pipeline
__root__
--output in run command and outputs must match (type=value_error)