Data acquisition

The first step in our processing pipeline is to acquire a local copy of the raw data. The raw data for this study is stored on OpenNeuro. If we browse to that website and go to the ‘Download’ tab, we can see that the data can be downloaded via the Amazon S3 storage infrastructure.

Our task now is to write Snakemake rules that will use the AWS CLI to download the anatomical and functional image data of interest for each of our participants. We could potentially aim to download everything in the one rule, but the differences between the file structure of anatomical and functional images makes it simpler to have separate rules.

Note

Ideally, the S3 storage plugin would allow access to the raw data as required, rather than specifically being downloaded as a first step. However, it currently doesn’t support the public data access that is required for this raw data.

When implementing a new rule, we will take the general approach of working through the key rule components in turn (where required for a given rule):

  • Outputs

  • Inputs

  • Parameters

  • Mechanism

    • Container

    • Logging

    • Command

  • Resources

See the manuscript for a detailed explanation of each of these components.

Rule for anatomical image acquisition

We will first tackle the rule to acquire the anatomical images.

Outputs

The first step is to think about the outputs that the rule will produce. Given our list of subject numbers, we want the workflow to generate the following output files:

results/sub-10159/anat/sub-10159_T1w.nii.gz
results/sub-10171/anat/sub-10171_T1w.nii.gz
results/sub-10189/anat/sub-10189_T1w.nii.gz

When creating the rule, we want to abstract away aspects of the output that are specific to a particular subject. We do that by replacing those characters with a wildcard for sub_num:

results/sub-{sub_num}/anat/sub-{sub_num}_T1w.nii.gz

As you can see, each of the required output files can be created by plugging in a particular value for sub_num.

Note

We will use ‘subject’ and ‘participant’ interchangeably in this documentation, consistent with its usage in the Brain Imaging Data Structure (BIDS).

We can now start to create the rule, called acquire_anat, that specifies a single output that we refer to as img (that decision is up to us) and that contains the sub_num wildcard:

workflow/workflow/rules/acquire_anat.smk
rule acquire_anat:
    "Acquire (download) the anatomical images"
    output:
        img="results/sub-{sub_num}/anat/sub-{sub_num}_T1w.nii.gz",

Inputs

The next step when creating a rule is to think about the inputs that are required to produce the output. Here, we don’t actually need any inputs — so we can omit an input section from the rule.

Parameters

Now we can think about any parameters that are required for the operation of the rule. For our purposes, it is useful to build the OpenNeuro URL as a parameter that can then be used by the downloading mechanism. In particular, we just need to combine the study data location (the S3 URL s3://openneuro.org/ds000030/) with the relative location of an output image (without the results/ prefix).

Because we need the output path after the sub_num wildcard has been replaced with a specific value, we specify this parameter as a function. Here, we use an anonymous (lambda) function and use the provided output argument to construct the URL:

workflow/workflow/rules/acquire_anat.smk
rule acquire_anat:
    "Acquire (download) the anatomical images"
    output:
        img="results/sub-{sub_num}/anat/sub-{sub_num}_T1w.nii.gz",
    params:
        remote_url=lambda wildcards, input, output: (
            f"s3://openneuro.org/ds000030/{output.img.removeprefix("results/")}"
        ),

When the function is called during the execution of a Snakemake job, the output.img variable will contain the string from the img item of the output directive ("results/sub-{sub_num}/anat/sub-{sub_num}_T1w.nii.gz") but with the {sub_num} wildcard replaced by the subject number that is active for the job (e.g., "results/sub-10159/anat/sub-10159_T1w.nii.gz").

Mechanism

Now we need to think about the rule’s mechanism — how the output files are produced.

Container

As outlined above, we want to use the AWS CLI to download the data from OpenNeuro. We could try to install this software locally, which can then be used when running Snakemake. However, a more reproducible approach is to use a container that, well, contains the AWS CLI in addition to the supporting operating system and dependencies.

If we search the internet for an AWS CLI container, we can see that one is hosted by Amazon on the Docker Hub. From that site, we can see its location identifier is something like amazon/aws-cli:2.32.21 (the latest version number will vary over time). We can provide this information in the rule, along with the docker:// protocol prefix, to specify that the rule should execute its job within this container.

Because we might want to use the same container sources in multiple rules, we can define it within the common.smk rule:

workflow/workflow/rules/common.smk
SUB_NUMS = ["10159", "10171", "10189"]
TASKS = ["taskswitch", "stopsignal"]

CONTAINER_SOURCES = {
    "AWS-CLI": "docker://amazon/aws-cli:2.32.21",
}

and then refer to that definition within the acquire_anat.smk rule:

workflow/workflow/rules/acquire_anat.smk
rule acquire_anat:
    "Acquire (download) the anatomical images"
    output:
        img="results/sub-{sub_num}/anat/sub-{sub_num}_T1w.nii.gz",
    params:
        remote_url=lambda wildcards, input, output: (
            f"s3://openneuro.org/ds000030/{output.img.removeprefix("results/")}"
        ),
    container:
        CONTAINER_SOURCES["AWS-CLI"]

Note

Why does the value for the container directive not have a comma (,) at the end, but the values for the output and params directives do have a comma? This is mostly a stylistic choice, where we use commas to indicate directives that could have multiple items (in typical usage, at least). This is particularly useful for version control because adding a new item doesn’t require changing the last item (to add a , at the end), and so the diff is isolated to the new item.

Logging

Because multiple jobs can be running simultaneously, and potentially on different computers, we do not easily have a way of monitoring any output that gets printed to the terminal as a rule executes. Instead, we can redirect any such output to a log file. Here, we specify the location of this log file:

workflow/workflow/rules/acquire_anat.smk
rule acquire_anat:
    "Acquire (download) the anatomical images"
    output:
        img="results/sub-{sub_num}/anat/sub-{sub_num}_T1w.nii.gz",
    params:
        remote_url=lambda wildcards, input, output: (
            f"s3://openneuro.org/ds000030/{output.img.removeprefix("results/")}"
        ),
    container:
        CONTAINER_SOURCES["AWS-CLI"]
    log:
        "logs/acquire_anat/acquire_anat_{sub_num}.txt"

Note that this doesn’t actually do any logging or any redirection of terminal output — it just specifies the location of the log file, which can then be used by other components in the rule (as we will see).

Command

Now we need to specify the AWS CLI command that will download the data. We use the shell directive in the rule to specify the command to execute, using wildcards for the remote URL and the output path (and the log path):

workflow/workflow/rules/acquire_anat.smk
rule acquire_anat:
    "Acquire (download) the anatomical images"
    output:
        img="results/sub-{sub_num}/anat/sub-{sub_num}_T1w.nii.gz",
    params:
        remote_url=lambda wildcards, input, output: (
            f"s3://openneuro.org/ds000030/{output.img.removeprefix("results/")}"
        ),
    container:
        CONTAINER_SOURCES["AWS-CLI"]
    log:
        "logs/acquire_anat/acquire_anat_{sub_num}.txt"
    shell:
        """
aws \
s3 \
cp \
--no-sign-request \
--no-progress \
{params.remote_url} \
{output.img} \
> {log} 2>&1
        """

The details on the construction of this command is out of the scope of this tutorial. However, it is worth expanding on two notable aspects of the command that are general principles to be aware of:

  • The final > {log} 2>&1 statement is a potentially cryptic aspect of the command. That can be read as “redirect (>) to the file ({log}), both standard error (2) and (>&) standard output (1)”. It is just some arcane syntax that puts any printed output from the command into the log file rather than to the screen.

  • The --no-progress argument is a common pattern in shell commands to be run within Snakemake. Without it, the command would print to the screen, and update over time, a progress bar. This is useful when used interactively, but less so in the context of Snakemake where multiple commands may be run simultaneously and on remote computing infrastructure.

Note

Interactive exploration of a containerised command is aided by having a local copy of the container — obtained using the apptainer pull command in an interactive shell. For example, apptainer pull "docker://amazon/aws-cli:2.32.21" will download the container into a local file named aws-cli_2.32.21.sif. An interactive console running inside the container can then be obtained by apptainer shell aws-cli_2.32.21.sif. You could then run something like aws s3 cp help (inside the container) to see the command-line options.

Resources

We also need to consider the resources that are required by the rule. Here, there is nothing special required — so we omit any resource-related directives.

Rule for functional image acquisition

Now that we have a complete rule for the anatomical image acquisition, we can turn to the rule for the functional image acquisition.

We can first note that the process for acquiring functional images is pretty much the same as for anatomical images — just with different paths. We could start by copying the acquire_anat.smk file that we just created. However, this puts the same information in multiple places and becomes prone to inconsistencies.

Instead, we can use rule inheritance to ask Snakemake to use everything from the acquire_anat rule except for what we explicitly override. We do this via:

workflow/workflow/rules/acquire_func.smk
use rule acquire_anat as acquire_func with:

Outputs

As with the anatomical acquisition rule, we can start by thinking about all the output that will be produced by the rule.

results/sub-10159/func/sub-10159_task-stopsignal_bold.nii.gz
results/sub-10159/func/sub-10159_task-taskswitch_bold.nii.gz
results/sub-10171/func/sub-10171_task-stopsignal_bold.nii.gz
results/sub-10171/func/sub-10171_task-taskswitch_bold.nii.gz
results/sub-10189/func/sub-10189_task-stopsignal_bold.nii.gz
results/sub-10189/func/sub-10189_task-taskswitch_bold.nii.gz

We can see that, in addition to the sub_num wildcard that was required for the anatomical rule, there is also a task wildcard. We can thus specify the rule output as:

workflow/workflow/rules/acquire_func.smk
use rule acquire_anat as acquire_func with:
    output:
        img="results/sub-{sub_num}/func/sub-{sub_num}_task-{task}_bold.nii.gz",

Mechanism

Because of the way we constructed the anatomical acquisition rule, it mostly also applies to the functional acquisition without modification.

Logging

We only need to override the path for the log file:

workflow/workflow/rules/acquire_func.smk
use rule acquire_anat as acquire_func with:
    output:
        img="results/sub-{sub_num}/func/sub-{sub_num}_task-{task}_bold.nii.gz",
    log:
        "logs/acquire_func/acquire_func_{sub_num}_{task}.txt"

Preparing for execution

At this point, we have specified the procedure for how the raw data files can be downloaded. Now, we need to tell Snakemake which output files we want the workflow to produce.

First, we need to include our newly-created rules within the Snakefile:

workflow/workflow/Snakefile
include: "rules/common.smk"
include: "rules/acquire_anat.smk"
include: "rules/acquire_func.smk"

Now we need to describe a special rule, called all by convention. The input to this rule is the set of output files that are to be produced by the workflow. As such, they cannot contain any wildcards.

At this point, we want the output to be the anatomical images for each of the subject numbers of interest and the functional images for each pairwise combination of the subject numbers of interest and the tasks of interest. We can use the expand helper function to insert these values into the rule output paths:

workflow/workflow/Snakefile
include: "rules/common.smk"
include: "rules/acquire_anat.smk"
include: "rules/acquire_func.smk"

rule all:
    input:
        expand(rules.acquire_anat.output.img, sub_num=SUB_NUMS),
        expand(rules.acquire_func.output.img, sub_num=SUB_NUMS, task=TASKS),

Executing the workflow

We now have everything in order to actually run the workflow!

However, before doing so it is typically useful to do a ‘dry run’. This shows us what Snakemake is planning to do, but does not actually execute the jobs. We can do a dry run via:

$ uv run snakemake --dry-run

This will produce a bunch of output. If we start at the top, it will look something like:

Using workflow specific profile profiles/default for setting default command line arguments.
host: djmhomepc
Building DAG of jobs...
Singularity image docker://amazon/aws-cli:2.32.21 will be pulled.
Job stats:
job             count
------------  -------
acquire_anat        3
acquire_func        6
all                 1
total              10

We can see that it has picked up our command line arguments profile, and that it has recognised that it needs to pull the container in order to run the rules. It also shows that it needs to run the acquire_anat rule 3 times and the acquire_func rule 6 times — which matches our expectation on the number of output files that will be produced.

We can also look at the details for specific jobs, such as an anatomical data acquisition:

rule acquire_anat:
    output: results/sub-10159/anat/sub-10159_T1w.nii.gz
    jobid: 1
    reason: Missing output files: results/sub-10159/anat/sub-10159_T1w.nii.gz
    wildcards: sub_num=10159
    resources: tmpdir=<TBD>
Shell command: 
aws s3 cp --no-sign-request --only-show-errors s3://openneuro.org/ds000030/sub-10159/anat/sub-10159_T1w.nii.gz results/sub-10159/anat/sub-10159_T1w.nii.gz

Note that Snakemake states its reason for running the job - here, because the required output file is not present. It also shows the wildcard value that it will use, and the shell command that is constructed.

The description for a functional acquisition job is similar:

rule acquire_func:
    output: results/sub-10171/func/sub-10171_task-stopsignal_bold.nii.gz
    jobid: 7
    reason: Missing output files: results/sub-10171/func/sub-10171_task-stopsignal_bold.nii.gz
    wildcards: sub_num=10171, task=stopsignal
    resources: tmpdir=<TBD>
Shell command: 
aws s3 cp --no-sign-request --only-show-errors s3://openneuro.org/ds000030/sub-10171/func/sub-10171_task-stopsignal_bold.nii.gz results/sub-10171/func/sub-10171_task-stopsignal_bold.nii.gz

If all looks good, we can go ahead and actually run the workflow:

$ uv run snakemake

Snakemake will print its progress to the screen. When it completes, the required output files will be found in the results/ directory.