One Snakemake workflow for paired-end and single-end data

2023-10-24

Recently, I was thinking about how to set up a Snakemake workflow that can deal with both single-end and paired-end sequencing datasets. For Illumina, paired-end sequencing results in two FASTQ files for each sample: one file with the forward reads and one file with the reverse reads, which have R1 and R2 in the file names. Samples from single-end sequencing only have the R1 file.

While most downstream programs can process samples from either sequencing type, the syntax when calling the programs usually differs slightly when having either one or two input FASTQ files.

This needs to be accounted for when writing the Snakemake rules, ideally without adding too much boilerplate code.

Since Snakemake uses files as intermediates between rules, it seems obvious that one would need two rules for each program, as the rule inputs will differ depending on having only R1 or R1 and R2 FASTQ files.

The example below showcases how Snakemake’s rule inheritance can be used to create a rule for running Kraken2 on paired-end FASTQ files and a derived rule that deals with the single-end case. Here, the input and params section are overwritten, while all other sections are inherited from the first rule.

Each rule’s params section defines the part of the Kraken2 command that specifies the input files, which is then inserted in the shell block. As the shell block is only defined in the first rule, we don’t need to duplicate the whole Kraken2 command in both rules.

We can also use a lambda function in the rules’ params section to access their inputs, which avoids the duplication of file names. Thanks to Nick Waters for this idea!

I think this is a quite elegant solution for this problem. While we need two rules, most code is only defined in one rule and the second rule only contains the necessary adjustments for the single-end case.

To make it work, the single-end and paired-end samples need to be stored in distinct folders, here called fastq-se/ and fastq-pe/.

from glob import glob

# ----------------------------------------------------------------------------
# get list of sample IDs from fastq files, using file name prefixes up to _S
# single-end samples are in folder fastq-se
# paired-end samples are in folder fastq-pe

SAMPLES = glob_wildcards("fastq-pe/{sample}_R1.fastq.gz")[0] + glob_wildcards("fastq-se/{sample}_R1.fastq.gz")[0]

# ----------------------------------------------------------------------------
# generate list of desired output files: one kraken2 report per sample

KRAKEN_REPORT = expand("kraken2/{sample}.report", sample=SAMPLES)

# ----------------------------------------------------------------------------

rule all:
  input:
    KRAKEN_REPORT

# ----------------------------------------------------------------------------

# The default rule for Kraken2 uses paired-end reads.
# params.file_spec contains the part of the kraken2 command that specifies its input files
rule kraken:
  threads: 10
  input:
    fq1 = "fastq-pe/{sample}_R1.fastq.gz",
    fq2 = "fastq-pe/{sample}_R2.fastq.gz"
  output:
    kraken = temp("kraken2/{sample}.kraken"),
    report = "kraken2/{sample}.report"
  params:
    file_spec = lambda wc, input: f"--paired {input.fq1} {input.fq2}"
  message:
    "Kraken2 PE: {wildcards.sample}"
  log:
    "logs/kraken2/{sample}.txt"
  benchmark:
    "benchmark/kraken2/{sample}.tsv"
  shell:
    """
    kraken2 \
    --threads {threads} --memory-mapping --gzip-compressed \
    --confidence 0.05 \
    --db /path/to/kraken2/db \
    --use-names \
    --output {output.kraken} \
    --report {output.report} \
    {params.file_spec} >{log} 2>&1
    """

# Use rule inheritance to override the input and params.file_spec for the single-end case.
use rule kraken as kraken_se with:
  input:
    fq1 = "fastq-se/{sample}_R1.fastq.gz"
  params:
    file_spec = lambda wc, input: f"{input.fq1}"
  message:
    "Kraken2 SE: {wildcards.sample}"