One Snakemake workflow for paired-end and single-end data, with globbing

2023-11-09

After figuring out how to make one Snakefile that can deal with both single-end and paired-end sequencing files in one workflow, I was still missing a puzzle piece: How to deal with globbing to get the fastq files in the rule’s input section.

Usually, the sample names are only a part of the fastq file name. In the example code below, the Illumina fastq files have a substring like _S123_ before the _R1_ or _R2_, which is the index of a sample in the samplesheet.

As this is not a part of the original sample name, we exclude it when using glob_wildcards() to get a list of all sample names from the fastq files. But this also means, we need to find the original fastq files from the sample names in the rule’s input. This is usually done with the glob() function, in which we use the * wildcard, which will match the _Sxxx_ part of the file name.

After fiddling with glob() inside the rule’s input, input functions and lambdas, the only working solution that I could come up with, was to create two dictionaries containing a mapping from sample IDs to their R1 and R2 file names. Then, these dicts are used in the input to specify the actual input files.

It looks a bit like overhead compared to using glob() inside the input, but the globbing needs to be done anyways at some point, so one might as well get it over with right at the beginning.

In the end, we set a ruleorder, which guides Snakemake towards using the paired-end fastq files over the single-end files if possible. This is necessary, as for paired-end samples, Snakemake could in principle use both rules to get the desired output - it will not automagically choose the rule that has “most” input files.

from glob import glob

# ----------------------------------------------------------------------------
# get list of sample IDs from fastq files, using file name prefixes up to _S
# single-end samples are in folder fastq-se
# paired-end samples are in folder fastq-pe

SAMPLES = list(set(glob_wildcards("fastq-pe/{sample}{rest,_S\d+_.*}.fastq.gz")[0] + glob_wildcards("fastq-se/{sample}{rest,_S\d+_.*}.fastq.gz")[0]))

# ----------------------------------------------------------------------------
# generate 2 dicts, which map sample names to their R1 and R2 file names

fq_R1_names = {}
fq_R2_names = {}
fq_names = glob('fastq-*/*.fastq.gz', recursive=True)
for f in fq_names:
  n = os.path.basename(f)
  n = re.sub("_S.*", "",n)
  if '_R1_' in f:
    fq_R1_names[n] = f
  elif '_R2_' in f:
    fq_R2_names[n] = f

# ----------------------------------------------------------------------------
# generate list of desired output files: one kraken2 report per sample

KRAKEN_REPORT = expand("kraken2/{sample}.report", sample=SAMPLES)

# ----------------------------------------------------------------------------

rule all:
  input:
    KRAKEN_REPORT

# ----------------------------------------------------------------------------

# The default rule for Kraken2 uses paired-end reads.
# params.file_spec contains the part of the kraken2 command that specifies its input files
rule kraken:
  threads: 10
  input:
    fq1 = lambda wildcards: fq_R1_names[wildcards.sample],
    fq2 = lambda wildcards: fq_R2_names[wildcards.sample]
  output:
    kraken = temp("kraken2/{sample}.kraken"),
    report = "kraken2/{sample}.report"
  params:
    file_spec = lambda wc, input: f"--paired {input.fq1} {input.fq2}"
  message:
    "Kraken2 PE: {wildcards.sample}"
  log:
    "logs/kraken2/{sample}.txt"
  benchmark:
    "benchmark/kraken2/{sample}.tsv"
  shell:
    """
    kraken2 \
    --threads {threads} --memory-mapping --gzip-compressed \
    --confidence 0.05 \
    --db /path/to/kraken2/db \
    --use-names \
    --output {output.kraken} \
    --report {output.report} \
    {params.file_spec} >{log} 2>&1
    """

# Use rule inheritance to override the input and params.file_spec for the single-end case.
use rule kraken as kraken_se with:
  input:
    fq1 = lambda wildcards: fq_R1_names[wildcards.sample],
  params:
    file_spec = lambda wc, input: f"{input.fq1}"
  message:
    "Kraken2 SE: {wildcards.sample}"

# ----------------------------------------------------------------------------
# set the rule order to use paired end (if possible) over just single-end data

ruleorder: kraken > kraken_se