Peter Menzel

About debug messages in knitr

Tue, 09 Apr 2024 21:00:00 +0100

When writing markdown documents for processing with knitr, warnings and messages are usually shown in the output document.

Often, I also want to print debug messages to the console, e.g. for indicating progress etc.

For this purpose, I was using the code chunk option message=FALSE, which made the output of message() show up in the console rather than in the output document:

---
title: "Example document"
output:
  html_document

---

My doc does things.

```{r chunk1, echo=FALSE, message=FALSE}
message("DOING THINGS")
```

```{r chunk2, echo=FALSE, message=FALSE}
message("DOING OTHER THINGS")
```

Console output when knitting:

processing file: example.Rmd
  |..............                                                        |  20%
  ordinary text without R code

  |............................                                          |  40%
label: chunk1 (with options)
List of 2
 $ echo   : logi FALSE
 $ message: logi FALSE

DOING THINGS
  |..........................................                            |  60%
  ordinary text without R code

  |........................................................              |  80%
label: chunk2 (with options)
List of 2
 $ echo   : logi FALSE
 $ message: logi FALSE

DOING OTHER THINGS
  |......................................................................| 100%
  ordinary text without R code

At some point recently, I noticed that these messages no longer showed up in the console output, but were also not visible in the output documents and I was questioning myself if it actually ever worked.

After some googling, I found this post by yihui, the author of the knitr package. It turns out that in versions from 0.19 of the evaluate package used by knitr, the meaning of the chunk options warning and message changed, so that FALSE really means no output at all.

Instead, one needs to set the option to NA for getting the console output again, i.e.:

```{r chunk1, echo=FALSE, message=NA}
message("DOING THINGS")
```

Of course, these options can also be set globally for all code chunks, using:

knitr::opts_chunk$set(echo = FALSE, message = NA)

at the beginning of the document.

All chunk options and package options for knitr are shown here at knitr’s website and there are also ways for reusing options.

Paper: Snakemake workflows for long-read bacterial genome assembly and evaluation

Mon, 01 Apr 2024 20:00:00 +0100

My paper on two Snakemake workflows for bacterial genome assembly now published in GigaByte: Snakemake workflows for long-read bacterial genome assembly and evaluation.

The first workflow, called ont-assembly-snake, is used for generating multiple genome assemblies from long-read sequencing data of a bacterial isolate. The user can freely choose from various read filtering, assembly, and polishing tools and combine them as required by creating folder names containing the order of programs to be executed.

The second workflow, called score-assemblies, is used for evaluating one or multiple genome assemblies using various programs, usually by comparison with reference assemblies, in order to help the user in deciding on a best assembly. This workflow can be run immediately after ont-assembly-snake and it outputs a report file in HTML format, which shows the key metrics from each program.

Paper: Nosocomial outbreak of Pandoraea commovens

Wed, 15 Nov 2023 21:00:00 +0100

Our paper about a nosocomial outbreak of Pandoraea spp. was recently published in Emerging Infections Diseases: Outbreak of Pandoraea commovens Infections among Non-Cystic Fibrosis Intensive Care Patients, Germany, 2019-2021

In total, 24 patients from 4 hospitals in Berlin were affected. Genomic analysis of the isolates revealed Pandoraea commovens as the genetically most similar species, which was only described in 2019 by Peeters et al.

One Snakemake workflow for paired-end and single-end data, with globbing

Thu, 09 Nov 2023 21:00:21 +0200

After figuring out how to make one Snakefile that can deal with both single-end and paired-end sequencing files in one workflow, I was still missing a puzzle piece: How to deal with globbing to get the fastq files in the rule’s input section.

Usually, the sample names are only a part of the fastq file name. In the example code below, the Illumina fastq files have a substring like _S123_ before the _R1_ or _R2_, which is the index of a sample in the samplesheet.

As this is not a part of the original sample name, we exclude it when using glob_wildcards() to get a list of all sample names from the fastq files. But this also means, we need to find the original fastq files from the sample names in the rule’s input. This is usually done with the glob() function, in which we use the * wildcard, which will match the _Sxxx_ part of the file name.

After fiddling with glob() inside the rule’s input, input functions and lambdas, the only working solution that I could come up with, was to create two dictionaries containing a mapping from sample IDs to their R1 and R2 file names. Then, these dicts are used in the input to specify the actual input files.

It looks a bit like overhead compared to using glob() inside the input, but the globbing needs to be done anyways at some point, so one might as well get it over with right at the beginning.

In the end, we set a ruleorder, which guides Snakemake towards using the paired-end fastq files over the single-end files if possible. This is necessary, as for paired-end samples, Snakemake could in principle use both rules to get the desired output - it will not automagically choose the rule that has “most” input files.

from glob import glob

# ----------------------------------------------------------------------------
# get list of sample IDs from fastq files, using file name prefixes up to _S
# single-end samples are in folder fastq-se
# paired-end samples are in folder fastq-pe

SAMPLES = list(set(glob_wildcards("fastq-pe/{sample}{rest,_S\d+_.*}.fastq.gz")[0] + glob_wildcards("fastq-se/{sample}{rest,_S\d+_.*}.fastq.gz")[0]))

# ----------------------------------------------------------------------------
# generate 2 dicts, which map sample names to their R1 and R2 file names

fq_R1_names = {}
fq_R2_names = {}
fq_names = glob('fastq-*/*.fastq.gz', recursive=True)
for f in fq_names:
  n = os.path.basename(f)
  n = re.sub("_S.*", "",n)
  if '_R1_' in f:
    fq_R1_names[n] = f
  elif '_R2_' in f:
    fq_R2_names[n] = f

# ----------------------------------------------------------------------------
# generate list of desired output files: one kraken2 report per sample

KRAKEN_REPORT = expand("kraken2/{sample}.report", sample=SAMPLES)

# ----------------------------------------------------------------------------

rule all:
  input:
    KRAKEN_REPORT

# ----------------------------------------------------------------------------

# The default rule for Kraken2 uses paired-end reads.
# params.file_spec contains the part of the kraken2 command that specifies its input files
rule kraken:
  threads: 10
  input:
    fq1 = lambda wildcards: fq_R1_names[wildcards.sample],
    fq2 = lambda wildcards: fq_R2_names[wildcards.sample]
  output:
    kraken = temp("kraken2/{sample}.kraken"),
    report = "kraken2/{sample}.report"
  params:
    file_spec = lambda wc, input: f"--paired {input.fq1} {input.fq2}"
  message:
    "Kraken2 PE: {wildcards.sample}"
  log:
    "logs/kraken2/{sample}.txt"
  benchmark:
    "benchmark/kraken2/{sample}.tsv"
  shell:
    """
    kraken2 \
    --threads {threads} --memory-mapping --gzip-compressed \
    --confidence 0.05 \
    --db /path/to/kraken2/db \
    --use-names \
    --output {output.kraken} \
    --report {output.report} \
    {params.file_spec} >{log} 2>&1
    """

# Use rule inheritance to override the input and params.file_spec for the single-end case.
use rule kraken as kraken_se with:
  input:
    fq1 = lambda wildcards: fq_R1_names[wildcards.sample],
  params:
    file_spec = lambda wc, input: f"{input.fq1}"
  message:
    "Kraken2 SE: {wildcards.sample}"

# ----------------------------------------------------------------------------
# set the rule order to use paired end (if possible) over just single-end data

ruleorder: kraken > kraken_se

One Snakemake workflow for paired-end and single-end data

Tue, 24 Oct 2023 21:00:21 +0200

Recently, I was thinking about how to set up a Snakemake workflow that can deal with both single-end and paired-end sequencing datasets. For Illumina, paired-end sequencing results in two FASTQ files for each sample: one file with the forward reads and one file with the reverse reads, which have R1 and R2 in the file names. Samples from single-end sequencing only have the R1 file.

While most downstream programs can process samples from either sequencing type, the syntax when calling the programs usually differs slightly when having either one or two input FASTQ files.

This needs to be accounted for when writing the Snakemake rules, ideally without adding too much boilerplate code.

Since Snakemake uses files as intermediates between rules, it seems obvious that one would need two rules for each program, as the rule inputs will differ depending on having only R1 or R1 and R2 FASTQ files.

The example below showcases how Snakemake’s rule inheritance can be used to create a rule for running Kraken2 on paired-end FASTQ files and a derived rule that deals with the single-end case. Here, the input and params section are overwritten, while all other sections are inherited from the first rule.

Each rule’s params section defines the part of the Kraken2 command that specifies the input files, which is then inserted in the shell block. As the shell block is only defined in the first rule, we don’t need to duplicate the whole Kraken2 command in both rules.

We can also use a lambda function in the rules’ params section to access their inputs, which avoids the duplication of file names. Thanks to Nick Waters for this idea!

I think this is a quite elegant solution for this problem. While we need two rules, most code is only defined in one rule and the second rule only contains the necessary adjustments for the single-end case.

To make it work, the single-end and paired-end samples need to be stored in distinct folders, here called fastq-se/ and fastq-pe/.

from glob import glob

# ----------------------------------------------------------------------------
# get list of sample IDs from fastq files, using file name prefixes up to _S
# single-end samples are in folder fastq-se
# paired-end samples are in folder fastq-pe

SAMPLES = glob_wildcards("fastq-pe/{sample}_R1.fastq.gz")[0] + glob_wildcards("fastq-se/{sample}_R1.fastq.gz")[0]

# ----------------------------------------------------------------------------
# generate list of desired output files: one kraken2 report per sample

KRAKEN_REPORT = expand("kraken2/{sample}.report", sample=SAMPLES)

# ----------------------------------------------------------------------------

rule all:
  input:
    KRAKEN_REPORT

# ----------------------------------------------------------------------------

# The default rule for Kraken2 uses paired-end reads.
# params.file_spec contains the part of the kraken2 command that specifies its input files
rule kraken:
  threads: 10
  input:
    fq1 = "fastq-pe/{sample}_R1.fastq.gz",
    fq2 = "fastq-pe/{sample}_R2.fastq.gz"
  output:
    kraken = temp("kraken2/{sample}.kraken"),
    report = "kraken2/{sample}.report"
  params:
    file_spec = lambda wc, input: f"--paired {input.fq1} {input.fq2}"
  message:
    "Kraken2 PE: {wildcards.sample}"
  log:
    "logs/kraken2/{sample}.txt"
  benchmark:
    "benchmark/kraken2/{sample}.tsv"
  shell:
    """
    kraken2 \
    --threads {threads} --memory-mapping --gzip-compressed \
    --confidence 0.05 \
    --db /path/to/kraken2/db \
    --use-names \
    --output {output.kraken} \
    --report {output.report} \
    {params.file_spec} >{log} 2>&1
    """

# Use rule inheritance to override the input and params.file_spec for the single-end case.
use rule kraken as kraken_se with:
  input:
    fq1 = "fastq-se/{sample}_R1.fastq.gz"
  params:
    file_spec = lambda wc, input: f"{input.fq1}"
  message:
    "Kraken2 SE: {wildcards.sample}"

On downloading genome assemblies from NCBI GenBank and RefSeq

Sat, 26 Aug 2023 06:11:21 +0100

Recently, I wanted to download all genome assemblies of fungi from NCBI GenBank and RefSeq for making a Kraken database. I used my download-refseq-genomes program for downloading assemblies from RefSeq, which worked just fine:

download-refseq-genomes.pl -t fna -r -g -a 4751

(The NCBI taxon ID for fungi is 4751)

Similarly for GenBank assemblies, I wanted to use download-genbank-genomes for downloading assemblies marked as representative or reference assemblies, but that would again include assemblies that I already downloaded from RefSeq before. So the problem was to filter out assemblies that are not contained in RefSeq.

Luckily, the assembly_summary_genbank.txt file already has a column with the corresponding RefSeq accession number if it exists, so I extended download-genbank-genomes.pl with an option (-g) to skip assemblies that have a RefSeq accession:

download-genbank-genomes.pl -t fna -r -g -a 4751

There is also a solution to this problem using the datasets and dataformat tools from NCBI:

datasets summary genome taxon 4751 --as-json-lines --assembly-source genbank | \
dataformat tsv genome --fields accession,assminfo-paired-assm-accession

which will return this two-column TSV file:

Assembly Accession  Assembly Paired Assembly Accession
[...]
GCA_030573135.1
GCA_030561325.1
GCA_030574775.1
GCA_001049995.1
GCA_001189475.1     GCF_001189475.1
GCA_002759435.2
GCA_002775015.1     GCF_002775015.1
GCA_003013715.2     GCF_003013715.1
GCA_003014415.1
GCA_004287075.1
GCA_005234155.1
GCA_007168705.1
GCA_008275145.1
GCA_008729165.1
[...]

Now we can filter the lines for keeping only those without a RefSeq-ID in the second column and use xargs to download the desired assemblies via datasets again:

datasets summary genome taxon 4751 --as-json-lines --assembly-source genbank |  \
dataformat tsv genome --elide-header --fields accession,assminfo-paired-assm-accession | \
grep -v GCF_ | \
xargs datasets download genome accession

or even better: use the RefSeq accession if it exists, otherwise use the GenBank accession for downloading:

datasets summary genome taxon 4751 --as-json-lines --assembly-source genbank | \
dataformat tsv genome --elide-header --fields accession,assminfo-paired-assm-accession | \
perl -lane 'print $F[1] ? $F[1] : $F[0]' | \
xargs datasets download genome accession

libGL error on RStudio + conda

Thu, 06 Jul 2023 19:11:11 +0100

Again, I encountered shenanigans when running RStudio using base R and packages installed via conda-forge!

This time, the error message is:

TTY detected. Printing informational message about logging configuration. Logging configuration loaded from '/etc/xdg/xdg-mate:/etc/xdg/rstudio/logging.conf'. Logging to '/home/ptr/.local/share/rstudio/log/rdesktop.log'.
libGL error: MESA-LOADER: failed to open swrast: /usr/lib/dri/swrast_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/x86_64-linux-gnu/dri:\$${ORIGIN}/dri:/usr/lib/dri, suffix _dri)
libGL error: failed to load driver: swrast
WebEngineContext used before QtWebEngine::initialize() or OpenGL context creation failed.
Failed to create OpenGL context for format QSurfaceFormat(version 2.0, options QFlags(), depthBufferSize 24, redBufferSize -1, greenBufferSize -1, blueBufferSize -1, alphaBufferSize -1, stencilBufferSize 8, samples 0, swapBehavior QSurfaceFormat::DefaultSwapBehavior, swapInterval 1, colorSpace QSurfaceFormat::DefaultColorSpace, profile  QSurfaceFormat::NoProfile)

The issue is fixed by downgrading the version of libffi in the conda environment file to v3.4.2:

  - conda-forge::libffi=3.4.2

R Shiny app in Docker using a conda environment

Tue, 21 Mar 2023 12:00:00 +0100

There are many ways of running a Shiny Server in a Docker container, for example using the rocker/shiny images.

However, this approach becomes more involved when one wants to run the Shiny server and apps with base R and additional packages from a conda environment using the conda-forge channel. The advantage here is that one can easily declare the used packages with version numbers in a conda environment file.

I am still trying to figure out a good way to run a full Shiny server from a conda environment in Docker, but found a nice minimal solution for running just a single Shiny app in a Docker container, without the need for a full Shiny server installation.

Setup

We need three main components: the Dockerfile, the conda environment definition and the code of the R Shiny app.

├── app
│   ├── app.R
│   ├── server.R
│   └── ui.R
├── conda-env.yaml
└── Dockerfile

Dockerfile

The Dockerfile is based on the mambaorg/micromamba image, which is extended by copying the Shiny app, which in turn is simply run via the shiny::shinyApp() function.

FROM mambaorg/micromamba:1.3.1
COPY --chown=$MAMBA_USER:$MAMBA_USER conda-env.yaml /tmp/env.yaml
RUN micromamba install -y -n base -f /tmp/env.yaml && \
    micromamba clean --all --yes

COPY --chmod=777 ./app/*.R ./

CMD ["/bin/bash", "-c", "./app.R"]

Conda environment

The conda environment using packages from the conda-forge channel is defined in a file called conda-env.yaml, for example like this:

name: base
channels:
  - conda-forge
dependencies:
  - conda-forge::r-base=4.2.3
  - conda-forge::r-tidyverse=2.0.0
  - conda-forge::r-shiny=1.7.4
  - conda-forge::r-shinyjs=2.1.0
  - conda-forge::r-dt=0.27
  - conda-forge::r-shinycssloaders=1.0.0
  ...

Shiny app

Finally, the Shiny app is contained in app/app.R, which in turn can include other files, e.g. ui.R or server.R in this example:

#!/usr/bin/env Rscript
library(shiny)

source("ui.R")
source("server.R")

options(shiny.host = '0.0.0.0')
options(shiny.port = 3838)

# Run Shiny app
shinyApp(ui = ui, server = server)

Build image and run container

For running the container on port 8001:

docker build -t docker-micromamba-shiny:latest .
docker run -d -p 8001:3838 docker-micromamba-shiny:latest

Open the app in the web browser at http://localhost:8001/.

Notes

The example source code from above is in this GitHub repository.
In contrast to a standard Shiny server that can host multiple apps at different URLs, the shiny app in this example runs at the root URL path.

GLIBCXX error on RStudio + conda

Tue, 07 Mar 2023 21:14:21 +0100

I usually install R and the packages required for a project through conda and the conda-forge channel and do not use a system-wide R installation through the Linux package manager. However, RStudio is not installed via conda, but using the deb-package file provided by Posit, which means it cannot be run without first activating a conda environment that at least contains the r-base package.

Since some weeks, I frequently encounter this error when loading packages in RStudio:

> library(tidyverse)
Error: package or namespace load failed for ‘tidyverse’:
 .onAttach failed in attachNamespace() for 'tidyverse', details:
  call: NULL
  error: package or namespace load failed for ‘tidyr’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/ptr/software/miniconda3/envs/my-env/lib/R/library/dplyr/libs/dplyr.so':
  /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /home/ptr/software/miniconda3/envs/my-env/lib/R/library/dplyr/libs/dplyr.so)

The operating system is Linux Mint 20.3 (Una), which is not the newest distribution one could have in 2023, but it still does its job good enough so far.

Running strings /lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX reveals that the highest available version of the C++ standard library in Linux Mint is GLIBCXX_3.4.28, so we are just a bit behind the required GLIBCXX_3.4.29. :-(

However, the conda environment should also include this library in the appropriate version matching the r-base package, it is just not used by RStudio. Funny enough, the error does not occur when just running R on the terminal and trying to load the library?!

A quick fix is to add the lib/ folder of the conda environment to the environment variable LD_LIBRARY_PATH and restart RStudio:

conda activate my-env
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib
rstudio

Now the library can be loaded without problems!

GitHub Action for checking a conda environment for upgradable packages

Mon, 23 Jan 2023 12:00:00 +0100

When experimenting with GitHub Actions, I made workflow called check-conda-envs for checking conda environment definition files (in YAML format) for available package upgrades. The action will create a table in the workflow summary page containing the current and latest version number for each package, and also a link to the changelog for many bioinformatics packages.

Here is an example output from a workflow run in the ont-assembly-snake repository:

The workflow will fail, when a package definition is found to not use = or == before the version number, e.g. the snakemake>=6.15.5 in above example.

It’s very easy to include the workflow in a repository, either by adding the file check-conda-envs.yml to the .github/workflows/ folder or by adding this job to an existing workflow:

  check-conda-upgrades:
    runs-on: "ubuntu-latest"

    # this defines the repository folder in which the conda environment files (*.y[a]ml) are located
    # multiple folders can be set with: TARGET: "env1 subfolder/env2"
    env:
      TARGET: "env"

    steps:
      # checkout this repository
      - uses: actions/checkout@v3

      # checkout pmenzel/gh-actions
      - uses: actions/checkout@v3
        with:
          repository: pmenzel/gh-actions
          ref: master
          path: ./external/gh-actions

      # https://github.com/marketplace/actions/setup-miniconda
      - uses: conda-incubator/setup-miniconda@v2
        with:
          channels: conda-forge,bioconda
      - run: |
          conda info

      - name: Run gh-actions/check-conda-envs/check-all-conda-envs.sh
        run: ./external/gh-actions/check-conda-envs/check-all-conda-envs.sh ${{env.TARGET}}