<?xml version="1.0" encoding="utf-8" standalone="yes"?><?xml-stylesheet href="/feed_style.xsl" type="text/xsl"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="https://www.rssboard.org/media-rss">
  <channel>
    <title>Peter Menzel</title>
    <link>https://menzel.tech/</link>
    <description>Recent content on Peter Menzel</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <copyright>Peter Menzel</copyright>
    <lastBuildDate>Tue, 09 Apr 2024 21:00:00 +0100</lastBuildDate><atom:link href="https://menzel.tech/index.xml" rel="self" type="application/rss+xml" /><icon>https://menzel.tech/logo.svg</icon>
    
    
    <item>
      <title>About debug messages in knitr</title>
      <link>https://menzel.tech/posts/2024-03-26-knitr-messages/</link>
      <pubDate>Tue, 09 Apr 2024 21:00:00 +0100</pubDate>
      
      <guid>https://menzel.tech/posts/2024-03-26-knitr-messages/</guid>
      <description><![CDATA[<p>When writing markdown documents for processing with <code>knitr</code>,
warnings and messages are usually shown in the output document.</p>
<p>Often, I also want to print debug messages to the console, e.g. for
indicating progress etc.</p>
<p>For this purpose, I was using the code chunk option <code>message=FALSE</code>,
which made the output of <code>message()</code> show up in the console rather than in the output document:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>---
</span></span><span style="display:flex;"><span>title: &#34;Example document&#34;
</span></span><span style="display:flex;"><span>output:
</span></span><span style="display:flex;"><span>  html_document
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>---
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>My doc does things.
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>```{r chunk1, echo=FALSE, message=FALSE}
</span></span><span style="display:flex;"><span>message(&#34;DOING THINGS&#34;)
</span></span><span style="display:flex;"><span>```
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>```{r chunk2, echo=FALSE, message=FALSE}
</span></span><span style="display:flex;"><span>message(&#34;DOING OTHER THINGS&#34;)
</span></span><span style="display:flex;"><span>```
</span></span></code></pre></div><p>Console output when knitting:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>processing file: example.Rmd
</span></span><span style="display:flex;"><span>  |..............                                                        |  20%
</span></span><span style="display:flex;"><span>  ordinary text without R code
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  |............................                                          |  40%
</span></span><span style="display:flex;"><span>label: chunk1 (with options)
</span></span><span style="display:flex;"><span>List of 2
</span></span><span style="display:flex;"><span> $ echo   : logi FALSE
</span></span><span style="display:flex;"><span> $ message: logi FALSE
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>DOING THINGS
</span></span><span style="display:flex;"><span>  |..........................................                            |  60%
</span></span><span style="display:flex;"><span>  ordinary text without R code
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  |........................................................              |  80%
</span></span><span style="display:flex;"><span>label: chunk2 (with options)
</span></span><span style="display:flex;"><span>List of 2
</span></span><span style="display:flex;"><span> $ echo   : logi FALSE
</span></span><span style="display:flex;"><span> $ message: logi FALSE
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>DOING OTHER THINGS
</span></span><span style="display:flex;"><span>  |......................................................................| 100%
</span></span><span style="display:flex;"><span>  ordinary text without R code
</span></span></code></pre></div><p>At some point recently, I noticed that these messages no longer showed up in the console output, but were also not visible in the output documents
and I was questioning myself if it actually ever worked.</p>
<p>After some googling, I found <a href="https://yihui.org/en/2022/12/message-false/">this post</a> by <a href="https://github.com/yihui">yihui</a>, the author of the <code>knitr</code> package.
It turns out that in versions from 0.19 of the <code>evaluate</code> package used by <code>knitr</code>, the meaning of the chunk options <code>warning</code> and <code>message</code> changed,
so that <code>FALSE</code> really means no output at all.</p>
<p>Instead, one needs to set the option to <code>NA</code> for getting the console output again, i.e.:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>```{r chunk1, echo=FALSE, message=NA}
</span></span><span style="display:flex;"><span>message(&#34;DOING THINGS&#34;)
</span></span><span style="display:flex;"><span>```
</span></span></code></pre></div><p>Of course, these options can also be set globally for all code chunks, using:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>knitr::opts_chunk$set(echo = FALSE, message = NA)
</span></span></code></pre></div><p>at the beginning of the document.</p>
<p>All chunk options and package options for <code>knitr</code> are shown here at <a href="https://yihui.org/knitr/options/">knitr&rsquo;s website</a>
and there are also ways for <a href="https://yihui.org/en/2021/05/knitr-reuse/">reusing options</a>.</p>
]]></description>
      
    </item>
    
    
    
    <item>
      <title>Paper: Snakemake workflows for long-read bacterial genome assembly and evaluation</title>
      <link>https://menzel.tech/posts/2024-04-01-snakemake-assembly-workflows-paper/</link>
      <pubDate>Mon, 01 Apr 2024 20:00:00 +0100</pubDate>
      
      <guid>https://menzel.tech/posts/2024-04-01-snakemake-assembly-workflows-paper/</guid>
      <description><![CDATA[<p>My paper on two Snakemake workflows for bacterial genome assembly now published in GigaByte:
<a href="https://gigabytejournal.com/articles/116">Snakemake workflows for long-read bacterial genome assembly and evaluation</a>.</p>
<p>The first workflow, called <a href="https://github.com/pmenzel/ont-assembly-snake">ont-assembly-snake</a>, is used for generating
multiple genome assemblies from long-read sequencing data of a bacterial isolate. The user can freely choose
from various read filtering, assembly, and polishing tools and combine them as required by creating folder names
containing the order of programs to be executed.</p>
<p>The second workflow, called <a href="https://github.com/pmenzel/score-assemblies">score-assemblies</a>, is used for evaluating one or multiple
genome assemblies using various programs, usually by comparison with reference assemblies, in order to
help the user in deciding on a best assembly. This workflow can be run immediately after ont-assembly-snake
and it outputs a report file in HTML format, which shows the key metrics from each program.</p>
]]></description>
      
    </item>
    
    
    
    <item>
      <title>Paper: Nosocomial outbreak of Pandoraea commovens</title>
      <link>https://menzel.tech/posts/2023-11-15-pandoraea-outbreak-paper/</link>
      <pubDate>Wed, 15 Nov 2023 21:00:00 +0100</pubDate>
      
      <guid>https://menzel.tech/posts/2023-11-15-pandoraea-outbreak-paper/</guid>
      <description><![CDATA[<p>Our paper about a nosocomial outbreak of Pandoraea spp. was recently published in Emerging Infections Diseases:
<a href="https://wwwnc.cdc.gov/eid/article/29/11/23-0493_article">Outbreak of Pandoraea commovens Infections among Non-Cystic Fibrosis Intensive Care Patients, Germany, 2019-2021</a></p>
<p>In total, 24 patients from 4 hospitals in Berlin were affected. Genomic analysis of the isolates revealed <em>Pandoraea commovens</em> as the genetically most similar species,
which was only described in 2019 by <a href="https://www.frontiersin.org/articles/10.3389/fmicb.2019.02556/full">Peeters et al.</a></p>
]]></description>
      
    </item>
    
    
    
    <item>
      <title>One Snakemake workflow for paired-end and single-end data, with globbing</title>
      <link>https://menzel.tech/posts/2023-11-09-snakemake-paired-and-single-end-with-glob/</link>
      <pubDate>Thu, 09 Nov 2023 21:00:21 +0200</pubDate>
      
      <guid>https://menzel.tech/posts/2023-11-09-snakemake-paired-and-single-end-with-glob/</guid>
      <description><![CDATA[<p>After figuring out how to make <a href="https://menzel.tech/posts/2023-10-24-snakemake-paired-and-single-end/">one Snakefile that can deal with both single-end and paired-end
sequencing files</a> in one workflow, I was still missing a puzzle piece:
How to deal with globbing to get the fastq files in the rule&rsquo;s <code>input</code> section.</p>
<p>Usually, the sample names are only a part of the fastq file name.
In the example code below, the Illumina fastq files have a substring like <code>_S123_</code> before the <code>_R1_</code> or <code>_R2_</code>,
which is the index of a sample in the samplesheet.</p>
<p>As this is not a part of the original sample name, we exclude it when using <code>glob_wildcards()</code> to get a list
of all sample names from the fastq files. But this also means, we need to find the original
fastq files from the sample names in the rule&rsquo;s input. This is usually done with the <code>glob()</code>
function, in which we use the <code>*</code> wildcard, which will match the <code>_Sxxx_</code> part of the file name.</p>
<p>After fiddling with <code>glob()</code> inside the rule&rsquo;s <code>input</code>, input functions and lambdas, the only working
solution that I could come up with, was to create two dictionaries containing a mapping from
sample IDs to their R1 and R2 file names. Then, these dicts are used in the <code>input</code> to specify the actual input files.</p>
<p>It looks a bit like overhead compared to using <code>glob()</code> inside the <code>input</code>, but
the globbing needs to be done anyways at some point, so one might as well get it over with right at the beginning.</p>
<p>In the end, we set a <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules">ruleorder</a>,
which guides Snakemake towards using the paired-end fastq files over the single-end files if possible.
This is necessary, as for paired-end samples, Snakemake could in principle use
both rules to get the desired output - it will not automagically choose the
rule that has &ldquo;most&rdquo; input files.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>from glob import glob
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span># get list of sample IDs from fastq files, using file name prefixes up to _S
</span></span><span style="display:flex;"><span># single-end samples are in folder fastq-se
</span></span><span style="display:flex;"><span># paired-end samples are in folder fastq-pe
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>SAMPLES = list(set(glob_wildcards(&#34;fastq-pe/{sample}{rest,_S\d+_.*}.fastq.gz&#34;)[0] + glob_wildcards(&#34;fastq-se/{sample}{rest,_S\d+_.*}.fastq.gz&#34;)[0]))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span># generate 2 dicts, which map sample names to their R1 and R2 file names
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>fq_R1_names = {}
</span></span><span style="display:flex;"><span>fq_R2_names = {}
</span></span><span style="display:flex;"><span>fq_names = glob(&#39;fastq-*/*.fastq.gz&#39;, recursive=True)
</span></span><span style="display:flex;"><span>for f in fq_names:
</span></span><span style="display:flex;"><span>  n = os.path.basename(f)
</span></span><span style="display:flex;"><span>  n = re.sub(&#34;_S.*&#34;, &#34;&#34;,n)
</span></span><span style="display:flex;"><span>  if &#39;_R1_&#39; in f:
</span></span><span style="display:flex;"><span>    fq_R1_names[n] = f
</span></span><span style="display:flex;"><span>  elif &#39;_R2_&#39; in f:
</span></span><span style="display:flex;"><span>    fq_R2_names[n] = f
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span># generate list of desired output files: one kraken2 report per sample
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>KRAKEN_REPORT = expand(&#34;kraken2/{sample}.report&#34;, sample=SAMPLES)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>rule all:
</span></span><span style="display:flex;"><span>  input:
</span></span><span style="display:flex;"><span>    KRAKEN_REPORT
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># The default rule for Kraken2 uses paired-end reads.
</span></span><span style="display:flex;"><span># params.file_spec contains the part of the kraken2 command that specifies its input files
</span></span><span style="display:flex;"><span>rule kraken:
</span></span><span style="display:flex;"><span>  threads: 10
</span></span><span style="display:flex;"><span>  input:
</span></span><span style="display:flex;"><span>    fq1 = lambda wildcards: fq_R1_names[wildcards.sample],
</span></span><span style="display:flex;"><span>    fq2 = lambda wildcards: fq_R2_names[wildcards.sample]
</span></span><span style="display:flex;"><span>  output:
</span></span><span style="display:flex;"><span>    kraken = temp(&#34;kraken2/{sample}.kraken&#34;),
</span></span><span style="display:flex;"><span>    report = &#34;kraken2/{sample}.report&#34;
</span></span><span style="display:flex;"><span>  params:
</span></span><span style="display:flex;"><span>    file_spec = lambda wc, input: f&#34;--paired {input.fq1} {input.fq2}&#34;
</span></span><span style="display:flex;"><span>  message:
</span></span><span style="display:flex;"><span>    &#34;Kraken2 PE: {wildcards.sample}&#34;
</span></span><span style="display:flex;"><span>  log:
</span></span><span style="display:flex;"><span>    &#34;logs/kraken2/{sample}.txt&#34;
</span></span><span style="display:flex;"><span>  benchmark:
</span></span><span style="display:flex;"><span>    &#34;benchmark/kraken2/{sample}.tsv&#34;
</span></span><span style="display:flex;"><span>  shell:
</span></span><span style="display:flex;"><span>    &#34;&#34;&#34;
</span></span><span style="display:flex;"><span>    kraken2 \
</span></span><span style="display:flex;"><span>    --threads {threads} --memory-mapping --gzip-compressed \
</span></span><span style="display:flex;"><span>    --confidence 0.05 \
</span></span><span style="display:flex;"><span>    --db /path/to/kraken2/db \
</span></span><span style="display:flex;"><span>    --use-names \
</span></span><span style="display:flex;"><span>    --output {output.kraken} \
</span></span><span style="display:flex;"><span>    --report {output.report} \
</span></span><span style="display:flex;"><span>    {params.file_spec} &gt;{log} 2&gt;&amp;1
</span></span><span style="display:flex;"><span>    &#34;&#34;&#34;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># Use rule inheritance to override the input and params.file_spec for the single-end case.
</span></span><span style="display:flex;"><span>use rule kraken as kraken_se with:
</span></span><span style="display:flex;"><span>  input:
</span></span><span style="display:flex;"><span>    fq1 = lambda wildcards: fq_R1_names[wildcards.sample],
</span></span><span style="display:flex;"><span>  params:
</span></span><span style="display:flex;"><span>    file_spec = lambda wc, input: f&#34;{input.fq1}&#34;
</span></span><span style="display:flex;"><span>  message:
</span></span><span style="display:flex;"><span>    &#34;Kraken2 SE: {wildcards.sample}&#34;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span># set the rule order to use paired end (if possible) over just single-end data
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>ruleorder: kraken &gt; kraken_se
</span></span></code></pre></div>]]></description>
      
    </item>
    
    
    
    <item>
      <title>One Snakemake workflow for paired-end and single-end data</title>
      <link>https://menzel.tech/posts/2023-10-24-snakemake-paired-and-single-end/</link>
      <pubDate>Tue, 24 Oct 2023 21:00:21 +0200</pubDate>
      
      <guid>https://menzel.tech/posts/2023-10-24-snakemake-paired-and-single-end/</guid>
      <description><![CDATA[<p>Recently, I was thinking about how to set up a Snakemake workflow that can deal with both single-end and paired-end sequencing datasets.
For Illumina, paired-end sequencing results in two FASTQ files for each sample: one file with the forward reads and one file with the reverse reads,
which have <code>R1</code> and <code>R2</code> in the file names. Samples from single-end sequencing only have the <code>R1</code> file.</p>
<p>While most downstream programs can process samples from either sequencing type,
the syntax when calling the programs usually differs slightly when having either one or two input FASTQ files.</p>
<p>This needs to be accounted for when writing the Snakemake rules, ideally without adding too much boilerplate code.</p>
<p>Since Snakemake uses files as intermediates between rules, it seems obvious that one would need two rules for each program,
as the rule inputs will differ depending on having only R1 or R1 and R2 FASTQ files.</p>
<p>The example below showcases how <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#rule-inheritance">Snakemake&rsquo;s rule inheritance</a> can be used to create a rule for running Kraken2 on paired-end
FASTQ files and a derived rule that deals with the single-end case. Here, the <code>input</code> and <code>params</code> section
are overwritten, while all other sections are inherited from the first rule.</p>
<p>Each rule&rsquo;s <code>params</code> section defines the part of the Kraken2 command that specifies the input files, which is then inserted in the <code>shell</code> block.
As the <code>shell</code> block is only defined in the first rule, we don&rsquo;t need to duplicate the whole Kraken2 command in both rules.</p>
<p>We can also use a lambda function in the rules&rsquo; <code>params</code> section to access their inputs, which avoids the duplication of file names.
Thanks to <a href="https://github.com/nickp60">Nick Waters</a> for this idea!</p>
<p>I think this is a quite elegant solution for this problem. While we need two rules, most code is only defined in one rule and
the second rule only contains the necessary adjustments for the single-end case.</p>
<p>To make it work, the single-end and paired-end samples need to be stored in distinct folders, here called <code>fastq-se/</code> and <code>fastq-pe/</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>from glob import glob
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span># get list of sample IDs from fastq files, using file name prefixes up to _S
</span></span><span style="display:flex;"><span># single-end samples are in folder fastq-se
</span></span><span style="display:flex;"><span># paired-end samples are in folder fastq-pe
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>SAMPLES = glob_wildcards(&#34;fastq-pe/{sample}_R1.fastq.gz&#34;)[0] + glob_wildcards(&#34;fastq-se/{sample}_R1.fastq.gz&#34;)[0]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span># generate list of desired output files: one kraken2 report per sample
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>KRAKEN_REPORT = expand(&#34;kraken2/{sample}.report&#34;, sample=SAMPLES)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>rule all:
</span></span><span style="display:flex;"><span>  input:
</span></span><span style="display:flex;"><span>    KRAKEN_REPORT
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># ----------------------------------------------------------------------------
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># The default rule for Kraken2 uses paired-end reads.
</span></span><span style="display:flex;"><span># params.file_spec contains the part of the kraken2 command that specifies its input files
</span></span><span style="display:flex;"><span>rule kraken:
</span></span><span style="display:flex;"><span>  threads: 10
</span></span><span style="display:flex;"><span>  input:
</span></span><span style="display:flex;"><span>    fq1 = &#34;fastq-pe/{sample}_R1.fastq.gz&#34;,
</span></span><span style="display:flex;"><span>    fq2 = &#34;fastq-pe/{sample}_R2.fastq.gz&#34;
</span></span><span style="display:flex;"><span>  output:
</span></span><span style="display:flex;"><span>    kraken = temp(&#34;kraken2/{sample}.kraken&#34;),
</span></span><span style="display:flex;"><span>    report = &#34;kraken2/{sample}.report&#34;
</span></span><span style="display:flex;"><span>  params:
</span></span><span style="display:flex;"><span>    file_spec = lambda wc, input: f&#34;--paired {input.fq1} {input.fq2}&#34;
</span></span><span style="display:flex;"><span>  message:
</span></span><span style="display:flex;"><span>    &#34;Kraken2 PE: {wildcards.sample}&#34;
</span></span><span style="display:flex;"><span>  log:
</span></span><span style="display:flex;"><span>    &#34;logs/kraken2/{sample}.txt&#34;
</span></span><span style="display:flex;"><span>  benchmark:
</span></span><span style="display:flex;"><span>    &#34;benchmark/kraken2/{sample}.tsv&#34;
</span></span><span style="display:flex;"><span>  shell:
</span></span><span style="display:flex;"><span>    &#34;&#34;&#34;
</span></span><span style="display:flex;"><span>    kraken2 \
</span></span><span style="display:flex;"><span>    --threads {threads} --memory-mapping --gzip-compressed \
</span></span><span style="display:flex;"><span>    --confidence 0.05 \
</span></span><span style="display:flex;"><span>    --db /path/to/kraken2/db \
</span></span><span style="display:flex;"><span>    --use-names \
</span></span><span style="display:flex;"><span>    --output {output.kraken} \
</span></span><span style="display:flex;"><span>    --report {output.report} \
</span></span><span style="display:flex;"><span>    {params.file_spec} &gt;{log} 2&gt;&amp;1
</span></span><span style="display:flex;"><span>    &#34;&#34;&#34;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># Use rule inheritance to override the input and params.file_spec for the single-end case.
</span></span><span style="display:flex;"><span>use rule kraken as kraken_se with:
</span></span><span style="display:flex;"><span>  input:
</span></span><span style="display:flex;"><span>    fq1 = &#34;fastq-se/{sample}_R1.fastq.gz&#34;
</span></span><span style="display:flex;"><span>  params:
</span></span><span style="display:flex;"><span>    file_spec = lambda wc, input: f&#34;{input.fq1}&#34;
</span></span><span style="display:flex;"><span>  message:
</span></span><span style="display:flex;"><span>    &#34;Kraken2 SE: {wildcards.sample}&#34;
</span></span></code></pre></div>]]></description>
      
    </item>
    
    
    
    <item>
      <title>On downloading genome assemblies from NCBI GenBank and RefSeq</title>
      <link>https://menzel.tech/posts/2023-08-26-ncbi-assembly-download-genbank/</link>
      <pubDate>Sat, 26 Aug 2023 06:11:21 +0100</pubDate>
      
      <guid>https://menzel.tech/posts/2023-08-26-ncbi-assembly-download-genbank/</guid>
      <description><![CDATA[<p>Recently, I wanted to download all genome assemblies of fungi from NCBI GenBank and RefSeq for making a Kraken database.
I used my <a href="https://github.com/pmenzel/download-refseq-genomes/">download-refseq-genomes</a> program for downloading assemblies from RefSeq, which worked just fine:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>download-refseq-genomes.pl -t fna -r -g -a 4751
</span></span></code></pre></div><p>(The NCBI taxon ID for fungi is <code>4751</code>)</p>
<p>Similarly for GenBank assemblies, I wanted to use <a href="https://github.com/pmenzel/download-genbank-genomes">download-genbank-genomes</a> for downloading
assemblies marked as representative or reference assemblies, but that would again include assemblies that I already downloaded from RefSeq before.
So the problem was to filter out assemblies that are not contained in RefSeq.</p>
<p>Luckily, the <code>assembly_summary_genbank.txt</code> file already has a column with the corresponding RefSeq accession number if it exists,
so I extended <code>download-genbank-genomes.pl</code> with an option (<code>-g</code>) to skip assemblies that have a RefSeq accession:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>download-genbank-genomes.pl -t fna -r -g -a 4751
</span></span></code></pre></div><p>There is also a solution to this problem using the <code>datasets</code> and <code>dataformat</code> tools from NCBI:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>datasets summary genome taxon 4751 --as-json-lines --assembly-source genbank | \
</span></span><span style="display:flex;"><span>dataformat tsv genome --fields accession,assminfo-paired-assm-accession
</span></span></code></pre></div><p>which will return this two-column TSV file:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>Assembly Accession  Assembly Paired Assembly Accession
</span></span><span style="display:flex;"><span>[...]
</span></span><span style="display:flex;"><span>GCA_030573135.1
</span></span><span style="display:flex;"><span>GCA_030561325.1
</span></span><span style="display:flex;"><span>GCA_030574775.1
</span></span><span style="display:flex;"><span>GCA_001049995.1
</span></span><span style="display:flex;"><span>GCA_001189475.1     GCF_001189475.1
</span></span><span style="display:flex;"><span>GCA_002759435.2
</span></span><span style="display:flex;"><span>GCA_002775015.1     GCF_002775015.1
</span></span><span style="display:flex;"><span>GCA_003013715.2     GCF_003013715.1
</span></span><span style="display:flex;"><span>GCA_003014415.1
</span></span><span style="display:flex;"><span>GCA_004287075.1
</span></span><span style="display:flex;"><span>GCA_005234155.1
</span></span><span style="display:flex;"><span>GCA_007168705.1
</span></span><span style="display:flex;"><span>GCA_008275145.1
</span></span><span style="display:flex;"><span>GCA_008729165.1
</span></span><span style="display:flex;"><span>[...]
</span></span></code></pre></div><p>Now we can filter the lines for keeping only those without a RefSeq-ID in the second column and use <code>xargs</code> to download the desired assemblies via <code>datasets</code> again:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>datasets summary genome taxon 4751 --as-json-lines --assembly-source genbank |  \
</span></span><span style="display:flex;"><span>dataformat tsv genome --elide-header --fields accession,assminfo-paired-assm-accession | \
</span></span><span style="display:flex;"><span>grep -v GCF_ | \
</span></span><span style="display:flex;"><span>xargs datasets download genome accession
</span></span></code></pre></div><p>or even better: use the RefSeq accession if it exists, otherwise use the GenBank accession for downloading:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>datasets summary genome taxon 4751 --as-json-lines --assembly-source genbank | \
</span></span><span style="display:flex;"><span>dataformat tsv genome --elide-header --fields accession,assminfo-paired-assm-accession | \
</span></span><span style="display:flex;"><span>perl -lane &#39;print $F[1] ? $F[1] : $F[0]&#39; | \
</span></span><span style="display:flex;"><span>xargs datasets download genome accession
</span></span></code></pre></div>]]></description>
      
    </item>
    
    
    
    <item>
      <title>libGL error on RStudio &#43; conda</title>
      <link>https://menzel.tech/posts/2023-07-06-rstudio-conda-again/</link>
      <pubDate>Thu, 06 Jul 2023 19:11:11 +0100</pubDate>
      
      <guid>https://menzel.tech/posts/2023-07-06-rstudio-conda-again/</guid>
      <description><![CDATA[<p>Again, I encountered shenanigans when running RStudio using base R and packages installed via conda-forge!</p>
<p>This time, the error message is:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>TTY detected. Printing informational message about logging configuration. Logging configuration loaded from &#39;/etc/xdg/xdg-mate:/etc/xdg/rstudio/logging.conf&#39;. Logging to &#39;/home/ptr/.local/share/rstudio/log/rdesktop.log&#39;.
</span></span><span style="display:flex;"><span>libGL error: MESA-LOADER: failed to open swrast: /usr/lib/dri/swrast_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/x86_64-linux-gnu/dri:\$${ORIGIN}/dri:/usr/lib/dri, suffix _dri)
</span></span><span style="display:flex;"><span>libGL error: failed to load driver: swrast
</span></span><span style="display:flex;"><span>WebEngineContext used before QtWebEngine::initialize() or OpenGL context creation failed.
</span></span><span style="display:flex;"><span>Failed to create OpenGL context for format QSurfaceFormat(version 2.0, options QFlags&lt;QSurfaceFormat::FormatOption&gt;(), depthBufferSize 24, redBufferSize -1, greenBufferSize -1, blueBufferSize -1, alphaBufferSize -1, stencilBufferSize 8, samples 0, swapBehavior QSurfaceFormat::DefaultSwapBehavior, swapInterval 1, colorSpace QSurfaceFormat::DefaultColorSpace, profile  QSurfaceFormat::NoProfile)
</span></span></code></pre></div><p>The issue is fixed by downgrading the version of <code>libffi</code> in the conda environment file to v3.4.2:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>  - conda-forge::libffi=3.4.2
</span></span></code></pre></div>]]></description>
      
    </item>
    
    
    
    <item>
      <title>R Shiny app in Docker using a conda environment</title>
      <link>https://menzel.tech/posts/2023-03-21-conda-shiny-docker/</link>
      <pubDate>Tue, 21 Mar 2023 12:00:00 +0100</pubDate>
      
      <guid>https://menzel.tech/posts/2023-03-21-conda-shiny-docker/</guid>
      <description><![CDATA[<p>There are many ways of running a Shiny Server in a Docker container, for example using the <a href="https://rocker-project.org/images/versioned/shiny.html">rocker/shiny</a> images.</p>
<p>However, this approach becomes more involved when one wants to
run the Shiny server and apps with base R and additional packages from a conda environment using the conda-forge channel.
The advantage here is that one can easily declare the used packages with version numbers
in a <a href="https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually">conda environment file</a>.</p>
<p>I am still trying to figure out a good way to run a full Shiny server from a conda environment in Docker,
but found a nice minimal solution for running just a single Shiny app in a Docker container, without the need
for a full Shiny server installation.</p>
<h3 id="setup">Setup</h3>
<p>We need three main components: the <code>Dockerfile</code>, the conda environment definition and the code of the R Shiny app.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>├── app
</span></span><span style="display:flex;"><span>│   ├── app.R
</span></span><span style="display:flex;"><span>│   ├── server.R
</span></span><span style="display:flex;"><span>│   └── ui.R
</span></span><span style="display:flex;"><span>├── conda-env.yaml
</span></span><span style="display:flex;"><span>└── Dockerfile
</span></span></code></pre></div><h4 id="dockerfile">Dockerfile</h4>
<p>The <code>Dockerfile</code> is based on the <a href="https://github.com/mamba-org/micromamba-docker">mambaorg/micromamba image</a>,
which is extended by copying the Shiny app, which in turn is simply run via the <code>shiny::shinyApp()</code> function.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>FROM mambaorg/micromamba:1.3.1
</span></span><span style="display:flex;"><span>COPY --chown=$MAMBA_USER:$MAMBA_USER conda-env.yaml /tmp/env.yaml
</span></span><span style="display:flex;"><span>RUN micromamba install -y -n base -f /tmp/env.yaml &amp;&amp; \
</span></span><span style="display:flex;"><span>    micromamba clean --all --yes
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>COPY --chmod=777 ./app/*.R ./
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>CMD [&#34;/bin/bash&#34;, &#34;-c&#34;, &#34;./app.R&#34;]
</span></span></code></pre></div><h4 id="conda-environment">Conda environment</h4>
<p>The conda environment using packages from the conda-forge channel is defined in a file called <code>conda-env.yaml</code>,
for example like this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>name: base
</span></span><span style="display:flex;"><span>channels:
</span></span><span style="display:flex;"><span>  - conda-forge
</span></span><span style="display:flex;"><span>dependencies:
</span></span><span style="display:flex;"><span>  - conda-forge::r-base=4.2.3
</span></span><span style="display:flex;"><span>  - conda-forge::r-tidyverse=2.0.0
</span></span><span style="display:flex;"><span>  - conda-forge::r-shiny=1.7.4
</span></span><span style="display:flex;"><span>  - conda-forge::r-shinyjs=2.1.0
</span></span><span style="display:flex;"><span>  - conda-forge::r-dt=0.27
</span></span><span style="display:flex;"><span>  - conda-forge::r-shinycssloaders=1.0.0
</span></span><span style="display:flex;"><span>  ...
</span></span></code></pre></div><h4 id="shiny-app">Shiny app</h4>
<p>Finally, the Shiny app is contained in <code>app/app.R</code>, which in turn can include other files, e.g. <code>ui.R</code> or <code>server.R</code> in this example:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>#!/usr/bin/env Rscript
</span></span><span style="display:flex;"><span>library(shiny)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>source(&#34;ui.R&#34;)
</span></span><span style="display:flex;"><span>source(&#34;server.R&#34;)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>options(shiny.host = &#39;0.0.0.0&#39;)
</span></span><span style="display:flex;"><span>options(shiny.port = 3838)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># Run Shiny app
</span></span><span style="display:flex;"><span>shinyApp(ui = ui, server = server)
</span></span></code></pre></div><h3 id="build-image-and-run-container">Build image and run container</h3>
<p>For running the container on port 8001:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>docker build -t docker-micromamba-shiny:latest .
</span></span><span style="display:flex;"><span>docker run -d -p 8001:3838 docker-micromamba-shiny:latest
</span></span></code></pre></div><p>Open the app in the web browser at <code>http://localhost:8001/</code>.</p>
<h3 id="notes">Notes</h3>
<ul>
<li>The example source code from above is in this <a href="https://github.com/pmenzel/docker-micromamba-shiny">GitHub repository</a>.</li>
<li>In contrast to a standard Shiny server that can host multiple apps at different URLs,
the shiny app in this example runs at the root URL path.</li>
</ul>
]]></description>
      
    </item>
    
    
    
    <item>
      <title>GLIBCXX error on RStudio &#43; conda</title>
      <link>https://menzel.tech/posts/2023-03-07-rstudio-conda-libstdc&#43;&#43;/</link>
      <pubDate>Tue, 07 Mar 2023 21:14:21 +0100</pubDate>
      
      <guid>https://menzel.tech/posts/2023-03-07-rstudio-conda-libstdc&#43;&#43;/</guid>
      <description><![CDATA[<p>I usually install R and the packages required for a project through conda and the conda-forge channel
and do not use a system-wide R installation through the Linux package manager.
However, RStudio is not installed via conda, but using the <code>deb</code>-package file <a href="https://posit.co/download/rstudio-desktop/#download">provided by Posit</a>, which means it cannot be run without first
activating a conda environment that at least contains the <code>r-base</code> package.</p>
<p>Since some weeks, I frequently encounter this error when loading packages in RStudio:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-go" data-lang="go"><span style="display:flex;"><span>&gt; <span style="color:#50fa7b">library</span>(tidyverse)
</span></span><span style="display:flex;"><span>Error: <span style="color:#ff79c6">package</span> or namespace load failed <span style="color:#ff79c6">for</span> ‘tidyverse’:
</span></span><span style="display:flex;"><span> .onAttach failed in <span style="color:#50fa7b">attachNamespace</span>() <span style="color:#ff79c6">for</span> &#39;tidyverse&#39;, details:
</span></span><span style="display:flex;"><span>  call: NULL
</span></span><span style="display:flex;"><span>  <span style="color:#8be9fd">error</span>: <span style="color:#ff79c6">package</span> or namespace load failed <span style="color:#ff79c6">for</span> ‘tidyr’ in dyn.<span style="color:#50fa7b">load</span>(file, DLLpath = DLLpath, <span style="color:#ff79c6">...</span>):
</span></span><span style="display:flex;"><span> unable to load shared object &#39;<span style="color:#ff79c6">/</span>home<span style="color:#ff79c6">/</span>ptr<span style="color:#ff79c6">/</span>software<span style="color:#ff79c6">/</span>miniconda3<span style="color:#ff79c6">/</span>envs<span style="color:#ff79c6">/</span>my<span style="color:#ff79c6">-</span>env<span style="color:#ff79c6">/</span>lib<span style="color:#ff79c6">/</span>R<span style="color:#ff79c6">/</span>library<span style="color:#ff79c6">/</span>dplyr<span style="color:#ff79c6">/</span>libs<span style="color:#ff79c6">/</span>dplyr.so&#39;:
</span></span><span style="display:flex;"><span>  <span style="color:#ff79c6">/</span>usr<span style="color:#ff79c6">/</span>lib<span style="color:#ff79c6">/</span>x86_64<span style="color:#ff79c6">-</span>linux<span style="color:#ff79c6">-</span>gnu<span style="color:#ff79c6">/</span>libstdc<span style="color:#ff79c6">++</span>.so<span style="color:#bd93f9">.6</span>: version `GLIBCXX_3<span style="color:#bd93f9">.4.29</span>&#39; not <span style="color:#50fa7b">found</span> (required by <span style="color:#ff79c6">/</span>home<span style="color:#ff79c6">/</span>ptr<span style="color:#ff79c6">/</span>software<span style="color:#ff79c6">/</span>miniconda3<span style="color:#ff79c6">/</span>envs<span style="color:#ff79c6">/</span>my<span style="color:#ff79c6">-</span>env<span style="color:#ff79c6">/</span>lib<span style="color:#ff79c6">/</span>R<span style="color:#ff79c6">/</span>library<span style="color:#ff79c6">/</span>dplyr<span style="color:#ff79c6">/</span>libs<span style="color:#ff79c6">/</span>dplyr.so)
</span></span></code></pre></div><p>The operating system is Linux Mint 20.3 (Una), which is not the newest distribution one could have in 2023, but it still does its job good enough so far.</p>
<p>Running <code>strings /lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX</code> reveals that
the highest available version of the C++ standard library in Linux Mint is <code>GLIBCXX_3.4.28</code>, so we are just a bit behind the required <code>GLIBCXX_3.4.29</code>. :-(</p>
<p>However, the conda environment should also include this library in the appropriate version matching the <code>r-base</code> package, it is just not used by RStudio.
Funny enough, the error does not occur when just running <code>R</code> on the terminal and trying to load the library?!</p>
<p>A quick fix is to add the <code>lib/</code> folder of the conda environment to the environment variable <code>LD_LIBRARY_PATH</code> and restart RStudio:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>conda activate my-env
</span></span><span style="display:flex;"><span>export LD_LIBRARY_PATH=$CONDA_PREFIX/lib
</span></span><span style="display:flex;"><span>rstudio
</span></span></code></pre></div><p>Now the library can be loaded without problems!</p>
]]></description>
      
    </item>
    
    
    
    <item>
      <title>GitHub Action for checking a conda environment for upgradable packages</title>
      <link>https://menzel.tech/posts/2023-01-23-github-actions-check-conda-upgrades/</link>
      <pubDate>Mon, 23 Jan 2023 12:00:00 +0100</pubDate>
      
      <guid>https://menzel.tech/posts/2023-01-23-github-actions-check-conda-upgrades/</guid>
      <description><![CDATA[<p>When experimenting with GitHub Actions, I made workflow called <a href="https://github.com/pmenzel/gh-actions/tree/master/check-conda-envs">check-conda-envs</a> for
checking conda environment definition files (in YAML format) for available
package upgrades. The action will create a table in the workflow
summary page containing the current and latest version number for each package,
and also a link to the changelog for many bioinformatics packages.</p>
<p>Here is an example output from a workflow run in the <a href="https://github.com/pmenzel/ont-assembly-snake">ont-assembly-snake</a> repository:</p>
<p><img src="/post-images/2023-01-20-github-actions-example-output.png" alt="Example output table"></p>
<p>The workflow will fail, when a package definition is found to not use <code>=</code> or
<code>==</code> before the version number, e.g. the <code>snakemake&gt;=6.15.5</code> in above example.</p>
<p>It&rsquo;s very easy to include the workflow in a repository, either by adding the file <a href="https://github.com/pmenzel/gh-actions/blob/master/check-conda-envs/check-conda-envs.yml">check-conda-envs.yml</a>
to the <code>.github/workflows/</code> folder or by adding this job to an existing workflow:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>  check-conda-upgrades:
</span></span><span style="display:flex;"><span>    runs-on: &#34;ubuntu-latest&#34;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    # this defines the repository folder in which the conda environment files (*.y[a]ml) are located
</span></span><span style="display:flex;"><span>    # multiple folders can be set with: TARGET: &#34;env1 subfolder/env2&#34;
</span></span><span style="display:flex;"><span>    env:
</span></span><span style="display:flex;"><span>      TARGET: &#34;env&#34;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    steps:
</span></span><span style="display:flex;"><span>      # checkout this repository
</span></span><span style="display:flex;"><span>      - uses: actions/checkout@v3
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      # checkout pmenzel/gh-actions
</span></span><span style="display:flex;"><span>      - uses: actions/checkout@v3
</span></span><span style="display:flex;"><span>        with:
</span></span><span style="display:flex;"><span>          repository: pmenzel/gh-actions
</span></span><span style="display:flex;"><span>          ref: master
</span></span><span style="display:flex;"><span>          path: ./external/gh-actions
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      # https://github.com/marketplace/actions/setup-miniconda
</span></span><span style="display:flex;"><span>      - uses: conda-incubator/setup-miniconda@v2
</span></span><span style="display:flex;"><span>        with:
</span></span><span style="display:flex;"><span>          channels: conda-forge,bioconda
</span></span><span style="display:flex;"><span>      - run: |
</span></span><span style="display:flex;"><span>          conda info
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      - name: Run gh-actions/check-conda-envs/check-all-conda-envs.sh
</span></span><span style="display:flex;"><span>        run: ./external/gh-actions/check-conda-envs/check-all-conda-envs.sh ${{env.TARGET}}
</span></span></code></pre></div>]]></description>
      
    </item>
    
    
    
    
    
    
    
    
  </channel>
</rss>
