You can watch a demo led by Phil Ewels on our Tech Shorts channel on YouTube.
This post was contributed by Brendan Bouffler, Head of Developer Relations, HPC Engineering at AWS, Phil Ewels, Snr Product Manager for Open Source at Seqera, and Paolo Di Tommaso, the CTO & co-founder of Seqera.
Life sciences is a rapidly-moving area of research, and you don’t need to look too far for amazing examples where the practitioners of this field have adopted new tools and techniques to solve ever harder problems. Within this domain, bioinformatics has emerged as a crucial discipline, bridging the gap between biology and computer science. Many of the features in AWS Batch have been driven by this community, leading to a close relationship between our teams and the people working at the front lines of research.
But while biological data has become more complex and massive, it has also become increasingly personal, and inevitably regulated. The world needed researchers across borders and geographies to share insights and techniques without necessarily moving the data, or sharing the same infrastructure. The conditions were set for another uptake of new technology.
Adopting containers for packaging applications turned out to be pivotal for this shift: it revolutionized the way bioinformatics workflows are developed, deployed, and shared – increasing the reproducibility of an analysis. But while they’ve unquestionably simplified the life of bioinformaticians, their creation and usage is not without some friction.
Today, in conjunction with our friends at Seqera, we’re announcing a project we’re supporting called Seqera Containers. This is a freely-available resource for the entire bioinformatics community (in the cloud or not) which simplifies the container experience – allowing researchers to generate a container for any combination of Conda and PyPI packages at the click of a button. Best of all, Seqera Containers are publicly accessible to everyone, at no-cost.
Seqera Containers is not a traditional container registry: users are not expected to browse existing container images or push local images to a remote. Instead, Seqera Containers provides a simple form to choose a combination of packages from Conda or the Python Package Index (PyPI). Clicking “Get Container” returns a Docker or Singularity image URI which can be used immediately.
Figure 1 – the community wave registry is easy to access, and even easier to use.
The beating heart of Seqera Containers is Wave – an open-source technology built by Seqera for next-generation container provisioning that aims to simplify the usage of containers. Instead of writing Dockerfile scripts to build images, a developer or end user can request a just-in-time container image tailored for the target execution platform using only package names and version numbers from popular software packaging tools Conda (including Bioconda), and the Python Package Index and later, Spack.
Wave can also compile images on the fly to match your compute infrastructure – including x86 and Arm64-based processors (where upstream packages are available).
Wave supports both Docker and Singularity images – and thus any other container technologies that use Docker images, like Podman, Shifter, Sarus, or Charliecloud. Seqera Containers provides an OCI compliant registry for native Singularity image builds and even allows direct .sif image download as a flat file, without needing Singularity installed locally.
As part of the build process, Wave conducts a vulnerability scan using the Trivy security scanner and is able to generate SBOM manifests.
Building containers on demand is convenient, but for the sake of performance, reproducibility, and provenance, it’s important to use the exact same image for every run. To do this, Wave generates a checksum for the build and can push the built image to a traditional OCI registry. Subsequent requests with the same checksum will be returned directly from the registry.
So, when you request an image through the Seqera Containers web interface, most of the time the images will come from the cache – meaning virtually instant access and consistent reuse by you, or anyone in the community. That image cache doesn’t expire, so those images will still be there when you need to reproduce that analysis in a few years’ time.
We hope that this service and these containers will be of use to the entire bioinformatics community. However, the experience of using Seqera Containers will be particularly good for Nextflow users.
Using the Nextflow wave plugin, pipeline developers can avoid specifying container URIs in their pipeline code entirely. Instead, just naming the software packages in the conda (or, later, spack) declarations and then setting wave.enabled = true and wave.freeze = true is all you need. This instructs Nextflow to request an image from Wave for these packages, store it in the Seqera Containers public registry, and then use this at run time.
Wave isn’t restricted to working with Nextflow, there is also a Wave CLI which anyone can use to generate containers as well as an API, both of which have all the same functionality as the Nextflow plugin and Seqera Containers web interface.
The impact of container technology on the progress of life sciences research just can’t be overstated.
It addresses big challenges in bioinformatics, like reproducibility, scalability, and collaboration. It eases software management, and empowers researchers to focus on their core scientific endeavors. This has accelerated the pace of discoveries, fostered collaboration, and enhanced the overall quality and reproducibility of bioinformatics research.
We think today’s announcement pushes containers, and bioinformatics one step further, by making it dramatically easier to get your hands on containers in the right shape, size, and form-factor to meet the needs of your pipelines. Batch users will definitely enjoy using this. But, we’re pretty sure this will be a community resource everyone will love, and AWS is thrilled to be able to support the Seqera team to deliver this for the entire community.
If you still need convincing, try it out yourself now – head to https://seqera.io/containers/ and type in some of your favorite bioinformatics tool names before clicking “Build”. Pull your Docker image and get to work.
And if you have ideas for Seqera or AWS, don’t hesitate to reach out to us at ask-hpc@amazon.com.
Brendan Bouffler is the head of the Developer Relations in HPC Engineering at AWS. He’s been responsible for designing and building hundreds of HPC systems in all kind of environments, and joined AWS when it became clear to him that cloud would become the exceptional tool the global research & engineering community needed to bring on the discoveries that would change the world for us all. He holds a degree in Physics and an interest in testing several of its laws as they apply to bicycles. This has frequently resulted in hospitalization.
Paolo Di Tommaso is the CTO and co-founder of Seqera Labs. He is a computer scientist with a strong interest in high-throughput scientific computing, data-intensive applications, parallel programming, cloud computing, and containerization technologies. He has broad experience as a software engineer and software architect in life science and healthcare applications. He is an open-source advocate and the creator and maintainer of the Nextflow workflow system.
Phil holds a PhD in Molecular Biology from the University of Cambridge (UK) and has had a career that has spanned many disciplines: from lab work and bioinformatics research in epigenetics to software development and community engagement. He is the co-founder of the nf-core community and the author of several well-known bioinformatics tools, such as MultiQC. He works within Seqera to grow the Nextflow and nf-core communities worldwide and build the Seqera open source portfolio.