Running Bioinformatics Pipelines Cost Effectively Using MemVerge on AWS | AWS Partner Network (APN) Blog

By Jing Xie and Charlie Yu – MemVerge By Gokhul Srinivasan and Sujaya Srinivasan – AWS

The MemVerge Memory Machine Cloud (MMCloud) solution on Amazon Web Services (AWS) makes it easy for researchers to migrate their bioinformatics pipelines from on-premises high-performance computing (HPC) to AWS. It also enables executing long-running bioinformatics pipelines cost-effectively on Amazon EC2 Spot instances through automatic checkpoint and restore.

Over the last decade, as the cost of sequencing has dropped researchers are generating larger volumes of raw sequencing data. This raw data is processed through complex multi-step pipelines that require significant compute resources before the data is usable in any research or analysis.

Traditionally, most researchers in academic institutions have relied on HPC clusters managed by a central resource within their institutions. However, the increasing compute demands and aging infrastructure make it challenging to get the compute resources they need in a timely manner.

The cloud is an attractive alternative to get the elastic compute that research groups need, but they often are blocked by lack of IT support and skills necessary to get their pipelines cloud-ready to run them cost-effectively.

In this post, we will explore how MemVerge Memory Machine Cloud is built to run computational workflows and interactive computing applications on AWS. Designed for use by genomic researchers and bioinformaticians, MMCloud enables you to run your Nextflow pipelines, next-generation sequencing (NGS), and other genomic analysis safely and reliably on EC2 Spot instances.

MemVerge is an AWS Specialization Partner and AWS Marketplace Seller with the EC2 Spot service ready designation.

Most bioinformatics pipelines have multiple steps, with each step having different resource requirements. For example, in Sentieon’s implementation of the GATK best practices workflow, there are three phases:

To establish a baseline, the whole genome sequencing (WGS) pipeline is first run on a single r5.8xlarge instance to get CPU and memory benchmarks, as shown in Figure 1.

In the following table, you can see CPU and memory metrics for the Sentieon WGS pipeline phases on r5.8xlarge instances.

* Costs are based on us-east-1 on-demand EC2 instance prices as of the publishing date.

Based on the real-time profiling above, we can resize the instances for each step as follows, optimizing cost and run-time. The next table shows CPU and memory metrics for Sentieon WGS pipeline after right-sizing the instances for each phase based on profiling metrics from the table above.

* Costs are based on us-east-1 on-demand EC2 instance prices as of the publishing date. Note that costs could be lower by leveraging Spot instances.

For hybrid cloud organizations where there’s an on-premises HPC footprint but researchers want to burst scientific computing workloads into AWS, MMCloud offers a simple and highly cost-effective solution. It enables you to run multiple jobs at once on AWS within your own virtual private cloud (VPC) using command line interface (CLI), MMCloud graphical user interface (GUI), and HPC scheduler interfaces like Slurm, PBS/QSUB, and LSF/OpenLava.

Resource provisioning and deprovisioning is automated, with quotas that help IT admins manage costs. MMCloud provides cost-effective execution via EC2 Spot instance checkpoint and recovery combined with real-time resource management to scale up jobs that need more CPU/memory and scale down jobs that do not.

Figure 2 below shows a hybrid HPC cloud architecture, where researchers can submit jobs either to their on-premises HPC cluster or to an MMCloud-managed Spot instances queue in their own VPC.

MMCloud’s ability to checkpoint and recover from Spot instance interruptions is a critical component enabling Nextflow pipelines to run more cost efficiently on AWS. You can easily integrate MMCloud as a computing environment for Nextflow on AWS by using the nf-float plugin.

Using MMCloud, Nextflow users can easily deploy JuiceFS as a high-performance cloud-native, POSIX compatible distributed file system on Amazon Simple Storage Service (Amazon S3) for shared working directory.

For details on how to set up MMCloud with Nextflow, see the documentation.

Figure 3 shows the architecture of MMCloud running Nextflow pipelines through the nf-float plugin using JuiceFS.

At Columbia University, Dr. Gao Wang (Assistant Professor of Neurological Sciences) heads the Lab of Statistical Functional Genomics, which focuses on understanding the genetic regulation of molecular mechanisms behind complex biological traits.

A key initiative led by Dr. Wang is the FunGen-xQTL Project, a collaborative effort involving over a dozen research institutes across the United States and focuses on studying molecular quantitative trait loci in aging brains. Understanding genetic regulation plays a crucial role in providing the Alzheimer’s disease scientific communities with valuable functional genomics data from aging cohorts, curated and processed through comprehensive multi-omics analysis.

This project had two key requirements that were challenging with on-premises infrastructure:

MMCloud on AWS enabled Dr. Wang’s lab to submit hundreds of thousands of jobs to AWS and run them cost effectively on EC2 Spot instances. This reduced the time from several weeks to a few days and at 50-80% lower cost vs. using On-Demand instances. MMCloud also simplified and enabled cost-efficient provisioning and management of Jupyter and RStudio apps used by multiple institutional collaborators.

At MDI Biological Laboratory, Joel H. Graber, Ph.D., a senior staff scientist and director of comparative genomics and data science core leads a team focused on collaboration, analysis, and education in the computational analysis of genome-scale data. Dr. Graber leads the development of Axobase, an online resource that’s being built to provide data and tools in support of an international group of researchers across more than 40 labs who study axolotls and other salamanders.

Axolotls are interesting to study because of their amazing tissue regeneration abilities. Due to its very large genome size (roughly 10 times as large as the human genome) and structure, analyzing axolotl sequence data can be very computationally intensive, with individual analysis runs lasting up to several days using very large computing resources.

In order to standardize and automate the analysis pipelines that would be used for genomic analysis Dr. Graber’s team began writing their workflows using Nextflow. However, they ran into challenges with right-sizing instances and managing Spot instance terminations, resulting in high costs. By leveraging MMCloud on AWS, Dr. Graber’s team deployed a solution to both right-size and better utilize EC2 Spot instances when running Nextflow pipelines, saving 50-80% vs. using On-Demand and reduced CPU hours per pipeline by up to 60%.

The MemVerge Memory Machine Cloud (MMCloud) solution on AWS can be a cost-effective way of running bioinformatics pipelines leveraging Amazon EC2 Spot instances. It enables researchers an easy way to lift-and-shift their on-premises workloads to AWS.

By dynamically resizing EC2 instances based on actual CPU and memory usage, MMCloud makes it easy to right-size compute. It also simplifies executing long-running Nextflow pipelines on Spot instances, keeping costs under control.

MMCloud installs as a single Amazon Machine Image (AMI) with an AWS CloudFormation template in your VPC, and AWS users can easily find the solution on AWS Marketplace. Installation can be as quick as 15-20 mins and a 30-day free trial is offered to all new customers.

For more information, see the MemVerge MMCloud offering on AWS Marketplace or contact your AWS team.

MemVerge is an AWS Specialization Partner whose cloud automation platform (MMCloud) is designed for bioinformaticians and data scientists to easily run computational workflows on AWS.