Abstract:
Throughout the COVID-19 pandemic, UC San Diego has implemented a "Return to Learn" Initiative (https://returntolearn.ucsd.edu/), which consists of mass testing and sequencing of SARS-CoV-2 samples across the county. Importantly, this large-scale effort is yielding massive amounts of raw viral amplicon sequencing data (~10,000 samples per day) that need to be analyzed in real-time, posing a significant challenge with regard to computational infrastructure. In response to this challenge of scalability, I have developed a novel tool called ViReflow (https://github.com/niemasd/ViReflow) that leverages the Reflow system of incremental data processing developed by GRAIL (https://github.com/grailbio/reflow) in order to implement an elastic, massively-parallelizable, and massively-scalable (e.g. country-wide scale) pipeline for executing standard amplicon sequence data → variant call + consensus sequence workflows.
Case Study Summary:
- The scientific problem we tackled:
Throughout the COVID-19 pandemic, UC San Diego has implemented a "Return to Learn" Initiative (https://returntolearn.ucsd.edu/), which consists of mass testing and sequencing of SARS-CoV-2 samples across the county.
- The computational methods we used:
This large-scale effort is yielding massive amounts of raw viral amplicon sequencing data (~10,000 samples per day) that need to be analyzed in real-time, posing a significant challenge with regard to computational infrastructure. In response to this challenge of scalability, I have developed a novel tool called ViReflow (https://github.com/niemasd/ViReflow) that leverages the Reflow system of incremental data processing developed by GRAIL (https://github.com/grailbio/reflow) in order to implement an elastic, massively-parallelizable, and massively-scalable (e.g. country-wide scale) pipeline for executing standard amplicon sequence data → variant call + consensus sequence workflows.
- The cloud resources we used:
We are using Amazon Web Services (AWS) Elastic Compute Cloud (EC2) for conducting the analyses, and we are using AWS Simple Storage Service (S3) for data storage.
- The differences we’ve observed between locally-provided and cloud-provided resources:
The key benefit between locally-provided and cloud-provided resources for our purposes is the ability to scale the computational resources we utilize to precisely match what is needed for a specific sequencing run. It doesn’t matter if we have 10 samples vs. 1000 samples: the execution remains the same, and the allocated AWS resources dynamically scale, meaning we experience reduced walltime due to massively-parallelized execution while maintaining low compute costs.
Author Bio:
Niema Moshiri is an Assistant Teaching Professor in the Computer Science & Engineering Department at the University of California, San Diego (UCSD). His research interests lie in computational biology, with a research focus on viral phylogenetics and epidemiology. He also places a heavy emphasis on teaching, namely on the development of online educational content, primarily Massive Adaptive Interactive Texts (MAITs).
For further information:
- https://niema.net/
- https://github.com/niemasd/ViReflow
- https://github.com/niemasd/ViReflow-Paper/blob/main/scripts/Figures.ipynb
RRoCCET21 is a conference that was held virtually by CloudBank from August 10th through 12th, 2021. Its intention is to inspire you to consider utilizing the cloud in your research, by way of sharing the success stories of others. We hope the proceedings, of which this case study is a part, give you an idea of what is possible and act as a “recipe book” for mapping powerful computational resources onto your own field of inquiry.