Data Analysis for Life Sciences 6: High-performance Computing for Reproducible Genomics - Harvard UniversityedX
What you'll learn on the course
Enhanced throughput: Almost all recently manufactured laptops and desktops include multiple core CPUs. With R, it is very easy to obtain faster turnaround times for analyses by distributing tasks among the cores for concurrent execution. We will discuss how to use Bioconductor to simplify parallel computing for efficient, fault-tolerant, and reproducible high-performance analyses. This will be illustrated with common multicore architectures and Amazon’s EC2 infrastructure.
Enhanced interactivity: New approaches to programming with R and Bioconductor allow researchers to use the web browser as a highly dynamic interface for data interrogation and visualization. We will discuss how to create interactive reports that enable us to move beyond static tables and one-off graphics so that our analysis outputs can be transformed and explored in real time.
Enhanced reproducibility: New methods of virtualization of software environments, exemplified by the Docker ecosystem, are useful for achieving reproducible distributed analyses. The Docker Hub includes a considerable number of container images useful for important Bioconductor-based workflows, and we will illustrate how to use and extend these for sharable and reproducible analysis.
Given the diversity in educational background of our students we have divided the series into seven parts. You can take the entire series or individual courses that interest you. If you are a statistician you should consider skipping the first two or three courses, similarly, if you are biologists you should consider skipping some of the introductory biology lectures. Note that the statistics and programming aspects of the class ramp up in difficulty relatively quickly across the first three courses. By the third course will be teaching advanced statistical concepts such as hierarchical models and by the fourth advanced software engineering skills, such as parallel computing and reproducible research concepts.
The courses in this series will be released sequentially each month and are self-paced:
PH525.1x: Statistics and R for the Life Sciences
PH525.2x: Introduction to Linear Models and Matrix Algebra
PH525.3x: Statistical Inference and Modeling for High-throughput Experiments
PH525.4x: High-Dimensional Data Analysis
PH525.5x: Introduction to Bioconductor: annotation and analysis of genomes and genomic assays
PH525.6x: High-performance computing for reproducible genomics
PH525.7x: Case studies in functional genomics