High Performance Computing  Georgia Tech
UdacityFree
Important information
 Course
 Online
 When:
Flexible
This course will give you solid foundations for developing, analyzing, and implementing parallel and localityefficient algorithms. Offered at Georgia Tech as CS6220
Starts  Location 

Flexible 
Online

What you'll learn on the course
Computing  Basic  Basic IT training  Network  
Basic IT  Algorithms  Performance  CS  
systems  Programming  Network Training 
Course programme
Approx. 4 months
Built by Join thousands of students Course SummaryThe goal of this course is to give you solid foundations for developing, analyzing, and implementing parallel and localityefficient algorithms. This course focuses on theoretical underpinnings. To give a practical feeling for how algorithms map to and behave on real systems, we will supplement algorithmic theory with handson exercises on modern HPC systems, such as Cilk Plus or OpenMP on shared memory nodes, CUDA for graphics coprocessors (GPUs), and MPI and PGAS models for distributed memory systems.
This course is a graduatelevel introduction to scalable parallel algorithms. “Scale” really refers to two things: efficient as the problem size grows, and efficient as the system size (measured in numbers of cores or compute nodes) grows. To really scale your algorithm in both of these senses, you need to be smart about reducing asymptotic complexity the way you’ve done for sequential algorithms since CS 101; but you also need to think about reducing communication and data movement. This course is about the basic algorithmic techniques you’ll need to do so.
The techniques you’ll encounter covers the main algorithm design and analysis ideas for three major classes of machines: for multicore and many core shared memory machines, via the workspan model; for distributed memory machines like clusters and supercomputers, via network models; and for sequential or parallel machines with deep memory hierarchies (e.g., caches). You will see these techniques applied to fundamental problems, like sorting, search on trees and graphs, and linear algebra, among others. The practical aspect of this course is implementing the algorithms and techniques you’ll learn to run on real parallel and distributed systems, so you can check whether what appears to work well in theory also translates into practice. (Programming models you’ll use include Cilk Plus, OpenMP, and MPI, and possibly others.)
Why Take This Course? Prerequisites and RequirementsA “second course” in algorithms and data structures, a la Georgia Tech’s CS 3510B or Udacity’s Intro to Algorithms
For the programming assignments, programming experience in a “low level” “highlevel” language like C or C++
Experience using command line interfaces in *nix environments (e.g., Unix, Linux)
Course readiness survey. You should feel comfortable answering questions like those found in the Readiness Survey Course, HPC0
See the Technology Requirements for using Udacity.
SyllabusThe course topics are centered on three different ideas or extensions to the usual serial RAM model you encounter in CS 101. Recall that a serial RAM assumes a sequential or serial processor connected to a main memory.
 Unit 1: The workspan or dynamic multithreading model
In this model, the idea is that there are multiple processors connected to the main memory. Since they can all “see” the same memory, the processors can coordinate and communicate via reads and writes to that “shared” memory.
Subtopics include:
** Intro to the basic algorithmic model ** Intro to OpenMP, a practical programming model ** Comparisonbased sorting algorithms ** Scans and linked list algorithms ** Tree algorithms ** Graph algorithms, e.g., breadthfirst search
 Unit 2: Distributed memory or network models
In this model, the idea is that there is not one serial RAM, but many serial RAMs connected by a network. In this model, each serial RAM’s memory is private to the other RAMs; consequently, the processors must coordinate and communicate by sending and receiving messages.
Subtopics include:
** The basic algorithmic model ** Intro to the Message Passing Interface, a practical programming model ** Reasoning about the effects of network topology ** Dense linear algebra ** Sorting ** Sparse graph algorithms ** Graph partitioning
 Unit 3: Twolevel memory or I/O models
In this model, we return to a serial RAM, but instead of having only a processor connected to a main memory, there is a smaller but faster scratchpad memory in between the two. The algorithmic question here is how to use the scratchpad effectively, in order to minimize costly data transfers from main memory.
Subtopics include:
** Basic models ** Efficiency metrics, including “emerging” metrics like energy and power ** I/Oaware algorithms ** Cacheoblivious algorithms