Deploying machine learning algorithms at Petaflop scale on secure HPC production systems with containers

by David Brayford form LRZ

Abstract

There is an ever-increasing need for computational power to train complex artificial intelligence (AI) & machine learning (ML) models to tackle large scientific problems. High performance computing (HPC) resources are required to efficiently compute and scale complex models across tens of thousands of compute nodes. In this prentation, we discuss how we successfully deployed a standard machine learning framework on a secure large scale HPC production system, to train a complex three-dimensional convolutional GAN (3DGAN) model, with petaflop performance.

Slides

About the Speaker

David Brayford is a Senior HPC & AI scientist working on the deployment of AI software at petascale on secure HPC system. Previously worked in scientific and numerical computing as well as high performance computing for several years including both commercial and academic setting. Extensive experience in 3D computer graphics including device driver development, physics based photorealistic rendering, scientific and medical visualization. Dr Brayford received his PhD from The University of Manchester in computer science in 2006. A short list of employers include: PixelFusion/ClearSpeed, the University of Utah, Scientific and Imaging Institute (SCI), BioFire Technology, GE Healthcare and The Leibniz Supercomputing Center (LRZ).