Scalable Analytics with Dask on High Performance Clusters

Introduction

Dask is a performance oriented python package to scale up your python workloads across multiple threads, processes and computing nodes. Dask supports API support for most numpy and pandas functionality, which makes transitioning from one of the former frameworks as trivial as swapping out the module’s name to begin using dask. Official tutorials on dask are available here. In contrast, I will only quickly introduce a miminal example and show how to scale this up on high performance clusters here.

Minimal examples

INSERT NOTEBOOK

Using a distributed scheduler

Dask operations in above examples implicitly call the default Dask scheduler. This scheduler manages the different parts of the shown task graph shown above. One can manually choose a single-threaded, multi-threaded or multi-processing approach with the optional scheduler keyword on compute() calls. The default scheduler only works for the local machine. As we want to further scale up our analysis, we need to move to Dask Distributed.

Chris Byrohl
Chris Byrohl
Postdoc

My research interests include Lyman-$\alpha$ radiation to study galaxies and the large-scale structure, supernovae type Ia and high-performance computing.