Using jupyter on an HPC cluster

This post was originally written for the CBC-UCONN HC wiki

The purpose of this post is to enable cluster users to run jupyter on the cluster interactively, enabling them to conduct data analysis and visualization. There are different “flavors” of jupyter notebooks, the most appropriate are going to be pointed out at Picking a container.

Preliminaries

We assume that the user has access to ssh through a terminal. In addition, it is necessary to have SingularityCE (singularity for short) installed on a computer on which you have root privileges.

In order to use singularity on a Windows (or Mac) machine, a Linux Virtual Machine(VM) needs to be set up. Setting up a VM and installing SingularityCE os beyond the scope of this document.

In addition, singularity is assumed to be available on the HPC that you have access to. Usually, users have to run

module load singularity/<version>

before using it.

Familiarity with containers is helpful but not necessary. Loosely speaking, a container allows us to “isolate” a set of tools and software in order to guarantee code reproducibility and portability. Moreover, singularity was developed (among other reasons) to integrate these tools with HPC clusters.

Picking a container

The Jupyter Docker Stacks contains several useful docker containers that can be easily used to build singularity containers.

A list of the containers (along with their principal features) maintained by the Jupyter team can be found here. Some of these containers are detailed below

  • jupyter/r-notebook: a container containing a basic installation for Machine Learning using R.
  • jupyter/scipy-notebook: contains popular libraries for scientific computing using python.
  • jupyter/tensorflow-notebook: this is the jupyter/scipy-notebook with tensorflow installed on it.
  • jupyter/datascience-notebook: includes libraries for data analysis from the julia, python, andR` communities.

Converting a docker into a singularity container

Once you have chosen a container suitable for your needs (and have root access to a machine with singularity), a singularity container can be generated by executing the following chunk of code in the terminal.

## singularity pull <choose-a-name>.sif docker://jupyter/<preferred-notebook>
singularity pull mycontainer.sif docker://jupyter/datascience-notebook

In the example above, I choose to use the datascience-notebook. After doing so, the .sif file generated by singularity needs to be transferred to the cluster. My personal preference is to use either scp or rsync, for example

rsync mycontainer.sif <username>@<hpc-url>:<location>

Using the singularity container on the cluster

After transferring the .sif file to the cluster, follow the following steps. Firstly, set up the VPN and log-in to the cluster using ssh, then navigate to the location where you transferred the container (.sif) to. Next, you will have to start a interactive job. If the workload manager used in the HPC that you have access to is SLURM, this can be done either with srun or fisbatch. To start an interactive job with srun use

srun --partition=<partition-name> --qos=<queue-name> --mem=64G --pty bash

The same task can be achieved with fisbatch (if available) with

fisbatch --partition=<partition-name> --qos=<queue-name> --mem=64G

Either of these commands will allocate your job to a specific node. It is important to save the name of the node that your job has been allocated to. Next, load singularity on that node as follows

module load singularity/<version>

The penultimate step is to start the jupyter instance. It is done as follows

singularity exec --nv mycontainer.sif jupyter notebook --no-browser --ip='*'

After executing the last chunk of code, the terminal will be “busy” and will provide three URLs, they will look somewhat like “http://127.0.0.1:8888/” this. Copy the last address provided by the output. The last step before being able to access the notebook through the provided address is to create a ssh tunnel. To do so, open another terminal window and execute

ssh -NL localhost:8888:<node>:8888 <username>@<hpc-url>

where <node> should be replaced by the node to which the job submitted using srun (or fisbatch) was submitted to. This tunnel will keep the other terminal window busy to.

Finally, copy the address provided by the notebook (e.g., “http://127.0.0.1:8888/”) and paste it into your browser.

Avatar
Lucas Godoy
PhD Candidate / TA /GA

I’m a PhD Candidate in Stats interested in R, Open Data, and the most diverse applications of statistics.

comments powered by Disqus