User Guide

Kabré Usage Tutorial

Kabré is a word from the Ngäbe Language which means a bunch. This fits the current cluster composition, which features multiple parallel architectures. Through this tutorial you will learn how Kabré is composed, how to connect to the cluster, submit jobs, retrieve results and about environment modules.

Requirements

To complete this tutorial you will require a ssh client. In Unix and Linux platforms, there is commonly a terminal emulator capable of establishing ssh sessions. In Windows platforms you should download a ssh client program, like Putty or similar.

Also, you will require an active account in Kabré and valid credentials. If you don’t have one, please contact us at cnca@cenat.ac.cr, explaining your situation.

Understanding Kabré’s composition

The following image shows a network diagram of Kabré. We will discuss about the mayor components in the diagram.

Meta node

This is a very special node, it supports many services of the cluster. Its existence should be transparent for all users. If, for some reason, you find yourself in the Meta-node, please leave and inform us of that situation, it could mean a problem. Programs running in Meta-node is considered bad behavior.

Login-nodes

These nodes are a kind of shared-working area. When you log into Kabré, you will be assigned to one login node. Some common tasks you execute here are:

  • Creating and editing files
  • Creating directories and moving files
  • Copying files to and from your computer
  • Compiling code
  • Submitting jobs
  • Managing your active jobs

Parallel code or heavy tasks running on login nodes is considered bad behavior.

Machine Learning Nodes (Nukwa)

Nukwa has 6 nodes, 4 of these nodes features an Nvidia Tesla K40 GPU. The host has an Intel Xeon with 4 cores @ 3.2 GHz without hyper-threading and 16 GB of RAM. Additionally, Nukwa has 2 nodes with an Nvidia Tesla V100 GPU, 24 cores @ 2.20GHz, 2 threads per core and 32 GB of RAM.

Only applications with an intensive use of GPU would get a relevant speed-up with Nukwa.

Simulation Nodes (Nu)

The war horse of Kabré. 20 Intel Xeon Phi KNL nodes, each one with 64 cores @ 1.3 GHz and 96 GB of RAM; each core has two AVX units.

If your application can be splitted in many small pieces and use vectorization, this architecture can provide impressive speed ups.

Big Data Nodes (Andalan)

Two of these four nodes feature an Intel Xeon, each one with 24 cores @ 2.20 GHz, 2 threads per core and 64 GB of RAM. The third node features an Intel Xeon with 16 cores @ 2.10 GHz, 2 threads per core, and 64 GB of RAM. A fourth node features 10 cores, 2 threads per core, @ 2.20 GHz and 32 GB of RAM. The latest node has 24 cores @ 2.40 GHz, 2 threads per core and 128GB of RAM.

The Andalan nodes work quite well for sequential tools that need processing power.

Bioinformatics Nodes (Dribe)

Dribe has two nodes, one that features an Intel Xeon with 36 cores @ 3.00 GHz, 2 threads per core and 1024 GB of RAM, and the other one with 18 cores @ 3.00 GHz, 2 threads per core and 512 GB of RAM.

The Dribe nodes work quite well for tools that require a high memory demand.

Interacting with Kabré

In this section we will cover ssh connections, ssh keys, and how to copy files.

SSH connections and SSH Keys

To start, open a terminal emulator and open an ssh session by typing

$ ssh [user]@kabre.cenat.ac.cr

Remember to change [user] with your user name. Type your password when prompted. You will be logged to some login node. This is a typical Linux terminal, so, try out some known commands, like lscdmkdir and so on.

An SSH Key is a file that will keep your connection secure while avoiding typing your password everytime you log in. This file is commonly linked to one computer, normally you should generate one for the computer you will be using to interact with Kabré.

To generate an SSH key, in your local computer (laptop, workstation…) open a terminal, go to your home directory and type

$ ssh-keygen -t rsa -C "your_email@example.com"
and follow the instructions. If you chose default options, the new key will be in ~/.ssh/, now you have to copy the public key to Kabré, to do so type
$ scp ~/.ssh/id_rsa.pub your_user@kabre.cenat.ac.cr:~

Now, within an ssh session in Kabré, type:

$ cat id_rsa.pub >> .ssh/authorized_keys
$ rm id_rsa.pub

Alternatively, you may execute this procedure in a single command if it is available in your working station, as this:

$ ssh-copy-id your_user@kabre.cenat.ac.cr

Now, if you open a new terminal and type ssh your_user@kabre.cenat.ac.cr you will be logged without prompting for the password. This is because Kabré has the public ssh key of your computer. It is also convenient for your computer to have the public key of Kabré, simply append it to autorized_keys in your local computer, like this:

$ scp user@kabre.cenat.ac.cr:~/.ssh/id_rsa.pub . 
$ cat id_rsa.pub >> ~/.ssh/autorized_keys
$ rm id_rsa.pub

Kabré’s file system

We count with two directories: /home and /work.

Directory Quota Purpose Backup
/home/userid 10GB scripts and important data monthly
/work/userid 100GB temporary data NO

To know the amount of available space left on each directory, you can use the following command:

$ df -h /home/userid

Or:

$ df -h /work/userid

NOTES:

  • The user is responsible for managing both home and work directories. The former is meant for programs and sensitive data, while the latter is meant for massive and temporary data.
  • Capacity of the work directory can be extended upon request. A clear justification for the extra space required must be provided. Send your request to jumana@cenat.ac.cr.
  • Although the home directory is backed up monthly, we strongly recommend users to work with a version control system (v.g. git) for their scripts and source codes.

Copy files between your computer and Kabré

The command scp is similar to cp command, it copies files from an origin to a destiny through an SSH session. It has the following syntax:

$ scp [user]@[host][path]origin_file [user]@[host][path]destiny_file

Please not that this command is just for a single file, if you require to upload a whole directory, please append the -r option to the command, like the following:

$ scp -r [user]@[host][path]origin_directory [user]@[host][path]destiny_directory

Default values are

  • user: your local user
  • host: local host
  • path: current working directory

The command scp must be executed on your local machine, not in Kabré. Maybe your application generates a lot of visualization files and you want to download those files to your computer, remember that  ~ means the home directory and * matches any sequence of characters, using these concepts:

$ scp user@kabre.cenat.ac.cr:~/application/output/*.viz ~/app/results/visualization

Or maybe you want to upload a parameters file to use in a simulation, you should be doing like this:

$ scp ~/research/app/parameters.dat user@kabre.cenat.ac.cr:~/application/input

Give another user permissions over a file or directory

First step is to know the uid (user id) of the user you want to share ownership with, in order to get the uid, run the following command:

$ id username

Next, once known the uid, you can proceed to give permissions on a single file or directory, here you can choose which permissions to give out, they could be: R (for read), W (for write) and/or X (for execute).

If you would like to give all permissions to the user which uid is “[uid]” over the directory “[/path/directory]” (remember to substitute these values for the ones you need), you can use the following command:

$ nfs4_setfacl -a A:[uid]:RWX [/path/directory]

Notice that this would give all permissions (since we used RWX) to [uid] just to the directory [/path/directory], but not to its subdirectories (these are, the folders inside /path/directory). Please be careful when giving all permissions to another user, since they will have the capacity to delete and modify freely.

If you would like to include all subdirectories, this same command can be run recursively just by appending -R to it, like this:

$ nfs4_setfacl -a -R A:[uid]:RWX [/path/directory]

Lastly, if you would like to remove RWX permissions to an “[uid]” over the directory “[/path/directory]”, you can use:

$ nfs4_setfacl -a D:[uid]:RWX [/path/directory]

Understanding Kabré’s queues system

Login nodes are suitable for light tasks, as mentioned before: editing files, compiling, copying files, and so on. Heavy tasks are expected to run on Nu, Nukwa, Andalan or Dribe nodes. To enforce a fair sharing of resources among users, your task should be submitted to a queue system. It is like forming up at the bank, once your task makes its way to the head of the line, it will be granted all requested resources an will run until complete or until it consumes its time slot.

Currently, there are different queues for every component in Kabré, that means you cannot mix Nukwa nodes and Nu nodes in a single job, for example. The following table shows all available queues:

Partition (Queue) Platform Number of nodes Time slot
nu Xeon Phi KNL 1 72 h
nu-debug Xeon Phi KNL 1 8 h
nu-wide Xeon Phi KNL 12 24 h
nu-long Xeon Phi KNL 1 744 h
nukwa GPU 1 72 h
nukwa-debug GPU 1 8 h
nukwa-wide GPU 2 24 h
nukwa-long GPU 1 168 h
andalan Xeon 1 72 h
andalan-debug Xeon 1 8 h
dribe Xeon 1 72 h
dribe-debug Xeon 1 8 h

The process of submitting a job in Kabré could be divided in four steps: writing a SLURM file, queuing your job, monitoring jobs and retrieving results.

Writing a SLURM file

This configuration file tells the queue system all it needs to know about your job, so it can be placed in the right queue and executed. Lets try it out with a minimum working example. Below is a C code that approximates the value of pi using a Montecarlo method. Log into Kabré, copy the text to a file and save it with name pi_threads.c .

#include <pthread.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>

typedef struct {
 int num_of_points;
 int partial_result;
} pi_params;

void * calc_partial_pi(void * p){
 pi_params * params = p;
 int count = 0;
 double r1, r2;
 unsigned int seed = time(NULL);

for(int i = 0; i < params->num_of_points; ++i){
 r1 = (double)rand_r(&seed)/RAND_MAX;
 r2 = (double)rand_r(&seed)/RAND_MAX;
 if(hypot(r1, r2) < 1)
 count++;
 }
 params->partial_result = count;
 pthread_exit(NULL);
}


int main(int argc, char * argv[]){

if(argc != 3){
 printf("Usage: $ %s num_thread num_points\n", argv[0]);
 exit(0);
 }

int num_threads = atoi(argv[1]);
 int num_points = atoi(argv[2]);
 int num_points_per_thread = num_points / num_threads;

pthread_t threads[num_threads];
 pi_params parameters[num_threads];

for(int i = 0; i < num_threads; i++){
 parameters[i].num_of_points = num_points_per_thread;
 pthread_create(threads+i, NULL, calc_partial_pi, parameters+i);
 }

for(int i = 0; i < num_threads; i++)
 pthread_join(threads[i], NULL);

double approx_pi = 0;
 for(int i = 0; i < num_threads; i++)
 approx_pi += parameters[i].partial_result;
 approx_pi /= (num_threads * num_points_per_thread) / 4;

printf("Result is %f, error %f\n", approx_pi, fabs(M_PI-approx_pi));

}

Currently, you are in a login node, so it is OK to compile the code there, do so by typing:

$ gcc -std=gnu99 pi_threads.c -lm -lpthread -o pi_threads

The following is an example SLURM file. All lines starting with #SBATCH are configuration commands for the queues system. Options here shown are the most common, and possibly the only ones you will need.

Configuration Description
–job-name=<job_name> Specify the job’s name
–output=<result_name> Name of output file
–partition=<partition_name> In which queue should it run
–ntasks=<number> Number of process to run
–time=<HH:MM:SS> Approximate job’s duration

The body of a SLURM file is bash code. Copy the example in a file named pi_threads.slurm

#!/bin/bash
#SBATCH --job-name=pi_threads
#SBATCH --output=result.txt
#SBATCH --partition=nu
#SBATCH --ntasks=1
#SBATCH --time=00:10:00

module load gcc/7.2.0

srun ./pi_threads 64 100000000000

Note: The command line arguments 64 and 100000000000 are the specific parametrs needed to run pi_threads.c

Now, from the command line, invoke the queues submitter:

$ sbatch pi_threads.slurm

And that’s all! Your job will be queued and executed, in this case, on a Xeon Phi node.

Monitoring your active jobs

A pasive way of monitoring your jobs is to indicate SLURM to send an email when done. This can be configured in the SLURM file using the following options:

Configuration Description
–mail-user=<email> Where to send email alerts
–mail-type= <BEGIN|END|FAIL|REQUEUE|ALL> When to send email alerts
#SBATCH --mail-user=example@mail.com
#SBATCH --mail-type=END,FAIL

After submitting your job you can check its status with these commands:

Configuration Description
squeue -u <username> Check jobs for a specific user
sinfo Display all nodes (with attribute)
scontrol show job <job_id> Status of particular job
scancel <job_id> Delete job

To show details on the state of a job:

$ squeue -j <jobid>

To show details every <num_seconds> on the state of a job:

$ watch -n <num_seconds> squeue -j <jobid>

Valid Job States

To understand the Job State Codes that you may encounter check the following:

Code State
CA Canceled
CD Completed
CF Configuring
CG Completing
F Failed
NF Node Fail
PD Pending
R Running
TO Timeout

Retrieving results

By default, every job will generate an output file with a name as in the example below:

result.txt

You can copy this file to your local computer or run another script for post-processing the output.

Interactive jobs

Sometimes you want to have direct access to some node. Using ssh directly is a bad practice, because the queue system could send someone else’s job to execute on the node you are currently using. The polite way to ask for direct access is through an interactive job that will put you in an interactive sell on a compute node. This allows you to experiment with different options and variables that will provide immediate feedback.

To request an interactive queue on a Nu node, you can use:

$ salloc

Alternatively, the following command opens an interactive job on either a Nukwa, an Andalan or a Dribe node and reserves a GPU for your use:

$ srun --partition=[node] --pty --preserve-env $SHELL

Please note in the above command that you should substitute [node] with either: nukwa-debug, andalan-debug or dribe-debug, according to your needs.

Environment modules

Different users have different needs, sometimes those needs could be conflincting, for example, multiple versions of the same library. These situations are solved with environment modules. A typical case is different versions of Python. To exemplify, ask for the queue in which it should be run by typing:

$ SBATCH --partition=nu

Go ahead and type $ python, you should get into the default pyhton interpreter, the header should be like this:

Python 2.7.5 (default, Nov 6 2016, 00:28:07) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>>

Besides the default interpreter, you can execute Intel Distribution for Python, a specifically tunned compilation with packages commonly used in scientific computing. To get intel python, type:

module load intelpython/3.5

Now, type again $ python, you will get a different header:

Python 3.5.2 |Intel Corporation| (default, Oct 20 2016, 03:10:33) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution
>>> 
>>>

To check which modules are already loaded, type

$ module list

To get a list of all available modules, type

$ module avail

Behind scenes, module command is just configuring paths, aliases and other environment variables, so, modules are loaded only for the current shell session. You can request specific modules in you jobs, just add “module load module_name” lines to SLURM file body, below all #SBATCH lines and before runing your program.