How to Run Batch Operation on Cluster Computer

If you are pursuing your higher studies in the field of machine learning, its very likely that you have to “deal” with big data. By “deal”, I mean you might have to extract features, or say for example, find the eigenvectors and eigenvalues of large matrices. Carrying out these types of simulations on real or synthetic data on typical desktop/laptops may take days to run. Its suggested that you run these simulations on your University’s super computer or cluster computer. I am assuming that you have access to such a cluster.

These clusters are typically multi-core computers with huge amount of RAMs and tremendous processing powers. But you cannot just run any code on these clusters. You need a batch script (SLURM) to utilize the resources. You have access to a front end node and through that node you have to submit your batch job. I will show how to run a MATLAB code on the cluster using the sbatch command. I am assuming that you are on MacOSX/Linux and can use the terminal. If you are on Windows, install PuTTY to emulate the terminal.

  • Log in into the front end node. For example, if your username is abc123 and the front end host name is def.edu, type the following command in terminal:

ssh abc123@def.edu

  • Press enter and it will ask for your password. Type the password and press enter, you will be taken to the home directory of your remote host.
  • Create a directory in your home folder (optional).

mkdir TEST_MATLAB

cd TEST_MATLAB

  • Copy your necessary *.m files into the directory you just created. You can do that from terminal or you can use nice GUI programs like Fugu (for MacOSX) or WinSCP (Windows).
  • Now create a bash script like the following and save it as matlab_batch.sh. You can change the portions written in BOLD according to your specific situation. Copy this file to your remote directory as well.

#!/bin/bash

## This is an example bash script for MATLAB cluster simulation

#SBATCH –job-name=YOUR_JOB_NAME

#SBATCH –output=slurm.out
#SBATCH –error=slurm.err
#SBATCH –partition=SOE_main
#SBATCH –ntasks-per-node=NUMBER_OF_REQUESTED_NODES

MYHDIR=“REMOTE_HOME_DIRECTORY“   # contains all your necessary files
MYTMP=”/tmp/$USER/$SLURM_JOB_ID”   # local scratch directory on the node
mkdir -p $MYTMP                                                    # create scratch directory on the node
cp $MYHDIR/* $MYTMP                                       # copy all files into the scratch
cd $MYTMP                                                                # run tasks in the scratch

matlab -nodisplay -nosplash -r “M_FILE_NAME, exit”

cp $MYTMP/* $MYHDIR                                       # copy everything back into the home directory
rm -rf $MYTMP                                                         # remove scratch directory

  • Now go to your terminal and type the following command:

module purge

module load matlab

sbatch matlab_batch.sh

  • Your output files should be copied to your remote home directory after the simulation is finished.
  • However, it might take some time to start the processing on the computing node depending on the current status of the node. You can check whether your job is being processed using the following command:

squeue -u USERNAME

or

squeue -j JOBID

  • You can also get an estimate on when your job will start:

squeue -u USERNAME –start

or

squeue -j JOBID –start

 

  • You can opt for email notification in the case of start, finish or abort events of your job using the following commands. Add the following lines in your shell script.

#SBATCH –mail-type=ALL                                                         # mail alert at start, end and abort
#SBATCH –mail-user=<email address>                              # send mail to this address

One thought on “How to Run Batch Operation on Cluster Computer

  1. Hello, your post is as close as I’ve found thus far online regarding how to run a specific program through my small, 4-part raspberry pi cluster computer. I do not know much about coding, but I do recognize patterns and have a slight familiarity with entering codes from various experts I rely on.

    I’m wondering if you can help with my specific issue, which has more to do with writing a script that also utilizes mpiexec and machinefile to execute my program. I’m trying to run Kodi/XBMC and eventually plan to also use this small cluster computer as a file server or host–I’ll cross that bridge later. The command I’ve used to utilize the cluster is: mpiexec -f machinefile -n 1 kodi-standalone and, while this works, I’d like to simply write a script that automatically does this when my raspberry cluster is rebooted, or, conversely, to simply type at the terminal a one word command that then executes the mpiexec machinefile command.

    Scripts should be all about saving time, am I right? 🙂

    Thank you for any help you can provide.

Leave a comment