|
|  |
Using the SGI Altix Systems at JPL
Introduction
The User's Guide for the SGI Altix Supercomputer is
intended to provide the minimum amount of information needed by a new
user of these systems.
As such, it assumes that the user is familiar with many of the standard
aspects of supercomputing such as, Fortran and
C programming languages, and various standard libraries (BLAS, LAPACK,
MPI, etc.).
The JPL Supercomputing facility is funded by JPL
and is available to users at JPL.
The computer system is located in building 600,
and is supported by JPL's Supercomputing and Visualization Systems Group.
Getting an account
A user account for this machine can be obtained
by completing an application at the
Account Applications page.
Getting help
User questions and support are handled online by sending e-mail to:
scconsult
The Altix Hardware at JPL
The system is composed of a front end Altix (gemini) with 64 Itanium 2
processors, and two backend Altix systems (castor, pollux) with 256
Itanium 2 processors each.
Interactive editing, compiling, and very simple debugging is
done on the front end, gemini. Production computing is done
on castor and pollux using the batch queueing system (LSF).
LSF supports both interactive and background batch. See the
"Batch" discussion below.
Operating System
The operating system is Linux SuSE 10 SGI ProPack 5.
We assume that the user is familiar with Linux; if not
there are many web pages available on line to help a new user get started
with Linux.
Disk Details
Home directories are on gemini and NFS mounted on castor and pollux. Each user
has a /home quota of 1GB of disk space. Home directories
are backed up nightly.
Gemini:
* 91GB home
* 22 TB /workg
* nfs-mounted /workc and /workp
* 2 TB dynamic scratch
Castor:
* 44 TB /workc
* 1 TB dynamic scratch
Pollux:
* 44 TB /workp
* 1 TB dynamic scratch
Environment
We are using modules to switch between compilers.
Here are some basic module commands:
- module list
lists currently loaded modules
- module avail
lists modules available to load
- module help <name>
tells what the module is/does/loads
- module unload <name>
unloads the specified module
- module load <name>
loads the specified module
- module switch <oldname> <newname>
places <newname> as the complier and removes <oldname>
When users first log in, the module "latest_intel91"
is loaded automatically.
Compiling
- To compile your MPI applications, use the following scripts:
| icc <filename.c> -lmpi | for Intel's C/C++ compiler |
| ifort <filename.cpp> -lmpi | for Intel's Fortran90 compiler |
Batch Scheduling
As with any supercomputer, the fair and efficient use of CPU time
is an important concern for users. A batch queue system is meant
to address these issues. We are using Platform Computing's Load
Sharing Facility (LSF) for our batch queuing system. Jobs MUST
be run using the LSF batch system.
LSF commands
There are many commands associated with LSF. Man pages are
available for most of them. The most important commands for
a new user to learn are:
- bhist -l <jobid>
This displays the history of a job. The -a option
displays both finished and unfinished jobs
- bjobs
This command gives status information for one or more jobs.
The -l option gives resource usage information. If detailed
information on a job is desired (particularly, if answers
to questions such as "Why hasn't my job started?" or "What resources
have been requested or used by a job?" are desired),
then use bjobs -l <jobid>.
- bkill <jobid>
This deletes one or more unfinished batch jobs.
- bpeek <jobid>
This displays the standard output and standard error of an
unfinished job.
- bqueues
This lists information about each queue, for example, it's priority,
it's status, the total number of jobs, the number of pending jobs,
the number of running jobs, and the number of suspended jobs in the
queue. The -l option to this command will provide a long listing
of information about each queue.
- bsub
This command submits a job for execution. Please
see below for details on bsub usage.
LSF allows for the placement of batch queue jobs based upon the
availability of a large variety of resources. A job will not be
placed in a queue or will not be started unless all of the stated
resource requirements are met. The resources of most importance
are: number of processors and wallclock time.
Seven different queues are provided (described below), so that small
and large production jobs can be run without mutual interference
or oversubscription. Priority is given to short jobs on weekdays
and long jobs on evenings and weekends.
Please note, the queues, and their properties, and the LSF batch
queue system are our first cut at providing a fair and efficient
use of CPU time for our users. Accordingly, they can and will be
changed as we measure their usage and the needs of the users.
The queues and their characteristics:
- debug
This queue allows the use of up to 32 processors for up to
60 minutes, and is available at all times. Very simple debugging
may be done interactively on gemini. However, debugging multi-
processor jobs, MPI jobs of any type, and CPU intensive jobs must
be done in the batch queue system. This queue can be used for
"interactive batch" jobs so that the user can interact with a
debugging tool (such as totalview).
- shortg, shortc, and shortp
This queue allows the use of up to 128 processors for up to
3 hours, and is available at all times. However, at 6pm each
night, and on weekends, the "long" queue has priority, and jobs
will be launched and executed from that queue first during its
active times. Jobs from this queue will run during "long" queue
hours if there are no pending "long" queue jobs.
- longg, longc, and longp
This queue allows the use of up to 128 processors for up to
12 hours, and is fully available at all times.
bsub: How to submit a batch job
The basic command for submitting a batch queue job to LSF is bsub.
Although there are a multitude of options to bsub (see the man page),
there are only a few options that the average user will
commonly use:
- -e errfile
This option reroutes the standard error output to errfile.
- -Ip
This option specifies that this job is to be an interactive batch
job. Standard error, input, and output will be connected to your
terminal. This is most useful when doing interactive debugging in
the debug queue.
- -n #
This option sets the number of processors to be used on this job
and MUST BE USED on EVERY bsub command. Since this value is used by
LSF to place the job in the appropriate queue, a job without this
parameter cannot be queued.
- -o outfile
This option reroutes the standard output to outfile. If
-o is used without -e, the standard error of the job is stored in outfile.
-o and -Ip are mutually exclusive.
- -P project
Assigns the job to the specified project. This option is only
needed for users that are working on multiple projects and need
to designate which project their job is to be executed under.
- -q qname
This specifies the queue to which the job is to be submitted.
- -W time
This option sets the maximum amount of wallclock time (in minutes
or HH:MM) to be used by this job and MUST BE USED on EVERY bsub
command. Since this value is used by LSF to place
the job in the appropriate queue, a job without this parameter
cannot be queued.
- mpirun -np # <executablefile>
Examples of bsub:
bsub -n 32 -W 120 ./executablefile
This will launch an OpenMP job on 32 processors for 2 hours.
bsub -n 64 -W 2:35 -P xxx mpirun -np 64./executablefile
This will launch an MPI job on 64 processors, for 2 hours 35
minutes, and the job will run under the project code "xxx"
bsub -n 1 -W 30 -Ip tcsh
This will produce an interactive job, with a tcsh shell prompt.
It is also possible to put commands and bsub options
into a shell script file and just send that file to bsub using
"bsub < scriptfile"
IMPORTANT NOTE: Be sure to use the "<" as
shown. This will cause LSF to read and interpret the bsub options
in scriptfile before placing the job in a queue.
Here is an example of a bsub scriptfile for MPI:
#!/bin/tcsh
#BSUB -n 32 -W 90
#BSUB -o outfile -e errfile # my default stdout, stderr files
# NOTE: LSF starts in the current working directory by default.
cd /home/<userdir>/test
mpirun -np 32 ./simpleMPI
- Launch this scriptfile as follows:
bsub < scriptfile
Here is an example of a bsub script file for OpenMP:
#!/bin/csh
#BSUB -n 128 -W 90
#BSUB -o out128 -e err128 #my default stdout, stderr files
#Note: LSF starts in the current working directory by defalt.
setenv OMP_NUM_THREADS 128
cd /home/user directory/test
./simpleOMP
Printer Friendly Version
|