|
|  |
Using the Dell Xeon Cluster System at JPL
Introduction
The User's Guide for the JPL Dell Xeon Cluster Supercomputer is
intended to provide the minimum amount of information needed by a new
user of this system.
As such, it assumes that the user is familiar with many of the standard
aspects of supercomputing such as the Unix operating system, Fortran and
C programming languages, and various standard libraries (BLAS, LAPACK,
MPI, etc.).
The JPL Dell Xeon Cluster Supercomputing facility is funded by JPL
and is available to users at JPL.
The computer system is located in the Supercomputing Center,
and is supported by JPL's Supercomputing and Visualization Systems Group.
Getting an account
A user account for this machine can be obtained
by completing an application at the
Account Applications page.
Getting help
User questions and support are handled online by sending e-mail to:
scconsult
The Dell Xeon Cluster Hardware at JPL
The system has 512 nodes, with two Intel 3.2 GHz Xeon processors
on each node, for a total of 1024 processors. There is 2GB RAM per processor,
for a total or 2TB RAM. 496 nodes are available for computation, the
remaining 16 nodes are used for I/O.
Interactive editing, compiling, and very simple debugging is
done on the headnode, cosmos. Production computing is done
on the 496 compute nodes using the batch queueing system (LSF).
LSF supports both interactive and background batch. See the
"Batch" discussion below.
Operating System
The operating system is RedHat Linux Enterprise Edition 3.
We assume that the user is familiar with Linux; if not
there are many web pages available on line to help a new user get started
with Linux.
Disk Details
Home directories are NFS mounted on every node. Each user
has a /home quota of 1GB of disk space. Home directories
are backed up nightly.
All nodes also have nfs mounted work directories from the
IO nodes on the system. Currently /work00-/work07 are
available. These are RAID5 disks, and each has a capacity of
255GB. The work directories have a quota of 100 GB per project,
and are never backed up.
Additionally, all nodes have 53 GB of local scratch space in
/lscratch. /lscratch will be scrubbed after each batch run,
so if users wish to use /lscratch, they will need to stage their
data in and/or out of that area during their runs.
Environment
We are using modules to switch between compilers and mpi versions.
Here are some basic module commands:
- module list
lists currently loaded modules
- module avail
lists modules available to load
- module help <name>
tells what the module is/does/loads
- module unload <name>
unloads the specified module
- module load <name>
loads the specified module
- module switch <oldname> <newname>
places <newname> as the compiler and removes <oldname>
When users first log in, the module "latest_intel91" and
"mpich-gm-intel91" are loaded automatically. These are the paths and
environment variables for the latest Intel 9.1 compilers and
the MPICH built for the latest Intel 9.1 compilers, for use over
the Myrinet network (gm), respectively.
Modules "latest_intel71", "mpich-gm-intel71", "latest_intel81",
and "mpich-gm-intel81", as well as "latest_intel101" and "mpich-gm-intel101" are also available.
Modules are matched sets. For example, if a user chooses to use the
Intel 8.1 compilers, they will want to have the latest_intel81
module loaded along with the mpich-gm-intel81 module and so on.
The "module switch" command is best for switching between the
various modules available.
The modules, "latest_intel71", "latest_intel80", or
"latest_intel81" modules can only be loaded one at a time.
The MPICH modules, also, can only be loaded one at a time.
The other compiler and debugger modules can be mixed and matched
at will. They should all work together well.
On the other hand, users may run into trouble if they try and mix
the separate compiler or debugger modules with the "latest_intel" modules.
Modules permit this to be done, but certain necessary variables may
be overwritten. We recommend sticking with any of the "latest_*" and
"mpich-gm-*" modules.
Compiling
- To compile your MPI applications, use the following scripts:
| mpicc <filename.c> | for Intel's icc compiler |
| mpiCC <filename.cpp> | for Intel's C++ compiler |
| mpif90 <filename.f> | for Intel's ifort Fortran90 compiler |
Batch Scheduling
As with any supercomputer, the fair and efficient use of CPU time
is an important concern for users. A batch queue system is meant
to address these issues. We are using Platform Computing's Load
Sharing Facility (LSF) for our batch queuing system. Jobs MUST
be run using the LSF batch system. No MPI jobs, even if they are
one processor jobs, should ever be run on the headnode, cosmos.
Also, in order to lessen user impact on the two processor headnode,
users are requested to run long compilation jobs in the batch queues.
LSF commands
There are many commands associated with LSF. Man pages are
available for most of them. The most important commands for
a new user to learn are:
- bhist -l <jobid>
This displays the history of a job. The -a option
displays both finished and unfinished jobs
- bjobs
This command gives status information for one or more jobs.
The -l option gives resource usage information. If detailed
information on a job is desired (particularly, if answers
to questions such as "Why hasn't my job started?" or "What resources
have been requested or used by a job?" are desired),
then use bjobs -l <jobid>.
- bkill <jobid>
This deletes one or more unfinished batch jobs.
- bpeek <jobid>
This displays the standard output and standard error of an
unfinished job.
- bqueues
This lists information about each queue, for example, it's priority,
it's status, the total number of jobs, the number of pending jobs,
the number of running jobs, and the number of suspended jobs in the
queue. The -l option to this command will provide a long listing
of information about each queue.
- bstat
This command gives status information for one or more jobs. The
-me option shows only the user's own jobs. This command is a useful extension
to LSF provided by the SVF staff.
- bsub
This command submits a job for execution. Please
see below for details on bsub usage.
LSF allows for the placement of batch queue jobs based upon the
availability of a large variety of resources. A job will not be
placed in a queue or will not be started unless all of the stated
resource requirements are met. The resources of most importance
are: number of processors and wallclock time. Additionally, fairshare
scheduling is being used to determine the order in which jobs are executed.
Five different queues are provided (described below), so that small
and large production jobs can be run without mutual interference
or oversubscription. Priority is given to short jobs on weekdays
and long jobs on evenings and weekends. Additionally, one of the
queues gives top priority solely to engineering projects.
Non-engineering projects may not run in the priority queue.
Engineering project jobs submitted to the regular queues will
have next-in-line priority.
Please note, the queues, and their properties, and the LSF batch
queue system are our first cut at providing a fair and efficient
use of CPU time for our users. Accordingly, they can and will be
changed as we measure their usage and the needs of the users.
The queues and their characteristics:
- debug
This queue allows the use of up to 128 processors for up to
60 minutes, and is available at all times. Very simple debugging
may be done on the headnode, cosmos. However, debugging multi-
processor jobs, MPI jobs of any type, and CPU intensive jobs must
be done in the batch queue system. This queue can be used for
"interactive batch" jobs so that the user can interact with a
debugging tool (such as totalview).
- short
This queue allows the use of up to 256 processors for up to
3 hours, and is available at all times. However, at 6pm each
night, and on weekends, the "long" queue has priority, and jobs
will be launched and executed from that queue first during its
active times. Jobs from this queue will run during "long" queue
hours if there are no pending "long" queue jobs.
- long
This queue allows the use of up to 256 processors for up to
12 hours, and is fully available starting at 6pm every week night,
and on weekends starting at 6pm on Fridays. When the queue becomes
active at 6pm, 864 processors will be available for long jobs to run
on. As the night progresses, this pool of processors gets
progressively smaller. At 12am, the pool is reduced to 512 processors,
at 4am to 256 processors, and at 8am to 128 processors. On weekends,
the reduction of available processors begins at 12am on Monday morning.
The long queue will continue to run throughout the day, but will be
limited to using a total of only 128 processors at any one time.
Available processors in the "long" queue:
| Time: | Monday-Friday | Saturday & Sunday |
| Midnight - 4:00am | 512 processors | 864 processors |
| 4:00am - 8:00am | 256 processors | 864 processors |
| 8:00am - 6:00pm | 128 processors | 864 processors |
| 6:00pm - Midnight | 864 processors | 864 processors |
- pri_day
This queue is a high priority queue. It is intended for
interactive engineering use, and is available for engineering
projects. This queue allows the use of up to 16 processors
for a maximum of 10 hours. It is available from Monday through
Friday from 8am to 6pm.
- preemptable
This queue allows the use of up to 16 processors for up to
10 hours, and is available during the same times as "pri_day." (Monday
through Friday from 8am to 6pm.)
This queue shares the processors available for the priority
queue "pri_day". Jobs submitted to this queue will run provided
that there are processors allocated to the "pri_day" queue that
aren't running jobs.
A job running on this queue WILL BE KILLED in the event
that the processors the job is using are needed by the "pri_day" queue.
Therefore, users are advised to use this queue at their own risk.
We strongly recommend that users refrain from submitting jobs
that use all 16 processors to this queue. Any job submitted to the
"pri_day" queue would kill a 16 processor job running in this queue.
We also recommend that jobs submitted to this queue
use checkpointing.
bsub: How to submit a batch job
The basic command for submitting a batch queue job to LSF is bsub.
Although there are a multitude of options to bsub (see the man page),
there are only a few options that the average user will
commonly use:
- -a mpich_gm
This is required for jobs that use MPI.
- -e errfile
This option reroutes the standard error output to errfile.
- -Ip
This option specifies that this job is to be an interactive batch
job. Standard error, input, and output will be connected to your
terminal. This is most useful when doing interactive debugging in
the debug queue.
- -J "arrayName[indexList]"
This option is needed by people with embarassingly parallel jobs.
It creates a "job array" for their job. For the detailed explanation
of this option, please see: /usr/local/doc/jobarray on cosmos.
- -n #
This option sets the number of processors to be used on this job
and MUST BE USED on EVERY bsub command. Since this value is used by
LSF to place the job in the appropriate queue, a job without this
parameter cannot be queued.
- -o outfile
This option reroutes the standard output to outfile. If
-o is used without -e, the standard error of the job is stored in outfile.
-o and -Ip are mutually exclusive.
- -P project
Assigns the job to the specified project. This option is only
needed for users that are working on multiple projects and need
to designate which project their job is to be executed under.
- -q qname
This specifies the queue to which the job is to be submitted.
The "preemptable", and "pri_day" queues require this option.
- -R span[ptile=1]
This option changes the default LSF behavior, and causes an MPI
job to be executed on only one processor per each node allocated
to the job.
When an MPI job is launched by LSF, by default, the job will be
executed on both processors on a node before moving on to the
next node. Those familiar with MPICH machine files will recognize
this behavior as "hostname:2". Therefore, the -R span[ptile=1]
option changes the default to "hostname:1".
- -x
Puts the host running your job into exclusive execution mode.
In exclusive execution mode, your job runs by itself on a host.
It is dispatched only to a host with no other jobs running, and
LSF does not send any other jobs to the host until the job completes.
Use this with the -R "span[ptile=1]" option.
- -W time
This option sets the maximum amount of wallclock time (in minutes
or HH:MM) to be used by this job and MUST BE USED on EVERY bsub
command except "pri_day". Since this value is used by LSF to place
the job in the appropriate queue, a job without this parameter
cannot be queued.
- mpirun.lsf <executablefile>
This is required for jobs that use MPI. LSF automatically creates
the machine file. Please do not use "-np #" with this option.
MPI/MPICH users, please take special note of the -a, mpirun.lsf,
-R and -x options.
Examples of bsub:
bsub -n 32 -W 120 -a mpich_gm mpirun.lsf ./executablefile
This will launch an MPI job on 32 processors for 2 hours.
bsub -n 64 -W 2:35 -P xxx -a mpich_gm mpirun.lsf ./executablefile
This will launch an MPI job on 64 processors, for 2 hours 35
minutes, and the job will run under the project code "xxx"
bsub -n 128 -W 10 -a mpich_gm -o out mpirun.lsf /path/to/executable
This will launch an MPI job on 128 processors, for 10 minutes.
bsub -n 8 -W 5:30 -a mpich_gm -o out -q preemptable mpirun.lsf ./executablefile
This will launch an MPI job on 8 processors, for 5 hours 30
minutes, and the job will be forced into queue "preemptable".
bsub -n 1 -W 30 -Ip bash
This will produce an interactive job, with a bash shell prompt.
bsub -n 8 -W 30 -Ip -a mpich_gm csh
The code below can be used for debugging interactively with Totalview
bsub -n 8 -W 30 -Ip -a mpich_gm csh
machinefile.lsf > machines
setenv TOTALVIEW `which totalview`
mpirun -tv -machinefile ./machines -np 8 a.out
It is also possible to put commands and bsub options
into a shell script file and just send that file to bsub using
"bsub < scriptfile"
IMPORTANT NOTE: Be sure to use the "<" as
shown. This will cause LSF to read and interpret the bsub options
in scriptfile before placing the job in a queue.
Here is an example of a bsub scriptfile:
#!/bin/sh
#BSUB -n 32 -W 90
#BSUB -a mpich_gm
#BSUB -o outfile -e errfile # my default stdout, stderr files
# NOTE: LSF starts in the current working directory by default.
cd /home/<userdir>/test
mpirun.lsf ./simplempi80
- Launch this scriptfile as follows:
bsub < scriptfile
Printer Friendly Version
|