I
parallelized my program using the PARALLEL DO
statements described below. I left my
piecewise linear lowest-level routine alone, and put
parallel directives before the nested loops that
contained the call to that routine, and compiled the
(Fortran in my case) program with -reentrancy
threaded so that subroutines can be ~safely called within
parallelized do loops. you do not need to
parallelize your directional-split advection.
-
Background
- We are using shared-memory parallelism;
OpenMP. To some extent, our compute platform (Stampede2
- the Skylake
cluster) looks (to a programmer) like a vast
amount of memory. In the real world, it is a
series of nodes, each a pair
of 24-core computers tied together (details).
We will implement only the simplest of parallel
directives - loop level: for (in
C) or do (in Fortran).
- Our parallelization is among the compute cores on
a single Stampede2
Skylake "SKX" compute node
-
Quick
start
- Code:
- add PARALLEL DO (Fortran) or omp
parallel for (C) directives to loops in
your code (examples below)
- do this one routine at a time, and
then try it out before moving to the next.
- You can skip theta advection entirely, or if you
want to try it, do it last (see below).
- Makefile
settings
- add -openmp to your
makefile compile OPTIONS.
- Ready to
compile
- "make" (You might need to
make clean, just once, before doing
make. After that, just "make" as needed).
- Running
- start idev,
e.g. "idev -p skx-dev", which reserves
a 48-core node (skylake development queue) just
for you.
- I suggest idev -m
120 to use it for
120 minutes.
- define an environment variable
OMP_NUM_THREADS
to be the number of compute cores to use (if set
to 1, you are running
single-core).
- If you have bash as your login shell,
you would type (for 8 cores):
export
OMP_NUM_THREADS=8
- If you have tcsh as your login shell,
you must type (for 8 cores):
setenv
OMP_NUM_THREADS 8
- now run your compiled program in the usual way.
- To confirm
you are running more than one core
- Note your fastest run time may be with less
than 16 cores. For low resolution, 8
worked best for me. For hires, 14.
- Run your code first with one core, then with
8. To time it, run this way:
time
your-program-name
- or, put
a call to OMP_GET_THREAD_NUM in your code
(this caused problems for some students. To try
it, search for Parallel Region Example
on this
web page
- If you want
to try parallelizing temperature (theta)
advection
- I parallelized my BC, IC, PGF, Diffusion and
(U/V/W) Advection routines.
- Parallelizing theta advection has not always
worked. Doing the above is enough for full
parallelization credit.
But if you want to try, there are a few things to
know:
- put the parallel directive (PARALLEL DO in
Fortran, #pragma in C) before
each double-loop (before X advection,
before Y, etc). So parallelism takes place
across the entire advection.
- Do not parallelize the 1-D
advection routine (this slows it down!)
- Be careful with those PRIVATE directives -
make sure you include (as private, within the
double for-or-DO
loops) all indices (e.g. i,j,k) and
any 1-D arrays (whose values change) that you
are using to pass to or get
back from your advect1d routine. e.g., in
my code, PRIVATE(i,j,k,u1d,q1d_old,q1d_new).
- Lastly, add -reentrancy
threaded to OPTIONS inside your
Makefile.
- make clean, and rebuild your program with make.
- If it doesn't work, double-check that
everything that should be private is,
or skip theta parallelization entirely.
- What parallel directives look like
-
My examples
- You can copy
my parallel_test.f
or parallel_test.c codes from
~tg457444/502/Pgm6[/C or
/Fortran]
on Stampede2. These are designed to be run using
one of the batch job decks found in the same
Stampede directory.
To test it interactively, do
1) idev
-p skx-dev
# starts up idev to give
you a node on the Skylake development queue
2) make
parallel-test
# compiles my parallel-test
code. You will have had to copy it from my
account first.
3) ./bench_it
# runs my shell script
"bench_it" which runs the parallel test code with
a varied # of cores.
- First, in Fortran or C, you simply add the -openmp flag
when compiling.
For these test programs, use the Makefile file
in the C or Fortran directory. All you need is
to type make
and follow the directions that are given you (e.g. make
parallel-test).
This gives you two running programs in your
directory; a 1-core serial code, and a parallel
(OpenMP) code.
For working with
your code, you can always add compiler directives
(e.g. "-traceback") to the Makefile.
Both codes solve a simple multiple-dimension
diffusion problem. In 3-D, they are basically
doing
theta(n+1)
=
theta(n) + K*dt/(dx*dx)*( second derivative of theta
in X, Y, and/or Z )
-
Running my example
codes interactively
- First: easiest way to run these is using the batch
".deck" file provided to you!! (see below)
- Yes, they can be run interactively. You get
credit for running batch, of course, and I encourage
you to do so. Remember, when you run a
long-running program interactively, you are going up
against all sorts of other people compiling, editing
their files, etc. Unless your program finishes
in a minute or two, I'd encourage you to use idev,
or batch.
But here is how you run these programs
interactively:
• serial execution: run it interactively or,
better yet, through idev. You can put
the time
command before it to get a timing summary:
/usr/bin/time
-p parallel_test.1CORE
... mine ended with:
real
44.12
user 43.88
sys 0.23(it took 43.88 sec of user cpu time, and
44.12 sec of wallclock)
• parallel execution: you first tell the
system how many processors you want.
export OMP_NUM_THREADS =8
/usr/bin/time -p
parallel_test.OMP
... mine ended with:
real
9.25
user 69.23
sys 0.27 (it took 69.23 sec of
user cpu time across all cores, and 9.25 sec of
wallclock)
The speedup here is pretty fast. But the
domain sizes are pretty big, to make sure the
individual processes all had a fair
amount of "work" to do. This is important as
there is
overhead associated with doing parallel execution.
- Running my example codes in batch
- Stampede, like most high-performance computers
(HPCs), has a set of compute nodes set aside for
(lots of) interactive work, while keeping the
majority of processors dedicated to running jobs via
batch.
Batch means you submit a text file (a job deck) which
starts with information on what compute resources
you need (how many cpu, and for how long), along
with the commands needed to run your job.
- I will update the notes below after Stampede-2
returns to service !!
- OK, here's how you run
it. (a) compile the programs as directed
above. (b) edit either the fortran
(run_batch_fortran.deck) or the C
(run_batch_C.deck) batch deck files. They
are simple text file; any editor (emacs, vi, or an
editor on your local machine) will do. The
beginning of each (I'll list the fortran one) has
a section on compute resources and other
information about your job, e.g.
#
=======================Stampede batch
=================================
#SBATCH -J atms502p6F #
Job Name
#SBATCH -o atms502p6F.o%j # Output and error
file name (%j expands to jobID)
#SBATCH -p development
# Queue name: serial, development, normal,
large...
#SBATCH -n 1
# Total mpi tasks ...
=1 for OpenMP
#SBATCH -t 00:30:00 #
Run time (hh:mm:ss)
#SBATCH -A TG-ASC120037 # Charge to
class account
#SBATCH
--mail-user=YOUR-EMAIL-ADDRESS@illinois.edu
# **** SET to YOUR email ****
#SBATCH --mail-type=all
#
=======================================================================
- That's it. You are
prepared to submit
a job on Stampede run under our
class account (the "-A" above), with a job name of
atms502p5F,
in the development queue for a total
of up to 5 minutes of wallclock time (this test is
fast!) To submit it type:
sbatch
run_batch_fortran.deck
(or run_batch_C.deck if you are using C).
- This should work. Use squeue/showq
commands to find the job status. How do you know when it
is done?
- The patient way: wait for the email
notification.
- The less-patient way: periodically type: squeue
-u $user
- The least-patient (my) way: run my shell
script chk:
~tg457444/bin/chk
This runs squeue, waits 10 seconds, and runs it
again. Hold down the control key and hit "C"
to break out of the qstat loop.
When done, you will find a file like this in your
directory: atms502p6F.o11734
The .o
file contains output from running your
program.
If it wrote out any data, as your program should
(to e.g. RunHistory.dat),
those files will be in the directory you told it
to use.
- Please send me any questions that arise, or suggestions/corrections
to improve this getting-started guide.
|