Spring, 2019
Jewett
Running parallel and/or in batch
ATMS 502 / CSE 566
Numerical Fluid Dynamic
I parallelized my program using the PARALLEL DO statements described below.  I left my piecewise linear lowest-level routine alone, and put parallel directives before the nested loops that contained the call to that routine, and compiled the (Fortran in my case) program with -reentrancy threaded so that subroutines can be ~safely called within parallelized do loops.  you do not need to parallelize your directional-split advection.

  1. Background

    • We are using shared-memory parallelism; OpenMP.  To some extent, our compute platform (Stampede2 - the Skylake cluster) looks (to a programmer) like a vast amount of memory.  In the real world, it is a series of nodes, each a pair of 24-core computers tied together (details).  We will implement only the simplest of parallel directives - loop level: for (in C) or do (in Fortran).

    • Our parallelization is among the compute cores on a single Stampede2 Skylake "SKX" compute node
  2. Quick start

    • Code:
      • add PARALLEL DO (Fortran) or omp parallel for (C) directives to loops in your code (examples below)
      • do this one routine at a time, and then try it out before moving to the next.
      • You can skip theta advection entirely, or if you want to try it, do it last (see below).
    • Makefile settings
      • add -openmp  to your makefile compile OPTIONS.
    • Ready to compile
      • "make"  (You might need to make clean, just once, before doing make.  After that, just "make" as needed).
    • Running
      • start idev, e.g. "idev  -p  skx-dev", which reserves a 48-core node (skylake development queue) just for you. 
      • I suggest  idev  -m  120   to use it for 120 minutes.
      • define an environment variable OMP_NUM_THREADS
        to be the number of compute cores to use (if set to 1, you are running single-core).
        • If you have bash as your login shell, you would type (for 8 cores):
               export   OMP_NUM_THREADS=8
        • If you have tcsh as your login shell, you must type (for 8 cores):
               setenv   OMP_NUM_THREADS   8
      • now run your compiled program in the usual way.
    • To confirm you are running more than one core
      • Note your fastest run time may be with less than 16 cores.  For low resolution, 8 worked best for me.  For hires, 14.
      • Run your code first with one core, then with 8.  To time it, run this way:
             time   your-program-name
      • or, put a call to OMP_GET_THREAD_NUM in your code
        (this caused problems for some students. To try it, search for Parallel Region Example on this web page
    • If you want to try parallelizing temperature (theta) advection
      • I parallelized my BC, IC, PGF, Diffusion and (U/V/W) Advection routines.
      • Parallelizing theta advection has not always worked.  Doing the above is enough for full parallelization credit.
        But if you want to try, there are a few things to know:
        • put the parallel directive (PARALLEL DO in Fortran, #pragma in C) before
          each double-loop (before X advection, before Y, etc).  So parallelism takes place across the entire advection.
        • Do not parallelize the 1-D advection routine (this slows it down!)
        • Be careful with those PRIVATE directives - make sure you include (as private, within the double for-or-DO
          loops) all indices (e.g. i,j,k) and any 1-D arrays (whose values change) that you are using to pass to or get
          back from your advect1d routine.  e.g., in my code, PRIVATE(i,j,k,u1d,q1d_old,q1d_new).
        • Lastly, add -reentrancy threaded to OPTIONS inside your Makefile. 
        • make clean, and rebuild your program with make.
        • If it doesn't work, double-check that everything that should be private is, or skip theta parallelization entirely.

  3. What parallel directives look like

    • Here's an excerpt from my example Fortran program discussed later on this page:

      !$OMP   PARALLEL DO PRIVATE(i,j,k)      
            do k = 1,nz
              do j = 2,ny-1
                do i = 2,nx-1
                  t2(i,j,k) = t1(i,j,k) + (Km*dt)/(dx*dx)*  &
                       ( t1(i+1,j,k)-2.*t1(i,j,k)+t1(i-1,j,k) )
                enddo
              enddo
            enddo
      !$OMP   END PARALLEL DO

    • Here's an excerpt from the example C program:

      #pragma omp parallel for shared(t1,t2) private(i,j,k)
            for (i=0; i<nx; i++) {
              for (j=0; j<ny; j++) {
                for (k=1; k<nz-1; k++) {
                  t2[i][j][k] = t1[i][j][k] + (Km*dt)/(dx*dx)*
                        ( t1[i][j][k+1]-2.*t1[i][j][k]+t1[i][j][k-1] );
                }
              }
             }

    • The Fortran OMP code statement looks like a comment; the C one a preprocessor directive.
      In each, we assert that:
      • we want to break up the work in these parellel sections among multiple compute cores, but
      • each processor will have its own private copy of variables i,j,k with which to do it (important!)

    • The shared directives above could be skipped.  Everything is shared by default.
    • The private directives are needed.  Particularly with nested loops, you want all loop variables to be private to each core.
    • Both the C and Fortran codes have the above in a time-step loop that runs for a little while.  But it won't take long to run.
    • Look at this page for more on parallelizing nested loops in Fortran and C

  4. My examples

    • You can copy my parallel_test.f or parallel_test.c codes from

           ~tg457444/502/Pgm6[/C or /Fortran]

      on Stampede2.  These are designed to be run using one of the batch job decks found in the same Stampede directory.
      To test it interactively, do
          1)  idev  -p  skx-dev         # starts up idev to give you a node on the Skylake development queue
          2)  make  parallel-test      # compiles my parallel-test code.  You will have had to copy it from my account first.
          3)  ./bench_it                   # runs my shell script "bench_it" which runs the parallel test code with a varied # of cores.

    • First, in Fortran or C, you simply add the -openmp flag when compiling.
      For these test programs, use the Makefile file in the C or Fortran directory.  All you need is to type make
      and follow the directions that are given you (e.g. make parallel-test).

      This gives you two running programs in your directory; a 1-core serial code, and a parallel (OpenMP) code.
      For working with your code, you can always add compiler directives (e.g. "-traceback") to the Makefile.

      Both codes solve a simple multiple-dimension diffusion problem.  In 3-D, they are basically doing

                      theta(n+1) = theta(n) + K*dt/(dx*dx)*( second derivative of theta in X, Y, and/or Z )
  5. Running my example codes interactively

    • First: easiest way to run these is using the batch ".deck" file provided to you!! (see below)

    • Yes, they can be run interactively.  You get credit for running batch, of course, and I encourage you to do so.  Remember, when you run a long-running program interactively, you are going up against all sorts of other people compiling, editing their files, etc.  Unless your program finishes in a minute or two, I'd encourage you to use idev, or batch.

      But here is how you run these programs interactively:

      serial execution: run it interactively or, better yet, through idev.  You can put the time command before it to get a timing summary:

        /usr/bin/time  -p  parallel_test.1CORE

        ... mine ended with:

        real 44.12
        user 43.88
        sys 0.23
      (it took 43.88 sec of user cpu time, and 44.12 sec of wallclock)

      parallel execution: you first tell the system how many processors you want.

        export  OMP_NUM_THREADS =8
        /usr/bin/time  -p  parallel_test.OMP

       ... mine ended with:

        real 9.25
        user 69.23
        sys 0.27
         (it took 69.23 sec of user cpu time across all cores, and 9.25 sec of wallclock)

       The speedup here is pretty fast.  But the domain sizes are pretty big, to make sure the individual processes all had a fair
      amount of "work" to do.  This is important as there is overhead associated with doing parallel execution.

  6. Running my example codes in batch

    • Stampede, like most high-performance computers (HPCs), has a set of compute nodes set aside for (lots of) interactive work, while keeping the majority of processors dedicated to running jobs via batch.  Batch means you submit a text file (a job deck) which starts with information on what compute resources you need (how many cpu, and for how long), along with the commands needed to run your job.

    • I will update the notes below after Stampede-2 returns to service !!

    • OK, here's how you run it.  (a) compile the programs as directed above.  (b) edit either the fortran (run_batch_fortran.deck) or the C (run_batch_C.deck) batch deck files.  They are simple text file; any editor (emacs, vi, or an editor on your local machine) will do.  The beginning of each (I'll list the fortran one) has a section on compute resources and other information about your job, e.g.

      # =======================Stampede batch =================================
      #SBATCH -J atms502p6F      # Job Name
      #SBATCH -o atms502p6F.o%j # Output and error file name (%j expands to jobID)
      #SBATCH -p development      # Queue name: serial, development, normal, large...
      #SBATCH -n 1          # Total mpi tasks ... =1 for OpenMP
      #SBATCH -t 00:30:00      # Run time (hh:mm:ss)
      #SBATCH -A TG-ASC120037   # Charge to class account
      #SBATCH --mail-user=YOUR-EMAIL-ADDRESS@illinois.edu  # **** SET to YOUR email ****
      #SBATCH --mail-type=all
      # =======================================================================

    • That's it.  You are prepared to submit a job on Stampede run under our
      class account (the "-A" above), with a job name of atms502p5F, in the development queue for a total
      of up to 5 minutes of wallclock time (this test is fast!)  To submit it type:

          sbatch   run_batch_fortran.deck  (or run_batch_C.deck if you are using C).


    • This should work.  Use squeue/showq commands to find the job status.  How do you know when it is done?

      • The patient way: wait for the email notification.
      • The less-patient way: periodically type:  squeue  -u  $user
      • The least-patient (my) way:  run my shell script chk:

           ~tg457444/bin/chk

        This runs squeue, waits 10 seconds, and runs it again.  Hold down the control key and hit "C" to break out of the qstat loop.

        When done, you will find a file like this in your directory:   atms502p6F.o11734
        The .o file contains output from running your program.
        If it wrote out any data, as your program should (to e.g. RunHistory.dat),
        those files will be in the directory you told it to use.

    • Please send me any questions that arise, or suggestions/corrections to improve this getting-started guide.