Parallelization of Codes

on the SGI Origins

Sherry Chang
Scientific Consultant
NASA Advanced Supercomputing Facility
NASA Ames Research Center
MS 258-6
Moffett Field, CA 94035-1000

http://www.nas.nasa.gov/~schang

schang@nas.nasa.gov


Table of Contents

Links to Examples in this Document

Glossary


Good resources:


When ?

Before parallelizing a code, a programmer should optimize his code for single CPU execution. Many optimization techniques can accelerate performance in cache-memory systems such as the SGI Origins. Many tools are available for performance profiling. For more information on issues pertaining to single CPU optimization, see Performance Profiling and Optimization on the SGI Origins.


Why ?

The purpose of introducing parallelism into a code is to reduce the elapsed time used for execution. The elapsed time is the time that elapses from when the first processor starts execution to when the last procesor completes execution. Those issues relevant to how effectively this goal can be achieved are discussed below:


General Issues


SGI Origin Specific Issues

The SGI Origins use physically distributed but logically shared memory and processors based on the cache-coherent nonuniform memory access (cc-NUMA) technology. This technology enables applications to achieve scalability easier with the Origins than with conventional shared memory machines.

In order to fully take advantage of this technology and get good parallel performance, programmers should be aware of a few issues :


What ?

What in a code should one parallelize ? Almost every code has some form of parallelism. Depending on the nature of the code, one can either focus on decomposing the data or the functions (code segments) associated with the problem:


granularity of parallelism

Granularity is defined as the ratio of the amount of communication that a processor requires to the amount of computation that it performs.



Where ?

Where in a code should one parallelize ?


How ?

Parallelization Without Code Modification


Parallelization With Code Modification

Topics in Irix Programming; PART FOUR - Models of Parallel Computation
Use dplace for non-MP library programs
- important for MPI 2.0
- not needed for MPI 3.0


How to Measure the Performance of Your Parallel Code ?


Example: false sharing

Note: This example is modified from Example 8-5 in Chapter8 of SGI's Origin 2000 and Onyx Performance Turing and Optimization Guide.

In this example, a simple program (lower left panel) is used to demonstrate the occurrence of false sharing and how to use various profiling tools to detect such occurrence. Another program (lower right panel) is also given to demonstrate one method of removing false sharing.

The programs:

false sharing no false sharing
      program false_sharing
      parameter (m=4,n=100000)
      real a(n,m),s(m)
      real*4 dtime,tarray(2)

      t1=dtime(tarray)
      do i=1,m
      do j=1,n
      a(j,i)=(i+j)/5000.0
      end do
      end do
      t2=dtime(tarray)


      do k=1,100
      call sum85(a,s,m,n)
      write (6,*) 'k= ',k
      write (6,*) s
      end do

      t3=dtime(tarray)

      print *, 'time on initialization = ', 
     &         t2
      print *, 'time on real work = ', t3
      stop
      end

      subroutine sum85 (a,s,m,n)
      integer m, n, i, j
      real    a(n,m), s(m)
!$omp parallel do private(i,j), shared(s,a)
      do i = 1, m
         s(i) = 0.0
         do j = 1, n
            s(i) = s(i) + a(j,i)
         enddo
      enddo
      return
      end
 
      program no_false_sharing
      parameter (m=4,n=100000)
      real a(n,m),s(32,m)
      real*4 dtime,tarray

      t1=dtime(tarray)
      do i=1,m
      do j=1,n
      a(j,i)=(i+j)/5000.0
      end do
      end do
      t2=dtime(tarray)

      do k=1,100
      call sum85(a,s,m,n)
      write (6,*) 'k= ',k
      do i=1,m
      write (6,*) s(1,i)
      end do
      end do
      t3=dtime(tarray)

      print *, 'time on initialization = ', 
     &          t2
      print *, 'time on real work = ', t3

      stop
      end

      subroutine sum85 (a,s,m,n)
      integer m, n, i, j
      real    a(n,m), s(32,m)
!$omp parallel do private(i,j), shared(s,a)
      do i = 1, m
         s(1,i) = 0.0
         do j = 1, n
            s(1,i) = s(1,i) + a(j,i)
         enddo
      enddo
      return
      end


compile the codes

compiler version used : MIPSpro 7.3.1.1m

f90 -O0 -mp program.f                
f90 -O1 -mp program.f
f90 -O2 -mp program.f
f90 -O3 -mp program.f

run the jobs

Running host : steger or hopper, 250 MHz R10000 O2K machine, L2 cache size = 4 MB, memory per node = 490 MB

All of the runs were performed through PBS. The nodes (cpus and memory) that were assigned to these runs were not shared with other jobs.

To run the code in parallel, specify how many cpus and threads are to be used

#PBS -l ncpus=n
setenv OMP_NUM_THREADS n  (n=1,2,4 in the runs)

Four performance tools were used :

  1. dtime

    Provides a measure of the time used for certain segments of the code

  2. /bin/time a.out > output

    Provides a measure of the total time used for the entire code and of the speedup when different number of processors are used

  3. perfex -a -x -y a.out > output (get aggregrated output for all processes)
    or
    perfex -a -x -y -mp a.out > output (get output per process)

    Provide a method of finding what the code might be suffering

  4. setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 31
    setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW 99 
    ssrun -exp prof_hwc a.out > output
    prof *.prof_hwc.* > ssrun_prof.out
    

    Provide a method of finding or confirming the factor that causes the code not to perform well and where in the code this is occurring.


Results

t2 (obtained using dtime function, see programs above) for the time spent on initialization; t3 (see programs above) for the time spent on real work; /bin/time (which reports real, user and sys) for overall performance of the code.

false sharing no false sharing

        -O0      -O1       -O2      -O3

1cpu
----

t2      0.062    0.061    0.007    0.006
t3      4.820    2.576    0.459    0.222
real    4.967    2.720    0.563    0.322
user    4.873    2.628    0.485    0.248
sys     0.040    0.041    0.040    0.040

2cpu
----

t2      0.062    0.061    0.013    0.012
t3     16.329    8.400    0.243    0.126
real   16.538    8.589    0.366    0.254
user   32.683   16.833    0.490    0.256
sys     0.051    0.048    0.046    0.047

4cpu
----

t2      0.062    0.061    0.015    0.015
t3     21.696   11.640    0.144    0.076
real   21.993   11.881    0.311    0.218
user   86.767   46.604    0.572    0.282
sys     0.068    0.063    0.057    0.058
        -O0      -O1       -O2      -O3

1cpu
----

t2      0.060    0.060    0.009    0.007
t3      4.789    2.572    0.460    0.213
real    4.930    2.711    0.565    0.310
user    4.844    2.623    0.484    0.237
sys     0.038    0.039    0.047    0.042

2cpu
----

t2      0.062    0.060    0.012    0.012
t3      2.417    1.296    0.238    0.124
real    2.580    1.444    0.364    0.248
user    4.868    2.624    0.482    0.253
sys     0.047    0.046    0.045    0.046

4cpu
----

t2      0.061    0.060    0.014    0.014
t3      1.226    0.670    0.134    0.077
real    1.646    0.872    0.280    0.214
user    5.213    2.791    0.560    0.288
sys     0.054    0.053    0.051    0.050

As seen in this table, the time used for initializing array a (t2) is negligible compared to the time spent on real work (t3). The walltime (real) reported by /bin/time is very close to t3. Of the various time reported (t2, t3, real, user, sys), t3 is the most useful when the speedup is to be calculated. With -O0 and -O1, as seen in the above table, the order of t3 used by false_sharing.f with different number of CPUs is : 4 cpus >> 2 cpus >> 1 cpus. Thus, not only t3 does not decrease , it increases signigicantly as more cpus are added. The cause of this behavior is due to a phenomenon called false sharing.

Using 4 CPUs as an example, for false_sharing.f (the program at the left panel) at each stage of the calculation, all four CPUs attempt to update one element of the sum array, s(i). For a CPU to update one element of s, it needs to gain exclusive access to the cache line holding that element. But the four words of s are probably contained in a single cache line. So only one CPU at a time can update an element of s. Instead of operating in parallel, the calculation is serialized.

Actually, it's a bit worse than merely serialized. For a CPU to gain exclusive access to a cache line, it first needs to invalidate cached copies that may reside in the other caches. Then it needs to read a fresh copy of the cache line from memory, because the invalidations will have caused data in some other CPU's cache to be written back to main memory. In a sequential version of the program, the element being updated can be kept in a register, but in the parallel version, false sharing forces the value to be continually reread from memory, in addition to serializing the updates.

One method of elminiating this problem is demonstrated in no_false_sharing.f (the program on the right panel). In this program, the elements s(1,1), s(1,2), s(1,3) and s(1,4) are separated by at least 32 ´ 4 = 128 bytes, and so are guaranteed to fall in separate cache lines. Implemented this way, the code achieves perfect parallel speedup (as demonstrated by the values of t3 in the table above for runs with 1, 2 and 4 cpus using -O0, -O1, -O2 and -O3).

Note: Using -O2 and -O3, the performances of false_sharing.f and no_false_sharing.f are nearly identical. Both have perfect speedup with increasing number of CPUs. The improved performance of false_sharing.f using -O2 and -O3 indicates that the compiler automatically detects false sharing and eliminates it for this code. (Warning: Elimination of false sharing by the compiler at -O2 or -O3 is not guaranteed.) I have further verified that the improved performace was not an artifact of code elimination (thus, work elimination) by the compiler. Specifically, using ssrun -ideal, the subroutine sum85 was indeed called 100 times by each thread no matter -O0, -O1, -O2, -O3 was used.

   -O3 100  __mpdo_sum85_1 (when 1 cpu was used)
   -O3 200  __mpdo_sum85_1 (when 2 cpus were used)
   -O3 400  __mpdo_sum85_1 (when 4 cpus were used) 
   

Detect what are going on using perfex

Click the perfex outputs below for the corresponding 1cpu, 2cpus, and 4cpus runs using -O0:

1 cpu perfex output

2 cpu perfex output (aggregrated from all processes)

4 cpu perfex output (aggregrated from all processes)

The following few characteristics in the perfex outputs indicate the occurrence of false sharing:

  1. Performances do not scale when more cpus are added. Even worse, the walltime does not decrease but increase when more cpus are added.

    Compare these quantities (highlighted in purple) in the above three perfex outputs.

  2. Significant increase in memory traffic
    (memory bandwidth used, MB/s, average per process)

    Compare these quantities (highlighted in green) in the above three perfex outputs.

  3. Event counters 31 and 29 increase from ~0 to a significant number (ignore the time associated with these two counters)

    Compare these quantities (highlighted in red) in the above three perfex outputs.

  4. Significant increase in event 7
    (quadwords written back from scache)

    Compare these quantities (highlighted in blue) in the above three perfex outputs.

    Origins cache are 'write-back cache', meaning that cache will write back to memory only when it is forced to do so. For example, (1) when the cache line needs to be used by another set of data that has the same lower bit address (2) when another cpu requesting this cache line and the current cpu has just update this cache line

  5. L2 cache line reuse drops significantly as more cpus are added. L2 cache line hit rate drops (say from 0.99 - < 0.85). Significat increase in L2 cache miss (event 26).

    Compare these quantities (highlighted in yellow) in the above three perfex outputs.

    I think L2 cache line reuse should drop because if another cpu just updates the same cache line, this cpu can not reuse what it has in cache. It has to discard this cache line and request a new copy (issuing new load) and since it is loading a new copy, it is considered a cache miss.

    Note: L2 cache miss is counted when the second quadword of a cache line is being written from memory to L2 cache.

P.S. For the defintion of some of the quantities (ex: L2 cache line reuse, etc.) used above, see Some useful statistics from perfex in my document of 'Performance Profiling and Optimization on the SGI Origins.'

Confirming false sharing is happening using ssrun with event counter 31

The R10000 and R12000 event counter 31 (store or prefetch-exclusive to shared block in Scache) is the best indicator of cache contention between CPUs. The CPU that accumulates a high count of event 31 is repeatedly modifying shared data. The ssrun output (aggregrated over all CPUS) below shows a high count of event 31 within subroutine sum85, confirming false sharing is happening there.

  1. 1 cpu -O0 -mp ssrun output
  2. 2 cpu -O0 -mp ssrun output
  3. 4 cpu -O0 -mp ssrun output

Read section 4.3 of the following report from SGI analysts (http://www.supercomp.org/sc96/proceedings/SC96PROC/ZAGHA/INDEX.HTM) to learn more about this behavior.

Additional performance data obtained on a different O2K - Delilah (300MHz R12000 and 8MB of L2 cahce)
for -O0 -mp, 4cpu run

  1. 4 cpu -O0 -mp perfex output
  2. 4 cpu -O0 -mp ssrun counter 31 output
  3. 4 cpu -O0 -mp ssrun -numa output

DATA PLACEMENT ISSUES

Data placement can be an issue only for parallel programs that are memory intensive and are not cache friendly.

Many environmental variables can be set to manipulate/change data placement in memory. For more information on these variables, see Section 4.2 of IRIX Environment Variables Ready Reference.

setenv _DSM_PLACEMENT FIRST_TOUCH
setenv _DSM_PLACEMENT ROUND_ROBIN
setnev _DSM_ROUND_ROBIN
setenv _DSM_MIGRATION ON
setenv _DSM_MIGRATION ALL_ON
setenv _DSM_MIGRATION_LEVEL level (level = 0 - 100)
setenv _DSM_VERBOSE  (to check how data have been distributed)

Note: At NAS, the processors in the origin machines, steger and lomax, are grouped in clumps so that the cpus used for a job requesting less than 128 cpus will not be too far away. This should help codes to perform better and to reduce walltime variation. So, data placement may not be so bad ....


Example: memory locality

Note: This example is modified from Example 8-9 in Chapter8 of SGI's Origin 2000 and Onyx Performance Turing and Optimization Guide.

The programs:

memory locality 1 and 2 memory locality 3 and 4
      program memory_locality_1 
! and program memory_locality_2
      integer i, j, n, niters 
      parameter (n = 8*1024*1024, 
     &           ndim = n+35) 
      real a(ndim), b(ndim), 
     &     c(ndim), d(ndim), q 
      real*4 dtime,tarray(2)

      read *, niters
      print *, ' niters = ', niters

! initialization 
      t1=dtime(tarray)
!memory_locality_1 : comment out the following
!           omp parallel
!memory_locality_2 : keep the following
!           omp parallel

!$omp parallel do private(i) shared(a,b,c,d)
      do i = 1, n 
         a(i) = 1.0 - 0.5*i 
         b(i) = -10.0 + 0.01*(i*i) 
         c(i) = 2*i - 0.3
         d(i) = 0.5*i
      enddo 
      t2=dtime(tarray)

! real work 
      do it = 1, niters 
         q = 0.01*it 
!$omp parallel do private(i) shared(a,b,c,d,q)

         do i = 1, n 
            a(i) = a(i) + q*b(i) 
            c(i) = c(i) + d(i)
            a(i) = a(i) + sqrt(c(i))
         enddo 
       call sub(a,b,ndim)
      enddo

      t3=dtime(tarray)
      print *, a(1), a(n), q
      print *, 'time on initialization = ', t2
      print *, 'time on real work =',t3
      end

      subroutine sub(a,b,ndim)
      real a(ndim), b(ndim)
      return
      end
      program memory_locality_3
! and program memory_locality_4

      integer i, j, n, niters 
      parameter (n = 8*1024*1024, 
     &           ndim = n+35) 
      real a(ndim), b(ndim), 
     &     c(ndim), d(ndim), q 
      real*4 dtime,tarray(2)

      read *, niters
      print *, ' niters = ', niters

! initialization 
      t1=dtime(tarray)
!memory_locality_3 : comment out the line
!         with distribute_reshape
!$sgi distribute a(block),b(block),c(block),d(block)
!memory_locality_4 : comment out the above line
!         with distribute a(block) ... 
!$sgi distribute_reshape a(block),b(block),c(block),d(block)
      do i = 1, n 
         a(i) = 1.0 - 0.5*i 
         b(i) = -10.0 + 0.01*(i*i) 
         c(i) = 2*i - 0.3
         d(i) = 0.5*i
      enddo 
      t2=dtime(tarray)

! real work 
      do it = 1, niters 
         q = 0.01*it 
!$omp parallel do private(i) shared(a,b,c,d,q)
!$sgi+ affinity (i) = data(a(i))
         do i = 1, n 
            a(i) = a(i) + q*b(i) 
            c(i) = c(i) + d(i)
            a(i) = a(i) + sqrt(c(i))
         enddo 
       call sub(a,b,ndim)
      enddo

      t3=dtime(tarray)
      print *, a(1), a(n), q
      print *, 'time on initialization = ', t2
      print *, 'time on real work =',t3
      end

      subroutine sub(a,b,ndim)
      real a(ndim), b(ndim)
      return
      end

The program memory_locality_1 in the left panel demonstrates the effect of poor data placement (using the default first-touch placement policy) due to the initialization of data by a single CPU. Because the initialization takes a small amount of time compared with the "real work," parallelizing it doesn't reduce the sequential time of this code by much. Some programmers wouldn't bother to parallelize the first loop for a traditional shared memory computer. But for the SGI Origins that use the NUMA architecture, additional care is needed to get good performance and speedup. Under the first-touch policy, all of the program's memory ends up allocated in the node running the main thread. This causes two effetcs:

One way to improve data placement is to use the Round Robin placement policy, rather than first-touch. Under this policy, data are allocated in a round-robin fashion from all the nodes the program runs on. The data are still initialized by a single CPU, but the memory holding them will not be allocated from a single node; it will be evenly spread out among all the nodes running the program. This may not place the data in their optimal locations and thus the access times are not likely to be minimized. But there will be no bottlenecks, so scalability will not be limited. This method does not require any modification of the program or the compilation. One simply sets the following environmental variable before executing the code.

setnev _DSM_ROUND_ROBIN

However, if you are relying on the first-touch policy to ensure a good data placement, it is critical to parallelize the initialization code (as in program memory_locality_2, also in the left panel, see comments) in exactly the same way as the processing loop.

Another way to improve data placement is to use the SGI's Distribute directory (an SGI extention to the OpenMP directories for the NUMA architecture), as shown in program memory_locality_3 in the right panel. When this method is used, a single CPU will handle the distribution of data among the memory of multiple CPUs. How the data are distributed depend on the mapping used. Specifically, the Distribute directive allows you to specify a different mapping for each dimension of each distributed array. The possible mappings are these:

In program memory_locality_3, two directives are used: The first, Distribute, instructs the compiler to allocate the memory for the arrays a, b, c and d from all nodes on which the program runs. The second directive, Parallel Do, uses the clause AFFINITY (I) = DATA (A(I)), an SGI extension, to tell the compiler to distribute work to the CPUs based on how the contents of array a are distributed. Note that because the data is explicitly distributed, it is no longer necessary to parallelize the initialization loop to properly distribute the data among all CPUs using first-touch allocation (although it is still good programming practice to parallelize data initializations).

In this program, memory_locality_3, Distribute specifies a BLOCK mapping for all four arrays. BLOCK means that, when running with p CPUs, each array is to be divided into p blocks of size ceiling(n/p), with the first block assigned to the first CPU, the second block assigned to the second CPU, and so on. The intent of assigning blocks to CPUs is to allow each block to be stored in one CPUs' local memory. However, only whole pages are distributed, so when a page straddles blocks belonging to different CPUs, it is stored entirely in the memory of a single node. As a result, some CPUs will use nonlocal accesses to some of the data on one or two pages per array. An imperfect data distribution has a negligible effect on performance because, as long as a "block" comprises at least a few pages, the ratio of nonlocal to local accesses is small. When the block size is less than a page, you must live with a larger fraction of nonlocal accesses, or use the reshaped directives as shown in program memory_locality_4. (In general, you should try to arrange it so that each CPU's share of a data structure exceeds the size of a page. Data placement is rarely important for arrays smaller than a page, because if they are used heavily, they fit entirely in cache.)

In program memory_locality_4, the distribute_reshape directory is used to achieve the desired data distribution without page-granularity limitations.


Note:

If the dimension of arrays a, b, c and d is set to be 8*1024*1024 (32MB), cache trashing may occur. This is due to the combining facts that : (i) the L2 cahce size is either 4MB (hopper, steger) or 8MB (lomax); (ii) the lower bits of a(i), b(i), c(i) and d(i) are identical; (iii) a(i), b(i), c(i), d(i) will be loaded into the same cache line. With -O0, -O1 and -O2, compiler will not pad the arrays to avoid cache trashing. But at -O3, the Loop Nest Optimizer (LNO) will pad these arrays. In the programs above, cache trashing is prevented by explicitly offseting the dimension by 35.

compile the codes

compiler version used : MIPSpro 7.3.1.1m

f90 -O3 -mp program.f

run the jobs

Running host : steger or hopper, 250 MHz R10000 O2K machine, L2 cache size = 4 MB, memory per node = 490 MB

All of the runs were performed through PBS. The nodes (cpus and memory) that were assigned to these runs were not shared with other jobs.

To run the code in parallel, specify how many cpus and threads are to be used

#PBS -l ncpus=n
setenv OMP_NUM_THREADS n (n=1,2,4,...,32 in the runs)
setnev _DSM_ROUND_ROBIN (used only when Round Robin Placement is desired)

niters is read from stdin as 100 ( (i) to create a lot more real work than initialization and (ii) I guess it was intended to show the effect of 'page migration' when it was turned on).


Four performance tools were used :

  1. dtime

  2. /bin/time a.out > output

  3. perfex -a -x -y a.out > output (get aggregrated output for all processes)
    or
    perfex -a -x -y -mp a.out > output (get output per process)


Results:

t2 (obtained using dtime function, see programs above) for the time spent on initialization; t3 (see programs above) for the time spent on real work; /bin/time (which reports real, user and sys) for overall performance of the code.

memory_locality_1 memory_locality_1 memory_locality_2 memory_locality_3 memory_locality_4
first-touch round robin first-touch sgi distribute sgi distribute_reshape
1 cpu run

t2      1.856
t3     70.038
real   72.400
user   70.387
sys     1.553
 

2 cpu run t2 1.731 t3 36.266 real 38.446 user 72.844 sys 1.457
4 cpu run t2 1.825 t3 32.644 real 35.027 user 130.705 sys 1.690
6 cpu run t2 1.742 t3 34.057 real 36.345 user 204.949 sys 1.614
8 cpu run t2 1.783 t3 34.974 real 37.335 user 280.797 sys 1.688
10 cpu run t2 1.872 t3 35.547 real 38.095 user 356.904 sys 1.804
12 cpu run t2 1.739 t3 35.759 real 38.196 user 430.225 sys 1.733
14 cpu run t2 1.396 t3 35.966 real 38.125 user 504.370 sys 1.493
16 cpu run t2 1.399 t3 36.124 real 38.301 user 578.638 sys 1.530
18 cpu run t2 1.406 t3 35.542 real 37.809 user 642.900 sys 1.502
20 cpu run t2 1.353 t3 35.510 real 37.914 user 713.693 sys 1.637
22 cpu run t2 1.346 t3 35.455 real 37.779 user 784.726 sys 1.692
24 cpu run t2 1.405 t3 35.536 real 38.005 user 857.089 sys 1.819
32 cpu run t2 1.459 t3 35.740 real 38.517 user 1122.633 sys 2.168
1 cpu run

t2     1.784
t3    69.971
real  72.309
user  70.320
sys    1.485
 

2 cpu run t2 1.556 t3 36.345 real 38.378 user 73.010 sys 1.289
4 cpu run t2 1.747 t3 19.949 real 22.155 user 80.048 sys 1.499
6 cpu run t2 1.707 t3 15.527 real 17.691 user 93.641 sys 1.485
8 cpu run t2 1.845 t3 14.233 real 16.587 user 114.480 sys 1.661
10 cpu run t2 1.428 t3 10.872 real 12.772 user 109.214 sys 1.251
12 cpu run t2 1.611 t3 9.593 real 11.839 user 116.564 sys 1.460
14 cpu run t2 1.672 t3 9.026 real 11.373 user 127.921 sys 1.558
16 cpu run t2 1.623 t3 8.884 real 11.195 user 144.446 sys 1.536
18 cpu run t2 1.527 t3 7.175 real 9.400 user 131.807 sys 1.433
20 cpu run t2 1.527 t3 6.525 real 8.831 user 133.808 sys 1.466
22 cpu run t2 1.570 t3 5.659 real 8.083 user 128.730 sys 1.523
24 cpu run t2 1.551 t3 6.268 real 8.759 user 156.548 sys 1.547
32 cpu run t2 1.704 t3 7.284 real 10.254 user 240.865 sys 1.905
1 cpu run

t2     1.327
t3    77.896
real  79.746
user  78.258
sys    1.010
 

2 cpu run t2 1.313 t3 39.442 real 41.285 user 79.380 sys 2.228
4 cpu run t2 0.716 t3 19.717 real 20.978 user 79.541 sys 2.428
6 cpu run t2 0.543 t3 13.155 real 14.384 user 79.819 sys 2.564
8 cpu run t2 0.376 t3 10.034 real 11.378 user 81.064 sys 2.588
10 cpu run t2 0.330 t3 7.901 real 9.272 user 81.058 sys 2.463
12 cpu run t2 0.303 t3 6.604 real 7.963 user 81.607 sys 2.398
14 cpu run t2 0.313 t3 5.745 real 7.187 user 83.460 sys 2.437
16 cpu run t2 0.225 t3 5.050 real 6.553 user 84.645 sys 2.296
18 cpu run t2 0.227 t3 4.477 real 6.152 user 86.820 sys 2.305
20 cpu run t2 0.248 t3 4.050 real 5.721 user 86.972 sys 2.365
22 cpu run t2 0.232 t3 3.600 real 5.269 user 86.381 sys 2.272
24 cpu run t2 0.247 t3 3.379 real 5.182 user 88.941 sys 2.456
32 cpu run t2 0.247 t3 5.604 real 7.725 user 190.906 sys 2.689
1 cpu run

t2     1.741
t3    71.300
real  73.725
user  71.619
sys    1.469
 

2 cpu run t2 1.417 t3 37.517 real 39.422 user 75.309 sys 1.183
4 cpu run t2 1.604 t3 18.821 real 20.846 user 75.541 sys 1.409
6 cpu run t2 1.545 t3 12.609 real 14.552 user 76.112 sys 1.331
8 cpu run t2 1.401 t3 9.473 real 11.277 user 76.456 sys 1.197
10 cpu run t2 1.429 t3 7.652 real 9.537 user 77.361 sys 1.244
12 cpu run t2 1.446 t3 6.477 real 8.546 user 78.925 sys 1.399
14 cpu run t2 1.415 t3 5.533 real 7.545 user 79.139 sys 1.263
16 cpu run t2 1.415 t3 4.816 real 6.866 user 79.618 sys 1.285
18 cpu run t2 1.392 t3 4.311 real 6.383 user 80.439 sys 1.278
20 cpu run t2 1.533 t3 3.901 real 6.195 user 81.974 sys 1.447
22 cpu run t2 1.413 t3 3.568 real 5.822 user 83.707 sys 1.344
24 cpu run t2 1.408 t3 3.280 real 5.604 user 84.957 sys 1.381
32 cpu run t2 1.510 t3 5.332 real 7.998 user 175.450 sys 1.858
1 cpu run

t2      1.349
t3     67.738
real   69.588
user   68.084
sys     1.047
 

2 cpu run t2 1.344 t3 35.653 real 37.429 user 71.602 sys 1.077
4 cpu run t2 1.342 t3 17.847 real 19.546 user 71.685 sys 1.083
6 cpu run t2 1.359 t3 12.074 real 13.793 user 72.836 sys 1.126
8 cpu run t2 1.415 t3 9.007 real 10.766 user 72.398 sys 1.197
10 cpu run t2 1.404 t3 7.274 real 9.037 user 73.044 sys 1.197
12 cpu run t2 1.426 t3 6.130 real 7.898 user 73.742 sys 1.249
14 cpu run t2 1.438 t3 5.271 real 7.119 user 74.616 sys 1.269
16 cpu run t2 1.482 t3 4.617 real 6.549 user 74.661 sys 1.341
18 cpu run t2 1.472 t3 4.133 real 6.154 user 76.472 sys 1.344
20 cpu run t2 1.511 t3 3.613 real 5.686 user 74.425 sys 1.412
22 cpu run t2 1.567 t3 3.276 real 5.473 user 74.850 sys 1.500
24 cpu run t2 1.558 t3 3.017 real 5.252 user 75.000 sys 1.537
32 cpu run t2 1.755 t3 4.211 real 6.994 user 144.383 sys 2.043

Detect what is going on:

Unfortunately, there is no profiling technique that will tell you definitively that poor data placement is hurting the performance of your program. But, one can 'suspect' poor data placement is happening for the program memory_locality_1 since:


Example: setenv _DSM_VERBOSE

Use !$sgi distribute as an example to show the data placement, Are there 24 MLD created and mapped to each thread ?


Example: dlook

dlook is a tool for showing memory and process placement. By default, dlook shows where each process was running when it exited and how its stack and heap data was placed with page size information. If sampling is enabled, (by using - sample n), this data is also displayed every n seconds.

To obtain a dlook output, use the -out option:

dlook -out dlook_output a.out < input > output

For illustration, outputs from dlook for

  1. memory_locality_1 with first-touch policy,
  2. memory_locality_1 with round-robin policy,
  3. memory_locality_2 with first-touch policy,
  4. memory_locality_3 with sgi distribute directory,
  5. memory_locality_4 with sgi distribute_reshape directory
are shown. In this example, 12 cpus are used.

Notice that each of arrays a, b, c, and d will occupy ~2048 pages (16KB page size). If the array elements are "equally" distributed among the memory of the 6 nodes (12 cpus), each node will allocate about 341 pages for 1/6 of array a ; 341 pages for 1/6 of array b; etc.


Example: dsm_home_threadnum()

In addition to dlook, you can obtain more detailed information about data placement within the program with the dsm_home_threadnum() intrinsic. It takes an address as an argument, and returns the number of the CPU in whose local memory the page containing that address is stored. It is used in Fortran as follows:

integer dsm_home_threadnum 
numthread = dsm_home_threadnum(array(i)) 

Since two CPUs are connected to each node and they share the same memory, dsm_home_threadnum returns the lowest CPU number of the two running on the node with the data. Note that the numbering of CPUs reported by dsm_home_threadnum is relative to the program, not the number of absolute physical cpu.

Using 12 cpus as an example, the placement of arrays in memory when they are initialized is examined for programs memory_locality_2, memory_locality_3 and memory_locality_4, by inserting the function calls to dsm_home_threadnum in these programs. The results obtained are summaried below for each program. Contiguous data stored in the same node are grouped with the same color. The format of these outputs are:

column 1 : array name
column 2 : the cpu number (p) performing the initialization, 
           (applied to memory_locality_2 only, for the other two
            programs, p is simply 0)
column 3 : the lower cpu number (p') of the node that has the data
column 4 : the first array element (i)
column 5 : the last array element (i')
column 6 : the number of pages used from the first (column 4) 
           to the last array (column 5) elements

  1. data placement output for memory_locality_2 with first-touch policy,
  2. data placement output for memory_locality_3 with sgi distribute directory,
  3. data placement output for memory_locality_4 with sgi distribute_reshape directory

In addition, the access of each array element during the real work is also examined and the results are shown below. In these outputs, contiguous data accessed by the same cpu are grouped with the same color. The format of these outputs are:

column 1 : array name
column 2 : the cpu number (p) accessing the data 
column 3 : the lower cpu number (p') of the node that has the data
column 4 : the first array element (i)
column 5 : the last array element (i')

  1. data access output for memory_locality_2 with first-touch policy,
  2. data access output for memory_locality_3 with sgi distribute directory,
  3. data access output for memory_locality_4 with sgi distribute_reshape directory

When data are initialized in parallel with OpenMP directive, data are distributed in whole pages based on first-touch policy. Although each pair of the 12 CPUS is responsible for initializing 1/6 (2*8*1024*1024/12=1398102 elements = 341.4 pages) of each array, some data initialized by one CPU (cpu-p) may have to be stored in the memory non-local to this CPU. For example, a(1394413) - a(1398102) (which take a total of 0.9 page of memory) are initialized by cpu 1 but they are stored in the memory of cpu 2. In addition, a(1394413) - a(2795244) will take a total of 342 pages (16KB page) and they are stored in memory of cpu 2. Some of these elements are accessed non-locally by cpu 1 (elements 1394413-1398102), the rest are accessed locally by cpus 2 (elements 1398103 - 2097153) and 3 (elements 2097154 - 2795244).

Data initialized by cpu p
Data stored in the memory of cpu p'

When data are initialized by a single CPU with SGI Distribute directive, data are distributed into the memory of many nodes in whole pages (not based on first-touch policy). When the data are accessed, data stored in the same node may be accessed by different cpus. For example, a(1398103) - a(1399340) are stored in the memory of cpu 0 but are accessed by cpu 2.

The data access output for memory_locality_2 with first-touch policy, and the data access output for memory_locality_3 with sgi distribute directory clearly show a difference. With first-touch policy, for example, a(1398103) is the first element to be initialized by cpu 2 and thus the memory of cpu 2 will store a whole page of data adjcent to a(1398103) including a(1394413)-a(1398102) which are to be initialized pretty late by cpu 1. On the contrary, with SGI distribute directory, first-touch policy is not used. Thus, the first element, for example, a(1398103) to be accessed by cpu 2 during real work might not be stored in the memory of cpu 2.

When data are initialized by a single CPU with SGI Distribute_reshape directive, for each array, same number of data are distributed into each node. Thus, each node contains 1398108 elements of each array (8*1024*1024+35)/6=1398108, which takes up 341.33 pages.

With -O3, it appears that the compiler adds extra padding among arrays a, b, c, and d. If -O2 is used, there will be no extra padding.


Example: More about Round Robin Distribution

The dsm_home_threadnum() intrinsic is also used to examine the distribution by a single cpu (say, cpu 0) of arrays a, b, c, and d into the memory of many nodes in program memory_locality_1 when round robin policy is used. (setenv _DSM_ROUND_ROBIN). The outputs are shown here:

  1. data placement output for memory_locality_1 with round robin policy
  2. data access output for memory_locality_1 with round robin policy

It is surprising that elements of arrays a and b are stored in the memory of cpus 0, 4 and 8. On the contrary, elements of arrays c and d are stored in the memory of cpus 2, 6 and 10. This seems odd because we "expect" elements of each array to be distributed to the memory of all nodes (cpus 0, 2, 4, 6, 8, 10) in a round robin fashion.

To understand this behavior, a few things have been tested.

The programs:

round robin 1 round robin 2

      program round_robin 1
      integer i, j, n, niters 
      parameter (n = 24*1024, ndim = n) 
      real a(ndim), b(ndim), 
     &     c(ndim), d(ndim), q 
      real*4 dtime,tarray(2)
      integer dsm_home_threadnum

      read *, niters
      print *, ' niters = ', niters

! initialization 
      t1=dtime(tarray)

      do i=1,n
         a(i) = 1.0 - 0.5*i
      enddo

      do i=1,n
         b(i) = -10.0 + 0.01*(i*i)
      end do

      do i=1,n
         c(i) = 2*i - 0.3
      end do

      do i=1,n
         d(i) = 0.5*i
      end do

    
      t2=dtime(tarray)

! real work 
      do it = 1, niters 
         q = 0.01*it 
!$omp   parallel do private(i, &
!$omp&  numthread1,numthread2,& 
!$omp&  numthread3,numthread4,num_omp) &
!$omp&  shared(a,b,c,d)

         do i = 1, n 
      num_omp=omp_get_thread_num()

      numthread1 = dsm_home_threadnum(a(i))
      write (6,*) 'a', num_omp,i,numthread1

      numthread2 = dsm_home_threadnum(b(i))
      write (6,*) 'b', num_omp, i,numthread2

      numthread3 = dsm_home_threadnum(c(i))
       write (6,*) 'c', num_omp, i,numthread3

      numthread4 = dsm_home_threadnum(d(i))
      write (6,*) 'd', num_omp, i,numthread4

            a(i) = a(i) + q*b(i) 
            c(i) = c(i) + d(i)
            a(i) = a(i) + sqrt(c(i))
         enddo 
       call sub(a,b,ndim)
      enddo

      t3=dtime(tarray)
      print *, a(1), a(n), q
      print *, 'time on initialization = ', t2
      print *, 'time on real work =',t3
      end

      subroutine sub(a,b,ndim)
      real a(ndim), b(ndim)
      return
      end

      program round_robin_2
      integer i, j, n, niters 
      parameter (n = 24*1024, ndim = n) 
      real a(ndim), b(ndim), 
     &     c(ndim), d(ndim), q 
      real*4 dtime,tarray(2)
      integer dsm_home_threadnum

      read *, niters
      print *, ' niters = ', niters

! initialization 
      t1=dtime(tarray)

      do i = 1, n 
         a(i) = 1.0 - 0.5*i 
         b(i) = -10.0 + 0.01*(i*i) 
         c(i) = 2*i - 0.3
         d(i) = 0.5*i
      enddo 











    
      t2=dtime(tarray)

! real work 
      do it = 1, niters 
         q = 0.01*it 
!$omp   parallel do private(i, &
!$omp&  numthread1,numthread2,&
!$omp&  numthread3,numthread4,num_omp) &
!$omp&  shared(a,b,c,d)

         do i = 1, n 
      num_omp=omp_get_thread_num()

      numthread1 = dsm_home_threadnum(a(i))
      write (6,*) 'a', num_omp,i,numthread1

      numthread2 = dsm_home_threadnum(b(i))
      write (6,*) 'b', num_omp, i,numthread2

      numthread3 = dsm_home_threadnum(c(i))
       write (6,*) 'c', num_omp, i,numthread3

      numthread4 = dsm_home_threadnum(d(i))
      write (6,*) 'd', num_omp, i,numthread4

            a(i) = a(i) + q*b(i) 
            c(i) = c(i) + d(i)
            a(i) = a(i) + sqrt(c(i))
         enddo 
       call sub(a,b,ndim)
      enddo

      t3=dtime(tarray)
      print *, a(1), a(n), q
      print *, 'time on initialization = ', t2
      print *, 'time on real work =',t3
      end

      subroutine sub(a,b,ndim)
      real a(ndim), b(ndim)
      return
      end

The results are listed here:

For the program round_robin_1,

  1. data placement output -O0,
  2. data placement output -O1,
  3. data placement output -O2,
  4. data placement output -O3

For the program round_robin_2,

  1. data placement output -O0,
  2. data placement output -O1,
  3. data placement output -O2
  4. data placement output -O3

Conclusion:

  1. In program round_robin_1 where the initialization of each array is done in separate do loop, the whole array a is intialized first, followed by the whole array of b, etc. When -O0, -O1 or -O2 is used, the OS first allocates pages of memory for all elements of array a before it allocates for array b, etc. Thus, the elements of each array are distributed among all 6 nodes.
  2. In program round_robin_1, with -O3, the compiler does global optimization which may mix the instructions in different do loops. Thus, allocation of pages to memory is done in a new order (by the compiler) rather than the order seen in the program. In a new order, the OS may allocates 1 page for a, followed by 1 page for b, 1 page for c, etc. Thus, the pages of a may not be distributed among all 6 nodes, as seen in data placement output -O3.
  3. For the program round_robin_2, with -O0, -O1, -O2 and -O3, since the initialization of all four arrays is done in a single do loop, the OS allocates 1 page for a, followed by 1 page for b, 1 page for c, etc. The resulting distribution is that elements in each array may not be distributed among all 6 nodes.
  4. For both programs, With -O3, the compiler does additional padding between arrays.


Glossary