http://www.nas.nasa.gov/~schang
Good resources:
Before parallelizing a code, a programmer should optimize his code for single CPU execution. Many optimization techniques can accelerate performance in cache-memory systems such as the SGI Origins. Many tools are available for performance profiling. For more information on issues pertaining to single CPU optimization, see Performance Profiling and Optimization on the SGI Origins.
The purpose of introducing parallelism into a code is to reduce the elapsed time used for execution. The elapsed time is the time that elapses from when the first processor starts execution to when the last procesor completes execution. Those issues relevant to how effectively this goal can be achieved are discussed below:
Ideally, one would like to have 100% of a code executed in parallel. In reality, this is never achieved. If for some segments of a code, a dependence exists between program statements when the order of statement execution affects the results of the program, these segments must be executed in serial. Dependency is usually the cause why a code can not be well parallelized.
The fraction of time that a code spent running in parallel is represented by 'f' and is defined as follows:
ts = time spent on sequential region
ts' = time spent on parallel region when only 1 processor is used
tp = time spent on parallel region when N processors are used
= ts' / N (Ideally. Without taking into consideration of the
overhead and possible algorithm change to achieve
parallelism)
f (fraction = %) = tp /(ts + tp)
Parallel overhead is the additional amount of work that is required on the parallel implementation of a sequential algorithm arising from the use of a parallel computer. The possible sources of the overhead are :
extra work needed for starting parallel task, terminating parallel task and other software overhead imposed by the operating system, parallel compiler, libraries, and tools. For example, (1) spawning threads, (2) acquiring/releasing locks, and (3) dividing task among threads, etc.
Parallel processes typically need to exchange data. Two of the most important parameters that affect the efficiency of data communication are:
Latency is defined as the time it takes to send a zero-length message from one processor to another. Typically measured in microseconds. Lower latency is better. To reduce the communication cost due to latency, avoid sending short messages seperately. Instead, cluster them together if possible. (I.e., send the same amount of data using less messages)
Low latency is very important for fine grain parallelism.
Bandwidth is the rate of data transfer during communication. Typically measured in MB/sec. Higher bandwidth is better. To minimize communication overhead, send only necessary information.
Synchronization refers to the coordination of parallel processes in real time. This is implemented by establishing a sybchronization point where a process may not proceed further until another process(es) reaches the same logically equivalent point. Time wasted for synchronization may come from : (1) the time spent by one process waiting for another to complete, (2) the time spent waiting for a semaphore to be signaled.
Load imbalance refers to the unequal distribution of work load so that some processes complete before others and they end up waiting idle. If all processes are subject to a barrier synchronization point, the slowest process will determine the overall performance. Load imbalance can be caused by either software or hardware. For example, on the software side, it is unknown whether some portions of code will be executed due to conditional statements; on the hardware side, caches can cause unpredictable memory access times. These uncertainties will make the work load of each process unpredictable.
Below are a few possible scenarios of load imbalance:
The main reason for introducing parallelism is to reduce the elapsed time used for executing your code. One way to check if a code is well parallelized is to see if a code has a good speedup.
speedup(N) = elapsed time using 1-CPU / elapsed time using N-CPUs
Two possibilities exist for obtaining the elapsed time for 1-CPU execution:
Speedup measured as T(1opt)/T(N) highlights any algorithmic efficiencies that had to be sacrificed to achieve the parallel version. In addition, none of the parallel implementation penalties are hidden by this comparison and thus the speedup is not exaggerated.
Speedup measured as T(1)/T(N) shows how well the problem is coping with an increasing number of CPUs, and thus provides an indication as to the scalability of the parallel implementation. Without considering any possible parallel overhead, the limit to the speedup (measured with T(1)/T(N)) one can achieve is the fraction of the program that can be run in parallel. This fraction is typically less than 100%. The Amdahl's law says that the maximum theoretical speedup one can achieve is :
speedup(N,f) = T(1)/T(N) = (ts+tp)/(ts+tp/N) = 1/(1-f+f/N) ts = time spent on sequential region tp = time spent on parallel region f (fraction = %) = tp /(ts + tp)
The following table lists the speedups for various fractions (f=%) and number of CPUs (N) used. As seen in this table, unless a very high fraction of parallelization is achieved, the speedup is far from ideal when large number of CPUs are ued.
Max Theoretical Speedup on "N" CPUs with "%" Parallelism
% \N 2 3 | 4 | 5 6 7 | 8 | 12 16 32 64 1.0e6
------\------------|------|-----------------|------|--------------------------
10.0% | 1.05 1.07 | 1.08 | 1.09 1.09 1.09 | 1.10 | 1.1 1.1 1.1 1.1 1.1
25.0% | 1.14 1.20 | 1.23 | 1.25 1.26 1.27 | 1.28 | 1.3 1.3 1.3 1.3 1.3
40.0% | 1.25 1.36 | 1.43 | 1.47 1.50 1.52 | 1.54 | 1.6 1.6 1.6 1.6 1.7
| | | | |
50.0% | 1.33 1.50 | 1.60 | 1.67 1.71 1.75 | 1.78 | 1.8 1.9 1.9 2.0 2.0
55.0% | 1.38 1.58 | 1.70 | 1.79 1.85 1.89 | 1.93 | 2.0 2.1 2.1 2.2 2.2
60.0% | 1.43 1.67 | 1.82 | 1.92 2.00 2.06 | 2.11 | 2.2 2.3 2.4 2.4 2.5
| | | | |
65.0% | 1.48 1.76 | 1.95 | 2.08 2.18 2.26 | 2.32 | 2.5 2.6 2.7 2.8 2.9
70.0% | 1.54 1.88 | 2.11 | 2.27 2.40 2.50 | 2.58 | 2.8 2.9 3.1 3.2 3.3
75.0% | 1.60 2.00 | 2.29 | 2.50 2.67 2.80 | 2.91 | 3.2 3.4 3.7 3.8 4.0
| | | | |
80.0% | 1.67 2.14 | 2.50 | 2.78 3.00 3.18 | 3.33 | 3.8 4.0 4.4 4.7 5.0
85.0% | 1.74 2.31 | 2.76 | 3.13 3.43 3.68 | 3.90 | 4.5 4.9 5.7 6.1 6.7
90.0% | 1.82 2.50 | 3.08 | 3.57 4.00 4.38 | 4.71 | 5.7 6.4 7.8 8.8 10.0
| | | | |
92.5% | 1.86 2.61 | 3.27 | 3.85 4.36 4.83 | 5.25 | 6.6 7.5 9.6 11.2 13.3
95.0% | 1.90 2.73 | 3.48 | 4.17 4.80 5.38 | 5.93 | 7.7 9.1 12.5 15.4 20.0
96.0% | 1.92 2.78 | 3.57 | 4.31 5.00 5.65 | 6.25 | 8.3 10.0 14.3 18.2 25.0
| | | | |
97.0% | 1.94 2.83 | 3.67 | 4.46 5.22 5.93 | 6.61 | 9.0 11.0 16.6 22.1 33.3
97.5% | 1.95 2.86 | 3.72 | 4.55 5.33 6.09 | 6.81 | 9.4 11.6 18.0 24.9 40.0
98.0% | 1.96 2.88 | 3.77 | 4.63 5.45 6.25 | 7.02 | 9.8 12.3 19.8 28.3 50.0
| | | | |
98.5% | 1.97 2.91 | 3.83 | 4.72 5.58 6.42 | 7.24 | 10.3 13.1 21.8 32.9 66.7
99.0% | 1.98 2.94 | 3.88 | 4.81 5.71 6.60 | 7.48 | 10.8 13.9 24.4 39.3 100.0
99.5% | 1.99 2.97 | 3.94 | 4.90 5.85 6.80 | 7.73 | 11.4 14.9 27.7 48.7 200.0
% /N 2 3 | 4 | 5 6 7 | 8 | 12 16 32 64 1.0e6
The figure below shows an example of the speedup as a function of the number of processors for a fixed size problem. If the parallel fraction is 100% and no parallel overhead has incurred, an ideal (linear) speedup is achieved. If the parallel fraction is less than 100% (in this figure, less than 90%) and no parallel overhead has incurred, then the speedup follows Amdahl's law and is less than N. In reality, the speedup of most parallel applications is worst (less) than what is predicted by the Amdahl's law due to the parallel overhead.
On the other hand, for many applications as problem size increases, fraction of sequential operations decreases, so a better speedup will be seen (compared to the speedup for a smaller problem size).
Amdahl's law can be used to predict the speedup of N processors if a speedup of less processors (n < N) has been obtained. To do this, one first estimates the fraction of parallel region in a code as follows:
f = [n * (speedup(n) -1)] / [speedup(n) * (N-1)]
Amdahl's law is then used to predict speedup(N) where N > n.
efficiency = speedup(N) x 100 / number of processors
During execution, each cpu is either computing, communicating or idling. The efficiency is a useful measure as to what percentage of a processor's time is being spent in useful computation. An efficiency of 100% means that every CPU spends 100% of its time performing useful computation.
The scalability is the ability of a parallel program's performance to scale as more resources (usually CPUs) are added. There are a few common ways to measure scalability:
Usually, you expect the speedup(N) to be less than N, reflecting the fact that not all parts of a program benefit from parallel execution. However, in rare situations, one may see linear scalability or even superlinear scalability.
The performance of a parallel program improves linearly as a function of the number of CPUs. For example, when a program is 100% parallelized, the speedup of the code is: speedup(N)=N.
For a parallel program, if adding more CPUs accidentally relieves some other bottleneck, the speedup can exceed number of CPUs added, ie. speedup(N)>N.
Memory/cache effects
More processors typically also provide more memory/cache.
Total computation time decreases due to more page/cache hits.
Search anomalies
Parallel search algorithms.
Decomposition of search range and/or multiple search strategies.
One task may be "lucky" to find result early.
A superlinear speedup does not really result from parallel execution. It comes about because each CPU is
now working on a smaller set of memory. The problem data handled by any one CPU fits better in cache, so
each CPU executes faster than the single CPU could do. A superlinear speedup is welcome, but it indicates
that the sequential program was being held back by cache effects.
The SGI Origins use physically distributed but logically shared memory and processors based on the cache-coherent nonuniform memory access (cc-NUMA) technology. This technology enables applications to achieve scalability easier with the Origins than with conventional shared memory machines.
In order to fully take advantage of this technology and get good parallel performance, programmers should be aware of a few issues :
On the SGI Origins, the memory reference latency increases with each level of the memory hierarchy : registers < L1 cache < L2 cache < local memory < remote memory < disk. A miss at each level of the memory hierarchy multiplies the latency by an order of magnitude or more. Programs whose memory references are satisfied mostly from caches shall have good performances. To avoid cache misses, it is important to ensure spatial locality and temporal locality. This is true for both singal CPU and multiple CPU executations.
Each CPU in an Origin system has an independent secondary cache, organized as a set of 128-byte cache lines. The memory lines that were most recently used by the CPU are stored here for high-speed access.
When two or more CPUs access the same memory, each has an independent copy of that data. There can be as many copies of a data item as there are CPUs; and for some important tables in the IRIX kernel, this may often be the case.
Cache coherency means that the system hardware ensures that every cached copy remains a true reflection of the memory data, without software intervention. If one processor updates a location in shared memory, all the other processors know about the update.
The directory-based cache coherency mechanism uses a lot of dedicated circuitry in the hub to ensure that many CPUs can use the same memory, without race conditions, at high bandwidth. As long as memory reads far exceed memory writes (the normal situation), there is no performance cost for maintaining coherence.
However, when two or more CPUs alternately and repeatedly update the same cache line, performance suffers, because every time either CPU refers to that memory, a copy of the cache line must be obtained from the other CPU. This performance problem is generically referred to as cache contention. Two variations of it are:
Memory contention occurs because of the design of the algorithm; correcting it usually involves an algorithmic change. False sharing is contention that arises by coincidence, not design; it can usually be corrected by modifying data structures.
Part of performance tuning of parallel programs is recognizing cache coherency contention and eliminating it. The R10000 and R12000 event counter 31 (store or prefetch-exclusive to shared block in Scache) is the best indicator of cache contention between CPUs. The CPU that accumulates a high count of event 31 is repeatedly modifying shared data. Other CPUs that have a copy of the modified cache line will be sent invalidations. Another good indicator of cache contention is event 29 (external invalidation hits in Scache). The CPU that produces a high count of event 29 is being slowed because it is using shared data that is being updated by a different CPU. The CPU doing the updating generates event 31.
Memory access time is a function of distance to memory. That is, the time it takes for a CPU to access a memory location varies according to the location of the memory relative to the CPU (for example, how many hubs and routers the data must pass through to get to the requesting CPU).
The impact of NUMA on performance is important for:
For the multi-threaded programs, memory locality management (placing data in or near the local memory of the CPU that needs these data) is an important factor to achieve good parallelism (speedup and scalability). The Irix operating system is able to take care of many memory locality needs through:
For situations where memory locality management by the OS is not optimal, there are other ways to fine-tune data placement in order to achieve memory locality:
Three data placement policies are available on the Origins:
Memory is usually allocated on a "first-touch" basis; that is, it is allocated in the node where the program that first defines that page is executing. When that is not possible, the memory is allocated as close as possible (in router hops) to the CPU that first accessed the page. The IRIX scheduler maintains process affinity to CPUs based on both cache affinity (as in previous versions) and on memory affinity. When a process is ready to run it is dispatched to
This policy works well for fully parallelized programs, but serial initializations cause nonlocal accesses and bottlenecks.
_DSM_ROUND_ROBIN
Unfortunately none of the counter values reported by perfex provide a direct diagnosis of bad memory placement. You can suspect memory placement problems from a combination of circumstances:
This is a performance problem you can address for its own sake; but it demonstrates that the program depends on a high memory bandwidth.
The Origins at NAS are shared by many users. In a shared environment, performance of a parallel job varies depending on the load of the system. To provide consistent performances, cpuests have been implemented on the compute systems (hopper, steger, lomax and chapman, but not turing). At NAS, cpusets are configured such that (i) when a cpuset is created for your job, all of the processors and memory associated with that cpuset are dedicated to your job. No other jobs can share your cpus or memory. (ii) No swapping (between memory and disks) is allowed.
The Portable Batch System (PBS) used at NAS for batch jobs supports cpusets.
What in a code should one parallelize ? Almost every code has some form of parallelism. Depending on the nature of the code, one can either focus on decomposing the data or the functions (code segments) associated with the problem:
In data parallelism, one focuses first on dividing the data into small pieces and then partition the computation (functions) to be performed by associating each operation with the data. This partitioning yields a number of tasks, each comprising some data and a set of operations on that data. If an operation requires data from several tasks, communication is required to move data between tasks.
Data parallelism is commonly applied at the loop levels of a code. Thus, some code segments in a loop run concurrently on different data elements and data elements are assigned to various processors and are subjected to identical processing. When needed, communication takes place among processors to exchange data. With data parallelism, the amount of communication is usually high (fine grain parallelism).
In a data parallel environment (loop-level parallelism), data can be decomposed implicitly or explicitly:
In functional parallelism, one focuses first on dividing the computations (functions) to be performed into disjoint tasks. One then associates data with each task. When data used in different tasks overlap, communication will take place.
Functional parallelism is usually applied to coarser levels of a code, such as subroutines. Thus, multiple code segments are run concurrently. It works well if these code segments are independent of each other and little communication is needed (coarse grain parallelism). If the code segments need significant overlap of data, considerable communication is required to avoid replication of data. This is often a sign that a data parallelism approach should be considered instead.
A good example of the application of functional parallelism is in the simulation of earth climate which may comprise components such as atmosphere, ocean, etc. The simulations of these components are run concurrently. Exchange of data between components take place during computation.
Granularity is defined as the ratio of the amount of communication that a processor requires to the amount of computation that it performs.
A UNIX process consists of an address space, a large set of process state values, and one thread of execution.
In traditional UNIX practice, one process creates another with the system call fork(), which makes a duplicate of the calling process, after which the two copies execute in parallel.
IRIX also supports the system function sproc(), which creates a lightweight process. A process created with sproc() shares some of its process state values with its parent process
A thread is an independent execution state; that is, a set of machine registers, a call stack, and the ability to execute code. When IRIX creates a process, it also creates one thread to execute that process. However, you can write a program that creates many more threads to execute in the same address space.
There are three key differences between a thread and a process:
Threads exist within a process and do not have distinct copies of these UNIX state values. Threads share the single state belonging to their process.
Threads within a process always share the single address space belonging to their process.
In contrast, threads are scheduled by code that operates largely in the user address space, without kernel assistance. Thread scheduling can be faster than process scheduling
The finest level of granularity is to run individual statements in parallel. This is provided using any of three language products:
Where in a code should one parallelize ?
Parallelization Without Code Modification
Example:
% f90 -O3 -apo your_serial_program.f % setenv OMP_NUM_THREADS 4 (or setenv MP_SET_NUM_THREADS 4) % a.out
Invoking the Auto-Parallelizing Option (APO) during compilation automatically converts a sequential code into a parallel code by inserting parallel directives where it is safe and beneficial to do so. The compiler can produce a report showing which loops it could parallelize and which it could not parallelize, and why. The compiler can also produce a listing of the modified source code, after parallelizing and before loop-nest optimization. Below are a few things regarding -apo that one should be aware of:
APO does not parallelize unsafe loops:
See 9.4.1 Constructs That Inhibit Parallelization in MIPSpro 7 Fortran 90 Commands and Directives Reference Manual for more information.
Loops that inhibit APO's efficiency:
See 9.4.2 Constructs That Slow Down Parallelized Code in MIPSpro 7 Fortran 90 Commands and Directives Reference Manual for more information.
setenv OMP_NUM_THREADS 2
When a parallel program starts, the run-time support creates a pool of lightweight processes using the sproc() function. Initially the extra processes are blocked, while one process executes the opening passage of the program. When execution reaches a parallel section, the run-time library code unblocks as many processes as necessary. Each process begins to execute the same block of statements. The processes share global variables, while each allocates its own copy of variables that are local to one iteration of a loop, such as a loop index.
When a process completes its portion of the work of that parallel section, it returns to the run-time library code, where it picks up another portion of work if any work remains, or suspends until the next time it is needed. At the end of the parallel section, all extra processes are suspended and the original process continues to execute the serial code following the parallel section.
Here is a list of the files that may be useful:
Files
Content
file.list
APO listing file. To retain this,
specify the -apolist option.
-apokeep option also generates this file.
It contains information on the loops
that were executed in parallel and explains
why others were not executed in parallel
file.anl
To retain this,
specify the -apokeep option.
This file is ued by the ProDev ProMP tools.
file.m
To retain this,
specify the -apokeep option.
This file is an annotated version of your source code
that shows the insertion of multiprocessing directives.
It is similar to the file.w2f.f file. It is based on OpenMP
and mimics the behavior of the program after automatic parallelization.
This file is used by Workshop Pro MP.
file.w2f.f
Fortran transformation file. To retain this file,
specify -mplist (or -FLIST:=ON ?)
file.w2c.c
C transformation file. To retain this file,
specify -mplist (or -CLIST:=ON
Note: This option does not work with C++ ?).
file.L
Listing file containing a cross reference and a source listing. To retain this file, specify the -listing option.
file.s
Assembly language file. To retain this file,
specify the -S or -keep option.
To learn more about APO, see Chapter 9. The Auto-Parallelizing Option (APO) in MIPSpro 7 Fortran 90 Commands and Directives Reference Manual.
Parallelization With Code Modification
Example:
% f90 -O3 -mp your_program.f % setenv OMP_NUM_THREADS 4 (or setenv MP_SET_NUMTHREADS 4) % a.out % ssrun [options] a.out (for performance profiling using ssrun) % perfex -a -x -y a.out (for performance profiling using perfex)
The -mp flag enables the processing of the original SGI/PCF directives as well as the OpenMP directives. To selectively disable one or the other set of directives, add the following -MP option group flag to the -mp flag:
-MP:old_mp=off
disable processing of the original SGI/PCF directives, but retain the processing of OpenMP directives.
-MP:open_mp=off
disable processing of the OpenMP directives, but retain processing of the original SGI/PCF directives.
The MIPSpro 7 Fortran 90 compiler supports directives for performance tuning on Origin series systems. These directives are extensions to the OpenMP Fortran API.
See Chapter 5. Parallel Processing on Origin Series Systems in MIPSpro 7 Fortran 90 Commands and Directives Reference Manual for more information.
CAPO (Captools-based Automatic Parallelizer using OpenMP) is a tool developed at
NAS.
It is used to automate the insertion of compiler directives
to facilitate parallel processing on shared memory parallel (SMP) machines.
CAPO takes a fortran program (currently Fortran-77 only), performs
data dependence analysis, and generates either SGI's native or OpenMP directives
for a code.
See
CAPO (CAPTools-based Automatic Parallelizer using OpenMP) for more information.
Example:
Developed by James Taft at NAS,
MLP can be used on shared memory systems (not distributed memory systems).
MLP is a vastly simplified, and inherently more scalable alternative to MPI.
With MLP, one also needs to first think about data decomposition just as
developing an MPI code. After that, only three routines are needed
to achieve coarse grain parallelism. With each mlp process, OpenMP
with multiple threads can be used to achieve fine grain parallelism.
The three mlp routines are :
It contains
For more information, see:
% f90 -O3 your_program.f -lmpi
% mpirun -np 4 a.out
% mpirun -np 4 ssrun [options] a.out (for performance profiling using ssrun)
% mpirun -np 4 perfex -mp -o file a.out (for performance profiling using perfex)
% mpirun -np 4 dplace -place file a.out (for data placement using dplace)
% totalview mpirun -a -np 4 a.out (for debugging using totalview)
explicit control over locality of reference and interprocessor
communication
explicit control over locality of reference and interprocessor
communication
Use dplace for non-MP library programs - important for MPI 2.0 - not needed for MPI 3.0
speedup = elapsed time 1-CPU / elapsed time multiple-CPU
scalability = speedup vs. number of processors
scalability = Flops vs. number of processors
flops with -TARG:madd=OFF
Flops/s = ---------------------------
run time with -TARG:madd=ON
where the floating point operation count, flops, was measured with the perfex command.
(more accurate measure of Flops is obtained with 'perfex -e 0 -e 21 -x -y')
An example is Jim Taft's report of Overflow-MLP on 512 CPUS Lomax.
Performance of the Overflow-MLP CFD Code on the
NASA Ames 512 CPU Origin System
efficiency = speedup x 100 / number of processors
How to measure this ?
Do all threads issue a similar number of floating point operations (event 21)?
Example: false sharing
Note: This example is modified from Example 8-5 in Chapter8 of SGI's Origin 2000 and Onyx Performance Turing and Optimization Guide.
In this example, a simple program (lower left panel) is used to demonstrate the occurrence of false sharing and how to use various profiling tools to detect such occurrence. Another program (lower right panel) is also given to demonstrate one method of removing false sharing.
The programs:
compile the codes
compiler version used : MIPSpro 7.3.1.1m
run the jobs
Running host : steger or hopper,
250 MHz R10000 O2K machine, L2 cache size = 4 MB, memory per node = 490 MB
All of the runs were performed through PBS. The nodes (cpus and memory) that were
assigned to these runs were not shared with other jobs.
To run the code in parallel, specify how many cpus and threads are to be used
Four performance tools were used :
Provides a measure of the time used for certain segments of the code
Provides a measure of the total time used for the entire code and
of the speedup
when different number of processors are used
Provide a method of finding what the code might be suffering
Provide a method of finding or confirming the factor that causes the code
not to perform well and where in the code this is occurring.
Results
t2 (obtained using dtime function,
see programs above) for
the time spent on initialization; t3 (see programs above)
for the time spent on real work;
/bin/time (which reports real, user and sys)
for overall performance of the code.
As seen in this table, the time used for initializing array a (t2) is
negligible compared to the time spent on real work (t3). The walltime
(real) reported by /bin/time is very close to t3.
Of the various time reported (t2, t3, real, user, sys), t3 is the most useful
when the speedup is to be calculated.
With -O0 and -O1, as seen in the above table, the order of t3
used by false_sharing.f with different number of CPUs
is : 4 cpus >> 2 cpus >> 1 cpus. Thus, not only t3 does not decrease
, it increases signigicantly as more cpus are added. The cause of this
behavior is due to a phenomenon called false sharing.
Using 4 CPUs as an example,
for false_sharing.f (the program at the left panel)
at each stage of the calculation, all four CPUs
attempt to update one element of the sum array, s(i).
For a CPU to update one element of s, it needs to
gain exclusive access to the cache line holding
that element. But the four words of s are probably
contained in a single cache line. So only one CPU at a
time can update an element of s. Instead of
operating in parallel, the calculation is serialized.
Actually, it's a bit worse than merely serialized.
For a CPU to gain exclusive access to a cache line,
it first needs to invalidate cached copies that may
reside in the other caches. Then it needs to read
a fresh copy of the cache line from memory, because
the invalidations will have caused data in some
other CPU's cache to be written back to main memory.
In a sequential version of the program, the element
being updated can be kept in a register, but
in the parallel version, false sharing forces the
value to be continually reread from memory, in
addition to serializing the updates.
One method of elminiating this problem is demonstrated in no_false_sharing.f
(the program on the right panel). In this program,
the elements s(1,1), s(1,2), s(1,3) and s(1,4)
are separated by at least 32 ´ 4 = 128 bytes,
and so are guaranteed to fall in separate cache lines.
Implemented this way, the code achieves perfect parallel speedup
(as demonstrated by the values of t3 in the table above
for runs with 1, 2 and 4 cpus
using -O0, -O1, -O2 and -O3).
Note: Using -O2 and -O3, the performances of false_sharing.f and
no_false_sharing.f
are nearly identical. Both have perfect speedup with increasing number of CPUs.
The improved performance of false_sharing.f
using -O2 and -O3 indicates that the compiler
automatically detects false sharing and eliminates it for this code.
(Warning: Elimination of false sharing by the compiler at -O2 or -O3
is not guaranteed.)
I have further verified that the improved performace was not an artifact
of code elimination (thus, work elimination) by the compiler.
Specifically, using ssrun -ideal, the subroutine sum85 was indeed called
100 times by each thread no matter -O0, -O1, -O2, -O3 was used.
Detect what are going on using perfex
Click the perfex outputs below for the corresponding 1cpu, 2cpus, and 4cpus runs
using -O0:
2 cpu perfex output (aggregrated from all processes)
4 cpu perfex output (aggregrated from all processes)
The following few characteristics in the perfex outputs indicate the
occurrence of false sharing:
Compare these quantities (highlighted in purple) in the above three
perfex outputs.
Compare these quantities (highlighted in green) in the above three
perfex outputs.
Compare these quantities (highlighted in red) in the above three
perfex outputs.
Compare these quantities (highlighted in blue) in the above three
perfex outputs.
Origins cache are 'write-back cache', meaning that cache
will write back to memory only when it is forced to do so.
For example, (1) when the cache line needs to be used by another set
of data that has the same lower bit address
(2) when another cpu requesting this cache line and the current cpu
has just update this cache line
Compare these quantities (highlighted in yellow) in the above three
perfex outputs.
I think L2 cache line reuse should drop because if another cpu
just updates the same cache line, this cpu can not reuse what it
has in cache. It has to discard this cache line and request
a new copy (issuing new load) and since it is loading a new copy,
it is considered a cache miss.
Note: L2 cache miss is counted when the second quadword of a cache
line is being written from memory to L2 cache.
P.S. For the defintion of some of the quantities
(ex: L2 cache line reuse, etc.) used above, see
Some useful statistics from perfex in my document of
'Performance Profiling and Optimization on the SGI Origins.'
Confirming false sharing is happening using ssrun with event counter 31
The R10000 and R12000 event counter 31 (store or prefetch-exclusive to shared
block in Scache) is the best indicator of cache contention between CPUs.
The CPU that accumulates a high count of event 31 is repeatedly modifying shared data.
The ssrun output (aggregrated over all CPUS) below shows a high count of event 31
within subroutine sum85,
confirming false sharing is happening there.
Read section 4.3 of the following report from SGI analysts
(http://www.supercomp.org/sc96/proceedings/SC96PROC/ZAGHA/INDEX.HTM)
to learn more about this behavior.
Additional performance data obtained on a different O2K - Delilah (300MHz R12000 and 8MB of L2 cahce)
Data placement can be an issue only for parallel programs that are
memory intensive and are not cache friendly.
Many environmental variables can be set to manipulate/change
data placement in memory. For more information on these variables, see
Section 4.2 of IRIX Environment Variables Ready Reference.
Note: At NAS, the processors in the origin machines, steger and lomax,
are grouped in clumps so that the cpus used for a job requesting
less than 128 cpus will not be too far away.
This should help codes to perform better and to
reduce walltime variation.
So, data placement may not be so bad ....
Example: memory locality
Note: This example is modified from Example 8-9 in Chapter8 of SGI's
Origin 2000 and Onyx Performance Turing and Optimization Guide.
The program memory_locality_1 in the left panel demonstrates the
effect of poor data placement (using the default first-touch placement policy)
due to the initialization of data by a single CPU.
Because the initialization takes a small amount of time
compared with the "real work," parallelizing it doesn't
reduce the sequential time of this code by much.
Some programmers wouldn't bother to parallelize the first
loop for a traditional shared memory computer.
But for the SGI Origins that use the NUMA architecture, additional
care is needed to get good performance and speedup.
Under the first-touch policy, all of the program's memory ends up allocated
in the node running the main thread. This causes two effetcs:
One way to improve data placement is to use the Round Robin placement
policy, rather than first-touch. Under this policy, data are allocated
in a round-robin fashion from all the nodes the program runs on. The data
are still initialized by a single CPU, but the memory holding them will not
be allocated from a single node; it will be evenly spread out among all
the nodes running the program. This may not place the data in their
optimal locations and thus the access times are not likely to be minimized.
But there will be no bottlenecks, so scalability will not be limited.
This method does not require any modification of the program or the compilation.
One simply sets the following environmental variable before executing the code.
However, if you are relying on the first-touch policy to
ensure a good data placement, it is critical to
parallelize the initialization code (as in program memory_locality_2, also
in the left panel, see comments)
in exactly the same way as the
processing loop.
Another way to improve data placement is to use the SGI's Distribute
directory (an SGI extention to the OpenMP directories for the NUMA architecture),
as shown in program
memory_locality_3 in the right panel.
When this method is used, a single CPU will handle the distribution of data
among the memory of multiple CPUs. How the data are distributed depend on the
mapping used. Specifically,
the Distribute directive allows you to specify a
different mapping for each dimension of each
distributed array. The possible mappings
are these:
In program memory_locality_3, two directives are used:
The first, Distribute, instructs the compiler
to allocate the memory for the
arrays a, b, c and d from all nodes on which the program runs.
The second directive, Parallel Do, uses the
clause AFFINITY (I) = DATA (A(I)), an SGI extension,
to tell the compiler to distribute work to the
CPUs based on how the contents of array a are distributed.
Note that because the data is explicitly distributed, it is no
longer necessary to parallelize the initialization
loop to properly distribute the data among all CPUs using
first-touch allocation (although it is still good
programming practice to parallelize data initializations).
In this program, memory_locality_3,
Distribute specifies a BLOCK mapping for all four
arrays. BLOCK means that, when
running with p CPUs, each array is to be divided into
p blocks of size ceiling(n/p), with the first block
assigned to the first CPU, the second block assigned to the second CPU, and so on. The intent of
assigning blocks to CPUs is to allow each block to be stored in one CPUs' local memory.
However, only whole pages are distributed, so when a page straddles
blocks belonging to different CPUs, it is
stored entirely in the memory of a single node.
As a result, some CPUs will use nonlocal accesses to
some of the data on one or two pages per array.
An imperfect data distribution has a negligible
effect on performance because, as long as a "block"
comprises at least a few pages, the ratio of nonlocal
to local accesses is small. When the block size is less
than a page, you must live with a larger fraction
of nonlocal accesses, or use the reshaped directives as shown
in program memory_locality_4.
(In general, you should try to arrange it so that each CPU's
share of a data structure exceeds the size of a page.
Data placement is rarely important for arrays
smaller than a page, because if they are used heavily, they
fit entirely in cache.)
In program memory_locality_4,
the distribute_reshape directory is used to achieve the desired data
distribution without page-granularity limitations.
Note:
If the dimension of arrays a, b, c and d is set to be 8*1024*1024 (32MB),
cache trashing may occur. This is due to the combining facts that :
(i) the L2 cahce size is either 4MB (hopper, steger) or 8MB (lomax);
(ii) the lower bits of a(i), b(i), c(i) and d(i) are identical;
(iii) a(i), b(i), c(i), d(i) will be loaded into the same cache line.
With -O0, -O1 and -O2,
compiler will not pad the arrays to avoid cache trashing. But at -O3, the Loop
Nest Optimizer (LNO) will pad these arrays.
In the programs above, cache trashing is prevented by explicitly
offseting the dimension by 35.
compile the codes
compiler version used : MIPSpro 7.3.1.1m
f90 -O3 -mp program.f
run the jobs
Running host : steger or hopper, 250 MHz R10000 O2K machine, L2 cache size = 4 MB, memory per node = 490 MB
All of the runs were performed through PBS. The nodes (cpus and memory) that were assigned to these runs were not shared
with other jobs.
To run the code in parallel, specify how many cpus and threads are to be used
#PBS -l ncpus=n
niters is read from stdin as 100
( (i) to create a lot more real work than initialization and (ii) I guess
it was intended to show the effect of 'page migration'
when it was turned on).
t2 (obtained using dtime function, see
programs above)
for the time spent on initialization; t3 (see programs above) for the time
spent on real work; /bin/time (which reports real, user and sys) for overall performance of the code.
Unfortunately, there is no profiling technique that will tell
you definitively that poor
data placement is hurting the performance of your program.
But, one can 'suspect' poor data placement is happening for
the program memory_locality_1 since:
Use !$sgi distribute as an example to show the data placement,
Are there 24 MLD created and mapped to each thread ?
dlook is a tool for showing memory and process placement.
By default, dlook shows where each process was running when
it exited and how its stack and
heap data was placed with page size information.
If sampling is enabled, (by using - sample n),
this data
is also displayed every n seconds.
To obtain a dlook output, use the -out option:
For illustration, outputs from dlook for
Notice that each of arrays a, b, c, and d will
occupy ~2048 pages (16KB page size). If the array elements are
"equally" distributed among the memory of the 6 nodes (12 cpus), each
node will allocate about 341 pages for 1/6 of array a ; 341 pages for
1/6 of array b; etc.
In addition to dlook, you can obtain more detailed
information about data placement within the program with the
dsm_home_threadnum() intrinsic. It takes an address as an
argument, and returns the number of the
CPU in whose local memory the page containing that
address is stored. It is used in Fortran as follows:
Since two CPUs are connected to each node and they share the same memory,
dsm_home_threadnum returns the
lowest CPU number of the two running on the node with the data.
Note that the numbering of CPUs reported by dsm_home_threadnum
is relative to the program, not the number of absolute physical cpu.
Using 12 cpus as an example, the placement of arrays
in memory when they are initialized is examined for
programs memory_locality_2, memory_locality_3 and memory_locality_4,
by inserting the function calls to dsm_home_threadnum in these programs.
The results obtained are summaried below for each program. Contiguous data stored in the same
node are grouped with the same color. The format of these outputs are:
In addition, the access of each array element during the real work is also examined
and the results are shown below. In these outputs, contiguous data accessed by the same
cpu are grouped with the same color. The format of these outputs are:
When data are initialized in parallel with OpenMP directive,
data are distributed in whole pages based on first-touch policy.
Although each pair of the 12 CPUS
is responsible for initializing 1/6 (2*8*1024*1024/12=1398102 elements = 341.4 pages)
of each array, some data initialized by one CPU (cpu-p) may have to be stored
in the memory non-local to this CPU. For example, a(1394413) - a(1398102) (which take
a total of 0.9 page
of memory) are initialized
by cpu 1 but they are stored in the memory of cpu 2. In addition, a(1394413) - a(2795244)
will take a total of 342 pages (16KB page) and they are stored in memory of cpu 2.
Some of these elements are accessed non-locally by cpu 1
(elements 1394413-1398102), the rest are accessed locally
by cpus 2 (elements 1398103 - 2097153) and 3 (elements 2097154 - 2795244).
When data are initialized by a single CPU with SGI Distribute directive,
data are distributed into the memory of many nodes in whole pages (not
based on first-touch policy).
When the data are accessed, data stored in the same node may be accessed by
different cpus. For example, a(1398103) - a(1399340) are stored in the memory of cpu 0
but are accessed by cpu 2.
The
data access output for memory_locality_2 with first-touch policy, and the
data access output for memory_locality_3 with sgi distribute directory clearly show a difference. With first-touch policy, for example, a(1398103) is
the first element to be initialized by cpu 2 and thus the memory of cpu 2 will store a whole
page of data adjcent
to a(1398103) including a(1394413)-a(1398102) which are to be initialized pretty late by cpu 1.
On the contrary, with SGI distribute directory, first-touch policy is not used. Thus,
the first element, for example, a(1398103) to be accessed by cpu 2 during real work might not be stored in the memory of cpu 2.
When data are initialized by a single CPU with SGI Distribute_reshape directive,
for each array, same number of data are distributed into each node. Thus, each node
contains 1398108 elements of each array (8*1024*1024+35)/6=1398108, which takes
up 341.33 pages.
With -O3, it appears that the compiler adds extra padding among arrays
a, b, c, and d. If -O2 is used, there will be no extra padding.
Example: More about Round Robin Distribution
The dsm_home_threadnum() intrinsic is also used to examine the
distribution by a single cpu (say, cpu 0) of
arrays a, b, c, and d into the memory of many nodes
in program memory_locality_1
when round robin policy is used.
(setenv _DSM_ROUND_ROBIN). The outputs are shown here:
It is surprising that elements of arrays a and b are
stored in the memory of cpus 0, 4 and 8.
On the contrary, elements of arrays c and d are stored
in the memory of cpus 2, 6 and 10.
This seems odd because we "expect" elements of each array to be
distributed to the memory of all nodes (cpus 0, 2, 4, 6, 8, 10)
in a round robin fashion.
To understand this behavior, a few things have been tested.
The results are listed here:
For the program round_robin_1,
For the program round_robin_2,
Conclusion:
A sequence of program statements that contains no labels and no branches. A basic block can only be executed completely (because it
contains no labels, it cannot be entered other than at the top; because it contains no branches, it cannot be exited before the end), and
in sequence (no internal labels or branches). Any program can be decomposed into a sequence of basic blocks. The compiler optimizes
units of basic blocks. An ideal profile counts the use of each basic block.
Every reference to an array element results in a cache miss due to the
unfortunate alignment of arrays, i.e, they all map to the same cache location.
Formal parameters which always have a
particular constant value can be replaced by the constant, allowing
additional optimization. Global variables which are initialized to
constant values and never modified can be replaced by the constant.
Global names in shared code must normally
be referenced via addresses in a global table, in case they are
defined or preempted by another DSO (see dso(5)). If the compiler
knows it is compiling a main program and that the name is defined in
another of the source files comprising the main program, an absolute
reference can be substituted, eliminating a memory reference.
Calls to a procedure are replaced by a suitably modified
copy of the called procedure's body inline, even if the callee is in
a different source file.
A loop over an array accesses array elements from adjcent memory
addresses.
a point within an application where a task
may not proceed furhter until another task(s) reaches the same or logically
equivalent point. Synchronization can cause a parallel application's wall
clock execution time to increase
"fine grain parallelism" means individual tasks are
relatively small in terms of code size and execution time,
"coarse grain" is the opposite. You talk about the
"granularity" of the parallelism.
The smaller the granularity, the greater the potential for parallelism
and hence speedup but the greater the
overheads of synchronisation and communication.
the ability of a hardware or software to demonstrate a
proportionate increase in parallel speedup with the addition of more processors
re-use data in the cache
The synchronisation of data in multiple caches such that
reading a memory location via any cache will return the most
recent data written to that location via any (other) cache.
To ensure that all cache copies of data are true reflections of
the data in main memory.
A lock is a small software object that stands for the exclusive right to use some resource. The resource could
be the right to execute a section of code, or the right to modify a variable in memory, or the right to read or
write in a file, or any other software operation that must by performed serially, by one process at a time.
Before using a serial resource, the program claims the lock, and releases the lock when it is done with the
resource.
Usually an allocation policy gives a process certain number of
main memory pages within which to execute.
The number of pages allocated is also known as the resident set
(of pages).
Two policies for resident set allocation: fixed and variable.
When a new process is loaded into the memory, allocate a
certain number of page frames on the basis of application type,
or other criteria. Prepaging or demand paging is used to fill up
the pages.
When a page fault occurs select a page for replacement.
An alternatice to first touch allocation in which new pages are
distributed in rotation to each node. This prevents the hub chip
in each node from having to serve all
requests for memory from one program
false sharing
no false sharing
program false_sharing
parameter (m=4,n=100000)
real a(n,m),s(m)
real*4 dtime,tarray(2)
t1=dtime(tarray)
do i=1,m
do j=1,n
a(j,i)=(i+j)/5000.0
end do
end do
t2=dtime(tarray)
do k=1,100
call sum85(a,s,m,n)
write (6,*) 'k= ',k
write (6,*) s
end do
t3=dtime(tarray)
print *, 'time on initialization = ',
& t2
print *, 'time on real work = ', t3
stop
end
subroutine sum85 (a,s,m,n)
integer m, n, i, j
real a(n,m), s(m)
!$omp parallel do private(i,j), shared(s,a)
do i = 1, m
s(i) = 0.0
do j = 1, n
s(i) = s(i) + a(j,i)
enddo
enddo
return
end
program no_false_sharing
parameter (m=4,n=100000)
real a(n,m),s(32,m)
real*4 dtime,tarray
t1=dtime(tarray)
do i=1,m
do j=1,n
a(j,i)=(i+j)/5000.0
end do
end do
t2=dtime(tarray)
do k=1,100
call sum85(a,s,m,n)
write (6,*) 'k= ',k
do i=1,m
write (6,*) s(1,i)
end do
end do
t3=dtime(tarray)
print *, 'time on initialization = ',
& t2
print *, 'time on real work = ', t3
stop
end
subroutine sum85 (a,s,m,n)
integer m, n, i, j
real a(n,m), s(32,m)
!$omp parallel do private(i,j), shared(s,a)
do i = 1, m
s(1,i) = 0.0
do j = 1, n
s(1,i) = s(1,i) + a(j,i)
enddo
enddo
return
end
f90 -O0 -mp program.f
f90 -O1 -mp program.f
f90 -O2 -mp program.f
f90 -O3 -mp program.f
#PBS -l ncpus=n
setenv OMP_NUM_THREADS n (n=1,2,4 in the runs)
or
perfex -a -x -y -mp a.out > output (get output per process)
setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 31
setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW 99
ssrun -exp prof_hwc a.out > output
prof *.prof_hwc.* > ssrun_prof.out
false sharing
no false sharing
-O0 -O1 -O2 -O3
1cpu
----
t2 0.062 0.061 0.007 0.006
t3 4.820 2.576 0.459 0.222
real 4.967 2.720 0.563 0.322
user 4.873 2.628 0.485 0.248
sys 0.040 0.041 0.040 0.040
2cpu
----
t2 0.062 0.061 0.013 0.012
t3 16.329 8.400 0.243 0.126
real 16.538 8.589 0.366 0.254
user 32.683 16.833 0.490 0.256
sys 0.051 0.048 0.046 0.047
4cpu
----
t2 0.062 0.061 0.015 0.015
t3 21.696 11.640 0.144 0.076
real 21.993 11.881 0.311 0.218
user 86.767 46.604 0.572 0.282
sys 0.068 0.063 0.057 0.058
-O0 -O1 -O2 -O3
1cpu
----
t2 0.060 0.060 0.009 0.007
t3 4.789 2.572 0.460 0.213
real 4.930 2.711 0.565 0.310
user 4.844 2.623 0.484 0.237
sys 0.038 0.039 0.047 0.042
2cpu
----
t2 0.062 0.060 0.012 0.012
t3 2.417 1.296 0.238 0.124
real 2.580 1.444 0.364 0.248
user 4.868 2.624 0.482 0.253
sys 0.047 0.046 0.045 0.046
4cpu
----
t2 0.061 0.060 0.014 0.014
t3 1.226 0.670 0.134 0.077
real 1.646 0.872 0.280 0.214
user 5.213 2.791 0.560 0.288
sys 0.054 0.053 0.051 0.050
-O3 100 __mpdo_sum85_1 (when 1 cpu was used)
-O3 200 __mpdo_sum85_1 (when 2 cpus were used)
-O3 400 __mpdo_sum85_1 (when 4 cpus were used)
(memory bandwidth used, MB/s, average per process)
(quadwords written back from scache)
for -O0 -mp, 4cpu run
DATA PLACEMENT ISSUES
setenv _DSM_PLACEMENT FIRST_TOUCH
setenv _DSM_PLACEMENT ROUND_ROBIN
setnev _DSM_ROUND_ROBIN
setenv _DSM_MIGRATION ON
setenv _DSM_MIGRATION ALL_ON
setenv _DSM_MIGRATION_LEVEL level (level = 0 - 100)
setenv _DSM_VERBOSE (to check how data have been distributed)
memory locality 1 and 2
memory locality 3 and 4
program memory_locality_1
! and program memory_locality_2
integer i, j, n, niters
parameter (n = 8*1024*1024,
& ndim = n+35)
real a(ndim), b(ndim),
& c(ndim), d(ndim), q
real*4 dtime,tarray(2)
read *, niters
print *, ' niters = ', niters
! initialization
t1=dtime(tarray)
!memory_locality_1 : comment out the following
! omp parallel
!memory_locality_2 : keep the following
! omp parallel
!$omp parallel do private(i) shared(a,b,c,d)
do i = 1, n
a(i) = 1.0 - 0.5*i
b(i) = -10.0 + 0.01*(i*i)
c(i) = 2*i - 0.3
d(i) = 0.5*i
enddo
t2=dtime(tarray)
! real work
do it = 1, niters
q = 0.01*it
!$omp parallel do private(i) shared(a,b,c,d,q)
do i = 1, n
a(i) = a(i) + q*b(i)
c(i) = c(i) + d(i)
a(i) = a(i) + sqrt(c(i))
enddo
call sub(a,b,ndim)
enddo
t3=dtime(tarray)
print *, a(1), a(n), q
print *, 'time on initialization = ', t2
print *, 'time on real work =',t3
end
subroutine sub(a,b,ndim)
real a(ndim), b(ndim)
return
end
program memory_locality_3
! and program memory_locality_4
integer i, j, n, niters
parameter (n = 8*1024*1024,
& ndim = n+35)
real a(ndim), b(ndim),
& c(ndim), d(ndim), q
real*4 dtime,tarray(2)
read *, niters
print *, ' niters = ', niters
! initialization
t1=dtime(tarray)
!memory_locality_3 : comment out the line
! with distribute_reshape
!$sgi distribute a(block),b(block),c(block),d(block)
!memory_locality_4 : comment out the above line
! with distribute a(block) ...
!$sgi distribute_reshape a(block),b(block),c(block),d(block)
do i = 1, n
a(i) = 1.0 - 0.5*i
b(i) = -10.0 + 0.01*(i*i)
c(i) = 2*i - 0.3
d(i) = 0.5*i
enddo
t2=dtime(tarray)
! real work
do it = 1, niters
q = 0.01*it
!$omp parallel do private(i) shared(a,b,c,d,q)
!$sgi+ affinity (i) = data(a(i))
do i = 1, n
a(i) = a(i) + q*b(i)
c(i) = c(i) + d(i)
a(i) = a(i) + sqrt(c(i))
enddo
call sub(a,b,ndim)
enddo
t3=dtime(tarray)
print *, a(1), a(n), q
print *, 'time on initialization = ', t2
print *, 'time on real work =',t3
end
subroutine sub(a,b,ndim)
real a(ndim), b(ndim)
return
end
setnev _DSM_ROUND_ROBIN
setenv OMP_NUM_THREADS n (n=1,2,4,...,32 in the runs)
setnev _DSM_ROUND_ROBIN (used only when Round Robin Placement is desired)
Four performance tools were used :
or
perfex -a -x -y -mp a.out > output (get output per process)
Results:
memory_locality_1
memory_locality_1
memory_locality_2
memory_locality_3
memory_locality_4
first-touch
round robin
first-touch
sgi distribute
sgi distribute_reshape
1 cpu run
t2 1.856
t3 70.038
real 72.400
user 70.387
sys 1.553
2 cpu run
t2 1.731
t3 36.266
real 38.446
user 72.844
sys 1.457
4 cpu run
t2 1.825
t3 32.644
real 35.027
user 130.705
sys 1.690
6 cpu run
t2 1.742
t3 34.057
real 36.345
user 204.949
sys 1.614
8 cpu run
t2 1.783
t3 34.974
real 37.335
user 280.797
sys 1.688
10 cpu run
t2 1.872
t3 35.547
real 38.095
user 356.904
sys 1.804
12 cpu run
t2 1.739
t3 35.759
real 38.196
user 430.225
sys 1.733
14 cpu run
t2 1.396
t3 35.966
real 38.125
user 504.370
sys 1.493
16 cpu run
t2 1.399
t3 36.124
real 38.301
user 578.638
sys 1.530
18 cpu run
t2 1.406
t3 35.542
real 37.809
user 642.900
sys 1.502
20 cpu run
t2 1.353
t3 35.510
real 37.914
user 713.693
sys 1.637
22 cpu run
t2 1.346
t3 35.455
real 37.779
user 784.726
sys 1.692
24 cpu run
t2 1.405
t3 35.536
real 38.005
user 857.089
sys 1.819
32 cpu run
t2 1.459
t3 35.740
real 38.517
user 1122.633
sys 2.168
1 cpu run
t2 1.784
t3 69.971
real 72.309
user 70.320
sys 1.485
2 cpu run
t2 1.556
t3 36.345
real 38.378
user 73.010
sys 1.289
4 cpu run
t2 1.747
t3 19.949
real 22.155
user 80.048
sys 1.499
6 cpu run
t2 1.707
t3 15.527
real 17.691
user 93.641
sys 1.485
8 cpu run
t2 1.845
t3 14.233
real 16.587
user 114.480
sys 1.661
10 cpu run
t2 1.428
t3 10.872
real 12.772
user 109.214
sys 1.251
12 cpu run
t2 1.611
t3 9.593
real 11.839
user 116.564
sys 1.460
14 cpu run
t2 1.672
t3 9.026
real 11.373
user 127.921
sys 1.558
16 cpu run
t2 1.623
t3 8.884
real 11.195
user 144.446
sys 1.536
18 cpu run
t2 1.527
t3 7.175
real 9.400
user 131.807
sys 1.433
20 cpu run
t2 1.527
t3 6.525
real 8.831
user 133.808
sys 1.466
22 cpu run
t2 1.570
t3 5.659
real 8.083
user 128.730
sys 1.523
24 cpu run
t2 1.551
t3 6.268
real 8.759
user 156.548
sys 1.547
32 cpu run
t2 1.704
t3 7.284
real 10.254
user 240.865
sys 1.905
1 cpu run
t2 1.327
t3 77.896
real 79.746
user 78.258
sys 1.010
2 cpu run
t2 1.313
t3 39.442
real 41.285
user 79.380
sys 2.228
4 cpu run
t2 0.716
t3 19.717
real 20.978
user 79.541
sys 2.428
6 cpu run
t2 0.543
t3 13.155
real 14.384
user 79.819
sys 2.564
8 cpu run
t2 0.376
t3 10.034
real 11.378
user 81.064
sys 2.588
10 cpu run
t2 0.330
t3 7.901
real 9.272
user 81.058
sys 2.463
12 cpu run
t2 0.303
t3 6.604
real 7.963
user 81.607
sys 2.398
14 cpu run
t2 0.313
t3 5.745
real 7.187
user 83.460
sys 2.437
16 cpu run
t2 0.225
t3 5.050
real 6.553
user 84.645
sys 2.296
18 cpu run
t2 0.227
t3 4.477
real 6.152
user 86.820
sys 2.305
20 cpu run
t2 0.248
t3 4.050
real 5.721
user 86.972
sys 2.365
22 cpu run
t2 0.232
t3 3.600
real 5.269
user 86.381
sys 2.272
24 cpu run
t2 0.247
t3 3.379
real 5.182
user 88.941
sys 2.456
32 cpu run
t2 0.247
t3 5.604
real 7.725
user 190.906
sys 2.689
1 cpu run
t2 1.741
t3 71.300
real 73.725
user 71.619
sys 1.469
2 cpu run
t2 1.417
t3 37.517
real 39.422
user 75.309
sys 1.183
4 cpu run
t2 1.604
t3 18.821
real 20.846
user 75.541
sys 1.409
6 cpu run
t2 1.545
t3 12.609
real 14.552
user 76.112
sys 1.331
8 cpu run
t2 1.401
t3 9.473
real 11.277
user 76.456
sys 1.197
10 cpu run
t2 1.429
t3 7.652
real 9.537
user 77.361
sys 1.244
12 cpu run
t2 1.446
t3 6.477
real 8.546
user 78.925
sys 1.399
14 cpu run
t2 1.415
t3 5.533
real 7.545
user 79.139
sys 1.263
16 cpu run
t2 1.415
t3 4.816
real 6.866
user 79.618
sys 1.285
18 cpu run
t2 1.392
t3 4.311
real 6.383
user 80.439
sys 1.278
20 cpu run
t2 1.533
t3 3.901
real 6.195
user 81.974
sys 1.447
22 cpu run
t2 1.413
t3 3.568
real 5.822
user 83.707
sys 1.344
24 cpu run
t2 1.408
t3 3.280
real 5.604
user 84.957
sys 1.381
32 cpu run
t2 1.510
t3 5.332
real 7.998
user 175.450
sys 1.858
1 cpu run
t2 1.349
t3 67.738
real 69.588
user 68.084
sys 1.047
2 cpu run
t2 1.344
t3 35.653
real 37.429
user 71.602
sys 1.077
4 cpu run
t2 1.342
t3 17.847
real 19.546
user 71.685
sys 1.083
6 cpu run
t2 1.359
t3 12.074
real 13.793
user 72.836
sys 1.126
8 cpu run
t2 1.415
t3 9.007
real 10.766
user 72.398
sys 1.197
10 cpu run
t2 1.404
t3 7.274
real 9.037
user 73.044
sys 1.197
12 cpu run
t2 1.426
t3 6.130
real 7.898
user 73.742
sys 1.249
14 cpu run
t2 1.438
t3 5.271
real 7.119
user 74.616
sys 1.269
16 cpu run
t2 1.482
t3 4.617
real 6.549
user 74.661
sys 1.341
18 cpu run
t2 1.472
t3 4.133
real 6.154
user 76.472
sys 1.344
20 cpu run
t2 1.511
t3 3.613
real 5.686
user 74.425
sys 1.412
22 cpu run
t2 1.567
t3 3.276
real 5.473
user 74.850
sys 1.500
24 cpu run
t2 1.558
t3 3.017
real 5.252
user 75.000
sys 1.537
32 cpu run
t2 1.755
t3 4.211
real 6.994
user 144.383
sys 2.043
Detect what is going on:
No TLB misses - count of event 23 is negligible in
perfex output
No cache contention - count of event 31 is negligible in
perfex output
Example: setenv _DSM_VERBOSE
Example: dlook
dlook -out dlook_output a.out < input > output
are shown. In this example, 12 cpus are used.
Example: dsm_home_threadnum()
integer dsm_home_threadnum
numthread = dsm_home_threadnum(array(i))
column 1 : array name
column 2 : the cpu number (p) performing the initialization,
(applied to memory_locality_2 only, for the other two
programs, p is simply 0)
column 3 : the lower cpu number (p') of the node that has the data
column 4 : the first array element (i)
column 5 : the last array element (i')
column 6 : the number of pages used from the first (column 4)
to the last array (column 5) elements
column 1 : array name
column 2 : the cpu number (p) accessing the data
column 3 : the lower cpu number (p') of the node that has the data
column 4 : the first array element (i)
column 5 : the last array element (i')
The programs:
f90 -On -mp program.f (n=0,1,2 or 3)
round robin 1
round robin 2
program round_robin 1
integer i, j, n, niters
parameter (n = 24*1024, ndim = n)
real a(ndim), b(ndim),
& c(ndim), d(ndim), q
real*4 dtime,tarray(2)
integer dsm_home_threadnum
read *, niters
print *, ' niters = ', niters
! initialization
t1=dtime(tarray)
do i=1,n
a(i) = 1.0 - 0.5*i
enddo
do i=1,n
b(i) = -10.0 + 0.01*(i*i)
end do
do i=1,n
c(i) = 2*i - 0.3
end do
do i=1,n
d(i) = 0.5*i
end do
t2=dtime(tarray)
! real work
do it = 1, niters
q = 0.01*it
!$omp parallel do private(i, &
!$omp& numthread1,numthread2,&
!$omp& numthread3,numthread4,num_omp) &
!$omp& shared(a,b,c,d)
do i = 1, n
num_omp=omp_get_thread_num()
numthread1 = dsm_home_threadnum(a(i))
write (6,*) 'a', num_omp,i,numthread1
numthread2 = dsm_home_threadnum(b(i))
write (6,*) 'b', num_omp, i,numthread2
numthread3 = dsm_home_threadnum(c(i))
write (6,*) 'c', num_omp, i,numthread3
numthread4 = dsm_home_threadnum(d(i))
write (6,*) 'd', num_omp, i,numthread4
a(i) = a(i) + q*b(i)
c(i) = c(i) + d(i)
a(i) = a(i) + sqrt(c(i))
enddo
call sub(a,b,ndim)
enddo
t3=dtime(tarray)
print *, a(1), a(n), q
print *, 'time on initialization = ', t2
print *, 'time on real work =',t3
end
subroutine sub(a,b,ndim)
real a(ndim), b(ndim)
return
end
program round_robin_2
integer i, j, n, niters
parameter (n = 24*1024, ndim = n)
real a(ndim), b(ndim),
& c(ndim), d(ndim), q
real*4 dtime,tarray(2)
integer dsm_home_threadnum
read *, niters
print *, ' niters = ', niters
! initialization
t1=dtime(tarray)
do i = 1, n
a(i) = 1.0 - 0.5*i
b(i) = -10.0 + 0.01*(i*i)
c(i) = 2*i - 0.3
d(i) = 0.5*i
enddo
t2=dtime(tarray)
! real work
do it = 1, niters
q = 0.01*it
!$omp parallel do private(i, &
!$omp& numthread1,numthread2,&
!$omp& numthread3,numthread4,num_omp) &
!$omp& shared(a,b,c,d)
do i = 1, n
num_omp=omp_get_thread_num()
numthread1 = dsm_home_threadnum(a(i))
write (6,*) 'a', num_omp,i,numthread1
numthread2 = dsm_home_threadnum(b(i))
write (6,*) 'b', num_omp, i,numthread2
numthread3 = dsm_home_threadnum(c(i))
write (6,*) 'c', num_omp, i,numthread3
numthread4 = dsm_home_threadnum(d(i))
write (6,*) 'd', num_omp, i,numthread4
a(i) = a(i) + q*b(i)
c(i) = c(i) + d(i)
a(i) = a(i) + sqrt(c(i))
enddo
call sub(a,b,ndim)
enddo
t3=dtime(tarray)
print *, a(1), a(n), q
print *, 'time on initialization = ', t2
print *, 'time on real work =',t3
end
subroutine sub(a,b,ndim)
real a(ndim), b(ndim)
return
end
Glossary