Performance Profiling and Optimization

on the SGI Origins

Sherry Chang
Scientific Consultant
Numerical Aerospace Simulation Facility
NASA Ames Research Center
MS 258-6
Moffett Field, CA 94035-1000

http://www.nas.nasa.gov/~schang


Click here for PDF format

Table of Contents

Links to Examples in this Document

Glossary


UNDERSTANDING THE ORIGINS


Distributed Memory Multi-Processors

In a multiprocessor configuration, each processor has fast access to its own local memory but no direct access to the memory of a different processor. (a procesor does not know the address of a datum that resides in the memory of a different processor.) When data located in the memory of a different processor is needed, a message is sent via the inter-processor network.

Example : Distributed memory parallel processors


Shared Memory Multi-Processors

Memory in a multiprocessor configuration, usually RAM, which can be accessed by more than one processor, usually via a shared bus or network.

Two of the memory access models for shared memory machines:

Note: Each black square in the above diagram represents a 'node'. Nodes are connected through Routers/Craylink interconnect.


The Origin 2000 System

An Origin2000 system has the following components:

Suggested Reading:

Application Programmer's I/O Guide http://techpubs.sgi.com/library/dynaweb_bin/ebt-bin/0650/nph-infosrch.cgi/infosrchtpl/SGI_Developer/AplPrg_IOG


The Origin Machines at NAS

Note: This section was updated on Dec. 27, 2001 to reflect the configuration changes.

Here is a list of resources of the origin machines available for users:

Hostname Turing Fermi Hopper Steger Lomax
Function front-end server compute compute compute
Processor
CPU R10000 R10000 R10000 R10000 R12000
CPU-Clock 250MHz 250MHz 250MHz 250MHz 400MHz
NCPUS/Node 2 2 2 2 2
# of Nodes 16 16 32 64 256
NCPUS 32 32 64 128 512
Memory
Local Memory/Node 512MB 512MB 768MB* 512MB 768MB
Total Memory 8GB 8GB 24GB* 32GB 192GB
Free Memory/Node ~490MB N/A ~700MB* ~490MB ~700MB
L1 Cache Size 32KB 32KB 32KB 32KB 32KB
L2 Cache Size 4MB 4MB 4MB 4MB 8MB
Page Size 16KB 16KB 16KB 16KB 16KB
I/O
/u/userid N/A 300MB # N/A N/A N/A
/scratch1 100GB N/A 200GB 400GB ~800GB
/scratch2 100GB N/A 200GB 400GB ~500GB

*: Memory on Hopper will be expanded to include a total of 64GB in the near future.

For further information, refer to the configuration diagram for each machine. http://in.nas.nasa.gov/Publications/ConfigDiags/


The R10000/R12000 Processors

The MIPS R10000 and R12000 processors were designed with a few characteristics to achieve optimum performance.


Instruction Latency

Load/Store Latency (in clock periods)
Load (primary cache hit, int/FP) 2/3
Integer
add, sub, logic ops, shift, branches 1
multiply (32-bits, signed) 5/6
multiply (64-bits, signed) 9/10
divide (32-bit/64-bit) 35/67
Floating Point
add, compare, convert 2
multiply 2
multiply-add (bypass/not) 2/4
divide, reciprocal (single/double) 12/19
sqrt (single/double) 18/33
reciprocal sqrt (single/double) 30/52


Memory Hierarchy

The CPU can make use only of data in registers, and it can load data into registers only from the primary cache. So data must be brought into the primary cache before it can be used in calculations. The primary cache can obtain data only from the secondary cache, so all data in the primary cache is simultaneously resident in the secondary cache. The secondary cache obtains its data from main memory.

What is "Translation Lookaside Buffer (TLB) in the above figure and its role ?


TLB, Virtual Memory and Physical Memory

The origins at NAS are so called 'Virtual Memory' machines which allow (1) the operating system to freely partition the physical memory of the system among the running processes while presenting a uniform virtual address space to all jobs and (2) processes to access address space much larger than the physical memory space.

Virtual memory is divided into pages. Thus, a page is the smallest continuous memory that the operating system can allocate to your program. For the origins at NAS, the default page size is set to be 16KB. However, this page size can be changed when necessary.

Each virtual address is composed of a virtual page number (the most significant bits) and an offset within the page (the N least significant bits, in the case of a page size 16KB, N = 14). In translating the virtual address to a physical address, the offset is left unchanged and the virtual page number is mapped to a physical page number. This is recombined with the offset to give a physical address (i.e., a location in physical memory).

For example, for 32-bit virtual address with a page size of 16KB, bits 13:0 represent the offset within a page and bits 30:14 represent a virtual page number from the 2**17 pages in the virtual segment.

The operating system translates between the virtual addresses your programs use and the physical addresses that the hardware requires. These translations are done for every memory access, so, to keep their overhead low, the 64 most recently used page addresses are cached in the translation lookaside buffer (TLB). This allows virtual-to-physical translation to be carried out with zero latency, in hardware - for those 64 pages.


Role of TLB in Memory Referencing

When a virtual memory location (either an instruction or data) is referenced, the CPU determines the page that the item resides. It searches through the Translation Lookaside Buffer for an entry containing the page number. If the page number is found in TLB, memory reference is continued by checking the L1 cache. If the CPU finds no entry for the referenced page in TLB, a TLB miss has occurred. Only then must the operating system assist in address translation. The hardware traps to a routine in the IRIX kernel. It locates the needed physical address in a memory table and loads it into one of the TLB registers. Thus, the TLB is updated and memory referencing continues.

Thus the TLB acts as a cache for frequently-used page addresses. The virtual memory of a program is usually much larger than 64 pages. The most recently used pages of memory (hundreds or thousands of them) are retained in physical memory, but depending on program activity and total system use, some virtual pages can be copied to disk and removed from memory.

A 'minor page fault' is said to occur when a program suffers a TLB miss but the referenced page is found in the physical memory by the operating system. If the operating system discovers that the referenced page is not presently in physical memory, a 'major page fault' has occurred and the program is further delayed while the page is read from disk.


L1 and L2 Caches

Cache Type Total Size Cache Line Size # of cache lines Mapping Method Replacement Policy
L1 Instruction Cache 32KB 64-byte 512 2-way set associative Least Recently Used (LRU)
L1 Data Cache 32KB 32-byte 1024 2-way set associative LRU
L2 Unified Cache 4MB ; 8MB (for lomax) 128-byte 2**15=32768 ; 2**16=65536 2-way set associative LRU

The L1 and L2 caches are divided into cache lines. For the L1 instruction cache, L1 data cache and L2 cache, the size of a cache line is 64 bytes, 32 bytes and 128 bytes, respectively. Thus, the number of cache lines for each cache is 32KB/64bytes= 512 lines for L1 instruction cache, 32KB/32bytes = 1024 lines for L1 data cache and 4MB/128 bytes= 2**15 lines for L2 cache (for lomax, there are 8MB/128bytes = 2**16 lines).

The use of cache line makes sure that a high bandwidth between the caches and memory is maintained. When a particular word is copied into cache, several nearby words are copied with it. If 1 word is 4-bytes long, one L1 data cache line can hold 8 consecutive data, and one L2 cache line can hold 32 consecutive data.


Two-Way Set Associativity

When a datum is loaded into either L1 or L2 cache, the cache line it goes to is not random. In fact, with the use of two-way set associative mapping method, there are only two specific cache lines that a datum can reside. The two cache lines that this datum can reside are determined by the middle bits of the address of this datum. In the case where the L1 cache size is 32KB with 32B per cache line and L2 cache size is 4MB with 128B per cache line, the following figure illustrates how many and which middle bits of the address are used to determine the cache lines a datum will used. An example is provided to further demonstrate how to determine the two L1 cache lines and two L2 cache lines a datum can use.


Example : Determination of the Cache Lines for a Datum

For ease of discussion, all numbers will be converted from binary to decimal in this example.

If the base address of a datum is represented by 44556676 in decimal, the two L2 cache lines this datum can reside are line number 4035 and 20419 as determined below:

44556676/(4MB/2 way) = 21 remainder 516484

516484/128B (L2 cache line size) = 4035 remainder 4

The first L2 cache line this datum can reside = 4035

The second L2 cache line this datum can reside = 4035 + (2**15 lines/2 way) = 20419

Similarly, for the two L1 data cache lines,

44556676/(32KB/2 way)=2719 remainder 8580

8580/32B (L1 data cache line size) = 268 remainder 4

The first L1 data cache line this data can reside = 268

The second L1 data cache line this data can reside = 268 + (1024 lines/2 way) = 780

To determine if a datum is in cache, one only needs to check these two specific cache lines. Once the two cache lines are determined, one checks the cache tag on each of the two lines to see if it matches the rest of the address. If it does, this cache line contains our data and we have a cache hit. If the cache tag does not match the rest of our address, we have a cache miss and the data must be loaded from the next level of the memory hierarchy.


If the program refers to two data whose middle address bits are identical, copies of both can be maintained in the cache simultaneously. When a third datum with the same middle address bits is accessed, it replaces one of the two previous data - the one that was least recently accessed.

Example : L2 Cache Trashing among 3 Data with Identical Middle Address Bits

In the following program, each one of the 3 arrays a, b and c occupy exactly 2MB of memory and thus a(i,j), b(i,j) and c(i,j) have identical low-21 bits (bits 20:0) in their base addresses. Lets also assume that the base address of a(1,1) is 44556676 in decimal. Thus a(1,1) can only reside in L2 cache line numbers 4035 or 8131. If a(1,1) is in line 4035, then b(1,1) can reside in line number 8131. When c(1,1) is needed, an L2 cache line that contains either a(1,1) or b(1,1) needs to be replaced with data containing c(1,1) because c(1,1) can only reside in either line number 4035 or 8131 as a(1,1) and b(1,1).

dimension  /abc/ a(1024,512), b(1024,512), c(1024,512)

do j=1,512
do i=1,1024
a(i,j)=a(i,j)+b(i,j)+c(i,j)
end do
end do

Quiz : If the above code is run in systems (as in lomax) whose L2 cache is two-way set associative, its size is 8MB and each L2 cache line is 128B, will the two L2 cache lines for a(1,1) be same as those two lines for b(1,1) ? What about c(1,1) ?


The Situations that Cache Latency Applies

A cache miss occurs when the desired data is not found. A delay (latency) applies when:


Principles of Good Cache Use


Memory Access Latency

Latency of memory access is best measured in CPU clock cycles. For a 250MHz CPU, 1 clock cycle is 4 nanoseconds. The latency of data access becomes greater with each cache level. A miss at each level of the memory hierarchy multiplies the latency by an order of magnitude or more. Clearly a program can sustain high performance only by achieving a very high ratio of cache hits at every level.

Memory Hierarchy Latency (in clock periods)
CPU Register 0
L1 cache hit 2/3
L1 cache miss satisfied by L2 cache hit 8/10
L2 cache miss satisfied from main memory, no TLB miss 75/250 (depending on the node where the memory resides)
TLB miss requiring only reload of the TLB to refer to a virtual page already in memory (minor page fault) ~ 2000
TLB miss requiring virtual page to load from disk (major page fault) ~ hundreds of millions


PERFORMANCE PROFILING


Available Tools


How to Proceed


SINGLE-CPU OPTIMIZATION TECHNIQUES


Sources of Performance Problems


Useful Techniques


Compiler Assisted Optimization

Profiling reveals which code to tune. The most important tool for tuning is the compiler. The Silicon Graphics compilers are flexible and offer a wide variety of compiler options to control their operation. Refer to the SGI documents listed below to find details about the compiler.

The default version of compiler used on turing, hopper, steger and lomax is MIPSpro.7.3.1.1m. Use the module commands If you need a different version.


Useful Files Generated by Compiler

By default, several files are created during processing. Some of these files are useful for finding out what the compiler has done for your code. The compiler adds a suffix to the file portion of the file name and write the files it creates to your working directory.

Files Content
file.l Assembler listing file. To retain this, specify the -LIST option.
file.L Listing file containing a cross reference and a source listing. To retain this file, specify the -listing option.
file.s Assembly language file. To retain this file, specify the -S or -keep option.
file.w2f.f Fortran transformation file. To retain this file, specify -FLIST:=ON
file.w2c.c C transformation file. To retain this file, specify -CLIST:=ON Note: This option does not work with C++.
file.list APO listing file. To retain this, specify the -apolist option.

Note:


Optimization Option Groups:

Options Group Function
-On Basic Optimization
-OPT: Miscellaneous Optimization Specification
-SWP: Software Pipelining Specification
-LNO: Loop Nest Optimization Specification
-IPA: Inter-Procedural Analysis Specification
-INLINE: Standalone Inliner Specification
-TENV: Target Environment Specification
-TARG: Target Architecture/platform Specification


  • Miscellaneous Compiler Optimizations
  • The -OPT: option group controls miscellaneous optimizations. This option overrides defaults based on the main optimization level. See opt man page for details.


    Using Tuned Libraries for Optimization

    One way to improve a program's performance is to link it with libraries already tuned for the target hardware.