| |
The IBM Power4 Processor: Overview and Initial Experiences
M.F. Guest, STFC Daresbury Laboratory
Considerable interest, not least within the HPC(X) procurement process, has centred on IBM's new
SP systems based around the POWER4 processor chip [1]. In this article we provide a brief overview
of the chip and associated building blocks of IBM's next generation parallel systems, and describe our
own initial experiences in benchmarking applications on these systems, in both serial and parallel
mode.
The POWER4chip contains two microprocessor cores, chip and system pervasive functions, core
interface logic, a 1.41 MB L2 cache and controls, the level-3 (L3) cache directory and controls, and
the fabric controller that controls the flow of information and control data between the L2 and L3 and
among chips. Table I provides a processor comparison with the POWER3- and RS64-based
processors from IBM, while Figure 1 gives a logical view of the chip.
Each microprocessor contains a 64 KB L1 I-cache, a 32 KB L1 D-cache, two fixed-point execution
units, two floating-point execution units, two load/store execution units, one branch execution unit,
and one execution unit to perform logical operations on the condition.
Design features and Systems Evaluated
Three different types of IBM pSeries 690 servers were used in this study: one corresponds to the 1.3
GHz POWER4 pSeries 690 HPC (High Performance Computing) while the other two systems are
based around the 1.3 GHz pSeries 690 Turbo. For simplicity, we refer to them as the HPC system
and the Turbo systems, respectively.
The pSeries 690, which is the latest UNIX server from IBM, provides a new architecture [1,2]. At the
core of its architecture is the POWER4 Multi-Chip Module (MCM). The building blocks for the systems
used in this study are an 8-way MCM Turbo running at 1.3 GHz and a 4-way MCM HPC running at 1.3
GHz. IBM's positioning of the HPC system may be summarised as follows; "the HPC system (4-way
MCM) is optimized for data intensive applications that require larger memory bandwidth per core".
Frankly this would appear to be an admission that in many applications the memory bandwidth on the
initial Turbo systems is inadequate when all CPUs on the MCM are involved.
| |
POWER3 |
RS64-III |
POWER4 |
| Frequency [MHz] |
375 |
450 |
1,300 |
| Fixed Point Units |
3 |
2 |
2 |
| Floating Point Units |
2 |
1 |
2 |
| Load/Store Units |
2 |
1 |
2 |
| Branch/Other Units |
1 |
1 |
2 |
| Dispatch Width |
4 |
4 |
5 |
| Branch Prediction |
Dynamic |
Static |
Dynamic |
| I-Cache Size [KB] |
32 |
128 |
64 |
| D-Cache Size [KB] |
128 |
128 |
32 |
| L2-Cache Size [MB] |
8 |
16 |
1.41a |
| L3-Cache Size [MB] |
N/A |
N/A |
512b |
| Data Prefetch |
Yes |
No |
Yes |
Table I. Processor comparisons
aShared between two cores.
bShared among 32 cores.
Each POWER4 HPC processor chip contains one core, rather than two, and the L2 cache is
dedicated for the core. On the other hand, the Turbo system (8-way MCM) has two cores per L2
cache i.e., one MCM is 8-way. A full description of the POWER4 architecture is beyond the scope of
this article, for further details see the references [1,2]. We merely summarise here the most important
features of this architecture. Each processor chip on the pSeries 690 consists of the following:
- Either one or two cores.
- An L2 cache that runs at the same speed as the microprocessor.
- The microprocessor interface unit, which is the interface for each microprocessor to the rest of the
system.
- The directory and cache controller for the L3 cache.
- The fabric bus controller.
- A GX bus controller that enables I/O devices to connect to the Central Electronic Complex (CEC).
Note that the L3 cache is a new component not available on the POWER3 architecture. The L3
caches are mounted on a separate module.
The two pSeries 690 Turbo-based systems used in the present evaluation corresponded to an 8-way
system (single MCM), and a 32-way system (4 MCMs, hereafter referred to as Regatta-H). The HPC-
based system comprised 16 processors (4 MCMs), hereafter referred to as Regatta-HPC.
Single Processor Performance
Figure 2 depicts the results from running the standard DisCo Computational Chemistry Benchmarks
[3] on a single processor of an 8-way pSeries 690 Turbo system. For each Benchmark we have
considered the performance on the Cray T3E/1200E, SGI Origin R12k/400 (the HPC'97 technology
refresh system, green), PSC's Compaq AlphaServer SC ES45/1000 and the 690 Turbo system. The
performance bar for a given benchmark on machine M depicts the ratio [T Origin 3800 / T M ] * 100, i.e.,
performance is normalised relative to that found on the Origin 3800. Note that the MATRIX-97 figures
arise from averaging across figures for the separate MATRIX benchmarks (MMO-97, QHQ-97 and
Diag-97), whilst the Chem. Kernels figure is similarly derived as the average across all four chemistry
kernels (SCF, MD, QMC and JACOBI).
These results suggest that the overall performance of a single processor of the p-series 690 Turbo
(an average over all 11 benchmarks) is outperforming the Cray T3E/1200E by a factor of 6.8, and the
SGI Origin R12k/400 (green at CSAR) by a factor of ca. 3.0. Allowing for the early status of IBM's
compilers in optimising and tuning for the power4 architecture, these results are encouraging. They
suggest a significant improvement over the AlphaServer SC ES45/1000 in all but two of the
benchmarks, the SCF computational chemistry kernel and the DL_POLY code itself. The latter is of
some concern, although this effect is not peculiar to the power4 processor, for it has been evident in
all previous power-based CPUs from IBM.

Figure 2. Relative Serial Performance of the DisCo Computational Chemistry Benchmarks on the
Cray T3E/1200E, SGI Origin 3800/R14k-500, Compaq AlphaServer SC ES45/1000 and IBM p-series
690 Turbo System.
(Click for full size image)
Parallel Regatta-HPC Performance
Detailed comparisons of our initial experiences running a variety of applications on up to 32
processors of the Regatta-H system are presented in the application based articles throughout this
Annex. For the present purposes we will restrict attention to the most recent exercise that involved
access to a Regatta-HPC system at Montpelier. In Table 2 below we present the timings for a variety
of 8 processor parallel jobs obtained on this system, together with those using the Compaq
AlphaServer SC ES45/1000 at Pittsburgh Supercomputing Centre. The 8-processor metric is used
here as this represents the fundamental building block of the likely high-end systems from IBM, given
the short term requirements to LPAR the 32-way Regatta Node (so as to overcome the limitations
associated with IBM's existing colony switch). We only provide here a brief statement of the
benchmark codes and the corresponding data sets. More details can be found in the subsequent
articles of this Annex.
A brief overview of the timings of Table 2 suggests that in many cases the AlphaServer SC and
Regatta-HPC systems demonstrate comparable times to solution. Of the 14 benchmarks presented,
covering 6 different application codes, we find that the 8-way HPC node is outperforming the
AlphaServer SC in 12 of the 14, albeit by only small amounts in 6 of these 12. The most disturbing
case of under performance by the HPC node is found in the macromolecular DL_POLY benchmark,
where only 67% of AlphaServer SC performance is evident. This comes as no surprise given the
serial performance of the code outlined above.
| Code |
Data Set |
Elapsed Time (seconds) |
[T(ES45/1000) / T(Regatta-HPC)] X 100% |
| ES45 /1000 |
Regatta- HPC |
| DL_POLY |
Bench 4 |
155 |
145 |
107% |
| |
Bench 5 |
120 |
119 |
101% |
| |
Bench 7 |
198 |
296 |
67% |
| NWChem‡ |
DFT J-fit, Siosi3 |
74 |
71 |
104% |
| |
DFT Jfit , SioSI4 |
207 |
149 |
139% |
| GAMESS-UK‡ |
DFT Morphine 6-31G**, explicit J |
338 |
308 |
110% |
| |
DFT Cyclosporin 6-31G, explicit J |
2293 |
2246 |
100% |
| |
SCF Morphine, 6-31G** |
208 |
179 |
116% |
| |
DFT Morphine, DZVP/A2, explicit J |
879 |
867 |
101% |
| |
DFT Morphine, J-fit |
195 |
214 |
91% |
| |
Furan TZVP, SCF 2nd derivs |
357 |
222 |
161% |
| ANGUS |
(144**3) ILU Grid, 100 iterations |
1213 |
639 |
190% |
| CASTEP |
Chabazite energy |
337 |
163 |
207% |
| CPMD |
(H2O)32 cluster energy |
165 |
138 |
120% |
Table 2. Time in Wall Clock Seconds for a number of applications and data sets using 8 processors
of the IBM Regatta HPC and Compaq AlphaServer SC/ES45 1000.
‡ Both GAMESS-UK and NWChem required a 32-bit kernel when using LAPI. This led to a non-optimal memory
configuration on the Regatta HPC system.
References
[1] S. Behling, R. Bell, P. Farrell, A. Holthoff, F. O'Connell, and W. Weir, The POWER4 Processor
Introduction and Tuning Guide, IBM Corporation, International Technical Support Organization, Austin
TX, 2001; and references therein.
[2] H. M. Matis, J. D. McCalpin, M-C, Chiang, F. P. O'Connell, P. Buckland, IBM pSeries 690
Configuring for Performance, IBM Corporation, Austin, TX, 2001.
[3] M.F. Guest, Performance of Various Computers in Computational Chemistry, in Proceedings of the
Daresbury Machine Evaluation Workshop, STFC Daresbury Laboratory, November 2001. The
associated MS PowerPoint
presentation is also available.
|
|