STFC Home Page STFC Home Page CSE Home Page CSE Home Page Computational Science & Engineering Department  

 14:22:24 GMT
 Tuesday
 09 February 2010

 Search the CSE web:
 Enter text and press return

 
  Home
  Support and services
  Research and development
  Advanced research computing
  Atomic and molecular physics
  Band theory
  CCP4 group
  Computational biology
  Computational chemistry
  Computational engineering
  Computational material science
  Numerical analysis
  Software engineering
  Visualization
  Online resources
  Events calendar
  Newsroom
  Site map / index
   

 

 

Valid HTML 4.01

 

The IBM Power4 Processor:
Overview and Initial Experiences

M.F. Guest, STFC Daresbury Laboratory

Considerable interest, not least within the HPC(X) procurement process, has centred on IBM's new SP systems based around the POWER4 processor chip [1]. In this article we provide a brief overview of the chip and associated building blocks of IBM's next generation parallel systems, and describe our own initial experiences in benchmarking applications on these systems, in both serial and parallel mode.

The POWER4chip contains two microprocessor cores, chip and system pervasive functions, core interface logic, a 1.41 MB L2 cache and controls, the level-3 (L3) cache directory and controls, and the fabric controller that controls the flow of information and control data between the L2 and L3 and among chips. Table I provides a processor comparison with the POWER3- and RS64-based processors from IBM, while Figure 1 gives a logical view of the chip.

Each microprocessor contains a 64 KB L1 I-cache, a 32 KB L1 D-cache, two fixed-point execution units, two floating-point execution units, two load/store execution units, one branch execution unit, and one execution unit to perform logical operations on the condition.

Design features and Systems Evaluated

Three different types of IBM pSeries 690 servers were used in this study: one corresponds to the 1.3 GHz POWER4 pSeries 690 HPC (High Performance Computing) while the other two systems are based around the 1.3 GHz pSeries 690 Turbo. For simplicity, we refer to them as the HPC system and the Turbo systems, respectively.

The pSeries 690, which is the latest UNIX server from IBM, provides a new architecture [1,2]. At the core of its architecture is the POWER4 Multi-Chip Module (MCM). The building blocks for the systems used in this study are an 8-way MCM Turbo running at 1.3 GHz and a 4-way MCM HPC running at 1.3 GHz. IBM's positioning of the HPC system may be summarised as follows; "the HPC system (4-way MCM) is optimized for data intensive applications that require larger memory bandwidth per core". Frankly this would appear to be an admission that in many applications the memory bandwidth on the initial Turbo systems is inadequate when all CPUs on the MCM are involved.

  POWER3 RS64-III POWER4
Frequency [MHz] 375 450 1,300
Fixed Point Units 3 2 2
Floating Point Units 2 1 2
Load/Store Units 2 1 2
Branch/Other Units 1 1 2
Dispatch Width 4 4 5
Branch Prediction Dynamic Static Dynamic
I-Cache Size [KB] 32 128 64
D-Cache Size [KB] 128 128 32
L2-Cache Size [MB] 8 16 1.41a
L3-Cache Size [MB] N/A N/A 512b
Data Prefetch Yes No Yes

Table I. Processor comparisons
aShared between two cores.
bShared among 32 cores.

Each POWER4 HPC processor chip contains one core, rather than two, and the L2 cache is dedicated for the core. On the other hand, the Turbo system (8-way MCM) has two cores per L2 cache i.e., one MCM is 8-way. A full description of the POWER4 architecture is beyond the scope of this article, for further details see the references [1,2]. We merely summarise here the most important features of this architecture. Each processor chip on the pSeries 690 consists of the following:

  • Either one or two cores.

  • An L2 cache that runs at the same speed as the microprocessor.

  • The microprocessor interface unit, which is the interface for each microprocessor to the rest of the system.

  • The directory and cache controller for the L3 cache.

  • The fabric bus controller.

  • A GX bus controller that enables I/O devices to connect to the Central Electronic Complex (CEC).

Note that the L3 cache is a new component not available on the POWER3 architecture. The L3 caches are mounted on a separate module.

The two pSeries 690 Turbo-based systems used in the present evaluation corresponded to an 8-way system (single MCM), and a 32-way system (4 MCMs, hereafter referred to as Regatta-H). The HPC- based system comprised 16 processors (4 MCMs), hereafter referred to as Regatta-HPC.

Single Processor Performance

Figure 2 depicts the results from running the standard DisCo Computational Chemistry Benchmarks [3] on a single processor of an 8-way pSeries 690 Turbo system. For each Benchmark we have considered the performance on the Cray T3E/1200E, SGI Origin R12k/400 (the HPC'97 technology refresh system, green), PSC's Compaq AlphaServer SC ES45/1000 and the 690 Turbo system. The performance bar for a given benchmark on machine M depicts the ratio [T Origin 3800 / T M ] * 100, i.e., performance is normalised relative to that found on the Origin 3800. Note that the MATRIX-97 figures arise from averaging across figures for the separate MATRIX benchmarks (MMO-97, QHQ-97 and Diag-97), whilst the Chem. Kernels figure is similarly derived as the average across all four chemistry kernels (SCF, MD, QMC and JACOBI).

These results suggest that the overall performance of a single processor of the p-series 690 Turbo (an average over all 11 benchmarks) is outperforming the Cray T3E/1200E by a factor of 6.8, and the SGI Origin R12k/400 (green at CSAR) by a factor of ca. 3.0. Allowing for the early status of IBM's compilers in optimising and tuning for the power4 architecture, these results are encouraging. They suggest a significant improvement over the AlphaServer SC ES45/1000 in all but two of the benchmarks, the SCF computational chemistry kernel and the DL_POLY code itself. The latter is of some concern, although this effect is not peculiar to the power4 processor, for it has been evident in all previous power-based CPUs from IBM.

Figure 2
Figure 2. Relative Serial Performance of the DisCo Computational Chemistry Benchmarks on the Cray T3E/1200E, SGI Origin 3800/R14k-500, Compaq AlphaServer SC ES45/1000 and IBM p-series 690 Turbo System.
(Click for full size image)

Parallel Regatta-HPC Performance

Detailed comparisons of our initial experiences running a variety of applications on up to 32 processors of the Regatta-H system are presented in the application based articles throughout this Annex. For the present purposes we will restrict attention to the most recent exercise that involved access to a Regatta-HPC system at Montpelier. In Table 2 below we present the timings for a variety of 8 processor parallel jobs obtained on this system, together with those using the Compaq AlphaServer SC ES45/1000 at Pittsburgh Supercomputing Centre. The 8-processor metric is used here as this represents the fundamental building block of the likely high-end systems from IBM, given the short term requirements to LPAR the 32-way Regatta Node (so as to overcome the limitations associated with IBM's existing colony switch). We only provide here a brief statement of the benchmark codes and the corresponding data sets. More details can be found in the subsequent articles of this Annex.

A brief overview of the timings of Table 2 suggests that in many cases the AlphaServer SC and Regatta-HPC systems demonstrate comparable times to solution. Of the 14 benchmarks presented, covering 6 different application codes, we find that the 8-way HPC node is outperforming the AlphaServer SC in 12 of the 14, albeit by only small amounts in 6 of these 12. The most disturbing case of under performance by the HPC node is found in the macromolecular DL_POLY benchmark, where only 67% of AlphaServer SC performance is evident. This comes as no surprise given the serial performance of the code outlined above.

Code Data Set Elapsed Time
(seconds)
[T(ES45/1000) /
T(Regatta-HPC)] X 100%
ES45 /1000 Regatta-
HPC
DL_POLY Bench 4 155 145 107%
  Bench 5 120 119 101%
  Bench 7 198 296 67%
NWChem DFT J-fit, Siosi3 74 71 104%
  DFT Jfit , SioSI4 207 149 139%
GAMESS-UK DFT Morphine 6-31G**,
explicit J
338 308 110%
  DFT Cyclosporin 6-31G,
explicit J
2293 2246 100%
  SCF Morphine, 6-31G** 208 179 116%
  DFT Morphine, DZVP/A2,
explicit J
879 867 101%
  DFT Morphine, J-fit 195 214 91%
  Furan TZVP,
SCF 2nd derivs
357 222 161%
ANGUS (144**3) ILU Grid,
100 iterations
1213 639 190%
CASTEP Chabazite energy 337 163 207%
CPMD (H2O)32 cluster energy 165 138 120%

Table 2. Time in Wall Clock Seconds for a number of applications and data sets using 8 processors of the IBM Regatta HPC and Compaq AlphaServer SC/ES45 1000.
‡ Both GAMESS-UK and NWChem required a 32-bit kernel when using LAPI. This led to a non-optimal memory configuration on the Regatta HPC system.

References

[1] S. Behling, R. Bell, P. Farrell, A. Holthoff, F. O'Connell, and W. Weir, The POWER4 Processor Introduction and Tuning Guide, IBM Corporation, International Technical Support Organization, Austin TX, 2001; and references therein.

[2] H. M. Matis, J. D. McCalpin, M-C, Chiang, F. P. O'Connell, P. Buckland, IBM pSeries 690 Configuring for Performance, IBM Corporation, Austin, TX, 2001.

[3] M.F. Guest, Performance of Various Computers in Computational Chemistry, in Proceedings of the Daresbury Machine Evaluation Workshop, STFC Daresbury Laboratory, November 2001. The associated MS PowerPoint presentation is also available.

 
 
 
For further information see the related articles:
 
Link Applications performance
 
Link The Intel IA-64 processor
 

For more information about the Advanced Research Computing Group please contact Dr Mike Ashworth.
 
back to top
 
 ARC Quick links
Link ARC Home Page
Applications:
Link Castep
Link DL-POLY
Link FLITE3D
Link PDNS3D
Link POLCOMS
Link PRMAT
Link SIC-LMTO
Link THOR
Algorithms:
Link BFG
Link CLIPS
Link FFT
Link Eigensolvers
Benchmarking:
Link NWChem
Link JASPA
Link OCCOMM
Link DL-POLY
Languages:
Link Fortran 90
Link Inter-comparison
Link PGAS Languages
Link HPCS Languages
Tools etc.:
Link Vampir
Link Toolkits
Link QA software
Link GUI
People:
Link Mike Ashworth
Link Rob Allan
Link Stephen Pickles
Link Martin Plummer
Link Andrew Porter
Link Andrew Sunderland
Link Ilian Todorov
Past projects:
Link UKHEC Home Page