![]()
![]()
This report compares the performance of a number of different computer systems using the DL_POLY software. The comparison involves twenty seven computers, including scientific workstations from IBM, Sun, Hewlett Packard, Digital and Silicon Graphics, and Pentium-based PCs. The benchmark suite consists of a set of six typical MD simulations detailed below.
Workstations that have been benchmarked, include those from
We should stress from the outset that our access to much of the hardware evaluated herein has been at best short lived, and has often involved the temporary loan or donation of machines as part of one of the hardware evaluation exercises run at the Daresbury Laboratory. In many cases these machines were not optimally configured in terms of either memory, or high speed disk, and consideration of the results presented here should be viewed in that light.
Following an introductory evaluation of hardware based on the SPEC Benchmarks (section 3), we present in Sections 4, and 5 results using the DL_POLY simulation program [1].
Note that the present results are taken from a more detailed report on computational chemistry benchmarks; the associated MS powerpoint presentation is also available.
One of the most useful indicators of CPU performance is provided by the SPEC (``Systems Performance Evaluation Corporation'') benchmarks. This benchmark suite contains non-tuned application-based code to measure processor speed for both integer (SPECint) and floating point (SPECfp) arithmetic. While earlier versions of the suite (e.g. SPECmark89) had certain well-advertised flaws, the more recent offerings, SPECfp95 and SPECint95 have become industry standards in measuring primarily the performance of a system's processor, memory architecture, operating system and compiler.
SPECfp95 is derived from the results of ten floating-point benchmarks compiled with aggressive optimization. It is the geometric mean of ten normalized ratios (one for each floating-point benchmark). SPECint95 is derived from the results of eight integer benchmarks compiled with aggressive optimization. It is the geometric mean of eight normalized ratios (one for each integer benchmark) Note that the level of optimization is not mandated. While highly aggressive optimization is permitted, results derived from benchmarks compiled with conservative optimization (as in SPECbase) can be submitted.
SPECfp95 and SPECint95 results for many of the CPUs discussed in this paper are given in Table 1. It is clear that no single CPU has dominated the SPECfp95 ratings over the recent past, that is until the arrival of the EV6/21264 from Digital. Thus until July 1998, the P2SC/160 CPU in the IBM RS/6000-397 exhibited the most impressive SPECfp95 rating; with a value of 26.6, the P2SC was marginally faster than the HP PA-9000/C240, 1.2 times faster than the HP PA-9000/C200 and SUN Enterprise HPC4500/336, 1.3 times faster than the DEC Alpha 8400/5-625 and SUN Ultra-2/300, and 1.4 times faster than the R10k-based SGI Origin2000/195. This picture has changed quite drastically with the arrival of the EV6. With SPECfp95 ratings of 47.7 (in the 8400/6-575) and 58.7 in the Compaq DS20, the EV6 alpha is seen to be more than twice as fast as the other leading processors of Table 1. The following points should be noted regarding the values specified in this Table;
Using the Compaq AlphaServer DS20 value of 58.7 to normalises the SPECfp ratings, we would expect the DS20 and ES40 to be somewhat ahead of the other EV6-based machines (the Compaq Alpha GS140 and Alpha 8400/6-575), given the more optimal memory subsystem involved. These four machines appear far superior to the remainder; based on this performance metric, the DS20 is seen to be 1.95 times the power3-based 200 MHz IBM RS/6000-43P/260 and 2.3 times the HP PA-9000/C240 and 250 MHz R10k-based SGI Origin2000. All other CPUs are projected to be significantly less than half the speed.
While the EV6-based machines from Compaq/DEC are also seen to dominate the SPECint95 ratings, a quite different ordering of the processors from IBM (Power3, Power2SC and Power2) and SGI (both R10k and R8k) is seen compared to the SPECfp95 ratings; both are now slower than those from DEC (EV5-based) and SUN. We also note that the Specint95 ratings suggest that Pentium II/400 is 1.7 times slower than the Compaq AlphaServer DS20, while the SPECfp95 ratings point to the Pentium being a factor of 4.5 times slower.
When considering the present benchmarking results, there are several factors we wish to consider in assessing the usefulness of the SPEC ratings;
i. Do the SPECfp95 values provide a reliable metric for evaluating the capabilities of hardware in computational chemistry? If so, we would expect to find a close mapping of the ratios for the various chemistry benchmarks onto the SPECfp ratios;
ii. Does any particular CPU consistently ``underperform'' based on the SPECfp criteria? - this would manifest itself as the ratios from the chemistry benchmarks falling below the SPECfp ratios. In particular we shall look for indicators of the memory problems of the SGI O2-R10k impacting on the benchmarks.
We will attempt to address these issues below. Finally, we note that A SPEC FAQ describing the SPEC benchmark suite and the SPEC consortium is periodically posted to comp.benchmarks, and can be found on the WWW at
http://www.specbench.org/spec/faq
An excellent summary of the SPEC benchmarks that is periodically updated is available via anonymous ftp from ftp.cs.toronto.edu in the file /pub/spectable More SPEC-related information is available at the SPEC WWW site,
and at the Performance Database Web site,
http://performance.netlib.org/performance/html/spec.html#specsite.
The benchmark summarised below is designed to reflect the typical range of simulations undertaken by the molecular dynamicist. It includes 6 calculations carried out using the DL_POLY molecular dynamics code, and includes the following functionality;
The data presented in Table 2 is collected under control of the UNIX command time where available, and includes CPU time (both user and system), total elapsed time and Efficiency, measured as CPU versus elapsed. The total user CPU timings of Table 2 refer to the summed user CPU timings over all 6 calculations of the benchmark. Note that in contrast to the QC benchmark, little I/O is performed by the DL_POLY calculations, so that efficiency should always be high assuming the benchmarks were conducted on a dedicated resource.
The total CPU timings of Table 2 suggest that the Digital/Compaq Alpha CPU and, to a lesser extent, the SGI 250 MHz R10k, are dominant. The Compaq AlphaServer GS140 is the optimum CPU (13.9 minutes), slightly faster than the AlphaServer DS20 and DS40 (14.3 and 14.5 minutes respectively) and the Compaq XP1000 6/450 (15.6 minutes).The EV6-based GS140 outperforms the EV5-based DEC Alpha 8400/5-625 (19.8 mins.) and the Alpha PW/600AU (20.2 mins.) by a factor of 1.45, and the SGI Origin2000/250 (21.6 mins.) by a factor of 1.55. These are followed by the SGI Octane/250 (24.5 mins.), the DEC Alpha PW/433AU (28.1 mins.), SGI Origin2000/195 (29.9 mins.) and SGI PChall-R10k/195 (33.1 mins). The leading 11 CPUs are from either Digital/Compaq or Silicon Graphics. Note that the incorporation of DL-POLY into the benchmark suite came after the availability of the Alpha 8400/6-575.
When considering the performance of the CPUs from SUN, IBM and Hewlett Packard, we would note the following:
Based on the published SPECfp95 ratings, and normalising with respect to the Compaq AlphaServer DS20 value of 58.7, we would expect (see section 1) the Alpha DS20 and ES40 to be somewhat ahead of the other EV6-based machines (the Compaq Alpha GS140 and Alpha 8400/6-575), with these four machines appear far superior to the remainder. The DS20 is seen to be 1.95 times the power3-based 200 MHz IBM RS/6000-43P/260 and 2.3 times the HP PA-9000/C240 and 250 MHz R10k-based SGI Origin2000. All other CPUs are projected to be significantly less than half the speed.
While the EV6-based machines from Compaq/DEC are also seen to dominate the SPECint95 ratings, a quite different ordering of the processors from IBM (Power3, Power2SC and Power2) and SGI (both R10k and R8k) is seen compared to the SPECfp95 ratings; both are now slower than those from DEC (EV5-based) and SUN. We also note that the Specint95 ratings suggest that Pentium II/400 is 1.7 times slower than the Compaq AlphaServer DS20, while the SPECfp95 ratings point to the Pentium being a factor of 4.5 times slower.
When analysing the results, we wish to consider based on the present evaluation exercise, (i) do the SPECfp95 values provide a reliable metric for evaluating the capabilities of hardware in computational chemistry? If so, we would expect to find a close mapping of the ratios for the various benchmarks onto the SPECfp95 ratios, (ii) does any particular CPU consistently ``underperform'' based on the SPECfp criteria? - this would manifest itself as the ratios from the benchmarks falling below the SPECfp ratios, and (iii) do the ``simple'' Matrix and Chemistry Kernel benchmarks lead to the same conclusions as the GAMESS-UK and DL_POLY benchmarks? To these ends an approximate Performance Index (PI) has been devised for each machine, based on an average value of the Matrix-97, Chemistry Kernels and DL_POLY benchmarks. A full discussion is to be found in the paper on computational chemistry benchmarks. Here we present the conclusions only.
In terms of relative speed, we find that the PI values and all the chemistry benchmarks are broadly in line with the SPECfp predictions, the only notable exceptions being summarised below:
The CRIME chip, which acts as the memory interface between the memory and the three drains on it - the CPU (800 MByte/second), I/O engine (500 MByte/sec) and the monitor display (700 MByte/second) - is probably the main bottleneck. This chip was designed to work as a built in memory controller, but the design was biased toward the R5k; it can't work directly with the R10k because the R5k expects 32 byte cache refills while the R10k wants to have 64 or 128 byte refills. Therefore SGI supply a custom ASIC with the R10k daughter board. This interfaces the R10k's level 2 cache with the CRIME chip. Performance problems are caused by the ASIC having to break each 128 byte cache refill operation into 4, 32 byte refills. The net impact of this effect is that the O2 R10k will only work well with problems that fit into the L2 cache (1 MByte). Not surprisingly, the memory intensive SPECfp95 figures are badly affected, although the impact on less memory intensive applications is not so severe. It should be noted that this type of incident is very rare; chips often fail to deliver but not system architectures designed for existing chips.
![]()
| Machine | SPECfp95 | SPECint95 | Relative Values (%) | |
|---|---|---|---|---|
| SPECfp95 | SPECint95 | |||
| Compaq Alpha DS20 | 58.70 | 27.70 | 100% | 100% |
| Compaq Alpha ES40 | 57.70 | 27.30 | 98% | 99% |
| Compaq XP1000 6/450 | 52.80 | 24.90 | 90% | 90% |
| DEC Alpha 8400/6-575 | 47.70 | 30.30 | 81% | 109% |
| Compaq Alpha GS140 | 45.20 | 27.80 | 77% | 100% |
| IBM RS/6000-43P | 30.10 | 13.10 | 51% | 47% |
| IBM RS/6000-397 | 26.60 | 8.61 | 45% | 31% |
| HP PA-9000/C240 | 25.40 | 17.30 | 43% | 62% |
| SGI Onyx2 IR2/250 | 24.50 | 14.70 | 42% | 53% |
| SGI Origin2000/250 | 24.50 | 14.70 | 42% | 53% |
| SUN HPC4500/336 | 21.90 | 15.00 | 37% | 54% |
| HP PA-9000/C200 | 21.40 | 14.20 | 36% | 51% |
| DEC Alpha PW/600AU | 21.30 | 16.30 | 36% | 59% |
| DEC Alpha 8400/5-625 | 20.80 | 18.40 | 35% | 66% |
| DEC Alpha 500/5-500 | 20.40 | 15.00 | 35% | 54% |
| SGI Octane/250 | 20.30 | 13.60 | 35% | 49% |
| SUN Ultra-2/300 | 20.20 | 12.30 | 34% | 44% |
| SGI Origin2000/195 | 19.00 | 9.48 | 32% | 34% |
| SUN Ultra30/300 | 18.30 | 12.10 | 31% | 44% |
| DEC Alpha PW/433AU | 18.10 | 13.90 | 31% | 50% |
| IBM RS/6000-595 | 17.60 | 6.17 | 31% | 22% |
| SGI Octane/195 | 17.40 | 9.40 | 30% | 34% |
| HP PA-9000/C160 | 16.30 | 10.40 | 28% | 38% |
| SGI Origin200 | 15.60 | 8.59 | 27% | 31% |
| SGI Octane/175 | 15.50 | 8.40 | 26% | 30% |
| DEC Alpha 500/5-400 | 14.10 | 12.30 | 24% | 44% |
| SGI PChall-R10k/195 | 13.80 | 8.85 | 24% | 32% |
| DEC Alpha 600/5-333 | 13.20 | 9.23 | 22% | 33% |
| Pentium II/400 | 13.00 | 16.00 | 22% | 58% |
| DEC Alpha 8400/5-300 | 12.40 | 7.43 | 21% | 27% |
| DEC Alpha 600/5-266 | 11.80 | 7.91 | 20% | 29% |
| SUN Ultra-2/200 | 11.10 | 7.67 | 19% | 28% |
| IBM RS/6000-590 | 10.40 | 3.33 | 18% | 12% |
| IBM RS/6000-3CT | 10.20 | 3.42 | 17% | 12% |
| SUN Ultra-1/170 | 9.06 | 5.56 | 15% | 20% |
| DEC Alpha 2100/5-250 | 8.39 | 5.96 | 14% | 22% |
| Pentium II/300 | 8.15 | 11.60 | 14% | 42% |
| SUN Ultra-1/140 | 7.90 | 4.66 | 13% | 17% |
| SGI O2/R10k-SC | 7.83 | 9.02 | 13% | 33% |
| Pentium II/266 | 7.68 | 10.80 | 13% | 39% |
| IBM RS/6000-3BT | 7.50 | 3.14 | 13% | 11% |
| Pentium Pro/200 | 6.75 | 8.09 | 11% | 29% |
| HP PA-9000/J200 | 6.32 | 3.52 | 11% | 13% |
| DEC Alpha 250/4-266 | 6.27 | 5.18 | 11% | 19% |
| DEC AXP/3000-700 | 5.71 | 3.66 | 10% | 13% |
| SGI O2/R5k-SC | 5.42 | 4.82 | 9% | 17% |
| Pentium 233 MMX | 5.21 | - | 9% | - |
| SGI Indy-R5k | 4.78 | 4.32 | 8% | 16% |
| HP PA-9000/735-125 | 4.61 | 3.97 | 8% | 14% |
| HP PA-9000/735 | 4.06 | 3.22 | 7% | 12% |
| DEC AXP/3000-500 | 3.65 | 2.15 | 6% | 8% |
| HP PA-9000/715-100 | 3.47 | 2.89 | 6% | 10% |
| IBM PowerPC-43P | 3.20 | 3.59 | 5% | 13% |
| IBM PowerPC-250 | 2.32 | 1.82 | 4% | 7% |
| SUN SPARC 10/41 | 1.38 | 1.13 | 2% | 4% |
| MPP node | ||||
| IBM SP2/160Thin | 25.80 | 8.61 | 44% | 31% |
| HP PA-9000/V2200 | 22.10 | 13.80 | 38% | 50% |
| Cray T3E/1200 | 21.30 | 18.40 | 36% | 66% |
| Cray T3E/900 | 17.25 | 13.60 | 29% | 49% |
| IBM SP2/120Thin | 16.60 | 5.61 | 28% | 20% |
| IBM SP2/66Thin | 9.35 | 3.31 | 16% | 12% |
| Machine | CPU Time | Elapsed | Relative | |
|---|---|---|---|---|
| User | System | Time | Performance (%) | |
| Compaq Alpha GS140 | 13.9 | 0.0 | 13.9 | 103% |
| Compaq Alpha DS20 | 14.3 | 0.0 | 14.3 | 100% |
| Compaq Alpha ES40 | 14.5 | 0.0 | 14.5 | 98% |
| Compaq XP1000 6/450 | 15.6 | 0.0 | 15.6 | 92% |
| DEC Alpha 8400/5-625 | 19.7 | 0.1 | 19.9 | 72% |
| DEC Alpha PW/600AU | 20.2 | 0.0 | 20.2 | 71% |
| SGI Origin2000/250 | 21.6 | 0.0 | 21.7 | 66% |
| SGI Octane/250 | 24.5 | 0.0 | 24.5 | 59% |
| DEC Alpha PW/433AU | 28.1 | 0.0 | 28.8 | 51% |
| SGI Origin2000/195 | 29.9 | 0.0 | 30.8 | 48% |
| SGI PChall-R10k/195 | 33.0 | 0.1 | 33.2 | 43% |
| HP PA-9000/V2200 | 33.5 | 0.1 | 33.6 | 43% |
| IBM RS/6000-43P | 35.8 | 0.0 | 35.7 | 40% |
| HP PA-9000/C240 | 36.2 | 0.0 | 36.5 | 40% |
| DEC Alpha 8400/5-300 | 37.2 | 0.1 | 37.2 | 38% |
| SGI Octane/175 | 40.4 | 0.1 | 40.5 | 35% |
| Cray T3E/1200 | 41.2 | 0.6 | 42.6 | 34% |
| Pentium II/400 (pgi) | 50.4 | 0.0 | 50.5 | 28% |
| Cray T3E/900 | 51.0 | 0.7 | 52.3 | 28% |
| SUN HPC4500/336 | 62.6 | 0.0 | 62.7 | 23% |
| Pentium II/300 (pgi) | 65.6 | 0.0 | 65.7 | 22% |
| IBM SP2/120Thin | 67.5 | 0.0 | 68.1 | 21% |
| Pentium II/300 (abs) | 72.1 | 0.0 | 72.1 | 20% |
| Pentium II/266 (pgi) | 76.4 | 0.0 | 76.5 | 19% |
| SGI O2/R5k-SC | 80.5 | 0.2 | 83.5 | 18% |
| Pentium II/266 (abs) | 83.8 | 0.0 | 83.8 | 17% |
| IBM RS/6000-59H | 107.7 | 0.0 | 108.3 | 13% |
(+) Version 2.11 of the DL_POLY Code
M.F. Guest / W. Smith