| |
Applications Performance: Release 4.1 of NWChem
M.F. Guest, STFC Daresbury Laboratory
In a previous report we have described the functionality within the NWChem package [1]. Release 4.1
of the code continues to extend the implementation of many of the standard electronic structure
methods currently used to compute the properties of molecules and periodic solids. In addition,
NWChem has the ability to perform classical molecular dynamics and free energy simulations with the
forces for these simulations being obtainable from a variety of sources, including ab initio calculations.
Examples of the capabilities of NWChem include:
- Direct, semi-direct, and conventional Hartree-Fock (RHF, UHF, ROHF) calculations using up to
10,000 Gaussian basis functions; analytic first and second derivatives of the HF energy.
- Direct, semi-direct, and conventional density functional theory (DFT) calculations with a wide
variety of local and non-local exchange-correlation potentials, using up to 10,000 basis functions;
analytic first and second derivatives of the DFT energy.
- Complete active space self-consistent-field (CASSCF) calculations; analytic first and numerical
second derivatives of the CASSCF energy.
- Semi-direct and RI (Resolution-of-the-Identity)-based second-order perturbation theory (MP2)
calculations for RHF and UHF wave functions using up to 3,000 basis functions; fully direct
calculations based on RHF wave functions; analytic first derivatives and numerical second
derivatives of the MP2 energy.
- Coupled cluster, CCSD and CCSD(T), calculations based on RHF wave functions using up to
3,000 basis functions; numerical first and second derivatives of the coupled cluster energy.
One of the standard NWChem benchmarks is designed to illustrate the scaling of the DFT module.
This involves the treatment of a number of different fragment models of a zeolite cluster, these
fragments ranged in size from Si8O7H18,
with 347 basis functions and 832 CD fitting functions, to
Si28O67H30, with 1687 basis functions and 3928 fitting functions.
With the assistance of PNNL's Jarek Nieplocha and Eduoardo Apra, we have continued to port the
code to a variety of Pentium (CS1, CS6 and CS9), Athlon (CS7) and Alpha-based (CS2) Clusters, and
benchmarked these systems using the same Zeolite fragments. Total elapsed times on the Compaq
AlphaServer SC ES45/1000, the SGI Origin 3800/R14k and Cray T3E/1200E , together with those on
number of these clusters, are given in Tables 1 and 2.
The super-linear speed-ups observed at high processor counts for the larger fragments on both the
Cray T3E/1200E and SGI Origin 3800/R14k-500 arises from the increased availability of memory. In
such cases, the 3c2e-integrals are held entirely in core; at lower processor count a fraction of these
integrals must be re-computed on each iterative cycle of the SCF. Considering the total times to
solution on 32 CPUs of the high-end systems for the smaller fragments
(Si8O7H18 and Si8O25H18), we
see that while the Compaq AlphaServer SC ES45/1000 is the fastest machine, it appears to only
outperform the Origin 3800 by the modest factor of ca. 1.2 (and the Cray T3E/1200E by factors of 2.2
and 2.5). In these calculations all integrals are held in memory. This factor increases significantly in
calculations on the larger fragments when integral re-computation is required; now the AlphaServer
SC is 2.7 (Si26O37H36) and 3.5 (Si28O67H30)
times faster than the Origin 3800. Lower factors are again
apparent on higher node counts even for the larger zeolite fragments when all integrals are in
memory. Thus the 128 CPU timings suggest that the AlphaServer SC is only a factor of 1.3 times
faster than the Origin 3800, and a factor of 2.2 times faster than the Cray T3E/1200E.
| Zeolite |
Basis (AOs/CD) |
CPUs |
Cray T3E/1200E |
Compaq AlphaServer SC ES45/1000 |
SGI Origin 3800/R14k-500 |
| Si8O7H18 |
347/832 |
8 |
357 |
74 |
159 |
| |
|
16 |
163 |
55 |
73 |
| |
|
32 |
107 |
42 |
52 |
| |
|
64 |
88 |
|
51 |
| |
|
128 |
76 |
|
|
| Si8O25H18 |
617/1444 |
16 |
411 |
159 |
225 |
| |
|
32 |
257 |
119 |
140 |
| |
|
64 |
174 |
|
121 |
| |
|
128 |
155 |
|
|
| Si26O37H36 |
1199/2818 |
16 |
|
|
4837 |
| |
|
32 |
5169 |
907 |
2414 |
| |
|
64 |
798 |
404 |
502 |
| |
|
128 |
632 |
303 |
|
| Si28O67H30 |
1687/3928 |
16 |
|
|
|
| |
|
32 |
|
1580 |
5507 |
| |
|
64 |
6090 |
1182 |
3050 |
| |
|
128 |
1360 |
611 |
880 |
Table 1. Total Elapsed times (secs) using the NWChem DFT module in calculations on a variety of
Zeolite fragments on the Compaq AlphaServer SC ES45/1000, SGI Origin 3800/R14k-500 and Cray
T3E/1200E.
Considering the total times to solution on 32 CPUs of the commodity clusters (Table 2), we see that
the PentiumIII/800 CS6 cluster is delivering 61% (Si8O7H18)
and 50% (Si8O25H18) of the Cray
T3E/1200E in the DFT calculations. A noticeable feature not immediately apparent from the results of
Table 2 is the effective failure of current clusters with just fast ether interconnect to be able to "cope
with" truly distributed data applications such as NWChem. The interconnect is just not able to support
the levels of network traffic demanded; in the present case it proved impossible to even run the larger
zeolites, Si26O37H36 and Si28O67H30,
the CS6 cluster often hanging with congested network traffic. It
seems likely that Gigabit ethernet would show similar behaviour. The only solution here would be to
have a replicated data exchange correlation (XC) Fock build since at present ga_access (used by the
distributed data XC build) is extremely latency sensitive; PNNL plan to implement a replicated data XC
build in the near future. Significantly better performance is found when using Giganet, and to a lesser
extent Myrinet, since their latency is much better than fast ethernet. A related effect is found on
machines featuring interconnects around which the Global Array Tools have not been optimised. A
good example is provided by the AMD-based CS7 cluster that features the SCALI/SCI interconnect. In
spite of repeated efforts, it has not been possible to engage Dolphin in delivering a tuned GA
implementation based on the SCAMPI environment. The impact of running with a GA implementation
in which the 1-sided communications are not optimised is only too clear from the results of table 2.
The performance in those cases where all integrals are resident in memory is extremely poor, with the
delivered performance little better than clusters with just fast ethernet.
As expected, the more powerful CPUs of both the Alpha Linux and Pentium/4 Clusters lead to much
higher percentage delivery, with the CS2 Alpha Cluster somewhat faster than the CS9 myrinet-based
cluster in every case. Again, it appears that the latency of QSNet vs. Myrinet accounts for this
ordering. The 32-CPU Alpha cluster delivers 220% (Si8O7H18)
and 215% (Si8O25H18) of the Cray T3E.
The much higher figure (464%) for Si26O37H36
is attributed to the increased memory available on the
cluster (see above). On 32-CPUs, the CS2 Alpha Linux and CS9 Pentium/4 Clusters are seen to
deliver between 67-98% (CS2) and 65-88% (CS9) of the Compaq AlphaServer SC ES45/1000.
Corresponding delivery figures against the SGI Origin 3800/R14k-500 are 113-234% (CS2) and 95-
227% (CS9). It is of some interest to compare the present Alpha timings for the larger fragments with
those originally reported on the IBM SP/P2SC-120 [2]. The timings of Table 2 suggests that the 32-
CPU Alpha Linux Cluster is, for the larger fragments, delivering ca. 50% of the performance of the
256 node IBM/P2SC-120 at the EMSL's Molecular Sciences Computing Facility (MSCF). The 64-CPU
cluster comfortably outperforms the IBM SP, a creditable level of performance even allowing for the
dated nature of the IBM hardware.
| Zeolite |
Basis (AOs/CD) |
CPUs |
SGI Origin 3800/R14k-500 |
CS6 PIII/800 + FE
|
CS7 dual AMD K7/1000 Cluster
|
CS9 dual P4/2000 Cluster
|
CS2 Alpha Linux Cluster
|
| |
|
|
|
LAM/MPI |
SCAMPI |
Myrinet 2k |
QsNet |
| Si8O7H18 |
347/832 |
8 |
159 |
356 |
257 |
110 |
103 |
| |
|
16 |
73 |
227 |
223 |
68 |
61 |
| |
|
32 |
52 |
177 |
249 |
56 |
43 |
| |
|
64 |
51 |
|
|
|
|
| Si8O25H18 |
617/1444 |
16 |
225 |
596 |
511 |
182 |
177 |
| |
|
32 |
140 |
514 |
404 |
135 |
124 |
| |
|
48 |
|
|
|
128 |
104 |
| |
|
64 |
121 |
|
427 |
129 |
94‡ |
| |
|
128 |
|
|
|
|
|
| Si26O37H36 |
1199/2818 |
16 |
4837 |
|
|
2065 |
1978 |
| |
|
32 |
2414 |
|
2388 |
1147 |
1103 |
| |
|
48 |
|
|
|
530 |
489 |
| |
|
64 |
502 |
|
1271 |
517 |
438‡ |
| |
|
128 |
|
|
|
|
|
| Si28O67H30 |
1687/3928 |
32 |
5507 |
|
4682 |
2424 |
2351 |
| |
|
48 |
|
|
|
1822 |
1770 |
| |
|
64 |
3050 |
|
3008 |
1617 |
1487‡ |
| |
|
128 |
880 |
|
|
|
|
Table 2. Total Elapsed times (secs.) using the NWChem DFT module in calculations on a variety of
Zeolite fragments on a number of commodity-based systems. The SGI Origin 3800/R14k-500 is also
included as a reference.
‡ 62 processors
References
[1] Additional information on NWChem and the Environmental Molecular Sciences Laboratory, Pacific
Northwest National Laboratory, Battelle Memorial Institute, and the U.S. Department of Energy is
available via the Pacific Northwest National Laboratory Home Page
[2] D.A. Dixon et al, Computational Chemistry in the Environmental Sciences Laboratory, High
Performance Computing, Eds. R.J. Allan, M.F. Guest, A.D. Simpson, D.S. Henty and D.A. Nicole,
Kluwer Academic, 1998, pp215-228.
|
|