STFC Home Page STFC Home Page CSE Home Page CSE Home Page Computational Science & Engineering Department  

 16:24:46 BST
 Thursday
 02 September 2010

 Search the CSE web:
 Enter text and press return

 
  Home
  Support and services
  Research and development
  Advanced research computing
  Atomic and molecular physics
  Band theory
  CCP4 group
  Computational biology
  Computational chemistry
  Computational engineering
  Computational material science
  Numerical analysis
  Software engineering
  Visualization
  Online resources
  Events calendar
  Newsroom
  Site map / index
   

Valid HTML 4.01

Valid CSS!

 

Applications Performance:
Release 4.1 of NWChem

M.F. Guest, STFC Daresbury Laboratory

In a previous report we have described the functionality within the NWChem package [1]. Release 4.1 of the code continues to extend the implementation of many of the standard electronic structure methods currently used to compute the properties of molecules and periodic solids. In addition, NWChem has the ability to perform classical molecular dynamics and free energy simulations with the forces for these simulations being obtainable from a variety of sources, including ab initio calculations. Examples of the capabilities of NWChem include:

  • Direct, semi-direct, and conventional Hartree-Fock (RHF, UHF, ROHF) calculations using up to 10,000 Gaussian basis functions; analytic first and second derivatives of the HF energy.

  • Direct, semi-direct, and conventional density functional theory (DFT) calculations with a wide variety of local and non-local exchange-correlation potentials, using up to 10,000 basis functions; analytic first and second derivatives of the DFT energy.

  • Complete active space self-consistent-field (CASSCF) calculations; analytic first and numerical second derivatives of the CASSCF energy.

  • Semi-direct and RI (Resolution-of-the-Identity)-based second-order perturbation theory (MP2) calculations for RHF and UHF wave functions using up to 3,000 basis functions; fully direct calculations based on RHF wave functions; analytic first derivatives and numerical second derivatives of the MP2 energy.

  • Coupled cluster, CCSD and CCSD(T), calculations based on RHF wave functions using up to 3,000 basis functions; numerical first and second derivatives of the coupled cluster energy.

One of the standard NWChem benchmarks is designed to illustrate the scaling of the DFT module. This involves the treatment of a number of different fragment models of a zeolite cluster, these fragments ranged in size from Si8O7H18, with 347 basis functions and 832 CD fitting functions, to Si28O67H30, with 1687 basis functions and 3928 fitting functions.

With the assistance of PNNL's Jarek Nieplocha and Eduoardo Apra, we have continued to port the code to a variety of Pentium (CS1, CS6 and CS9), Athlon (CS7) and Alpha-based (CS2) Clusters, and benchmarked these systems using the same Zeolite fragments. Total elapsed times on the Compaq AlphaServer SC ES45/1000, the SGI Origin 3800/R14k and Cray T3E/1200E , together with those on number of these clusters, are given in Tables 1 and 2.

The super-linear speed-ups observed at high processor counts for the larger fragments on both the Cray T3E/1200E and SGI Origin 3800/R14k-500 arises from the increased availability of memory. In such cases, the 3c2e-integrals are held entirely in core; at lower processor count a fraction of these integrals must be re-computed on each iterative cycle of the SCF. Considering the total times to solution on 32 CPUs of the high-end systems for the smaller fragments (Si8O7H18 and Si8O25H18), we see that while the Compaq AlphaServer SC ES45/1000 is the fastest machine, it appears to only outperform the Origin 3800 by the modest factor of ca. 1.2 (and the Cray T3E/1200E by factors of 2.2 and 2.5). In these calculations all integrals are held in memory. This factor increases significantly in calculations on the larger fragments when integral re-computation is required; now the AlphaServer SC is 2.7 (Si26O37H36) and 3.5 (Si28O67H30) times faster than the Origin 3800. Lower factors are again apparent on higher node counts even for the larger zeolite fragments when all integrals are in memory. Thus the 128 CPU timings suggest that the AlphaServer SC is only a factor of 1.3 times faster than the Origin 3800, and a factor of 2.2 times faster than the Cray T3E/1200E.

Zeolite Basis
(AOs/CD)
CPUs Cray
T3E/1200E
Compaq
AlphaServer SC
ES45/1000
SGI Origin
3800/R14k-500
Si8O7H18 347/832 8 357 74 159
    16 163 55 73
    32 107 42 52
    64 88   51
    128 76    
Si8O25H18 617/1444 16 411 159 225
    32 257 119 140
    64 174   121
    128 155    
Si26O37H36 1199/2818 16     4837
    32 5169 907 2414
    64 798 404 502
    128 632 303  
Si28O67H30 1687/3928 16      
    32   1580 5507
    64 6090 1182 3050
    128 1360 611 880

Table 1. Total Elapsed times (secs) using the NWChem DFT module in calculations on a variety of Zeolite fragments on the Compaq AlphaServer SC ES45/1000, SGI Origin 3800/R14k-500 and Cray T3E/1200E.

Considering the total times to solution on 32 CPUs of the commodity clusters (Table 2), we see that the PentiumIII/800 CS6 cluster is delivering 61% (Si8O7H18) and 50% (Si8O25H18) of the Cray T3E/1200E in the DFT calculations. A noticeable feature not immediately apparent from the results of Table 2 is the effective failure of current clusters with just fast ether interconnect to be able to "cope with" truly distributed data applications such as NWChem. The interconnect is just not able to support the levels of network traffic demanded; in the present case it proved impossible to even run the larger zeolites, Si26O37H36 and Si28O67H30, the CS6 cluster often hanging with congested network traffic. It seems likely that Gigabit ethernet would show similar behaviour. The only solution here would be to have a replicated data exchange correlation (XC) Fock build since at present ga_access (used by the distributed data XC build) is extremely latency sensitive; PNNL plan to implement a replicated data XC build in the near future. Significantly better performance is found when using Giganet, and to a lesser extent Myrinet, since their latency is much better than fast ethernet. A related effect is found on machines featuring interconnects around which the Global Array Tools have not been optimised. A good example is provided by the AMD-based CS7 cluster that features the SCALI/SCI interconnect. In spite of repeated efforts, it has not been possible to engage Dolphin in delivering a tuned GA implementation based on the SCAMPI environment. The impact of running with a GA implementation in which the 1-sided communications are not optimised is only too clear from the results of table 2. The performance in those cases where all integrals are resident in memory is extremely poor, with the delivered performance little better than clusters with just fast ethernet.

As expected, the more powerful CPUs of both the Alpha Linux and Pentium/4 Clusters lead to much higher percentage delivery, with the CS2 Alpha Cluster somewhat faster than the CS9 myrinet-based cluster in every case. Again, it appears that the latency of QSNet vs. Myrinet accounts for this ordering. The 32-CPU Alpha cluster delivers 220% (Si8O7H18) and 215% (Si8O25H18) of the Cray T3E. The much higher figure (464%) for Si26O37H36 is attributed to the increased memory available on the cluster (see above). On 32-CPUs, the CS2 Alpha Linux and CS9 Pentium/4 Clusters are seen to deliver between 67-98% (CS2) and 65-88% (CS9) of the Compaq AlphaServer SC ES45/1000. Corresponding delivery figures against the SGI Origin 3800/R14k-500 are 113-234% (CS2) and 95- 227% (CS9). It is of some interest to compare the present Alpha timings for the larger fragments with those originally reported on the IBM SP/P2SC-120 [2]. The timings of Table 2 suggests that the 32- CPU Alpha Linux Cluster is, for the larger fragments, delivering ca. 50% of the performance of the 256 node IBM/P2SC-120 at the EMSL's Molecular Sciences Computing Facility (MSCF). The 64-CPU cluster comfortably outperforms the IBM SP, a creditable level of performance even allowing for the dated nature of the IBM hardware.

Zeolite Basis
(AOs/CD)
CPUs SGI Origin
3800/R14k-500
CS6
PIII/800
+ FE
CS7
dual AMD
K7/1000
Cluster
CS9 dual
P4/2000
Cluster
CS2
Alpha Linux
Cluster
        LAM/MPI SCAMPI Myrinet 2k QsNet
Si8O7H18 347/832 8 159 356 257 110 103
    16 73 227 223 68 61
    32 52 177 249 56 43
    64 51        
Si8O25H18 617/1444 16 225 596 511 182 177
    32 140 514 404 135 124
    48       128 104
    64 121   427 129 94‡
    128          
Si26O37H36 1199/2818 16 4837     2065 1978
    32 2414   2388 1147 1103
    48       530 489
    64 502   1271 517 438‡
    128          
Si28O67H30 1687/3928 32 5507   4682 2424 2351
    48       1822 1770
    64 3050   3008 1617 1487‡
    128 880        

Table 2. Total Elapsed times (secs.) using the NWChem DFT module in calculations on a variety of Zeolite fragments on a number of commodity-based systems. The SGI Origin 3800/R14k-500 is also included as a reference.
‡ 62 processors

References

[1] Additional information on NWChem and the Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Battelle Memorial Institute, and the U.S. Department of Energy is available via the Pacific Northwest National Laboratory Home Page

[2] D.A. Dixon et al, Computational Chemistry in the Environmental Sciences Laboratory, High Performance Computing, Eds. R.J. Allan, M.F. Guest, A.D. Simpson, D.S. Henty and D.A. Nicole, Kluwer Academic, 1998, pp215-228.

 
 
   
Link NWChem Home Page
 

For more information about the Advanced Research Computing Group please contact Dr Mike Ashworth.
 
back to top
 
 ARC Quick links
Link ARC Home Page
Applications:
Link Castep
Link DL-POLY
Link FLITE3D
Link PDNS3D
Link POLCOMS
Link PRMAT
Link SIC-LMTO
Link THOR
Algorithms:
Link BFG
Link CLIPS
Link FFT
Link Eigensolvers
Benchmarking:
Link NWChem
Link JASPA
Link OCCOMM
Link DL-POLY
Languages:
Link Fortran 90
Link Inter-comparison
Link PGAS Languages
Link HPCS Languages
Tools etc.:
Link Vampir
Link Toolkits
Link QA software
Link GUI
People:
Link Mike Ashworth
Link Rob Allan
Link Stephen Pickles
Link Martin Plummer
Link Andrew Porter
Link Andrew Sunderland
Link Ilian Todorov
Past projects:
Link UKHEC Home Page