| |
A Parallel Implementation of SPME for DL_POLY 3
I.J. Bush and W. Smith
STFC Daresbury Laboratory
In DL_POLY [1] the evaluation of the Coulomb potential and forces is performed using the smooth
particle mesh Ewald (SPME) algorithm [2]. As in all Ewald [3] methods this splits the calculation into
two parts, one performed in real space and one in Fourier space. The former only requires evaluation
of short ranged functions, which fits in well with the domain decomposition used by DL_POLY 3, and
so scales well with increasing processor count. However the Fourier component requires 3
dimensional FFTs to be performed. These are global operations and so a different strategy is required
if good scaling is to be achieved.
The original implementation involved replicating the whole FFT grid on all processors and performing
the FFTs in serial after which each processor could evaluate the appropriate terms for the atoms that
it held. This method has a number of drawbacks
- The FFT grid may be very large. A 256x256x256 grid would require 256 Mbytes of data to be
held on each processor.
- The replication of such large grids is an expensive operation in itself, the cost of which
increases with processor count
- Doing the FFT in serial is potentially a bottleneck, especially at large numbers of processors
While both open source 3D parallel FFTs (such as FFTW [4]) and proprietary ones (such as Cray's
PCCFFT) exist, unfortunately they do not quite address all the points above. The problem is that they
impose a data distribution, typically planes of points, that is incompatible with DL_POLY's spatial
domain decomposition, so while a complete replication of the data is not required, it is still necessary
to perform extensive data redistribution which will limit the scaling of the method.
To address these limitations, a parallel 3D FFT has been written by Ian Bush which maps directly
onto DL_POLY's data distribution. This involved parallelizing the individual 1D FFTs in an efficient
manner, and while the method will be slower than the proprietary routines for small processor counts,
at large numbers it is an attractive method, since:
- The method, while moving more data in total, requires very much fewer messages. Thus in
the latency dominated regime it should perform better.
- Global operations, such as the all to all operations used in both FFTW and PCCFFT, are
totally avoided.
More generally the method is extremely flexible, allowing a much more general data distribution than
those of other FFTs, and as such should be useful in other codes which do not map directly onto a
by planes distribution.
In Table 1 we compare the times for DL_POLY 3 using both the new method and the original. The
system comprises 27,000 atoms of Sodium Chloride run for 200 time steps, where the FFT grid size
is 64x64x64. The timings (seconds) are for the Cray T3E/1200E at Manchester; it can be seen that
the new method, while not scaling perfectly, is a vast improvement on the original.
| Processors |
Original Method |
Parallel FFT |
| 4 |
1851.1 |
554.8 |
| 8 |
911.8 |
318.9 |
| 16 |
558.6 |
116.9 |
| 32 |
351.2 |
86.1 |
| 64 |
271.7 |
54.4 |
| 128 |
232.3 |
33.7 |
| 256 |
216.6 |
28.1 |
Table 1.
Table 2 shows the scaling of the new method for a larger NaCl system, this time 216,000 atoms for
10 time steps with a 128x128x128 FFT grid, again on the Cray T3E. For this example, which is more
typical of a production run, the scaling can be seen to be excellent.
| Processors |
Time/s |
| 32 |
53.32 |
| 64 |
24.38 |
| 128 |
12.46 |
| 256 |
6.23 |
Table 2.
References
[1] W. Smith and T. Forester, J. Molec. Graphics, 14 136 (1996)
[2] Darden et al., J. Chem.Phys. 103 19 (1995)
[3] P.P. Ewald, Ann. Physik. 64 253 (1921)
[4] Frigo and Johnson, ICASSP Conference Proceedings (1998)
|
|