A Parallel Implementation of SPME for DL-POLY3
I.J.Bush and W.Smith, STFC Daresbury Laboratory
In DLPOLY [1] the evaluation of the Coulomb potential and forces is performed using the smooth
particle mesh Ewald (SPME) algorithm [2]. As in all Ewald [3] methods this splits the calculation into
two parts, one performed in real space and one in Fourier space. The former only requires evaluation
of short ranged functions, which fits in well with the domain decomposition used by DLPOLY 3, and
so scales well with increasing processor count. However the Fourier component requires 3
dimensional FFTs to be performed. These are global operations and so a different strategy is required
if good scaling is to be achieved.
The original implementation involved replicating the whole FFT grid on all processors and performing
the FFTs in serial after which each processor could evaluate the appropriate terms for the atoms that
it held. This method has a number of drawbacks
- The FFT grid may be very large. A 256x256x256 grid would require 256 Mbytes of data to be held
on each processor.
- The replication of such large grids is an expensive operation in itself, the cost of which increases
with processor count
- Doing the FFT in serial is potentially a bottleneck, especially at large numbers of processors
While both open source 3D parallel FFTs (such as FFTW [4]) and proprietary ones (such as Cray's
PCCFFT) exist, unfortunately they do not quite address all the points above. The problem is that they
impose a data distribution, typically planes of points, that is incompatible with DLPOLY's spatial
domain decomposition, so while a complete replication of the data is not required, it is still necessary
to perform extensive data redistributions which will limit the scaling of the method.
Instead a parallel 3D FFT has been written by Ian Bush at DL that maps directly onto DLPOLY's data
distribution. This involved parallelizing the individual 1D FFTs in an efficient manner, and while the
method will be slower than the proprietary routines for small processor counts, at large numbers it is
an attractive method for a couple of reasons:
- The method, while moving more data in total, requires very much fewer messages. Thus in the
latency dominated regime it should perform better.
- Global operations, such as the all to all operations used in both FFTW and PCCFFT, are totally
avoided.
More generally the method is extremely flexible, allowing a much more general data distribution than
those of other FFTs, and as such should be useful in other codes which do not map directly onto a by
planes distribution.
In figure 1 we compare the times for DL_POLY 3 using both the new method and the original. The
system is 27000 atoms of Sodium Chloride run for 200 time steps, and the FFT grid size is 64x64x64
The numbers are for a T3E 1200, the times quoted are in seconds. It can be seen that the
new method, while not scaling perfectly, is a vast improvement on the original.

Figure 1: Timings on the Cray T3E 1200 for 27000 atoms of NaCl
showing the new parallel FFT compared to the original method
Figure 2 shows the elapsed times for a larger system on a variety of
architectures. The system is again Sodium Chloride , but this time
consists of 216000 ions and the FFT grid size id 128x128x128. Again
the times are for 200 time steps. It can be seen that the calculation
scales well on all machines to 128 processors, and is still speeding
up substantially on 256.

Figure 2: Timings for 216000 ions of NaCl on a range of parallel systems
A discussion of the scaling properties of DL-POLY
can be found here.
References
[1] W.Smith and T.Forester, J.Molec. Graphics, 14 136 (1996)
[2] Darden et al., J.Chem.Phys. 103 19 (1995)
[3] P.P.Ewald, Ann. Physik. 64 253 (1921)
[4] Frigo and Johnson, ICASSP Conference Proceedings (1998)
|