A New Implementation of a Parallel Jacobi Diagonalizer
I.J.Bush, STFC Daresbury Laboratory
Diagonalization of dense real symmetric and Hermitian matrices is an important step in many
packages that are run on high performance computer systems. Experience has shown that while this
step is often a comparatively small percentage of serial runs, at large numbers of processors and for
large matrices the cost as a percentage of the total run time is much more significant, and in some
cases it may dominate. It is therefore necessary to investigate different algorithms and
implementations to enable the best exploitation possible of high end systems.
The fastest serial algorithm is that based upon Householder reduction to tridiagonal form followed by
diagonalization [1]. A number of parallel implementations of this algorithm exist, for existence
ScaLAPACK and the PeIGS package from PNNL, but they all suffer from a number of drawbacks.
The most obvious is that even for quite large matrices the scaling is mediocre at large processor
counts (see A.G.Sunderland's report on
"Parallel Eigensolver Performance").
However on closer investigation further
problems can be seen. Both ScaLAPACK and PeIGS require a rather indeterminate amount of
workspace to operate, and this workspace is replicated across all the processors; in ScaLAPACK in
the worst case scenario the size of this workspace can be the size of the whole matrix. Obviously for
the very large matrices that one wishes to diagonalize on high-end systems this is highly undesirable,
and may even be impossible.
Further many algorithms can easily generate a good guess at the eigenvectors, and it would be very
useful if this guess could be used to speed up the diagonalization. Unfortunately this is difficult within
the Householder methodology.
An alternative algorithm is Jacobi's. Though slower in serial, typically by a factor of three to five, it has
a number of attractive properties for practical parallel codes; the scaling is good (see below), the
memory requirement scales strictly with the inverse of the number of processors and is known a
priori, and guesses at the eigenvectors, provided they are mutually orthogonal, can be used very
effectively to speed up the diagonalization. At DL I have written a package, named BFG, that
implements a version of this algorithm as described by Littlefield and Maschoff [2]. This is a rewrite of
a previous version which has been used in both the CRYSTAL and VASP packages. The new version
offers extended functionality, better numerical stability, increased single processor performance and
better scaling. It can diagonalize both real symmetric and Hermitian matrices, and has interfaces for
matrices distributed either in a block-cyclic (like ScaLAPACK) fashion, or by columns (PeIGS). It is
written in standard conforming Fortran and uses MPI for message passing, and so it is portable to a
wide range of parallel machines. Further extensive use of MPI communicators make it flexible enough
that matrices may be diagonalized across just a subset of the processors, thus allowing a number of
diagonalizations to be carried out at once. It is now being used in CRYSTAL, and has been distributed
to the authors of Siesta.
Table 1 shows the speed-up of the new package for a dense real symmetric matrix on a Cray
T3E-1200 and an SGI Origin 3000. The matrix is of order 2000 and was generated by using
random numbers, symmetrizing it and then making it diagonally dominant by dividing the off-diagonal
terms by 100. This last step was taken because matrices used in real applications are often
diagonally dominant by roughly this factor. The speed up on turing can be seen to be excellent, but
the situation is not so clear for green. The large super-linear effect is almost certainly due to the large
8 MB secondary cache on that machine. The matrix occupies 32 MB, so at less than four processors
most of the memory accesses must come from main memory, which is slow, while at four processors
or more all the data can be held in secondary cache. Hence the marked super-linear behaviour for
processor counts of four or more.
Number of Processors |
Cray T3E speed-up |
SGI Origin 3000 speed-up |
| 1 |
|
1.00 |
| 2 |
|
1.97 |
| 4 |
4.00 |
6.02 |
| 8 |
8.03 |
13.60 |
| 16 |
15.43 |
30.93 |
| 32 |
30.60 |
58.32 |
| 64 |
57.11 |
100.00 |
| 128 |
100.62 |
141.60 |
Table 1: Speed up for an order 2000 matrix on a Cray T3E and an SGI Origin 3000
Table 2 shows the speed up for an order 2000 Hermitian matrix on an SGI Origin 3000. It has been generated in a
similar manner to the real symmetric one. Again there are super-linear effects, but the larger size of
the matrix results in them being less pronounced. Again the speed up can be seen to be good.
Number of Processors |
SGI Origin 3000 speed-up |
| 1 |
1.00 |
| 2 |
1.57 |
| 4 |
4.04 |
| 8 |
9.63 |
| 16 |
21.04 |
| 32 |
38.65 |
| 64 |
68.79 |
| 128 |
105.11 |
Table 2: Speed up for an order 2000 Hermitian matrix on an SGI Origin 3000
There is an outstanding question: Though the scaling of this algorithm is good, is it good enough to
overcome the poor serial speed of Jacobi's algorithm relative to Householder's? This is a difficult
question to answer, as it will vary from case to case, though
some investigation has been carried out
at least for CRYSTAL. For this code it is felt that this method is
superior because the memory usage on high processor counts is typically much smaller than for the
Householder methods so allowing larger calculations to be performed.
References
[1] Wilkinson and Reinsch, Linear Algebra, Vol. 2 of Handbook for Automatic Computation (1971)
[2] Littlefield and Maschoff, Theor. Chim. Acta. 84 457 (1993)
|