STFC Home Page STFC Home Page CSE Home Page CSE Home Page Computational Science & Engineering Department  

 22:26:50 GMT
 Friday
 12 March 2010

 Search the CSE web:
 Enter text and press return

 
  Home
  Support and services
  Research and development
  Advanced research computing
  Atomic and molecular physics
  Band theory
  CCP4 group
  Computational biology
  Computational chemistry
  Computational engineering
  Computational material science
  Numerical analysis
  Software engineering
  Visualization
  Online resources
  Events calendar
  Newsroom
  Site map / index
   

Valid HTML 4.01

Valid CSS!

 

A New Implementation of a Parallel Jacobi Diagonalizer

I.J.Bush, STFC Daresbury Laboratory

Diagonalization of dense real symmetric and Hermitian matrices is an important step in many packages that are run on high performance computer systems. Experience has shown that while this step is often a comparatively small percentage of serial runs, at large numbers of processors and for large matrices the cost as a percentage of the total run time is much more significant, and in some cases it may dominate. It is therefore necessary to investigate different algorithms and implementations to enable the best exploitation possible of high end systems.

The fastest serial algorithm is that based upon Householder reduction to tridiagonal form followed by diagonalization [1]. A number of parallel implementations of this algorithm exist, for existence ScaLAPACK and the PeIGS package from PNNL, but they all suffer from a number of drawbacks. The most obvious is that even for quite large matrices the scaling is mediocre at large processor counts (see A.G.Sunderland's report on "Parallel Eigensolver Performance"). However on closer investigation further problems can be seen. Both ScaLAPACK and PeIGS require a rather indeterminate amount of workspace to operate, and this workspace is replicated across all the processors; in ScaLAPACK in the worst case scenario the size of this workspace can be the size of the whole matrix. Obviously for the very large matrices that one wishes to diagonalize on high-end systems this is highly undesirable, and may even be impossible.

Further many algorithms can easily generate a good guess at the eigenvectors, and it would be very useful if this guess could be used to speed up the diagonalization. Unfortunately this is difficult within the Householder methodology.

An alternative algorithm is Jacobi's. Though slower in serial, typically by a factor of three to five, it has a number of attractive properties for practical parallel codes; the scaling is good (see below), the memory requirement scales strictly with the inverse of the number of processors and is known a priori, and guesses at the eigenvectors, provided they are mutually orthogonal, can be used very effectively to speed up the diagonalization. At DL I have written a package, named BFG, that implements a version of this algorithm as described by Littlefield and Maschoff [2]. This is a rewrite of a previous version which has been used in both the CRYSTAL and VASP packages. The new version offers extended functionality, better numerical stability, increased single processor performance and better scaling. It can diagonalize both real symmetric and Hermitian matrices, and has interfaces for matrices distributed either in a block-cyclic (like ScaLAPACK) fashion, or by columns (PeIGS). It is written in standard conforming Fortran and uses MPI for message passing, and so it is portable to a wide range of parallel machines. Further extensive use of MPI communicators make it flexible enough that matrices may be diagonalized across just a subset of the processors, thus allowing a number of diagonalizations to be carried out at once. It is now being used in CRYSTAL, and has been distributed to the authors of Siesta.

Table 1 shows the speed-up of the new package for a dense real symmetric matrix on a Cray T3E-1200 and an SGI Origin 3000. The matrix is of order 2000 and was generated by using random numbers, symmetrizing it and then making it diagonally dominant by dividing the off-diagonal terms by 100. This last step was taken because matrices used in real applications are often diagonally dominant by roughly this factor. The speed up on turing can be seen to be excellent, but the situation is not so clear for green. The large super-linear effect is almost certainly due to the large 8 MB secondary cache on that machine. The matrix occupies 32 MB, so at less than four processors most of the memory accesses must come from main memory, which is slow, while at four processors or more all the data can be held in secondary cache. Hence the marked super-linear behaviour for processor counts of four or more.

Number of
Processors
Cray T3E
speed-up
SGI Origin 3000
speed-up
1   1.00
2   1.97
4 4.00 6.02
8 8.03 13.60
16 15.43 30.93
32 30.60 58.32
64 57.11 100.00
128 100.62 141.60

Table 1: Speed up for an order 2000 matrix on a Cray T3E and an SGI Origin 3000

Table 2 shows the speed up for an order 2000 Hermitian matrix on an SGI Origin 3000. It has been generated in a similar manner to the real symmetric one. Again there are super-linear effects, but the larger size of the matrix results in them being less pronounced. Again the speed up can be seen to be good.

Number of
Processors
SGI Origin 3000
speed-up
1 1.00
2 1.57
4 4.04
8 9.63
16 21.04
32 38.65
64 68.79
128 105.11

Table 2: Speed up for an order 2000 Hermitian matrix on an SGI Origin 3000

There is an outstanding question: Though the scaling of this algorithm is good, is it good enough to overcome the poor serial speed of Jacobi's algorithm relative to Householder's? This is a difficult question to answer, as it will vary from case to case, though some investigation has been carried out at least for CRYSTAL. For this code it is felt that this method is superior because the memory usage on high processor counts is typically much smaller than for the Householder methods so allowing larger calculations to be performed.

References

[1] Wilkinson and Reinsch, Linear Algebra, Vol. 2 of Handbook for Automatic Computation (1971)

[2] Littlefield and Maschoff, Theor. Chim. Acta. 84 457 (1993)

 
 
For more information about the Advanced Research Computing Group please contact Dr Mike Ashworth.
 
back to top
 
 ARC Quick links
Link ARC Home Page
Applications:
Link Castep
Link DL-POLY
Link FLITE3D
Link PDNS3D
Link POLCOMS
Link PRMAT
Link SIC-LMTO
Link THOR
Algorithms:
Link BFG
Link CLIPS
Link FFT
Link Eigensolvers
Benchmarking:
Link NWChem
Link JASPA
Link OCCOMM
Link DL-POLY
Languages:
Link Fortran 90
Link Inter-comparison
Link PGAS Languages
Link HPCS Languages
Tools etc.:
Link Vampir
Link Toolkits
Link QA software
Link GUI
People:
Link Mike Ashworth
Link Rob Allan
Link Stephen Pickles
Link Martin Plummer
Link Andrew Porter
Link Andrew Sunderland
Link Ilian Todorov
Past projects:
Link UKHEC Home Page