| |
SIC-LMTO - Benchmarking an Electronic Structure Code
John Ashby, STFC Rutherford Appleton Laboratory
We have investigated the performance across a wide-range
of parallel systems of the SIC-LMTO code -
a first-principles electronic band structure code from the Band Theory
Group at Daresbury Laboratory.
Many of the interesting physical properties of materials
are governed by the arrangement of electrons in the atoms of which they
are composed. In crystalline materials these can include the
crystalline structure, magnetic and chemical properties, electrical and
thermal conductivity and many others. Understanding the interplay
between electronic structure and physical properties informs the search
for better and more useful materials. The computational solution of the
underlying equations, the many-body Schrodinger equation for the
electrons, is formidable, but computational and theoretical advances
have made feasible the use of better and better approximations.

Figure 1: Performance of the SIC-LMTO NiFe2O4inv benchmark
on HPCx, the SGI Altix, the Cray XD1 and Scarf (AMD Opteron cluster) systems
The SIC-LMTO code of Temmerman and Szotek is a
self-consistent spin polarised calculation of the eectronic band
structure of a crystalline material. It uses the linear muffin-tin
orbitals approach with a Self-interaction correction and is written
mostly in Fortran95, although there is some legacy code still in
Fortran77. We had available datasets for a small problem (silver) and a
large problem, the magnetic half metal NiFe2O4 in
an inverse spinel
structure. In this latter case the program treats 26 “atoms” (2 types
of Fe, 1 Ni, 2 O and 4 empty spheres) with 98 bands leading to a
hamiltonian matrix of dimension 234. This is diagonalised at 512 points
within the Brillouin zone.
The essence of the program is the solution of the
eigenproblem, H(k)ψk=Ekψk. The
Hamiltonian H(k) depends on all the ψk through the electron
density n(r)=Σall occupied states∣ψk
(r)∣2 . Initially a guess is made at an
electron density, the eigenproblem is solved and a new electron density
generated. This is then fed back until self-consistency is reached.
Within this self consistency loop each k-value can be solved for
independently. The program is parallelised by farming out the k-points
among the available processors and then performing a global broadcast
of the results so that each processor can then calculate the electron
density to use for the next iteration.
Figure 1 shows performance results for the
NiFe2O4inv benchmark data case on several
systems. These were: The IBM
P690 Regatta system HPCX, an SGI Altix and two similar AMD
Opteron clusters, a Cray XD1 and SCARF, a cluster supplied by
Streamline and using Myrinet connection technology. The Altix, XD1 and
SCARF have similar scaling behaviour at low processor numbers (though
the poor performance of the XD1 for 48 processors is anomalous but
repeatable). In contrast HPCx displays poor scaling and a deterioration
in performance at processor numbers above128. At the best going from 32
to 128
processors, a factor of 4, only doubles the speed. This can be traced
back to the communication strategy employed. The global sums over
k-points are performed in at best O(NlnN) messages. The message size is
O(1/N) (the 512 k-points are divided up between the N processors) so
even if there were no start up costs for a message, the communication
cost would grow as lnN. The computational cost is decreasing as O(1/N)
and eventually the increase in the communications cost will overtake
the decrease in the computation. At 128 processors each processor is
dealing with only 4 k-points before the communication phase.
The global
communication strategy exacerbates the impact of load imbalance. The
global sums
have an implicit synchronisation point since they require all data to
be available. Thus if one processor is taking longer than the others,
the whole program is required to go at the speed of the slowest. We
show
this happening in Figure 2. Here we have used Vampir
to
produce a plot of the time spent by SIC-LMTO in one of its
major routines and its subsidiaries, shown in grey, and in MPI calls
shown in red. It is clear that some processes are spending 50% more
time in MPI calls than others, not because they are
sending more information but because they are waiting idle for the
computationally slower routines. The computational load imbalance is
shown by the grey histogram, and it is noticeable that this almost
exactly mirrors the MPI imbalance. The double structure is an artefact
of the use of two frames or LPARs in this
64-node run, the same load imbalance is being repeated on each LPAR.

Figure 2. Vampir plots
of time SIC-LMTO spent in subroutine bands (grey) and in MPI calls
(red).
|
|