| |
FLITE3D - Benchmarking an Unstructured Multigrid CFD Code
John Ashby, STFC Rutherford Appleton Laboratory
We have investigated the performance across a wide-range
of parallel systems of the FLITE3D code -
a Navier-Stokes solver from the Sowerby Research Centre of British Aerospace and the
Computational
Engineering
Group at Daresbury Laboratory.
Aeronautical design has benefited greatly from the use
of computational modelling approaches over the years. Starting with
simple wing shape optimisation for modest flow speeds, improvements in
algorithms and computers have made it possible to study wing-body
interactions and full aircraft at sub-, trans- and super-sonic speeds.

Figure 1: Performance of the FLITE3D wing-body benchmark on HPCx, the
SGI Origin3000, the Cray XD1 and Scarf (AMD Opteron cluster) systems
FLITE3D is a three-dimensional CFD code which solves
the Navier Stokes equations using an unstructured multigrid method. The
space around an object in which fluid flows is divided into a set of
tetrahedral cells. These cells and the points which make up their
vertices form a mesh, and the continuous equations are transformed to a
discrete form on this mesh so that the pressure and velocity of the
fluid are found at each of the points. The Navier-Stokes equations
define the density, pressure and velocity of the fluid. When
discretised and linearised these equations generate a large sparse
linear system in the variables Uir j, Pr j
noting that the pressure implies the density through the equation of
state.
In the multigrid method several different grids of
varying degrees of fineness are overlaid. The problem is solved
approximately on one grid, then transferred to the next grid by a
process known as prolongation (moving from coarse to fine) or
restriction (moving from fine to coarse). At each mesh level the
problem is solved (on coarse grids the problem is solved to give
corrections to the solution on the finer grid). There are many
different approaches to moving between grids – in FLITE3D the V-cycle
is used in which the grid levels are fully traversed from fine to
coarse and then back again. Parallelisation is achieved by domain
decomposition. The meshes are divided into as many sections as there
are processors available and the discrete problem is solved on each
individual section or partition. Then the values at the interface
between partitions are exchanged and provide new boundary conditions
for the next stage of the solution process. This requires inter-process
communication where each partition sends its boundary values to each of
its neighbouring partitions and receives boundary values from each of
them.
Two datasets were available to us for FLITE3D, a wing-body assembly
where the mesh consisted of 51737 points and 302079 tetrahedra and an
F18 meshed using 585792 points and 3663559 tetrahedra. The former is of
modest size while the latter is large and imposes quite high memory
requirements. In both cases 1000 V-cycles were carried out to give runs
of a reasonable length of time and to average out any stochastic
behaviour of the communications.
Figure 1 shows performance results for the
wing-body benchmark data case on several systems. These were: The IBM
P690 Regatta system HPCX, an SGI Origin 3000 and two similar AMD
Opteron clusters, a Cray XD1 and SCARF, a cluster supplied by
Streamline and using Myrinet connection technology with AMD Opteron
processors. The size of the last two machines limited experiments to
fewer than 32 processors. All machines demonstrate good scaling up to
32 processors, but above this the two larger machines drop away, with
HPCx falling faster than the Origin. Investigations with Vampir show
that at 32 processors the communications becomes saturated and many
processors spend significant time idle waiting for messages. he
configuration of HPCx with LPARs of 32 processors leads to different
communication profiles between and within LPARs, whereas the Origin3000
has a uniform profile. This is thought to be the reason the Origin
performs relatively better for high numbers of processors than HPCx.

Figure 2: Performance of the FLITE3D F18 benchmark on HPCx, the
SGI Origin3000, the Cray XD1 and Scarf (AMD Opteron cluster) systems
In Figure 2 we show the performance results for the
larger F18 benchmark. The relative performances of the four machines
are similar to the smaller benchmark case, although the XD1 and SCARF
do better. Although the speed-up on the Origin is not particularly
good, there is no sign of the turnover seen in the wing-body benchmark
due to the increased computational load.
|
|