Parallelization
WHY PARALLELIZE ?
To ameliorate the execution time of program, there are many solutions :
Because of the way the code is organized,
the parallelization is a space parallelization instead of a time
parallelization (we discuss this decision later). Nicholas Kevlahan put
forward the idea that the Fourier transform takes the most calculation
time. That's why he decides to use the FFTW parallelized library instead
of the serial Singleton
library. To manage the communication between processors we used MPI.
FFTW
You can see the documentation of FFTW on http://www.fftw.org. This documentation is useful for understanding why we should use this library. But it's not as easy as it seems on SHARCNET machine Idra : implementing FFTW took most of our time !
We should use the real to complex FFTW routine to save storage. Indeed, since the variables like velocity are real functions, their Fourier transform is a complex array where only half dimension is needed (the rest is obtained by the conjugate).
We tried using the serial FFTW library for the computation of simple derivatives of periodic function like sin(x). After much efforts, we were finally able to calculate the derivatives of :
Then we tried with the FFTW
MPI wrapper for Fortran 90 only in 3D. With complex functions, we
had easily obtained good results ; but with real functions, there
were error messages like "Segmentation Fault", "Floating point exception",
or we had incorrect results.
Thanks to James Wadsley, who found that MPI FFTW is just experimental function and that doesn't work correctly with the Fortran wrapper. That's why we are using now a C wrapper which is linked to the F90 code.
We propose to see our test programs
on FFTW : Workshop
HOW TO PARALLELIZE ?
|
The FFTW documentation
suggests to cut the space according to the z direction what we did. You
can see it on this drawing.
On each processor, you have a few X-Y planes where you can apply the same resolution method. Moreover, after studding Krylov time scheme for the Navier-Stokes equation, it seems that all operations are local calculations except the values of dot product, drag and lift. So we just exchanged these values between processors. Each processor generates its data files output separately. |
To obtain the local index in the z
direction, we used the FFTW function "get_local_size". So we just allocated
the local real array like Array (X, Y, local Z start : local
Z start + number of plan X-Y on each processor -1). The allocation
of the complex (result of the Fourier transform) array is Array (X/2,
Y, local Z start : local Z start + number of plan X-Y on each processor
-1) so we saved half storage space.
We used the MPI function : MPI_REDUCE
(or ALLREDUCE) with the flag MPI_SUM to calculate dot product, drag and
lift, and MPI_GATHER to send the parameters to all the processors. That's
all for MPI
As we said before, the major problem
or difficulty of this project is to calculate the Fourier transform with
FFTW. The Krylov method isn't easy to understand but, you can easily see
what all operations are locals i.e. the calculations depends only local
dimension and the dimension of Krylov subspace.
To sum up, the code was changed gradually
from a code which used Singleton to calculate the Fourier transform to
FFTW with MPI permitting to save time and to increase the maximum resolution
of the calculation.
Abstract | Result of Parallelization |