Parallelisation

Parallelization

WHY PARALLELIZE ?

To ameliorate the execution time of program, there are many solutions :

optimization of the program

algorithm modification

parallelization of the calculation.

Parallelisation allows us to distribute the computation over many processors, although communication problems appear. It's like "human resource management".

Because of the way the code is organized, the parallelization is a space parallelization instead of a time parallelization (we discuss this decision later). Nicholas Kevlahan put forward the idea that the Fourier transform takes the most calculation time. That's why he decides to use the FFTW parallelized library instead of the serial Singleton library. To manage the communication between processors we used MPI.

FFTW

You can see the documentation of FFTW on http://www.fftw.org. This documentation is useful for understanding why we should use this library. But it's not as easy as it seems on SHARCNET machine Idra : implementing FFTW took most of our time !

We should use the real to complex FFTW routine to save storage. Indeed, since the variables like velocity are real functions, their Fourier transform is a complex array where only half dimension is needed (the rest is obtained by the conjugate).

We tried using the serial FFTW library for the computation of simple derivatives of periodic function like sin(x). After much efforts, we were finally able to calculate the derivatives of :

1D real periodic functions
1D complex periodic functions
2D real periodic functions
2D complex periodic functions
3D complex periodic functions

Then we tried with the FFTW MPI wrapper for Fortran 90 only in 3D. With complex functions, we had easily obtained good results ; but with real functions, there were error messages like "Segmentation Fault", "Floating point exception", or we had incorrect results.

Thanks to James Wadsley, who found that MPI FFTW is just experimental function and that doesn't work correctly with the Fortran wrapper. That's why we are using now a C wrapper which is linked to the F90 code.

We propose to see our test programs on FFTW : Workshop

HOW TO PARALLELIZE ?

       The FFTW documentation suggests to cut the space according to the z direction what we did. You can see it on this drawing.
     On each processor, you have a few X-Y planes where you can apply the same resolution method.
    Moreover, after studding Krylov time scheme for the Navier-Stokes equation, it seems that all operations are local calculations except the values of dot product, drag and lift. So we just exchanged these values between processors. Each processor generates its data files output separately.

    To obtain the local index in the z direction, we used the FFTW function "get_local_size". So we just allocated the local real array like Array (X, Y, local Z start : local Z start + number of plan X-Y on each processor -1). The allocation of the complex (result of the Fourier transform) array is Array (X/2, Y, local Z start : local Z start + number of plan X-Y on each processor -1) so we saved half storage space.
    We used the MPI function : MPI_REDUCE (or ALLREDUCE) with the flag MPI_SUM to calculate dot product, drag and lift, and MPI_GATHER to send the parameters to all the processors. That's all for MPI
    As we said before, the major problem or difficulty of this project is to calculate the Fourier transform with FFTW. The Krylov method isn't easy to understand but, you can easily see what all operations are locals i.e. the calculations depends only local dimension and the dimension of Krylov subspace.

To sum up, the code was changed gradually from a code which used Singleton to calculate the Fourier transform to FFTW with MPI permitting to save time and to increase the maximum resolution of the calculation.

Abstract

Result of Parallelization