inputs/dev/openMP.tex

%%%%%%%%%%%%%%%%%%%
% XLiFE++ is an extended library of finite elements written in C++
%     Copyright (C) 2014  Lunéville, Eric; Kielbasiewicz, Nicolas; Lafranche, Yvon; Nguyen, Manh-Ha; Chambeyron, Colin
%
%     This program is free software: you can redistribute it and/or modify
%     it under the terms of the GNU General Public License as published by
%     the Free Software Foundation, either version 3 of the License, or
%     (at your option) any later version.
%     This program is distributed in the hope that it will be useful,
%     but WITHOUT ANY WARRANTY; without even the implied warranty of
%     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
%     GNU General Public License for more details.
%     You should have received a copy of the GNU General Public License
%     along with this program.  If not, see <http://www.gnu.org/licenses/>.
%%%%%%%%%%%%%%%%%%%

\section{Sparse matrix-vector multiplication}

In a Finite Element package, after repeatedly setting up the stiffness matrices, it needs to a solve linear equation system, which can be done either in two ways (1) by direct methods or (2) iterative methods. In direct methods, the factorization of matrix costs nearly most of the runtime, meanwhile, in iterative methods, matrix-vector multiplication plays an important role.
Since the matrices arising from the discretisation are sparse, an appropriate matrix storage format, sparse format is used. \xlifepp provides a plenty types of sparse matrix, which can serve for different purposes. Although all the sparse matrix can be taken for parallelized sparse-matrix-vector-multiplication (SpMV); among them, CSR (compressed storage row) seems to be the best candidate. \\
The form of SpMV is $y=Ax$, where $A$ is a sparse matrix, $x$ and $y$ are dense vectors. $x$ is source vector and $y$ is destination vector.\\
In the following, the algorithms of parallelization of matrix-vector multiplication for CSR and CSC (compressed storage column) are mentioned, on which parallelized matrix-vector multiplication of other sparse format (CSDual, CSSym, SkylineDual, SkylineSym) are based.

\subsection {SpMV of CSR}

Because CSR is a well-known format, only a small recall about its structure is presented. \\
A matrix $A$ is an $m\times n$ matrix, and the number of non-zero elements is {\itshape nz}, CSR format needs to store three arrays:
\begin{itemize}
\item values[nz] store the value of each non-zero element in matrix A
\item colIndex[nz] stores the column index of each element in {\itshape val[nz]} array
\item rowPointer[m+1] stores the index of the first non-zero element of each row and {\itshape rowPointer[m] = nz}
\end{itemize}
For instance, the following $5\times 6$ matrix
\[
A=\left[
\begin{array}{cccccc}
11 & 12 & 0 & 14 & 0 & 16\\
0 & 22 & 23 & 0 & 0 & 26 \\
31 & 32 & 33 & 0 & 0 & 0 \\
0 & 0 & 43 & 44 & 0 & 0 \\
51 & 0 & 53 & 54 & 55 & 0
\end{array}
\right]
\]
is stored as follow in \xlifepp by the three following {\itshape std::vector}: \\
\mbox{}\hspace{10 mm}values     = ( 0 11 12 14 16\ \ 22 23 26\ \ 31 32 33\ \ 43 44\ \ 51 53 54 55 ) \\
\mbox{}\hspace{10 mm}colIndex   = ( 0 1 3 5\ \ 1 2 5\ \ 0 1 2\ \ 2 3\ \ 0 2 3 4 )\\
\mbox{}\hspace{10 mm}rowPointer = ( 0 4 7 10 12 16) \\
In \xlifepp, different from "standard", the array {\itshape values} of CSR format contains one more element $0$ at the first position. \\
Since it must store the location information explicitly as well as the value of each non-zero element, extra communication time is need to access these location data.
As a matrix of CSR format is store row by row, the simplest way to parallelize the matrix-vector multiplication using OpenMP is to assign each thread of parallel region to work with one row. However, considering the problem of load balancing, scheduling overhead and synchronization overheads, it's bettr to apply the {\itshape row partitioning} scheme. The matrix will be partitioned into blocks of row by the number of threads.
\vspace{.1cm}
\begin{lstlisting}[]
// Assign each thread an approximately equal number of non-zero
Number nnzEachThread = std::floor(nnz/numThread);

// Index of non-zero in each range for each thread
Number nnzRangeIdx = nnzEachThread;

// Lower and upper bound position for each thread
std::vector<std::vector<Number>::const_iterator> itThreadLower(numThread);
std::vector<std::vector<Number>::const_iterator> itThreadUpper(numThread);

itp = itpb;
itThreadLower[0] = itpb;
for (int i = 0; i < numThread; i++) {
  itThreadLower[i] = itp;
  nnzRangeIdx = *itp + nnzEachThread;
  itpLower = std::lower_bound(itp, itpe, nnzRangeIdx);
  itpUpper = std::upper_bound(itpLower, itpe,nnzRangeIdx);
  itp = ((nnzRangeIdx - *(itpUpper -1))< (*itpUpper-nnzRangeIdx)) ? itpUpper-1 : itpUpper;
  itThreadUpper[i] = itp;
}
itThreadUpper[numThread-1] = itpe;
\end{lstlisting}
\vspace{.1cm}
Each block is assigned to a thread and has an approximately equal number of non-zero elements. By partitioning this way, each thread operates on its own part of the {\itshape values, colIndex and rowPointer}. All threads access element of the source vector $x$. Since accesses on $x$ are read-only, there is no invalidation traffic. One advantage of {\itshape row partitioning} is that each thread works on its own part of destination vector $y$, therefore, there is no  need of synchronization among threads. \\
\vspace{.1cm}
\begin{lstlisting}[]
#pragma omp parallel for default(none)\
                         private(i, itr, iti, itie, itim, itpLower, itpUpper) \
                         shared(numThread, itbLower, itbUpper, itpb, itib, itm, itvb, itrb) \
                         schedule(dynamic,1)
for(i = 0; i < numThread; i++) {
  itpLower = *(itbLower+i);
  itpUpper = *(itbUpper+i);
  for (;itpLower != itpUpper;itpLower++) {
    itr = itrb + (itpLower - itpb);
    *itr *=0;
    iti  = itib + *(itpLower);
    itie = itib + *(itpLower + 1);
    itim = itm + *(itpLower);

    while(iti != itie) {
      *itr += *(itim) * *(itvb + *iti);iti++; itim++;
    }
  }
}
\end{lstlisting}
\vspace{.1cm}
However,  the coarse-grained approach sometimes can not assure the load balancing of SpMV of a very large matrix (e.g with million non-zero elements), and a {\bfseries granularity factor} is a solution to make it more fine-grained. Each block of row of a thread is divided into smaller pieces, which not only make SpMv more fine-grained but also allow a thread to take work from each other.
\vspace{.1cm}
\begin{lstlisting}[]
const Number GRANULARITY = 16;
numThread *= GRANULARITY;
\end{lstlisting}
\vspace{.1cm}

\vspace{.1cm}
\displayInfos{library=largeMatrix, header=RowCsStorage.hpp, implementation=RowCsStorage.cpp, test={test\_OpenMP.cpp}, header dep={CsStorage.hpp, config.h, utils.h}}


\subsection {SpMV of CSC}

A matrix $A$ is an $m\times n$ matrix, and the number of non-zero elements is {\itshape nz}, CSC format needs to store three arrays:
\begin{itemize}
\item values[nz] store the value of each non-zero element in matrix A
\item rowIndex[nz] stores the row index of each element in {\itshape val[nz]} array
\item colPointer[m+1] stores the index of the first non-zero element of each column and {\itshape colPointer[m] = nz}
\end{itemize}
For instance, the following $5\times 6$ matrix
\[
A=\left[
\begin{array}{cccccc}
11 & 12 & 0 & 14 & 0 & 16\\
0 & 22 & 23 & 0 & 0 & 26 \\
31 & 32 & 33 & 0 & 0 & 0 \\
0 & 0 & 43 & 44 & 0 & 0 \\
51 & 0 & 53 & 54 & 55 & 0
\end{array}
\right]
\]
is stored as follow in \xlifepp by the three following {\itshape std::vector}: \\
\mbox{}\hspace{10 mm}values     = ( 0 11 31 51 12  22 32 23 33 43 53 14 44 54 55 16 26 ) \\
\mbox{}\hspace{10 mm}rowIndex   = ( 0 2  4 0 1 2 1 2 3 4 0 3 4 4 0 1)\\
\mbox{}\hspace{10 mm}colPointer = ( 0 3 6 10 13 14 16) \\
In \xlifepp, different from "standard", the array {\itshape values} of CSC format contains one more element $0$ at the first position \\
Because the "column-like" storage of CSC is contrast to the natural manner of "row-processing" matrix-vector multiplication, the approach using OpenMP to parallelize matrix-vector multiplication of CSC is to limit the affect of this "column-like" characteristic. \\
Since a CSC matrix is stored in column by column, the simplest way to parallelize the matrix-vector multiplication using OpenMP is to use each thread to process a column, one by one; then the result of each thread is contributed to the destination vector by a "reduction-like" operation. Obviously, this approach is not efficient. Not only each thread processes non-adjacent column, which causes cache misses in cases of  many threads; but also the problem of load-balancing can happen because of the difference in the number of non-zero elements in each column. An improved approach is to divide the matrix into groups of column, each of which has an approximately equal number of non-zero elements. Because all the columns in each group are adjacent, it limits the effect of cache miss. Moreover, load-balancing hardly happens thanks to equal number of non-zero elements processed by each thread. \\
The following code implements the approach described above.
\vspace{.1cm}
\begin{lstlisting}[]
// Lower and upper bound position for each thread
std::vector<std::vector<Number>::const_iterator> itThreadLower(numThread);
std::vector<std::vector<Number>::const_iterator> itThreadUpper(numThread);

std::vector<Number>::const_iterator itpLower, itpUpper;
// Find the smallest and largest rowIndexs corresponding to each thread
std::vector<Number> sRowIdx(numThread);
std::vector<Number> lRowIdx(numThread);

itp = itpb;
itThreadLower[0] = itpb;
for (int i = 0; i < numThread; i++) {
  itThreadLower[i] = itp;
  nnzRangeIdx = *itp + nnzEachThread;
  itpLower = std::lower_bound(itp, itpe, nnzRangeIdx);
  itpUpper = std::upper_bound(itpLower, itpe,nnzRangeIdx);
  itp = ((nnzRangeIdx - *(itpUpper -1))< (*itpUpper-nnzRangeIdx)) ? itpUpper-1 : itpUpper;
  itThreadUpper[i] = itp;
}
itThreadUpper[numThread-1] = itpe;

std::vector<std::vector<Number>::const_iterator>::const_iterator itbLower, itbUpper;
itbLower = itThreadLower.begin();
itbUpper = itThreadUpper.begin();
\end{lstlisting}
\vspace{.1cm}
The serial codes above split a CSC matrix into blocks of column, each of which has nearly the same number of non-zero elements. After that, each of thread will process a block of column, store the result into its temporary vector and write this vector into the destination vector by a "reduction-like" operation. One remark is that temporary vector of each thread have necessarily the same size as the destination vector. It's is the trade-off of this method. \\
After finishing its own matrix-vector multiplication, each thread writes its result into the destination vector. Since array reduction isn't supported in OpenMP C++ (some projects have tried to make this feature available but it seems not to be ready in the near future!!). The only way, at the moment, is to use critical section. And we can easily realize that it is coarse-grained.
\vspace{.1cm}
\begin{lstlisting}[]
typedef typename IterationVectorTrait<ResIterator>::Type ResType;
#pragma omp parallel  \
                      private(i, tid, itr, iti, itie, itim, itpLower, itpUpper, itv) \
                      shared(numThread, itbLower, itbUpper, itpb, itib, itm, itvb, itrb, nRows)
{
    tid = omp_get_thread_num();
    std::vector<ResType> resTemp(nRows);
    typename std::vector<ResType>::iterator itbResTemp = resTemp.begin(), iteResTemp = resTemp.end(), itResTemp = itbResTemp;

    #pragma omp for nowait schedule(dynamic,1)
    for(i = 0; i < numThread; i++) {
         itpLower = *(itbLower+i);
         itpUpper = *(itbUpper+i);
         for (;itpLower != itpUpper;itpLower++) {
             itv = itvb + (itpLower - itpb);
             iti  = itib + *(itpLower);
             itie = itib + *(itpLower + 1);
             itim = itm + *(itpLower);

             while(iti != itie) {
                 itResTemp   = itbResTemp + (*iti);
                 *itResTemp += *itim * *(itv);
                 iti++; itim++;
             }
         }
    }
    #pragma omp critical (updateResult)
    {
        for (itResTemp = itbResTemp;itResTemp != iteResTemp; itResTemp++) {
            itr = (itrb + (itResTemp - itbResTemp));
            *itr += *itResTemp;
        }
    }
}
\end{lstlisting}
\vspace{.1cm}
\displayInfos{library=largeMatrix, header=ColCsStorage.hpp, implementation=ColCsStorage.cpp, test={test\_OpenMP.cpp}, header dep={CsStorage.hpp, config.h, utils.h}}

\section{Sparse matrix factorization}
As mentioned above, one way to solve a linear equation system is to use direct methods, which heavily depends on matrix factorization. Firstly, the costly runtime factorization is done, secondly, the factorized matrix is used to solve the linear equation system $Ax=b$. Till now, \xlifepp has supported some popular factorization methods: LU, LDLt and LDL*; all of them have a parallelized version with OpenMP. Because of factorization algorithm, only skyline storage is suitable to be factorized. \\
Because LDLt can be considered to be a special case of LU, the following only mentions the parallelized LU factorization. \\
The idea behind multi-threaded factorization is simple: Instead of implementing the factorization element by element as in the serial version, we make the algorithm work on block by block. For each iteration, block on diagonal is processed then block on the row and column corresponding to this diagonal block. \\
First of all, the diagonal block is calculated
\vspace{.1cm}
\begin{lstlisting}[]
#pragma omp single
      diagBlockSolverParallel(k*blockSize, blockSizeRow[k], it_rownx,
                              k*blockSize, blockSizeCol[k], it_colnx,
                              it_fd, it_fl, it_fu,
                              it_md, it_ml, it_mu);
\end{lstlisting}
\vspace{.1cm}
After that, the block of row and column can be computed in the same time
\vspace{.1cm}
\begin{lstlisting}[]
#pragma omp for nowait
      for (jj = k+1; jj < nbBlockCol; jj++) {
          #pragma omp task untied firstprivate(k, jj) \
                      shared(blockSize, blockSizeRow, blockSizeCol, it_rownx, it_colnx, it_fl, it_fu, it_mu)
          upperBlockSolverParallel(k*blockSize, blockSizeRow[k],it_rownx,
                              jj*blockSize, blockSizeCol[jj], it_colnx,
                              it_fl, it_fu, it_mu);
      }

#pragma omp for
      for (ii = k+1; ii < nbBlockRow; ii++) {
          #pragma omp task untied firstprivate(k, ii) \
                      shared(blockSize, blockSizeRow, blockSizeCol, it_rownx, it_colnx, it_fd, it_fl, it_fu, it_ml)
          lowerBlockSolverParallel(ii*blockSize, blockSizeRow[ii], it_rownx,
                              k*blockSize, blockSizeCol[k], it_colnx,
                              it_fd, it_fl, it_fu, it_ml);
      }
  }
\end{lstlisting}
\vspace{.1cm}
In these code, we use one of the latest feature of OpenMP, directive task, which has been only supported since OpenMP 3.0. By using this directive, we can take advantage of taskification of the code and prevent the potential of load balancing problem.  \\
One common problem on working with blocking-algorithm is the block size. It is impossible to find out a optimal value for the blockSize that is able to change the performance of program largely. After some experiments with typical sparse matrices of \xlifepp, one simple formula is provided to specify blockSize. \\
\vspace{.1cm}
\begin{lstlisting}[]
  const Number BLOCKFACTOR = 0.05*std::min(nbOfRows(),nbOfColumns());
  Number numBlockMin = BLOCKFACTOR;
  Number blockSize  = std::floor(diagonalSize()/numBlockMin);
  Number nbBlockRow = std::ceil(nbOfRows()/blockSize);
  Number nbBlockCol = std::ceil(nbOfColumns()/blockSize);
  Number numBlock = std::min(nbBlockRow, nbBlockCol);

  std::vector<Number> blockSizeRow(nbBlockRow, blockSize), blockSizeCol(nbBlockCol, blockSize);
  blockSizeRow[nbBlockRow-1] = nbRows_ - (nbBlockRow-1)*blockSize;
  blockSizeCol[nbBlockCol-1] = nbCols_ - (nbBlockCol-1)*blockSize;
\end{lstlisting}
\vspace{.1cm}
Again, this value is only "good" for a certain group of sparse matrices. It maybe needs changing in the future.