1%%%%%%%%%%%%%%%%%%%
2% XLiFE++ is an extended library of finite elements written in C++
3%     Copyright (C) 2014  Lunéville, Eric; Kielbasiewicz, Nicolas; Lafranche, Yvon; Nguyen, Manh-Ha; Chambeyron, Colin
4%
5%     This program is free software: you can redistribute it and/or modify
6%     it under the terms of the GNU General Public License as published by
7%     the Free Software Foundation, either version 3 of the License, or
8%     (at your option) any later version.
9%     This program is distributed in the hope that it will be useful,
10%     but WITHOUT ANY WARRANTY; without even the implied warranty of
11%     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
12%     GNU General Public License for more details.
13%     You should have received a copy of the GNU General Public License
14%     along with this program.  If not, see <http://www.gnu.org/licenses/>.
15%%%%%%%%%%%%%%%%%%%
16
17\section{Sparse matrix-vector multiplication}
18
19In a Finite Element package, after repeatedly setting up the stiffness matrices, it needs to a solve linear equation system, which can be done either in two ways (1) by direct methods or (2) iterative methods. In direct methods, the factorization of matrix costs nearly most of the runtime, meanwhile, in iterative methods, matrix-vector multiplication plays an important role.
20Since the matrices arising from the discretisation are sparse, an appropriate matrix storage format, sparse format is used. \xlifepp provides a plenty types of sparse matrix, which can serve for different purposes. Although all the sparse matrix can be taken for parallelized sparse-matrix-vector-multiplication (SpMV); among them, CSR (compressed storage row) seems to be the best candidate. \\
21The form of SpMV is $y=Ax$, where $A$ is a sparse matrix, $x$ and $y$ are dense vectors. $x$ is source vector and $y$ is destination vector.\\
22In the following, the algorithms of parallelization of matrix-vector multiplication for CSR and CSC (compressed storage column) are mentioned, on which parallelized matrix-vector multiplication of other sparse format (CSDual, CSSym, SkylineDual, SkylineSym) are based.
23
24\subsection {SpMV of CSR}
25
26Because CSR is a well-known format, only a small recall about its structure is presented. \\
27A matrix $A$ is an $m\times n$ matrix, and the number of non-zero elements is {\itshape nz}, CSR format needs to store three arrays:
28\begin{itemize}
29\item values[nz] store the value of each non-zero element in matrix A
30\item colIndex[nz] stores the column index of each element in {\itshape val[nz]} array
31\item rowPointer[m+1] stores the index of the first non-zero element of each row and {\itshape rowPointer[m] = nz}
32\end{itemize}
33For instance, the following $5\times 6$ matrix
34\[
35A=\left[
36\begin{array}{cccccc}
3711 & 12 & 0 & 14 & 0 & 16\\
380 & 22 & 23 & 0 & 0 & 26 \\
3931 & 32 & 33 & 0 & 0 & 0 \\
400 & 0 & 43 & 44 & 0 & 0 \\
4151 & 0 & 53 & 54 & 55 & 0
42\end{array}
43\right]
44\]
45is stored as follow in \xlifepp by the three following {\itshape std::vector}: \\
46\mbox{}\hspace{10 mm}values     = ( 0 11 12 14 16\ \ 22 23 26\ \ 31 32 33\ \ 43 44\ \ 51 53 54 55 ) \\
47\mbox{}\hspace{10 mm}colIndex   = ( 0 1 3 5\ \ 1 2 5\ \ 0 1 2\ \ 2 3\ \ 0 2 3 4 )\\
48\mbox{}\hspace{10 mm}rowPointer = ( 0 4 7 10 12 16) \\
49In \xlifepp, different from "standard", the array {\itshape values} of CSR format contains one more element $0$ at the first position. \\
50Since it must store the location information explicitly as well as the value of each non-zero element, extra communication time is need to access these location data.
51As a matrix of CSR format is store row by row, the simplest way to parallelize the matrix-vector multiplication using OpenMP is to assign each thread of parallel region to work with one row. However, considering the problem of load balancing, scheduling overhead and synchronization overheads, it's bettr to apply the {\itshape row partitioning} scheme. The matrix will be partitioned into blocks of row by the number of threads.
52\vspace{.1cm}
53\begin{lstlisting}[]
54// Assign each thread an approximately equal number of non-zero
55Number nnzEachThread = std::floor(nnz/numThread);
56
57// Index of non-zero in each range for each thread
58Number nnzRangeIdx = nnzEachThread;
59
60// Lower and upper bound position for each thread
61std::vector<std::vector<Number>::const_iterator> itThreadLower(numThread);
62std::vector<std::vector<Number>::const_iterator> itThreadUpper(numThread);
63
64itp = itpb;
65itThreadLower[0] = itpb;
66for (int i = 0; i < numThread; i++) {
67  itThreadLower[i] = itp;
68  nnzRangeIdx = *itp + nnzEachThread;
69  itpLower = std::lower_bound(itp, itpe, nnzRangeIdx);
70  itpUpper = std::upper_bound(itpLower, itpe,nnzRangeIdx);
71  itp = ((nnzRangeIdx - *(itpUpper -1))< (*itpUpper-nnzRangeIdx)) ? itpUpper-1 : itpUpper;
72  itThreadUpper[i] = itp;
73}
74itThreadUpper[numThread-1] = itpe;
75\end{lstlisting}
76\vspace{.1cm}
77Each block is assigned to a thread and has an approximately equal number of non-zero elements. By partitioning this way, each thread operates on its own part of the {\itshape values, colIndex and rowPointer}. All threads access element of the source vector $x$. Since accesses on $x$ are read-only, there is no invalidation traffic. One advantage of {\itshape row partitioning} is that each thread works on its own part of destination vector $y$, therefore, there is no  need of synchronization among threads. \\
78\vspace{.1cm}
79\begin{lstlisting}[]
80#pragma omp parallel for default(none)\
81                         private(i, itr, iti, itie, itim, itpLower, itpUpper) \
82                         shared(numThread, itbLower, itbUpper, itpb, itib, itm, itvb, itrb) \
83                         schedule(dynamic,1)
84for(i = 0; i < numThread; i++) {
85  itpLower = *(itbLower+i);
86  itpUpper = *(itbUpper+i);
87  for (;itpLower != itpUpper;itpLower++) {
88    itr = itrb + (itpLower - itpb);
89    *itr *=0;
90    iti  = itib + *(itpLower);
91    itie = itib + *(itpLower + 1);
92    itim = itm + *(itpLower);
93
94    while(iti != itie) {
95      *itr += *(itim) * *(itvb + *iti);iti++; itim++;
96    }
97  }
98}
99\end{lstlisting}
100\vspace{.1cm}
101However,  the coarse-grained approach sometimes can not assure the load balancing of SpMV of a very large matrix (e.g with million non-zero elements), and a {\bfseries granularity factor} is a solution to make it more fine-grained. Each block of row of a thread is divided into smaller pieces, which not only make SpMv more fine-grained but also allow a thread to take work from each other.
102\vspace{.1cm}
103\begin{lstlisting}[]
104const Number GRANULARITY = 16;
105numThread *= GRANULARITY;
106\end{lstlisting}
107\vspace{.1cm}
108
109\vspace{.1cm}
110\displayInfos{library=largeMatrix, header=RowCsStorage.hpp, implementation=RowCsStorage.cpp, test={test\_OpenMP.cpp}, header dep={CsStorage.hpp, config.h, utils.h}}
111
112
113\subsection {SpMV of CSC}
114
115A matrix $A$ is an $m\times n$ matrix, and the number of non-zero elements is {\itshape nz}, CSC format needs to store three arrays:
116\begin{itemize}
117\item values[nz] store the value of each non-zero element in matrix A
118\item rowIndex[nz] stores the row index of each element in {\itshape val[nz]} array
119\item colPointer[m+1] stores the index of the first non-zero element of each column and {\itshape colPointer[m] = nz}
120\end{itemize}
121For instance, the following $5\times 6$ matrix
122\[
123A=\left[
124\begin{array}{cccccc}
12511 & 12 & 0 & 14 & 0 & 16\\
1260 & 22 & 23 & 0 & 0 & 26 \\
12731 & 32 & 33 & 0 & 0 & 0 \\
1280 & 0 & 43 & 44 & 0 & 0 \\
12951 & 0 & 53 & 54 & 55 & 0
130\end{array}
131\right]
132\]
133is stored as follow in \xlifepp by the three following {\itshape std::vector}: \\
134\mbox{}\hspace{10 mm}values     = ( 0 11 31 51 12  22 32 23 33 43 53 14 44 54 55 16 26 ) \\
135\mbox{}\hspace{10 mm}rowIndex   = ( 0 2  4 0 1 2 1 2 3 4 0 3 4 4 0 1)\\
136\mbox{}\hspace{10 mm}colPointer = ( 0 3 6 10 13 14 16) \\
137In \xlifepp, different from "standard", the array {\itshape values} of CSC format contains one more element $0$ at the first position \\
138Because the "column-like" storage of CSC is contrast to the natural manner of "row-processing" matrix-vector multiplication, the approach using OpenMP to parallelize matrix-vector multiplication of CSC is to limit the affect of this "column-like" characteristic. \\
139Since a CSC matrix is stored in column by column, the simplest way to parallelize the matrix-vector multiplication using OpenMP is to use each thread to process a column, one by one; then the result of each thread is contributed to the destination vector by a "reduction-like" operation. Obviously, this approach is not efficient. Not only each thread processes non-adjacent column, which causes cache misses in cases of  many threads; but also the problem of load-balancing can happen because of the difference in the number of non-zero elements in each column. An improved approach is to divide the matrix into groups of column, each of which has an approximately equal number of non-zero elements. Because all the columns in each group are adjacent, it limits the effect of cache miss. Moreover, load-balancing hardly happens thanks to equal number of non-zero elements processed by each thread. \\
140The following code implements the approach described above.
141\vspace{.1cm}
142\begin{lstlisting}[]
143// Lower and upper bound position for each thread
144std::vector<std::vector<Number>::const_iterator> itThreadLower(numThread);
145std::vector<std::vector<Number>::const_iterator> itThreadUpper(numThread);
146
147std::vector<Number>::const_iterator itpLower, itpUpper;
148// Find the smallest and largest rowIndexs corresponding to each thread
149std::vector<Number> sRowIdx(numThread);
150std::vector<Number> lRowIdx(numThread);
151
152itp = itpb;
153itThreadLower[0] = itpb;
154for (int i = 0; i < numThread; i++) {
155  itThreadLower[i] = itp;
156  nnzRangeIdx = *itp + nnzEachThread;
157  itpLower = std::lower_bound(itp, itpe, nnzRangeIdx);
158  itpUpper = std::upper_bound(itpLower, itpe,nnzRangeIdx);
159  itp = ((nnzRangeIdx - *(itpUpper -1))< (*itpUpper-nnzRangeIdx)) ? itpUpper-1 : itpUpper;
160  itThreadUpper[i] = itp;
161}
162itThreadUpper[numThread-1] = itpe;
163
164std::vector<std::vector<Number>::const_iterator>::const_iterator itbLower, itbUpper;
165itbLower = itThreadLower.begin();
166itbUpper = itThreadUpper.begin();
167\end{lstlisting}
168\vspace{.1cm}
169The serial codes above split a CSC matrix into blocks of column, each of which has nearly the same number of non-zero elements. After that, each of thread will process a block of column, store the result into its temporary vector and write this vector into the destination vector by a "reduction-like" operation. One remark is that temporary vector of each thread have necessarily the same size as the destination vector. It's is the trade-off of this method. \\
170After finishing its own matrix-vector multiplication, each thread writes its result into the destination vector. Since array reduction isn't supported in OpenMP C++ (some projects have tried to make this feature available but it seems not to be ready in the near future!!). The only way, at the moment, is to use critical section. And we can easily realize that it is coarse-grained.
171\vspace{.1cm}
172\begin{lstlisting}[]
173typedef typename IterationVectorTrait<ResIterator>::Type ResType;
174#pragma omp parallel  \
175                      private(i, tid, itr, iti, itie, itim, itpLower, itpUpper, itv) \
176                      shared(numThread, itbLower, itbUpper, itpb, itib, itm, itvb, itrb, nRows)
177{
178    tid = omp_get_thread_num();
179    std::vector<ResType> resTemp(nRows);
180    typename std::vector<ResType>::iterator itbResTemp = resTemp.begin(), iteResTemp = resTemp.end(), itResTemp = itbResTemp;
181
182    #pragma omp for nowait schedule(dynamic,1)
183    for(i = 0; i < numThread; i++) {
184         itpLower = *(itbLower+i);
185         itpUpper = *(itbUpper+i);
186         for (;itpLower != itpUpper;itpLower++) {
187             itv = itvb + (itpLower - itpb);
188             iti  = itib + *(itpLower);
189             itie = itib + *(itpLower + 1);
190             itim = itm + *(itpLower);
191
192             while(iti != itie) {
193                 itResTemp   = itbResTemp + (*iti);
194                 *itResTemp += *itim * *(itv);
195                 iti++; itim++;
196             }
197         }
198    }
199    #pragma omp critical (updateResult)
200    {
201        for (itResTemp = itbResTemp;itResTemp != iteResTemp; itResTemp++) {
202            itr = (itrb + (itResTemp - itbResTemp));
203            *itr += *itResTemp;
204        }
205    }
206}
207\end{lstlisting}
208\vspace{.1cm}
209\displayInfos{library=largeMatrix, header=ColCsStorage.hpp, implementation=ColCsStorage.cpp, test={test\_OpenMP.cpp}, header dep={CsStorage.hpp, config.h, utils.h}}
210
211\section{Sparse matrix factorization}
212As mentioned above, one way to solve a linear equation system is to use direct methods, which heavily depends on matrix factorization. Firstly, the costly runtime factorization is done, secondly, the factorized matrix is used to solve the linear equation system $Ax=b$. Till now, \xlifepp has supported some popular factorization methods: LU, LDLt and LDL*; all of them have a parallelized version with OpenMP. Because of factorization algorithm, only skyline storage is suitable to be factorized. \\
213Because LDLt can be considered to be a special case of LU, the following only mentions the parallelized LU factorization. \\
214The idea behind multi-threaded factorization is simple: Instead of implementing the factorization element by element as in the serial version, we make the algorithm work on block by block. For each iteration, block on diagonal is processed then block on the row and column corresponding to this diagonal block. \\
215First of all, the diagonal block is calculated
216\vspace{.1cm}
217\begin{lstlisting}[]
218#pragma omp single
219      diagBlockSolverParallel(k*blockSize, blockSizeRow[k], it_rownx,
220                              k*blockSize, blockSizeCol[k], it_colnx,
221                              it_fd, it_fl, it_fu,
222                              it_md, it_ml, it_mu);
223\end{lstlisting}
224\vspace{.1cm}
225After that, the block of row and column can be computed in the same time
226\vspace{.1cm}
227\begin{lstlisting}[]
228#pragma omp for nowait
229      for (jj = k+1; jj < nbBlockCol; jj++) {
230          #pragma omp task untied firstprivate(k, jj) \
231                      shared(blockSize, blockSizeRow, blockSizeCol, it_rownx, it_colnx, it_fl, it_fu, it_mu)
232          upperBlockSolverParallel(k*blockSize, blockSizeRow[k],it_rownx,
233                              jj*blockSize, blockSizeCol[jj], it_colnx,
234                              it_fl, it_fu, it_mu);
235      }
236
237#pragma omp for
238      for (ii = k+1; ii < nbBlockRow; ii++) {
239          #pragma omp task untied firstprivate(k, ii) \
240                      shared(blockSize, blockSizeRow, blockSizeCol, it_rownx, it_colnx, it_fd, it_fl, it_fu, it_ml)
241          lowerBlockSolverParallel(ii*blockSize, blockSizeRow[ii], it_rownx,
242                              k*blockSize, blockSizeCol[k], it_colnx,
243                              it_fd, it_fl, it_fu, it_ml);
244      }
245  }
246\end{lstlisting}
247\vspace{.1cm}
248In these code, we use one of the latest feature of OpenMP, directive task, which has been only supported since OpenMP 3.0. By using this directive, we can take advantage of taskification of the code and prevent the potential of load balancing problem.  \\
249One common problem on working with blocking-algorithm is the block size. It is impossible to find out a optimal value for the blockSize that is able to change the performance of program largely. After some experiments with typical sparse matrices of \xlifepp, one simple formula is provided to specify blockSize. \\
250\vspace{.1cm}
251\begin{lstlisting}[]
252  const Number BLOCKFACTOR = 0.05*std::min(nbOfRows(),nbOfColumns());
253  Number numBlockMin = BLOCKFACTOR;
254  Number blockSize  = std::floor(diagonalSize()/numBlockMin);
255  Number nbBlockRow = std::ceil(nbOfRows()/blockSize);
256  Number nbBlockCol = std::ceil(nbOfColumns()/blockSize);
257  Number numBlock = std::min(nbBlockRow, nbBlockCol);
258
259  std::vector<Number> blockSizeRow(nbBlockRow, blockSize), blockSizeCol(nbBlockCol, blockSize);
260  blockSizeRow[nbBlockRow-1] = nbRows_ - (nbBlockRow-1)*blockSize;
261  blockSizeCol[nbBlockCol-1] = nbCols_ - (nbBlockCol-1)*blockSize;
262\end{lstlisting}
263\vspace{.1cm}
264Again, this value is only "good" for a certain group of sparse matrices. It maybe needs changing in the future.
265