1%%%%%%%%%%%%%%%%%%% 2% XLiFE++ is an extended library of finite elements written in C++ 3% Copyright (C) 2014 Lunéville, Eric; Kielbasiewicz, Nicolas; Lafranche, Yvon; Nguyen, Manh-Ha; Chambeyron, Colin 4% 5% This program is free software: you can redistribute it and/or modify 6% it under the terms of the GNU General Public License as published by 7% the Free Software Foundation, either version 3 of the License, or 8% (at your option) any later version. 9% This program is distributed in the hope that it will be useful, 10% but WITHOUT ANY WARRANTY; without even the implied warranty of 11% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12% GNU General Public License for more details. 13% You should have received a copy of the GNU General Public License 14% along with this program. If not, see <http://www.gnu.org/licenses/>. 15%%%%%%%%%%%%%%%%%%% 16 17\section{Sparse matrix-vector multiplication} 18 19In a Finite Element package, after repeatedly setting up the stiffness matrices, it needs to a solve linear equation system, which can be done either in two ways (1) by direct methods or (2) iterative methods. In direct methods, the factorization of matrix costs nearly most of the runtime, meanwhile, in iterative methods, matrix-vector multiplication plays an important role. 20Since the matrices arising from the discretisation are sparse, an appropriate matrix storage format, sparse format is used. \xlifepp provides a plenty types of sparse matrix, which can serve for different purposes. Although all the sparse matrix can be taken for parallelized sparse-matrix-vector-multiplication (SpMV); among them, CSR (compressed storage row) seems to be the best candidate. \\ 21The form of SpMV is $y=Ax$, where $A$ is a sparse matrix, $x$ and $y$ are dense vectors. $x$ is source vector and $y$ is destination vector.\\ 22In the following, the algorithms of parallelization of matrix-vector multiplication for CSR and CSC (compressed storage column) are mentioned, on which parallelized matrix-vector multiplication of other sparse format (CSDual, CSSym, SkylineDual, SkylineSym) are based. 23 24\subsection {SpMV of CSR} 25 26Because CSR is a well-known format, only a small recall about its structure is presented. \\ 27A matrix $A$ is an $m\times n$ matrix, and the number of non-zero elements is {\itshape nz}, CSR format needs to store three arrays: 28\begin{itemize} 29\item values[nz] store the value of each non-zero element in matrix A 30\item colIndex[nz] stores the column index of each element in {\itshape val[nz]} array 31\item rowPointer[m+1] stores the index of the first non-zero element of each row and {\itshape rowPointer[m] = nz} 32\end{itemize} 33For instance, the following $5\times 6$ matrix 34\[ 35A=\left[ 36\begin{array}{cccccc} 3711 & 12 & 0 & 14 & 0 & 16\\ 380 & 22 & 23 & 0 & 0 & 26 \\ 3931 & 32 & 33 & 0 & 0 & 0 \\ 400 & 0 & 43 & 44 & 0 & 0 \\ 4151 & 0 & 53 & 54 & 55 & 0 42\end{array} 43\right] 44\] 45is stored as follow in \xlifepp by the three following {\itshape std::vector}: \\ 46\mbox{}\hspace{10 mm}values = ( 0 11 12 14 16\ \ 22 23 26\ \ 31 32 33\ \ 43 44\ \ 51 53 54 55 ) \\ 47\mbox{}\hspace{10 mm}colIndex = ( 0 1 3 5\ \ 1 2 5\ \ 0 1 2\ \ 2 3\ \ 0 2 3 4 )\\ 48\mbox{}\hspace{10 mm}rowPointer = ( 0 4 7 10 12 16) \\ 49In \xlifepp, different from "standard", the array {\itshape values} of CSR format contains one more element $0$ at the first position. \\ 50Since it must store the location information explicitly as well as the value of each non-zero element, extra communication time is need to access these location data. 51As a matrix of CSR format is store row by row, the simplest way to parallelize the matrix-vector multiplication using OpenMP is to assign each thread of parallel region to work with one row. However, considering the problem of load balancing, scheduling overhead and synchronization overheads, it's bettr to apply the {\itshape row partitioning} scheme. The matrix will be partitioned into blocks of row by the number of threads. 52\vspace{.1cm} 53\begin{lstlisting}[] 54// Assign each thread an approximately equal number of non-zero 55Number nnzEachThread = std::floor(nnz/numThread); 56 57// Index of non-zero in each range for each thread 58Number nnzRangeIdx = nnzEachThread; 59 60// Lower and upper bound position for each thread 61std::vector<std::vector<Number>::const_iterator> itThreadLower(numThread); 62std::vector<std::vector<Number>::const_iterator> itThreadUpper(numThread); 63 64itp = itpb; 65itThreadLower[0] = itpb; 66for (int i = 0; i < numThread; i++) { 67 itThreadLower[i] = itp; 68 nnzRangeIdx = *itp + nnzEachThread; 69 itpLower = std::lower_bound(itp, itpe, nnzRangeIdx); 70 itpUpper = std::upper_bound(itpLower, itpe,nnzRangeIdx); 71 itp = ((nnzRangeIdx - *(itpUpper -1))< (*itpUpper-nnzRangeIdx)) ? itpUpper-1 : itpUpper; 72 itThreadUpper[i] = itp; 73} 74itThreadUpper[numThread-1] = itpe; 75\end{lstlisting} 76\vspace{.1cm} 77Each block is assigned to a thread and has an approximately equal number of non-zero elements. By partitioning this way, each thread operates on its own part of the {\itshape values, colIndex and rowPointer}. All threads access element of the source vector $x$. Since accesses on $x$ are read-only, there is no invalidation traffic. One advantage of {\itshape row partitioning} is that each thread works on its own part of destination vector $y$, therefore, there is no need of synchronization among threads. \\ 78\vspace{.1cm} 79\begin{lstlisting}[] 80#pragma omp parallel for default(none)\ 81 private(i, itr, iti, itie, itim, itpLower, itpUpper) \ 82 shared(numThread, itbLower, itbUpper, itpb, itib, itm, itvb, itrb) \ 83 schedule(dynamic,1) 84for(i = 0; i < numThread; i++) { 85 itpLower = *(itbLower+i); 86 itpUpper = *(itbUpper+i); 87 for (;itpLower != itpUpper;itpLower++) { 88 itr = itrb + (itpLower - itpb); 89 *itr *=0; 90 iti = itib + *(itpLower); 91 itie = itib + *(itpLower + 1); 92 itim = itm + *(itpLower); 93 94 while(iti != itie) { 95 *itr += *(itim) * *(itvb + *iti);iti++; itim++; 96 } 97 } 98} 99\end{lstlisting} 100\vspace{.1cm} 101However, the coarse-grained approach sometimes can not assure the load balancing of SpMV of a very large matrix (e.g with million non-zero elements), and a {\bfseries granularity factor} is a solution to make it more fine-grained. Each block of row of a thread is divided into smaller pieces, which not only make SpMv more fine-grained but also allow a thread to take work from each other. 102\vspace{.1cm} 103\begin{lstlisting}[] 104const Number GRANULARITY = 16; 105numThread *= GRANULARITY; 106\end{lstlisting} 107\vspace{.1cm} 108 109\vspace{.1cm} 110\displayInfos{library=largeMatrix, header=RowCsStorage.hpp, implementation=RowCsStorage.cpp, test={test\_OpenMP.cpp}, header dep={CsStorage.hpp, config.h, utils.h}} 111 112 113\subsection {SpMV of CSC} 114 115A matrix $A$ is an $m\times n$ matrix, and the number of non-zero elements is {\itshape nz}, CSC format needs to store three arrays: 116\begin{itemize} 117\item values[nz] store the value of each non-zero element in matrix A 118\item rowIndex[nz] stores the row index of each element in {\itshape val[nz]} array 119\item colPointer[m+1] stores the index of the first non-zero element of each column and {\itshape colPointer[m] = nz} 120\end{itemize} 121For instance, the following $5\times 6$ matrix 122\[ 123A=\left[ 124\begin{array}{cccccc} 12511 & 12 & 0 & 14 & 0 & 16\\ 1260 & 22 & 23 & 0 & 0 & 26 \\ 12731 & 32 & 33 & 0 & 0 & 0 \\ 1280 & 0 & 43 & 44 & 0 & 0 \\ 12951 & 0 & 53 & 54 & 55 & 0 130\end{array} 131\right] 132\] 133is stored as follow in \xlifepp by the three following {\itshape std::vector}: \\ 134\mbox{}\hspace{10 mm}values = ( 0 11 31 51 12 22 32 23 33 43 53 14 44 54 55 16 26 ) \\ 135\mbox{}\hspace{10 mm}rowIndex = ( 0 2 4 0 1 2 1 2 3 4 0 3 4 4 0 1)\\ 136\mbox{}\hspace{10 mm}colPointer = ( 0 3 6 10 13 14 16) \\ 137In \xlifepp, different from "standard", the array {\itshape values} of CSC format contains one more element $0$ at the first position \\ 138Because the "column-like" storage of CSC is contrast to the natural manner of "row-processing" matrix-vector multiplication, the approach using OpenMP to parallelize matrix-vector multiplication of CSC is to limit the affect of this "column-like" characteristic. \\ 139Since a CSC matrix is stored in column by column, the simplest way to parallelize the matrix-vector multiplication using OpenMP is to use each thread to process a column, one by one; then the result of each thread is contributed to the destination vector by a "reduction-like" operation. Obviously, this approach is not efficient. Not only each thread processes non-adjacent column, which causes cache misses in cases of many threads; but also the problem of load-balancing can happen because of the difference in the number of non-zero elements in each column. An improved approach is to divide the matrix into groups of column, each of which has an approximately equal number of non-zero elements. Because all the columns in each group are adjacent, it limits the effect of cache miss. Moreover, load-balancing hardly happens thanks to equal number of non-zero elements processed by each thread. \\ 140The following code implements the approach described above. 141\vspace{.1cm} 142\begin{lstlisting}[] 143// Lower and upper bound position for each thread 144std::vector<std::vector<Number>::const_iterator> itThreadLower(numThread); 145std::vector<std::vector<Number>::const_iterator> itThreadUpper(numThread); 146 147std::vector<Number>::const_iterator itpLower, itpUpper; 148// Find the smallest and largest rowIndexs corresponding to each thread 149std::vector<Number> sRowIdx(numThread); 150std::vector<Number> lRowIdx(numThread); 151 152itp = itpb; 153itThreadLower[0] = itpb; 154for (int i = 0; i < numThread; i++) { 155 itThreadLower[i] = itp; 156 nnzRangeIdx = *itp + nnzEachThread; 157 itpLower = std::lower_bound(itp, itpe, nnzRangeIdx); 158 itpUpper = std::upper_bound(itpLower, itpe,nnzRangeIdx); 159 itp = ((nnzRangeIdx - *(itpUpper -1))< (*itpUpper-nnzRangeIdx)) ? itpUpper-1 : itpUpper; 160 itThreadUpper[i] = itp; 161} 162itThreadUpper[numThread-1] = itpe; 163 164std::vector<std::vector<Number>::const_iterator>::const_iterator itbLower, itbUpper; 165itbLower = itThreadLower.begin(); 166itbUpper = itThreadUpper.begin(); 167\end{lstlisting} 168\vspace{.1cm} 169The serial codes above split a CSC matrix into blocks of column, each of which has nearly the same number of non-zero elements. After that, each of thread will process a block of column, store the result into its temporary vector and write this vector into the destination vector by a "reduction-like" operation. One remark is that temporary vector of each thread have necessarily the same size as the destination vector. It's is the trade-off of this method. \\ 170After finishing its own matrix-vector multiplication, each thread writes its result into the destination vector. Since array reduction isn't supported in OpenMP C++ (some projects have tried to make this feature available but it seems not to be ready in the near future!!). The only way, at the moment, is to use critical section. And we can easily realize that it is coarse-grained. 171\vspace{.1cm} 172\begin{lstlisting}[] 173typedef typename IterationVectorTrait<ResIterator>::Type ResType; 174#pragma omp parallel \ 175 private(i, tid, itr, iti, itie, itim, itpLower, itpUpper, itv) \ 176 shared(numThread, itbLower, itbUpper, itpb, itib, itm, itvb, itrb, nRows) 177{ 178 tid = omp_get_thread_num(); 179 std::vector<ResType> resTemp(nRows); 180 typename std::vector<ResType>::iterator itbResTemp = resTemp.begin(), iteResTemp = resTemp.end(), itResTemp = itbResTemp; 181 182 #pragma omp for nowait schedule(dynamic,1) 183 for(i = 0; i < numThread; i++) { 184 itpLower = *(itbLower+i); 185 itpUpper = *(itbUpper+i); 186 for (;itpLower != itpUpper;itpLower++) { 187 itv = itvb + (itpLower - itpb); 188 iti = itib + *(itpLower); 189 itie = itib + *(itpLower + 1); 190 itim = itm + *(itpLower); 191 192 while(iti != itie) { 193 itResTemp = itbResTemp + (*iti); 194 *itResTemp += *itim * *(itv); 195 iti++; itim++; 196 } 197 } 198 } 199 #pragma omp critical (updateResult) 200 { 201 for (itResTemp = itbResTemp;itResTemp != iteResTemp; itResTemp++) { 202 itr = (itrb + (itResTemp - itbResTemp)); 203 *itr += *itResTemp; 204 } 205 } 206} 207\end{lstlisting} 208\vspace{.1cm} 209\displayInfos{library=largeMatrix, header=ColCsStorage.hpp, implementation=ColCsStorage.cpp, test={test\_OpenMP.cpp}, header dep={CsStorage.hpp, config.h, utils.h}} 210 211\section{Sparse matrix factorization} 212As mentioned above, one way to solve a linear equation system is to use direct methods, which heavily depends on matrix factorization. Firstly, the costly runtime factorization is done, secondly, the factorized matrix is used to solve the linear equation system $Ax=b$. Till now, \xlifepp has supported some popular factorization methods: LU, LDLt and LDL*; all of them have a parallelized version with OpenMP. Because of factorization algorithm, only skyline storage is suitable to be factorized. \\ 213Because LDLt can be considered to be a special case of LU, the following only mentions the parallelized LU factorization. \\ 214The idea behind multi-threaded factorization is simple: Instead of implementing the factorization element by element as in the serial version, we make the algorithm work on block by block. For each iteration, block on diagonal is processed then block on the row and column corresponding to this diagonal block. \\ 215First of all, the diagonal block is calculated 216\vspace{.1cm} 217\begin{lstlisting}[] 218#pragma omp single 219 diagBlockSolverParallel(k*blockSize, blockSizeRow[k], it_rownx, 220 k*blockSize, blockSizeCol[k], it_colnx, 221 it_fd, it_fl, it_fu, 222 it_md, it_ml, it_mu); 223\end{lstlisting} 224\vspace{.1cm} 225After that, the block of row and column can be computed in the same time 226\vspace{.1cm} 227\begin{lstlisting}[] 228#pragma omp for nowait 229 for (jj = k+1; jj < nbBlockCol; jj++) { 230 #pragma omp task untied firstprivate(k, jj) \ 231 shared(blockSize, blockSizeRow, blockSizeCol, it_rownx, it_colnx, it_fl, it_fu, it_mu) 232 upperBlockSolverParallel(k*blockSize, blockSizeRow[k],it_rownx, 233 jj*blockSize, blockSizeCol[jj], it_colnx, 234 it_fl, it_fu, it_mu); 235 } 236 237#pragma omp for 238 for (ii = k+1; ii < nbBlockRow; ii++) { 239 #pragma omp task untied firstprivate(k, ii) \ 240 shared(blockSize, blockSizeRow, blockSizeCol, it_rownx, it_colnx, it_fd, it_fl, it_fu, it_ml) 241 lowerBlockSolverParallel(ii*blockSize, blockSizeRow[ii], it_rownx, 242 k*blockSize, blockSizeCol[k], it_colnx, 243 it_fd, it_fl, it_fu, it_ml); 244 } 245 } 246\end{lstlisting} 247\vspace{.1cm} 248In these code, we use one of the latest feature of OpenMP, directive task, which has been only supported since OpenMP 3.0. By using this directive, we can take advantage of taskification of the code and prevent the potential of load balancing problem. \\ 249One common problem on working with blocking-algorithm is the block size. It is impossible to find out a optimal value for the blockSize that is able to change the performance of program largely. After some experiments with typical sparse matrices of \xlifepp, one simple formula is provided to specify blockSize. \\ 250\vspace{.1cm} 251\begin{lstlisting}[] 252 const Number BLOCKFACTOR = 0.05*std::min(nbOfRows(),nbOfColumns()); 253 Number numBlockMin = BLOCKFACTOR; 254 Number blockSize = std::floor(diagonalSize()/numBlockMin); 255 Number nbBlockRow = std::ceil(nbOfRows()/blockSize); 256 Number nbBlockCol = std::ceil(nbOfColumns()/blockSize); 257 Number numBlock = std::min(nbBlockRow, nbBlockCol); 258 259 std::vector<Number> blockSizeRow(nbBlockRow, blockSize), blockSizeCol(nbBlockCol, blockSize); 260 blockSizeRow[nbBlockRow-1] = nbRows_ - (nbBlockRow-1)*blockSize; 261 blockSizeCol[nbBlockCol-1] = nbCols_ - (nbBlockCol-1)*blockSize; 262\end{lstlisting} 263\vspace{.1cm} 264Again, this value is only "good" for a certain group of sparse matrices. It maybe needs changing in the future. 265