1\par 2\vfill \eject 3\section{MPI Solution of $A X = Y$ using an $LU$ factorization} 4\label{section:LU-MPI} 5\par 6Unlike the serial and multithreaded environments where the data 7structures are global, existing under one address space, 8in the MPI environment, data is local, each process or processor 9has its own distinct address space. 10The MPI step-by-step process to solve a linear system is exactly 11the same as the multithreaded case, with the additional trouble 12that the data structures are distributed and need to be 13re-distributed as needed. 14\par 15The ownership of the factor matrices during the factorization and 16solves is exactly the same as for the multithreaded case -- the 17map from fronts to processors and map from submatrices to 18processors are identical to their counterparts in the multithreaded 19program. 20What is different is the explicit message passing of data 21structures between processors. 22Luckily, most of this is hidden to the user code. 23\par 24We will now begin to work our way through the program 25found in Section~\ref{section:LU-MPI-driver} 26to illustrate the use of {\bf SPOOLES} to solve a system 27of linear equations in the MPI environment. 28\par 29\subsection{Reading the input parameters} 30\label{subsection:MPI:input-data} 31\par 32This step is identical to the serial code, as described in 33Section~\ref{subsection:serial:input-data}, with the exception 34that the file names for $A$ and $Y$ are hardcoded in the driver, 35and so are not part of the input parameters. 36\par 37\subsection{Communicating the data for the problem} 38\label{subsection:MPI:communicating-data} 39\par 40This step is identical to the serial code, as described in 41Section~\ref{subsection:serial:communicating-data} 42In the serial and multithreaded codes, the entire matrix $A$ was 43read in from one file and placed into one {\tt InpMtx} object. 44In the MPI environment, this need not be the case that one 45processor holds the entire matrix $A$. 46(In fact, $A$ must be distributed across processors during the 47factorization.) 48\par 49Each processor opens a matrix file, (possibly) reads in matrix 50entries, and creates its {\it local} {\tt InpMtx} object that holds 51the matrix entries it has read in. 52We have hardcoded the file names: processor $q$ reads 53its matrix entries from file {\tt matrix.}$q${\tt .input} 54and 55its right hand side entries from file {\tt rhs.}$q${\tt .input}. 56The file formats are the same as for the serial and multithreaded 57drivers. 58\par 59The entries needed not be partitioned over the files. 60For example, each processor could read in entries for disjoint sets 61of finite elements. 62Naturally some degrees of freedom will have support on elements 63that are found on different processors. 64When the entries in $A$ and $Y$ are mapped to processors, an 65assembly of the matrix entries will be done automatically. 66\par 67It could be the case that the matrix $A$ and right hand side $Y$ 68are read in by one processor. (This was the approach we took with 69the {\tt LinSol} wrapper objects.) 70There still need to be input files for the other processors 71with zeroes on their first (and only) line, 72to specify that no entries are to be read. 73\par 74\subsection{Reordering the linear system} 75\label{subsection:MPI:reordering} 76\par 77The first part is very similar to the serial code, as described in 78Section~\ref{subsection:serial:reordering}. 79\begin{verbatim} 80graph = Graph_new() ; 81adjIVL = InpMtx_MPI_fullAdjacency(mtxA, stats, msglvl, msgFile, MPI_COMM_WORLD) ; 82nedges = IVL_tsize(adjIVL) ; 83Graph_init2(graph, 0, neqns, 0, nedges, neqns, nedges, adjIVL, NULL, NULL) ; 84frontETree = orderViaMMD(graph, seed + myid, msglvl, msgFile) ; 85\end{verbatim} 86While the data and computations are distributed across the 87processors, the ordering process is not. 88Therefore we need a global graph on each processor. 89Since the matrix $A$ is distributed across the processors, 90we use the distributed {\tt InpMtx\_MPI\_fullAdjacency()} method 91to construct the {\tt IVL} object of the graph of $A + A^T$. 92\par 93At this point, each processor has computed its own minimum degree 94ordering and created a front tree object. 95The orderings will likely be different, because each processors 96input a different random number seed to the ordering method. 97Only one ordering can be used for the factorization, so the 98processors collectively determine which of the orderings is best, 99which is then broadcast to all the processors, as the code fragment 100below illustrates. 101\begin{verbatim} 102opcounts = DVinit(nproc, 0.0) ; 103opcounts[myid] = ETree_nFactorOps(frontETree, type, symmetryflag) ; 104MPI_Allgather((void *) &opcounts[myid], 1, MPI_DOUBLE, 105 (void *) opcounts, 1, MPI_DOUBLE, MPI_COMM_WORLD) ; 106minops = DVmin(nproc, opcounts, &root) ; 107DVfree(opcounts) ; 108frontETree = ETree_MPI_Bcast(frontETree, root, msglvl, msgFile, MPI_COMM_WORLD) ; 109\end{verbatim} 110\par 111\subsection{Non-numeric work} 112\label{subsection:MPI:non-numeric} 113\par 114Once the front tree is replicated across the processors, we obtain 115the permutation vectors and permute the vertices in the front tree. 116The local matrices for $A$ and $Y$ are also permuted. 117These steps are identical to the serial and multithreaded drivers, 118except the fact local instead of global $A$ and $Y$ matrices 119are permuted. 120\begin{verbatim} 121oldToNewIV = ETree_oldToNewVtxPerm(frontETree) ; 122newToOldIV = ETree_newToOldVtxPerm(frontETree) ; 123ETree_permuteVertices(frontETree, oldToNewIV) ; 124InpMtx_permute(mtxA, IV_entries(oldToNewIV), 125IV_entries(oldToNewIV)) ; 126if ( symmetryflag == SPOOLES_SYMMETRIC || symmetryflag == SPOOLES_HERMITIAN ) { 127 InpMtx_mapToUpperTriangle(mtxA) ; 128} 129InpMtx_changeCoordType(mtxA, INPMTX_BY_CHEVRONS) ; 130InpMtx_changeStorageMode(mtxA, INPMTX_BY_VECTORS) ; 131DenseMtx_permuteRows(mtxY, oldToNewIV) ; 132\end{verbatim} 133\par 134The next step is to obtain the map from fronts to processors, 135just as was done in the multithreaded driver. 136In addition, we need a map from vertices to processors to be able 137to distribute the matrix $A$ and right hand side $Y$ as necessary. 138Since we have the map from vertices to fronts inside the front tree 139object, the vertex map is easy to determine. 140\begin{verbatim} 141cutoff = 1./(2*nproc) ; 142cumopsDV = DV_new() ; 143DV_init(cumopsDV, nproc, NULL) ; 144ownersIV = ETree_ddMap(frontETree, type, symmetryflag, cumopsDV, cutoff) ; 145DV_free(cumopsDV) ; 146vtxmapIV = IV_new() ; 147IV_init(vtxmapIV, neqns, NULL) ; 148IVgather(neqns, IV_entries(vtxmapIV), IV_entries(ownersIV), ETree_vtxToFront(frontETree)) ; 149\end{verbatim} 150At this point we are ready to assemble and distribute the entries 151of $A$ and $Y$. 152\begin{verbatim} 153firsttag = 0 ; 154newA = InpMtx_MPI_split(mtxA, vtxmapIV, stats, msglvl, msgFile, firsttag, 155MPI_COMM_WORLD) ; 156InpMtx_free(mtxA) ; 157mtxA = newA ; 158InpMtx_changeStorageMode(mtxA, INPMTX_BY_VECTORS) ; 159newY = DenseMtx_MPI_splitByRows(mtxY, vtxmapIV, stats, msglvl, 160 msgFile, firsttag, MPI_COMM_WORLD) ; 161DenseMtx_free(mtxY) ; 162mtxY = newY ; 163\end{verbatim} 164The {\tt InpMtx\_MPI\_split()} method assembles and redistributes 165the matrix entries by the vectors of the local matrix. 166Recall above that the coordinate type was set to chevrons, as is 167needed for the assembly of the entries into the front matrices. 168The method returns a new {\tt InpMtx} object that contains the part 169of $A$ that is needed by the processor. 170The old {\tt InpMtx} object is free'd and the new one takes its place. 171\par 172Now we are ready to compute the symbolic factorization, but it too 173much be done in a distributed manner. 174\begin{verbatim} 175symbfacIVL = SymbFac_MPI_initFromInpMtx(frontETree, ownersIV, mtxA, 176 stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ; 177\end{verbatim} 178The {\tt symbfacIVL} object on a particular processor is only a 179subset of the global symbolic factorization, containing only what 180it needs to know for it to compute its part of the factorization. 181\par 182\subsection{The Matrix Factorization} 183\label{subsection:MPI:factor} 184\par 185In contrast the the multithreaded environment, data structures are 186local to a processor, and so locks are not needed to manage access 187to critical regions of code. 188The initialization of the front matrix and submatrix manager 189objects is much like the serial case, with one exception. 190\par 191\begin{verbatim} 192mtxmanager = SubMtxManager_new() ; 193SubMtxManager_init(mtxmanager, NO_LOCK, 0) ; 194frontmtx = FrontMtx_new() ; 195FrontMtx_init(frontmtx, frontETree, symbfacIVL, type, symmetryflag, 196 FRONTMTX_DENSE_FRONTS, pivotingflag, NO_LOCK, myid, 197 ownersIV, mtxmanager, msglvl, msgFile) ; 198\end{verbatim} 199Note that the nineth and tenth arguments are {\tt myid} and {\tt 200ownersIV}, not {\tt 0} and {\tt NULL} as for the serial and 201multithreaded drivers. 202These arguments tell the front matrix object 203that it needs to initialize only 204those parts of the factor matrices that it ``owns'', 205which are given by the map from fronts to processors 206and the processor id. 207\par 208The numeric factorization is performed by the 209{\tt FrontMtx\_MPI\_factorInpMtx()} method. 210The code segment from the sample program for the numerical 211factorization step is found below. 212\begin{verbatim} 213chvmanager = ChvManager_new() ; 214ChvManager_init(chvmanager, NO_LOCK, 0) ; 215rootchv = FrontMtx_MPI_factorInpMtx(frontmtx, mtxA, tau, droptol, 216 chvmanager, ownersIV, lookahead, &error, cpus, 217 stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ; 218ChvManager_free(chvmanager) ; 219\end{verbatim} 220Note that the {\tt ChvManager} is not locked. 221The calling sequence is identical to that of 222the multithreaded factorization except for the addition of the {\tt 223firsttag} and MPI communicator at the end. 224\par 225The post-processing of the factorization is the same in principle 226as in the serial code but differs in that is uses the distributed 227data structures. 228\begin{verbatim} 229FrontMtx_MPI_postProcess(frontmtx, ownersIV, stats, msglvl, 230 msgFile, firsttag, MPI_COMM_WORLD) ; 231\end{verbatim} 232After the post-processing step, each local {\tt FrontMtx} object 233contains the $L_{J,I}$, $D_{I,I}$ and $U_{I,J}$ submatrices 234for the fronts that were owned by the particular processor. 235However, the parallel solve is based on the submatrices being 236distributed across the processors, not just the fronts. 237\par 238We must specify which threads own which submatrices, 239and so perform computations with them. 240This is done by constructing a {\it ``solve--map''} object, 241as we see below. 242\begin{verbatim} 243solvemap = SolveMap_new() ; 244SolveMap_ddMap(solvemap, symmetryflag, FrontMtx_upperBlockIVL(frontmtx), 245 FrontMtx_lowerBlockIVL(frontmtx), nproc, ownersIV, 246 FrontMtx_frontTree(frontmtx), seed, msglvl, msgFile) ; 247\end{verbatim} 248This object also uses a domain decomposition map, the only solve map 249that presently found in the {\bf SPOOLES} library. 250\par 251Once the solve map has been created, (and note that it is identical 252across all the processors), we redistribute the submatrices 253with the following code fragment. 254\begin{verbatim} 255FrontMtx_MPI_split(frontmtx, solvemap, stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ; 256\end{verbatim} 257At this point in time, 258the submatrices that a processor owns are local to that processor. 259\par 260\subsection{The Forward and Backsolves} 261\label{subsection:MPI:solve} 262\par 263If pivoting has been performed for numerical stability, then the 264rows of $PY$ may not be located on the processor that needs them. 265We must perform an additional redistribution of the local 266{\tt DenseMtx} objects that hold $PY$, as the code fragment below 267illustrates. 268\begin{verbatim} 269if ( FRONTMTX_IS_PIVOTING(frontmtx) ) { 270 IV *rowmapIV ; 271/* 272 ---------------------------------------------------------- 273 pivoting has taken place, redistribute the right hand side 274 to match the final rows and columns in the fronts 275 ---------------------------------------------------------- 276*/ 277 rowmapIV = FrontMtx_MPI_rowmapIV(frontmtx, ownersIV, msglvl, 278 msgFile, MPI_COMM_WORLD) ; 279 newY = DenseMtx_MPI_splitByRows(mtxY, rowmapIV, stats, msglvl, 280 msgFile, firsttag, MPI_COMM_WORLD) ; 281 DenseMtx_free(mtxY) ; 282 mtxY = newY ; 283 IV_free(rowmapIV) ; 284} 285\end{verbatim} 286\par 287Each processor now must create a local {\tt DenseMtx} object 288to hold the rows of $PX$ that it owns. 289\begin{verbatim} 290ownedColumnsIV = FrontMtx_ownedColumnsIV(frontmtx, myid, ownersIV, 291 msglvl, msgFile) ; 292nmycol = IV_size(ownedColumnsIV) ; 293mtxX = DenseMtx_new() ; 294if ( nmycol > 0 ) { 295 DenseMtx_init(mtxX, type, 0, 0, nmycol, nrhs, 1, nmycol) ; 296 DenseMtx_rowIndices(mtxX, &nrow, &rowind) ; 297 IVcopy(nmycol, rowind, IV_entries(ownedColumnsIV)) ; 298} 299\end{verbatim} 300If $A$ is symmetric, or if pivoting for stability was not used, 301then {\tt mtxX} can just be a pointer to {\tt mtxY}, i.e., 302$PX$ could overwrite $PY$. 303\par 304The parallel solve is remarkably similar to the serial solve, 305as we see with the code fragment below. 306\begin{verbatim} 307solvemanager = SubMtxManager_new() ; 308SubMtxManager_init(solvemanager, NO_LOCK, 0) ; 309FrontMtx_MPI_solve(frontmtx, mtxX, mtxY, solvemanager, solvemap, cpus, 310 stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ; 311SubMtxManager_free(solvemanager) ; 312\end{verbatim} 313The only difference between the multithreaded and MPI solve 314methods is the presence of the first tag and MPI communicator 315in the latter. 316\par 317The last step is to permute the rows of the local solution matrix 318into the original matrix ordering. 319We also gather all the solution entries into one {\tt DenseMtx} 320object on processor zero. 321\begin{verbatim} 322DenseMtx_permuteRows(mtxX, newToOldIV) ; 323IV_fill(vtxmapIV, 0) ; 324firsttag++ ; 325mtxX = DenseMtx_MPI_splitByRows(mtxX, vtxmapIV, stats, msglvl, msgFile, 326 firsttag, MPI_COMM_WORLD) ; 327\end{verbatim} 328\par 329\subsection{Sample Matrix and Right Hand Side Files} 330\label{subsection:MPI:input-files} 331\par 332\begin{center} 333\begin{tabular}{|l||l||l||l||} 334\hline 335{\tt matrix.0.input} & 336{\tt matrix.1.input} & 337{\tt matrix.2.input} & 338{\tt matrix.3.input} \\ 339\begin{minipage}[t]{1 in} 340\begin{verbatim} 3419 9 6 3420 0 4.0 3430 1 -1.0 3440 3 -1.0 3451 1 4.0 3461 2 -1.0 3471 4 -1.0 348\end{verbatim} 349\end{minipage} 350& 351\begin{minipage}[t]{1 in} 352\begin{verbatim} 3539 9 5 3542 2 4.0 3552 5 -1.0 3563 3 4.0 3573 4 -1.0 3583 6 -1.0 359\end{verbatim} 360\end{minipage} 361& 362\begin{minipage}[t]{1 in} 363\begin{verbatim} 3649 9 7 3654 4 4.0 3664 5 -1.0 3674 7 -1.0 3685 5 4.0 3695 8 -1.0 3706 6 4.0 3716 7 -1.0 372 373\end{verbatim} 374\end{minipage} 375& 376\begin{minipage}[t]{1 in} 377\begin{verbatim} 3789 9 3 3797 7 4.0 3807 8 -1.0 3818 8 4.0 382\end{verbatim} 383\end{minipage} 384\\ 385\hline 386\hline 387{\tt rhs.0.input} & 388{\tt rhs.1.input} & 389{\tt rhs.2.input} & 390{\tt rhs.3.input} \\ 391\begin{minipage}[t]{1 in} 392\begin{verbatim} 3932 1 3940 0.0 3951 0.0 396\end{verbatim} 397\end{minipage} 398& 399\begin{minipage}[t]{1 in} 400\begin{verbatim} 4012 1 4022 0.0 4033 0.0 404\end{verbatim} 405\end{minipage} 406& 407\begin{minipage}[t]{1 in} 408\begin{verbatim} 4092 1 4104 1.0 4115 0.0 412\end{verbatim} 413\end{minipage} 414& 415\begin{minipage}[t]{1 in} 416\begin{verbatim} 4173 1 4186 0.0 4197 0.0 4208 0.0 421 422\end{verbatim} 423\end{minipage} 424\\ 425\hline 426\end{tabular} 427\end{center} 428