1\par
2\vfill \eject
3\section{MPI Solution of $A X = Y$ using an $LU$ factorization}
4\label{section:LU-MPI}
5\par
6Unlike the serial and multithreaded environments where the data
7structures are global, existing under one address space,
8in the MPI environment, data is local, each process or processor
9has its own distinct address space.
10The MPI step-by-step process to solve a linear system is exactly
11the same as the multithreaded case, with the additional trouble
12that the data structures are distributed and need to be
13re-distributed as needed.
14\par
15The ownership of the factor matrices during the factorization and
16solves is exactly the same as for the multithreaded case -- the
17map from fronts to processors and map from submatrices to
18processors are identical to their counterparts in the multithreaded
19program.
20What is different is the explicit message passing of data
21structures between processors.
22Luckily, most of this is hidden to the user code.
23\par
24We will now begin to work our way through the program
25found in Section~\ref{section:LU-MPI-driver}
26to illustrate the use of {\bf SPOOLES} to solve a system
27of linear equations in the MPI environment.
28\par
29\subsection{Reading the input parameters}
30\label{subsection:MPI:input-data}
31\par
32This step is identical to the serial code, as described in
33Section~\ref{subsection:serial:input-data}, with the exception
34that the file names for $A$ and $Y$ are hardcoded in the driver,
35and so are not part of the input parameters.
36\par
37\subsection{Communicating the data for the problem}
38\label{subsection:MPI:communicating-data}
39\par
40This step is identical to the serial code, as described in
41Section~\ref{subsection:serial:communicating-data}
42In the serial and multithreaded codes, the entire matrix $A$ was
43read in from one file and placed into one {\tt InpMtx} object.
44In the MPI environment, this need not be the case that one
45processor holds the entire matrix $A$.
46(In fact, $A$ must be distributed across processors during the
47factorization.)
48\par
49Each processor opens a matrix file, (possibly) reads in matrix
50entries, and creates its {\it local} {\tt InpMtx} object that holds
51the matrix entries it has read in.
52We have hardcoded the file names: processor $q$ reads
53its matrix entries from file {\tt matrix.}$q${\tt .input}
54and
55its right hand side entries from file {\tt rhs.}$q${\tt .input}.
56The file formats are the same as for the serial and multithreaded
57drivers.
58\par
59The entries needed not be partitioned over the files.
60For example, each processor could read in entries for disjoint sets
61of finite elements.
62Naturally some degrees of freedom will have support on elements
63that are found on different processors.
64When the entries in $A$ and $Y$ are mapped to processors, an
65assembly of the matrix entries will be done automatically.
66\par
67It could be the case that the matrix $A$ and right hand side $Y$
68are read in by one processor. (This was the approach we took with
69the {\tt LinSol} wrapper objects.)
70There still need to be input files for the other processors
71with zeroes on their first (and only) line,
72to specify that no entries are to be read.
73\par
74\subsection{Reordering the linear system}
75\label{subsection:MPI:reordering}
76\par
77The first part is very similar to the serial code, as described in
78Section~\ref{subsection:serial:reordering}.
79\begin{verbatim}
80graph = Graph_new() ;
81adjIVL = InpMtx_MPI_fullAdjacency(mtxA, stats, msglvl, msgFile, MPI_COMM_WORLD) ;
82nedges = IVL_tsize(adjIVL) ;
83Graph_init2(graph, 0, neqns, 0, nedges, neqns, nedges, adjIVL, NULL, NULL) ;
84frontETree = orderViaMMD(graph, seed + myid, msglvl, msgFile) ;
85\end{verbatim}
86While the data and computations are distributed across the
87processors, the ordering process is not.
88Therefore we need a global graph on each processor.
89Since the matrix $A$ is distributed across the processors,
90we use the distributed {\tt InpMtx\_MPI\_fullAdjacency()} method
91to construct the {\tt IVL} object of the graph of $A + A^T$.
92\par
93At this point, each processor has computed its own minimum degree
94ordering and created a front tree object.
95The orderings will likely be different, because each processors
96input a different random number seed to the ordering method.
97Only one ordering can be used for the factorization, so the
98processors collectively determine which of the orderings is best,
99which is then broadcast to all the processors, as the code fragment
100below illustrates.
101\begin{verbatim}
102opcounts = DVinit(nproc, 0.0) ;
103opcounts[myid] = ETree_nFactorOps(frontETree, type, symmetryflag) ;
104MPI_Allgather((void *) &opcounts[myid], 1, MPI_DOUBLE,
105              (void *) opcounts, 1, MPI_DOUBLE, MPI_COMM_WORLD) ;
106minops = DVmin(nproc, opcounts, &root) ;
107DVfree(opcounts) ;
108frontETree = ETree_MPI_Bcast(frontETree, root, msglvl, msgFile, MPI_COMM_WORLD) ;
109\end{verbatim}
110\par
111\subsection{Non-numeric work}
112\label{subsection:MPI:non-numeric}
113\par
114Once the front tree is replicated across the processors, we obtain
115the permutation vectors and permute the vertices in the front tree.
116The local matrices for $A$ and $Y$ are also permuted.
117These steps are identical to the serial and multithreaded drivers,
118except the fact local instead of global $A$ and $Y$ matrices
119are permuted.
120\begin{verbatim}
121oldToNewIV = ETree_oldToNewVtxPerm(frontETree) ;
122newToOldIV = ETree_newToOldVtxPerm(frontETree) ;
123ETree_permuteVertices(frontETree, oldToNewIV) ;
124InpMtx_permute(mtxA, IV_entries(oldToNewIV),
125IV_entries(oldToNewIV)) ;
126if (  symmetryflag == SPOOLES_SYMMETRIC || symmetryflag == SPOOLES_HERMITIAN ) {
127   InpMtx_mapToUpperTriangle(mtxA) ;
128}
129InpMtx_changeCoordType(mtxA, INPMTX_BY_CHEVRONS) ;
130InpMtx_changeStorageMode(mtxA, INPMTX_BY_VECTORS) ;
131DenseMtx_permuteRows(mtxY, oldToNewIV) ;
132\end{verbatim}
133\par
134The next step is to obtain the map from fronts to processors,
135just as was done in the multithreaded driver.
136In addition, we need a map from vertices to processors to be able
137to distribute the matrix $A$ and right hand side $Y$ as necessary.
138Since we have the map from vertices to fronts inside the front tree
139object, the vertex map is easy to determine.
140\begin{verbatim}
141cutoff   = 1./(2*nproc) ;
142cumopsDV = DV_new() ;
143DV_init(cumopsDV, nproc, NULL) ;
144ownersIV = ETree_ddMap(frontETree, type, symmetryflag, cumopsDV, cutoff) ;
145DV_free(cumopsDV) ;
146vtxmapIV = IV_new() ;
147IV_init(vtxmapIV, neqns, NULL) ;
148IVgather(neqns, IV_entries(vtxmapIV), IV_entries(ownersIV), ETree_vtxToFront(frontETree)) ;
149\end{verbatim}
150At this point we are ready to assemble and distribute the entries
151of $A$ and $Y$.
152\begin{verbatim}
153firsttag = 0 ;
154newA = InpMtx_MPI_split(mtxA, vtxmapIV, stats, msglvl, msgFile, firsttag,
155MPI_COMM_WORLD) ;
156InpMtx_free(mtxA) ;
157mtxA = newA ;
158InpMtx_changeStorageMode(mtxA, INPMTX_BY_VECTORS) ;
159newY = DenseMtx_MPI_splitByRows(mtxY, vtxmapIV, stats, msglvl,
160                                msgFile, firsttag, MPI_COMM_WORLD) ;
161DenseMtx_free(mtxY) ;
162mtxY = newY ;
163\end{verbatim}
164The {\tt InpMtx\_MPI\_split()} method assembles and redistributes
165the matrix entries by the vectors of the local matrix.
166Recall above that the coordinate type was set to chevrons, as is
167needed for the assembly of the entries into the front matrices.
168The method returns a new {\tt InpMtx} object that contains the part
169of $A$ that is needed by the processor.
170The old {\tt InpMtx} object is free'd and the new one takes its place.
171\par
172Now we are ready to compute the symbolic factorization, but it too
173much be done in a distributed manner.
174\begin{verbatim}
175symbfacIVL = SymbFac_MPI_initFromInpMtx(frontETree, ownersIV, mtxA,
176                    stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
177\end{verbatim}
178The {\tt symbfacIVL} object on a particular processor is only a
179subset of the global symbolic factorization, containing only what
180it needs to know for it to compute its part of the factorization.
181\par
182\subsection{The Matrix Factorization}
183\label{subsection:MPI:factor}
184\par
185In contrast the the multithreaded environment, data structures are
186local to a processor, and so locks are not needed to manage access
187to critical regions of code.
188The initialization of the front matrix and submatrix manager
189objects is much like the serial case, with one exception.
190\par
191\begin{verbatim}
192mtxmanager = SubMtxManager_new() ;
193SubMtxManager_init(mtxmanager, NO_LOCK, 0) ;
194frontmtx = FrontMtx_new() ;
195FrontMtx_init(frontmtx, frontETree, symbfacIVL, type, symmetryflag,
196              FRONTMTX_DENSE_FRONTS, pivotingflag, NO_LOCK, myid,
197              ownersIV, mtxmanager, msglvl, msgFile) ;
198\end{verbatim}
199Note that the nineth and tenth arguments are {\tt myid} and {\tt
200ownersIV}, not {\tt 0} and {\tt NULL} as for the serial and
201multithreaded drivers.
202These arguments tell the front matrix object
203that it needs to initialize only
204those parts of the factor matrices that it ``owns'',
205which are given by the map from fronts to processors
206and the processor id.
207\par
208The numeric factorization is performed by the
209{\tt FrontMtx\_MPI\_factorInpMtx()} method.
210The code segment from the sample program for the numerical
211factorization step is found below.
212\begin{verbatim}
213chvmanager = ChvManager_new() ;
214ChvManager_init(chvmanager, NO_LOCK, 0) ;
215rootchv = FrontMtx_MPI_factorInpMtx(frontmtx, mtxA, tau, droptol,
216                     chvmanager, ownersIV, lookahead, &error, cpus,
217                     stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
218ChvManager_free(chvmanager) ;
219\end{verbatim}
220Note that the {\tt ChvManager} is not locked.
221The calling sequence is identical to that of
222the multithreaded factorization except for the addition of the {\tt
223firsttag} and MPI communicator at the end.
224\par
225The post-processing of the factorization is the same in principle
226as in the serial code but differs in that is uses the distributed
227data structures.
228\begin{verbatim}
229FrontMtx_MPI_postProcess(frontmtx, ownersIV, stats, msglvl,
230                         msgFile, firsttag, MPI_COMM_WORLD) ;
231\end{verbatim}
232After the post-processing step, each local {\tt FrontMtx} object
233contains the $L_{J,I}$, $D_{I,I}$ and $U_{I,J}$ submatrices
234for the fronts that were owned by the particular processor.
235However, the parallel solve is based on the submatrices being
236distributed across the processors, not just the fronts.
237\par
238We must specify which threads own which submatrices,
239and so perform computations with them.
240This is done by constructing a {\it ``solve--map''} object,
241as we see below.
242\begin{verbatim}
243solvemap = SolveMap_new() ;
244SolveMap_ddMap(solvemap, symmetryflag, FrontMtx_upperBlockIVL(frontmtx),
245               FrontMtx_lowerBlockIVL(frontmtx), nproc, ownersIV,
246               FrontMtx_frontTree(frontmtx), seed, msglvl, msgFile) ;
247\end{verbatim}
248This object also uses a domain decomposition map, the only solve map
249that presently found in the {\bf SPOOLES} library.
250\par
251Once the solve map has been created, (and note that it is identical
252across all the processors), we redistribute the submatrices
253with the following code fragment.
254\begin{verbatim}
255FrontMtx_MPI_split(frontmtx, solvemap, stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
256\end{verbatim}
257At this point in time,
258the submatrices that a processor owns are local to that processor.
259\par
260\subsection{The Forward and Backsolves}
261\label{subsection:MPI:solve}
262\par
263If pivoting has been performed for numerical stability, then the
264rows of $PY$ may not be located on the processor that needs them.
265We must perform an additional redistribution of the local
266{\tt DenseMtx} objects that hold $PY$, as the code fragment below
267illustrates.
268\begin{verbatim}
269if ( FRONTMTX_IS_PIVOTING(frontmtx) ) {
270   IV   *rowmapIV ;
271/*
272   ----------------------------------------------------------
273   pivoting has taken place, redistribute the right hand side
274   to match the final rows and columns in the fronts
275   ----------------------------------------------------------
276*/
277   rowmapIV = FrontMtx_MPI_rowmapIV(frontmtx, ownersIV, msglvl,
278                                    msgFile, MPI_COMM_WORLD) ;
279   newY = DenseMtx_MPI_splitByRows(mtxY, rowmapIV, stats, msglvl,
280                                   msgFile, firsttag, MPI_COMM_WORLD) ;
281   DenseMtx_free(mtxY) ;
282   mtxY = newY ;
283   IV_free(rowmapIV) ;
284}
285\end{verbatim}
286\par
287Each processor now must create a local {\tt DenseMtx} object
288to hold the rows of $PX$ that it owns.
289\begin{verbatim}
290ownedColumnsIV = FrontMtx_ownedColumnsIV(frontmtx, myid, ownersIV,
291                                         msglvl, msgFile) ;
292nmycol = IV_size(ownedColumnsIV) ;
293mtxX = DenseMtx_new() ;
294if ( nmycol > 0 ) {
295   DenseMtx_init(mtxX, type, 0, 0, nmycol, nrhs, 1, nmycol) ;
296   DenseMtx_rowIndices(mtxX, &nrow, &rowind) ;
297   IVcopy(nmycol, rowind, IV_entries(ownedColumnsIV)) ;
298}
299\end{verbatim}
300If $A$ is symmetric, or if pivoting for stability was not used,
301then {\tt mtxX} can just be a pointer to {\tt mtxY}, i.e.,
302$PX$ could overwrite $PY$.
303\par
304The parallel solve is remarkably similar to the serial solve,
305as we see with the code fragment below.
306\begin{verbatim}
307solvemanager = SubMtxManager_new() ;
308SubMtxManager_init(solvemanager, NO_LOCK, 0) ;
309FrontMtx_MPI_solve(frontmtx, mtxX, mtxY, solvemanager, solvemap, cpus,
310                   stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
311SubMtxManager_free(solvemanager) ;
312\end{verbatim}
313The only difference between the multithreaded and MPI solve
314methods is the presence of the first tag and MPI communicator
315in the latter.
316\par
317The last step is to permute the rows of the local solution matrix
318into the original matrix ordering.
319We also gather all the solution entries into one {\tt DenseMtx}
320object on processor zero.
321\begin{verbatim}
322DenseMtx_permuteRows(mtxX, newToOldIV) ;
323IV_fill(vtxmapIV, 0) ;
324firsttag++ ;
325mtxX = DenseMtx_MPI_splitByRows(mtxX, vtxmapIV, stats, msglvl, msgFile,
326                                firsttag, MPI_COMM_WORLD) ;
327\end{verbatim}
328\par
329\subsection{Sample Matrix and Right Hand Side Files}
330\label{subsection:MPI:input-files}
331\par
332\begin{center}
333\begin{tabular}{|l||l||l||l||}
334\hline
335{\tt matrix.0.input} &
336{\tt matrix.1.input} &
337{\tt matrix.2.input} &
338{\tt matrix.3.input} \\
339\begin{minipage}[t]{1 in}
340\begin{verbatim}
3419 9 6
3420 0  4.0
3430 1 -1.0
3440 3 -1.0
3451 1  4.0
3461 2 -1.0
3471 4 -1.0
348\end{verbatim}
349\end{minipage}
350&
351\begin{minipage}[t]{1 in}
352\begin{verbatim}
3539 9 5
3542 2  4.0
3552 5 -1.0
3563 3  4.0
3573 4 -1.0
3583 6 -1.0
359\end{verbatim}
360\end{minipage}
361&
362\begin{minipage}[t]{1 in}
363\begin{verbatim}
3649 9 7
3654 4  4.0
3664 5 -1.0
3674 7 -1.0
3685 5  4.0
3695 8 -1.0
3706 6  4.0
3716 7 -1.0
372
373\end{verbatim}
374\end{minipage}
375&
376\begin{minipage}[t]{1 in}
377\begin{verbatim}
3789 9 3
3797 7  4.0
3807 8 -1.0
3818 8  4.0
382\end{verbatim}
383\end{minipage}
384\\
385\hline
386\hline
387{\tt rhs.0.input} &
388{\tt rhs.1.input} &
389{\tt rhs.2.input} &
390{\tt rhs.3.input} \\
391\begin{minipage}[t]{1 in}
392\begin{verbatim}
3932 1
3940 0.0
3951 0.0
396\end{verbatim}
397\end{minipage}
398&
399\begin{minipage}[t]{1 in}
400\begin{verbatim}
4012 1
4022 0.0
4033 0.0
404\end{verbatim}
405\end{minipage}
406&
407\begin{minipage}[t]{1 in}
408\begin{verbatim}
4092 1
4104  1.0
4115  0.0
412\end{verbatim}
413\end{minipage}
414&
415\begin{minipage}[t]{1 in}
416\begin{verbatim}
4173 1
4186  0.0
4197  0.0
4208  0.0
421
422\end{verbatim}
423\end{minipage}
424\\
425\hline
426\end{tabular}
427\end{center}
428