1\chapter{{\tt MT} directory}
2\label{chapter:MT}
3\par
4All methods that use multithreaded function calls
5are found in this directory.
6Three functionalities are presently supported:
7matrix-matrix multiplies, sparse factorizations, and solves.
8\par
9The multithreaded methods to compute $Y := Y + \alpha A X$,
10$Y := Y + \alpha A^T X$ and $Y := Y + \alpha A^H X$
11are simple.
12Their calling sequences are almost identical to their
13serial counterparts: global data structures for $Y$,
14$\alpha$, $A$ and $X$ are followed by the number of threads,
15a message level and file.
16Thread $q$ accesses part of $A$, part of $X$, and computes its own
17$Y^q = \alpha A X$ using those entries of $A$ that it is
18responsible for.
19This work is done independently by all threads.
20The global summation $Y := Y + \sum_q Y^q$ is done in serial mode
21by the calling process.
22\par
23This approach is not scalable.
24A better approach would be to explicitly partition $A$ into local
25$A^q$ matrices, and use local $X^q$ and $Y^q$ to hold rows of $X$
26and $Y$ that have support with $A^q$, as is done with the
27distributed MPI matrix-matrix multiplies.
28(With MPI there is added complexity since $X$ and $Y$ are distributed
29among processors.)
30\par
31A matrix-matrix multiply does not exist in isolation.
32For example, a block shifted eigensolver requires factorizations of
33$A - \sigma B$ and multiplies using $A$ or $B$.
34The data structure for the matrix that takes part in the multiply
35needs to toggle back and for between its forms for the factor and
36multiply.
37Managing this in a distributed environment is actually easier than
38a multithreaded environment, for $A$ and $B$ are already
39distributed.
40Our multithreaded factorization expects $A$ and $B$ in global form.
41Insisting that $A$ and $B$ be partitioned as $A^q$ and $B^q$
42matrices is too great a burden for the user that has no need for a
43multithreaded matrix-matrix multiply.
44Allowing the $A^q$ matrices to overlap or point into the global $A$
45matrix in a persistent fashion is not cleanly possible, but
46requires changes to the {\tt InpMtx} object.
47\par
48In the future we intend to provide a scalable multithreaded
49matrix-matrix multiply.
50It requires a more in-depth consideration of the issues involved
51than we are able to give it at the present time.
52\par
53The multithreaded factorizations $A = LU$ and $A = QR$
54are very similar to the serial factorizations,
55in both the calling sequence visible to the user and
56in the underlying code structure.
57The only additional parameters in the calling sequence is a map
58from the fronts to the threads that defines who does what
59computation, and a {\it lookahead} parameter that allows some
60ability to control and reduce the idle time during the factorization.
61Inside the code, the deterministic post-order traversal of the
62serial factorization is replaced by independent topological
63traversals of the front tree.
64It is the list and working storage data structures
65(the {\tt ChvList}, {\tt ChvManager} and {\tt SubMtxManager} objects)
66that have locks.
67{\it What} is done is common code between the serial and
68multithreaded environments, it is the choreography, i.e.,
69{\it who} does what, that differs.
70\par
71Most of these same comments apply to the multithreaded solve methods.
72The calling sequences between the serial and multithreaded solves
73differs by one parameter, a {\tt SolveMap} object that maps the
74submatrices of the factor matrix to the threads that will compute
75with them.
76