1\chapter{{\tt MT} directory} 2\label{chapter:MT} 3\par 4All methods that use multithreaded function calls 5are found in this directory. 6Three functionalities are presently supported: 7matrix-matrix multiplies, sparse factorizations, and solves. 8\par 9The multithreaded methods to compute $Y := Y + \alpha A X$, 10$Y := Y + \alpha A^T X$ and $Y := Y + \alpha A^H X$ 11are simple. 12Their calling sequences are almost identical to their 13serial counterparts: global data structures for $Y$, 14$\alpha$, $A$ and $X$ are followed by the number of threads, 15a message level and file. 16Thread $q$ accesses part of $A$, part of $X$, and computes its own 17$Y^q = \alpha A X$ using those entries of $A$ that it is 18responsible for. 19This work is done independently by all threads. 20The global summation $Y := Y + \sum_q Y^q$ is done in serial mode 21by the calling process. 22\par 23This approach is not scalable. 24A better approach would be to explicitly partition $A$ into local 25$A^q$ matrices, and use local $X^q$ and $Y^q$ to hold rows of $X$ 26and $Y$ that have support with $A^q$, as is done with the 27distributed MPI matrix-matrix multiplies. 28(With MPI there is added complexity since $X$ and $Y$ are distributed 29among processors.) 30\par 31A matrix-matrix multiply does not exist in isolation. 32For example, a block shifted eigensolver requires factorizations of 33$A - \sigma B$ and multiplies using $A$ or $B$. 34The data structure for the matrix that takes part in the multiply 35needs to toggle back and for between its forms for the factor and 36multiply. 37Managing this in a distributed environment is actually easier than 38a multithreaded environment, for $A$ and $B$ are already 39distributed. 40Our multithreaded factorization expects $A$ and $B$ in global form. 41Insisting that $A$ and $B$ be partitioned as $A^q$ and $B^q$ 42matrices is too great a burden for the user that has no need for a 43multithreaded matrix-matrix multiply. 44Allowing the $A^q$ matrices to overlap or point into the global $A$ 45matrix in a persistent fashion is not cleanly possible, but 46requires changes to the {\tt InpMtx} object. 47\par 48In the future we intend to provide a scalable multithreaded 49matrix-matrix multiply. 50It requires a more in-depth consideration of the issues involved 51than we are able to give it at the present time. 52\par 53The multithreaded factorizations $A = LU$ and $A = QR$ 54are very similar to the serial factorizations, 55in both the calling sequence visible to the user and 56in the underlying code structure. 57The only additional parameters in the calling sequence is a map 58from the fronts to the threads that defines who does what 59computation, and a {\it lookahead} parameter that allows some 60ability to control and reduce the idle time during the factorization. 61Inside the code, the deterministic post-order traversal of the 62serial factorization is replaced by independent topological 63traversals of the front tree. 64It is the list and working storage data structures 65(the {\tt ChvList}, {\tt ChvManager} and {\tt SubMtxManager} objects) 66that have locks. 67{\it What} is done is common code between the serial and 68multithreaded environments, it is the choreography, i.e., 69{\it who} does what, that differs. 70\par 71Most of these same comments apply to the multithreaded solve methods. 72The calling sequences between the serial and multithreaded solves 73differs by one parameter, a {\tt SolveMap} object that maps the 74submatrices of the factor matrix to the threads that will compute 75with them. 76