1\chapter{Using \libflame}
2\label{chapter:using}
3
4This chapter contains code examples that illustrate how to use \libflame in
5your application.
6
7\section{FLAME/C examples}
8\label{sec:flamec-examples}
9
10Let us begin by illustrating a small program that uses LAPACK.
11Figure \ref{fig:fla-chol-orig} contains a C language program that acquires a
12matrix buffer and its dimension properties, performs a Cholesky factorization
13on the matrix, and then frees the memory associated with the matrix buffer.
14
15\begin{figure}[h]
16\begin{Verbatim}[frame=single,framesep=2.5mm,xleftmargin=5mm,fontsize=\footnotesize]
17int main( void )
18{
19    double* buffer;
20    int     m, rs, cs;
21    int     info;
22    char    uplo = 'L';
23
24    // Get the matrix buffer address, size, and row and column strides.
25    get_matrix_info( &buffer, &m, &rs, &cs );
26
27    // Compute the Cholesky factorization of the matrix, reading from and
28    // updating the lower triangle.
29    dpotrf_( &uplo, &m, buffer, &cs, &info );
30
31    // Free the matrix buffer.
32    free_matrix( buffer );
33
34    return 0;
35}
36\end{Verbatim}
37\caption{
38A simple program that calls {\tt dpotrf()} from LAPACK.
39}
40\label{fig:fla-chol-orig}
41\end{figure}
42
43\noindent
44The program is trivial in that it does not do anything with the factored
45matrix before exiting.
46Furthermore, the corresponding code found in most real-world programs would
47most likely exist within a loop of some sort.
48However, we are keeping things simple here to better illustrate the usage
49of \libflame functions.
50
51Now suppose we wish to modify the previous program to use the FLAME/C API
52within \libflame.
53There are two general methods.
54\begin{itemize}
55\item
56Create a \libflame object without a buffer and then attach the conventional
57row- or column-major matrix buffer to the bufferless \libflame object.
58This method almost always requires the fewest number of code changes in the
59application.
60\item
61Modify the application such that the matrix is created natively along with
62the \libflame object.
63This will require the user to interface the application to the matrix data
64within the object using various query routines.
65This method often involves more work because many applications are written
66to access matrix buffers directly without any abstractions.
67There are two different strategies for implementing this method, and
68depending on the nature of the application, one strategy may be more
69appropriate than the other:
70\begin{itemize}
71\item
72The matrix may be created and fully initialized, and then copied into a
73\libflame object.
74\item
75The matrix may be created and initialized piecemeal, perhaps one block at
76a time.
77\end{itemize}
78Regardless of whether the matrix is initialized in full or one submatrix at
79a time, the user may use \flacopybuffertoobject to copy the data from a
80conventional column-major matrix arrays to \libflame objects.
81\end{itemize}
82
83
84\begin{figure}[t]
85\begin{Verbatim}[frame=single,framesep=2.5mm,xleftmargin=5mm,commandchars=\\\{\},fontsize=\footnotesize]
86\textcolor{red}{#include "FLAME.h"}
87
88int main( void )
89\{
90    double* buffer;
91    int     m, rs, cs;
92    \textcolor{red}{FLA_Obj A;}
93
94    // Initialize libflame.
95    \textcolor{red}{FLA_Init();}
96
97    // Get the matrix buffer address, size, and row and column strides.
98    get_matrix_info( &buffer, &m, &rs, &cs );
99
100    // Create an m x m double-precision libflame object without a buffer,
101    // and then attach the matrix buffer to the object.
102    \textcolor{red}{FLA_Obj_create_without_buffer( FLA_DOUBLE, m, m, &A );}
103    \textcolor{red}{FLA_Obj_attach_buffer( buffer, rs, cs, &A );}
104
105    // Compute the Cholesky factorization, storing to the lower triangle.
106    \textcolor{red}{FLA_Chol( FLA_LOWER_TRIANGULAR, A );}
107
108    // Free the object without freeing the matrix buffer.
109    \textcolor{red}{FLA_Obj_free_without_buffer( &A );}
110
111    // Free the matrix buffer.
112    free_matrix( buffer );
113
114    // Finalize libflame.
115    \textcolor{red}{FLA_Finalize();}
116
117    return 0;
118\}
119\end{Verbatim}
120\caption{
121The program from Figure \ref{fig:fla-chol-orig} modified to use \libflame
122objects.
123This example code illustrates the minimal amount of work to use FLAME/C APIs
124in a program that was originally designed to use the BLAS or LAPACK.
125}
126\label{fig:fla-chol-attach}
127\end{figure}
128
129The program in Figure \ref{fig:fla-chol-attach} uses the first method to
130integrate \libflamens.
131Note that changes from the original example are tracked in red.
132We start by inserting a {\tt \#include} directive for the \libflame header
133file, {\tt FLAME.h}.
134Before calling any other \libflame functions, we must first invoke \flainitns.
135Next, we replace the invocation to {\tt dpotrf()} with four lines of
136\libflame code.
137First, an $ m \by m $ object {\tt A} of datatype \fladouble is created without
138a buffer.
139Then the matrix buffer \buffer is attached to the \libflame object, assuming
140row and column strides \rs and \csns.
141The Cholesky factorization is invoked on {\tt A} with \flacholns.
142And finally, the matrix object is released with \flaobjfreewithoutbufferns.
143The library is finalized with a call to \flafinalizens.
144
145
146\begin{figure}[t]
147\begin{Verbatim}[frame=single,framesep=2.5mm,xleftmargin=5mm,commandchars=\\\{\},fontsize=\footnotesize]
148\textcolor{red}{#include "FLAME.h"}
149
150int main( void )
151\{
152    double* buffer;
153    int     m, rs, cs;
154    \textcolor{red}{FLA_Obj A;}
155
156    // Initialize libflame.
157    \textcolor{red}{FLA_Init();}
158
159    // Get the matrix buffer address, size, and row and column strides.
160    get_matrix_info( &buffer, &m, &rs, &cs );
161
162    // Create an m x m double-precision libflame object.
163    \textcolor{red}{FLA_Obj_create( FLA_DOUBLE, m, m, rs, cs, &A );}
164
165    // Copy the contents of the conventional matrix into a libflame object.
166    \textcolor{red}{FLA_Copy_buffer_to_object( FLA_NO_TRANSPOSE, m, m, buffer, rs, cs, 0, 0, A );}
167
168    // Compute the Cholesky factorization, storing to the lower triangle.
169    \textcolor{red}{FLA_Chol( FLA_LOWER_TRIANGULAR, A );}
170
171    // Free the object.
172    \textcolor{red}{FLA_Obj_free( &A );}
173
174    // Free the matrix buffer.
175    free_matrix( buffer );
176
177    // Finalize libflame.
178    \textcolor{red}{FLA_Finalize();}
179
180    return 0;
181\}
182\end{Verbatim}
183\caption{
184The program from Figure \ref{fig:fla-chol-orig} modified to use
185\libflame objects natively.
186This code does not attach the conventional matrix buffer to a bufferless
187object and instead copies the matrix contents into the object using
188\flacopybuffertoobjectns.
189Note that the matrix is copied all at once, and thus here we assume that
190original matrix is fully initialized in {\tt initialize\_matrix()}
191}
192\label{fig:fla-chol-native1}
193\end{figure}
194
195The second method requires somewhat more extensive modifications to the original
196program.
197In Figure \ref{fig:fla-chol-native1}, we revise and extend the previous
198example.
199This program initializes the matrix as before, but then creates a \libflame
200object natively (with an internal buffer), and then copies the contents of
201the conventional matrix into the \libflame object all at once.
202
203
204\begin{figure}[t]
205\begin{Verbatim}[frame=single,framesep=2.5mm,xleftmargin=5mm,commandchars=\\\{\},fontsize=\footnotesize]
206\textcolor{red}{#include "FLAME.h"}
207
208int main( void )
209\{
210    double* buffer;
211    int     m, rs, cs\textcolor{red}{, b};
212    \textcolor{red}{int     i, j;}
213    \textcolor{red}{FLA_Obj A;}
214
215    // Initialize libflame.
216    \textcolor{red}{FLA_Init();}
217
218    // Get the matrix buffer address, size, row and column strides, and block size.
219    get_matrix_info( &buffer, &m, &rs, &cs\textcolor{red}{, &b });
220
221    // Create an m x m double-precision libflame object.
222    \textcolor{red}{FLA_Obj_create( FLA_DOUBLE, m, m, rs, cs, &A );}
223
224    // Acquire the conventional matrix one block at a time and copy these
225    // blocks into the appropriate location within the libflame object.
226    \textcolor{red}{for( j = 0; j < m; j += b )}
227    \textcolor{red}{\{}
228        \textcolor{red}{for( i = 0; i < m; i += b )}
229        \textcolor{red}{\{}
230            \textcolor{red}{double* ij_ptr;}
231            \textcolor{red}{int     b_m, b_n;}
232
233            // Compute the block dimensions, in case they are blocks along the lower and/or
234            // right edges of the overall matrix.
235            \textcolor{red}{b_m = ( m - i < b ? m - i : b );}
236            \textcolor{red}{b_n = ( m - j < b ? m - j : b );}
237
238            // Get a pointer to the b_m x b_n block that starts at element (i,j).
239            \textcolor{red}{ij_ptr = FLA_Submatrix_at( FLA_DOUBLE, buffer, i, j, rs, cs );}
240
241            // Copy the current block into the correct location within the libflame object.
242            \textcolor{red}{FLA_Copy_buffer_to_object( FLA_NO_TRANSPOSE, b_m, b_n, ij_ptr, rs, cs, i, j, A );}
243        \textcolor{red}{\}}
244    \textcolor{red}{\}}
245
246    // Compute the Cholesky factorization, storing to the lower triangle.
247    \textcolor{red}{FLA_Chol( FLA_LOWER_TRIANGULAR, A );}
248
249    // Free the object.
250    \textcolor{red}{FLA_Obj_free( &A );}
251
252    // Finalize libflame.
253    \textcolor{red}{FLA_Finalize();}
254
255    return 0;
256\}
257\end{Verbatim}
258\caption{
259The program from Figure \ref{fig:fla-chol-orig} modified to use FLAME/C
260in a way that initializes a \libflame object incrementally, one block at a
261time.
262}
263\label{fig:fla-chol-native2}
264\end{figure}
265
266Finally, Figure \ref{fig:fla-chol-native2} shows what a program might look
267like if it were to use a native \libflame object but only copy over the data
268one block at a time.
269Here, we place \flacopybuffertoobject in a loop that copies a single
270submatrix per iteration.
271We use \flasubmatrixat to compute the starting address of the submatrix
272whose top-left element is the $ (i,j) $ element within the overall matrix
273stored in \bufferns.
274
275Note that \flacopybuffertoobject may also be used to copy over one
276row or column at a time.
277Copying single rows or columns are just special cases of copying rectangular
278blocks.
279
280
281
282
283\section{FLASH examples}
284\label{sec:flash-examples}
285
286Now let us discuss how we might convert the \libflame programs in
287Section \ref{sec:flamec-examples} to use the FLASH API.
288Please see Section \ref{sec:flash} for a full discussion of FLASH, including
289the motivation behind hierarchical objects and a summary of related
290terminology.
291
292%When using hierarchial objects, the user must consider how many levels
293%to build into the object hierarchies.
294
295In the previous section, we reviewed a code (Figure \ref{fig:fla-chol-attach})
296that uses \libflame functions with an existing matrix buffer.
297Figure \ref{fig:flash-chol-attach} shows what this code would look like
298if we wished to use hierarchical objects.
299
300\begin{figure}[h]
301\begin{Verbatim}[frame=single,framesep=2.5mm,xleftmargin=5mm,commandchars=\\\{\},fontsize=\footnotesize]
302#include "FLAME.h"
303
304int main( void )
305\{
306    double* buffer;
307    int     m, rs, cs\textcolor{red}{, b};
308    FLA_Obj A;
309
310    // Initialize libflame.
311    FLA_Init();
312
313    // Get the matrix buffer address, size, row and column strides, and blocksize.
314    get_matrix_info( &buffer, &m, &rs, &cs\textcolor{red}{, &b} );
315
316    // Create an m x m double-precision hierarchical object without a buffer,
317    // of depth 1 and blocksize b, and then attach the matrix buffer to the object.
318    FLA\textcolor{red}{SH}_Obj_create_without_buffer( FLA_DOUBLE, m, m, \textcolor{red}{1, &b,} &A );
319    FLA\textcolor{red}{SH}_Obj_attach_buffer( buffer, rs, cs, &A );
320
321    // Compute the Cholesky factorization, storing to the lower triangle.
322    FLA\textcolor{red}{SH}_Chol( FLA_LOWER_TRIANGULAR, A );
323
324    // Free the object without freeing the matrix buffer.
325    FLA\textcolor{red}{SH}_Obj_free_without_buffer( &A );
326
327    // Free the matrix buffer.
328    free_matrix( buffer );
329
330    // Finalize libflame.
331    FLA_Finalize();
332
333    return 0;
334\}
335\end{Verbatim}
336\caption{
337The program from Figure \ref{fig:fla-chol-attach} modified to use the
338FLASH API.
339}
340\label{fig:flash-chol-attach}
341\end{figure}
342
343\noindent
344Note that the changes from the corresponding FLAME/C code are highlighted in
345red.
346The application-specific code changes are limited to inputting a blocksize
347value to use in the creation of the hierarchical object {\tt A}.
348All of the \libflame function names are the same as in Figure
349\ref{fig:fla-chol-attach} except that the prefix has changed from
350{\tt FLA\_} to {\tt FLASH\_}.
351Additionally, all of the function type signatures are the same, except
352for the invocation to \flashobjcreatewithoutbufferns.
353This function takes two additional arguments: a depth, and an array of
354blocksizes.\footnote{Since the depth is 1 in this example, we choose to
355simply pass the address of the integer {\tt b} rather than create a separate
356single-element array.}
357The depth and the blocksize array together determine the details of the
358object hierarchy.
359Also note that since a conventional matrix buffer is being attached, the
360hierarchical object {\tt A} will refer to submatrices that are not contiguous
361in memory.
362
363
364\begin{figure}[t]
365\begin{Verbatim}[frame=single,framesep=2.5mm,xleftmargin=5mm,commandchars=\\\{\},fontsize=\footnotesize]
366#include "FLAME.h"
367
368int main( void )
369\{
370    double* buffer;
371    int     m, rs, cs\textcolor{red}{, b};
372    FLA_Obj A;
373
374    // Initialize libflame.
375    FLA_Init();
376
377    // Get the matrix buffer address, size, row and column strides, and blocksize.
378    get_matrix_info( &m, &rs, &cs\textcolor{red}{, &b} );
379
380    // Create an m x m double-precision libflame object.
381    FLA\textcolor{red}{SH}_Obj_create( FLA_DOUBLE, m, m, \textcolor{red}{1, &b,} &A );
382
383    // Copy the contents of the conventional matrix into a libflame object.
384    FLA\textcolor{red}{SH}_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A );
385
386    // Compute the Cholesky factorization, storing to the lower triangle.
387    FLA\textcolor{red}{SH}_Chol( FLA_LOWER_TRIANGULAR, A );
388
389    // Free the object.
390    FLA\textcolor{red}{SH}_Obj_free( &A );
391
392    // Free the matrix buffer.
393    free_matrix( buffer );
394
395    // Finalize libflame.
396    FLA_Finalize();
397
398    return 0;
399\}
400\end{Verbatim}
401\caption{
402The program from Figure \ref{fig:fla-chol-native1} modified to use the
403FLASH API.
404}
405\label{fig:flash-chol-native1}
406\end{figure}
407
408In similar fashion, we have modified the code in Figure \ref{fig:fla-chol-native1}
409to use hierarchical objects, as shown in Figure \ref{fig:flash-chol-native1}.
410The changes in this code are similar to those discussed for the previous example.
411Note that while \flacopybuffertoobject accepts a transposition argument,
412\flashcopyflattohier does not, and thus we had to remove this
413argument from the invocation of the latter function.
414
415\begin{figure}[t]
416\begin{Verbatim}[frame=single,framesep=2.5mm,xleftmargin=5mm,commandchars=\\\{\},fontsize=\footnotesize]
417#include "FLAME.h"
418
419int main( void )
420\{
421    double* buffer;
422    int     m, rs, cs, b;
423    int     i, j;
424    FLA_Obj A;
425
426    // Initialize libflame.
427    FLA_Init();
428
429    // Get the matrix buffer address, size, row and column strides, and blocksize.
430    get_matrix_info( &buffer, &m, &rs, &cs, &b );
431
432    // Create an m x m double-precision libflame object.
433    FLA\textcolor{red}{SH}_Obj_create( FLA_DOUBLE, m, m, \textcolor{red}{1, &b,} &A );
434
435    // Acquire the conventional matrix one block at a time and copy these
436    // blocks into the appropriate location within the libflame object.
437    for( j = 0; j < m; j += b )
438    \{
439        for( i = 0; i < m; i += b )
440        \{
441            double* ij_ptr;
442            int     b_m, b_n;
443
444            // Compute the block dimensions, in case they are blocks along the lower and/or
445            // right edges of the overall matrix.
446            b_m = ( m - i < b ? m - i : b );
447            b_n = ( m - j < b ? m - j : b );
448
449            // Get a pointer to the b_m x b_n block that starts at element (i,j).
450            ij_ptr = FLA_Submatrix_at( FLA_DOUBLE, buffer, i, j, rs, cs );
451
452            // Copy the current block into the correct location within the libflame object.
453            FLA\textcolor{red}{SH}_Copy_buffer_to_hier( b_m, b_n, ij_ptr, rs, cs, i, j, A );
454        \}
455    \}
456
457    // Compute the Cholesky factorization, storing to the lower triangle.
458    FLA\textcolor{red}{SH}_Chol( FLA_LOWER_TRIANGULAR, A );
459
460    // Free the object.
461    FLA\textcolor{red}{SH}_Obj_free( &A );
462
463    // Finalize libflame.
464    FLA_Finalize();
465
466    return 0;
467\}
468\end{Verbatim}
469\caption{
470The program from Figure \ref{fig:fla-chol-native2} modified to use
471the FLASH API.
472}
473\label{fig:flash-chol-native2}
474\end{figure}
475
476In Figure \ref{fig:flash-chol-native2}, we show the code from Figure
477\ref{fig:fla-chol-native2} modified to use hierarchical objects.
478Once again, most of the differences are limited to changing the
479function prefixes.
480The one other change deserves additional attention, though, which
481is the use of the blocksize {\tt b} in the object creation.
482In the previous code, the blocksize was used only to determine the
483sizes of the submatrices that were individually acquired and copied
484into the {\tt A}.
485This code still uses the blocksize in this manner.
486However, it also uses the same value to establish the size of the
487submatrix blocks in the hierarchical object.
488It should be emphasized that \flashcopyflattohier allows
489the user to copy submatrices into the object that are different in size
490than the sizes of the underlying leaf-level blocks.
491That is, the function is capable of handling copies that span multiple
492block boundaries.
493
494The key insight we hope to have impressed on our readers from these
495simple examples is that the FLASH API (1) provides an easy interface for
496creating and manipulating hierarchical objects, and (2) is strikingly
497similar to the original FLAME/C API wherever possible.
498
499
500
501\section{SuperMatrix examples}
502\label{sec:sm-examples}
503
504
505
506