• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

config/H03-Nov-2008-3,3992,942

doc/H07-May-2022-

m4/H03-May-2022-47,37141,942

src/H03-May-2022-739,967538,593

testing/H21-Nov-2008-358,761245,617

LICENSEH A D22-Apr-20091.6 KiB3727

MakefileH A D03-May-202220.6 KiB890547

Makefile.m4H A D03-Nov-20085.8 KiB238193

READMEH A D03-Nov-200815.2 KiB367290

README.develH A D03-Nov-2008983 2820

configureH A D21-Nov-2008172.7 KiB6,2725,289

configure.acH A D03-Nov-20084.1 KiB154135

make.confH A D03-Nov-2008196 92

make.crayH A D03-Nov-20081,018 4339

make.gnuH A D03-Nov-2008785 3430

make.hpH A D03-Nov-2008701 3127

make.ia64H A D03-Nov-2008673 3127

make.ibmH A D03-Nov-2008755 3329

make.incH A D03-May-2022681 2617

make.inc.inH A D03-Nov-2008856 3426

make.intelH A D03-Nov-2008631 3026

make.sunH A D03-Nov-2008669 3127

README

1Table of Contents
2  1. Introduction
3  2. Content
4  3. Installation
5  4. Code Generation with M4 macro processor
6  5. Testing
7  6. Feedback
8
91. Introduction
10
11   This library of routines is part of a reference implementation for
12   the Dense and Banded BLAS routines, along with their
13   Extended and Mixed Precision versions, as documented in
14   Chapters 2 and 4 of the new BLAS Standard, which is available from:
15
16	http://www.netlib.org/blas/blast-forum/
17
18   EXTENDED PRECISION is only used internally; the input and output
19       arguments remain the same as in the existing BLAS.  At present, we
20       only allow Single, Double, or Extra internal precision.  Extra
21       precision is implemented as double-double precision (128-bit total,
22       106-bit significand).  The routines for the double-double precision
23       basic arithmetic operations +, -, *, / were developed by David
24       Bailey.
25
26       We have designed all our routines assuming that single precision
27       arithmetic is actually done in IEEE single precision (32 bits)
28       and that double precision arithmetic is actually done in
29       IEEE double precision (64 bits). The routines also pass our tests
30       on an Intel machine with 80-bit floating point registers.
31
32   MIXED PRECISION permits some input/output arguments to be of
33       different types (mixing real and complex) or precisions (mixing
34       single and double).
35
36   The purpose of this implementation is to do a proof of
37   concept implementation, showing that the considerable complexity
38   of the specification is actually implementable and testable
39   with a manageable amount of software. We have not attempted to
40   optimize performance, but our code should be as good as
41   straightforward but careful code written by hand.
42
43
442. Content
45
46   The BLAS Standard defines language bindings for Fortran 95, Fortran
47   77, and C. Here, we have only implemented the C version and
48   provided a method for binding to one Fortran 77 ABI.
49
50   In this initial release, we provide the following 11 routines:
51
52    Level 1
53 	DOT (Inner product)
54	SUM (Sum)
55	AXPBY (Scaled vector accumulation)
56	WAXPBY (Scaled vector addition)
57    Level 2
58	GEMV (General matrix vector product)
59	GBMV (Banded matrix vector product)
60	SYMV (Symmetric matrix vector product)
61	SBMV (Symmetric banded matrix vector product)
62	SPMV (Symmetric matrix vector product, packed format)
63	HEMV (Hermitian matrix vector product)
64	HBMV (Hermitian banded matrix vector product)
65	HPMV (Hermitian matrix vector product, packed format)
66	GE_SUM_MV (Summed matrix-vector product)
67	TRSV (Triangular solve)
68    Level 3
69	GEMM (General matrix matrix product)
70	SYMM (Symmetric matrix matrix product)
71	HEMM (Hermitian matrix matrix product)
72
73   All have passed our systematic testing of all
74   possible combinations of mixed and extended precision.
75   We will eventually include everything in the intersection of
76   Chapter 2 and Chapter 4 with systematic testing.
77
78   2.1 Directory Structure
79
80       This release contains the following directories:
81
82       doc/          - technical report
83
84       m4/           - Directory where m4 macro files and support
85                       routines are stored
86       m4/dot       \
87       m4/sum        } Directories for each function
88       m4/,..       /
89       m4/test-dot  \
90       m4/test-sum   } Directories for each test function
91       m4/...       /
92
93       src/          - Directory where C code is stored
94       src/dot      \
95       src/sum       } Target directories for C code
96       src/...      /
97
98       testing/      - Directory where C code for testing is stored
99       testing/test-dot  - DOT test code and results
100       testing/test-sum  - SUM test code and results
101       testing/...
102
1033. Installation
104
105   The reference XBLAS are built similarly to the reference BLAS and
106   LAPACK.  The current build system produces a static libxblas.a.
107
108   You need to provide a make.inc file in the source directory that
109   defines the compiler, optimization flags, and options for building
110   a Fortran->C bridge.  Some examples are provided.
111
112   Alternatively, you can use the configure script to attempt to produce
113   a make.inc that is appropriate for your system with your C (and optionally,
114   Fortran) compiler.  You need to issue ./configure in the top-most
115   directory.  This will produce a make.inc file that you should check
116   before proceeding.  To specify a specific compiler(s) to use, you will
117   need to do something like:
118
119     CC=my_c_compiler FC=my_fortran_compiler ./configure
120
121   M4 is not necessary for the distributed archive; it is only
122   necessary if you modify the generator sources under m4/.  See
123   README.devel for more information.
124
125   The Fortran->C bridge uses details of a specific toolchain's binary
126   interface, in particular how the Fortran compiler mangles names.
127   See src/f2c-bridge.h for the available options.  Most compilers can
128   support different name mangling schemes; be sure to use the *SAME*
129   naming options for all your Fortran code.
130
131   The Fortran->C bridge is included in libxblas.a.  Each of the
132   bridge's object files matches *-f2c.o, so you can extract them with
133   ar if you need to share one libxblas.a between multiple Fortran
134   compilers.  Example steps to strip the Fortran->C routines from
135   libxblas.a and place them in a separate libxblas-myfortran.a are as
136   follows:
137
138     ar t libxblas.a |fgrep -- -f2c.o | xargs ar x libxblas.a
139     ar ru libxblas-myfortran.a *-f2c.o
140     ar x libxblas.a *-f2c.o
141     rm *-f2c.o
142
1434. Code Generation with M4 macro processor
144
145   In the existing BLAS, there are usually 4 routines associated with
146   each operation.  All input, output, and internal variables are
147   single or double precision and real or complex.  But under the new
148   extended and mixed precision rules (see Chapter 4 for details), the
149   input, output and internal variables may have different precisions
150   and types. Therefore, the combination of all these types results in
151   many more routines associated with each operation. For
152   example, DOT will have 32 routines altogether, 4 "standard" versions
153   (from Chapter 2) and 28 mixed and extended precision versions
154   (from Chapter 4).  In addition, the 16 versions with extended precision
155   support up to three internal precisions that can be chosen at runtime.
156   We have automated the code and test code generation as much as possible.
157   We use the M4 macro processor to facilitate this task.
158
159   The idea is to define a macro for each fundamental operation. The
160   macro's argument list contains the variables, accompanied by their
161   types and precisions. For example, for the operation c <- a + b, we
162   define the following macro:
163
164   	ADD(c, c_type, a, a_type, b, b_type)
165
166   where, x_type can be one of:
167
168	real_single
169	real_double
170	real_extra
171	complex_single
172	complex_double
173	complex_extra
174
175   Inside the macro body, we use an "if-test" on c_type, a_type and
176   b_type, to generate the appropriate code segment for "+".
177   (This is similar to operator overloading in C++; but we do it
178    manually.) All these if-tests are evaluated at macro-evaluation
179   time, and do not appear in the executable code. Indeed, our goal
180   was to produce efficient C code, which means minimizing branches
181   in inner loops.
182
183   Other macros include SUB, MUL, DIV, DECLARE (variable declaration),
184   ASSIGN, etc.
185
186   Since these macros are shared among all the BLAS routines, we put
187   them in a common header file, named cblas.m4.h.  Each BLAS routine
188   also has its own macro file, such as dot.m4, spmv.m4 and gbmv.m4,
189   to generate the specific functions.  All the macro files are
190   located in the m4/ subdirectory.
191
192   For example, the inner loop of the M4 macro for the dot product is
193   simply as follows (the M4 parameters $2, $3, and $4 are types):
194
195        for (i = 0; i < n; ++i) {
196            GET_VECTOR_ELEMENT(x_ii, x_i, ix, $2)
197                  /* put ix-th element of vector x into x_ii */
198            GET_VECTOR_ELEMENT(y_ii, y_i, iy, $3)
199                  /* put iy-th element of vector y into y_ii */
200            MUL(prod, $4, x_ii, $2, y_ii, $3) /* prod = x[i]*y[i] */
201            ADD(sum, $4, sum, $4, prod, $4) /* sum = sum+prod */
202            ix += incx;
203            iy += incy;
204        } /* endfor */
205
206   The motivation for this macro-based approach is simplifying
207   software engineering. For example, the file {\tt dot.m4}
208   of M4 macros for the dot product is 401 lines long (245 non-comment lines)
209   but expands into 11145 lines in 32 C subroutines implementing
210   different versions of DOT. Similarly the macros for TRSV expand from
211   732 lines (454 non-comment lines) to 37099 lines in 32 C subroutines.
212   (This does not count the shared M4 macros in the file cblas.m4.h.)
213
214
2155. Testing
216
217   The goal of the testing code is to validate the underlying implementation.
218   The challenges are twofold: First, we must thoroughly test routines
219   claiming to use extra precision internally, where the test code is not
220   allowed to declare any extra precision variables or use any other extra
221   precision facilities not available to the code being tested. This requires
222   great care in generating test data.  Second, we must use M4 to
223   automatically generate the many versions of test code needed for the many
224   versions of the code being tested.
225
226   For each BLAS routine, we perform the following steps in the test code:
227     1) Generate input scalars, vectors and arrays, according to the
228        routine's specification, so that the result exposes the internal
229        precision actually used.
230     2) Call the BLAS routine
231     3) For each output, compute a "test ratio" of the computed
232	error to the theoretical error bound, i.e.,
233		| Computed_value - "True_value" | / Error_Bound
234
235   By design, the test ratio should be at most 1. A larger ratio indicates
236   that the computed result is either completely wrong, or not as
237   accurate as claimed in the specification.
238
239   The following section will discuss how we generate "good" inputs in
240   order to reveal the internal precisions actually used.
241   For details, see the paper in file doc/report.ps.
242
243   5.1 Testing DOT
244
245       DOT performs the following function:
246
247		r <- beta * r_in + alpha * (SUM_{i=1,n} x_i*y_i)
248
249       Assume that the result r_computed is computed as follows
250          precision eps_int internally,
251          precision eps_out when the final result is rounded on output
252          underflow threshold UN_int internally,
253          underflow threshold UN_out on output
254       and that additionally we compute a very accurate approximation r_acc with
255          precision eps_acc  = 2^(-106)   (double-double format)
256          underflow threshold UN_acc  = 2^(-968)
257       Then the error bound satisfies
258
259(*)	|r_computed-r_acc| <= (n+2)(eps_int + eps_acc)*S + U + eps_out*|r_acc|
260                            = Scale
261       where
262
263       	   S = |alpha| * (SUM_{i=1,n} |x_i|*|y_i|) + |beta|*|r_in|
264           U = (|alpha|*n+2)*(UN_int + UN_acc) + UN_out
265
266       Thus, we can confirm that r_computed has been computed as accurately
267       as claimed (i.e. with internal precision defined by eps_int and UN_int)
268       by testing whether the
269
270          test ratio = |r_computed - r_acc| / Scale
271
272       is at most 1.  Suppose that no underflow occurs, and that
273
274          eps_intX >> eps_int >= eps_acc.
275
276       where eps_intX is the internal precision actually used in some
277       buggy version of DOT that we want to test.  Then we can expect
278       the test ratio to be as large as
279
280                        (n+2)*(eps_intX + eps_acc)*S + eps_out*|r_acc|
281          test ratio ~  ----------------------------------------------
282                        (n+2)*(eps_int  + eps_acc)*S + eps_out*|r_acc|
283
284
285       If we can make r_acc very small, then this ratio will be roughly
286
287          test ratio =  eps_intX / eps_int >> 1
288
289       which means the bug in DOT will be detected, and in fact the test ratio
290       will actually tell us how much internal precision we effectively used.
291
292       Thus our goal is to pick test data alpha, x(1:n), y(1:n), beta and r_in
293       to make |r_acc| as small as possible, MUCH smaller than S, so that
294       eps_int term dominates on the right of the inequality (*).
295       Otherwise eps_int will be invisible in the error bound, then
296       we cannot tell what precision is used internally.
297
298       In our test generator, we choose input data alpha, beta, r_in, x(1:n)
299       and y(1:n) judiciously in order to cause as much cancellation in r as
300       possible.
301
302   5.2 Choosing input data and computing r_acc
303
304       The general approach is to choose some of the input values of x(i) and y(i)
305       so that the exact (partial) dot product of these values has a lot of
306       nonzero fraction bits, preferably at least 106. Then the remaining values of
307       x(i) and y(i) are chosen to cancel the previous bits as much as possible.
308       This latter computation seems to require high precision.
309
310       One possibility to use an arbitrary precision package, such as
311       MPFUN, but our goal is to make this test code self contained,
312       and use no extra precision facilities not available to the code
313       being tested.  Since most of our routines can be reduced to a
314       series of dot products, testing can be based on DOT.  We only
315       need a TRUSTED dot routine. Currently, we are using our own dot
316       routine with double-double internal precision to compute
317       r_truth, which is accurate to 106 bits.
318       This means that any internal precision higher than double-double
319       cannot be detected, and may result in a tiny test ratio.
320       A very tiny test ratio (such as zero) may also occur if the result
321       happens to be computed exactly.
322
323       This raises the possibility that there are "matching" bugs in
324       our trusted DOT and the DOT under test, so that bugs are missed.
325       To avoid this possibility we also generate some test examples
326       where the cancellation is done mathematically (and so exactly)
327       rather than depending on a computation. The idea is simple:
328       For example, choose x(1:3) and y(1:3) so that
329            x(1)*y(1) = -x(3)*y(3) >> x(2)*y(2)
330       so that SUM_{i=1,3} x(i)*y(i) = x(2)*y(2) exactly.
331
332   5.3 Testing SPMV and GBMV
333
334       SPMV, GBMV, and many other BLAS2 routines perform the following function:
335		y <- beta * y + alpha * A * x
336
337       Testing it is no more difficult than testing DOT, because each
338       component of the computed y vector is a dot product, and
339       satisfies the error bound (*). So we simply use the same test
340       generator as DOT, and compute a test ratio for each component
341       of the y vector. The only tricky part is that some entries
342       in each dot product may be fixed. For example, the first row
343       of A and the vector x can be chosen freely, but after that
344       x is fixed and, if A is symmetric, the first entry of each
345       subsequent row is fixed. Our dot-product test generator handles
346       all these cases.
347
348       This approach can be generalized to most other Level 2 and 3 BLAS.
349
3506. Feedback
351
352   Please send any comments or bug reports to extended_blas@cs.berkeley.edu.
353   This code was developed by
354
355   Xiaoye Li,
356   Jim Demmel,
357   David Bailey,
358   Yozo Hida,
359   Jimmy Iskandar,
360   Anil Kapur,
361   Michael Martin,
362   Brandon Thompson,
363   Teresa Tung,
364   Daniel Yoo
365
366   with help from Ben Wanzo, Berkat Tung, Weihua Shen, and Jason Riedy.
367

README.devel

1This document describes how to generate the C files
2from the M4 templates, mostly intended for developers.
3
4- Appropriate make.inc definition must first be supplied.
5  There are several sample make.incs for various platforms
6  provided.  Or you can try to have autoconf guess them for
7  you; to do this you need to do
8
9    autoconf
10    ./configure
11
12  This will create a make.inc appropriate for the compiler(s)
13  found.  This will also create a Makefile from Makefile.m4.
14  You can specify which compiler to use by CC and FC environment
15  variables.  For details, see "./configure --help".
16
17- If make.inc was configured manually, create the top-level
18  Makefile is generated from Makefile.m4.  Just do
19  "m4 Makefile.m4 >Makefile"
20
21- Then run "make makefiles".  This will generate all
22  the makefiles in various directories from m4 template.
23
24- Then run "make sources test-sources".  This will create
25  all the C source files for both XBLAS library code and
26  testing code.
27
28