1Table of Contents
2 1. Introduction
3 2. Content
4 3. Installation
5 4. Code Generation with M4 macro processor
6 5. Testing
7 6. Feedback
8
91. Introduction
10
11 This library of routines is part of a reference implementation for
12 the Dense and Banded BLAS routines, along with their
13 Extended and Mixed Precision versions, as documented in
14 Chapters 2 and 4 of the new BLAS Standard, which is available from:
15
16 http://www.netlib.org/blas/blast-forum/
17
18 EXTENDED PRECISION is only used internally; the input and output
19 arguments remain the same as in the existing BLAS. At present, we
20 only allow Single, Double, or Extra internal precision. Extra
21 precision is implemented as double-double precision (128-bit total,
22 106-bit significand). The routines for the double-double precision
23 basic arithmetic operations +, -, *, / were developed by David
24 Bailey.
25
26 We have designed all our routines assuming that single precision
27 arithmetic is actually done in IEEE single precision (32 bits)
28 and that double precision arithmetic is actually done in
29 IEEE double precision (64 bits). The routines also pass our tests
30 on an Intel machine with 80-bit floating point registers.
31
32 MIXED PRECISION permits some input/output arguments to be of
33 different types (mixing real and complex) or precisions (mixing
34 single and double).
35
36 The purpose of this implementation is to do a proof of
37 concept implementation, showing that the considerable complexity
38 of the specification is actually implementable and testable
39 with a manageable amount of software. We have not attempted to
40 optimize performance, but our code should be as good as
41 straightforward but careful code written by hand.
42
43
442. Content
45
46 The BLAS Standard defines language bindings for Fortran 95, Fortran
47 77, and C. Here, we have only implemented the C version and
48 provided a method for binding to one Fortran 77 ABI.
49
50 In this initial release, we provide the following 11 routines:
51
52 Level 1
53 DOT (Inner product)
54 SUM (Sum)
55 AXPBY (Scaled vector accumulation)
56 WAXPBY (Scaled vector addition)
57 Level 2
58 GEMV (General matrix vector product)
59 GBMV (Banded matrix vector product)
60 SYMV (Symmetric matrix vector product)
61 SBMV (Symmetric banded matrix vector product)
62 SPMV (Symmetric matrix vector product, packed format)
63 HEMV (Hermitian matrix vector product)
64 HBMV (Hermitian banded matrix vector product)
65 HPMV (Hermitian matrix vector product, packed format)
66 GE_SUM_MV (Summed matrix-vector product)
67 TRSV (Triangular solve)
68 Level 3
69 GEMM (General matrix matrix product)
70 SYMM (Symmetric matrix matrix product)
71 HEMM (Hermitian matrix matrix product)
72
73 All have passed our systematic testing of all
74 possible combinations of mixed and extended precision.
75 We will eventually include everything in the intersection of
76 Chapter 2 and Chapter 4 with systematic testing.
77
78 2.1 Directory Structure
79
80 This release contains the following directories:
81
82 doc/ - technical report
83
84 m4/ - Directory where m4 macro files and support
85 routines are stored
86 m4/dot \
87 m4/sum } Directories for each function
88 m4/,.. /
89 m4/test-dot \
90 m4/test-sum } Directories for each test function
91 m4/... /
92
93 src/ - Directory where C code is stored
94 src/dot \
95 src/sum } Target directories for C code
96 src/... /
97
98 testing/ - Directory where C code for testing is stored
99 testing/test-dot - DOT test code and results
100 testing/test-sum - SUM test code and results
101 testing/...
102
1033. Installation
104
105 The reference XBLAS are built similarly to the reference BLAS and
106 LAPACK. The current build system produces a static libxblas.a.
107
108 You need to provide a make.inc file in the source directory that
109 defines the compiler, optimization flags, and options for building
110 a Fortran->C bridge. Some examples are provided.
111
112 Alternatively, you can use the configure script to attempt to produce
113 a make.inc that is appropriate for your system with your C (and optionally,
114 Fortran) compiler. You need to issue ./configure in the top-most
115 directory. This will produce a make.inc file that you should check
116 before proceeding. To specify a specific compiler(s) to use, you will
117 need to do something like:
118
119 CC=my_c_compiler FC=my_fortran_compiler ./configure
120
121 M4 is not necessary for the distributed archive; it is only
122 necessary if you modify the generator sources under m4/. See
123 README.devel for more information.
124
125 The Fortran->C bridge uses details of a specific toolchain's binary
126 interface, in particular how the Fortran compiler mangles names.
127 See src/f2c-bridge.h for the available options. Most compilers can
128 support different name mangling schemes; be sure to use the *SAME*
129 naming options for all your Fortran code.
130
131 The Fortran->C bridge is included in libxblas.a. Each of the
132 bridge's object files matches *-f2c.o, so you can extract them with
133 ar if you need to share one libxblas.a between multiple Fortran
134 compilers. Example steps to strip the Fortran->C routines from
135 libxblas.a and place them in a separate libxblas-myfortran.a are as
136 follows:
137
138 ar t libxblas.a |fgrep -- -f2c.o | xargs ar x libxblas.a
139 ar ru libxblas-myfortran.a *-f2c.o
140 ar x libxblas.a *-f2c.o
141 rm *-f2c.o
142
1434. Code Generation with M4 macro processor
144
145 In the existing BLAS, there are usually 4 routines associated with
146 each operation. All input, output, and internal variables are
147 single or double precision and real or complex. But under the new
148 extended and mixed precision rules (see Chapter 4 for details), the
149 input, output and internal variables may have different precisions
150 and types. Therefore, the combination of all these types results in
151 many more routines associated with each operation. For
152 example, DOT will have 32 routines altogether, 4 "standard" versions
153 (from Chapter 2) and 28 mixed and extended precision versions
154 (from Chapter 4). In addition, the 16 versions with extended precision
155 support up to three internal precisions that can be chosen at runtime.
156 We have automated the code and test code generation as much as possible.
157 We use the M4 macro processor to facilitate this task.
158
159 The idea is to define a macro for each fundamental operation. The
160 macro's argument list contains the variables, accompanied by their
161 types and precisions. For example, for the operation c <- a + b, we
162 define the following macro:
163
164 ADD(c, c_type, a, a_type, b, b_type)
165
166 where, x_type can be one of:
167
168 real_single
169 real_double
170 real_extra
171 complex_single
172 complex_double
173 complex_extra
174
175 Inside the macro body, we use an "if-test" on c_type, a_type and
176 b_type, to generate the appropriate code segment for "+".
177 (This is similar to operator overloading in C++; but we do it
178 manually.) All these if-tests are evaluated at macro-evaluation
179 time, and do not appear in the executable code. Indeed, our goal
180 was to produce efficient C code, which means minimizing branches
181 in inner loops.
182
183 Other macros include SUB, MUL, DIV, DECLARE (variable declaration),
184 ASSIGN, etc.
185
186 Since these macros are shared among all the BLAS routines, we put
187 them in a common header file, named cblas.m4.h. Each BLAS routine
188 also has its own macro file, such as dot.m4, spmv.m4 and gbmv.m4,
189 to generate the specific functions. All the macro files are
190 located in the m4/ subdirectory.
191
192 For example, the inner loop of the M4 macro for the dot product is
193 simply as follows (the M4 parameters $2, $3, and $4 are types):
194
195 for (i = 0; i < n; ++i) {
196 GET_VECTOR_ELEMENT(x_ii, x_i, ix, $2)
197 /* put ix-th element of vector x into x_ii */
198 GET_VECTOR_ELEMENT(y_ii, y_i, iy, $3)
199 /* put iy-th element of vector y into y_ii */
200 MUL(prod, $4, x_ii, $2, y_ii, $3) /* prod = x[i]*y[i] */
201 ADD(sum, $4, sum, $4, prod, $4) /* sum = sum+prod */
202 ix += incx;
203 iy += incy;
204 } /* endfor */
205
206 The motivation for this macro-based approach is simplifying
207 software engineering. For example, the file {\tt dot.m4}
208 of M4 macros for the dot product is 401 lines long (245 non-comment lines)
209 but expands into 11145 lines in 32 C subroutines implementing
210 different versions of DOT. Similarly the macros for TRSV expand from
211 732 lines (454 non-comment lines) to 37099 lines in 32 C subroutines.
212 (This does not count the shared M4 macros in the file cblas.m4.h.)
213
214
2155. Testing
216
217 The goal of the testing code is to validate the underlying implementation.
218 The challenges are twofold: First, we must thoroughly test routines
219 claiming to use extra precision internally, where the test code is not
220 allowed to declare any extra precision variables or use any other extra
221 precision facilities not available to the code being tested. This requires
222 great care in generating test data. Second, we must use M4 to
223 automatically generate the many versions of test code needed for the many
224 versions of the code being tested.
225
226 For each BLAS routine, we perform the following steps in the test code:
227 1) Generate input scalars, vectors and arrays, according to the
228 routine's specification, so that the result exposes the internal
229 precision actually used.
230 2) Call the BLAS routine
231 3) For each output, compute a "test ratio" of the computed
232 error to the theoretical error bound, i.e.,
233 | Computed_value - "True_value" | / Error_Bound
234
235 By design, the test ratio should be at most 1. A larger ratio indicates
236 that the computed result is either completely wrong, or not as
237 accurate as claimed in the specification.
238
239 The following section will discuss how we generate "good" inputs in
240 order to reveal the internal precisions actually used.
241 For details, see the paper in file doc/report.ps.
242
243 5.1 Testing DOT
244
245 DOT performs the following function:
246
247 r <- beta * r_in + alpha * (SUM_{i=1,n} x_i*y_i)
248
249 Assume that the result r_computed is computed as follows
250 precision eps_int internally,
251 precision eps_out when the final result is rounded on output
252 underflow threshold UN_int internally,
253 underflow threshold UN_out on output
254 and that additionally we compute a very accurate approximation r_acc with
255 precision eps_acc = 2^(-106) (double-double format)
256 underflow threshold UN_acc = 2^(-968)
257 Then the error bound satisfies
258
259(*) |r_computed-r_acc| <= (n+2)(eps_int + eps_acc)*S + U + eps_out*|r_acc|
260 = Scale
261 where
262
263 S = |alpha| * (SUM_{i=1,n} |x_i|*|y_i|) + |beta|*|r_in|
264 U = (|alpha|*n+2)*(UN_int + UN_acc) + UN_out
265
266 Thus, we can confirm that r_computed has been computed as accurately
267 as claimed (i.e. with internal precision defined by eps_int and UN_int)
268 by testing whether the
269
270 test ratio = |r_computed - r_acc| / Scale
271
272 is at most 1. Suppose that no underflow occurs, and that
273
274 eps_intX >> eps_int >= eps_acc.
275
276 where eps_intX is the internal precision actually used in some
277 buggy version of DOT that we want to test. Then we can expect
278 the test ratio to be as large as
279
280 (n+2)*(eps_intX + eps_acc)*S + eps_out*|r_acc|
281 test ratio ~ ----------------------------------------------
282 (n+2)*(eps_int + eps_acc)*S + eps_out*|r_acc|
283
284
285 If we can make r_acc very small, then this ratio will be roughly
286
287 test ratio = eps_intX / eps_int >> 1
288
289 which means the bug in DOT will be detected, and in fact the test ratio
290 will actually tell us how much internal precision we effectively used.
291
292 Thus our goal is to pick test data alpha, x(1:n), y(1:n), beta and r_in
293 to make |r_acc| as small as possible, MUCH smaller than S, so that
294 eps_int term dominates on the right of the inequality (*).
295 Otherwise eps_int will be invisible in the error bound, then
296 we cannot tell what precision is used internally.
297
298 In our test generator, we choose input data alpha, beta, r_in, x(1:n)
299 and y(1:n) judiciously in order to cause as much cancellation in r as
300 possible.
301
302 5.2 Choosing input data and computing r_acc
303
304 The general approach is to choose some of the input values of x(i) and y(i)
305 so that the exact (partial) dot product of these values has a lot of
306 nonzero fraction bits, preferably at least 106. Then the remaining values of
307 x(i) and y(i) are chosen to cancel the previous bits as much as possible.
308 This latter computation seems to require high precision.
309
310 One possibility to use an arbitrary precision package, such as
311 MPFUN, but our goal is to make this test code self contained,
312 and use no extra precision facilities not available to the code
313 being tested. Since most of our routines can be reduced to a
314 series of dot products, testing can be based on DOT. We only
315 need a TRUSTED dot routine. Currently, we are using our own dot
316 routine with double-double internal precision to compute
317 r_truth, which is accurate to 106 bits.
318 This means that any internal precision higher than double-double
319 cannot be detected, and may result in a tiny test ratio.
320 A very tiny test ratio (such as zero) may also occur if the result
321 happens to be computed exactly.
322
323 This raises the possibility that there are "matching" bugs in
324 our trusted DOT and the DOT under test, so that bugs are missed.
325 To avoid this possibility we also generate some test examples
326 where the cancellation is done mathematically (and so exactly)
327 rather than depending on a computation. The idea is simple:
328 For example, choose x(1:3) and y(1:3) so that
329 x(1)*y(1) = -x(3)*y(3) >> x(2)*y(2)
330 so that SUM_{i=1,3} x(i)*y(i) = x(2)*y(2) exactly.
331
332 5.3 Testing SPMV and GBMV
333
334 SPMV, GBMV, and many other BLAS2 routines perform the following function:
335 y <- beta * y + alpha * A * x
336
337 Testing it is no more difficult than testing DOT, because each
338 component of the computed y vector is a dot product, and
339 satisfies the error bound (*). So we simply use the same test
340 generator as DOT, and compute a test ratio for each component
341 of the y vector. The only tricky part is that some entries
342 in each dot product may be fixed. For example, the first row
343 of A and the vector x can be chosen freely, but after that
344 x is fixed and, if A is symmetric, the first entry of each
345 subsequent row is fixed. Our dot-product test generator handles
346 all these cases.
347
348 This approach can be generalized to most other Level 2 and 3 BLAS.
349
3506. Feedback
351
352 Please send any comments or bug reports to extended_blas@cs.berkeley.edu.
353 This code was developed by
354
355 Xiaoye Li,
356 Jim Demmel,
357 David Bailey,
358 Yozo Hida,
359 Jimmy Iskandar,
360 Anil Kapur,
361 Michael Martin,
362 Brandon Thompson,
363 Teresa Tung,
364 Daniel Yoo
365
366 with help from Ben Wanzo, Berkat Tung, Weihua Shen, and Jason Riedy.
367