1Table of Contents 2 1. Introduction 3 2. Content 4 3. Installation 5 4. Code Generation with M4 macro processor 6 5. Testing 7 6. Feedback 8 91. Introduction 10 11 This library of routines is part of a reference implementation for 12 the Dense and Banded BLAS routines, along with their 13 Extended and Mixed Precision versions, as documented in 14 Chapters 2 and 4 of the new BLAS Standard, which is available from: 15 16 http://www.netlib.org/blas/blast-forum/ 17 18 EXTENDED PRECISION is only used internally; the input and output 19 arguments remain the same as in the existing BLAS. At present, we 20 only allow Single, Double, or Extra internal precision. Extra 21 precision is implemented as double-double precision (128-bit total, 22 106-bit significand). The routines for the double-double precision 23 basic arithmetic operations +, -, *, / were developed by David 24 Bailey. 25 26 We have designed all our routines assuming that single precision 27 arithmetic is actually done in IEEE single precision (32 bits) 28 and that double precision arithmetic is actually done in 29 IEEE double precision (64 bits). The routines also pass our tests 30 on an Intel machine with 80-bit floating point registers. 31 32 MIXED PRECISION permits some input/output arguments to be of 33 different types (mixing real and complex) or precisions (mixing 34 single and double). 35 36 The purpose of this implementation is to do a proof of 37 concept implementation, showing that the considerable complexity 38 of the specification is actually implementable and testable 39 with a manageable amount of software. We have not attempted to 40 optimize performance, but our code should be as good as 41 straightforward but careful code written by hand. 42 43 442. Content 45 46 The BLAS Standard defines language bindings for Fortran 95, Fortran 47 77, and C. Here, we have only implemented the C version and 48 provided a method for binding to one Fortran 77 ABI. 49 50 In this initial release, we provide the following 11 routines: 51 52 Level 1 53 DOT (Inner product) 54 SUM (Sum) 55 AXPBY (Scaled vector accumulation) 56 WAXPBY (Scaled vector addition) 57 Level 2 58 GEMV (General matrix vector product) 59 GBMV (Banded matrix vector product) 60 SYMV (Symmetric matrix vector product) 61 SBMV (Symmetric banded matrix vector product) 62 SPMV (Symmetric matrix vector product, packed format) 63 HEMV (Hermitian matrix vector product) 64 HBMV (Hermitian banded matrix vector product) 65 HPMV (Hermitian matrix vector product, packed format) 66 GE_SUM_MV (Summed matrix-vector product) 67 TRSV (Triangular solve) 68 Level 3 69 GEMM (General matrix matrix product) 70 SYMM (Symmetric matrix matrix product) 71 HEMM (Hermitian matrix matrix product) 72 73 All have passed our systematic testing of all 74 possible combinations of mixed and extended precision. 75 We will eventually include everything in the intersection of 76 Chapter 2 and Chapter 4 with systematic testing. 77 78 2.1 Directory Structure 79 80 This release contains the following directories: 81 82 doc/ - technical report 83 84 m4/ - Directory where m4 macro files and support 85 routines are stored 86 m4/dot \ 87 m4/sum } Directories for each function 88 m4/,.. / 89 m4/test-dot \ 90 m4/test-sum } Directories for each test function 91 m4/... / 92 93 src/ - Directory where C code is stored 94 src/dot \ 95 src/sum } Target directories for C code 96 src/... / 97 98 testing/ - Directory where C code for testing is stored 99 testing/test-dot - DOT test code and results 100 testing/test-sum - SUM test code and results 101 testing/... 102 1033. Installation 104 105 The reference XBLAS are built similarly to the reference BLAS and 106 LAPACK. The current build system produces a static libxblas.a. 107 108 You need to provide a make.inc file in the source directory that 109 defines the compiler, optimization flags, and options for building 110 a Fortran->C bridge. Some examples are provided. 111 112 Alternatively, you can use the configure script to attempt to produce 113 a make.inc that is appropriate for your system with your C (and optionally, 114 Fortran) compiler. You need to issue ./configure in the top-most 115 directory. This will produce a make.inc file that you should check 116 before proceeding. To specify a specific compiler(s) to use, you will 117 need to do something like: 118 119 CC=my_c_compiler FC=my_fortran_compiler ./configure 120 121 M4 is not necessary for the distributed archive; it is only 122 necessary if you modify the generator sources under m4/. See 123 README.devel for more information. 124 125 The Fortran->C bridge uses details of a specific toolchain's binary 126 interface, in particular how the Fortran compiler mangles names. 127 See src/f2c-bridge.h for the available options. Most compilers can 128 support different name mangling schemes; be sure to use the *SAME* 129 naming options for all your Fortran code. 130 131 The Fortran->C bridge is included in libxblas.a. Each of the 132 bridge's object files matches *-f2c.o, so you can extract them with 133 ar if you need to share one libxblas.a between multiple Fortran 134 compilers. Example steps to strip the Fortran->C routines from 135 libxblas.a and place them in a separate libxblas-myfortran.a are as 136 follows: 137 138 ar t libxblas.a |fgrep -- -f2c.o | xargs ar x libxblas.a 139 ar ru libxblas-myfortran.a *-f2c.o 140 ar x libxblas.a *-f2c.o 141 rm *-f2c.o 142 1434. Code Generation with M4 macro processor 144 145 In the existing BLAS, there are usually 4 routines associated with 146 each operation. All input, output, and internal variables are 147 single or double precision and real or complex. But under the new 148 extended and mixed precision rules (see Chapter 4 for details), the 149 input, output and internal variables may have different precisions 150 and types. Therefore, the combination of all these types results in 151 many more routines associated with each operation. For 152 example, DOT will have 32 routines altogether, 4 "standard" versions 153 (from Chapter 2) and 28 mixed and extended precision versions 154 (from Chapter 4). In addition, the 16 versions with extended precision 155 support up to three internal precisions that can be chosen at runtime. 156 We have automated the code and test code generation as much as possible. 157 We use the M4 macro processor to facilitate this task. 158 159 The idea is to define a macro for each fundamental operation. The 160 macro's argument list contains the variables, accompanied by their 161 types and precisions. For example, for the operation c <- a + b, we 162 define the following macro: 163 164 ADD(c, c_type, a, a_type, b, b_type) 165 166 where, x_type can be one of: 167 168 real_single 169 real_double 170 real_extra 171 complex_single 172 complex_double 173 complex_extra 174 175 Inside the macro body, we use an "if-test" on c_type, a_type and 176 b_type, to generate the appropriate code segment for "+". 177 (This is similar to operator overloading in C++; but we do it 178 manually.) All these if-tests are evaluated at macro-evaluation 179 time, and do not appear in the executable code. Indeed, our goal 180 was to produce efficient C code, which means minimizing branches 181 in inner loops. 182 183 Other macros include SUB, MUL, DIV, DECLARE (variable declaration), 184 ASSIGN, etc. 185 186 Since these macros are shared among all the BLAS routines, we put 187 them in a common header file, named cblas.m4.h. Each BLAS routine 188 also has its own macro file, such as dot.m4, spmv.m4 and gbmv.m4, 189 to generate the specific functions. All the macro files are 190 located in the m4/ subdirectory. 191 192 For example, the inner loop of the M4 macro for the dot product is 193 simply as follows (the M4 parameters $2, $3, and $4 are types): 194 195 for (i = 0; i < n; ++i) { 196 GET_VECTOR_ELEMENT(x_ii, x_i, ix, $2) 197 /* put ix-th element of vector x into x_ii */ 198 GET_VECTOR_ELEMENT(y_ii, y_i, iy, $3) 199 /* put iy-th element of vector y into y_ii */ 200 MUL(prod, $4, x_ii, $2, y_ii, $3) /* prod = x[i]*y[i] */ 201 ADD(sum, $4, sum, $4, prod, $4) /* sum = sum+prod */ 202 ix += incx; 203 iy += incy; 204 } /* endfor */ 205 206 The motivation for this macro-based approach is simplifying 207 software engineering. For example, the file {\tt dot.m4} 208 of M4 macros for the dot product is 401 lines long (245 non-comment lines) 209 but expands into 11145 lines in 32 C subroutines implementing 210 different versions of DOT. Similarly the macros for TRSV expand from 211 732 lines (454 non-comment lines) to 37099 lines in 32 C subroutines. 212 (This does not count the shared M4 macros in the file cblas.m4.h.) 213 214 2155. Testing 216 217 The goal of the testing code is to validate the underlying implementation. 218 The challenges are twofold: First, we must thoroughly test routines 219 claiming to use extra precision internally, where the test code is not 220 allowed to declare any extra precision variables or use any other extra 221 precision facilities not available to the code being tested. This requires 222 great care in generating test data. Second, we must use M4 to 223 automatically generate the many versions of test code needed for the many 224 versions of the code being tested. 225 226 For each BLAS routine, we perform the following steps in the test code: 227 1) Generate input scalars, vectors and arrays, according to the 228 routine's specification, so that the result exposes the internal 229 precision actually used. 230 2) Call the BLAS routine 231 3) For each output, compute a "test ratio" of the computed 232 error to the theoretical error bound, i.e., 233 | Computed_value - "True_value" | / Error_Bound 234 235 By design, the test ratio should be at most 1. A larger ratio indicates 236 that the computed result is either completely wrong, or not as 237 accurate as claimed in the specification. 238 239 The following section will discuss how we generate "good" inputs in 240 order to reveal the internal precisions actually used. 241 For details, see the paper in file doc/report.ps. 242 243 5.1 Testing DOT 244 245 DOT performs the following function: 246 247 r <- beta * r_in + alpha * (SUM_{i=1,n} x_i*y_i) 248 249 Assume that the result r_computed is computed as follows 250 precision eps_int internally, 251 precision eps_out when the final result is rounded on output 252 underflow threshold UN_int internally, 253 underflow threshold UN_out on output 254 and that additionally we compute a very accurate approximation r_acc with 255 precision eps_acc = 2^(-106) (double-double format) 256 underflow threshold UN_acc = 2^(-968) 257 Then the error bound satisfies 258 259(*) |r_computed-r_acc| <= (n+2)(eps_int + eps_acc)*S + U + eps_out*|r_acc| 260 = Scale 261 where 262 263 S = |alpha| * (SUM_{i=1,n} |x_i|*|y_i|) + |beta|*|r_in| 264 U = (|alpha|*n+2)*(UN_int + UN_acc) + UN_out 265 266 Thus, we can confirm that r_computed has been computed as accurately 267 as claimed (i.e. with internal precision defined by eps_int and UN_int) 268 by testing whether the 269 270 test ratio = |r_computed - r_acc| / Scale 271 272 is at most 1. Suppose that no underflow occurs, and that 273 274 eps_intX >> eps_int >= eps_acc. 275 276 where eps_intX is the internal precision actually used in some 277 buggy version of DOT that we want to test. Then we can expect 278 the test ratio to be as large as 279 280 (n+2)*(eps_intX + eps_acc)*S + eps_out*|r_acc| 281 test ratio ~ ---------------------------------------------- 282 (n+2)*(eps_int + eps_acc)*S + eps_out*|r_acc| 283 284 285 If we can make r_acc very small, then this ratio will be roughly 286 287 test ratio = eps_intX / eps_int >> 1 288 289 which means the bug in DOT will be detected, and in fact the test ratio 290 will actually tell us how much internal precision we effectively used. 291 292 Thus our goal is to pick test data alpha, x(1:n), y(1:n), beta and r_in 293 to make |r_acc| as small as possible, MUCH smaller than S, so that 294 eps_int term dominates on the right of the inequality (*). 295 Otherwise eps_int will be invisible in the error bound, then 296 we cannot tell what precision is used internally. 297 298 In our test generator, we choose input data alpha, beta, r_in, x(1:n) 299 and y(1:n) judiciously in order to cause as much cancellation in r as 300 possible. 301 302 5.2 Choosing input data and computing r_acc 303 304 The general approach is to choose some of the input values of x(i) and y(i) 305 so that the exact (partial) dot product of these values has a lot of 306 nonzero fraction bits, preferably at least 106. Then the remaining values of 307 x(i) and y(i) are chosen to cancel the previous bits as much as possible. 308 This latter computation seems to require high precision. 309 310 One possibility to use an arbitrary precision package, such as 311 MPFUN, but our goal is to make this test code self contained, 312 and use no extra precision facilities not available to the code 313 being tested. Since most of our routines can be reduced to a 314 series of dot products, testing can be based on DOT. We only 315 need a TRUSTED dot routine. Currently, we are using our own dot 316 routine with double-double internal precision to compute 317 r_truth, which is accurate to 106 bits. 318 This means that any internal precision higher than double-double 319 cannot be detected, and may result in a tiny test ratio. 320 A very tiny test ratio (such as zero) may also occur if the result 321 happens to be computed exactly. 322 323 This raises the possibility that there are "matching" bugs in 324 our trusted DOT and the DOT under test, so that bugs are missed. 325 To avoid this possibility we also generate some test examples 326 where the cancellation is done mathematically (and so exactly) 327 rather than depending on a computation. The idea is simple: 328 For example, choose x(1:3) and y(1:3) so that 329 x(1)*y(1) = -x(3)*y(3) >> x(2)*y(2) 330 so that SUM_{i=1,3} x(i)*y(i) = x(2)*y(2) exactly. 331 332 5.3 Testing SPMV and GBMV 333 334 SPMV, GBMV, and many other BLAS2 routines perform the following function: 335 y <- beta * y + alpha * A * x 336 337 Testing it is no more difficult than testing DOT, because each 338 component of the computed y vector is a dot product, and 339 satisfies the error bound (*). So we simply use the same test 340 generator as DOT, and compute a test ratio for each component 341 of the y vector. The only tricky part is that some entries 342 in each dot product may be fixed. For example, the first row 343 of A and the vector x can be chosen freely, but after that 344 x is fixed and, if A is symmetric, the first entry of each 345 subsequent row is fixed. Our dot-product test generator handles 346 all these cases. 347 348 This approach can be generalized to most other Level 2 and 3 BLAS. 349 3506. Feedback 351 352 Please send any comments or bug reports to extended_blas@cs.berkeley.edu. 353 This code was developed by 354 355 Xiaoye Li, 356 Jim Demmel, 357 David Bailey, 358 Yozo Hida, 359 Jimmy Iskandar, 360 Anil Kapur, 361 Michael Martin, 362 Brandon Thompson, 363 Teresa Tung, 364 Daniel Yoo 365 366 with help from Ben Wanzo, Berkat Tung, Weihua Shen, and Jason Riedy. 367