1## Build Status 2| Build branch | master | develop | 3|-----|-----|-----| 4| GCC/Clang x64 | [![Build Status](https://travis-ci.org/clMathLibraries/clBLAS.svg?branch=master)](https://travis-ci.org/clMathLibraries/clBLAS/branches) | [![Build Status](https://travis-ci.org/clMathLibraries/clBLAS.svg?branch=develop)](https://travis-ci.org/clMathLibraries/clBLAS/branches) | 5| Visual Studio x64 | [![Build status](https://ci.appveyor.com/api/projects/status/v384bi6e8xv8nxjm/branch/master?svg=true)](https://ci.appveyor.com/project/kknox/clblas-5ph9i/branch/master)|[![Build status](https://ci.appveyor.com/api/projects/status/v384bi6e8xv8nxjm/branch/develop?svg=true)](https://ci.appveyor.com/project/kknox/clblas-5ph9i/branch/develop) | 6 7clBLAS 8===== 9This repository houses the code for the OpenCL™ BLAS portion of clMath. 10The complete set of BLAS level 1, 2 & 3 routines is implemented. Please 11see Netlib BLAS for the list of supported routines. In addition to GPU 12devices, the library also supports running on CPU devices to facilitate 13debugging and multicore programming. APPML 1.10 is the most current 14generally available pre-packaged binary version of the library available 15for download for both Linux and Windows platforms. 16 17The primary goal of clBLAS is to make it easier for developers to 18utilize the inherent performance and power efficiency benefits of 19heterogeneous computing. clBLAS interfaces do not hide nor wrap OpenCL 20interfaces, but rather leaves OpenCL state management to the control of 21the user to allow for maximum performance and flexibility. The clBLAS 22library does generate and enqueue optimized OpenCL kernels, relieving 23the user from the task of writing, optimizing and maintaining kernel 24code themselves. 25 26## clBLAS update notes 09/2015 27 28- Introducing [AutoGemm](http://github.com/clMathLibraries/clBLAS/wiki/AutoGemm) 29 - clBLAS's Gemm implementation has been comprehensively overhauled to use AutoGemm. AutoGemm is a suite of python scripts which generate optimized kernels and kernel selection logic, for all precisions, transposes, tile sizes and so on. 30 - CMake is configured to use AutoGemm for clBLAS so the build and usage experience of Gemm remains unchanged (only performance and maintainability has been improved). Kernel sources are generated at build time (not runtime) and can be configured within CMake to be pre-compiled at build time. 31 - clBLAS users with unique Gemm requirements can customize AutoGemm to their needs (such as non-default tile sizes for very small or very skinny matrices); see [AutoGemm](http://github.com/clMathLibraries/clBLAS/wiki/AutoGemm) documentation for details. 32 33 34## clBLAS library user documentation 35 36[Library and API documentation][] for developers is available online as 37a GitHub Pages website 38 39## Google Groups 40 41Two mailing lists have been created for the clMath projects: 42 43- [clmath@googlegroups.com][] - group whose focus is to answer 44 questions on using the library or reporting issues 45 46- [clmath-developers@googlegroups.com][] - group whose focus is for 47 developers interested in contributing to the library code itself 48 49## clBLAS Wiki 50 51The [project wiki][] contains helpful documentation, including a [build 52primer][] 53 54## Contributing code 55 56Please refer to and read the [Contributing][] document for guidelines on 57how to contribute code to this open source project. The code in the 58/master branch is considered to be stable, and all pull-requests should 59be made against the /develop branch. 60 61## License 62The source for clBLAS is licensed under the [Apache License, Version 2.0]( http://www.apache.org/licenses/LICENSE-2.0 ) 63 64## Example 65The simple example below shows how to use clBLAS to compute an OpenCL accelerated SGEMM 66 67```c 68 #include <sys/types.h> 69 #include <stdio.h> 70 71 /* Include the clBLAS header. It includes the appropriate OpenCL headers */ 72 #include <clBLAS.h> 73 74 /* This example uses predefined matrices and their characteristics for 75 * simplicity purpose. 76 */ 77 78 #define M 4 79 #define N 3 80 #define K 5 81 82 static const cl_float alpha = 10; 83 84 static const cl_float A[M*K] = { 85 11, 12, 13, 14, 15, 86 21, 22, 23, 24, 25, 87 31, 32, 33, 34, 35, 88 41, 42, 43, 44, 45, 89 }; 90 static const size_t lda = K; /* i.e. lda = K */ 91 92 static const cl_float B[K*N] = { 93 11, 12, 13, 94 21, 22, 23, 95 31, 32, 33, 96 41, 42, 43, 97 51, 52, 53, 98 }; 99 static const size_t ldb = N; /* i.e. ldb = N */ 100 101 static const cl_float beta = 20; 102 103 static cl_float C[M*N] = { 104 11, 12, 13, 105 21, 22, 23, 106 31, 32, 33, 107 41, 42, 43, 108 }; 109 static const size_t ldc = N; /* i.e. ldc = N */ 110 111 static cl_float result[M*N]; 112 113 int main( void ) 114 { 115 cl_int err; 116 cl_platform_id platform = 0; 117 cl_device_id device = 0; 118 cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 }; 119 cl_context ctx = 0; 120 cl_command_queue queue = 0; 121 cl_mem bufA, bufB, bufC; 122 cl_event event = NULL; 123 int ret = 0; 124 125 /* Setup OpenCL environment. */ 126 err = clGetPlatformIDs( 1, &platform, NULL ); 127 err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL ); 128 129 props[1] = (cl_context_properties)platform; 130 ctx = clCreateContext( props, 1, &device, NULL, NULL, &err ); 131 queue = clCreateCommandQueue( ctx, device, 0, &err ); 132 133 /* Setup clBLAS */ 134 err = clblasSetup( ); 135 136 /* Prepare OpenCL memory objects and place matrices inside them. */ 137 bufA = clCreateBuffer( ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A), 138 NULL, &err ); 139 bufB = clCreateBuffer( ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B), 140 NULL, &err ); 141 bufC = clCreateBuffer( ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C), 142 NULL, &err ); 143 144 err = clEnqueueWriteBuffer( queue, bufA, CL_TRUE, 0, 145 M * K * sizeof( *A ), A, 0, NULL, NULL ); 146 err = clEnqueueWriteBuffer( queue, bufB, CL_TRUE, 0, 147 K * N * sizeof( *B ), B, 0, NULL, NULL ); 148 err = clEnqueueWriteBuffer( queue, bufC, CL_TRUE, 0, 149 M * N * sizeof( *C ), C, 0, NULL, NULL ); 150 151 /* Call clBLAS extended function. Perform gemm for the lower right sub-matrices */ 152 err = clblasSgemm( clblasRowMajor, clblasNoTrans, clblasNoTrans, 153 M, N, K, 154 alpha, bufA, 0, lda, 155 bufB, 0, ldb, beta, 156 bufC, 0, ldc, 157 1, &queue, 0, NULL, &event ); 158 159 /* Wait for calculations to be finished. */ 160 err = clWaitForEvents( 1, &event ); 161 162 /* Fetch results of calculations from GPU memory. */ 163 err = clEnqueueReadBuffer( queue, bufC, CL_TRUE, 0, 164 M * N * sizeof(*result), 165 result, 0, NULL, NULL ); 166 167 /* Release OpenCL memory objects. */ 168 clReleaseMemObject( bufC ); 169 clReleaseMemObject( bufB ); 170 clReleaseMemObject( bufA ); 171 172 /* Finalize work with clBLAS */ 173 clblasTeardown( ); 174 175 /* Release OpenCL working objects. */ 176 clReleaseCommandQueue( queue ); 177 clReleaseContext( ctx ); 178 179 return ret; 180 } 181``` 182 183## Build dependencies 184### Library for Windows 185* Windows® 7/8 186* Visual Studio 2010 SP1, 2012 187* An OpenCL SDK, such as APP SDK 2.8 188* Latest CMake 189 190### Library for Linux 191* GCC 4.6 and onwards 192* An OpenCL SDK, such as APP SDK 2.9 193* Latest CMake 194 195### Library for Mac OSX 196* Recommended to generate Unix makefiles with cmake 197 198### Test infrastructure 199* Googletest v1.6 200* ACML on windows/linux; Accelerate on Mac OSX 201* Latest Boost 202 203### Performance infrastructure 204* Python 205 206 [Library and API documentation]: http://clmathlibraries.github.io/clBLAS/ 207 [clmath@googlegroups.com]: https://groups.google.com/forum/#!forum/clmath 208 [clmath-developers@googlegroups.com]: https://groups.google.com/forum/#!forum/clmath-developers 209 [project wiki]: https://github.com/clMathLibraries/clBLAS/wiki 210 [build primer]: https://github.com/clMathLibraries/clBLAS/wiki/Build 211 [Contributing]: CONTRIBUTING.md 212 [Apache License, Version 2.0]: http://www.apache.org/licenses/LICENSE-2.0 213