1# CUB 1.10.0 (NVIDIA HPC SDK 20.9) 2 3## Summary 4 5CUB 1.10.0 is the major release accompanying the NVIDIA HPC SDK 20.9 release. 6It drops support for C++03, GCC < 5, Clang < 6, and MSVC < 2017. 7It also overhauls CMake support. 8Finally, we now have a Code of Conduct for contributors: 9https://github.com/NVIDIA/cub/blob/main/CODE_OF_CONDUCT.md 10 11## Breaking Changes 12 13- C++03 is no longer supported. 14- GCC < 5, Clang < 6, and MSVC < 2017 are no longer supported. 15- C++11 is deprecated. 16 Using this dialect will generate a compile-time warning. 17 These warnings can be suppressed by defining 18 `CUB_IGNORE_DEPRECATED_CPP_DIALECT` or `CUB_IGNORE_DEPRECATED_CPP_11`. 19 Suppression is only a short term solution. 20 We will be dropping support for C++11 in the near future. 21- CMake < 3.15 is no longer supported. 22- The default branch on GitHub is now called `main`. 23 24## Other Enhancements 25 26- Added install targets to CMake builds. 27- C++17 support. 28 29## Bug Fixes 30 31- NVIDIA/thrust#1244: Check for macro collisions with system headers during 32 header testing. 33- NVIDIA/thrust#1153: Switch to placement new instead of assignment to 34 construct items in uninitialized memory. 35 Thanks to Hugh Winkler for this contribution. 36- NVIDIA/cub#38: Fix `cub::DeviceHistogram` for `size_t` `OffsetT`s. 37 Thanks to Leo Fang for this contribution. 38- NVIDIA/cub#35: Fix GCC-5 maybe-uninitialized warning. 39 Thanks to Rong Ou for this contribution. 40- NVIDIA/cub#36: Qualify namespace for `va_printf` in `_CubLog`. 41 Thanks to Andrei Tchouprakov for this contribution. 42 43# CUB 1.9.10-1 (NVIDIA HPC SDK 20.7, CUDA Toolkit 11.1) 44 45## Summary 46 47CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release 48 and the CUDA Toolkit 11.1 release. 49 50## Bug Fixes 51 52- NVIDIA/thrust#1217: Move static local in cub::DeviceCount to a separate 53 host-only function because NVC++ doesn't support static locals in host-device 54 functions. 55 56# CUB 1.9.10 (NVIDIA HPC SDK 20.5) 57 58## Summary 59 60Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release. 61It adds CMake `find_package` support. 62C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. 63Starting with the upcoming 1.10.0 release, C++03 support will be dropped 64 entirely. 65 66## Breaking Changes 67 68- Thrust now checks that it is compatible with the version of CUB found 69 in your include path, generating an error if it is not. 70 If you are using your own version of CUB, it may be too old. 71 It is recommended to simply delete your own version of CUB and use the 72 version of CUB that comes with Thrust. 73- C++03 and C++11 are deprecated. 74 Using these dialects will generate a compile-time warning. 75 These warnings can be suppressed by defining 76 `CUB_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11 77 deprecation warnings) or `CUB_IGNORE_DEPRECATED_CPP_11` (to suppress C++11 78 deprecation warnings). 79 Suppression is only a short term solution. 80 We will be dropping support for C++03 in the 1.10.0 release and C++11 in the 81 near future. 82- GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. 83 Using these compilers will generate a compile-time warning. 84 These warnings can be suppressed by defining 85 `CUB_IGNORE_DEPRECATED_COMPILER`. 86 Suppression is only a short term solution. 87 We will be dropping support for these compilers in the near future. 88 89## New Features 90 91- CMake `find_package` support. 92 Just point CMake at the `cmake` folder in your CUB include directory 93 (ex: `cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .`) and then you 94 can add CUB to your CMake project with `find_package(CUB REQUIRED CONFIG)`. 95 96# CUB 1.9.9 (CUDA 11.0) 97 98## Summary 99 100CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release. 101It introduces CMake support, version macros, platform detection machinery, 102 and support for NVC++, which uses Thrust (and thus CUB) to implement 103 GPU-accelerated C++17 Parallel Algorithms. 104Additionally, the scan dispatch layer was refactored and modernized. 105C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. 106Starting with the upcoming 1.10.0 release, C++03 support will be dropped 107 entirely. 108 109## Breaking Changes 110 111- Thrust now checks that it is compatible with the version of CUB found 112 in your include path, generating an error if it is not. 113 If you are using your own version of CUB, it may be too old. 114 It is recommended to simply delete your own version of CUB and use the 115 version of CUB that comes with Thrust. 116- C++03 and C++11 are deprecated. 117 Using these dialects will generate a compile-time warning. 118 These warnings can be suppressed by defining 119 `CUB_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11 120 deprecation warnings) or `CUB_IGNORE_DEPRECATED_CPP11` (to suppress C++11 121 deprecation warnings). 122 Suppression is only a short term solution. 123 We will be dropping support for C++03 in the 1.10.0 release and C++11 in the 124 near future. 125- GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. 126 Using these compilers will generate a compile-time warning. 127 These warnings can be suppressed by defining 128 `CUB_IGNORE_DEPRECATED_COMPILER`. 129 Suppression is only a short term solution. 130 We will be dropping support for these compilers in the near future. 131 132## New Features 133 134- CMake support. 135 Thanks to Francis Lemaire for this contribution. 136- Refactorized and modernized scan dispatch layer. 137 Thanks to Francis Lemaire for this contribution. 138- Policy hooks for device-wide reduce, scan, and radix sort facilities 139 to simplify tuning and allow users to provide custom policies. 140 Thanks to Francis Lemaire for this contribution. 141- `<cub/version.cuh>`: `CUB_VERSION`, `CUB_VERSION_MAJOR`, `CUB_VERSION_MINOR`, 142 `CUB_VERSION_SUBMINOR`, and `CUB_PATCH_NUMBER`. 143- Platform detection machinery: 144 - `<cub/util_cpp_dialect.cuh>`: Detects the C++ standard dialect. 145 - `<cub/util_compiler.cuh>`: host and device compiler detection. 146 - `<cub/util_deprecated.cuh>`: `CUB_DEPRECATED`. 147 - <cub/config.cuh>`: Includes `<cub/util_arch.cuh>`, 148 `<cub/util_compiler.cuh>`, `<cub/util_cpp_dialect.cuh>`, 149 `<cub/util_deprecated.cuh>`, `<cub/util_macro.cuh>`, 150 `<cub/util_namespace.cuh>` 151- `cub::DeviceCount` and `cub::DeviceCountUncached`, caching abstractions for 152 `cudaGetDeviceCount`. 153 154## Other Enhancements 155 156- Lazily initialize the per-device CUDAattribute caches, because CUDA context 157 creation is expensive and adds up with large CUDA binaries on machines with 158 many GPUs. 159 Thanks to the NVIDIA PyTorch team for bringing this to our attention. 160- Make `cub::SwitchDevice` avoid setting/resetting the device if the current 161 device is the same as the target device. 162 163## Bug Fixes 164 165- Add explicit failure parameter to CAS in the CUB attribute cache to workaround 166 a GCC 4.8 bug. 167- Revert a change in reductions that changed the signedness of the `lane_id` 168 variable to suppress a warning, as this introduces a bug in optimized device 169 code. 170- Fix initialization in `cub::ExclusiveSum`. 171 Thanks to Conor Hoekstra for this contribution. 172- Fix initialization of the `std::array` in the CUB attribute cache. 173- Fix `-Wsign-compare` warnings. 174 Thanks to Elias Stehle for this contribution. 175- Fix `test_block_reduce.cu` to build without parameters. 176 Thanks to Francis Lemaire for this contribution. 177- Add missing includes to `grid_even_share.cuh`. 178 Thanks to Francis Lemaire for this contribution. 179- Add missing includes to `thread_search.cuh`. 180 Thanks to Francis Lemaire for this contribution. 181- Add missing includes to `cub.cuh`. 182 Thanks to Felix Kallenborn for this contribution. 183 184# CUB 1.9.8-1 (NVIDIA HPC SDK 20.3) 185 186## Summary 187 188CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release. 189It contains modifications necessary to serve as the implementation of NVC++'s 190 GPU-accelerated C++17 Parallel Algorithms. 191 192# CUB 1.9.8 (CUDA 11.0 Early Access) 193 194## Summary 195 196CUB 1.9.8 is the first release of CUB to be officially supported and included 197 in the CUDA Toolkit. 198When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query 199 APIs, which improves performance of these queries by 20x to 50x when they 200 are called concurrently by multiple host threads. 201 202## Enhancements 203 204- (C++11 or later) Cache calls to `cudaFuncGetAttributes` and 205 `cudaDeviceGetAttribute` within `cub::PtxVersion` and `cub::SmVersion`. 206 These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform 207 poorly under contention; with the caching, they are 20 to 50x faster when 208 called concurrently. 209 Thanks to Bilge Acun for bringing this issue to our attention. 210- `DispatchReduce` now takes an `OutputT` template parameter so that users can 211 specify the intermediate type explicitly. 212- Radix sort tuning policies updates to fix performance issues for element 213 types smaller than 4 bytes. 214 215## Bug Fixes 216 217- Change initialization style from copy initialization to direct initialization 218 (which is more permissive) in `AgentReduce` to allow a wider range of types 219 to be used with it. 220- Fix bad signed/unsigned comparisons in `WarpReduce`. 221- Fix computation of valid lanes in warp-level reduction primitive to correctly 222 handle the case where there are 0 input items per warp. 223 224# CUB 1.8.0 225 226## Summary 227 228CUB 1.8.0 introduces changes to the `cub::Shuffle*` interfaces. 229 230## Breaking Changes 231 232- The interfaces of `cub::ShuffleIndex`, `cub::ShuffleUp`, and 233 `cub::ShuffleDown` have been changed to allow for better computation of the 234 PTX SHFL control constant for logical warps smaller than 32 threads. 235 236## Bug Fixes 237 238- #112: Fix `cub::WarpScan`'s broadcast of warp-wide aggregate for logical 239 warps smaller than 32 threads. 240 241# CUB 1.7.5 242 243## Summary 244 245CUB 1.7.5 adds support for radix sorting `__half` keys and improved sorting 246 performance for 1 byte keys. 247It was incorporated into Thrust 1.9.2. 248 249## Enhancements 250 251- Radix sort support for `__half` keys. 252- Radix sort tuning policy updates to improve 1 byte key performance. 253 254## Bug Fixes 255 256- Syntax tweaks to mollify Clang. 257- #127: `cub::DeviceRunLengthEncode::Encode` returns incorrect results. 258- #128: 7-bit sorting passes fail for SM61 with large values. 259 260# CUB 1.7.4 261 262## Summary 263 264CUB 1.7.4 is a minor release that was incorporated into Thrust 1.9.1-2. 265 266## Bug Fixes 267 268- #114: Can't pair non-trivially-constructible values in radix sort. 269- #115: `cub::WarpReduce` segmented reduction is broken in CUDA 9 for logical 270 warp sizes smaller than 32. 271 272# CUB 1.7.3 273 274## Summary 275 276CUB 1.7.3 is a minor release. 277 278## Bug Fixes 279 280- #110: `cub::DeviceHistogram` null-pointer exception bug for iterator inputs. 281 282# CUB 1.7.2 283 284## Summary 285 286CUB 1.7.2 is a minor release. 287 288## Bug Fixes 289 290- #104: Device-wide reduction is now "run-to-run" deterministic for 291 pseudo-associative reduction operators (like floating point addition). 292 293# CUB 1.7.1 294 295## Summary 296 297CUB 1.7.1 delivers improved radix sort performance on SM7x (Volta) GPUs and a 298 number of bug fixes. 299 300## Enhancements 301 302- Radix sort tuning policies updated for SM7x (Volta). 303 304## Bug Fixes 305 306- #104: `uint64_t` `cub::WarpReduce` broken for CUB 1.7.0 on CUDA 8 and older. 307- #103: Can't mix Thrust from CUDA 9.0 and CUB. 308- #102: CUB pulls in `windows.h` which defines `min`/`max` macros that conflict 309 with `std::min`/`std::max`. 310- #99: Radix sorting crashes NVCC on Windows 10 for SM52. 311- #98: cuda-memcheck: --tool initcheck failed with lineOfSight. 312- #94: Git clone size. 313- #93: Accept iterators for segment offsets. 314- #87: CUB uses anonymous unions which is not valid C++. 315- #44: Check for C++11 is incorrect for Visual Studio 2013. 316 317# CUB 1.7.0 318 319## Summary 320 321CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. 322It is compatible with independent thread scheduling. 323It was incorporated into Thrust 1.9.0-5. 324 325## Breaking Changes 326 327- Remove `cub::WarpAll` and `cub::WarpAny`. 328 These functions served to emulate `__all` and `__any` functionality for 329 SM1x devices, which did not have those operations. 330 However, SM1x devices are now deprecated in CUDA, and the interfaces of these 331 two functions are now lacking the lane-mask needed for collectives to run on 332 SM7x and newer GPUs which have independent thread scheduling. 333 334## Other Enhancements 335 336- Remove any assumptions of implicit warp synchronization to be compatible with 337 SM7x's (Volta) independent thread scheduling. 338 339## Bug Fixes 340 341- #86: Incorrect results with reduce-by-key. 342 343# CUB 1.6.4 344 345## Summary 346 347CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x 348 (Pascal) GPUs. 349 350## Enhancements 351 352- Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) - 353 3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively. 354 355## Bug Fixes 356 357- Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5. 358- #65: `cub::DeviceSegmentedRadixSort` should allow inputs to have 359 pointer-to-const type. 360- Mollify Clang device-side warnings. 361- Remove out-dated MSVC project files. 362 363# CUB 1.6.3 364 365## Summary 366 367CUB 1.6.3 improves support for Windows, changes 368 `cub::BlockLoad`/`cub::BlockStore` interface to take the local data type, 369 and enhances radix sort performance for SM6x (Pascal) GPUs. 370 371## Breaking Changes 372 373- `cub::BlockLoad` and `cub::BlockStore` are now templated by the local data 374 type, instead of the `Iterator` type. 375 This allows for output iterators having `void` as their `value_type` (e.g. 376 discard iterators). 377 378## Other Enhancements 379 380- Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte 381 keys/s on GP100. 382- Improved support for Windows (warnings, alignment, etc). 383 384## Bug Fixes 385 386- #74: `cub::WarpReduce` executes reduction operator for out-of-bounds items. 387- #72: `cub:InequalityWrapper::operator` should be non-const. 388- #71: `cub::KeyValuePair` won't work if `Key` has non-trivial constructor. 389- #69: cub::BlockStore::Store` doesn't compile if `OutputIteratorT::value_type` 390 isn't `T`. 391- #68: `cub::TilePrefixCallbackOp::WarpReduce` doesn't permit PTX arch 392 specialization. 393 394# CUB 1.6.2 (previously 1.5.5) 395 396## Summary 397 398CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal) 399 GPUs. 400 401## Enhancements 402 403- Radix sort tuning policies updated for SM6x (Pascal) GPUs. 404 405## Bug Fixes 406 407- Fix AArch64 compilation of `cub::CachingDeviceAllocator`. 408 409# CUB 1.6.1 (previously 1.5.4) 410 411## Summary 412 413CUB 1.6.1 (previously 1.5.4) is a minor release. 414 415## Bug Fixes 416 417- Fix radix sorting bug introduced by scan refactorization. 418 419# CUB 1.6.0 (previously 1.5.3) 420 421## Summary 422 423CUB 1.6.0 changes the scan and reduce interfaces. 424Exclusive scans now accept an "initial value" instead of an "identity value". 425Scans and reductions now support differing input and output sequence types. 426Additionally, many bugs have been fixed. 427 428## Breaking Changes 429 430- Device/block/warp-wide exclusive scans have been revised to now accept an 431 "initial value" (instead of an "identity value") for seeding the computation 432 with an arbitrary prefix. 433- Device-wide reductions and scans can now have input sequence types that are 434 different from output sequence types (as long as they are convertible). 435 436## Other Enhancements 437 438- Reduce repository size by moving the doxygen binary to doc repository. 439- Minor reduction in `cub::BlockScan` instruction counts. 440 441## Bug Fixes 442 443- Issue #55: Warning in `cub/device/dispatch/dispatch_reduce_by_key.cuh`. 444- Issue #59: `cub::DeviceScan::ExclusiveSum` can't prefix sum of float into 445 double. 446- Issue #58: Infinite loop in `cub::CachingDeviceAllocator::NearestPowerOf`. 447- Issue #47: `cub::CachingDeviceAllocator` needs to clean up CUDA global error 448 state upon successful retry. 449- Issue #46: Very high amount of needed memory from the 450 `cub::DeviceHistogram::HistogramEven`. 451- Issue #45: `cub::CachingDeviceAllocator` fails with debug output enabled 452 453# CUB 1.5.2 454 455## Summary 456 457CUB 1.5.2 enhances `cub::CachingDeviceAllocator` and improves scan performance 458 for SM5x (Maxwell). 459 460## Enhancements 461 462- Improved medium-size scan performance on SM5x (Maxwell). 463- Refactored `cub::CachingDeviceAllocator`: 464 - Now spends less time locked. 465 - Uses C++11's `std::mutex` when available. 466 - Failure to allocate a block from the runtime will retry once after 467 freeing cached allocations. 468 - Now respects max-bin, fixing an issue where blocks in excess of max-bin 469 were still being retained in the free cache. 470 471## Bug fixes: 472 473- Fix for generic-type reduce-by-key `cub::WarpScan` for SM3x and newer GPUs. 474 475# CUB 1.5.1 476 477## Summary 478 479CUB 1.5.1 is a minor release. 480 481## Bug Fixes 482 483- Fix for incorrect `cub::DeviceRadixSort` output for some small problems on 484 SM52 (Mawell) GPUs. 485- Fix for macro redefinition warnings when compiling `thrust::sort`. 486 487# CUB 1.5.0 488 489CUB 1.5.0 introduces segmented sort and reduction primitives. 490 491## New Features: 492 493- Segmented device-wide operations for device-wide sort and reduction primitives. 494 495## Bug Fixes: 496 497- #36: `cub::ThreadLoad` generates compiler errors when loading from 498 pointer-to-const. 499- #29: `cub::DeviceRadixSort::SortKeys<bool>` yields compiler errors. 500- #26: Misaligned address after `cub::DeviceRadixSort::SortKeys`. 501- #25: Fix for incorrect results and crashes when radix sorting 0-length 502 problems. 503- Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and 504 warp-reduction on non-primitive data types (e.g. user-defined structs). 505- Fix small radix sorting problems where 0 temporary bytes were required and 506 users code was invoking `malloc(0)` on some systems where that returns 507 `NULL`. 508 CUB assumed the user was asking for the size again and not running the sort. 509 510# CUB 1.4.1 511 512## Summary 513 514CUB 1.4.1 is a minor release. 515 516## Enhancements 517 518- Allow `cub::DeviceRadixSort` and `cub::BlockRadixSort` on bool types. 519 520## Bug Fixes 521 522- Fix minor CUDA 7.0 performance regressions in `cub::DeviceScan` and 523 `cub::DeviceReduceByKey`. 524- Remove requirement for callers to define the `CUB_CDP` macro 525 when invoking CUB device-wide rountines using CUDA dynamic parallelism. 526- Fix headers not being included in the proper order (or missing includes) 527 for some block-wide functions. 528 529# CUB 1.4.0 530 531## Summary 532 533CUB 1.4.0 adds `cub::DeviceSpmv`, `cub::DeviceRunLength::NonTrivialRuns`, 534 improves `cub::DeviceHistogram`, and introduces support for SM5x (Maxwell) 535 GPUs. 536 537## New Features: 538 539- `cub::DeviceSpmv` methods for multiplying sparse matrices by 540 dense vectors, load-balanced using a merge-based parallel decomposition. 541- `cub::DeviceRadixSort` sorting entry-points that always return 542 the sorted output into the specified buffer, as opposed to the 543 `cub::DoubleBuffer` in which it could end up in either buffer. 544- `cub::DeviceRunLengthEncode::NonTrivialRuns` for finding the starting 545 offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in 546 a given sequence. 547 Useful for top-down partitioning algorithms like MSD sorting of very-large 548 keys. 549 550## Other Enhancements 551 552- Support and performance tuning for SM5x (Maxwell) GPUs. 553- Updated cub::DeviceHistogram implementation that provides the same 554 "histogram-even" and "histogram-range" functionality as IPP/NPP. 555 Provides extremely fast and, perhaps more importantly, very uniform 556 performance response across diverse real-world datasets, including 557 pathological (homogeneous) sample distributions. 558 559# CUB 1.3.2 560 561## Summary 562 563CUB 1.3.2 is a minor release. 564 565## Bug Fixes 566 567- Fix `cub::DeviceReduce` where reductions of small problems (small enough to 568 only dispatch a single thread block) would run in the default stream (stream 569 zero) regardless of whether an alternate stream was specified. 570 571# CUB 1.3.1 572 573## Summary 574 575CUB 1.3.1 is a minor release. 576 577## Bug Fixes 578 579- Workaround for a benign WAW race warning reported by cuda-memcheck 580 in `cub::BlockScan` specialized for `BLOCK_SCAN_WARP_SCANS` algorithm. 581- Fix bug in `cub::DeviceRadixSort` where the algorithm may sort more 582 key bits than the caller specified (up to the nearest radix digit). 583- Fix for ~3% `cub::DeviceRadixSort` performance regression on SM2x (Fermi) and 584 SM3x (Kepler) GPUs. 585 586# CUB 1.3.0 587 588## Summary 589 590CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide 591 primitives and adds an enhanced version of `cub::WarpScan`. 592 593## Breaking Changes 594 595- CUB's collective (block-wide, warp-wide) primitives underwent a minor 596 interface refactoring: 597 - To provide the appropriate support for multidimensional thread blocks, 598 The interfaces for collective classes are now template-parameterized by 599 X, Y, and Z block dimensions (with `BLOCK_DIM_Y` and `BLOCK_DIM_Z` being 600 optional, and `BLOCK_DIM_X` replacing `BLOCK_THREADS`). 601 Furthermore, the constructors that accept remapped linear 602 thread-identifiers have been removed: all primitives now assume a 603 row-major thread-ranking for multidimensional thread blocks. 604 - To allow the host program (compiled by the host-pass) to accurately 605 determine the device-specific storage requirements for a given collective 606 (compiled for each device-pass), the interfaces for collective classes 607 are now (optionally) template-parameterized by the desired PTX compute 608 capability. 609 This is useful when aliasing collective storage to shared memory that has 610 been allocated dynamically by the host at the kernel call site. 611 - Most CUB programs having typical 1D usage should not require any 612 changes to accomodate these updates. 613 614## New Features 615 616- Added "combination" `cub::WarpScan` methods for efficiently computing 617 both inclusive and exclusive prefix scans (and sums). 618 619## Bug Fixes 620 621- Fix for bug in `cub::WarpScan` (which affected `cub::BlockScan` and 622 `cub::DeviceScan`) where incorrect results (e.g., NAN) would often be 623 returned when parameterized for floating-point types (fp32, fp64). 624- Workaround for ptxas error when compiling with with -G flag on Linux (for 625 debug instrumentation). 626- Fixes for certain scan scenarios using custom scan operators where code 627 compiled for SM1x is run on newer GPUs of higher compute-capability: the 628 compiler could not tell which memory space was being used collective 629 operations and was mistakenly using global ops instead of shared ops. 630 631# CUB 1.2.3 632 633## Summary 634 635CUB 1.2.3 is a minor release. 636 637## Bug Fixes 638 639- Fixed access violation bug in `cub::DeviceReduce::ReduceByKey` for 640 non-primitive value types. 641- Fixed code-snippet bug in `ArgIndexInputIteratorT` documentation. 642 643# CUB 1.2.2 644 645## Summary 646 647CUB 1.2.2 adds a new variant of `cub::BlockReduce` and MSVC project solections 648 for examples. 649 650## New Features 651 652- MSVC project solutions for device-wide and block-wide examples 653- New algorithmic variant of cub::BlockReduce for improved performance 654 when using commutative operators (e.g., numeric addition). 655 656## Bug Fixes 657 658- Inclusion of Thrust headers in a certain order prevented CUB device-wide 659 primitives from working properly. 660 661# CUB 1.2.0 662 663## Summary 664 665CUB 1.2.0 adds `cub::DeviceReduce::ReduceByKey` and 666 `cub::DeviceReduce::RunLengthEncode` and support for CUDA 6.0. 667 668## New Features 669 670- `cub::DeviceReduce::ReduceByKey`. 671- `cub::DeviceReduce::RunLengthEncode`. 672 673## Other Enhancements 674 675- Improved `cub::DeviceScan`, `cub::DeviceSelect`, `cub::DevicePartition` 676 performance. 677- Documentation and testing: 678 - Added performance-portability plots for many device-wide primitives. 679 - Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and 680 Thrust 1.6 (and older). 681- Revised the operation of temporary tile status bookkeeping for 682 `cub::DeviceScan` (and similar) to be safe for current code run on future 683 platforms (now uses proper fences). 684 685## Bug Fixes 686 687- Fix `cub::DeviceScan` bug where Windows alignment disagreements between host 688 and device regarding user-defined data types would corrupt tile status. 689- Fix `cub::BlockScan` bug where certain exclusive scans on custom data types 690 for the `BLOCK_SCAN_WARP_SCANS` variant would return incorrect results for 691 the first thread in the block. 692- Added workaround to make `cub::TexRefInputIteratorT` work with CUDA 6.0. 693 694# CUB 1.1.1 695 696## Summary 697 698CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting, 699 `cub::DeviceSelect`, `cub::DevicePartition`, `cub::Shuffle*`, and 700 `cub::MaxSMOccupancy`. 701Additionally, scan and sort performance for older GPUs has been improved and 702 many bugs have been fixed. 703 704## Breaking Changes 705 706- Refactored block-wide I/O (`cub::BlockLoad` and `cub::BlockStore`), removing 707 cache-modifiers from their interfaces. 708 `cub::CacheModifiedInputIterator` and `cub::CacheModifiedOutputIterator` 709 should now be used with `cub::BlockLoad` and `cub::BlockStore` to effect that 710 behavior. 711 712## New Features 713 714- `cub::TexObjInputIterator`, `cub::TexRefInputIterator`, 715 `cub::CacheModifiedInputIterator`, and `cub::CacheModifiedOutputIterator` 716 types for loading & storing arbitrary types through the cache hierarchy. 717 They are compatible with Thrust. 718- Descending sorting for `cub::DeviceRadixSort` and `cub::BlockRadixSort`. 719- Min, max, arg-min, and arg-max operators for `cub::DeviceReduce`. 720- `cub::DeviceSelect` (select-unique, select-if, and select-flagged). 721- `cub::DevicePartition` (partition-if, partition-flagged). 722- Generic `cub::ShuffleUp`, `cub::ShuffleDown`, and `cub::ShuffleIndex` for 723 warp-wide communication of arbitrary data types (SM3x and up). 724- `cub::MaxSmOccupancy` for accurately determining SM occupancy for any given 725 kernel function pointer. 726 727## Other Enhancements 728 729- Improved `cub::DeviceScan` and `cub::DeviceRadixSort` performance for older 730 GPUs (SM1x to SM3x). 731- Renamed device-wide `stream_synchronous` param to `debug_synchronous` to 732 avoid confusion about usage. 733- Documentation improvements: 734 - Added simple examples of device-wide methods. 735 - Improved doxygen documentation and example snippets. 736- Improved test coverege to include up to 21,000 kernel variants and 851,000 737 unit tests (per architecture, per platform). 738 739## Bug Fixes 740 741- Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when 742 operating on non-primitive types for older architectures SM1x. 743- SHFL-based scans and reductions produced incorrect results for multi-word 744 types (size > 4B) on Linux. 745- For `cub::WarpScan`-based scans, not all threads in the first warp were 746 entering the prefix callback functor. 747- `cub::DeviceRadixSort` had a race condition with key-value pairs for pre-SM35 748 architectures. 749- `cub::DeviceRadixSor` bitfield-extract behavior with long keys on 64-bit 750 Linux was incorrect. 751- `cub::BlockDiscontinuity` failed to compile for types other than 752 `int32_t`/`uint32_t`. 753- CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide 754 methods now report the same temporary storage allocation size requirement as 755 their host-callable counterparts. 756 757# CUB 1.0.2 758 759## Summary 760 761CUB 1.0.2 is a minor release. 762 763## Bug Fixes 764 765- Corrections to code snippet examples for `cub::BlockLoad`, `cub::BlockStore`, 766 and `cub::BlockDiscontinuity`. 767- Cleaned up unnecessary/missing header includes. 768 You can now safely include a specific .cuh (instead of `cub.cuh`). 769- Bug/compilation fixes for `cub::BlockHistogram`. 770 771# CUB 1.0.1 772 773## Summary 774 775CUB 1.0.1 adds `cub::DeviceRadixSort` and `cub::DeviceScan`. 776Numerous other performance and correctness fixes and included. 777 778## Breaking Changes 779 780- New collective interface idiom (specialize/construct/invoke). 781 782## New Features 783 784- `cub::DeviceRadixSort`. 785 Implements short-circuiting for homogenous digit passes. 786- `cub::DeviceScan`. 787 Implements single-pass "adaptive-lookback" strategy. 788 789## Other Enhancements 790 791- Significantly improved documentation (with example code snippets). 792- More extensive regression test suit for aggressively testing collective 793 variants. 794- Allow non-trially-constructed types (previously unions had prevented aliasing 795 temporary storage of those types). 796- Improved support for SM3x SHFL (collective ops now use SHFL for types larger 797 than 32 bits). 798- Better code generation for 64-bit addressing within 799 `cub::BlockLoad`/`cub::BlockStore`. 800- `cub::DeviceHistogram` now supports histograms of arbitrary bins. 801- Updates to accommodate CUDA 5.5 dynamic parallelism. 802 803## Bug Fixes 804 805- Workarounds for SM10 codegen issues in uncommonly-used 806 `cub::WarpScan`/`cub::WarpReduce` specializations. 807 808# CUB 0.9.4 809 810## Summary 811 812CUB 0.9.3 is a minor release. 813 814## Enhancements 815 816- Various documentation updates and corrections. 817 818## Bug Fixes 819 820- Fixed compilation errors for SM1x. 821- Fixed compilation errors for some WarpScan entrypoints on SM3x and up. 822 823# CUB 0.9.3 824 825## Summary 826 827CUB 0.9.3 adds histogram algorithms and work management utility descriptors. 828 829## New Features 830 831- `cub::DevicHistogram256`. 832- `cub::BlockHistogram256`. 833- `cub::BlockScan` algorithm variant `BLOCK_SCAN_RAKING_MEMOIZE`, which 834 trades more register consumption for less shared memory I/O. 835- `cub::GridQueue`, `cub::GridEvenShare`, work management utility descriptors. 836 837## Other Enhancements 838 839- Updates to `cub::BlockRadixRank` to use `cub::BlockScan`, which improves 840 performance on SM3x by using SHFL. 841- Allow types other than builtin types to be used in `cub::WarpScan::*Sum` 842 methods if they only have `operator+` overloaded. 843 Previously they also required to support assignment from `int(0)`. 844- Update `cub::BlockReduce`'s `BLOCK_REDUCE_WARP_REDUCTIONS` algorithm to work 845 even when block size is not an even multiple of warp size. 846- Refactoring of `cub::DeviceAllocator` interface and 847 `cub::CachingDeviceAllocator` implementation. 848 849# CUB 0.9.2 850 851## Summary 852 853CUB 0.9.2 adds `cub::WarpReduce`. 854 855## New Features 856 857- `cub::WarpReduce`, which uses the SHFL instruction when applicable. 858 `cub::BlockReduce` now uses this `cub::WarpReduce` instead of implementing 859 its own. 860 861## Enhancements 862 863- Documentation updates and corrections. 864 865## Bug Fixes 866 867- Fixes for 64-bit Linux compilation warnings and errors. 868 869# CUB 0.9.1 870 871## Summary 872 873CUB 0.9.1 is a minor release. 874 875## Bug Fixes 876 877- Fix for ambiguity in `cub::BlockScan::Reduce` between generic reduction and 878 summation. 879 Summation entrypoints are now called `::Sum()`, similar to the 880 convention in `cub::BlockScan`. 881- Small edits to documentation and download tracking. 882 883# CUB 0.9.0 884 885## Summary 886 887Initial preview release. 888CUB is the first durable, high-performance library of cooperative block-level, 889 warp-level, and thread-level primitives for CUDA kernel programming. 890 891