1# CUB 1.10.0 (NVIDIA HPC SDK 20.9)
2
3## Summary
4
5CUB 1.10.0 is the major release accompanying the NVIDIA HPC SDK 20.9 release.
6It drops support for C++03, GCC < 5, Clang < 6, and MSVC < 2017.
7It also overhauls CMake support.
8Finally, we now have a Code of Conduct for contributors:
9https://github.com/NVIDIA/cub/blob/main/CODE_OF_CONDUCT.md
10
11## Breaking Changes
12
13- C++03 is no longer supported.
14- GCC < 5, Clang < 6, and MSVC < 2017 are no longer supported.
15- C++11 is deprecated.
16  Using this dialect will generate a compile-time warning.
17  These warnings can be suppressed by defining
18    `CUB_IGNORE_DEPRECATED_CPP_DIALECT` or `CUB_IGNORE_DEPRECATED_CPP_11`.
19  Suppression is only a short term solution.
20  We will be dropping support for C++11 in the near future.
21- CMake < 3.15 is no longer supported.
22- The default branch on GitHub is now called `main`.
23
24## Other Enhancements
25
26- Added install targets to CMake builds.
27- C++17 support.
28
29## Bug Fixes
30
31- NVIDIA/thrust#1244: Check for macro collisions with system headers during
32    header testing.
33- NVIDIA/thrust#1153: Switch to placement new instead of assignment to
34    construct items in uninitialized memory.
35  Thanks to Hugh Winkler for this contribution.
36- NVIDIA/cub#38: Fix `cub::DeviceHistogram` for `size_t` `OffsetT`s.
37  Thanks to Leo Fang for this contribution.
38- NVIDIA/cub#35: Fix GCC-5 maybe-uninitialized warning.
39  Thanks to Rong Ou for this contribution.
40- NVIDIA/cub#36: Qualify namespace for `va_printf` in `_CubLog`.
41  Thanks to Andrei Tchouprakov for this contribution.
42
43# CUB 1.9.10-1 (NVIDIA HPC SDK 20.7, CUDA Toolkit 11.1)
44
45## Summary
46
47CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release
48  and the CUDA Toolkit 11.1 release.
49
50## Bug Fixes
51
52- NVIDIA/thrust#1217: Move static local in cub::DeviceCount to a separate
53  host-only function because NVC++ doesn't support static locals in host-device
54  functions.
55
56# CUB 1.9.10 (NVIDIA HPC SDK 20.5)
57
58## Summary
59
60Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release.
61It adds CMake `find_package` support.
62C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated.
63Starting with the upcoming 1.10.0 release, C++03 support will be dropped
64  entirely.
65
66## Breaking Changes
67
68- Thrust now checks that it is compatible with the version of CUB found
69    in your include path, generating an error if it is not.
70  If you are using your own version of CUB, it may be too old.
71  It is recommended to simply delete your own version of CUB and use the
72    version of CUB that comes with Thrust.
73- C++03 and C++11 are deprecated.
74  Using these dialects will generate a compile-time warning.
75  These warnings can be suppressed by defining
76    `CUB_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11
77    deprecation warnings) or `CUB_IGNORE_DEPRECATED_CPP_11` (to suppress C++11
78    deprecation warnings).
79  Suppression is only a short term solution.
80  We will be dropping support for C++03 in the 1.10.0 release and C++11 in the
81    near future.
82- GCC < 5, Clang < 6, and MSVC < 2017 are deprecated.
83  Using these compilers will generate a compile-time warning.
84  These warnings can be suppressed by defining
85  `CUB_IGNORE_DEPRECATED_COMPILER`.
86  Suppression is only a short term solution.
87  We will be dropping support for these compilers in the near future.
88
89## New Features
90
91- CMake `find_package` support.
92  Just point CMake at the `cmake` folder in your CUB include directory
93    (ex: `cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .`) and then you
94    can add CUB to your CMake project with `find_package(CUB REQUIRED CONFIG)`.
95
96# CUB 1.9.9 (CUDA 11.0)
97
98## Summary
99
100CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release.
101It introduces CMake support, version macros, platform detection machinery,
102  and support for NVC++, which uses Thrust (and thus CUB) to implement
103  GPU-accelerated C++17 Parallel Algorithms.
104Additionally, the scan dispatch layer was refactored and modernized.
105C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated.
106Starting with the upcoming 1.10.0 release, C++03 support will be dropped
107  entirely.
108
109## Breaking Changes
110
111- Thrust now checks that it is compatible with the version of CUB found
112    in your include path, generating an error if it is not.
113  If you are using your own version of CUB, it may be too old.
114  It is recommended to simply delete your own version of CUB and use the
115    version of CUB that comes with Thrust.
116- C++03 and C++11 are deprecated.
117  Using these dialects will generate a compile-time warning.
118  These warnings can be suppressed by defining
119    `CUB_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11
120    deprecation warnings) or `CUB_IGNORE_DEPRECATED_CPP11` (to suppress C++11
121    deprecation warnings).
122  Suppression is only a short term solution.
123  We will be dropping support for C++03 in the 1.10.0 release and C++11 in the
124    near future.
125- GCC < 5, Clang < 6, and MSVC < 2017 are deprecated.
126  Using these compilers will generate a compile-time warning.
127  These warnings can be suppressed by defining
128    `CUB_IGNORE_DEPRECATED_COMPILER`.
129  Suppression is only a short term solution.
130  We will be dropping support for these compilers in the near future.
131
132## New Features
133
134- CMake support.
135  Thanks to Francis Lemaire for this contribution.
136- Refactorized and modernized scan dispatch layer.
137  Thanks to Francis Lemaire for this contribution.
138- Policy hooks for device-wide reduce, scan, and radix sort facilities
139    to simplify tuning and allow users to provide custom policies.
140  Thanks to Francis Lemaire for this contribution.
141- `<cub/version.cuh>`: `CUB_VERSION`, `CUB_VERSION_MAJOR`, `CUB_VERSION_MINOR`,
142    `CUB_VERSION_SUBMINOR`, and `CUB_PATCH_NUMBER`.
143- Platform detection machinery:
144  - `<cub/util_cpp_dialect.cuh>`: Detects the C++ standard dialect.
145  - `<cub/util_compiler.cuh>`: host and device compiler detection.
146  - `<cub/util_deprecated.cuh>`: `CUB_DEPRECATED`.
147  - <cub/config.cuh>`: Includes `<cub/util_arch.cuh>`,
148      `<cub/util_compiler.cuh>`, `<cub/util_cpp_dialect.cuh>`,
149      `<cub/util_deprecated.cuh>`, `<cub/util_macro.cuh>`,
150      `<cub/util_namespace.cuh>`
151- `cub::DeviceCount` and `cub::DeviceCountUncached`, caching abstractions for
152    `cudaGetDeviceCount`.
153
154## Other Enhancements
155
156- Lazily initialize the per-device CUDAattribute caches, because CUDA context
157    creation is expensive and adds up with large CUDA binaries on machines with
158    many GPUs.
159  Thanks to the NVIDIA PyTorch team for bringing this to our attention.
160- Make `cub::SwitchDevice` avoid setting/resetting the device if the current
161    device is the same as the target device.
162
163## Bug Fixes
164
165- Add explicit failure parameter to CAS in the CUB attribute cache to workaround
166    a GCC 4.8 bug.
167- Revert a change in reductions that changed the signedness of the `lane_id`
168    variable to suppress a warning, as this introduces a bug in optimized device
169    code.
170- Fix initialization in `cub::ExclusiveSum`.
171  Thanks to Conor Hoekstra for this contribution.
172- Fix initialization of the `std::array` in the CUB attribute cache.
173- Fix `-Wsign-compare` warnings.
174  Thanks to Elias Stehle for this contribution.
175- Fix `test_block_reduce.cu` to build without parameters.
176  Thanks to Francis Lemaire for this contribution.
177- Add missing includes to `grid_even_share.cuh`.
178  Thanks to Francis Lemaire for this contribution.
179- Add missing includes to `thread_search.cuh`.
180  Thanks to Francis Lemaire for this contribution.
181- Add missing includes to `cub.cuh`.
182  Thanks to Felix Kallenborn for this contribution.
183
184# CUB 1.9.8-1 (NVIDIA HPC SDK 20.3)
185
186## Summary
187
188CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release.
189It contains modifications necessary to serve as the implementation of NVC++'s
190  GPU-accelerated C++17 Parallel Algorithms.
191
192# CUB 1.9.8 (CUDA 11.0 Early Access)
193
194## Summary
195
196CUB 1.9.8 is the first release of CUB to be officially supported and included
197  in the CUDA Toolkit.
198When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query
199  APIs, which improves performance of these queries by 20x to 50x when they
200  are called concurrently by multiple host threads.
201
202## Enhancements
203
204- (C++11 or later) Cache calls to `cudaFuncGetAttributes` and
205    `cudaDeviceGetAttribute` within `cub::PtxVersion` and `cub::SmVersion`.
206    These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform
207    poorly under contention; with the caching, they are 20 to 50x faster when
208    called concurrently.
209  Thanks to Bilge Acun for bringing this issue to our attention.
210- `DispatchReduce` now takes an `OutputT` template parameter so that users can
211    specify the intermediate type explicitly.
212- Radix sort tuning policies updates to fix performance issues for element
213    types smaller than 4 bytes.
214
215## Bug Fixes
216
217- Change initialization style from copy initialization to direct initialization
218    (which is more permissive) in `AgentReduce` to allow a wider range of types
219    to be used with it.
220- Fix bad signed/unsigned comparisons in `WarpReduce`.
221- Fix computation of valid lanes in warp-level reduction primitive to correctly
222    handle the case where there are 0 input items per warp.
223
224# CUB 1.8.0
225
226## Summary
227
228CUB 1.8.0 introduces changes to the `cub::Shuffle*` interfaces.
229
230## Breaking Changes
231
232- The interfaces of `cub::ShuffleIndex`, `cub::ShuffleUp`, and
233    `cub::ShuffleDown` have been changed to allow for better computation of the
234    PTX SHFL control constant for logical warps smaller than 32 threads.
235
236## Bug Fixes
237
238- #112: Fix `cub::WarpScan`'s broadcast of warp-wide aggregate for logical
239    warps smaller than 32 threads.
240
241# CUB 1.7.5
242
243## Summary
244
245CUB 1.7.5 adds support for radix sorting `__half` keys and improved sorting
246  performance for 1 byte keys.
247It was incorporated into Thrust 1.9.2.
248
249## Enhancements
250
251- Radix sort support for `__half` keys.
252- Radix sort tuning policy updates to improve 1 byte key performance.
253
254## Bug Fixes
255
256- Syntax tweaks to mollify Clang.
257- #127: `cub::DeviceRunLengthEncode::Encode` returns incorrect results.
258- #128: 7-bit sorting passes fail for SM61 with large values.
259
260# CUB 1.7.4
261
262## Summary
263
264CUB 1.7.4 is a minor release that was incorporated into Thrust 1.9.1-2.
265
266## Bug Fixes
267
268- #114: Can't pair non-trivially-constructible values in radix sort.
269- #115: `cub::WarpReduce` segmented reduction is broken in CUDA 9 for logical
270    warp sizes smaller than 32.
271
272# CUB 1.7.3
273
274## Summary
275
276CUB 1.7.3 is a minor release.
277
278## Bug Fixes
279
280- #110: `cub::DeviceHistogram` null-pointer exception bug for iterator inputs.
281
282# CUB 1.7.2
283
284## Summary
285
286CUB 1.7.2 is a minor release.
287
288## Bug Fixes
289
290- #104: Device-wide reduction is now "run-to-run" deterministic for
291    pseudo-associative reduction operators (like floating point addition).
292
293# CUB 1.7.1
294
295## Summary
296
297CUB 1.7.1 delivers improved radix sort performance on SM7x (Volta) GPUs and a
298  number of bug fixes.
299
300## Enhancements
301
302- Radix sort tuning policies updated for SM7x (Volta).
303
304## Bug Fixes
305
306- #104: `uint64_t` `cub::WarpReduce` broken for CUB 1.7.0 on CUDA 8 and older.
307- #103: Can't mix Thrust from CUDA 9.0 and CUB.
308- #102: CUB pulls in `windows.h` which defines `min`/`max` macros that conflict
309    with `std::min`/`std::max`.
310- #99: Radix sorting crashes NVCC on Windows 10 for SM52.
311- #98: cuda-memcheck: --tool initcheck failed with lineOfSight.
312- #94: Git clone size.
313- #93: Accept iterators for segment offsets.
314- #87: CUB uses anonymous unions which is not valid C++.
315- #44: Check for C++11 is incorrect for Visual Studio 2013.
316
317# CUB 1.7.0
318
319## Summary
320
321CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs.
322It is compatible with independent thread scheduling.
323It was incorporated into Thrust 1.9.0-5.
324
325## Breaking Changes
326
327- Remove `cub::WarpAll` and `cub::WarpAny`.
328  These functions served to emulate `__all` and `__any` functionality for
329    SM1x devices, which did not have those operations.
330  However, SM1x devices are now deprecated in CUDA, and the interfaces of these
331    two functions are now lacking the lane-mask needed for collectives to run on
332    SM7x and newer GPUs which have independent thread scheduling.
333
334## Other Enhancements
335
336- Remove any assumptions of implicit warp synchronization to be compatible with
337    SM7x's (Volta) independent thread scheduling.
338
339## Bug Fixes
340
341- #86: Incorrect results with reduce-by-key.
342
343# CUB 1.6.4
344
345## Summary
346
347CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x
348  (Pascal) GPUs.
349
350## Enhancements
351
352- Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) -
353    3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively.
354
355## Bug Fixes
356
357- Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5.
358- #65: `cub::DeviceSegmentedRadixSort` should allow inputs to have
359    pointer-to-const type.
360- Mollify Clang device-side warnings.
361- Remove out-dated MSVC project files.
362
363# CUB 1.6.3
364
365## Summary
366
367CUB 1.6.3 improves support for Windows, changes
368  `cub::BlockLoad`/`cub::BlockStore` interface to take the local data type,
369  and enhances radix sort performance for SM6x (Pascal) GPUs.
370
371## Breaking Changes
372
373- `cub::BlockLoad` and `cub::BlockStore` are now templated by the local data
374    type, instead of the `Iterator` type.
375  This allows for output iterators having `void` as their `value_type` (e.g.
376    discard iterators).
377
378## Other Enhancements
379
380- Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte
381    keys/s on GP100.
382- Improved support for Windows (warnings, alignment, etc).
383
384## Bug Fixes
385
386- #74: `cub::WarpReduce` executes reduction operator for out-of-bounds items.
387- #72: `cub:InequalityWrapper::operator` should be non-const.
388- #71: `cub::KeyValuePair` won't work if `Key` has non-trivial constructor.
389- #69: cub::BlockStore::Store` doesn't compile if `OutputIteratorT::value_type`
390    isn't `T`.
391- #68: `cub::TilePrefixCallbackOp::WarpReduce` doesn't permit PTX arch
392    specialization.
393
394# CUB 1.6.2 (previously 1.5.5)
395
396## Summary
397
398CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal)
399  GPUs.
400
401## Enhancements
402
403- Radix sort tuning policies updated for SM6x (Pascal) GPUs.
404
405## Bug Fixes
406
407- Fix AArch64 compilation of `cub::CachingDeviceAllocator`.
408
409# CUB 1.6.1 (previously 1.5.4)
410
411## Summary
412
413CUB 1.6.1 (previously 1.5.4) is a minor release.
414
415## Bug Fixes
416
417- Fix radix sorting bug introduced by scan refactorization.
418
419# CUB 1.6.0 (previously 1.5.3)
420
421## Summary
422
423CUB 1.6.0 changes the scan and reduce interfaces.
424Exclusive scans now accept an "initial value" instead of an "identity value".
425Scans and reductions now support differing input and output sequence types.
426Additionally, many bugs have been fixed.
427
428## Breaking Changes
429
430- Device/block/warp-wide exclusive scans have been revised to now accept an
431    "initial value" (instead of an "identity value") for seeding the computation
432    with an arbitrary prefix.
433- Device-wide reductions and scans can now have input sequence types that are
434    different from output sequence types (as long as they are convertible).
435
436## Other Enhancements
437
438- Reduce repository size by moving the doxygen binary to doc repository.
439- Minor reduction in `cub::BlockScan` instruction counts.
440
441## Bug Fixes
442
443- Issue #55: Warning in `cub/device/dispatch/dispatch_reduce_by_key.cuh`.
444- Issue #59: `cub::DeviceScan::ExclusiveSum` can't prefix sum of float into
445    double.
446- Issue #58: Infinite loop in `cub::CachingDeviceAllocator::NearestPowerOf`.
447- Issue #47: `cub::CachingDeviceAllocator` needs to clean up CUDA global error
448    state upon successful retry.
449- Issue #46: Very high amount of needed memory from the
450    `cub::DeviceHistogram::HistogramEven`.
451- Issue #45: `cub::CachingDeviceAllocator` fails with debug output enabled
452
453# CUB 1.5.2
454
455## Summary
456
457CUB 1.5.2 enhances `cub::CachingDeviceAllocator` and improves scan performance
458  for SM5x (Maxwell).
459
460## Enhancements
461
462- Improved medium-size scan performance on SM5x (Maxwell).
463- Refactored `cub::CachingDeviceAllocator`:
464  - Now spends less time locked.
465  - Uses C++11's `std::mutex` when available.
466  - Failure to allocate a block from the runtime will retry once after
467  		freeing cached allocations.
468  - Now respects max-bin, fixing an issue where blocks in excess of max-bin
469      were still being retained in the free cache.
470
471## Bug fixes:
472
473- Fix for generic-type reduce-by-key `cub::WarpScan` for SM3x and newer GPUs.
474
475# CUB 1.5.1
476
477## Summary
478
479CUB 1.5.1 is a minor release.
480
481## Bug Fixes
482
483- Fix for incorrect `cub::DeviceRadixSort` output for some small problems on
484    SM52 (Mawell) GPUs.
485- Fix for macro redefinition warnings when compiling `thrust::sort`.
486
487# CUB 1.5.0
488
489CUB 1.5.0 introduces segmented sort and reduction primitives.
490
491## New Features:
492
493- Segmented device-wide operations for device-wide sort and reduction primitives.
494
495## Bug Fixes:
496
497- #36: `cub::ThreadLoad` generates compiler errors when loading from
498    pointer-to-const.
499- #29: `cub::DeviceRadixSort::SortKeys<bool>` yields compiler errors.
500- #26: Misaligned address after `cub::DeviceRadixSort::SortKeys`.
501- #25: Fix for incorrect results and crashes when radix sorting 0-length
502    problems.
503- Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and
504    warp-reduction on non-primitive data types (e.g. user-defined structs).
505- Fix small radix sorting problems where 0 temporary bytes were required and
506    users code was invoking `malloc(0)` on some systems where that returns
507    `NULL`.
508  CUB assumed the user was asking for the size again and not running the sort.
509
510# CUB 1.4.1
511
512## Summary
513
514CUB 1.4.1 is a minor release.
515
516## Enhancements
517
518- Allow `cub::DeviceRadixSort` and `cub::BlockRadixSort` on bool types.
519
520## Bug Fixes
521
522- Fix minor CUDA 7.0 performance regressions in `cub::DeviceScan` and
523    `cub::DeviceReduceByKey`.
524- Remove requirement for callers to define the `CUB_CDP` macro
525    when invoking CUB device-wide rountines using CUDA dynamic parallelism.
526- Fix headers not being included in the proper order (or missing includes)
527    for some block-wide functions.
528
529# CUB 1.4.0
530
531## Summary
532
533CUB 1.4.0 adds `cub::DeviceSpmv`, `cub::DeviceRunLength::NonTrivialRuns`,
534  improves `cub::DeviceHistogram`, and introduces support for SM5x (Maxwell)
535  GPUs.
536
537## New Features:
538
539- `cub::DeviceSpmv` methods for multiplying sparse matrices by
540    dense vectors, load-balanced using a merge-based parallel decomposition.
541- `cub::DeviceRadixSort` sorting entry-points that always return
542    the sorted output into the specified buffer, as opposed to the
543    `cub::DoubleBuffer` in which it could end up in either buffer.
544- `cub::DeviceRunLengthEncode::NonTrivialRuns` for finding the starting
545    offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in
546    a given sequence.
547  Useful for top-down partitioning algorithms like MSD sorting of very-large
548    keys.
549
550## Other Enhancements
551
552- Support and performance tuning for SM5x (Maxwell) GPUs.
553- Updated cub::DeviceHistogram implementation that provides the same
554    "histogram-even" and "histogram-range" functionality as IPP/NPP.
555  Provides extremely fast and, perhaps more importantly, very uniform
556    performance response across diverse real-world datasets, including
557    pathological (homogeneous) sample distributions.
558
559# CUB 1.3.2
560
561## Summary
562
563CUB 1.3.2 is a minor release.
564
565## Bug Fixes
566
567- Fix `cub::DeviceReduce` where reductions of small problems (small enough to
568    only dispatch a single thread block) would run in the default stream (stream
569    zero) regardless of whether an alternate stream was specified.
570
571# CUB 1.3.1
572
573## Summary
574
575CUB 1.3.1 is a minor release.
576
577## Bug Fixes
578
579- Workaround for a benign WAW race warning reported by cuda-memcheck
580    in `cub::BlockScan` specialized for `BLOCK_SCAN_WARP_SCANS` algorithm.
581- Fix bug in `cub::DeviceRadixSort` where the algorithm may sort more
582    key bits than the caller specified (up to the nearest radix digit).
583- Fix for ~3% `cub::DeviceRadixSort` performance regression on SM2x (Fermi) and
584    SM3x (Kepler) GPUs.
585
586# CUB 1.3.0
587
588## Summary
589
590CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide
591  primitives and adds an enhanced version of `cub::WarpScan`.
592
593## Breaking Changes
594
595- CUB's collective (block-wide, warp-wide) primitives underwent a minor
596    interface refactoring:
597  - To provide the appropriate support for multidimensional thread blocks,
598      The interfaces for collective classes are now template-parameterized by
599      X, Y, and Z block dimensions (with `BLOCK_DIM_Y` and `BLOCK_DIM_Z` being
600      optional, and `BLOCK_DIM_X` replacing `BLOCK_THREADS`).
601    Furthermore, the constructors that accept remapped linear
602      thread-identifiers have been removed: all primitives now assume a
603      row-major thread-ranking for multidimensional thread blocks.
604  - To allow the host program (compiled by the host-pass) to accurately
605      determine the device-specific storage requirements for a given collective
606      (compiled for each device-pass), the interfaces for collective classes
607      are now (optionally) template-parameterized by the desired PTX compute
608      capability.
609    This is useful when aliasing collective storage to shared memory that has
610      been allocated dynamically by the host at the kernel call site.
611  - Most CUB programs having typical 1D usage should not require any
612      changes to accomodate these updates.
613
614## New Features
615
616- Added "combination" `cub::WarpScan` methods for efficiently computing
617    both inclusive and exclusive prefix scans (and sums).
618
619## Bug Fixes
620
621- Fix for bug in `cub::WarpScan` (which affected `cub::BlockScan` and
622    `cub::DeviceScan`) where incorrect results (e.g., NAN) would often be
623    returned when parameterized for floating-point types (fp32, fp64).
624- Workaround for ptxas error when compiling with with -G flag on Linux (for
625    debug instrumentation).
626- Fixes for certain scan scenarios using custom scan operators where code
627    compiled for SM1x is run on newer GPUs of higher compute-capability: the
628    compiler could not tell which memory space was being used collective
629    operations and was mistakenly using global ops instead of shared ops.
630
631# CUB 1.2.3
632
633## Summary
634
635CUB 1.2.3 is a minor release.
636
637## Bug Fixes
638
639- Fixed access violation bug in `cub::DeviceReduce::ReduceByKey` for
640    non-primitive value types.
641- Fixed code-snippet bug in `ArgIndexInputIteratorT` documentation.
642
643# CUB 1.2.2
644
645## Summary
646
647CUB 1.2.2 adds a new variant of `cub::BlockReduce` and MSVC project solections
648  for examples.
649
650## New Features
651
652- MSVC project solutions for device-wide and block-wide examples
653- New algorithmic variant of cub::BlockReduce for improved performance
654    when using commutative operators (e.g., numeric addition).
655
656## Bug Fixes
657
658- Inclusion of Thrust headers in a certain order prevented CUB device-wide
659    primitives from working properly.
660
661# CUB 1.2.0
662
663## Summary
664
665CUB 1.2.0 adds `cub::DeviceReduce::ReduceByKey` and
666  `cub::DeviceReduce::RunLengthEncode` and support for CUDA 6.0.
667
668## New Features
669
670- `cub::DeviceReduce::ReduceByKey`.
671- `cub::DeviceReduce::RunLengthEncode`.
672
673## Other Enhancements
674
675- Improved `cub::DeviceScan`, `cub::DeviceSelect`, `cub::DevicePartition`
676    performance.
677- Documentation and testing:
678  - Added performance-portability plots for many device-wide primitives.
679  - Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and
680      Thrust 1.6 (and older).
681- Revised the operation of temporary tile status bookkeeping for
682    `cub::DeviceScan` (and similar) to be safe for current code run on future
683    platforms (now uses proper fences).
684
685## Bug Fixes
686
687- Fix `cub::DeviceScan` bug where Windows alignment disagreements between host
688    and device regarding user-defined data types would corrupt tile status.
689- Fix `cub::BlockScan` bug where certain exclusive scans on custom data types
690    for the `BLOCK_SCAN_WARP_SCANS` variant would return incorrect results for
691    the first thread in the block.
692- Added workaround to make `cub::TexRefInputIteratorT` work with CUDA 6.0.
693
694# CUB 1.1.1
695
696## Summary
697
698CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting,
699  `cub::DeviceSelect`, `cub::DevicePartition`, `cub::Shuffle*`, and
700  `cub::MaxSMOccupancy`.
701Additionally, scan and sort performance for older GPUs has been improved and
702  many bugs have been fixed.
703
704## Breaking Changes
705
706- Refactored block-wide I/O (`cub::BlockLoad` and `cub::BlockStore`), removing
707    cache-modifiers from their interfaces.
708  `cub::CacheModifiedInputIterator` and `cub::CacheModifiedOutputIterator`
709    should now be used with `cub::BlockLoad` and `cub::BlockStore` to effect that
710    behavior.
711
712## New Features
713
714- `cub::TexObjInputIterator`, `cub::TexRefInputIterator`,
715    `cub::CacheModifiedInputIterator`, and `cub::CacheModifiedOutputIterator`
716    types for loading & storing arbitrary types through the cache hierarchy.
717  They are compatible with Thrust.
718- Descending sorting for `cub::DeviceRadixSort` and `cub::BlockRadixSort`.
719- Min, max, arg-min, and arg-max operators for `cub::DeviceReduce`.
720- `cub::DeviceSelect` (select-unique, select-if, and select-flagged).
721- `cub::DevicePartition` (partition-if, partition-flagged).
722- Generic `cub::ShuffleUp`, `cub::ShuffleDown`, and `cub::ShuffleIndex` for
723    warp-wide communication of arbitrary data types (SM3x and up).
724- `cub::MaxSmOccupancy` for accurately determining SM occupancy for any given
725    kernel function pointer.
726
727## Other Enhancements
728
729- Improved `cub::DeviceScan` and `cub::DeviceRadixSort` performance for older
730    GPUs (SM1x to SM3x).
731- Renamed device-wide `stream_synchronous` param to `debug_synchronous` to
732    avoid confusion about usage.
733- Documentation improvements:
734  - Added simple examples of device-wide methods.
735  - Improved doxygen documentation and example snippets.
736- Improved test coverege to include up to 21,000 kernel variants and 851,000
737    unit tests (per architecture, per platform).
738
739## Bug Fixes
740
741- Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when
742    operating on non-primitive types for older architectures SM1x.
743- SHFL-based scans and reductions produced incorrect results for multi-word
744    types (size > 4B) on Linux.
745- For `cub::WarpScan`-based scans, not all threads in the first warp were
746    entering the prefix callback functor.
747- `cub::DeviceRadixSort` had a race condition with key-value pairs for pre-SM35
748    architectures.
749- `cub::DeviceRadixSor` bitfield-extract behavior with long keys on 64-bit
750    Linux was incorrect.
751- `cub::BlockDiscontinuity` failed to compile for types other than
752    `int32_t`/`uint32_t`.
753- CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide
754    methods now report the same temporary storage allocation size requirement as
755    their host-callable counterparts.
756
757# CUB 1.0.2
758
759## Summary
760
761CUB 1.0.2 is a minor release.
762
763## Bug Fixes
764
765- Corrections to code snippet examples for `cub::BlockLoad`, `cub::BlockStore`,
766    and `cub::BlockDiscontinuity`.
767- Cleaned up unnecessary/missing header includes.
768  You can now safely include a specific .cuh (instead of `cub.cuh`).
769- Bug/compilation fixes for `cub::BlockHistogram`.
770
771# CUB 1.0.1
772
773## Summary
774
775CUB 1.0.1 adds `cub::DeviceRadixSort` and `cub::DeviceScan`.
776Numerous other performance and correctness fixes and included.
777
778## Breaking Changes
779
780- New collective interface idiom (specialize/construct/invoke).
781
782## New Features
783
784- `cub::DeviceRadixSort`.
785  Implements short-circuiting for homogenous digit passes.
786- `cub::DeviceScan`.
787  Implements single-pass "adaptive-lookback" strategy.
788
789## Other Enhancements
790
791- Significantly improved documentation (with example code snippets).
792- More extensive regression test suit for aggressively testing collective
793    variants.
794- Allow non-trially-constructed types (previously unions had prevented aliasing
795    temporary storage of those types).
796- Improved support for SM3x SHFL (collective ops now use SHFL for types larger
797    than 32 bits).
798- Better code generation for 64-bit addressing within
799    `cub::BlockLoad`/`cub::BlockStore`.
800- `cub::DeviceHistogram` now supports histograms of arbitrary bins.
801- Updates to accommodate CUDA 5.5 dynamic parallelism.
802
803## Bug Fixes
804
805- Workarounds for SM10 codegen issues in uncommonly-used
806    `cub::WarpScan`/`cub::WarpReduce` specializations.
807
808# CUB 0.9.4
809
810## Summary
811
812CUB 0.9.3 is a minor release.
813
814## Enhancements
815
816- Various documentation updates and corrections.
817
818## Bug Fixes
819
820- Fixed compilation errors for SM1x.
821- Fixed compilation errors for some WarpScan entrypoints on SM3x and up.
822
823# CUB 0.9.3
824
825## Summary
826
827CUB 0.9.3 adds histogram algorithms and work management utility descriptors.
828
829## New Features
830
831- `cub::DevicHistogram256`.
832- `cub::BlockHistogram256`.
833- `cub::BlockScan` algorithm variant `BLOCK_SCAN_RAKING_MEMOIZE`, which
834    trades more register consumption for less shared memory I/O.
835- `cub::GridQueue`, `cub::GridEvenShare`, work management utility descriptors.
836
837## Other Enhancements
838
839- Updates to `cub::BlockRadixRank` to use `cub::BlockScan`, which improves
840    performance on SM3x by using SHFL.
841- Allow types other than builtin types to be used in `cub::WarpScan::*Sum`
842    methods if they only have `operator+` overloaded.
843  Previously they also required to support assignment from `int(0)`.
844- Update `cub::BlockReduce`'s `BLOCK_REDUCE_WARP_REDUCTIONS` algorithm to work
845    even when block size is not an even multiple of warp size.
846- Refactoring of `cub::DeviceAllocator` interface and
847    `cub::CachingDeviceAllocator` implementation.
848
849# CUB 0.9.2
850
851## Summary
852
853CUB 0.9.2 adds `cub::WarpReduce`.
854
855## New Features
856
857- `cub::WarpReduce`, which uses the SHFL instruction when applicable.
858  `cub::BlockReduce` now uses this `cub::WarpReduce` instead of implementing
859    its own.
860
861## Enhancements
862
863- Documentation updates and corrections.
864
865## Bug Fixes
866
867- Fixes for 64-bit Linux compilation warnings and errors.
868
869# CUB 0.9.1
870
871## Summary
872
873CUB 0.9.1 is a minor release.
874
875## Bug Fixes
876
877- Fix for ambiguity in `cub::BlockScan::Reduce` between generic reduction and
878    summation.
879  Summation entrypoints are now called `::Sum()`, similar to the
880    convention in `cub::BlockScan`.
881- Small edits to documentation and download tracking.
882
883# CUB 0.9.0
884
885## Summary
886
887Initial preview release.
888CUB is the first durable, high-performance library of cooperative block-level,
889  warp-level, and thread-level primitives for CUDA kernel programming.
890
891