1# Upsampling 2 3[TOC] 4 5<!--* 6# Document freshness: For more information, see go/fresh-source. 7freshness: { owner: 'janwas' reviewed: '2019-02-22' } 8*--> 9 10jxl/resample.h provides `Upsampler8` for fast and high-quality 8x8 upsampling by 114x4 (separable/non-separable) or 6x6 (non-separable) floating-point kernels. 12This was previously used for the "smooth predictor", which has been removed. It 13would still be useful for a progressive mode that upsamples DC for a preview, 14though this code is not yet used for that purpose. 15 16See 'Separability' section below for the surprising result that non-separable 17can be faster than separable and possibly better. 18 19## Performance evaluation 20 21### 4x4 separable/non-separable 22 23__Single-core__: 5.1 GB/s (single-channel, floating-point, 160x96 input) 24 25* 40x speedup vs. an unoptimized Nehab/Hoppe bicubic upsampler 26* 25x or 6x speedup vs. [Pillow 4.3](http://python-pillow.org/pillow-perf/) 27 bicubic (AVX2: 66M RGB 8-bit per second) in terms of bytes or samples 28* 2.6x AVX2 speedup vs. our SSE4 version (similar algorithm) 29 30__12 independent instances on 12 cores__: same speed for each instance (linear 31scalability: not limited by memory bandwidth). 32 33__Multicore__: 15-18 GB/s (single-channel, floating-point, 320x192 input) 34 35* 9-11x or 2.3-2.8x speedup vs. a parallel Halide bicubic (1.6G single-channel 36 8-bit per second, 320x192 input) in terms of bytes or samples. 37 38### 6x6 non-separable 39 40Note that a separable (outer-product) kernel only requires 6+6 (12) 41multiplications per output pixel/vector. However, we do not assume anything 42about the kernel rank (separability), and thus require 6x6 (36) multiplications. 43 44__Single-core__: 2.8 GB/s (single-channel, floating-point, 320x192 input) 45 46* 5x speedup vs. optimized (outer-product of two 1D) Lanczos without SIMD 47 48__Multicore__: 9-10 GB/s (single-channel, floating-point, 320x192 input) 49 50* 7-8x or 2x speedup vs. a parallel Halide 6-tap tensor-product Lanczos (1.3G 51 single-channel 8-bit per second, 320x192 input) in terms of bytes or 52 samples. 53 54## Implementation details 55 56### Data type 57 58Our input pixels are 16-bit, so 8-bit integer multiplications are insufficient. 5916-bit fixed-point arithmetic is fast but risks overflow or loss of precision 60unless the value intervals are carefully managed. 32-bit integers reduce this 61concern, but 32-bit multiplies are slow on Haswell (one mullo32 per cycle with 6210+1 cycle latency). We instead use single-precision float, which enables 2 FMA 63per cycle with 5 cycle latency. The extra bandwidth vs. 16-bit types types may 64be a concern, but we can run 12 instances (on a 12-core socket) with zero 65slowdown, indicating that memory bandwidth is not the bottleneck at the moment. 66 67### Pixel layout 68 69We require planar inputs, e.g. all red channel samples in a 2D matrix. This 70allows better utilization of 8-lane SIMD compared to dedicating four SIMD lanes 71to R,G,B,A samples. If upsampling is the only operation, this requires an extra 72deinterleaving step, but our application involves a larger processing pipeline. 73 74### No prefilter 75 76Nehab and Hoppe (http://hhoppe.com/filtering.pdf) advocate generalized sampling 77with an additional digital filter step. They claim "significantly higher 78quality" for upsampling operations, which is contradicted by minimal MSSIM 79differences between cardinal and ordinary cubic interpolators in their 80experiment on page 65. We also see only minor Butteraugli differences between 81Catmull-Rom and B-spline3 or OMOMS3 when upsampling 8x. As they note on page 29, 82a prefilter is often omitted when upsampling because the reconstruction filter's 83frequency cutoff is lower than that of the prefilter. The prefilter is claimed 84to be efficient (relative to normal separable convolutions), but still involves 85separate horizontal and vertical passes. To avoid cache thrashing would require 86a separate ring buffer of rows, which may be less efficient than our single-pass 87algorithm which only writes final outputs. 88 89### 8x upsampling 90 91Our code is currently specific to 8x upsampling because that is what is required 92for our (DCT prediction) application. This happens to be a particularly good fit 93for both 4-lane and 8-lane (AVX2) SIMD AVX2. It would be relatively easy to 94adapt the code to 4x upsampling. 95 96### Kernel support 97 98Even (asymmetric) kernels are often used to reduce computation. Many upsampling 99applications use 4-tap cubic interpolation kernels but at such extreme 100magnifications (8x) we find 6x6 to be better. 101 102### Separability 103 104Separable kernels can be expressed as the (outer) product of two 1D kernels. For 105n x n kernels, this requires n + n multiplications per pixel rather than n x n. 106However, the Fourier transform of such kernels (except the Gaussian) has a 107square rather than disc shape. 108 109We instead allow arbitrary (possibly non-separable) kernels. This can avoid 110structural dependencies between output pixels within the same 8x8 block. 111 112Surprisingly, 4x4 non-separable is actually faster than the separable version 113due to better utilization of FMA units. For 6x6, we only implement the 114non-separable version because our application benefits from such kernels. 115 116### Dual grid 117 118A primal grid with `n` grid points at coordinates `k/n` would be convenient for 119index computations, but is asymmetric at the borders (0 and `(n-1)/n` != 1) and 120does not compensate for the asymmetric (even) kernel support. We instead use a 121dual grid with coordinates offset by half the sample spacing, which only adds a 122integer shift term to the index computations. Note that the offset still leads 123to four identical input pixels in 8x upsampling, which is convenient for 4 and 124even 8-lane SIMD. 125 126### Kernel 127 128We use Catmull-Rom splines out of convenience. Computational cost does not 129matter because the weights are precomputed. Also, Catmull-Rom splines pass 130through the control points, but that does not matter in this application. 131B-splines or other kernels (arbitrary coefficients, even non-separable) can also 132be used. 133 134### Single pixel loads 135 136For SIMD convolution, we have the choice whether to broadcast inputs or weights. 137To ensure we write a unit-stride vector of output pixels, we need to broadcast 138the input pixels and vary the weights. This has the additional benefit of 139avoiding complexity at the borders, where it is not safe to load an entire 140vector. Instead, we only load and broadcast single pixels, with bounds checking 141at the borders but not in the interior. 142 143### Single pass 144 145Separable 2D convolutions are often implemented as two separate 1D passes. 146However, this leads to cache thrashing in the vertical pass, assuming the image 147does not fit into L2 caches. We instead use a single-pass algorithm that 148generates four (see "Kernel support" above) horizontal convolution results and 149immediately convolves them in the vertical direction to produce a final output. 150This is simpler than sliding windows and has a smaller working set, thus 151enabling the major speedups reported above. 152