1# Upsampling
2
3[TOC]
4
5<!--*
6# Document freshness: For more information, see go/fresh-source.
7freshness: { owner: 'janwas' reviewed: '2019-02-22' }
8*-->
9
10jxl/resample.h provides `Upsampler8` for fast and high-quality 8x8 upsampling by
114x4 (separable/non-separable) or 6x6 (non-separable) floating-point kernels.
12This was previously used for the "smooth predictor", which has been removed. It
13would still be useful for a progressive mode that upsamples DC for a preview,
14though this code is not yet used for that purpose.
15
16See 'Separability' section below for the surprising result that non-separable
17can be faster than separable and possibly better.
18
19## Performance evaluation
20
21### 4x4 separable/non-separable
22
23__Single-core__: 5.1 GB/s (single-channel, floating-point, 160x96 input)
24
25*   40x speedup vs. an unoptimized Nehab/Hoppe bicubic upsampler
26*   25x or 6x speedup vs. [Pillow 4.3](http://python-pillow.org/pillow-perf/)
27    bicubic (AVX2: 66M RGB 8-bit per second) in terms of bytes or samples
28*   2.6x AVX2 speedup vs. our SSE4 version (similar algorithm)
29
30__12 independent instances on 12 cores__: same speed for each instance (linear
31scalability: not limited by memory bandwidth).
32
33__Multicore__: 15-18 GB/s (single-channel, floating-point, 320x192 input)
34
35*   9-11x or 2.3-2.8x speedup vs. a parallel Halide bicubic (1.6G single-channel
36    8-bit per second, 320x192 input) in terms of bytes or samples.
37
38### 6x6 non-separable
39
40Note that a separable (outer-product) kernel only requires 6+6 (12)
41multiplications per output pixel/vector. However, we do not assume anything
42about the kernel rank (separability), and thus require 6x6 (36) multiplications.
43
44__Single-core__: 2.8 GB/s (single-channel, floating-point, 320x192 input)
45
46*   5x speedup vs. optimized (outer-product of two 1D) Lanczos without SIMD
47
48__Multicore__: 9-10 GB/s (single-channel, floating-point, 320x192 input)
49
50*   7-8x or 2x speedup vs. a parallel Halide 6-tap tensor-product Lanczos (1.3G
51    single-channel 8-bit per second, 320x192 input) in terms of bytes or
52    samples.
53
54## Implementation details
55
56### Data type
57
58Our input pixels are 16-bit, so 8-bit integer multiplications are insufficient.
5916-bit fixed-point arithmetic is fast but risks overflow or loss of precision
60unless the value intervals are carefully managed. 32-bit integers reduce this
61concern, but 32-bit multiplies are slow on Haswell (one mullo32 per cycle with
6210+1 cycle latency). We instead use single-precision float, which enables 2 FMA
63per cycle with 5 cycle latency. The extra bandwidth vs. 16-bit types types may
64be a concern, but we can run 12 instances (on a 12-core socket) with zero
65slowdown, indicating that memory bandwidth is not the bottleneck at the moment.
66
67### Pixel layout
68
69We require planar inputs, e.g. all red channel samples in a 2D matrix. This
70allows better utilization of 8-lane SIMD compared to dedicating four SIMD lanes
71to R,G,B,A samples. If upsampling is the only operation, this requires an extra
72deinterleaving step, but our application involves a larger processing pipeline.
73
74### No prefilter
75
76Nehab and Hoppe (http://hhoppe.com/filtering.pdf) advocate generalized sampling
77with an additional digital filter step. They claim "significantly higher
78quality" for upsampling operations, which is contradicted by minimal MSSIM
79differences between cardinal and ordinary cubic interpolators in their
80experiment on page 65. We also see only minor Butteraugli differences between
81Catmull-Rom and B-spline3 or OMOMS3 when upsampling 8x. As they note on page 29,
82a prefilter is often omitted when upsampling because the reconstruction filter's
83frequency cutoff is lower than that of the prefilter. The prefilter is claimed
84to be efficient (relative to normal separable convolutions), but still involves
85separate horizontal and vertical passes. To avoid cache thrashing would require
86a separate ring buffer of rows, which may be less efficient than our single-pass
87algorithm which only writes final outputs.
88
89### 8x upsampling
90
91Our code is currently specific to 8x upsampling because that is what is required
92for our (DCT prediction) application. This happens to be a particularly good fit
93for both 4-lane and 8-lane (AVX2) SIMD AVX2. It would be relatively easy to
94adapt the code to 4x upsampling.
95
96### Kernel support
97
98Even (asymmetric) kernels are often used to reduce computation. Many upsampling
99applications use 4-tap cubic interpolation kernels but at such extreme
100magnifications (8x) we find 6x6 to be better.
101
102### Separability
103
104Separable kernels can be expressed as the (outer) product of two 1D kernels. For
105n x n kernels, this requires n + n multiplications per pixel rather than n x n.
106However, the Fourier transform of such kernels (except the Gaussian) has a
107square rather than disc shape.
108
109We instead allow arbitrary (possibly non-separable) kernels. This can avoid
110structural dependencies between output pixels within the same 8x8 block.
111
112Surprisingly, 4x4 non-separable is actually faster than the separable version
113due to better utilization of FMA units. For 6x6, we only implement the
114non-separable version because our application benefits from such kernels.
115
116### Dual grid
117
118A primal grid with `n` grid points at coordinates `k/n` would be convenient for
119index computations, but is asymmetric at the borders (0 and `(n-1)/n` != 1) and
120does not compensate for the asymmetric (even) kernel support. We instead use a
121dual grid with coordinates offset by half the sample spacing, which only adds a
122integer shift term to the index computations. Note that the offset still leads
123to four identical input pixels in 8x upsampling, which is convenient for 4 and
124even 8-lane SIMD.
125
126### Kernel
127
128We use Catmull-Rom splines out of convenience. Computational cost does not
129matter because the weights are precomputed. Also, Catmull-Rom splines pass
130through the control points, but that does not matter in this application.
131B-splines or other kernels (arbitrary coefficients, even non-separable) can also
132be used.
133
134### Single pixel loads
135
136For SIMD convolution, we have the choice whether to broadcast inputs or weights.
137To ensure we write a unit-stride vector of output pixels, we need to broadcast
138the input pixels and vary the weights. This has the additional benefit of
139avoiding complexity at the borders, where it is not safe to load an entire
140vector. Instead, we only load and broadcast single pixels, with bounds checking
141at the borders but not in the interior.
142
143### Single pass
144
145Separable 2D convolutions are often implemented as two separate 1D passes.
146However, this leads to cache thrashing in the vertical pass, assuming the image
147does not fit into L2 caches. We instead use a single-pass algorithm that
148generates four (see "Kernel support" above) horizontal convolution results and
149immediately convolves them in the vertical direction to produce a final output.
150This is simpler than sliding windows and has a smaller working set, thus
151enabling the major speedups reported above.
152