1*********
2Threading
3*********
4
5.. _pools:
6
7Thread Pools
8============
9
10x265 creates one or more thread pools per encoder, one pool per NUMA
11node (typically a CPU socket). :option:`--pools` specifies the number of
12pools and the number of threads per pool the encoder will allocate. By
13default x265 allocates one thread per (hyperthreaded) CPU core on each
14NUMA node.
15
16If you are running multiple encoders on a system with multiple NUMA
17nodes, it is recommended to isolate each of them to a single node in
18order to avoid the NUMA overhead of remote memory access.
19
20Work distribution is job based. Idle worker threads scan the job
21providers assigned to their thread pool for jobs to perform. When no
22jobs are available, the idle worker threads block and consume no CPU
23cycles.
24
25Objects which desire to distribute work to worker threads are known as
26job providers (and they derive from the JobProvider class).  The thread
27pool has a method to **poke** awake a blocked idle thread, and job
28providers are recommended to call this method when they make new jobs
29available.
30
31Worker jobs are not allowed to block except when absolutely necessary
32for data locking. If a job becomes blocked, the work function is
33expected to drop that job so the worker thread may go back to the pool
34and find more work.
35
36On Windows, the native APIs offer sufficient functionality to discover
37the NUMA topology and enforce the thread affinity that libx265 needs (so
38long as you have not chosen to target XP or Vista), but on POSIX systems
39it relies on libnuma for this functionality. If your target POSIX system
40is single socket, then building without libnuma is a perfectly
41reasonable option, as it will have no effect on the runtime behavior. On
42a multiple-socket system, a POSIX build of libx265 without libnuma will
43be less work efficient, but will still function correctly. You lose the
44work isolation effect that keeps each frame encoder from only using the
45threads of a single socket and so you incur a heavier context switching
46cost.
47
48Wavefront Parallel Processing
49=============================
50
51New with HEVC, Wavefront Parallel Processing allows each row of CTUs to
52be encoded in parallel, so long as each row stays at least two CTUs
53behind the row above it, to ensure the intra references and other data
54of the blocks above and above-right are available. WPP has almost no
55effect on the analysis and compression of each CTU and so it has a very
56small impact on compression efficiency relative to slices or tiles. The
57compression loss from WPP has been found to be less than 1% in most of
58our tests.
59
60WPP has three effects which can impact efficiency. The first is the row
61starts must be signaled in the slice header, the second is each row must
62be padded to an even byte in length, and the third is the state of the
63entropy coder is transferred from the second CTU of each row to the
64first CTU of the row below it.  In some conditions this transfer of
65state actually improves compression since the above-right state may have
66better locality than the end of the previous row.
67
68Parabola Research have published an excellent HEVC
69`animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_
70which visualizes WPP very well.  It even correctly visualizes some of
71WPPs key drawbacks, such as:
72
731. the low thread utilization at the start and end of each frame
742. a difficult block may stall the wave-front and it takes a while for
75   the wave-front to recover.
763. 64x64 CTUs are big! there are much fewer rows than with H.264 and
77   similar codecs
78
79Because of these stall issues you rarely get the full parallelisation
80benefit one would expect from row threading. 30% to 50% of the
81theoretical perfect threading is typical.
82
83In x265 WPP is enabled by default since it not only improves performance
84at encode but it also makes it possible for the decoder to be threaded.
85
86If WPP is disabled by :option:`--no-wpp` the frame will be encoded in
87scan order and the entropy overheads will be avoided.  If frame
88threading is not disabled, the encoder will change the default frame
89thread count to be higher than if WPP was enabled.  The exact formulas
90are described in the next section.
91
92Bonded Task Groups
93==================
94
95If a worker thread job has work which can be performed in parallel by
96many threads, it may allocate a bonded task group and enlist the help of
97other idle worker threads from the same thread pool. Those threads will
98cooperate to complete the work of the bonded task group and then return
99to their idle states. The larger and more uniform those tasks are, the
100better the bonded task group will perform.
101
102Parallel Mode Analysis
103~~~~~~~~~~~~~~~~~~~~~~
104
105When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to
1068x8) will distribute its analysis work to the thread pool via a bonded
107task group. Each analysis job will measure the cost of one prediction
108for the CU: merge, skip, intra, inter (2Nx2N, Nx2N, 2NxN, and AMP).
109
110At slower presets, the amount of increased parallelism from pmode is
111often enough to be able to reduce or disable frame parallelism while
112achieving the same overall CPU utilization. Reducing frame threads is
113often beneficial to ABR and VBV rate control.
114
115Parallel Motion Estimation
116~~~~~~~~~~~~~~~~~~~~~~~~~~
117
118When :option:`--pme` is enabled all of the analysis functions which
119perform motion searches to reference frames will distribute those motion
120searches to other worker threads via a bonded task group (if more than
121two motion searches are required).
122
123Frame Threading
124===============
125
126Frame threading is the act of encoding multiple frames at the same time.
127It is a challenge because each frame will generally use one or more of
128the previously encoded frames as motion references and those frames may
129still be in the process of being encoded themselves.
130
131Previous encoders such as x264 worked around this problem by limiting
132the motion search region within these reference frames to just one
133macroblock row below the coincident row being encoded. Thus a frame
134could be encoded at the same time as its reference frames so long as it
135stayed one row behind the encode progress of its references (glossing
136over a few details).
137
138x265 has the same frame threading mechanism, but we generally have much
139less frame parallelism to exploit than x264 because of the size of our
140CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock
141rows available each frame while x265 only has 17 64x64 CTU rows.
142
143The second extenuating circumstance is the loop filters. The pixels used
144for motion reference must be processed by the loop filters and the loop
145filters cannot run until a full row has been encoded, and it must run a
146full row behind the encode process so that the pixels below the row
147being filtered are available. On top of this, HEVC has two loop filters:
148deblocking and SAO, which must be run in series with a row lag between
149them. When you add up all the row lags each frame ends up being 3 CTU
150rows behind its reference frames (the equivalent of 12 macroblock rows
151for x264). And keep in mind the wave-front progression pattern; by the
152time the reference frame finishes the third row of CTUs, nearly half of
153the CTUs in the frame may be compressed (depending on the display aspect
154ratio).
155
156The third extenuating circumstance is that when a frame being encoded
157becomes blocked by a reference frame row being available, that frame's
158wave-front becomes completely stalled and when the row becomes available
159again it can take quite some time for the wave to be restarted, if it
160ever does. This makes WPP less effective when frame parallelism is in
161use.
162
163:option:`--merange` can have a negative impact on frame parallelism. If
164the range is too large, more rows of CTU lag must be added to ensure
165those pixels are available in the reference frames.
166
167.. note::
168
169	Even though the merange is used to determine the amount of reference
170	pixels that must be available in the reference frames, the actual
171	motion search is not necessarily centered around the coincident
172	block. The motion search is actually centered around the motion
173	predictor, but the available pixel area (mvmin, mvmax) is determined
174	by merange and the interpolation filter half-heights.
175
176When frame threading is disabled, the entirety of all reference frames
177are always fully available (by definition) and thus the available pixel
178area is not restricted at all, and this can sometimes improve
179compression efficiency. Because of this, the output of encodes with
180frame parallelism disabled will not match the output of encodes with
181frame parallelism enabled; but when enabled the number of frame threads
182should have no effect on the output bitstream except when using ABR or
183VBV rate control or noise reduction.
184
185When :option:`--nr` is enabled, the outputs of each number of frame threads
186will be deterministic but none of them will match becaue each frame
187encoder maintains a cumulative noise reduction state.
188
189VBV introduces non-determinism in the encoder, at this point in time,
190regardless of the amount of frame parallelism.
191
192By default frame parallelism and WPP are enabled together. The number of
193frame threads used is auto-detected from the (hyperthreaded) CPU core
194count, but may be manually specified via :option:`--frame-threads`
195
196	+-------+--------+
197	| Cores | Frames |
198	+=======+========+
199	|  > 32 |  6..8  |
200	+-------+--------+
201	| >= 16 |   5    |
202	+-------+--------+
203	| >= 8  |   3    |
204	+-------+--------+
205	| >= 4  |   2    |
206	+-------+--------+
207
208If WPP is disabled, then the frame thread count defaults to **min(cpuCount, ctuRows / 2)**
209
210Over-allocating frame threads can be very counter-productive. They
211each allocate a large amount of memory and because of the limited number
212of CTU rows and the reference lag, you generally get limited benefit
213from adding frame encoders beyond the auto-detected count, and often
214the extra frame encoders reduce performance.
215
216Given these considerations, you can understand why the faster presets
217lower the max CTU size to 32x32 (making twice as many CTU rows available
218for WPP and for finer grained frame parallelism) and reduce
219:option:`--merange`
220
221Each frame encoder runs in its own thread (allocated separately from the
222worker pool). This frame thread has some pre-processing responsibilities
223and some post-processing responsibilities for each frame, but it spends
224the bulk of its time managing the wave-front processing by making CTU
225rows available to the worker threads when their dependencies are
226resolved.  The frame encoder threads spend nearly all of their time
227blocked in one of 4 possible locations:
228
2291. blocked, waiting for a frame to process
2302. blocked on a reference frame, waiting for a CTU row of reconstructed
231   and loop-filtered reference pixels to become available
2323. blocked waiting for wave-front completion
2334. blocked waiting for the main thread to consume an encoded frame
234
235Lookahead
236=========
237
238The lookahead module of x265 (the lowres pre-encode which determines
239scene cuts and slice types) uses the thread pool to distribute the
240lowres cost analysis to worker threads. It will use bonded task groups
241to perform batches of frame cost estimates, and it may optionally use
242bonded task groups to measure single frame cost estimates using slices.
243(see :option:`--lookahead-slices`)
244
245The main slicetypeDecide() function itself is also performed by a worker
246thread if your encoder has a thread pool, else it runs within the
247context of the thread which calls the x265_encoder_encode().
248
249SAO
250===
251
252The Sample Adaptive Offset loopfilter has a large effect on encode
253performance because of the peculiar way it must be analyzed and coded.
254
255SAO flags and data are encoded at the CTU level before the CTU itself is
256coded, but SAO analysis (deciding whether to enable SAO and with what
257parameters) cannot be performed until that CTU is completely analyzed
258(reconstructed pixels are available) as well as the CTUs to the right
259and below.  So in effect the encoder must perform SAO analysis in a
260wavefront at least a full row behind the CTU compression wavefront.
261
262This extra latency forces the encoder to save the encode data of every
263CTU until the entire frame has been analyzed, at which point a function
264can code the final slice bitstream with the decided SAO flags and data
265interleaved between each CTU.  This second pass over the CTUs can be
266expensive, particularly at large resolutions and high bitrates.
267