1********* 2Threading 3********* 4 5.. _pools: 6 7Thread Pools 8============ 9 10x265 creates one or more thread pools per encoder, one pool per NUMA 11node (typically a CPU socket). :option:`--pools` specifies the number of 12pools and the number of threads per pool the encoder will allocate. By 13default x265 allocates one thread per (hyperthreaded) CPU core on each 14NUMA node. 15 16If you are running multiple encoders on a system with multiple NUMA 17nodes, it is recommended to isolate each of them to a single node in 18order to avoid the NUMA overhead of remote memory access. 19 20Work distribution is job based. Idle worker threads scan the job 21providers assigned to their thread pool for jobs to perform. When no 22jobs are available, the idle worker threads block and consume no CPU 23cycles. 24 25Objects which desire to distribute work to worker threads are known as 26job providers (and they derive from the JobProvider class). The thread 27pool has a method to **poke** awake a blocked idle thread, and job 28providers are recommended to call this method when they make new jobs 29available. 30 31Worker jobs are not allowed to block except when absolutely necessary 32for data locking. If a job becomes blocked, the work function is 33expected to drop that job so the worker thread may go back to the pool 34and find more work. 35 36On Windows, the native APIs offer sufficient functionality to discover 37the NUMA topology and enforce the thread affinity that libx265 needs (so 38long as you have not chosen to target XP or Vista), but on POSIX systems 39it relies on libnuma for this functionality. If your target POSIX system 40is single socket, then building without libnuma is a perfectly 41reasonable option, as it will have no effect on the runtime behavior. On 42a multiple-socket system, a POSIX build of libx265 without libnuma will 43be less work efficient, but will still function correctly. You lose the 44work isolation effect that keeps each frame encoder from only using the 45threads of a single socket and so you incur a heavier context switching 46cost. 47 48Wavefront Parallel Processing 49============================= 50 51New with HEVC, Wavefront Parallel Processing allows each row of CTUs to 52be encoded in parallel, so long as each row stays at least two CTUs 53behind the row above it, to ensure the intra references and other data 54of the blocks above and above-right are available. WPP has almost no 55effect on the analysis and compression of each CTU and so it has a very 56small impact on compression efficiency relative to slices or tiles. The 57compression loss from WPP has been found to be less than 1% in most of 58our tests. 59 60WPP has three effects which can impact efficiency. The first is the row 61starts must be signaled in the slice header, the second is each row must 62be padded to an even byte in length, and the third is the state of the 63entropy coder is transferred from the second CTU of each row to the 64first CTU of the row below it. In some conditions this transfer of 65state actually improves compression since the above-right state may have 66better locality than the end of the previous row. 67 68Parabola Research have published an excellent HEVC 69`animation <http://www.parabolaresearch.com/blog/2013-12-01-hevc-wavefront-animation.html>`_ 70which visualizes WPP very well. It even correctly visualizes some of 71WPPs key drawbacks, such as: 72 731. the low thread utilization at the start and end of each frame 742. a difficult block may stall the wave-front and it takes a while for 75 the wave-front to recover. 763. 64x64 CTUs are big! there are much fewer rows than with H.264 and 77 similar codecs 78 79Because of these stall issues you rarely get the full parallelisation 80benefit one would expect from row threading. 30% to 50% of the 81theoretical perfect threading is typical. 82 83In x265 WPP is enabled by default since it not only improves performance 84at encode but it also makes it possible for the decoder to be threaded. 85 86If WPP is disabled by :option:`--no-wpp` the frame will be encoded in 87scan order and the entropy overheads will be avoided. If frame 88threading is not disabled, the encoder will change the default frame 89thread count to be higher than if WPP was enabled. The exact formulas 90are described in the next section. 91 92Bonded Task Groups 93================== 94 95If a worker thread job has work which can be performed in parallel by 96many threads, it may allocate a bonded task group and enlist the help of 97other idle worker threads from the same thread pool. Those threads will 98cooperate to complete the work of the bonded task group and then return 99to their idle states. The larger and more uniform those tasks are, the 100better the bonded task group will perform. 101 102Parallel Mode Analysis 103~~~~~~~~~~~~~~~~~~~~~~ 104 105When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to 1068x8) will distribute its analysis work to the thread pool via a bonded 107task group. Each analysis job will measure the cost of one prediction 108for the CU: merge, skip, intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). 109 110At slower presets, the amount of increased parallelism from pmode is 111often enough to be able to reduce or disable frame parallelism while 112achieving the same overall CPU utilization. Reducing frame threads is 113often beneficial to ABR and VBV rate control. 114 115Parallel Motion Estimation 116~~~~~~~~~~~~~~~~~~~~~~~~~~ 117 118When :option:`--pme` is enabled all of the analysis functions which 119perform motion searches to reference frames will distribute those motion 120searches to other worker threads via a bonded task group (if more than 121two motion searches are required). 122 123Frame Threading 124=============== 125 126Frame threading is the act of encoding multiple frames at the same time. 127It is a challenge because each frame will generally use one or more of 128the previously encoded frames as motion references and those frames may 129still be in the process of being encoded themselves. 130 131Previous encoders such as x264 worked around this problem by limiting 132the motion search region within these reference frames to just one 133macroblock row below the coincident row being encoded. Thus a frame 134could be encoded at the same time as its reference frames so long as it 135stayed one row behind the encode progress of its references (glossing 136over a few details). 137 138x265 has the same frame threading mechanism, but we generally have much 139less frame parallelism to exploit than x264 because of the size of our 140CTU rows. For instance, with 1080p video x264 has 68 16x16 macroblock 141rows available each frame while x265 only has 17 64x64 CTU rows. 142 143The second extenuating circumstance is the loop filters. The pixels used 144for motion reference must be processed by the loop filters and the loop 145filters cannot run until a full row has been encoded, and it must run a 146full row behind the encode process so that the pixels below the row 147being filtered are available. On top of this, HEVC has two loop filters: 148deblocking and SAO, which must be run in series with a row lag between 149them. When you add up all the row lags each frame ends up being 3 CTU 150rows behind its reference frames (the equivalent of 12 macroblock rows 151for x264). And keep in mind the wave-front progression pattern; by the 152time the reference frame finishes the third row of CTUs, nearly half of 153the CTUs in the frame may be compressed (depending on the display aspect 154ratio). 155 156The third extenuating circumstance is that when a frame being encoded 157becomes blocked by a reference frame row being available, that frame's 158wave-front becomes completely stalled and when the row becomes available 159again it can take quite some time for the wave to be restarted, if it 160ever does. This makes WPP less effective when frame parallelism is in 161use. 162 163:option:`--merange` can have a negative impact on frame parallelism. If 164the range is too large, more rows of CTU lag must be added to ensure 165those pixels are available in the reference frames. 166 167.. note:: 168 169 Even though the merange is used to determine the amount of reference 170 pixels that must be available in the reference frames, the actual 171 motion search is not necessarily centered around the coincident 172 block. The motion search is actually centered around the motion 173 predictor, but the available pixel area (mvmin, mvmax) is determined 174 by merange and the interpolation filter half-heights. 175 176When frame threading is disabled, the entirety of all reference frames 177are always fully available (by definition) and thus the available pixel 178area is not restricted at all, and this can sometimes improve 179compression efficiency. Because of this, the output of encodes with 180frame parallelism disabled will not match the output of encodes with 181frame parallelism enabled; but when enabled the number of frame threads 182should have no effect on the output bitstream except when using ABR or 183VBV rate control or noise reduction. 184 185When :option:`--nr` is enabled, the outputs of each number of frame threads 186will be deterministic but none of them will match becaue each frame 187encoder maintains a cumulative noise reduction state. 188 189VBV introduces non-determinism in the encoder, at this point in time, 190regardless of the amount of frame parallelism. 191 192By default frame parallelism and WPP are enabled together. The number of 193frame threads used is auto-detected from the (hyperthreaded) CPU core 194count, but may be manually specified via :option:`--frame-threads` 195 196 +-------+--------+ 197 | Cores | Frames | 198 +=======+========+ 199 | > 32 | 6..8 | 200 +-------+--------+ 201 | >= 16 | 5 | 202 +-------+--------+ 203 | >= 8 | 3 | 204 +-------+--------+ 205 | >= 4 | 2 | 206 +-------+--------+ 207 208If WPP is disabled, then the frame thread count defaults to **min(cpuCount, ctuRows / 2)** 209 210Over-allocating frame threads can be very counter-productive. They 211each allocate a large amount of memory and because of the limited number 212of CTU rows and the reference lag, you generally get limited benefit 213from adding frame encoders beyond the auto-detected count, and often 214the extra frame encoders reduce performance. 215 216Given these considerations, you can understand why the faster presets 217lower the max CTU size to 32x32 (making twice as many CTU rows available 218for WPP and for finer grained frame parallelism) and reduce 219:option:`--merange` 220 221Each frame encoder runs in its own thread (allocated separately from the 222worker pool). This frame thread has some pre-processing responsibilities 223and some post-processing responsibilities for each frame, but it spends 224the bulk of its time managing the wave-front processing by making CTU 225rows available to the worker threads when their dependencies are 226resolved. The frame encoder threads spend nearly all of their time 227blocked in one of 4 possible locations: 228 2291. blocked, waiting for a frame to process 2302. blocked on a reference frame, waiting for a CTU row of reconstructed 231 and loop-filtered reference pixels to become available 2323. blocked waiting for wave-front completion 2334. blocked waiting for the main thread to consume an encoded frame 234 235Lookahead 236========= 237 238The lookahead module of x265 (the lowres pre-encode which determines 239scene cuts and slice types) uses the thread pool to distribute the 240lowres cost analysis to worker threads. It will use bonded task groups 241to perform batches of frame cost estimates, and it may optionally use 242bonded task groups to measure single frame cost estimates using slices. 243(see :option:`--lookahead-slices`) 244 245The main slicetypeDecide() function itself is also performed by a worker 246thread if your encoder has a thread pool, else it runs within the 247context of the thread which calls the x265_encoder_encode(). 248 249SAO 250=== 251 252The Sample Adaptive Offset loopfilter has a large effect on encode 253performance because of the peculiar way it must be analyzed and coded. 254 255SAO flags and data are encoded at the CTU level before the CTU itself is 256coded, but SAO analysis (deciding whether to enable SAO and with what 257parameters) cannot be performed until that CTU is completely analyzed 258(reconstructed pixels are available) as well as the CTUs to the right 259and below. So in effect the encoder must perform SAO analysis in a 260wavefront at least a full row behind the CTU compression wavefront. 261 262This extra latency forces the encoder to save the encode data of every 263CTU until the entire frame has been analyzed, at which point a function 264can code the final slice bitstream with the decided SAO flags and data 265interleaved between each CTU. This second pass over the CTUs can be 266expensive, particularly at large resolutions and high bitrates. 267