1This document summarizes the common approaches for performance fine tuning with
2jemalloc (as of 5.1.0).  The default configuration of jemalloc tends to work
3reasonably well in practice, and most applications should not have to tune any
4options. However, in order to cover a wide range of applications and avoid
5pathological cases, the default setting is sometimes kept conservative and
6suboptimal, even for many common workloads.  When jemalloc is properly tuned for
7a specific application / workload, it is common to improve system level metrics
8by a few percent, or make favorable trade-offs.
9
10
11## Notable runtime options for performance tuning
12
13Runtime options can be set via
14[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning).
15
16* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread)
17
18    Enabling jemalloc background threads generally improves the tail latency for
19    application threads, since unused memory purging is shifted to the dedicated
20    background threads.  In addition, unintended purging delay caused by
21    application inactivity is avoided with background threads.
22
23    Suggested: `background_thread:true` when jemalloc managed threads can be
24    allowed.
25
26* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp)
27
28    Allowing jemalloc to utilize transparent huge pages for its internal
29    metadata usually reduces TLB misses significantly, especially for programs
30    with large memory footprint and frequent allocation / deallocation
31    activities.  Metadata memory usage may increase due to the use of huge
32    pages.
33
34    Suggested for allocation intensive programs: `metadata_thp:auto` or
35    `metadata_thp:always`, which is expected to improve CPU utilization at a
36    small memory cost.
37
38* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and
39  [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms)
40
41    Decay time determines how fast jemalloc returns unused pages back to the
42    operating system, and therefore provides a fairly straightforward trade-off
43    between CPU and memory usage.  Shorter decay time purges unused pages faster
44    to reduces memory usage (usually at the cost of more CPU cycles spent on
45    purging), and vice versa.
46
47    Suggested: tune the values based on the desired trade-offs.
48
49* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas)
50
51    By default jemalloc uses multiple arenas to reduce internal lock contention.
52    However high arena count may also increase overall memory fragmentation,
53    since arenas manage memory independently.  When high degree of parallelism
54    is not expected at the allocator level, lower number of arenas often
55    improves memory usage.
56
57    Suggested: if low parallelism is expected, try lower arena count while
58    monitoring CPU and memory usage.
59
60* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena)
61
62    Enable dynamic thread to arena association based on running CPU.  This has
63    the potential to improve locality, e.g. when thread to CPU affinity is
64    present.
65
66    Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if
67    thread migration between processors is expected to be infrequent.
68
69Examples:
70
71* High resource consumption application, prioritizing CPU utilization:
72
73    `background_thread:true,metadata_thp:auto` combined with relaxed decay time
74    (increased `dirty_decay_ms` and / or `muzzy_decay_ms`,
75    e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`).
76
77* High resource consumption application, prioritizing memory usage:
78
79    `background_thread:true` combined with shorter decay time (decreased
80    `dirty_decay_ms` and / or `muzzy_decay_ms`,
81    e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count
82    (e.g. number of CPUs).
83
84* Low resource consumption application:
85
86    `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased
87    `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g.
88    `dirty_decay_ms:1000,muzzy_decay_ms:0`).
89
90* Extremely conservative -- minimize memory usage at all costs, only suitable when
91allocation activity is very rare:
92
93    `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0`
94
95Note that it is recommended to combine the options with `abort_conf:true` which
96aborts immediately on illegal options.
97
98## Beyond runtime options
99
100In addition to the runtime options, there are a number of programmatic ways to
101improve application performance with jemalloc.
102
103* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create)
104
105    Manually created arenas can help performance in various ways, e.g. by
106    managing locality and contention for specific usages.  For example,
107    applications can explicitly allocate frequently accessed objects from a
108    dedicated arena with
109    [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve
110    locality.  In addition, explicit arenas often benefit from individually
111    tuned options, e.g. relaxed [decay
112    time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if
113    frequent reuse is expected.
114
115* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks)
116
117    Extent hooks allow customization for managing underlying memory.  One use
118    case for performance purpose is to utilize huge pages -- for example,
119    [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp)
120    uses explicit arenas with customized extent hooks to manage 1GB huge pages
121    for frequently accessed data, which reduces TLB misses significantly.
122
123* [Explicit thread-to-arena
124  binding](http://jemalloc.net/jemalloc.3.html#thread.arena)
125
126    It is common for some threads in an application to have different memory
127    access / allocation patterns.  Threads with heavy workloads often benefit
128    from explicit binding, e.g. binding very active threads to dedicated arenas
129    may reduce contention at the allocator level.
130