1.. SPDX-License-Identifier: GPL-2.0
2
3===============
4Physical Memory
5===============
6
7Linux is available for a wide range of architectures so there is a need for an
8architecture-independent abstraction to represent the physical memory. This
9chapter describes the structures used to manage physical memory in a running
10system.
11
12The first principal concept prevalent in the memory management is
13`Non-Uniform Memory Access (NUMA)
14<https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_.
15With multi-core and multi-socket machines, memory may be arranged into banks
16that incur a different cost to access depending on the “distance” from the
17processor. For example, there might be a bank of memory assigned to each CPU or
18a bank of memory very suitable for DMA near peripheral devices.
19
20Each bank is called a node and the concept is represented under Linux by a
21``struct pglist_data`` even if the architecture is UMA. This structure is
22always referenced to by it's typedef ``pg_data_t``. ``A pg_data_t`` structure
23for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
24``nid`` is the ID of that node.
25
26For NUMA architectures, the node structures are allocated by the architecture
27specific code early during boot. Usually, these structures are allocated
28locally on the memory bank they represent. For UMA architectures, only one
29static ``pg_data_t`` structure called ``contig_page_data`` is used. Nodes will
30be discussed further in Section :ref:`Nodes <nodes>`
31
32The entire physical address space is partitioned into one or more blocks
33called zones which represent ranges within memory. These ranges are usually
34determined by architectural constraints for accessing the physical memory.
35The memory range within a node that corresponds to a particular zone is
36described by a ``struct zone``, typedeffed to ``zone_t``. Each zone has
37one of the types described below.
38
39* ``ZONE_DMA`` and ``ZONE_DMA32`` historically represented memory suitable for
40  DMA by peripheral devices that cannot access all of the addressable
41  memory. For many years there are better more and robust interfaces to get
42  memory with DMA specific requirements (Documentation/core-api/dma-api.rst),
43  but ``ZONE_DMA`` and ``ZONE_DMA32`` still represent memory ranges that have
44  restrictions on how they can be accessed.
45  Depending on the architecture, either of these zone types or even they both
46  can be disabled at build time using ``CONFIG_ZONE_DMA`` and
47  ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
48  both zones as they support peripherals with different DMA addressing
49  limitations.
50
51* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
52  the time. DMA operations can be performed on pages in this zone if the DMA
53  devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
54  always enabled.
55
56* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
57  permanent mapping in the kernel page tables. The memory in this zone is only
58  accessible to the kernel using temporary mappings. This zone is available
59  only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
60
61* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
62  The difference is that the contents of most pages in ``ZONE_MOVABLE`` is
63  movable. That means that while virtual addresses of these pages do not
64  change, their content may move between different physical pages. Often
65  ``ZONE_MOVABLE`` is populated during memory hotplug, but it may be
66  also populated on boot using one of ``kernelcore``, ``movablecore`` and
67  ``movable_node`` kernel command line parameters. See
68  Documentation/mm/page_migration.rst and
69  Documentation/admin-guide/mm/memory_hotplug.rst for additional details.
70
71* ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU.
72  It has different characteristics than RAM zone types and it exists to provide
73  :ref:`struct page <Pages>` and memory map services for device driver
74  identified physical address ranges. ``ZONE_DEVICE`` is enabled with
75  configuration option ``CONFIG_ZONE_DEVICE``.
76
77It is important to note that many kernel operations can only take place using
78``ZONE_NORMAL`` so it is the most performance critical zone. Zones are
79discussed further in Section :ref:`Zones <zones>`.
80
81The relation between node and zone extents is determined by the physical memory
82map reported by the firmware, architectural constraints for memory addressing
83and certain parameters in the kernel command line.
84
85For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
86entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``,
87``ZONE_NORMAL`` and ``ZONE_HIGHMEM``::
88
89  0                                                            2G
90  +-------------------------------------------------------------+
91  |                            node 0                           |
92  +-------------------------------------------------------------+
93
94  0         16M                    896M                        2G
95  +----------+-----------------------+--------------------------+
96  | ZONE_DMA |      ZONE_NORMAL      |       ZONE_HIGHMEM       |
97  +----------+-----------------------+--------------------------+
98
99
100With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and
101booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of
102RAM equally split between two nodes, there will be ``ZONE_DMA32``,
103``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and
104``ZONE_MOVABLE`` on node 1::
105
106
107  1G                                9G                         17G
108  +--------------------------------+ +--------------------------+
109  |              node 0            | |          node 1          |
110  +--------------------------------+ +--------------------------+
111
112  1G       4G        4200M          9G          9320M          17G
113  +---------+----------+-----------+ +------------+-------------+
114  |  DMA32  |  NORMAL  |  MOVABLE  | |   NORMAL   |   MOVABLE   |
115  +---------+----------+-----------+ +------------+-------------+
116
117.. _nodes:
118
119Nodes
120=====
121
122As we have mentioned, each node in memory is described by a ``pg_data_t`` which
123is a typedef for a ``struct pglist_data``. When allocating a page, by default
124Linux uses a node-local allocation policy to allocate memory from the node
125closest to the running CPU. As processes tend to run on the same CPU, it is
126likely the memory from the current node will be used. The allocation policy can
127be controlled by users as described in
128Documentation/admin-guide/mm/numa_memory_policy.rst.
129
130Most NUMA architectures maintain an array of pointers to the node
131structures. The actual structures are allocated early during boot when
132architecture specific code parses the physical memory map reported by the
133firmware. The bulk of the node initialization happens slightly later in the
134boot process by free_area_init() function, described later in Section
135:ref:`Initialization <initialization>`.
136
137
138Along with the node structures, kernel maintains an array of ``nodemask_t``
139bitmasks called ``node_states``. Each bitmask in this array represents a set of
140nodes with particular properties as defined by ``enum node_states``:
141
142``N_POSSIBLE``
143  The node could become online at some point.
144``N_ONLINE``
145  The node is online.
146``N_NORMAL_MEMORY``
147  The node has regular memory.
148``N_HIGH_MEMORY``
149  The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled
150  aliased to ``N_NORMAL_MEMORY``.
151``N_MEMORY``
152  The node has memory(regular, high, movable)
153``N_CPU``
154  The node has one or more CPUs
155
156For each node that has a property described above, the bit corresponding to the
157node ID in the ``node_states[<property>]`` bitmask is set.
158
159For example, for node 2 with normal memory and CPUs, bit 2 will be set in ::
160
161  node_states[N_POSSIBLE]
162  node_states[N_ONLINE]
163  node_states[N_NORMAL_MEMORY]
164  node_states[N_HIGH_MEMORY]
165  node_states[N_MEMORY]
166  node_states[N_CPU]
167
168For various operations possible with nodemasks please refer to
169``include/linux/nodemask.h``.
170
171Among other things, nodemasks are used to provide macros for node traversal,
172namely ``for_each_node()`` and ``for_each_online_node()``.
173
174For instance, to call a function foo() for each online node::
175
176	for_each_online_node(nid) {
177		pg_data_t *pgdat = NODE_DATA(nid);
178
179		foo(pgdat);
180	}
181
182Node structure
183--------------
184
185The nodes structure ``struct pglist_data`` is declared in
186``include/linux/mmzone.h``. Here we briefly describe fields of this
187structure:
188
189General
190~~~~~~~
191
192``node_zones``
193  The zones for this node.  Not all of the zones may be populated, but it is
194  the full list. It is referenced by this node's node_zonelists as well as
195  other node's node_zonelists.
196
197``node_zonelists``
198  The list of all zones in all nodes. This list defines the order of zones
199  that allocations are preferred from. The ``node_zonelists`` is set up by
200  ``build_zonelists()`` in ``mm/page_alloc.c`` during the initialization of
201  core memory management structures.
202
203``nr_zones``
204  Number of populated zones in this node.
205
206``node_mem_map``
207  For UMA systems that use FLATMEM memory model the 0's node
208  ``node_mem_map`` is array of struct pages representing each physical frame.
209
210``node_page_ext``
211  For UMA systems that use FLATMEM memory model the 0's node
212  ``node_page_ext`` is array of extensions of struct pages. Available only
213  in the kernels built with ``CONFIG_PAGE_EXTENSION`` enabled.
214
215``node_start_pfn``
216  The page frame number of the starting page frame in this node.
217
218``node_present_pages``
219  Total number of physical pages present in this node.
220
221``node_spanned_pages``
222  Total size of physical page range, including holes.
223
224``node_size_lock``
225  A lock that protects the fields defining the node extents. Only defined when
226  at least one of ``CONFIG_MEMORY_HOTPLUG`` or
227  ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled.
228  ``pgdat_resize_lock()`` and ``pgdat_resize_unlock()`` are provided to
229  manipulate ``node_size_lock`` without checking for ``CONFIG_MEMORY_HOTPLUG``
230  or ``CONFIG_DEFERRED_STRUCT_PAGE_INIT``.
231
232``node_id``
233  The Node ID (NID) of the node, starts at 0.
234
235``totalreserve_pages``
236  This is a per-node reserve of pages that are not available to userspace
237  allocations.
238
239``first_deferred_pfn``
240  If memory initialization on large machines is deferred then this is the first
241  PFN that needs to be initialized. Defined only when
242  ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled
243
244``deferred_split_queue``
245  Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled.
246
247``__lruvec``
248  Per-node lruvec holding LRU lists and related parameters. Used only when
249  memory cgroups are disabled. It should not be accessed directly, use
250  ``mem_cgroup_lruvec()`` to look up lruvecs instead.
251
252Reclaim control
253~~~~~~~~~~~~~~~
254
255See also Documentation/mm/page_reclaim.rst.
256
257``kswapd``
258  Per-node instance of kswapd kernel thread.
259
260``kswapd_wait``, ``pfmemalloc_wait``, ``reclaim_wait``
261  Workqueues used to synchronize memory reclaim tasks
262
263``nr_writeback_throttled``
264  Number of tasks that are throttled waiting on dirty pages to clean.
265
266``nr_reclaim_start``
267  Number of pages written while reclaim is throttled waiting for writeback.
268
269``kswapd_order``
270  Controls the order kswapd tries to reclaim
271
272``kswapd_highest_zoneidx``
273  The highest zone index to be reclaimed by kswapd
274
275``kswapd_failures``
276  Number of runs kswapd was unable to reclaim any pages
277
278``min_unmapped_pages``
279  Minimal number of unmapped file backed pages that cannot be reclaimed.
280  Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when
281  ``CONFIG_NUMA`` is enabled.
282
283``min_slab_pages``
284  Minimal number of SLAB pages that cannot be reclaimed. Determined by
285  ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled
286
287``flags``
288  Flags controlling reclaim behavior.
289
290Compaction control
291~~~~~~~~~~~~~~~~~~
292
293``kcompactd_max_order``
294  Page order that kcompactd should try to achieve.
295
296``kcompactd_highest_zoneidx``
297  The highest zone index to be compacted by kcompactd.
298
299``kcompactd_wait``
300  Workqueue used to synchronize memory compaction tasks.
301
302``kcompactd``
303  Per-node instance of kcompactd kernel thread.
304
305``proactive_compact_trigger``
306  Determines if proactive compaction is enabled. Controlled by
307  ``vm.compaction_proactiveness`` sysctl.
308
309Statistics
310~~~~~~~~~~
311
312``per_cpu_nodestats``
313  Per-CPU VM statistics for the node
314
315``vm_stat``
316  VM statistics for the node.
317
318.. _zones:
319
320Zones
321=====
322
323.. admonition:: Stub
324
325   This section is incomplete. Please list and describe the appropriate fields.
326
327.. _pages:
328
329Pages
330=====
331
332.. admonition:: Stub
333
334   This section is incomplete. Please list and describe the appropriate fields.
335
336.. _folios:
337
338Folios
339======
340
341.. admonition:: Stub
342
343   This section is incomplete. Please list and describe the appropriate fields.
344
345.. _initialization:
346
347Initialization
348==============
349
350.. admonition:: Stub
351
352   This section is incomplete. Please list and describe the appropriate fields.
353