1a7ca193bSDarrick J. Wong.. SPDX-License-Identifier: GPL-2.0
2a7ca193bSDarrick J. Wong.. _iomap_design:
3a7ca193bSDarrick J. Wong
4a7ca193bSDarrick J. Wong..
5a7ca193bSDarrick J. Wong        Dumb style notes to maintain the author's sanity:
6a7ca193bSDarrick J. Wong        Please try to start sentences on separate lines so that
7a7ca193bSDarrick J. Wong        sentence changes don't bleed colors in diff.
8a7ca193bSDarrick J. Wong        Heading decorations are documented in sphinx.rst.
9a7ca193bSDarrick J. Wong
10a7ca193bSDarrick J. Wong==============
11a7ca193bSDarrick J. WongLibrary Design
12a7ca193bSDarrick J. Wong==============
13a7ca193bSDarrick J. Wong
14a7ca193bSDarrick J. Wong.. contents:: Table of Contents
15a7ca193bSDarrick J. Wong   :local:
16a7ca193bSDarrick J. Wong
17a7ca193bSDarrick J. WongIntroduction
18a7ca193bSDarrick J. Wong============
19a7ca193bSDarrick J. Wong
20a7ca193bSDarrick J. Wongiomap is a filesystem library for handling common file operations.
21a7ca193bSDarrick J. WongThe library has two layers:
22a7ca193bSDarrick J. Wong
23a7ca193bSDarrick J. Wong 1. A lower layer that provides an iterator over ranges of file offsets.
24a7ca193bSDarrick J. Wong    This layer tries to obtain mappings of each file ranges to storage
25a7ca193bSDarrick J. Wong    from the filesystem, but the storage information is not necessarily
26a7ca193bSDarrick J. Wong    required.
27a7ca193bSDarrick J. Wong
28a7ca193bSDarrick J. Wong 2. An upper layer that acts upon the space mappings provided by the
29a7ca193bSDarrick J. Wong    lower layer iterator.
30a7ca193bSDarrick J. Wong
31a7ca193bSDarrick J. WongThe iteration can involve mappings of file's logical offset ranges to
32a7ca193bSDarrick J. Wongphysical extents, but the storage layer information is not necessarily
33a7ca193bSDarrick J. Wongrequired, e.g. for walking cached file information.
34a7ca193bSDarrick J. WongThe library exports various APIs for implementing file operations such
35a7ca193bSDarrick J. Wongas:
36a7ca193bSDarrick J. Wong
37a7ca193bSDarrick J. Wong * Pagecache reads and writes
38a7ca193bSDarrick J. Wong * Folio write faults to the pagecache
39a7ca193bSDarrick J. Wong * Writeback of dirty folios
40a7ca193bSDarrick J. Wong * Direct I/O reads and writes
41a7ca193bSDarrick J. Wong * fsdax I/O reads, writes, loads, and stores
42a7ca193bSDarrick J. Wong * FIEMAP
43a7ca193bSDarrick J. Wong * lseek ``SEEK_DATA`` and ``SEEK_HOLE``
44a7ca193bSDarrick J. Wong * swapfile activation
45a7ca193bSDarrick J. Wong
46a7ca193bSDarrick J. WongThis origins of this library is the file I/O path that XFS once used; it
47a7ca193bSDarrick J. Wonghas now been extended to cover several other operations.
48a7ca193bSDarrick J. Wong
49a7ca193bSDarrick J. WongWho Should Read This?
50a7ca193bSDarrick J. Wong=====================
51a7ca193bSDarrick J. Wong
52a7ca193bSDarrick J. WongThe target audience for this document are filesystem, storage, and
53a7ca193bSDarrick J. Wongpagecache programmers and code reviewers.
54a7ca193bSDarrick J. Wong
55a7ca193bSDarrick J. WongIf you are working on PCI, machine architectures, or device drivers, you
56a7ca193bSDarrick J. Wongare most likely in the wrong place.
57a7ca193bSDarrick J. Wong
58a7ca193bSDarrick J. WongHow Is This Better?
59a7ca193bSDarrick J. Wong===================
60a7ca193bSDarrick J. Wong
61a7ca193bSDarrick J. WongUnlike the classic Linux I/O model which breaks file I/O into small
62a7ca193bSDarrick J. Wongunits (generally memory pages or blocks) and looks up space mappings on
63a7ca193bSDarrick J. Wongthe basis of that unit, the iomap model asks the filesystem for the
64a7ca193bSDarrick J. Wonglargest space mappings that it can create for a given file operation and
65a7ca193bSDarrick J. Wonginitiates operations on that basis.
66a7ca193bSDarrick J. WongThis strategy improves the filesystem's visibility into the size of the
67a7ca193bSDarrick J. Wongoperation being performed, which enables it to combat fragmentation with
68a7ca193bSDarrick J. Wonglarger space allocations when possible.
69a7ca193bSDarrick J. WongLarger space mappings improve runtime performance by amortizing the cost
70a7ca193bSDarrick J. Wongof mapping function calls into the filesystem across a larger amount of
71a7ca193bSDarrick J. Wongdata.
72a7ca193bSDarrick J. Wong
73a7ca193bSDarrick J. WongAt a high level, an iomap operation `looks like this
74a7ca193bSDarrick J. Wong<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_:
75a7ca193bSDarrick J. Wong
76a7ca193bSDarrick J. Wong1. For each byte in the operation range...
77a7ca193bSDarrick J. Wong
78a7ca193bSDarrick J. Wong   1. Obtain a space mapping via ``->iomap_begin``
79a7ca193bSDarrick J. Wong
80a7ca193bSDarrick J. Wong   2. For each sub-unit of work...
81a7ca193bSDarrick J. Wong
82a7ca193bSDarrick J. Wong      1. Revalidate the mapping and go back to (1) above, if necessary.
83a7ca193bSDarrick J. Wong         So far only the pagecache operations need to do this.
84a7ca193bSDarrick J. Wong
85a7ca193bSDarrick J. Wong      2. Do the work
86a7ca193bSDarrick J. Wong
87a7ca193bSDarrick J. Wong   3. Increment operation cursor
88a7ca193bSDarrick J. Wong
89a7ca193bSDarrick J. Wong   4. Release the mapping via ``->iomap_end``, if necessary
90a7ca193bSDarrick J. Wong
91a7ca193bSDarrick J. WongEach iomap operation will be covered in more detail below.
92a7ca193bSDarrick J. WongThis library was covered previously by an `LWN article
93a7ca193bSDarrick J. Wong<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page
94a7ca193bSDarrick J. Wong<https://kernelnewbies.org/KernelProjects/iomap>`_.
95a7ca193bSDarrick J. Wong
96a7ca193bSDarrick J. WongThe goal of this document is to provide a brief discussion of the
97a7ca193bSDarrick J. Wongdesign and capabilities of iomap, followed by a more detailed catalog
98a7ca193bSDarrick J. Wongof the interfaces presented by iomap.
99a7ca193bSDarrick J. WongIf you change iomap, please update this design document.
100a7ca193bSDarrick J. Wong
101a7ca193bSDarrick J. WongFile Range Iterator
102a7ca193bSDarrick J. Wong===================
103a7ca193bSDarrick J. Wong
104a7ca193bSDarrick J. WongDefinitions
105a7ca193bSDarrick J. Wong-----------
106a7ca193bSDarrick J. Wong
107a7ca193bSDarrick J. Wong * **buffer head**: Shattered remnants of the old buffer cache.
108a7ca193bSDarrick J. Wong
109a7ca193bSDarrick J. Wong * ``fsblock``: The block size of a file, also known as ``i_blocksize``.
110a7ca193bSDarrick J. Wong
111a7ca193bSDarrick J. Wong * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore.
112a7ca193bSDarrick J. Wong   Processes hold this in shared mode to read file state and contents.
113a7ca193bSDarrick J. Wong   Some filesystems may allow shared mode for writes.
114a7ca193bSDarrick J. Wong   Processes often hold this in exclusive mode to change file state and
115a7ca193bSDarrick J. Wong   contents.
116a7ca193bSDarrick J. Wong
117a7ca193bSDarrick J. Wong * ``invalidate_lock``: The pagecache ``struct address_space``
118a7ca193bSDarrick J. Wong   rwsemaphore that protects against folio insertion and removal for
119a7ca193bSDarrick J. Wong   filesystems that support punching out folios below EOF.
120a7ca193bSDarrick J. Wong   Processes wishing to insert folios must hold this lock in shared
121a7ca193bSDarrick J. Wong   mode to prevent removal, though concurrent insertion is allowed.
122a7ca193bSDarrick J. Wong   Processes wishing to remove folios must hold this lock in exclusive
123a7ca193bSDarrick J. Wong   mode to prevent insertions.
124a7ca193bSDarrick J. Wong   Concurrent removals are not allowed.
125a7ca193bSDarrick J. Wong
126a7ca193bSDarrick J. Wong * ``dax_read_lock``: The RCU read lock that dax takes to prevent a
127a7ca193bSDarrick J. Wong   device pre-shutdown hook from returning before other threads have
128a7ca193bSDarrick J. Wong   released resources.
129a7ca193bSDarrick J. Wong
130a7ca193bSDarrick J. Wong * **filesystem mapping lock**: This synchronization primitive is
131a7ca193bSDarrick J. Wong   internal to the filesystem and must protect the file mapping data
132a7ca193bSDarrick J. Wong   from updates while a mapping is being sampled.
133a7ca193bSDarrick J. Wong   The filesystem author must determine how this coordination should
134a7ca193bSDarrick J. Wong   happen; it does not need to be an actual lock.
135a7ca193bSDarrick J. Wong
136a7ca193bSDarrick J. Wong * **iomap internal operation lock**: This is a general term for
137a7ca193bSDarrick J. Wong   synchronization primitives that iomap functions take while holding a
138a7ca193bSDarrick J. Wong   mapping.
139a7ca193bSDarrick J. Wong   A specific example would be taking the folio lock while reading or
140a7ca193bSDarrick J. Wong   writing the pagecache.
141a7ca193bSDarrick J. Wong
142a7ca193bSDarrick J. Wong * **pure overwrite**: A write operation that does not require any
143a7ca193bSDarrick J. Wong   metadata or zeroing operations to perform during either submission
144a7ca193bSDarrick J. Wong   or completion.
145b1daf3f8SDennis Lam   This implies that the filesystem must have already allocated space
146a7ca193bSDarrick J. Wong   on disk as ``IOMAP_MAPPED`` and the filesystem must not place any
147b1daf3f8SDennis Lam   constraints on IO alignment or size.
148a7ca193bSDarrick J. Wong   The only constraints on I/O alignment are device level (minimum I/O
149a7ca193bSDarrick J. Wong   size and alignment, typically sector size).
150a7ca193bSDarrick J. Wong
151a7ca193bSDarrick J. Wong``struct iomap``
152a7ca193bSDarrick J. Wong----------------
153a7ca193bSDarrick J. Wong
154a7ca193bSDarrick J. WongThe filesystem communicates to the iomap iterator the mapping of
155a7ca193bSDarrick J. Wongbyte ranges of a file to byte ranges of a storage device with the
156a7ca193bSDarrick J. Wongstructure below:
157a7ca193bSDarrick J. Wong
158a7ca193bSDarrick J. Wong.. code-block:: c
159a7ca193bSDarrick J. Wong
160a7ca193bSDarrick J. Wong struct iomap {
161a7ca193bSDarrick J. Wong     u64                 addr;
162a7ca193bSDarrick J. Wong     loff_t              offset;
163a7ca193bSDarrick J. Wong     u64                 length;
164a7ca193bSDarrick J. Wong     u16                 type;
165a7ca193bSDarrick J. Wong     u16                 flags;
166a7ca193bSDarrick J. Wong     struct block_device *bdev;
167a7ca193bSDarrick J. Wong     struct dax_device   *dax_dev;
168*71fdfcddSPankaj Raghav     void                *inline_data;
169a7ca193bSDarrick J. Wong     void                *private;
170a7ca193bSDarrick J. Wong     const struct iomap_folio_ops *folio_ops;
171a7ca193bSDarrick J. Wong     u64                 validity_cookie;
172a7ca193bSDarrick J. Wong };
173a7ca193bSDarrick J. Wong
174a7ca193bSDarrick J. WongThe fields are as follows:
175a7ca193bSDarrick J. Wong
176a7ca193bSDarrick J. Wong * ``offset`` and ``length`` describe the range of file offsets, in
177a7ca193bSDarrick J. Wong   bytes, covered by this mapping.
178a7ca193bSDarrick J. Wong   These fields must always be set by the filesystem.
179a7ca193bSDarrick J. Wong
180a7ca193bSDarrick J. Wong * ``type`` describes the type of the space mapping:
181a7ca193bSDarrick J. Wong
182a7ca193bSDarrick J. Wong   * **IOMAP_HOLE**: No storage has been allocated.
183a7ca193bSDarrick J. Wong     This type must never be returned in response to an ``IOMAP_WRITE``
184a7ca193bSDarrick J. Wong     operation because writes must allocate and map space, and return
185a7ca193bSDarrick J. Wong     the mapping.
186a7ca193bSDarrick J. Wong     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
187a7ca193bSDarrick J. Wong     iomap does not support writing (whether via pagecache or direct
188a7ca193bSDarrick J. Wong     I/O) to a hole.
189a7ca193bSDarrick J. Wong
190a7ca193bSDarrick J. Wong   * **IOMAP_DELALLOC**: A promise to allocate space at a later time
191a7ca193bSDarrick J. Wong     ("delayed allocation").
192a7ca193bSDarrick J. Wong     If the filesystem returns IOMAP_F_NEW here and the write fails, the
193a7ca193bSDarrick J. Wong     ``->iomap_end`` function must delete the reservation.
194a7ca193bSDarrick J. Wong     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
195a7ca193bSDarrick J. Wong
196a7ca193bSDarrick J. Wong   * **IOMAP_MAPPED**: The file range maps to specific space on the
197a7ca193bSDarrick J. Wong     storage device.
198a7ca193bSDarrick J. Wong     The device is returned in ``bdev`` or ``dax_dev``.
199a7ca193bSDarrick J. Wong     The device address, in bytes, is returned via ``addr``.
200a7ca193bSDarrick J. Wong
201a7ca193bSDarrick J. Wong   * **IOMAP_UNWRITTEN**: The file range maps to specific space on the
202a7ca193bSDarrick J. Wong     storage device, but the space has not yet been initialized.
203a7ca193bSDarrick J. Wong     The device is returned in ``bdev`` or ``dax_dev``.
204a7ca193bSDarrick J. Wong     The device address, in bytes, is returned via ``addr``.
205a7ca193bSDarrick J. Wong     Reads from this type of mapping will return zeroes to the caller.
206a7ca193bSDarrick J. Wong     For a write or writeback operation, the ioend should update the
207a7ca193bSDarrick J. Wong     mapping to MAPPED.
208a7ca193bSDarrick J. Wong     Refer to the sections about ioends for more details.
209a7ca193bSDarrick J. Wong
210a7ca193bSDarrick J. Wong   * **IOMAP_INLINE**: The file range maps to the memory buffer
211a7ca193bSDarrick J. Wong     specified by ``inline_data``.
212a7ca193bSDarrick J. Wong     For write operation, the ``->iomap_end`` function presumably
213a7ca193bSDarrick J. Wong     handles persisting the data.
214a7ca193bSDarrick J. Wong     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
215a7ca193bSDarrick J. Wong
216a7ca193bSDarrick J. Wong * ``flags`` describe the status of the space mapping.
217a7ca193bSDarrick J. Wong   These flags should be set by the filesystem in ``->iomap_begin``:
218a7ca193bSDarrick J. Wong
219a7ca193bSDarrick J. Wong   * **IOMAP_F_NEW**: The space under the mapping is newly allocated.
220a7ca193bSDarrick J. Wong     Areas that will not be written to must be zeroed.
221a7ca193bSDarrick J. Wong     If a write fails and the mapping is a space reservation, the
222a7ca193bSDarrick J. Wong     reservation must be deleted.
223a7ca193bSDarrick J. Wong
224a7ca193bSDarrick J. Wong   * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed
225a7ca193bSDarrick J. Wong     to access any data written.
226a7ca193bSDarrick J. Wong     fdatasync is required to commit these changes to persistent
227a7ca193bSDarrick J. Wong     storage.
228a7ca193bSDarrick J. Wong     This needs to take into account metadata changes that *may* be made
229a7ca193bSDarrick J. Wong     at I/O completion, such as file size updates from direct I/O.
230a7ca193bSDarrick J. Wong
231a7ca193bSDarrick J. Wong   * **IOMAP_F_SHARED**: The space under the mapping is shared.
232a7ca193bSDarrick J. Wong     Copy on write is necessary to avoid corrupting other file data.
233a7ca193bSDarrick J. Wong
234a7ca193bSDarrick J. Wong   * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer
235a7ca193bSDarrick J. Wong     heads for pagecache operations.
236a7ca193bSDarrick J. Wong     Do not add more uses of this.
237a7ca193bSDarrick J. Wong
238a7ca193bSDarrick J. Wong   * **IOMAP_F_MERGED**: Multiple contiguous block mappings were
239a7ca193bSDarrick J. Wong     coalesced into this single mapping.
240a7ca193bSDarrick J. Wong     This is only useful for FIEMAP.
241a7ca193bSDarrick J. Wong
242a7ca193bSDarrick J. Wong   * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not
243a7ca193bSDarrick J. Wong     regular file data.
244a7ca193bSDarrick J. Wong     This is only useful for FIEMAP.
245a7ca193bSDarrick J. Wong
246a7ca193bSDarrick J. Wong   * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can
247a7ca193bSDarrick J. Wong     be set by the filesystem for its own purposes.
248a7ca193bSDarrick J. Wong
249a7ca193bSDarrick J. Wong   These flags can be set by iomap itself during file operations.
250a7ca193bSDarrick J. Wong   The filesystem should supply an ``->iomap_end`` function if it needs
251a7ca193bSDarrick J. Wong   to observe these flags:
252a7ca193bSDarrick J. Wong
253a7ca193bSDarrick J. Wong   * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of
254a7ca193bSDarrick J. Wong     using this mapping.
255a7ca193bSDarrick J. Wong
256a7ca193bSDarrick J. Wong   * **IOMAP_F_STALE**: The mapping was found to be stale.
257a7ca193bSDarrick J. Wong     iomap will call ``->iomap_end`` on this mapping and then
258a7ca193bSDarrick J. Wong     ``->iomap_begin`` to obtain a new mapping.
259a7ca193bSDarrick J. Wong
260a7ca193bSDarrick J. Wong   Currently, these flags are only set by pagecache operations.
261a7ca193bSDarrick J. Wong
262a7ca193bSDarrick J. Wong * ``addr`` describes the device address, in bytes.
263a7ca193bSDarrick J. Wong
264a7ca193bSDarrick J. Wong * ``bdev`` describes the block device for this mapping.
265a7ca193bSDarrick J. Wong   This only needs to be set for mapped or unwritten operations.
266a7ca193bSDarrick J. Wong
267a7ca193bSDarrick J. Wong * ``dax_dev`` describes the DAX device for this mapping.
268a7ca193bSDarrick J. Wong   This only needs to be set for mapped or unwritten operations, and
269a7ca193bSDarrick J. Wong   only for a fsdax operation.
270a7ca193bSDarrick J. Wong
271a7ca193bSDarrick J. Wong * ``inline_data`` points to a memory buffer for I/O involving
272a7ca193bSDarrick J. Wong   ``IOMAP_INLINE`` mappings.
273a7ca193bSDarrick J. Wong   This value is ignored for all other mapping types.
274a7ca193bSDarrick J. Wong
275a7ca193bSDarrick J. Wong * ``private`` is a pointer to `filesystem-private information
276a7ca193bSDarrick J. Wong   <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
277a7ca193bSDarrick J. Wong   This value will be passed unchanged to ``->iomap_end``.
278a7ca193bSDarrick J. Wong
279a7ca193bSDarrick J. Wong * ``folio_ops`` will be covered in the section on pagecache operations.
280a7ca193bSDarrick J. Wong
281a7ca193bSDarrick J. Wong * ``validity_cookie`` is a magic freshness value set by the filesystem
282a7ca193bSDarrick J. Wong   that should be used to detect stale mappings.
283a7ca193bSDarrick J. Wong   For pagecache operations this is critical for correct operation
284a7ca193bSDarrick J. Wong   because page faults can occur, which implies that filesystem locks
285a7ca193bSDarrick J. Wong   should not be held between ``->iomap_begin`` and ``->iomap_end``.
286a7ca193bSDarrick J. Wong   Filesystems with completely static mappings need not set this value.
287a7ca193bSDarrick J. Wong   Only pagecache operations revalidate mappings; see the section about
288a7ca193bSDarrick J. Wong   ``iomap_valid`` for details.
289a7ca193bSDarrick J. Wong
290a7ca193bSDarrick J. Wong``struct iomap_ops``
291a7ca193bSDarrick J. Wong--------------------
292a7ca193bSDarrick J. Wong
293a7ca193bSDarrick J. WongEvery iomap function requires the filesystem to pass an operations
294a7ca193bSDarrick J. Wongstructure to obtain a mapping and (optionally) to release the mapping:
295a7ca193bSDarrick J. Wong
296a7ca193bSDarrick J. Wong.. code-block:: c
297a7ca193bSDarrick J. Wong
298a7ca193bSDarrick J. Wong struct iomap_ops {
299a7ca193bSDarrick J. Wong     int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
300a7ca193bSDarrick J. Wong                        unsigned flags, struct iomap *iomap,
301a7ca193bSDarrick J. Wong                        struct iomap *srcmap);
302a7ca193bSDarrick J. Wong
303a7ca193bSDarrick J. Wong     int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
304a7ca193bSDarrick J. Wong                      ssize_t written, unsigned flags,
305a7ca193bSDarrick J. Wong                      struct iomap *iomap);
306a7ca193bSDarrick J. Wong };
307a7ca193bSDarrick J. Wong
308a7ca193bSDarrick J. Wong``->iomap_begin``
309a7ca193bSDarrick J. Wong~~~~~~~~~~~~~~~~~
310a7ca193bSDarrick J. Wong
311a7ca193bSDarrick J. Wongiomap operations call ``->iomap_begin`` to obtain one file mapping for
312a7ca193bSDarrick J. Wongthe range of bytes specified by ``pos`` and ``length`` for the file
313a7ca193bSDarrick J. Wong``inode``.
314a7ca193bSDarrick J. WongThis mapping should be returned through the ``iomap`` pointer.
315a7ca193bSDarrick J. WongThe mapping must cover at least the first byte of the supplied file
316a7ca193bSDarrick J. Wongrange, but it does not need to cover the entire requested range.
317a7ca193bSDarrick J. Wong
318a7ca193bSDarrick J. WongEach iomap operation describes the requested operation through the
319a7ca193bSDarrick J. Wong``flags`` argument.
320a7ca193bSDarrick J. WongThe exact value of ``flags`` will be documented in the
321a7ca193bSDarrick J. Wongoperation-specific sections below.
322a7ca193bSDarrick J. WongThese flags can, at least in principle, apply generally to iomap
323a7ca193bSDarrick J. Wongoperations:
324a7ca193bSDarrick J. Wong
325a7ca193bSDarrick J. Wong * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to
326a7ca193bSDarrick J. Wong   block storage.
327a7ca193bSDarrick J. Wong
328a7ca193bSDarrick J. Wong * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to
329a7ca193bSDarrick J. Wong   memory-like storage.
330a7ca193bSDarrick J. Wong
331a7ca193bSDarrick J. Wong * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best
332a7ca193bSDarrick J. Wong   effort attempt to avoid any operation that would result in blocking
333a7ca193bSDarrick J. Wong   the submitting task.
334a7ca193bSDarrick J. Wong   This is similar in intent to ``O_NONBLOCK`` for network APIs - it is
335a7ca193bSDarrick J. Wong   intended for asynchronous applications to keep doing other work
336a7ca193bSDarrick J. Wong   instead of waiting for the specific unavailable filesystem resource
337a7ca193bSDarrick J. Wong   to become available.
338a7ca193bSDarrick J. Wong   Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use
339a7ca193bSDarrick J. Wong   trylock algorithms.
340a7ca193bSDarrick J. Wong   They need to be able to satisfy the entire I/O request range with a
341a7ca193bSDarrick J. Wong   single iomap mapping.
342a7ca193bSDarrick J. Wong   They need to avoid reading or writing metadata synchronously.
343a7ca193bSDarrick J. Wong   They need to avoid blocking memory allocations.
344a7ca193bSDarrick J. Wong   They need to avoid waiting on transaction reservations to allow
345a7ca193bSDarrick J. Wong   modifications to take place.
346a7ca193bSDarrick J. Wong   They probably should not be allocating new space.
347a7ca193bSDarrick J. Wong   And so on.
348a7ca193bSDarrick J. Wong   If there is any doubt in the filesystem developer's mind as to
349a7ca193bSDarrick J. Wong   whether any specific ``IOMAP_NOWAIT`` operation may end up blocking,
350a7ca193bSDarrick J. Wong   then they should return ``-EAGAIN`` as early as possible rather than
351a7ca193bSDarrick J. Wong   start the operation and force the submitting task to block.
352a7ca193bSDarrick J. Wong   ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or
353a7ca193bSDarrick J. Wong   ``RWF_NOWAIT``.
354a7ca193bSDarrick J. Wong
355a7ca193bSDarrick J. WongIf it is necessary to read existing file contents from a `different
356a7ca193bSDarrick J. Wong<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_
357a7ca193bSDarrick J. Wongdevice or address range on a device, the filesystem should return that
358a7ca193bSDarrick J. Wonginformation via ``srcmap``.
359a7ca193bSDarrick J. WongOnly pagecache and fsdax operations support reading from one mapping and
360a7ca193bSDarrick J. Wongwriting to another.
361a7ca193bSDarrick J. Wong
362a7ca193bSDarrick J. Wong``->iomap_end``
363a7ca193bSDarrick J. Wong~~~~~~~~~~~~~~~
364a7ca193bSDarrick J. Wong
365a7ca193bSDarrick J. WongAfter the operation completes, the ``->iomap_end`` function, if present,
366a7ca193bSDarrick J. Wongis called to signal that iomap is finished with a mapping.
367a7ca193bSDarrick J. WongTypically, implementations will use this function to tear down any
368a7ca193bSDarrick J. Wongcontext that were set up in ``->iomap_begin``.
369a7ca193bSDarrick J. WongFor example, a write might wish to commit the reservations for the bytes
370a7ca193bSDarrick J. Wongthat were operated upon and unreserve any space that was not operated
371a7ca193bSDarrick J. Wongupon.
372a7ca193bSDarrick J. Wong``written`` might be zero if no bytes were touched.
373a7ca193bSDarrick J. Wong``flags`` will contain the same value passed to ``->iomap_begin``.
374a7ca193bSDarrick J. Wongiomap ops for reads are not likely to need to supply this function.
375a7ca193bSDarrick J. Wong
376a7ca193bSDarrick J. WongBoth functions should return a negative errno code on error, or zero on
377a7ca193bSDarrick J. Wongsuccess.
378a7ca193bSDarrick J. Wong
379a7ca193bSDarrick J. WongPreparing for File Operations
380a7ca193bSDarrick J. Wong=============================
381a7ca193bSDarrick J. Wong
382a7ca193bSDarrick J. Wongiomap only handles mapping and I/O.
383a7ca193bSDarrick J. WongFilesystems must still call out to the VFS to check input parameters
384a7ca193bSDarrick J. Wongand file state before initiating an I/O operation.
385a7ca193bSDarrick J. WongIt does not handle obtaining filesystem freeze protection, updating of
386a7ca193bSDarrick J. Wongtimestamps, stripping privileges, or access control.
387a7ca193bSDarrick J. Wong
388a7ca193bSDarrick J. WongLocking Hierarchy
389a7ca193bSDarrick J. Wong=================
390a7ca193bSDarrick J. Wong
391a7ca193bSDarrick J. Wongiomap requires that filesystems supply their own locking model.
392a7ca193bSDarrick J. WongThere are three categories of synchronization primitives, as far as
393a7ca193bSDarrick J. Wongiomap is concerned:
394a7ca193bSDarrick J. Wong
395a7ca193bSDarrick J. Wong * The **upper** level primitive is provided by the filesystem to
396a7ca193bSDarrick J. Wong   coordinate access to different iomap operations.
397a7ca193bSDarrick J. Wong   The exact primitive is specific to the filesystem and operation,
398a7ca193bSDarrick J. Wong   but is often a VFS inode, pagecache invalidation, or folio lock.
399a7ca193bSDarrick J. Wong   For example, a filesystem might take ``i_rwsem`` before calling
400a7ca193bSDarrick J. Wong   ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent
401a7ca193bSDarrick J. Wong   these two file operations from clobbering each other.
402a7ca193bSDarrick J. Wong   Pagecache writeback may lock a folio to prevent other threads from
403a7ca193bSDarrick J. Wong   accessing the folio until writeback is underway.
404a7ca193bSDarrick J. Wong
405a7ca193bSDarrick J. Wong   * The **lower** level primitive is taken by the filesystem in the
406a7ca193bSDarrick J. Wong     ``->iomap_begin`` and ``->iomap_end`` functions to coordinate
407a7ca193bSDarrick J. Wong     access to the file space mapping information.
408a7ca193bSDarrick J. Wong     The fields of the iomap object should be filled out while holding
409a7ca193bSDarrick J. Wong     this primitive.
410a7ca193bSDarrick J. Wong     The upper level synchronization primitive, if any, remains held
411a7ca193bSDarrick J. Wong     while acquiring the lower level synchronization primitive.
412a7ca193bSDarrick J. Wong     For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem``
413a7ca193bSDarrick J. Wong     while sampling mappings.
414a7ca193bSDarrick J. Wong     Filesystems with immutable mapping information may not require
415a7ca193bSDarrick J. Wong     synchronization here.
416a7ca193bSDarrick J. Wong
417a7ca193bSDarrick J. Wong   * The **operation** primitive is taken by an iomap operation to
418a7ca193bSDarrick J. Wong     coordinate access to its own internal data structures.
419a7ca193bSDarrick J. Wong     The upper level synchronization primitive, if any, remains held
420a7ca193bSDarrick J. Wong     while acquiring this primitive.
421a7ca193bSDarrick J. Wong     The lower level primitive is not held while acquiring this
422a7ca193bSDarrick J. Wong     primitive.
423a7ca193bSDarrick J. Wong     For example, pagecache write operations will obtain a file mapping,
424a7ca193bSDarrick J. Wong     then grab and lock a folio to copy new contents.
425a7ca193bSDarrick J. Wong     It may also lock an internal folio state object to update metadata.
426a7ca193bSDarrick J. Wong
427a7ca193bSDarrick J. WongThe exact locking requirements are specific to the filesystem; for
428a7ca193bSDarrick J. Wongcertain operations, some of these locks can be elided.
429b1daf3f8SDennis LamAll further mentions of locking are *recommendations*, not mandates.
430a7ca193bSDarrick J. WongEach filesystem author must figure out the locking for themself.
431a7ca193bSDarrick J. Wong
432a7ca193bSDarrick J. WongBugs and Limitations
433a7ca193bSDarrick J. Wong====================
434a7ca193bSDarrick J. Wong
435a7ca193bSDarrick J. Wong * No support for fscrypt.
436a7ca193bSDarrick J. Wong * No support for compression.
437a7ca193bSDarrick J. Wong * No support for fsverity yet.
438a7ca193bSDarrick J. Wong * Strong assumptions that IO should work the way it does on XFS.
439a7ca193bSDarrick J. Wong * Does iomap *actually* work for non-regular file data?
440a7ca193bSDarrick J. Wong
441a7ca193bSDarrick J. WongPatches welcome!
442