131e6b01fSNick PigginPath walking and name lookup locking
231e6b01fSNick Piggin====================================
331e6b01fSNick Piggin
431e6b01fSNick PigginPath resolution is the finding a dentry corresponding to a path name string, by
531e6b01fSNick Pigginperforming a path walk. Typically, for every open(), stat() etc., the path name
631e6b01fSNick Pigginwill be resolved. Paths are resolved by walking the namespace tree, starting
731e6b01fSNick Pigginwith the first component of the pathname (eg. root or cwd) with a known dentry,
831e6b01fSNick Pigginthen finding the child of that dentry, which is named the next component in the
931e6b01fSNick Pigginpath string. Then repeating the lookup from the child dentry and finding its
1031e6b01fSNick Pigginchild with the next element, and so on.
1131e6b01fSNick Piggin
1231e6b01fSNick PigginSince it is a frequent operation for workloads like multiuser environments and
1331e6b01fSNick Pigginweb servers, it is important to optimize this code.
1431e6b01fSNick Piggin
1531e6b01fSNick PigginPath walking synchronisation history:
1631e6b01fSNick PigginPrior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and
1731e6b01fSNick Pigginthus in every component during path look-up. Since 2.5.10 onwards, fast-walk
1831e6b01fSNick Pigginalgorithm changed this by holding the dcache_lock at the beginning and walking
1931e6b01fSNick Pigginas many cached path component dentries as possible. This significantly
2031e6b01fSNick Piggindecreases the number of acquisition of dcache_lock. However it also increases
2131e6b01fSNick Pigginthe lock hold time significantly and affects performance in large SMP machines.
2231e6b01fSNick PigginSince 2.5.62 kernel, dcache has been using a new locking model that uses RCU to
2331e6b01fSNick Pigginmake dcache look-up lock-free.
2431e6b01fSNick Piggin
2531e6b01fSNick PigginAll the above algorithms required taking a lock and reference count on the
2631e6b01fSNick Piggindentry that was looked up, so that may be used as the basis for walking the
2731e6b01fSNick Pigginnext path element. This is inefficient and unscalable. It is inefficient
2831e6b01fSNick Pigginbecause of the locks and atomic operations required for every dentry element
2931e6b01fSNick Pigginslows things down. It is not scalable because many parallel applications that
3031e6b01fSNick Pigginare path-walk intensive tend to do path lookups starting from a common dentry
3131e6b01fSNick Piggin(usually, the root "/" or current working directory). So contention on these
3231e6b01fSNick Piggincommon path elements causes lock and cacheline queueing.
3331e6b01fSNick Piggin
3431e6b01fSNick PigginSince 2.6.38, RCU is used to make a significant part of the entire path walk
3531e6b01fSNick Piggin(including dcache look-up) completely "store-free" (so, no locks, atomics, or
3631e6b01fSNick Piggineven stores into cachelines of common dentries). This is known as "rcu-walk"
3731e6b01fSNick Pigginpath walking.
3831e6b01fSNick Piggin
3931e6b01fSNick PigginPath walking overview
4031e6b01fSNick Piggin=====================
4131e6b01fSNick Piggin
4231e6b01fSNick PigginA name string specifies a start (root directory, cwd, fd-relative) and a
4331e6b01fSNick Pigginsequence of elements (directory entry names), which together refer to a path in
4431e6b01fSNick Pigginthe namespace. A path is represented as a (dentry, vfsmount) tuple. The name
4525985edcSLucas De Marchielements are sub-strings, separated by '/'.
4631e6b01fSNick Piggin
4731e6b01fSNick PigginName lookups will want to find a particular path that a name string refers to
4831e6b01fSNick Piggin(usually the final element, or parent of final element). This is done by taking
4931e6b01fSNick Pigginthe path given by the name's starting point (which we know in advance -- eg.
5031e6b01fSNick Piggincurrent->fs->cwd or current->fs->root) as the first parent of the lookup. Then
5131e6b01fSNick Pigginiteratively for each subsequent name element, look up the child of the current
5231e6b01fSNick Pigginparent with the given name and if it is not the desired entry, make it the
5331e6b01fSNick Pigginparent for the next lookup.
5431e6b01fSNick Piggin
5531e6b01fSNick PigginA parent, of course, must be a directory, and we must have appropriate
5631e6b01fSNick Pigginpermissions on the parent inode to be able to walk into it.
5731e6b01fSNick Piggin
5831e6b01fSNick PigginTurning the child into a parent for the next lookup requires more checks and
5931e6b01fSNick Pigginprocedures. Symlinks essentially substitute the symlink name for the target
6031e6b01fSNick Pigginname in the name string, and require some recursive path walking.  Mount points
6131e6b01fSNick Pigginmust be followed into (thus changing the vfsmount that subsequent path elements
6231e6b01fSNick Pigginrefer to), switching from the mount point path to the root of the particular
6331e6b01fSNick Pigginmounted vfsmount. These behaviours are variously modified depending on the
6431e6b01fSNick Pigginexact path walking flags.
6531e6b01fSNick Piggin
6631e6b01fSNick PigginPath walking then must, broadly, do several particular things:
6731e6b01fSNick Piggin- find the start point of the walk;
6831e6b01fSNick Piggin- perform permissions and validity checks on inodes;
6931e6b01fSNick Piggin- perform dcache hash name lookups on (parent, name element) tuples;
7031e6b01fSNick Piggin- traverse mount points;
7131e6b01fSNick Piggin- traverse symlinks;
7231e6b01fSNick Piggin- lookup and create missing parts of the path on demand.
7331e6b01fSNick Piggin
7431e6b01fSNick PigginSafe store-free look-up of dcache hash table
7531e6b01fSNick Piggin============================================
7631e6b01fSNick Piggin
7731e6b01fSNick PigginDcache name lookup
7831e6b01fSNick Piggin------------------
7931e6b01fSNick PigginIn order to lookup a dcache (parent, name) tuple, we take a hash on the tuple
8031e6b01fSNick Pigginand use that to select a bucket in the dcache-hash table. The list of entries
8131e6b01fSNick Pigginin that bucket is then walked, and we do a full comparison of each entry
8231e6b01fSNick Pigginagainst our (parent, name) tuple.
8331e6b01fSNick Piggin
8431e6b01fSNick PigginThe hash lists are RCU protected, so list walking is not serialised with
8531e6b01fSNick Pigginconcurrent updates (insertion, deletion from the hash). This is a standard RCU
8631e6b01fSNick Pigginlist application with the exception of renames, which will be covered below.
8731e6b01fSNick Piggin
8831e6b01fSNick PigginParent and name members of a dentry, as well as its membership in the dcache
8931e6b01fSNick Pigginhash, and its inode are protected by the per-dentry d_lock spinlock. A
9031e6b01fSNick Pigginreference is taken on the dentry (while the fields are verified under d_lock),
9131e6b01fSNick Pigginand this stabilises its d_inode pointer and actual inode. This gives a stable
9231e6b01fSNick Pigginpoint to perform the next step of our path walk against.
9331e6b01fSNick Piggin
9431e6b01fSNick PigginThese members are also protected by d_seq seqlock, although this offers
9531e6b01fSNick Pigginread-only protection and no durability of results, so care must be taken when
9631e6b01fSNick Pigginusing d_seq for synchronisation (see seqcount based lookups, below).
9731e6b01fSNick Piggin
9831e6b01fSNick PigginRenames
9931e6b01fSNick Piggin-------
10031e6b01fSNick PigginBack to the rename case. In usual RCU protected lists, the only operations that
10131e6b01fSNick Pigginwill happen to an object is insertion, and then eventually removal from the
10231e6b01fSNick Pigginlist. The object will not be reused until an RCU grace period is complete.
10331e6b01fSNick PigginThis ensures the RCU list traversal primitives can run over the object without
10431e6b01fSNick Pigginproblems (see RCU documentation for how this works).
10531e6b01fSNick Piggin
10631e6b01fSNick PigginHowever when a dentry is renamed, its hash value can change, requiring it to be
10731e6b01fSNick Pigginmoved to a new hash list. Allocating and inserting a new alias would be
10831e6b01fSNick Pigginexpensive and also problematic for directory dentries. Latency would be far to
10931e6b01fSNick Pigginhigh to wait for a grace period after removing the dentry and before inserting
11031e6b01fSNick Pigginit in the new hash bucket. So what is done is to insert the dentry into the
11131e6b01fSNick Pigginnew list immediately.
11231e6b01fSNick Piggin
11331e6b01fSNick PigginHowever, when the dentry's list pointers are updated to point to objects in the
11431e6b01fSNick Pigginnew list before waiting for a grace period, this can result in a concurrent RCU
11531e6b01fSNick Pigginlookup of the old list veering off into the new (incorrect) list and missing
11631e6b01fSNick Pigginthe remaining dentries on the list.
11731e6b01fSNick Piggin
11831e6b01fSNick PigginThere is no fundamental problem with walking down the wrong list, because the
11931e6b01fSNick Piggindentry comparisons will never match. However it is fatal to miss a matching
12031e6b01fSNick Piggindentry. So a seqlock is used to detect when a rename has occurred, and so the
12131e6b01fSNick Pigginlookup can be retried.
12231e6b01fSNick Piggin
12331e6b01fSNick Piggin         1      2      3
12431e6b01fSNick Piggin        +---+  +---+  +---+
12531e6b01fSNick Pigginhlist-->| N-+->| N-+->| N-+->
12631e6b01fSNick Pigginhead <--+-P |<-+-P |<-+-P |
12731e6b01fSNick Piggin        +---+  +---+  +---+
12831e6b01fSNick Piggin
12931e6b01fSNick PigginRename of dentry 2 may require it deleted from the above list, and inserted
13031e6b01fSNick Piggininto a new list. Deleting 2 gives the following list.
13131e6b01fSNick Piggin
13231e6b01fSNick Piggin         1             3
13331e6b01fSNick Piggin        +---+         +---+     (don't worry, the longer pointers do not
13431e6b01fSNick Pigginhlist-->| N-+-------->| N-+->    impose a measurable performance overhead
13531e6b01fSNick Pigginhead <--+-P |<--------+-P |      on modern CPUs)
13631e6b01fSNick Piggin        +---+         +---+
13731e6b01fSNick Piggin          ^      2      ^
13831e6b01fSNick Piggin          |    +---+    |
13931e6b01fSNick Piggin          |    | N-+----+
14031e6b01fSNick Piggin          +----+-P |
14131e6b01fSNick Piggin               +---+
14231e6b01fSNick Piggin
14331e6b01fSNick PigginThis is a standard RCU-list deletion, which leaves the deleted object's
14431e6b01fSNick Pigginpointers intact, so a concurrent list walker that is currently looking at
14531e6b01fSNick Pigginobject 2 will correctly continue to object 3 when it is time to traverse the
14631e6b01fSNick Pigginnext object.
14731e6b01fSNick Piggin
14831e6b01fSNick PigginHowever, when inserting object 2 onto a new list, we end up with this:
14931e6b01fSNick Piggin
15031e6b01fSNick Piggin         1             3
15131e6b01fSNick Piggin        +---+         +---+
15231e6b01fSNick Pigginhlist-->| N-+-------->| N-+->
15331e6b01fSNick Pigginhead <--+-P |<--------+-P |
15431e6b01fSNick Piggin        +---+         +---+
15531e6b01fSNick Piggin                 2
15631e6b01fSNick Piggin               +---+
15731e6b01fSNick Piggin               | N-+---->
15831e6b01fSNick Piggin          <----+-P |
15931e6b01fSNick Piggin               +---+
16031e6b01fSNick Piggin
16131e6b01fSNick PigginBecause we didn't wait for a grace period, there may be a concurrent lookup
16231e6b01fSNick Pigginstill at 2. Now when it follows 2's 'next' pointer, it will walk off into
16331e6b01fSNick Pigginanother list without ever having checked object 3.
16431e6b01fSNick Piggin
16531e6b01fSNick PigginA related, but distinctly different, issue is that of rename atomicity versus
16631e6b01fSNick Pigginlookup operations. If a file is renamed from 'A' to 'B', a lookup must only
16731e6b01fSNick Pigginfind either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup
16831e6b01fSNick Pigginof 'B' must succeed (note the reverse is not true).
16931e6b01fSNick Piggin
17031e6b01fSNick PigginBetween deleting the dentry from the old hash list, and inserting it on the new
17131e6b01fSNick Pigginhash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same
17231e6b01fSNick Pigginrename seqlock is also used to cover this race in much the same way, by
17331e6b01fSNick Pigginretrying a negative lookup result if a rename was in progress.
17431e6b01fSNick Piggin
17531e6b01fSNick PigginSeqcount based lookups
17631e6b01fSNick Piggin----------------------
17731e6b01fSNick PigginIn refcount based dcache lookups, d_lock is used to serialise access to
17831e6b01fSNick Pigginthe dentry, stabilising it while comparing its name and parent and then
17931e6b01fSNick Piggintaking a reference count (the reference count then gives a stable place to
18031e6b01fSNick Pigginstart the next part of the path walk from).
18131e6b01fSNick Piggin
18231e6b01fSNick PigginAs explained above, we would like to do path walking without taking locks or
18331e6b01fSNick Pigginreference counts on intermediate dentries along the path. To do this, a per
18431e6b01fSNick Piggindentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry
18531e6b01fSNick Pigginlooks like (its name, parent, and inode). That snapshot is then used to start
18631e6b01fSNick Pigginthe next part of the path walk. When loading the coherent snapshot under d_seq,
18731e6b01fSNick Piggincare must be taken to load the members up-front, and use those pointers rather
18831e6b01fSNick Pigginthan reloading from the dentry later on (otherwise we'd have interesting things
18931e6b01fSNick Pigginlike d_inode going NULL underneath us, if the name was unlinked).
19031e6b01fSNick Piggin
19131e6b01fSNick PigginAlso important is to avoid performing any destructive operations (pretty much:
19231e6b01fSNick Pigginno non-atomic stores to shared data), and to recheck the seqcount when we are
19331e6b01fSNick Piggin"done" with the operation. Retry or abort if the seqcount does not match.
19431e6b01fSNick PigginAvoiding destructive or changing operations means we can easily unwind from
19531e6b01fSNick Pigginfailure.
19631e6b01fSNick Piggin
19731e6b01fSNick PigginWhat this means is that a caller, provided they are holding RCU lock to
19831e6b01fSNick Pigginprotect the dentry object from disappearing, can perform a seqcount based
19931e6b01fSNick Pigginlookup which does not increment the refcount on the dentry or write to
20031e6b01fSNick Pigginit in any way. This returned dentry can be used for subsequent operations,
20131e6b01fSNick Pigginprovided that d_seq is rechecked after that operation is complete.
20231e6b01fSNick Piggin
20331e6b01fSNick PigginInodes are also rcu freed, so the seqcount lookup dentry's inode may also be
20431e6b01fSNick Pigginqueried for permissions.
20531e6b01fSNick Piggin
20631e6b01fSNick PigginWith this two parts of the puzzle, we can do path lookups without taking
20731e6b01fSNick Pigginlocks or refcounts on dentry elements.
20831e6b01fSNick Piggin
20931e6b01fSNick PigginRCU-walk path walking design
21031e6b01fSNick Piggin============================
21131e6b01fSNick Piggin
21231e6b01fSNick PigginPath walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk
21331e6b01fSNick Pigginis the traditional[*] way of performing dcache lookups using d_lock to
21431e6b01fSNick Pigginserialise concurrent modifications to the dentry and take a reference count on
21531e6b01fSNick Pigginit. ref-walk is simple and obvious, and may sleep, take locks, etc while path
21631e6b01fSNick Pigginwalking is operating on each dentry. rcu-walk uses seqcount based dentry
21731e6b01fSNick Pigginlookups, and can perform lookup of intermediate elements without any stores to
21831e6b01fSNick Pigginshared data in the dentry or inode. rcu-walk can not be applied to all cases,
21931e6b01fSNick Piggineg. if the filesystem must sleep or perform non trivial operations, rcu-walk
22031e6b01fSNick Pigginmust be switched to ref-walk mode.
22131e6b01fSNick Piggin
22231e6b01fSNick Piggin[*] RCU is still used for the dentry hash lookup in ref-walk, but not the full
22331e6b01fSNick Piggin    path walk.
22431e6b01fSNick Piggin
22531e6b01fSNick PigginWhere ref-walk uses a stable, refcounted ``parent'' to walk the remaining
22631e6b01fSNick Pigginpath string, rcu-walk uses a d_seq protected snapshot. When looking up a
22731e6b01fSNick Pigginchild of this parent snapshot, we open d_seq critical section on the child
22831e6b01fSNick Pigginbefore closing d_seq critical section on the parent. This gives an interlocking
22931e6b01fSNick Pigginladder of snapshots to walk down.
23031e6b01fSNick Piggin
23131e6b01fSNick Piggin
23231e6b01fSNick Piggin     proc 101
23331e6b01fSNick Piggin      /----------------\
23431e6b01fSNick Piggin     / comm:    "vi"    \
23531e6b01fSNick Piggin    /  fs.root: dentry0  \
23631e6b01fSNick Piggin    \  fs.cwd:  dentry2  /
23731e6b01fSNick Piggin     \                  /
23831e6b01fSNick Piggin      \----------------/
23931e6b01fSNick Piggin
24031e6b01fSNick PigginSo when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will
24131e6b01fSNick Pigginstart from current->fs->root, which is a pinned dentry. Alternatively,
24231e6b01fSNick Piggin"./test.c" would start from cwd; both names refer to the same path in
24331e6b01fSNick Pigginthe context of proc101.
24431e6b01fSNick Piggin
24531e6b01fSNick Piggin     dentry 0
24631e6b01fSNick Piggin    +---------------------+   rcu-walk begins here, we note d_seq, check the
24731e6b01fSNick Piggin    | name:    "/"        |   inode's permission, and then look up the next
24831e6b01fSNick Piggin    | inode:   10         |   path element which is "home"...
24931e6b01fSNick Piggin    | children:"home", ...|
25031e6b01fSNick Piggin    +---------------------+
25131e6b01fSNick Piggin              |
25231e6b01fSNick Piggin     dentry 1 V
25331e6b01fSNick Piggin    +---------------------+   ... which brings us here. We find dentry1 via
25431e6b01fSNick Piggin    | name:    "home"     |   hash lookup, then note d_seq and compare name
25531e6b01fSNick Piggin    | inode:   678        |   string and parent pointer. When we have a match,
25631e6b01fSNick Piggin    | children:"npiggin"  |   we now recheck the d_seq of dentry0. Then we
25731e6b01fSNick Piggin    +---------------------+   check inode and look up the next element.
25831e6b01fSNick Piggin              |
25931e6b01fSNick Piggin     dentry2  V
26031e6b01fSNick Piggin    +---------------------+   Note: if dentry0 is now modified, lookup is
26131e6b01fSNick Piggin    | name:    "npiggin"  |   not necessarily invalid, so we need only keep a
26231e6b01fSNick Piggin    | inode:   543        |   parent for d_seq verification, and grandparents
26331e6b01fSNick Piggin    | children:"a.c", ... |   can be forgotten.
26431e6b01fSNick Piggin    +---------------------+
26531e6b01fSNick Piggin              |
26631e6b01fSNick Piggin     dentry3  V
26731e6b01fSNick Piggin    +---------------------+   At this point we have our destination dentry.
26831e6b01fSNick Piggin    | name:    "a.c"      |   We now take its d_lock, verify d_seq of this
26931e6b01fSNick Piggin    | inode:   14221      |   dentry. If that checks out, we can increment
27031e6b01fSNick Piggin    | children:NULL       |   its refcount because we're holding d_lock.
27131e6b01fSNick Piggin    +---------------------+
27231e6b01fSNick Piggin
27331e6b01fSNick PigginTaking a refcount on a dentry from rcu-walk mode, by taking its d_lock,
27431e6b01fSNick Pigginre-checking its d_seq, and then incrementing its refcount is called
27531e6b01fSNick Piggin"dropping rcu" or dropping from rcu-walk into ref-walk mode.
27631e6b01fSNick Piggin
27731e6b01fSNick PigginIt is, in some sense, a bit of a house of cards. If the seqcount check of the
27831e6b01fSNick Pigginparent snapshot fails, the house comes down, because we had closed the d_seq
27931e6b01fSNick Pigginsection on the grandparent, so we have nothing left to stand on. In that case,
28031e6b01fSNick Pigginthe path walk must be fully restarted (which we do in ref-walk mode, to avoid
28131e6b01fSNick Pigginlive locks). It is costly to have a full restart, but fortunately they are
28231e6b01fSNick Pigginquite rare.
28331e6b01fSNick Piggin
28431e6b01fSNick PigginWhen we reach a point where sleeping is required, or a filesystem callout
28531e6b01fSNick Pigginrequires ref-walk, then instead of restarting the walk, we attempt to drop rcu
28631e6b01fSNick Pigginat the last known good dentry we have. Avoiding a full restart in ref-walk in
28731e6b01fSNick Pigginthese cases is fundamental for performance and scalability because blocking
28831e6b01fSNick Pigginoperations such as creates and unlinks are not uncommon.
28931e6b01fSNick Piggin
29031e6b01fSNick PigginThe detailed design for rcu-walk is like this:
29131e6b01fSNick Piggin* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
29231e6b01fSNick Piggin* Take the RCU lock for the entire path walk, starting with the acquiring
29331e6b01fSNick Piggin  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
29431e6b01fSNick Piggin  not required for dentry persistence.
29531e6b01fSNick Piggin* synchronize_rcu is called when unregistering a filesystem, so we can
29631e6b01fSNick Piggin  access d_ops and i_ops during rcu-walk.
29731e6b01fSNick Piggin* Similarly take the vfsmount lock for the entire path walk. So now mnt
29831e6b01fSNick Piggin  refcounts are not required for persistence. Also we are free to perform mount
29931e6b01fSNick Piggin  lookups, and to assume dentry mount points and mount roots are stable up and
30031e6b01fSNick Piggin  down the path.
30131e6b01fSNick Piggin* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
30231e6b01fSNick Piggin  so we can load this tuple atomically, and also check whether any of its
30331e6b01fSNick Piggin  members have changed.
30431e6b01fSNick Piggin* Dentry lookups (based on parent, candidate string tuple) recheck the parent
30531e6b01fSNick Piggin  sequence after the child is found in case anything changed in the parent
30631e6b01fSNick Piggin  during the path walk.
30731e6b01fSNick Piggin* inode is also RCU protected so we can load d_inode and use the inode for
30831e6b01fSNick Piggin  limited things.
30931e6b01fSNick Piggin* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
31031e6b01fSNick Piggin* i_op can be loaded.
31131e6b01fSNick Piggin* When the destination dentry is reached, drop rcu there (ie. take d_lock,
31231e6b01fSNick Piggin  verify d_seq, increment refcount).
31331e6b01fSNick Piggin* If seqlock verification fails anywhere along the path, do a full restart
31431e6b01fSNick Piggin  of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of
31531e6b01fSNick Piggin  a better errno) to signal an rcu-walk failure.
31631e6b01fSNick Piggin
31731e6b01fSNick PigginThe cases where rcu-walk cannot continue are:
31831e6b01fSNick Piggin* NULL dentry (ie. any uncached path element)
31931e6b01fSNick Piggin* Following links
32031e6b01fSNick Piggin
321b74c79e9SNick PigginIt may be possible eventually to make following links rcu-walk aware.
32231e6b01fSNick Piggin
32331e6b01fSNick PigginUncached path elements will always require dropping to ref-walk mode, at the
32431e6b01fSNick Pigginvery least because i_mutex needs to be grabbed, and objects allocated.
32531e6b01fSNick Piggin
32631e6b01fSNick PigginFinal note:
32731e6b01fSNick Piggin"store-free" path walking is not strictly store free. We take vfsmount lock
32831e6b01fSNick Pigginand refcounts (both of which can be made per-cpu), and we also store to the
32931e6b01fSNick Pigginstack (which is essentially CPU-local), and we also have to take locks and
33031e6b01fSNick Pigginrefcount on final dentry.
33131e6b01fSNick Piggin
33231e6b01fSNick PigginThe point is that shared data, where practically possible, is not locked
33331e6b01fSNick Pigginor stored into. The result is massive improvements in performance and
33431e6b01fSNick Pigginscalability of path resolution.
33531e6b01fSNick Piggin
33631e6b01fSNick Piggin
337b74c79e9SNick PigginInteresting statistics
338b74c79e9SNick Piggin======================
339b74c79e9SNick Piggin
340b74c79e9SNick PigginThe following table gives rcu lookup statistics for a few simple workloads
341b74c79e9SNick Piggin(2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to
342b74c79e9SNick Piggindrop rcu that fail due to d_seq failure and requiring the entire path lookup
343b74c79e9SNick Pigginagain. Other cases are successful rcu-drops that are required before the final
344b74c79e9SNick Pigginelement, nodentry for missing dentry, revalidate for filesystem revalidate
345b74c79e9SNick Pigginroutine requiring rcu drop, permission for permission check requiring drop,
346b74c79e9SNick Pigginand link for symlink traversal requiring drop.
347b74c79e9SNick Piggin
348b74c79e9SNick Piggin     rcu-lookups     restart  nodentry          link  revalidate  permission
349b74c79e9SNick Pigginbootup     47121           0      4624          1010       10283        7852
350b74c79e9SNick Piggindbench  25386793           0   6778659(26.7%)     55         549        1156
351b74c79e9SNick Pigginkbuild   2696672          10     64442(2.3%)  108764(4.0%)     1        1590
352b74c79e9SNick Piggingit diff   39605           0        28             2           0         106
353b74c79e9SNick Pigginvfstest 24185492        4945    708725(2.9%) 1076136(4.4%)     0        2651
354b74c79e9SNick Piggin
355b74c79e9SNick PigginWhat this shows is that failed rcu-walk lookups, ie. ones that are restarted
356b74c79e9SNick Pigginentirely with ref-walk, are quite rare. Even the "vfstest" case which
35725985edcSLucas De Marchispecifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise
358b74c79e9SNick Pigginsuch races is not showing a huge amount of restarts.
359b74c79e9SNick Piggin
360b74c79e9SNick PigginDropping from rcu-walk to ref-walk mean that we have encountered a dentry where
361b74c79e9SNick Pigginthe reference count needs to be taken for some reason. This is either because
362b74c79e9SNick Pigginwe have reached the target of the path walk, or because we have encountered a
363b74c79e9SNick Piggincondition that can't be resolved in rcu-walk mode.  Ideally, we drop rcu-walk
364b74c79e9SNick Pigginonly when we have reached the target dentry, so the other statistics show where
365b74c79e9SNick Pigginthis does not happen.
366b74c79e9SNick Piggin
367b74c79e9SNick PigginNote that a graceful drop from rcu-walk mode due to something such as the
368b74c79e9SNick Piggindentry not existing (which can be common) is not necessarily a failure of
369b74c79e9SNick Pigginrcu-walk scheme, because some elements of the path may have been walked in
370b74c79e9SNick Pigginrcu-walk mode. The further we get from common path elements (such as cwd or
371b74c79e9SNick Pigginroot), the less contended the dentry is likely to be. The closer we are to
372b74c79e9SNick Piggincommon path elements, the more likely they will exist in dentry cache.
373b74c79e9SNick Piggin
374b74c79e9SNick Piggin
37531e6b01fSNick PigginPapers and other documentation on dcache locking
37631e6b01fSNick Piggin================================================
37731e6b01fSNick Piggin
378*93431e06SAlexander A. Klimov1. Scaling dcache with RCU (https://linuxjournal.com/article.php?sid=7124).
37931e6b01fSNick Piggin
38031e6b01fSNick Piggin2. http://lse.sourceforge.net/locking/dcache/dcache.html
381b74c79e9SNick Piggin
3823ce96239SNeil Brown3. path-lookup.md in this directory.
383