xref: /dragonfly/sys/vfs/ufs/README (revision b40e316c)
1# $FreeBSD: src/sys/ufs/ffs/README,v 1.4 1999/12/03 00:34:26 billf Exp $
2# $DragonFly: src/sys/vfs/ufs/README,v 1.4 2004/07/18 19:43:48 drhodus Exp $
3
4Introduction
5
6This package constitutes the alpha distribution of the soft update
7code updates for the fast filesystem.
8
9For More information on what Soft Updates is, see:
10http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/
11
12Status
13
14My `filesystem torture tests' (described below) run for days without
15a hitch (no panic's, hangs, filesystem corruption, or memory leaks).
16However, I have had several panic's reported to me by folks that
17are field testing the code which I have not yet been able to
18reproduce or fix. Although these panic's are rare and do not cause
19filesystem corruption, the code should only be put into production
20on systems where the system administrator is aware that it is being
21run, and knows how to turn it off if problems arise. Thus, you may
22hand out this code to others, but please ensure that this status
23message is included with any distributions. Please also include
24the file ffs_softdep.stub.c in any distributions so that folks that
25cannot abide by the need to redistribute source will not be left
26with a kernel that will not link. It will resolve all the calls
27into the soft update code and simply ignores the request to enable
28them. Thus you will be able to ensure that your other hooks have
29not broken anything and that your kernel is softdep-ready for those
30that wish to use them. Please report problems back to me with
31kernel backtraces of panics if possible. This is massively complex
32code, and people only have to have their filesystems hosed once or
33twice to avoid future changes like the plague. I want to find and
34fix as many bugs as soon as possible so as to get the code rock
35solid before it gets widely released. Please report any bugs that
36you uncover to mckusick@mckusick.com.
37
38Performance
39
40Running the Andrew Benchmarks yields the following raw data:
41
42	Phase	Normal	Softdep	    What it does
43	  1	  3s	  <1s	    Creating directories
44	  2	  8s	   4s	    Copying files
45	  3	  6s	   6s	    Recursive directory stats
46	  4	  8s	   9s	    Scanning each file
47	  5	 25s	  25s	    Compilation
48
49	Normal:  19.9u 29.2s 0:52.8 135+630io
50	Softdep: 20.3u 28.5s 0:47.8 103+363io
51
52Another interesting datapoint are my `filesystem torture tests'.
53They consist of 1000 runs of the andrew benchmarks, 1000 copy and
54removes of /etc with randomly selected pauses of 0-60 seconds
55between each copy and remove, and 500 find from / with randomly
56selected pauses of 100 seconds between each run). The run of the
57torture test compares as follows:
58
59With soft updates: writes: 6 sync, 1,113,686 async; run time 19hr, 50min
60Normal filesystem: writes: 1,459,147 sync, 487,031 async; run time 27hr, 15min
61
62The upshot is 42% less I/O and 28% shorter running time.
63
64Another interesting test point is a full MAKEDEV. Because it runs
65as a shell script, it becomes mostly limited by the execution speed
66of the machine on which it runs. Here are the numbers:
67
68With soft updates:
69
70	labrat# time ./MAKEDEV std
71	2.2u 32.6s 0:34.82 100.0% 0+0k 11+36io 0pf+0w
72
73	labrat# ls | wc
74	     522     522    3317
75
76Without soft updates:
77
78	labrat# time ./MAKEDEV std
79	2.0u 40.5s 0:42.53 100.0% 0+0k 11+1221io 0pf+0w
80
81	labrat# ls | wc
82	     522     522    3317
83
84Of course, some of the system time is being pushed
85to the syncer process, but that is a different story.
86
87To show a benchmark designed to highlight the soft update code
88consider a tar of zero-sized files and an rm -rf of a directory tree
89that has at least 50 files or so at each level. Running a test with
90a directory tree containing 28 directories holding 202 empty files
91produces the following numbers:
92
93With soft updates:
94tar: 0.0u 0.5s 0:00.65 76.9% 0+0k 0+44io 0pf+0w (0 sync, 33 async writes)
95rm: 0.0u 0.2s 0:00.20 100.0% 0+0k 0+37io 0pf+0w (0 sync, 72 async writes)
96
97Normal filesystem:
98tar: 0.0u 1.1s 0:07.27 16.5% 0+0k 60+586io 0pf+0w (523 sync, 0 async writes)
99rm:  0.0u 0.5s 0:01.84 29.3% 0+0k 0+318io 0pf+0w (258 sync, 65 async writes)
100
101The large reduction in writes is because inodes are clustered, so
102most of a block gets allocated, then the whole block is written
103out once rather than having the same block written once for each
104inode allocated from it.  Similarly each directory block is written
105once rather than once for each new directory entry. Effectively
106what the update code is doing is allocating a bunch of inodes
107and directory entries without writing anything, then ensuring that
108the block containing the inodes is written first followed by the
109directory block that references them.  If there were data in the
110files it would further ensure that the data blocks were written
111before their inodes claimed them.
112
113Copyright Restrictions
114
115Please familiarize yourself with the copyright restrictions
116contained at the top of either the sys/ufs/ffs/softdep.h or
117sys/ufs/ffs/ffs_softdep.c file. The key provision is similar
118to the one used by the DB 2.0 package and goes as follows:
119
120    Redistributions in any form must be accompanied by information
121    on how to obtain complete source code for any accompanying
122    software that uses the this software. This source code must
123    either be included in the distribution or be available for
124    no more than the cost of distribution plus a nominal fee,
125    and must be freely redistributable under reasonable
126    conditions. For an executable file, complete source code
127    means the source code for all modules it contains. It does
128    not mean source code for modules or files that typically
129    accompany the operating system on which the executable file
130    runs, e.g., standard library modules or system header files.
131
132The idea is to allow those of you freely redistributing your source
133to use it while retaining for myself the right to peddle it for
134money to the commercial UNIX vendors. Note that I have included a
135stub file ffs_softdep.c.stub that is freely redistributable so that
136you can put in all the necessary hooks to run the full soft updates
137code, but still allow vendors that want to maintain proprietary
138source to have a working system. I do plan to release the code with
139a `Berkeley style' copyright once I have peddled it around to the
140commercial vendors.  If you have concerns about this copyright,
141feel free to contact me with them and we can try to resolve any
142difficulties.
143
144Soft Dependency Operation
145
146The soft update implementation does NOT require ANY changes
147to the on-disk format of your filesystems. Furthermore it is
148not used by default for any filesystems. It must be enabled on
149a filesystem by filesystem basis by running tunefs to set a
150bit in the superblock indicating that the filesystem should be
151managed using soft updates. If you wish to stop using
152soft updates due to performance or reliability reasons,
153you can simply run tunefs on it again to turn off the bit and
154revert to normal operation. The additional dynamic memory load
155placed on the kernel malloc arena is approximately equal to
156the amount of memory used by vnodes plus inodes (for a system
157with 1000 vnodes, the additional peak memory load is about 300K).
158
159Kernel Changes
160
161There are two new changes to the kernel functionality that are not
162contained in in the soft update files. The first is a `trickle
163sync' facility running in the kernel as process 3.  This trickle
164sync process replaces the traditional `update' program (which should
165be commented out of the /etc/rc startup script). When a vnode is
166first written it is placed 30 seconds down on the trickle sync
167queue. If it still exists and has dirty data when it reaches the
168top of the queue, it is sync'ed.  This approach evens out the load
169on the underlying I/O system and avoids writing short-lived files.
170The papers on trickle-sync tend to favor aging based on buffers
171rather than files. However, I sync on file age rather than buffer
172age because the data structures are much smaller as there are
173typically far fewer files than buffers. Although this can make the
174I/O spikey when a big file times out, it is still much better than
175the wholesale sync's that were happening before. It also adapts
176much better to the soft update code where I want to control
177aging to improve performance (inodes age in 10 seconds, directories
178in 15 seconds, files in 30 seconds). This ensures that most
179dependencies are gone (e.g., inodes are written when directory
180entries want to go to disk) reducing the amount of rollback that
181is needed.
182
183The other main kernel change is to split the vnode freelist into
184two separate lists.  One for vnodes that are still being used to
185identify buffers and the other for those vnodes no longer identifying
186any buffers.  The latter list is used by getnewvnode in preference
187to the former.
188
189Packaging of Kernel Changes
190
191The sys subdirectory contains the changes and additions to the
192kernel. My goal in writing this code was to minimize the changes
193that need to be made to the kernel. Thus, most of the new code
194is contained in the two new files softdep.h and ffs_softdep.c.
195The rest of the kernel changes are simply inserting hooks to
196call into these two new files. Although there has been some
197structural reorganization of the filesystem code to accommodate
198gathering the information required by the soft update code,
199the actual ordering of filesystem operations when soft updates
200are disabled is unchanged.
201
202The kernel changes are packaged as a set of diffs. As I am
203doing my development in BSD/OS, the diffs are relative to the
204BSD/OS versions of the files. Because BSD/OS recently had
2054.4BSD-Lite2 merged into it, the Lite2 files are a good starting
206point for figuring out the changes. There are 40 files that
207require change plus the two new files. Most of these files have
208only a few lines of changes in them. However, four files have
209fairly extensive changes: kern/vfs_subr.c, vfs/ufs/ufs_lookup.c,
210vfs/ufs/ufs_vnops.c, and vfs/ffs/ffs_alloc.c. For these four
211files, I have provided the original Lite2 version, the Lite2
212version with the diffs merged in, and the diffs between the
213BSD/OS and merged version. Even so, I expect that there will
214be some difficulty in doing the merge; I am certainly willing
215to assist in helping get the code merged into your system.
216
217Packaging of Utility Changes
218
219The utilities subdirectory contains the changes and additions
220to the utilities. There are diffs to three utilities enclosed:
221
222    tunefs - add a flag to enable and disable soft updates
223
224    mount - print out whether soft updates are enabled and
225	    also statistics on number of sync and async writes
226
227    fsck - tighter checks on acceptable errors and a slightly
228	   different policy for what to put in lost+found on
229	   filesystems using soft updates
230
231In addition you should recompile vmstat so as to get reports
232on the 13 new memory types used by the soft update code.
233It is not necessary to use the new version of fsck, however it
234would aid in my debugging if you do. Also, because of the time
235lag between deleting a directory entry and the inode it
236references, you will find a lot more files showing up in your
237lost+found if you do not use the new version. Note that the
238new version checks for the soft update flag in the superblock
239and only uses the new algorithms if it is set. So, it will run
240unchanged on the filesystems that are not using soft updates.
241
242Operation
243
244Once you have booted a kernel that incorporates the soft update
245code and installed the updated utilities, do the following:
246
2471) Comment out the update program in /etc/rc.
248
2492) Run `tunefs -n enable' on one or more test filesystems.
250
2513) Mount these filesystems and then type `mount' to ensure that
252   they have been enabled for soft updates.
253
2544) Copy the test directory to a softdep filesystem, chdir into
255   it and run `./doit'. You may want to check out each of the
256   three subtests individually first: doit1 - andrew benchmarks,
257   doit2 - copy and removal of /etc, doit3 - find from /.
258
259====
260Additional notes from Feb 13
261
262When removing huge directories of files, it is possible to get
263the incore state arbitrarily far ahead of the disk. Maintaining
264all the associated depedency information can exhaust the kernel
265malloc arena. To avoid this senario, I have put some limits on
266the soft update code so that it will not be allowed to rampage
267through all of the kernel memory. I enclose below the relevant
268patches to vnode.h and vfs_subr.c (which allow the soft update
269code to speed up the filesystem syncer process). I have also
270included the diffs for ffs_softdep.c. I hope to make a pass over
271ffs_softdep.c to isolate the differences with my standard version
272so that these diffs are less painful to incorporate.
273
274Since I know you like to play with tuning, I have put the relevant
275knobs on sysctl debug variables. The tuning knobs can be viewed
276with `sysctl debug' and set with `sysctl -w debug.<name>=value'.
277The knobs are as follows:
278
279        debug.max_softdeps - limit on any given resource
280        debug.tickdelay - ticks to delay before allocating
281        debug.max_limit_hit - number of times tickdelay imposed
282        debug.rush_requests - number of rush requests to filesystem syncer
283
284The max_softdeps limit is derived from vnodesdesired which in
285turn is sized based on the amount of memory on the machine.
286When the limit is hit, a process requesting a resource first
287tries to speed up the filesystem syncer process. Such a
288request is recorded as a rush_request. After syncdelay / 2
289unserviced rush requests (typically 15) are in the filesystem
290syncers queue (i.e., it is more than 15 seconds behind in its
291work), the process requesting the memory is put to sleep for
292tickdelay seconds. Such a delay is recorded in max_limit_hit.
293Following this delay it is granted its memory without further
294delay. I have tried the following experiments in which I
295delete an MH directory containing 16,703 files:
296
297Run #                   1               2               3
298
299max_softdeps         4496            4496            4496
300tickdelay        100 == 1 sec   20 == 0.2 sec   2 == 0.02 sec
301max_limit_hit    16 == 16 sec   27 == 5.4 sec   203 == 4.1 sec
302rush_requests         147             102              93
303run time             57 sec          46 sec          45 sec
304I/O's                 781             859             936
305
306When run with no limits, it completes in 40 seconds. So, the
307time spent in delay is directly added to the bottom line.
308Shortening the tick delay does cut down the total running time,
309but at the expense of generating more total I/O operations
310due to the rush orders being sent to the filesystem syncer.
311Although the number of rush orders decreases with a shorter
312tick delay, there are more requests in each order, hence the
313increase in I/O count. Also, although the I/O count does rise
314with a shorter delay, it is still at least an order of magnitude
315less than without soft updates. Anyway, you may want to play
316around with these value to see what works best and to see if
317you can get an insight into how best to tune them. If you get
318out of memory panic's, then you have max_softdeps set too high.
319The max_limit_hit and rush_requests show be reset to zero
320before each run. The minimum legal value for tickdelay is 2
321(if you set it below that, the code will use 2).
322