1# $FreeBSD: src/sys/ufs/ffs/README,v 1.4 1999/12/03 00:34:26 billf Exp $ 2# $DragonFly: src/sys/vfs/ufs/README,v 1.3 2003/08/07 21:17:44 dillon Exp $ 3 4Introduction 5 6This package constitutes the alpha distribution of the soft update 7code updates for the fast filesystem. 8 9For More information on what Soft Updates is, see: 10http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/ 11 12Status 13 14My `filesystem torture tests' (described below) run for days without 15a hitch (no panic's, hangs, filesystem corruption, or memory leaks). 16However, I have had several panic's reported to me by folks that 17are field testing the code which I have not yet been able to 18reproduce or fix. Although these panic's are rare and do not cause 19filesystem corruption, the code should only be put into production 20on systems where the system administrator is aware that it is being 21run, and knows how to turn it off if problems arise. Thus, you may 22hand out this code to others, but please ensure that this status 23message is included with any distributions. Please also include 24the file ffs_softdep.stub.c in any distributions so that folks that 25cannot abide by the need to redistribute source will not be left 26with a kernel that will not link. It will resolve all the calls 27into the soft update code and simply ignores the request to enable 28them. Thus you will be able to ensure that your other hooks have 29not broken anything and that your kernel is softdep-ready for those 30that wish to use them. Please report problems back to me with 31kernel backtraces of panics if possible. This is massively complex 32code, and people only have to have their filesystems hosed once or 33twice to avoid future changes like the plague. I want to find and 34fix as many bugs as soon as possible so as to get the code rock 35solid before it gets widely released. Please report any bugs that 36you uncover to mckusick@mckusick.com. 37 38Performance 39 40Running the Andrew Benchmarks yields the following raw data: 41 42 Phase Normal Softdep What it does 43 1 3s <1s Creating directories 44 2 8s 4s Copying files 45 3 6s 6s Recursive directory stats 46 4 8s 9s Scanning each file 47 5 25s 25s Compilation 48 49 Normal: 19.9u 29.2s 0:52.8 135+630io 50 Softdep: 20.3u 28.5s 0:47.8 103+363io 51 52Another interesting datapoint are my `filesystem torture tests'. 53They consist of 1000 runs of the andrew benchmarks, 1000 copy and 54removes of /etc with randomly selected pauses of 0-60 seconds 55between each copy and remove, and 500 find from / with randomly 56selected pauses of 100 seconds between each run). The run of the 57torture test compares as follows: 58 59With soft updates: writes: 6 sync, 1,113,686 async; run time 19hr, 50min 60Normal filesystem: writes: 1,459,147 sync, 487,031 async; run time 27hr, 15min 61 62The upshot is 42% less I/O and 28% shorter running time. 63 64Another interesting test point is a full MAKEDEV. Because it runs 65as a shell script, it becomes mostly limited by the execution speed 66of the machine on which it runs. Here are the numbers: 67 68With soft updates: 69 70 labrat# time ./MAKEDEV std 71 2.2u 32.6s 0:34.82 100.0% 0+0k 11+36io 0pf+0w 72 73 labrat# ls | wc 74 522 522 3317 75 76Without soft updates: 77 78 labrat# time ./MAKEDEV std 79 2.0u 40.5s 0:42.53 100.0% 0+0k 11+1221io 0pf+0w 80 81 labrat# ls | wc 82 522 522 3317 83 84Of course, some of the system time is being pushed 85to the syncer process, but that is a different story. 86 87To show a benchmark designed to highlight the soft update code 88consider a tar of zero-sized files and an rm -rf of a directory tree 89that has at least 50 files or so at each level. Running a test with 90a directory tree containing 28 directories holding 202 empty files 91produces the following numbers: 92 93With soft updates: 94tar: 0.0u 0.5s 0:00.65 76.9% 0+0k 0+44io 0pf+0w (0 sync, 33 async writes) 95rm: 0.0u 0.2s 0:00.20 100.0% 0+0k 0+37io 0pf+0w (0 sync, 72 async writes) 96 97Normal filesystem: 98tar: 0.0u 1.1s 0:07.27 16.5% 0+0k 60+586io 0pf+0w (523 sync, 0 async writes) 99rm: 0.0u 0.5s 0:01.84 29.3% 0+0k 0+318io 0pf+0w (258 sync, 65 async writes) 100 101The large reduction in writes is because inodes are clustered, so 102most of a block gets allocated, then the whole block is written 103out once rather than having the same block written once for each 104inode allocated from it. Similarly each directory block is written 105once rather than once for each new directory entry. Effectively 106what the update code is doing is allocating a bunch of inodes 107and directory entries without writing anything, then ensuring that 108the block containing the inodes is written first followed by the 109directory block that references them. If there were data in the 110files it would further ensure that the data blocks were written 111before their inodes claimed them. 112 113Copyright Restrictions 114 115Please familiarize yourself with the copyright restrictions 116contained at the top of either the sys/ufs/ffs/softdep.h or 117sys/ufs/ffs/ffs_softdep.c file. The key provision is similar 118to the one used by the DB 2.0 package and goes as follows: 119 120 Redistributions in any form must be accompanied by information 121 on how to obtain complete source code for any accompanying 122 software that uses the this software. This source code must 123 either be included in the distribution or be available for 124 no more than the cost of distribution plus a nominal fee, 125 and must be freely redistributable under reasonable 126 conditions. For an executable file, complete source code 127 means the source code for all modules it contains. It does 128 not mean source code for modules or files that typically 129 accompany the operating system on which the executable file 130 runs, e.g., standard library modules or system header files. 131 132The idea is to allow those of you freely redistributing your source 133to use it while retaining for myself the right to peddle it for 134money to the commercial UNIX vendors. Note that I have included a 135stub file ffs_softdep.c.stub that is freely redistributable so that 136you can put in all the necessary hooks to run the full soft updates 137code, but still allow vendors that want to maintain proprietary 138source to have a working system. I do plan to release the code with 139a `Berkeley style' copyright once I have peddled it around to the 140commercial vendors. If you have concerns about this copyright, 141feel free to contact me with them and we can try to resolve any 142difficulties. 143 144Soft Dependency Operation 145 146The soft update implementation does NOT require ANY changes 147to the on-disk format of your filesystems. Furthermore it is 148not used by default for any filesystems. It must be enabled on 149a filesystem by filesystem basis by running tunefs to set a 150bit in the superblock indicating that the filesystem should be 151managed using soft updates. If you wish to stop using 152soft updates due to performance or reliability reasons, 153you can simply run tunefs on it again to turn off the bit and 154revert to normal operation. The additional dynamic memory load 155placed on the kernel malloc arena is approximately equal to 156the amount of memory used by vnodes plus inodes (for a system 157with 1000 vnodes, the additional peak memory load is about 300K). 158 159Kernel Changes 160 161There are two new changes to the kernel functionality that are not 162contained in in the soft update files. The first is a `trickle 163sync' facility running in the kernel as process 3. This trickle 164sync process replaces the traditional `update' program (which should 165be commented out of the /etc/rc startup script). When a vnode is 166first written it is placed 30 seconds down on the trickle sync 167queue. If it still exists and has dirty data when it reaches the 168top of the queue, it is sync'ed. This approach evens out the load 169on the underlying I/O system and avoids writing short-lived files. 170The papers on trickle-sync tend to favor aging based on buffers 171rather than files. However, I sync on file age rather than buffer 172age because the data structures are much smaller as there are 173typically far fewer files than buffers. Although this can make the 174I/O spikey when a big file times out, it is still much better than 175the wholesale sync's that were happening before. It also adapts 176much better to the soft update code where I want to control 177aging to improve performance (inodes age in 10 seconds, directories 178in 15 seconds, files in 30 seconds). This ensures that most 179dependencies are gone (e.g., inodes are written when directory 180entries want to go to disk) reducing the amount of rollback that 181is needed. 182 183The other main kernel change is to split the vnode freelist into 184two separate lists. One for vnodes that are still being used to 185identify buffers and the other for those vnodes no longer identifying 186any buffers. The latter list is used by getnewvnode in preference 187to the former. 188 189Packaging of Kernel Changes 190 191The sys subdirectory contains the changes and additions to the 192kernel. My goal in writing this code was to minimize the changes 193that need to be made to the kernel. Thus, most of the new code 194is contained in the two new files softdep.h and ffs_softdep.c. 195The rest of the kernel changes are simply inserting hooks to 196call into these two new files. Although there has been some 197structural reorganization of the filesystem code to accommodate 198gathering the information required by the soft update code, 199the actual ordering of filesystem operations when soft updates 200are disabled is unchanged. 201 202The kernel changes are packaged as a set of diffs. As I am 203doing my development in BSD/OS, the diffs are relative to the 204BSD/OS versions of the files. Because BSD/OS recently had 2054.4BSD-Lite2 merged into it, the Lite2 files are a good starting 206point for figuring out the changes. There are 40 files that 207require change plus the two new files. Most of these files have 208only a few lines of changes in them. However, four files have 209fairly extensive changes: kern/vfs_subr.c, vfs/ufs/ufs_lookup.c, 210vfs/ufs/ufs_vnops.c, and vfs/ffs/ffs_alloc.c. For these four 211files, I have provided the original Lite2 version, the Lite2 212version with the diffs merged in, and the diffs between the 213BSD/OS and merged version. Even so, I expect that there will 214be some difficulty in doing the merge; I am certainly willing 215to assist in helping get the code merged into your system. 216 217Packaging of Utility Changes 218 219The utilities subdirectory contains the changes and additions 220to the utilities. There are diffs to three utilities enclosed: 221 222 tunefs - add a flag to enable and disable soft updates 223 224 mount - print out whether soft updates are enabled and 225 also statistics on number of sync and async writes 226 227 fsck - tighter checks on acceptable errors and a slightly 228 different policy for what to put in lost+found on 229 filesystems using soft updates 230 231In addition you should recompile vmstat so as to get reports 232on the 13 new memory types used by the soft update code. 233It is not necessary to use the new version of fsck, however it 234would aid in my debugging if you do. Also, because of the time 235lag between deleting a directory entry and the inode it 236references, you will find a lot more files showing up in your 237lost+found if you do not use the new version. Note that the 238new version checks for the soft update flag in the superblock 239and only uses the new algorithms if it is set. So, it will run 240unchanged on the filesystems that are not using soft updates. 241 242Operation 243 244Once you have booted a kernel that incorporates the soft update 245code and installed the updated utilities, do the following: 246 2471) Comment out the update program in /etc/rc. 248 2492) Run `tunefs -n enable' on one or more test filesystems. 250 2513) Mount these filesystems and then type `mount' to ensure that 252 they have been enabled for soft updates. 253 2544) Copy the test directory to a softdep filesystem, chdir into 255 it and run `./doit'. You may want to check out each of the 256 three subtests individually first: doit1 - andrew benchmarks, 257 doit2 - copy and removal of /etc, doit3 - find from /. 258 259==== 260Additional notes from Feb 13 261 262When removing huge directories of files, it is possible to get 263the incore state arbitrarily far ahead of the disk. Maintaining 264all the associated depedency information can exhaust the kernel 265malloc arena. To avoid this senario, I have put some limits on 266the soft update code so that it will not be allowed to rampage 267through all of the kernel memory. I enclose below the relevant 268patches to vnode.h and vfs_subr.c (which allow the soft update 269code to speed up the filesystem syncer process). I have also 270included the diffs for ffs_softdep.c. I hope to make a pass over 271ffs_softdep.c to isolate the differences with my standard version 272so that these diffs are less painful to incorporate. 273 274Since I know you like to play with tuning, I have put the relevant 275knobs on sysctl debug variables. The tuning knobs can be viewed 276with `sysctl debug' and set with `sysctl -w debug.<name>=value'. 277The knobs are as follows: 278 279 debug.max_softdeps - limit on any given resource 280 debug.tickdelay - ticks to delay before allocating 281 debug.max_limit_hit - number of times tickdelay imposed 282 debug.rush_requests - number of rush requests to filesystem syncer 283 284The max_softdeps limit is derived from vnodesdesired which in 285turn is sized based on the amount of memory on the machine. 286When the limit is hit, a process requesting a resource first 287tries to speed up the filesystem syncer process. Such a 288request is recorded as a rush_request. After syncdelay / 2 289unserviced rush requests (typically 15) are in the filesystem 290syncers queue (i.e., it is more than 15 seconds behind in its 291work), the process requesting the memory is put to sleep for 292tickdelay seconds. Such a delay is recorded in max_limit_hit. 293Following this delay it is granted its memory without further 294delay. I have tried the following experiments in which I 295delete an MH directory containing 16,703 files: 296 297Run # 1 2 3 298 299max_softdeps 4496 4496 4496 300tickdelay 100 == 1 sec 20 == 0.2 sec 2 == 0.02 sec 301max_limit_hit 16 == 16 sec 27 == 5.4 sec 203 == 4.1 sec 302rush_requests 147 102 93 303run time 57 sec 46 sec 45 sec 304I/O's 781 859 936 305 306When run with no limits, it completes in 40 seconds. So, the 307time spent in delay is directly added to the bottom line. 308Shortening the tick delay does cut down the total running time, 309but at the expense of generating more total I/O operations 310due to the rush orders being sent to the filesystem syncer. 311Although the number of rush orders decreases with a shorter 312tick delay, there are more requests in each order, hence the 313increase in I/O count. Also, although the I/O count does rise 314with a shorter delay, it is still at least an order of magnitude 315less than without soft updates. Anyway, you may want to play 316around with these value to see what works best and to see if 317you can get an insight into how best to tune them. If you get 318out of memory panic's, then you have max_softdeps set too high. 319The max_limit_hit and rush_requests show be reset to zero 320before each run. The minimum legal value for tickdelay is 2 321(if you set it below that, the code will use 2). 322 323 324