xref: /dragonfly/share/man/man8/swapcache.8 (revision 8a0bcd56)
1.\"
2.\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.Dd February 7, 2010
13.Dt SWAPCACHE 8
14.Os
15.Sh NAME
16.Nm swapcache
17.Nd a mechanism to use fast swap to cache filesystem data and meta-data
18.Sh SYNOPSIS
19.Cd sysctl vm.swapcache.accrate=100000
20.Cd sysctl vm.swapcache.maxfilesize=0
21.Cd sysctl vm.swapcache.maxburst=2000000000
22.Cd sysctl vm.swapcache.curburst=4000000000
23.Cd sysctl vm.swapcache.minburst=10000000
24.Cd sysctl vm.swapcache.read_enable=0
25.Cd sysctl vm.swapcache.meta_enable=0
26.Cd sysctl vm.swapcache.data_enable=0
27.Cd sysctl vm.swapcache.use_chflags=1
28.Cd sysctl vm.swapcache.maxlaunder=256
29.Cd sysctl vm.swapcache.hysteresis=(vm.stats.vm.v_inactive_target/2)
30.Sh DESCRIPTION
31.Nm
32is a system capability which allows a solid state disk (SSD) in a swap
33space configuration to be used to cache clean filesystem data and meta-data
34in addition to its normal function of backing anonymous memory.
35.Pp
36Sysctls are used to manage operational parameters and can be adjusted at
37any time.
38Typically a large initial burst is desired after system boot,
39controlled by the initial
40.Va vm.swapcache.curburst
41parameter.
42This parameter is reduced as data is written to swap by the swapcache
43and increased at a rate specified by
44.Va vm.swapcache.accrate .
45Once this parameter reaches zero write activity ceases until it has
46recovered sufficiently for write activity to resume.
47.Pp
48.Va vm.swapcache.meta_enable
49enables the writing of filesystem meta-data to the swapcache.
50Filesystem
51metadata is any data which the filesystem accesses via the disk device
52using buffercache.
53Meta-data is cached globally regardless of file or directory flags.
54.Pp
55.Va vm.swapcache.data_enable
56enables the writing of clean filesystem file-data to the swapcache.
57Filesystem filedata is any data which the filesystem accesses via a
58regular file.
59In technical terms, when the buffer cache is used to access
60a regular file through its vnode.
61Please do not blindly turn on this option, see the
62.Sx PERFORMANCE TUNING
63section for more information.
64.Pp
65.Va vm.swapcache.use_chflags
66enables the use of the
67.Va cache
68and
69.Va noscache
70.Xr chflags 1
71flags to control which files will be data-cached.
72If this sysctl is disabled and
73.Va data_enable
74is enabled, the system will ignore file flags and attempt to
75swapcache all regular files.
76.Pp
77.Va vm.swapcache.read_enable
78enables reading from the swapcache and should be set to 1 for normal
79operation.
80.Pp
81.Va vm.swapcache.maxfilesize
82controls which files are to be cached based on their size.
83If set to non-zero only files smaller than the specified size
84will be cached.
85Larger files will not be cached.
86.Pp
87.Va vm.swapcache.maxlaunder
88controls the maximum number of clean VM pages which will be added to
89the swap cache and written out to swap on each poll.
90Swapcache polls ten times a second.
91.Pp
92.Va vm.swapcache.hysteresis
93controls how many pages swapcache waits to be added to the inactive page
94queue before continuing its scan.
95Once it decides to scan it continues subject to the above limitations
96until it reaches the end of the inactive page queue.
97This parameter is designed to make swapcache generate more bulky bursts
98to swap which helps SSDs reduce write amplification effects.
99.Sh PERFORMANCE TUNING
100Best operation is achieved when the active data set fits within the
101swapcache.
102.Pp
103.Bl -tag -width 4n -compact
104.It Va vm.swapcache.accrate
105This specifies the burst accumulation rate in bytes per second and
106ultimately controls the write bandwidth to swap averaged over a long
107period of time.
108This parameter must be carefully chosen to manage the write endurance of
109the SSD in order to avoid wearing it out too quickly.
110Even though SSDs have limited write endurance, there is massive
111cost/performance benefit to using one in a swapcache configuration.
112.Pp
113Let's use the Intel X25V 40GB MLC SATA SSD as an example.
114This device has approximately a
11540TB (40 terabyte) write endurance, but see later
116notes on this, it is more a minimum value.
117Limiting the long term average bandwidth to 100KB/sec leads to no more
118than ~9GB/day writing which calculates approximately to a 12 year endurance.
119Endurance scales linearly with size.
120The 80GB version of this SSD
121will have a write endurance of approximately 80TB.
122.Pp
123MLC SSDs have a 1000-10000x write endurance, while the lower density
124higher-cost SLC SSDs have an approximately 10000-100000x write endurance.
125MLC SSDs can be used for the swapcache (and swap) as long as the system
126manager is cognizant of its limitations.
127.Pp
128.It Va vm.swapcache.meta_enable
129Turning on just
130.Va meta_enable
131causes only filesystem meta-data to be cached and will result
132in very fast directory operations even over millions of inodes
133and even in the face of other invasive operations being run
134by other processes.
135.Pp
136For
137.Nm HAMMER
138filesystems meta-data includes the B-Tree, directory entries,
139and data related to tiny files.
140Approximately 6 GB of swapcache is needed
141for every 14 million or so inodes cached, effectively giving one the
142ability to cache all the meta-data in a multi-terabyte filesystem using
143a fairly small SSD.
144.Pp
145.It Va vm.swapcache.data_enable
146Turning on
147.Va data_enable
148(with or without other features) allows bulk file data to be cached.
149This feature is very useful for web server operation when the
150operational data set fits in swap.
151The usefulness is somewhat mitigated by the maximum number
152of vnodes supported by the system via
153.Va kern.maxfiles ,
154because the bulk data in the cache is lost when the related
155vnode is recycled.
156In this case it might be desirable to
157take the plunge into running a 64-bit kernel which can support
158far more vnodes.
15932-bit kernels have limited kernel virtual
160memory (KVM) and cannot reliably support more than around
161100,000 active vnodes.
16264-bit kernels can support 300,000+ active vnodes.
163.Pp
164Data caching is definitely more wasteful of the SSD's write durability
165than meta-data caching.
166The swapcache may exhaust its burst and smack against the long term
167average bandwidth limit, causing the SSD to wear out at the maximum rate
168you programmed.
169Data caching is far less wasteful and more efficient
170if (on a 64-bit system only) you provide a sufficiently large SSD and
171increase
172.Va kern.maxvnodes
173to cover the entire directory topology being served.
174Each vnode requires about 1KB of physical RAM.
175.Pp
176Due to the higher SSD write rate you may want to use a
177medium-sized SSD with good write performance to reduce interference
178between reading and writing.
179Write durability also scales with larger SSDs.
180For example, an Intel X25-V only has 40MB/s in write performance
181and burst writing by swapcache will seriously interfere with
182concurrent read operation on the SSD.
183The 80GB X25-M on the otherhand has double the write performance.
184.Pp
185When data caching is turned on you generally want to use
186.Xr chflags 1
187with the
188.Va cache
189flag to enable data caching on a directory.
190This flag is tracked by the namecache and does not need to be
191recursively set in the directory tree.
192Simply setting the flag in a top level directory or mount point
193is usually sufficient.
194However, the flag does not track across mount points.
195A typical setup is something like this:
196.Pp
197.Dl chflags cache /etc /sbin /bin /usr /home
198.Dl chflags noscache /usr/obj
199.Pp
200If that doesn't work you can turn off
201.Va vm.swapcache.use_chflags
202entirely and not bother with any
203.Nm chflag Ns 'ing .
204.Pp
205Filesystems such as NFS which do not support flags generally
206have a
207.Va cache
208mount option which enables swapcache operation on the mount.
209.Pp
210.It Va vm.swapcache.maxfilesize
211This may be used to reduce cache thrashing when a focus on a small
212potentially fragmented filespace is desired, leaving the
213larger files alone.
214.Pp
215.It Va vm.swapcache.minburst
216This controls hysteresis and prevents nickel-and-dime write bursting.
217Once
218.Va curburst
219drops to zero, writing to the swapcache ceases until it has recovered past
220.Va minburst .
221The idea here is to avoid creating a heavily fragmented swapcache where
222reading data from a file must alternate between the cache and the primary
223filesystem.
224Doing so does not save disk seeks on the primary filesystem
225so we want to avoid doing small bursts.
226This parameter allows us to do larger bursts.
227The larger bursts also tend to improve SSD performance as the SSD itself
228can do a better job write-combining and erasing blocks.
229.Pp
230.It Va vm_swapcache.maxswappct
231This controls the maximum amount of swapspace
232.Nm
233may use, in percentage terms.
234.El
235.Pp
236It is important to note that you should always use
237.Xr disklabel64 8
238to label your SSD.
239Disklabel64 will properly align the base of the
240partition space relative to the physical drive regardless of how badly
241aligned the fdisk slice is.
242This will significantly reduce write amplification and write combining
243inefficiencies on the SSD.
244.Pp
245Finally, interleaved swap (multiple SSDs) may be used to increase
246performance even further.
247A single SATA SSD is typically capable of reading 120-220MB/sec.
248Configuring two SSDs for your swap will
249improve aggregate swapcache read performance by 1.5x to 1.8x.
250In tests with two Intel 40GB SSDs 300MB/sec was easily achieved.
251.Pp
252At this point you will be configuring more swap space than a 32 bit
253.Dx
254kernel can handle (due to KVM limitations).
255By default, 32 bit
256.Dx
257systems only support 32GB of configured swap and while this limit
258can be increased somewhat in
259.Pa /boot/loader.conf
260you should really be using a 64-bit
261.Dx
262kernel instead.
26364-bit systems support up to 512GB of swap by default
264and can be boosted to up to 8TB if you are really crazy and have enough RAM.
265Each 1GB of swap requires around 1MB of physical memory to manage it so
266the practical limit is more around 1TB of swap.
267.Pp
268Of course, a 1TB SSD is something on the order of $3000+ as of this writing.
269Even though a 1TB configuration might not be cost effective, storage levels
270more in the 100-200GB range certainly are.
271If the machine has only a 1GigE
272ethernet (100MB/s) there's no point configuring it for more SSD bandwidth.
273A single SSD of the desired size would be sufficient.
274.Sh INITIAL BURSTING & REPEATED BURSTING
275Even though the average write bandwidth is limited it is desirable
276to have a large initial burst after boot to load the cache.
277.Va curburst
278is initialized to 4GB by default and you can force rebursting
279by adjusting it with a sysctl.
280Remember that
281.Va curburst
282dynamically tracks burst and will go up and down depending.
283.Pp
284In addition there will be periods of time where the system is in
285steady state and not writing to the swapcache.
286During these periods
287.Va curburst
288will inch back up but will not exceed
289.Va maxburst .
290Thus the
291.Va maxburst
292value controls how large a repeated burst can be.
293.Pp
294A second bursting parameter called
295.Va vm.swapcache.minburst
296controls bursting when the maximum write bandwidth has been reached.
297When
298.Va minburst
299reaches zero write activity ceases and
300.Va curburst
301is allowed to recover up to
302.Va minburst
303before write activity resumes.
304The recommended range for the
305.Va minburst
306parameter is 1MB to 50MB.
307This parameter has a relationship to
308how fragmented the swapcache gets when not in a steady state.
309Large bursts reduce fragmentation and reduce incidences of
310excessive seeking on the hard drive.
311If set too low the
312swapcache will become fragmented within a single regular file
313and the constant back-and-forth between the swapcache and the
314hard drive will result in excessive seeking on the hard drive.
315.Sh SWAPCACHE SIZE & MANAGEMENT
316The swapcache feature will use up to 75% of configured swap space
317by default.
318The remaining 25% is reserved for normal paging operation.
319The system operator should configure at least 4 times the SWAP space
320versus main memory and no less than 8GB of swap space.
321If a 40GB SSD is used the recommendation is to configure 16GB to 32GB of
322swap (note: 32-bit is limited to 32GB of swap by default, for 64-bit
323it is 512GB of swap), and to leave the remainder unwritten and unused.
324.Pp
325The
326.Va vm_swapcache.maxswappct
327sysctl may be used to change the default.
328You may have to change this default if you also use
329.Xr tmpfs 5 ,
330.Xr vn 4 ,
331or if you have not allocated enough swap for reasonable normal paging
332activity to occur (in which case you probably shouldn't be using
333.Nm
334anyway).
335.Pp
336If swapcache reaches the 75% limit it will begin tearing down swap
337in linear bursts by iterating through available VM objects, until
338swap space use drops to 70%.
339The tear-down is limited by the rate at
340which new data is written and this rate in turn is often limited by
341.Va vm.swapcache.accrate ,
342resulting in an orderly replacement of cached data and meta-data.
343The limit is typically only reached when doing full data+meta-data
344caching with no file size limitations and serving primarily large
345files, or (on a 64-bit system) bumping
346.Va kern.maxvnodes
347up to very high values.
348.Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
349This is not a function of
350.Nm
351per se but instead a normal function of the system.
352Most systems have
353sufficient memory that they do not need to page memory to swap.
354These types of systems are the ones best suited for MLC SSD
355configured swap running with a
356.Nm
357configuration.
358Systems which modestly page to swap, in the range of a few hundred
359megabytes a day worth of writing, are also well suited for MLC SSD
360configured swap.
361Desktops usually fall into this category even if they
362page out a bit more because swap activity is governed by the actions of
363a single person.
364.Pp
365Systems which page anonymous memory heavily when
366.Nm
367would otherwise be turned off are not usually well suited for MLC SSD
368configured swap.
369Heavy paging activity is not governed by
370.Nm
371bandwidth control parameters and can lead to excessive uncontrolled
372writing to the MLC SSD, causing premature wearout.
373You would have to use the lower density, more expensive SLC SSD
374technology (which has 10x the durability).
375This isn't to say that
376.Nm
377would be ineffective, just that the aggregate write bandwidth required
378to support the system would be too large for MLC flash technologies.
379.Pp
380With this caveat in mind, SSD based paging on systems with insufficient
381RAM can be extremely effective in extending the useful life of the system.
382For example, a system with a measly 192MB of RAM and SSD swap can run
383a -j 8 parallel build world in a little less than twice the time it
384would take if the system had 2GB of RAM, whereas it would take 5x to 10x
385as long with normal HD based swap.
386.Sh WARNINGS
387I am going to repeat and expand a bit on SSD wear.
388Wear on SSDs is a function of the write durability of the cells,
389whether the SSD implements static or dynamic wear leveling, and
390write amplification effects based on the type of write activity.
391Write amplification occurs due to wasted space when the SSD must
392erase and rewrite the underlying flash blocks.
393E.g.\& MLC flash uses 128KB erase/write blocks.
394.Pp
395.Nm
396parameters should be carefully chosen to avoid early wearout.
397For example, the Intel X25V 40GB SSD has a minimum write durability
398of 40TB and an actual durability that can be quite a bit higher.
399Generally speaking, you want to select parameters that will give you
400at least 10 years of service life.
401The most important parameter to control this is
402.Va vm.swapcache.accrate .
403.Nm
404uses a very conservative 100KB/sec default but even a small X25V
405can probably handle 300KB/sec of continuous writing and still last 10 years.
406.Pp
407Depending on the wear leveling algorithm the drive uses, durability
408and performance can sometimes be improved by configuring less
409space (in a manufacturer-fresh drive) than the drive's probed capacity.
410For example, by only using 32GB of a 40GB SSD.
411SSDs typically implement 10% more storage than advertised and
412use this storage to improve wear leveling.
413As cells begin to fail
414this overallotment slowly becomes part of the primary storage
415until it has been exhausted.
416After that the SSD has basically failed.
417Keep in mind that if you use a larger portion of the SSD's advertised
418storage the SSD will not know if/when you decide to use less unless
419appropriate TRIM commands are sent (if supported), or a low level
420factory erase is issued.
421.Pp
422The swapcache is designed for use with SSDs configured as swap and
423will generally not improve performance when a normal hard drive is used
424for swap.
425.Pp
426.Nm smartctl
427(from pkgsrc's sysutils/smartmontools) may be used to retrieve
428the wear indicator from the drive.
429One usually runs something like
430.Ql smartctl -d sat -a /dev/daXX
431(for AHCI/SILI/SCSI), or
432.Ql smartctl -a /dev/adXX
433for NATA.
434Some SSDs
435(particularly the Intels) will brick the SATA port when smart operations
436are done while the drive is busy with normal activity, so the tool should
437only be run when the SSD is idle.
438.Pp
439ID 232 (0xe8) in the SMART data dump indicates available reserved
440space and ID 233 (0xe9) is the wear-out meter.
441Reserved space
442typically starts at 100 and decrements to 10, after which the SSD
443is considered to operate in a degraded mode.
444The wear-out meter typically starts at 99 and decrements to 0,
445after which the SSD has failed.
446.Pp
447.Nm
448tends to use large 64KB writes and tends to cluster multiple writes
449linearly.
450The SSD is able to take significant advantage of this
451and write amplification effects are greatly reduced.
452If we take a 40GB Intel X25V as an example the vendor specifies a write
453durability of approximately 40TB, but
454.Nm
455should be able to squeeze out upwards of 200TB due the fairly optimal
456write clustering it does.
457The theoretical limit for the Intel X25V is 400TB (10,000 erase cycles
458per MLC cell, 40GB drive), but the firmware doesn't do perfect static
459wear leveling so the actual durability is less.
460In tests over several hundred days we have validated a write endurance
461greater than 200TB on the 40G Intel X25V using
462.Nm .
463.Pp
464In contrast, most filesystems directly stored on a SSD have
465fairly severe write amplification effects and will have durabilities
466ranging closer to the vendor-specified limit.
467.Pp
468Power-on hours, power cycles, and read operations do not really affect wear.
469There is something called read-disturb but it is unclear what sort of
470ratio would be needed.  Since the data is cached in ram and thus not
471re-read at a high rate there is no expectation of a practical effect.
472For all intents and purposes only write operations effect wear.
473.Pp
474SSD's with MLC-based flash technology are high-density, low-cost solutions
475with limited write durability.
476SLC-based flash technology is a low-density,
477higher-cost solution with 10x the write durability as MLC.
478The durability also scales with the amount of flash storage.
479SLC based flash is typically
480twice as expensive per gigabyte.
481From a cost perspective, SLC based flash
482is at least 5x more cost effective in situations where high write
483bandwidths are required (because it lasts 10x longer).
484MLC is at least 2x more cost effective in situations where high
485write bandwidth is not required.
486When wear calculations are in years, these differences become huge, but
487often the quantity of storage needed trumps the wear life so we expect most
488people will be using MLC.
489.Nm
490is usable with both technologies.
491.Sh SEE ALSO
492.Xr chflags 1 ,
493.Xr fstab 5 ,
494.Xr disklabel64 8 ,
495.Xr swapon 8
496.Sh HISTORY
497.Nm
498first appeared in
499.Dx 2.5 .
500.Sh AUTHORS
501.An Matthew Dillon
502