xref: /dragonfly/share/man/man8/swapcache.8 (revision 38b5d46c)
1.\"
2.\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.Dd February 7, 2010
13.Dt SWAPCACHE 8
14.Os
15.Sh NAME
16.Nm swapcache
17.Nd a mechanism to use fast swap to cache filesystem data and meta-data
18.Sh SYNOPSIS
19.Cd sysctl vm.swapcache.accrate=100000
20.Cd sysctl vm.swapcache.maxfilesize=0
21.Cd sysctl vm.swapcache.maxburst=2000000000
22.Cd sysctl vm.swapcache.curburst=4000000000
23.Cd sysctl vm.swapcache.minburst=10000000
24.Cd sysctl vm.swapcache.read_enable=0
25.Cd sysctl vm.swapcache.meta_enable=0
26.Cd sysctl vm.swapcache.data_enable=0
27.Cd sysctl vm.swapcache.use_chflags=1
28.Cd sysctl vm.swapcache.maxlaunder=256
29.Cd sysctl vm.swapcache.hysteresis=(vm.stats.vm.v_inactive_target/2)
30.Sh DESCRIPTION
31.Nm
32is a system capability which allows a solid state disk (SSD) in a swap
33space configuration to be used to cache clean filesystem data and meta-data
34in addition to its normal function of backing anonymous memory.
35.Pp
36Sysctls are used to manage operational parameters and can be adjusted at
37any time.
38Typically a large initial burst is desired after system boot,
39controlled by the initial
40.Va vm.swapcache.curburst
41parameter.
42This parameter is reduced as data is written to swap by the swapcache
43and increased at a rate specified by
44.Va vm.swapcache.accrate .
45Once this parameter reaches zero write activity ceases until it has
46recovered sufficiently for write activity to resume.
47.Pp
48.Va vm.swapcache.meta_enable
49enables the writing of filesystem meta-data to the swapcache.
50Filesystem
51metadata is any data which the filesystem accesses via the disk device
52using buffercache.
53Meta-data is cached globally regardless of file or directory flags.
54.Pp
55.Va vm.swapcache.data_enable
56enables the writing of clean filesystem file-data to the swapcache.
57Filesystem filedata is any data which the filesystem accesses via a
58regular file.
59In technical terms, when the buffer cache is used to access
60a regular file through its vnode.
61Please do not blindly turn on this option, see the
62.Sx PERFORMANCE TUNING
63section for more information.
64.Pp
65.Va vm.swapcache.use_chflags
66enables the use of the
67.Va cache
68and
69.Va noscache
70.Xr chflags 1
71flags to control which files will be data-cached.
72If this sysctl is disabled and
73.Va data_enable
74is enabled, the system will ignore file flags and attempt to
75swapcache all regular files.
76.Pp
77.Va vm.swapcache.read_enable
78enables reading from the swapcache and should be set to 1 for normal
79operation.
80.Pp
81.Va vm.swapcache.maxfilesize
82controls which files are to be cached based on their size.
83If set to non-zero only files smaller than the specified size
84will be cached.
85Larger files will not be cached.
86.Pp
87.Va vm.swapcache.maxlaunder
88controls the maximum number of clean VM pages which will be added to
89the swap cache and written out to swap on each poll.
90Swapcache polls ten times a second.
91.Pp
92.Va vm.swapcache.hysteresis
93controls how many pages swapcache waits to be added to the inactive page
94queue before continuing its scan.
95Once it decides to scan it continues subject to the above limitations
96until it reaches the end of the inactive page queue.
97This parameter is designed to make swapcache generate more bulky bursts
98to swap which helps SSDs reduce write amplification effects.
99.Sh PERFORMANCE TUNING
100Best operation is achieved when the active data set fits within the
101swapcache.
102.Pp
103.Bl -tag -width 4n -compact
104.It Va vm.swapcache.accrate
105This specifies the burst accumulation rate in bytes per second and
106ultimately controls the write bandwidth to swap averaged over a long
107period of time.
108This parameter must be carefully chosen to manage the write endurance of
109the SSD in order to avoid wearing it out too quickly.
110Even though SSDs have limited write endurance, there is massive
111cost/performance benefit to using one in a swapcache configuration.
112.Pp
113Let's use the old Intel X25V 40GB MLC SATA SSD as an example.
114This device has approximately a
11540TB (40 terabyte) write endurance, but see later
116notes on this, it is more a minimum value.
117Limiting the long term average bandwidth to 100KB/sec leads to no more
118than ~9GB/day writing which calculates approximately to a 12 year endurance.
119Endurance scales linearly with size.
120The 80GB version of this SSD
121will have a write endurance of approximately 80TB.
122.Pp
123MLC SSDs have a 1000-10000x write endurance, while the lower density
124higher-cost SLC SSDs have a 10000-100000x write endurance, approximately.
125MLC SSDs can be used for the swapcache (and swap) as long as the system
126manager is cognizant of its limitations.
127However, over the years tests have shown the SLC SSDs do not really live
128up to their hype and are no more reliable than MLC SSDs.  Instead of
129worrying about SLC vs MLC, just use MLC (or TLC or whateve), leave
130more space unpartitioned which the SSD can utilize to improve durability,
131and be cognizant of the SSDs rate of wear.
132.Pp
133.It Va vm.swapcache.meta_enable
134Turning on just
135.Va meta_enable
136causes only filesystem meta-data to be cached and will result
137in very fast directory operations even over millions of inodes
138and even in the face of other invasive operations being run
139by other processes.
140.Pp
141For
142.Nm HAMMER
143filesystems meta-data includes the B-Tree, directory entries,
144and data related to tiny files.
145Approximately 6 GB of swapcache is needed
146for every 14 million or so inodes cached, effectively giving one the
147ability to cache all the meta-data in a multi-terabyte filesystem using
148a fairly small SSD.
149.Pp
150.It Va vm.swapcache.data_enable
151Turning on
152.Va data_enable
153(with or without other features) allows bulk file data to be cached.
154This feature is very useful for web server operation when the
155operational data set fits in swap.
156However, care must be taken to avoid thrashing the swapcache.
157In almost all cases you will want to leave chflags mode enabled
158and use 'chflags cache' on governing directories to control which
159directory subtrees file data should be cached for.
160.Pp
161DragonFly uses generously large kern.maxvnodes values,
162typically in excess of 400K vnodes, but large numbers
163of small files can still cause problems for swapcache.
164When operating on a filesystem containing a large number of
165small files, vnode recycling by the kernel will cause related
166swapcache data to be lost and also cause the swapcache to
167potentially thrash.
168Cache thrashing due to vnode recyclement can occur whether chflags
169mode is used or not.
170.Pp
171To solve the thrashing problem you can turn on HAMMER's
172double buffering feature via
173.Va vfs.hammer.double_buffer .
174This causes HAMMER to cache file data via its block device.
175HAMMER cannot avoid also caching file data via individual vnodes
176but will try to expire the second copy more quickly (hence
177why it is called double buffer mode), but the key point here is
178that
179.Nm
180will only cache the data blocks via the block device when
181double_buffer mode is used and since the block device is associated
182with the mount, vnode recycling will not mess with it.
183This allows the data for any number (potentially millions) of files to
184be swapcached.
185You still should use chflags mode to control the size of the dataset
186being cached to remain under 75% of configured swap space.
187.Pp
188Data caching is definitely more wasteful of the SSD's write durability
189than meta-data caching.
190If not carefully managed the swapcache may exhaust its burst and smack
191against the long term average bandwidth limit, causing the SSD to wear
192out at the maximum rate you programmed.
193Data caching is far less wasteful and more efficient
194if you provide a sufficiently large SSD.
195.Pp
196When caching large data sets you may want to use a medium-sized SSD
197with good write performance instead of a small SSD to accommodate
198the higher burst write rate data caching incurs and to reduce
199interference between reading and writing.
200Write durability also tends to scale with larger SSDs, but keep in mind
201that newer flash technologies use smaller feature sizes on-chip
202which reduce the write durability of the chips, so pay careful attention
203to the type of flash employed by the SSD when making durability
204assumptions.
205For example, an Intel X25-V only has 40MB/s in write performance
206and burst writing by swapcache will seriously interfere with
207concurrent read operation on the SSD.
208The 80GB X25-M on the otherhand has double the write performance.
209Higher-capacity and larger form-factor SSDs tend to have better
210write-performance.
211But the Intel 310 series SSDs use flash chips with a smaller feature
212size so an 80G 310 series SSD will wind up with a durability relative
213close to the older 40G X25-V.
214.Pp
215When data caching is turned on you can fine-tune what gets swapcached
216by also turning on swapcache's chflags mode and using
217.Xr chflags 1
218with the
219.Va cache
220flag to enable data caching on a directory-tree (recursive) basis.
221This flag is tracked by the namecache and does not need to be
222recursively set in the directory tree.
223Simply setting the flag in a top level directory or mount point
224is usually sufficient.
225However, the flag does not track across mount points.
226A typical setup is something like this:
227.Pp
228.Dl chflags cache /etc /sbin /bin /usr /home
229.Dl chflags noscache /usr/obj
230.Pp
231It is possible to tell
232.Nm
233to ignore the cache flag by leaving
234.Va vm.swapcache.use_chflags
235set to zero.
236In many situations it is convenient to simply not use chflags mode, but
237if you have numerous mixed SSDs and HDDs you may want to use this flag
238to enable swapcache on the HDDs and disable it on the SSDs even if
239you do not care about fine-grained control.
240.Nm chflag Ns 'ing .
241.Pp
242Filesystems such as NFS which do not support flags generally
243have a
244.Va cache
245mount option which enables swapcache operation on the mount.
246.Pp
247.It Va vm.swapcache.maxfilesize
248This may be used to reduce cache thrashing when a focus on a small
249potentially fragmented filespace is desired, leaving the
250larger (more linearly accessed) files alone.
251.Pp
252.It Va vm.swapcache.minburst
253This controls hysteresis and prevents nickel-and-dime write bursting.
254Once
255.Va curburst
256drops to zero, writing to the swapcache ceases until it has recovered past
257.Va minburst .
258The idea here is to avoid creating a heavily fragmented swapcache where
259reading data from a file must alternate between the cache and the primary
260filesystem.
261Doing so does not save disk seeks on the primary filesystem
262so we want to avoid doing small bursts.
263This parameter allows us to do larger bursts.
264The larger bursts also tend to improve SSD performance as the SSD itself
265can do a better job write-combining and erasing blocks.
266.Pp
267.It Va vm_swapcache.maxswappct
268This controls the maximum amount of swapspace
269.Nm
270may use, in percentage terms.
271The default is 75%, leaving the remaining 25% of swap available for normal
272paging operations.
273.El
274.Pp
275It is important to ensure that your swap partition is nicely aligned.
276The standard DragonFly
277.Xr disklabel 8
278program guarantees high alignment (~1MB) automatically.
279Swap-on HDDs benefit because HDDs tend to use a larger physical sector size
280than 512 bytes, and proper alignment for SSDs will reduce write amplification
281and write-combining inefficiencies.
282.Pp
283Finally, interleaved swap (multiple SSDs) may be used to increase
284swap and swapcache performance even further.
285A single SATA-II SSD is typically capable of reading 120-220MB/sec.
286Configuring two SSDs for your swap will
287improve aggregate swapcache read performance by 1.5x to 1.8x.
288In tests with two Intel 40GB SSDs 300MB/sec was easily achieved.
289With two SATA-III SSDs it is possible to achieve 600MB/sec or better
290and well over 400MB/sec random-read performance (versus the ~3MB/sec
291random read performance a hard drive gives you).
292Faster SATA interfaces or newer NVMe technologies have significantly
293more read bandwidth (3GB/sec+ for NVMe), but may still lag on the
294write bandwidth.
295With newer technologies, one swap device is usually plenty.
296.Pp
297.Dx
298defaults to a maximum of 512G of configured swap.
299Keep in mind that each 1GB of actually configured swap requires
300approximately 1MB of wired ram to manage.
301.Pp
302In addition there will be periods of time where the system is in
303steady state and not writing to the swapcache.
304During these periods
305.Va curburst
306will inch back up but will not exceed
307.Va maxburst .
308Thus the
309.Va maxburst
310value controls how large a repeated burst can be.
311Remember that
312.Va curburst
313dynamically tracks burst and will go up and down depending.
314.Pp
315A second bursting parameter called
316.Va vm.swapcache.minburst
317controls bursting when the maximum write bandwidth has been reached.
318When
319.Va minburst
320reaches zero write activity ceases and
321.Va curburst
322is allowed to recover up to
323.Va minburst
324before write activity resumes.
325The recommended range for the
326.Va minburst
327parameter is 1MB to 50MB.
328This parameter has a relationship to
329how fragmented the swapcache gets when not in a steady state.
330Large bursts reduce fragmentation and reduce incidences of
331excessive seeking on the hard drive.
332If set too low the
333swapcache will become fragmented within a single regular file
334and the constant back-and-forth between the swapcache and the
335hard drive will result in excessive seeking on the hard drive.
336.Sh SWAPCACHE SIZE & MANAGEMENT
337The swapcache feature will use up to 75% of configured swap space
338by default.
339The remaining 25% is reserved for normal paging operations.
340The system operator should configure at least 4 times the SWAP space
341versus main memory and no less than 8GB of swap space.
342A typical 128GB SSD might use 64GB for boot + base and 56GB for
343swap, with 8GB left unpartitioned.  The system might then have a large
344additional hard drive for bulk data.
345Even with many packages installed, 64GB is comfortable for
346boot + base.
347.Pp
348When configuring a SSD that will be used for swap or swapcache
349it is a good idea to leave around 10% unpartitioned to improve
350the SSDs durability.
351.Pp
352You do not need to use swapcache if you have no hard drives in the
353system, though in fact swapcache can help if you use NFS heavily
354as a client.
355.Pp
356The
357.Va vm_swapcache.maxswappct
358sysctl may be used to change the default.
359You may have to change this default if you also use
360.Xr tmpfs 5 ,
361.Xr vn 4 ,
362or if you have not allocated enough swap for reasonable normal paging
363activity to occur (in which case you probably shouldn't be using
364.Nm
365anyway).
366.Pp
367If swapcache reaches the 75% limit it will begin tearing down swap
368in linear bursts by iterating through available VM objects, until
369swap space use drops to 70%.
370The tear-down is limited by the rate at
371which new data is written and this rate in turn is often limited by
372.Va vm.swapcache.accrate ,
373resulting in an orderly replacement of cached data and meta-data.
374The limit is typically only reached when doing full data+meta-data
375caching with no file size limitations and serving primarily large
376files, or bumping
377.Va kern.maxvnodes
378up to very high values.
379.Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
380This is not a function of
381.Nm
382per se but instead a normal function of the system.
383Most systems have
384sufficient memory that they do not need to page memory to swap.
385These types of systems are the ones best suited for MLC SSD
386configured swap running with a
387.Nm
388configuration.
389Systems which modestly page to swap, in the range of a few hundred
390megabytes a day worth of writing, are also well suited for MLC SSD
391configured swap.
392Desktops usually fall into this category even if they
393page out a bit more because swap activity is governed by the actions of
394a single person.
395.Pp
396Systems which page anonymous memory heavily when
397.Nm
398would otherwise be turned off are not usually well suited for MLC SSD
399configured swap.
400Heavy paging activity is not governed by
401.Nm
402bandwidth control parameters and can lead to excessive uncontrolled
403writing to the SSD, causing premature wearout.
404This isn't to say that
405.Nm
406would be ineffective, just that the aggregate write bandwidth required
407to support the system might be too large to be cost-effective for a SSD.
408.Pp
409With this caveat in mind, SSD based paging on systems with insufficient
410RAM can be extremely effective in extending the useful life of the system.
411For example, a system with a measly 192MB of RAM and SSD swap can run
412a -j 8 parallel build world in a little less than twice the time it
413would take if the system had 2GB of RAM, whereas it would take 5x to 10x
414as long with normal HDD based swap.
415.Sh USING SWAPCACHE WITH NORMAL HARD DRIVES
416Although
417.Nm
418is designed to work with SSD-based storage it can also be used with
419HD-based storage as an aid for offloading the primary storage system.
420Here we need to make a distinction between using RAID for fanning out
421storage versus using RAID for redundancy.  There are numerous situations
422where RAID-based redundancy does not make sense.
423.Pp
424A good example would be in an environment where the servers themselves
425are redundant and can suffer a total failure without effecting
426ongoing operations.  When the primary storage requirements easily fit onto
427a single large-capacity drive it doesn't make a whole lot of sense to
428use RAID if your only desire is to improve performance.  If you had a farm
429of, say, 20 servers supporting the same facility adding RAID to each one
430would not accomplish anything other than to bloat your deployment and
431maintenance costs.
432.Pp
433In these sorts of situations it may be desirable and convenient to have
434the primary filesystem for each machine on a single large drive and then
435use the
436.Nm
437facility to offload the drive and make the machine more effective without
438actually distributing the filesystem itself across multiple drives.
439For the purposes of offloading while a SSD would be the most effective
440from a performance standpoint, a second medium sized HD with its much lower
441cost and higher capacity might actually be more cost effective.
442.Sh EXPLANATION OF STATIC VS DYNAMIC WEARING LEVELING, AND WRITE-COMBINING
443Modern SSDs keep track of space that has never been written to.
444This would also include space freed up via TRIM, but simply not
445touching a bit of storage in a factory fresh SSD works just as well.
446Once you touch (write to) the storage all bets are off, even if
447you reformat/repartition later.  It takes sending the SSD a
448whole-device TRIM command or special format command to take it back
449to its factory-fresh condition (sans wear already present).
450.Pp
451SSDs have wear leveling algorithms which are responsible for trying
452to even out the erase/write cycles across all flash cells in the
453storage.  The better a job the SSD can do the longer the SSD will
454remain usable.
455.Pp
456The more unused storage there is from the SSDs point of view the
457easier a time the SSD has running its wear leveling algorithms.
458Basically the wear leveling algorithm in a modern SSD (say Intel or OCZ)
459uses a combination of static and dynamic leveling.  Static is the
460best, allowing the SSD to reuse flash cells that have not been
461erased very much by moving static (unchanging) data out of them and
462into other cells that have more wear.  Dynamic wear leveling involves
463writing data to available flash cells and then marking the cells containing
464the previous copy of the data as being free/reusable.  Dynamic wear leveling
465is the worst kind but the easiest to implement.  Modern SSDs use a combination
466of both algorithms plus also do write-combining.
467.Pp
468USB sticks often use only dynamic wear leveling and have short life spans
469because of that.
470.Pp
471In anycase, any unused space in the SSD effectively makes the dynamic
472wear leveling the SSD does more efficient by giving the SSD more 'unused'
473space above and beyond the physical space it reserves beyond its stated
474storage capacity to cycle data through, so the SSD lasts longer in theory.
475.Pp
476Write-combining is a feature whereby the SSD is able to reduced write
477amplification effects by combining OS writes of smaller, discrete,
478non-contiguous logical sectors into a single contiguous 128KB physical
479flash block.
480.Pp
481On the flip side write-combining also results in more complex lookup tables
482which can become fragmented over time and reduce the SSDs read performance.
483Fragmentation can also occur when write-combined blocks are rewritten
484piecemeal.
485Modern SSDs can regain the lost performance by de-combining previously
486write-combined areas as part of their static wear leveling algorithm, but
487at the cost of extra write/erase cycles which slightly increase write
488amplification effects.
489Operating systems can also help maintain the SSDs performance by utilizing
490larger blocks.
491Write-combining results in a net-reduction
492of write-amplification effects but due to having to de-combine later and
493other fragmentary effects it isn't 100%.
494From testing with Intel devices write-amplification can be well controlled
495in the 2x-4x range with the OS doing 16K writes, versus a worst-case
4968x write-amplification with 16K blocks, 32x with 4K blocks, and a truly
497horrid worst-case with 512 byte blocks.
498.Pp
499The
500.Dx
501.Nm
502feature utilizes 64K-128K writes and is specifically designed to minimize
503write amplification and write-combining stresses.
504In terms of placing an actual filesystem on the SSD, the
505.Dx
506.Xr hammer 8
507filesystem utilizes 16K blocks and is well behaved as long as you limit
508reblocking operations.
509For UFS you should create the filesystem with at least a 4K fragment
510size, versus the default 2K.
511Modern Windows filesystems use 4K clusters but it is unclear how SSD-friendly
512NTFS is.
513.Sh EXPLANATION OF FLASH CHIP FEATURE SIZE VS ERASE/REWRITE CYCLE DURABILITY
514Manufacturers continue to produce flash chips with smaller feature sizes.
515Smaller flash cells means reduced erase/rewrite cycle durability which in
516turn reduces the durability of the SSD.
517.Pp
518The older 34nm flash typically had a 10,000 cell durability while the newer
51925nm flash is closer to 1000.  The newer flash uses larger ECCs and more
520sensitive voltage comparators on-chip to increase the durability closer to
5213000 cycles.  Generally speaking you should assume a durability of around
5221/3 for the same storage capacity using the new chips versus the older
523chips.  If you can squeeze out a 400TB durability from an older 40GB X25-V
524using 34nm technology then you should assume around a 400TB durability from
525a newer 120GB 310 series SSD using 25nm technology.
526.Sh WARNINGS
527I am going to repeat and expand a bit on SSD wear.
528Wear on SSDs is a function of the write durability of the cells,
529whether the SSD implements static or dynamic wear leveling (or both),
530write amplification effects when the OS does not issue write-aligned 128KB
531ops or when the SSD is unable to write-combine adjacent logical sectors,
532or if the SSD has a poor write-combining algorithm for non-adjacent sectors.
533In addition some additional erase/rewrite activity occurs from cleanup
534operations the SSD performs as part of its static wear leveling algorithms
535and its write-decombining algorithms (necessary to maintain performance over
536time).  MLC flash uses 128KB physical write/erase blocks while SLC flash
537typically uses 64KB physical write/erase blocks.
538.Pp
539The algorithms the SSD implements in its firmware are probably the most
540important part of the device and a major differentiator between e.g. SATA
541and USB-based SSDs.  SATA form factor drives will universally be far superior
542to USB storage sticks.
543SSDs can also have wildly different wearout rates and wildly different
544performance curves over time.
545For example the performance of a SSD which does not implement
546write-decombining can seriously degrade over time as its lookup
547tables become severely fragmented.
548For the purposes of this manual page we are primarily using Intel and OCZ
549drives when describing performance and wear issues.
550.Pp
551.Nm
552parameters should be carefully chosen to avoid early wearout.
553For example, the Intel X25V 40GB SSD has a minimum write durability
554of 40TB and an actual durability that can be quite a bit higher.
555Generally speaking, you want to select parameters that will give you
556at least 10 years of service life.
557The most important parameter to control this is
558.Va vm.swapcache.accrate .
559.Nm
560uses a very conservative 100KB/sec default but even a small X25V
561can probably handle 300KB/sec of continuous writing and still last 10 years.
562.Pp
563Depending on the wear leveling algorithm the drive uses, durability
564and performance can sometimes be improved by configuring less
565space (in a manufacturer-fresh drive) than the drive's probed capacity.
566For example, by only using 32GB of a 40GB SSD.
567SSDs typically implement 10% more storage than advertised and
568use this storage to improve wear leveling.
569As cells begin to fail
570this overallotment slowly becomes part of the primary storage
571until it has been exhausted.
572After that the SSD has basically failed.
573Keep in mind that if you use a larger portion of the SSD's advertised
574storage the SSD will not know if/when you decide to use less unless
575appropriate TRIM commands are sent (if supported), or a low level
576factory erase is issued.
577.Pp
578.Nm smartctl
579(from
580.Xr dports 7 Ap s
581.Pa sysutils/smartmontools )
582may be used to retrieve the wear indicator from the drive.
583One usually runs something like
584.Ql smartctl -d sat -a /dev/daXX
585(for AHCI/SILI/SCSI), or
586.Ql smartctl -a /dev/adXX
587for NATA.
588Some SSDs
589(particularly the Intels) will brick the SATA port when smart operations
590are done while the drive is busy with normal activity, so the tool should
591only be run when the SSD is idle.
592.Pp
593ID 232 (0xe8) in the SMART data dump indicates available reserved
594space and ID 233 (0xe9) is the wear-out meter.
595Reserved space
596typically starts at 100 and decrements to 10, after which the SSD
597is considered to operate in a degraded mode.
598The wear-out meter typically starts at 99 and decrements to 0,
599after which the SSD has failed.
600.Pp
601.Nm
602tends to use large 64KB writes and tends to cluster multiple writes
603linearly.
604The SSD is able to take significant advantage of this
605and write amplification effects are greatly reduced.
606If we take a 40GB Intel X25V as an example the vendor specifies a write
607durability of approximately 40TB, but
608.Nm
609should be able to squeeze out upwards of 200TB due the fairly optimal
610write clustering it does.
611The theoretical limit for the Intel X25V is 400TB (10,000 erase cycles
612per MLC cell, 40GB drive, with 34nm technology), but the firmware doesn't
613do perfect static wear leveling so the actual durability is less.
614In tests over several hundred days we have validated a write endurance
615greater than 200TB on the 40G Intel X25V using
616.Nm .
617.Pp
618In contrast, filesystems directly stored on a SSD could have
619fairly severe write amplification effects and will have durabilities
620ranging closer to the vendor-specified limit.
621.Pp
622Tests have shown that power cycling (with proper shutdown) and read
623operations do not adversely effect a SSD.  Writing within the wearout
624constraints provided by the vendor also does not make a powered SSD any
625less reliable over time.  Time itself seems to be a factor as the SSD
626encounters defects and weak cells in the flash chips.  Writes to a SSD
627will effect cold durability (a typical flash chip has 10 years of cold
628data retention when fresh and less than 1 year of cold data retention near
629the end of its wear life).  Keeping a SSD cool improves its data retention.
630.Pp
631Beware the standard comparison between SLC, MLC, and TLC-based flash
632in terms of wearout and durability.  Over the years, tests have shown
633that SLC is not actually any more reliable than MLC, despite having a
634significantly larger theoretical durability.  Cell and chip failures seem
635to trump theoretical wear limitations in terms of device reliability.
636With that in mind, we do not recommend using SLC for anything any more.
637Instead we recommend that the flash simply be over-provisioned to provide
638the needed durability.
639This is already done in numerous NVMe solutions for the vendor to be able
640to provide certain minimum wear guarantees.
641Durability scales with the amount of flash storage (but the fab process
642typically scales the opposite... smaller feature sizes for flash cells
643greatly reduce their durability).
644When wear calculations are in years, these differences become huge, but
645often the quantity of storage needed trumps the wear life so we expect most
646people will be using MLC.
647.Pp
648Beware the huge difference between larger (e.g. 2.5") form-factor SSDs
649and smaller SSDs such as USB sticks are very small M.2 storage.  Smaller
650form-factor devices have fewer flash chips and, much lower write bandwidths,
651less ram for caching and write-combining, and usb sticks in particular will
652usually have unsophisticated wear-leveling algorithms compared to a 2.5"
653SSD.  It is generally not a good idea to make a USB stick your primary
654storage.  Long-form-factor NGFF/M.2 devices will be better, and 2.5"
655form factor devices even better.  The read-bandwidth for a SATA SSD caps
656out more quickly than the read-bandwidth for a NVMe SSD, but the larger
657form factor of a 2.5" SATA SSD will often have superior write performance
658to a NGFF NVMe device.  There are 2.5" NVMe devices as well, requiring a
659special connector or PCIe adapter, which give you the best of both worlds.
660.Sh SEE ALSO
661.Xr chflags 1 ,
662.Xr fstab 5 ,
663.Xr disklabel64 8 ,
664.Xr hammer 8 ,
665.Xr swapon 8
666.Sh HISTORY
667.Nm
668first appeared in
669.Dx 2.5 .
670.Sh AUTHORS
671.An Matthew Dillon
672