1.\" 2.\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.Dd February 7, 2010 13.Dt SWAPCACHE 8 14.Os 15.Sh NAME 16.Nm swapcache 17.Nd a mechanism to use fast swap to cache filesystem data and meta-data 18.Sh SYNOPSIS 19.Cd sysctl vm.swapcache.accrate=100000 20.Cd sysctl vm.swapcache.maxfilesize=0 21.Cd sysctl vm.swapcache.maxburst=2000000000 22.Cd sysctl vm.swapcache.curburst=4000000000 23.Cd sysctl vm.swapcache.minburst=10000000 24.Cd sysctl vm.swapcache.read_enable=0 25.Cd sysctl vm.swapcache.meta_enable=0 26.Cd sysctl vm.swapcache.data_enable=0 27.Cd sysctl vm.swapcache.use_chflags=1 28.Cd sysctl vm.swapcache.maxlaunder=256 29.Cd sysctl vm.swapcache.hysteresis=(vm.stats.vm.v_inactive_target/2) 30.Sh DESCRIPTION 31.Nm 32is a system capability which allows a solid state disk (SSD) in a swap 33space configuration to be used to cache clean filesystem data and meta-data 34in addition to its normal function of backing anonymous memory. 35.Pp 36Sysctls are used to manage operational parameters and can be adjusted at 37any time. 38Typically a large initial burst is desired after system boot, 39controlled by the initial 40.Va vm.swapcache.curburst 41parameter. 42This parameter is reduced as data is written to swap by the swapcache 43and increased at a rate specified by 44.Va vm.swapcache.accrate . 45Once this parameter reaches zero write activity ceases until it has 46recovered sufficiently for write activity to resume. 47.Pp 48.Va vm.swapcache.meta_enable 49enables the writing of filesystem meta-data to the swapcache. 50Filesystem 51metadata is any data which the filesystem accesses via the disk device 52using buffercache. 53Meta-data is cached globally regardless of file or directory flags. 54.Pp 55.Va vm.swapcache.data_enable 56enables the writing of clean filesystem file-data to the swapcache. 57Filesystem filedata is any data which the filesystem accesses via a 58regular file. 59In technical terms, when the buffer cache is used to access 60a regular file through its vnode. 61Please do not blindly turn on this option, see the 62.Sx PERFORMANCE TUNING 63section for more information. 64.Pp 65.Va vm.swapcache.use_chflags 66enables the use of the 67.Va cache 68and 69.Va noscache 70.Xr chflags 1 71flags to control which files will be data-cached. 72If this sysctl is disabled and 73.Va data_enable 74is enabled, the system will ignore file flags and attempt to 75swapcache all regular files. 76.Pp 77.Va vm.swapcache.read_enable 78enables reading from the swapcache and should be set to 1 for normal 79operation. 80.Pp 81.Va vm.swapcache.maxfilesize 82controls which files are to be cached based on their size. 83If set to non-zero only files smaller than the specified size 84will be cached. 85Larger files will not be cached. 86.Pp 87.Va vm.swapcache.maxlaunder 88controls the maximum number of clean VM pages which will be added to 89the swap cache and written out to swap on each poll. 90Swapcache polls ten times a second. 91.Pp 92.Va vm.swapcache.hysteresis 93controls how many pages swapcache waits to be added to the inactive page 94queue before continuing its scan. 95Once it decides to scan it continues subject to the above limitations 96until it reaches the end of the inactive page queue. 97This parameter is designed to make swapcache generate more bulky bursts 98to swap which helps SSDs reduce write amplification effects. 99.Sh PERFORMANCE TUNING 100Best operation is achieved when the active data set fits within the 101swapcache. 102.Pp 103.Bl -tag -width 4n -compact 104.It Va vm.swapcache.accrate 105This specifies the burst accumulation rate in bytes per second and 106ultimately controls the write bandwidth to swap averaged over a long 107period of time. 108This parameter must be carefully chosen to manage the write endurance of 109the SSD in order to avoid wearing it out too quickly. 110Even though SSDs have limited write endurance, there is massive 111cost/performance benefit to using one in a swapcache configuration. 112.Pp 113Let's use the old Intel X25V 40GB MLC SATA SSD as an example. 114This device has approximately a 11540TB (40 terabyte) write endurance, but see later 116notes on this, it is more a minimum value. 117Limiting the long term average bandwidth to 100KB/sec leads to no more 118than ~9GB/day writing which calculates approximately to a 12 year endurance. 119Endurance scales linearly with size. 120The 80GB version of this SSD 121will have a write endurance of approximately 80TB. 122.Pp 123MLC SSDs have a 1000-10000x write endurance, while the lower density 124higher-cost SLC SSDs have a 10000-100000x write endurance, approximately. 125MLC SSDs can be used for the swapcache (and swap) as long as the system 126manager is cognizant of its limitations. 127However, over the years tests have shown the SLC SSDs do not really live 128up to their hype and are no more reliable than MLC SSDs. Instead of 129worrying about SLC vs MLC, just use MLC (or TLC or whateve), leave 130more space unpartitioned which the SSD can utilize to improve durability, 131and be cognizant of the SSDs rate of wear. 132.Pp 133.It Va vm.swapcache.meta_enable 134Turning on just 135.Va meta_enable 136causes only filesystem meta-data to be cached and will result 137in very fast directory operations even over millions of inodes 138and even in the face of other invasive operations being run 139by other processes. 140.Pp 141For 142.Nm HAMMER 143filesystems meta-data includes the B-Tree, directory entries, 144and data related to tiny files. 145Approximately 6 GB of swapcache is needed 146for every 14 million or so inodes cached, effectively giving one the 147ability to cache all the meta-data in a multi-terabyte filesystem using 148a fairly small SSD. 149.Pp 150.It Va vm.swapcache.data_enable 151Turning on 152.Va data_enable 153(with or without other features) allows bulk file data to be cached. 154This feature is very useful for web server operation when the 155operational data set fits in swap. 156However, care must be taken to avoid thrashing the swapcache. 157In almost all cases you will want to leave chflags mode enabled 158and use 'chflags cache' on governing directories to control which 159directory subtrees file data should be cached for. 160.Pp 161.Dx 162uses generously large kern.maxvnodes values, 163typically in excess of 400K vnodes, but large numbers 164of small files can still cause problems for swapcache. 165When operating on a filesystem containing a large number of 166small files, vnode recycling by the kernel will cause related 167swapcache data to be lost and also cause the swapcache to 168potentially thrash. 169Cache thrashing due to vnode recyclement can occur whether chflags 170mode is used or not. 171.Pp 172To solve the thrashing problem you can turn on HAMMER's 173double buffering feature via 174.Va vfs.hammer.double_buffer . 175This causes HAMMER to cache file data via its block device. 176HAMMER cannot avoid also caching file data via individual vnodes 177but will try to expire the second copy more quickly (hence 178why it is called double buffer mode), but the key point here is 179that 180.Nm 181will only cache the data blocks via the block device when 182double_buffer mode is used and since the block device is associated 183with the mount, vnode recycling will not mess with it. 184This allows the data for any number (potentially millions) of files to 185be swapcached. 186You still should use chflags mode to control the size of the dataset 187being cached to remain under 75% of configured swap space. 188.Pp 189Data caching is definitely more wasteful of the SSD's write durability 190than meta-data caching. 191If not carefully managed the swapcache may exhaust its burst and smack 192against the long term average bandwidth limit, causing the SSD to wear 193out at the maximum rate you programmed. 194Data caching is far less wasteful and more efficient 195if you provide a sufficiently large SSD. 196.Pp 197When caching large data sets you may want to use a medium-sized SSD 198with good write performance instead of a small SSD to accommodate 199the higher burst write rate data caching incurs and to reduce 200interference between reading and writing. 201Write durability also tends to scale with larger SSDs, but keep in mind 202that newer flash technologies use smaller feature sizes on-chip 203which reduce the write durability of the chips, so pay careful attention 204to the type of flash employed by the SSD when making durability 205assumptions. 206For example, an Intel X25-V only has 40MB/s in write performance 207and burst writing by swapcache will seriously interfere with 208concurrent read operation on the SSD. 209The 80GB X25-M on the otherhand has double the write performance. 210Higher-capacity and larger form-factor SSDs tend to have better 211write-performance. 212But the Intel 310 series SSDs use flash chips with a smaller feature 213size so an 80G 310 series SSD will wind up with a durability relative 214close to the older 40G X25-V. 215.Pp 216When data caching is turned on you can fine-tune what gets swapcached 217by also turning on swapcache's chflags mode and using 218.Xr chflags 1 219with the 220.Va cache 221flag to enable data caching on a directory-tree (recursive) basis. 222This flag is tracked by the namecache and does not need to be 223recursively set in the directory tree. 224Simply setting the flag in a top level directory or mount point 225is usually sufficient. 226However, the flag does not track across mount points. 227A typical setup is something like this: 228.Pp 229.Dl chflags cache /etc /sbin /bin /usr /home 230.Dl chflags noscache /usr/obj 231.Pp 232It is possible to tell 233.Nm 234to ignore the cache flag by leaving 235.Va vm.swapcache.use_chflags 236set to zero. 237In many situations it is convenient to simply not use chflags mode, but 238if you have numerous mixed SSDs and HDDs you may want to use this flag 239to enable swapcache on the HDDs and disable it on the SSDs even if 240you do not care about fine-grained control. 241.Nm chflag Ns 'ing . 242.Pp 243Filesystems such as NFS which do not support flags generally 244have a 245.Va cache 246mount option which enables swapcache operation on the mount. 247.Pp 248.It Va vm.swapcache.maxfilesize 249This may be used to reduce cache thrashing when a focus on a small 250potentially fragmented filespace is desired, leaving the 251larger (more linearly accessed) files alone. 252.Pp 253.It Va vm.swapcache.minburst 254This controls hysteresis and prevents nickel-and-dime write bursting. 255Once 256.Va curburst 257drops to zero, writing to the swapcache ceases until it has recovered past 258.Va minburst . 259The idea here is to avoid creating a heavily fragmented swapcache where 260reading data from a file must alternate between the cache and the primary 261filesystem. 262Doing so does not save disk seeks on the primary filesystem 263so we want to avoid doing small bursts. 264This parameter allows us to do larger bursts. 265The larger bursts also tend to improve SSD performance as the SSD itself 266can do a better job write-combining and erasing blocks. 267.Pp 268.It Va vm_swapcache.maxswappct 269This controls the maximum amount of swapspace 270.Nm 271may use, in percentage terms. 272The default is 75%, leaving the remaining 25% of swap available for normal 273paging operations. 274.El 275.Pp 276It is important to ensure that your swap partition is nicely aligned. 277The standard 278.Dx 279.Xr disklabel 8 280program guarantees high alignment (~1MB) automatically. 281Swap-on HDDs benefit because HDDs tend to use a larger physical sector size 282than 512 bytes, and proper alignment for SSDs will reduce write amplification 283and write-combining inefficiencies. 284.Pp 285Finally, interleaved swap (multiple SSDs) may be used to increase 286swap and swapcache performance even further. 287A single SATA-II SSD is typically capable of reading 120-220MB/sec. 288Configuring two SSDs for your swap will 289improve aggregate swapcache read performance by 1.5x to 1.8x. 290In tests with two Intel 40GB SSDs 300MB/sec was easily achieved. 291With two SATA-III SSDs it is possible to achieve 600MB/sec or better 292and well over 400MB/sec random-read performance (versus the ~3MB/sec 293random read performance a hard drive gives you). 294Faster SATA interfaces or newer NVMe technologies have significantly 295more read bandwidth (3GB/sec+ for NVMe), but may still lag on the 296write bandwidth. 297With newer technologies, one swap device is usually plenty. 298.Pp 299.Dx 300defaults to a maximum of 512G of configured swap. 301Keep in mind that each 1GB of actually configured swap requires 302approximately 1MB of wired ram to manage. 303.Pp 304In addition there will be periods of time where the system is in 305steady state and not writing to the swapcache. 306During these periods 307.Va curburst 308will inch back up but will not exceed 309.Va maxburst . 310Thus the 311.Va maxburst 312value controls how large a repeated burst can be. 313Remember that 314.Va curburst 315dynamically tracks burst and will go up and down depending. 316.Pp 317A second bursting parameter called 318.Va vm.swapcache.minburst 319controls bursting when the maximum write bandwidth has been reached. 320When 321.Va minburst 322reaches zero write activity ceases and 323.Va curburst 324is allowed to recover up to 325.Va minburst 326before write activity resumes. 327The recommended range for the 328.Va minburst 329parameter is 1MB to 50MB. 330This parameter has a relationship to 331how fragmented the swapcache gets when not in a steady state. 332Large bursts reduce fragmentation and reduce incidences of 333excessive seeking on the hard drive. 334If set too low the 335swapcache will become fragmented within a single regular file 336and the constant back-and-forth between the swapcache and the 337hard drive will result in excessive seeking on the hard drive. 338.Sh SWAPCACHE SIZE & MANAGEMENT 339The swapcache feature will use up to 75% of configured swap space 340by default. 341The remaining 25% is reserved for normal paging operations. 342The system operator should configure at least 4 times the SWAP space 343versus main memory and no less than 8GB of swap space. 344A typical 128GB SSD might use 64GB for boot + base and 56GB for 345swap, with 8GB left unpartitioned. The system might then have a large 346additional hard drive for bulk data. 347Even with many packages installed, 64GB is comfortable for 348boot + base. 349.Pp 350When configuring a SSD that will be used for swap or swapcache 351it is a good idea to leave around 10% unpartitioned to improve 352the SSDs durability. 353.Pp 354You do not need to use swapcache if you have no hard drives in the 355system, though in fact swapcache can help if you use NFS heavily 356as a client. 357.Pp 358The 359.Va vm_swapcache.maxswappct 360sysctl may be used to change the default. 361You may have to change this default if you also use 362.Xr tmpfs 5 , 363.Xr vn 4 , 364or if you have not allocated enough swap for reasonable normal paging 365activity to occur (in which case you probably shouldn't be using 366.Nm 367anyway). 368.Pp 369If swapcache reaches the 75% limit it will begin tearing down swap 370in linear bursts by iterating through available VM objects, until 371swap space use drops to 70%. 372The tear-down is limited by the rate at 373which new data is written and this rate in turn is often limited by 374.Va vm.swapcache.accrate , 375resulting in an orderly replacement of cached data and meta-data. 376The limit is typically only reached when doing full data+meta-data 377caching with no file size limitations and serving primarily large 378files, or bumping 379.Va kern.maxvnodes 380up to very high values. 381.Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP 382This is not a function of 383.Nm 384per se but instead a normal function of the system. 385Most systems have 386sufficient memory that they do not need to page memory to swap. 387These types of systems are the ones best suited for MLC SSD 388configured swap running with a 389.Nm 390configuration. 391Systems which modestly page to swap, in the range of a few hundred 392megabytes a day worth of writing, are also well suited for MLC SSD 393configured swap. 394Desktops usually fall into this category even if they 395page out a bit more because swap activity is governed by the actions of 396a single person. 397.Pp 398Systems which page anonymous memory heavily when 399.Nm 400would otherwise be turned off are not usually well suited for MLC SSD 401configured swap. 402Heavy paging activity is not governed by 403.Nm 404bandwidth control parameters and can lead to excessive uncontrolled 405writing to the SSD, causing premature wearout. 406This isn't to say that 407.Nm 408would be ineffective, just that the aggregate write bandwidth required 409to support the system might be too large to be cost-effective for a SSD. 410.Pp 411With this caveat in mind, SSD based paging on systems with insufficient 412RAM can be extremely effective in extending the useful life of the system. 413For example, a system with a measly 192MB of RAM and SSD swap can run 414a -j 8 parallel build world in a little less than twice the time it 415would take if the system had 2GB of RAM, whereas it would take 5x to 10x 416as long with normal HDD based swap. 417.Sh USING SWAPCACHE WITH NORMAL HARD DRIVES 418Although 419.Nm 420is designed to work with SSD-based storage it can also be used with 421HD-based storage as an aid for offloading the primary storage system. 422Here we need to make a distinction between using RAID for fanning out 423storage versus using RAID for redundancy. There are numerous situations 424where RAID-based redundancy does not make sense. 425.Pp 426A good example would be in an environment where the servers themselves 427are redundant and can suffer a total failure without effecting 428ongoing operations. When the primary storage requirements easily fit onto 429a single large-capacity drive it doesn't make a whole lot of sense to 430use RAID if your only desire is to improve performance. If you had a farm 431of, say, 20 servers supporting the same facility adding RAID to each one 432would not accomplish anything other than to bloat your deployment and 433maintenance costs. 434.Pp 435In these sorts of situations it may be desirable and convenient to have 436the primary filesystem for each machine on a single large drive and then 437use the 438.Nm 439facility to offload the drive and make the machine more effective without 440actually distributing the filesystem itself across multiple drives. 441For the purposes of offloading while a SSD would be the most effective 442from a performance standpoint, a second medium sized HD with its much lower 443cost and higher capacity might actually be more cost effective. 444.Sh EXPLANATION OF STATIC VS DYNAMIC WEARING LEVELING, AND WRITE-COMBINING 445Modern SSDs keep track of space that has never been written to. 446This would also include space freed up via TRIM, but simply not 447touching a bit of storage in a factory fresh SSD works just as well. 448Once you touch (write to) the storage all bets are off, even if 449you reformat/repartition later. It takes sending the SSD a 450whole-device TRIM command or special format command to take it back 451to its factory-fresh condition (sans wear already present). 452.Pp 453SSDs have wear leveling algorithms which are responsible for trying 454to even out the erase/write cycles across all flash cells in the 455storage. The better a job the SSD can do the longer the SSD will 456remain usable. 457.Pp 458The more unused storage there is from the SSDs point of view the 459easier a time the SSD has running its wear leveling algorithms. 460Basically the wear leveling algorithm in a modern SSD (say Intel or OCZ) 461uses a combination of static and dynamic leveling. Static is the 462best, allowing the SSD to reuse flash cells that have not been 463erased very much by moving static (unchanging) data out of them and 464into other cells that have more wear. Dynamic wear leveling involves 465writing data to available flash cells and then marking the cells containing 466the previous copy of the data as being free/reusable. Dynamic wear leveling 467is the worst kind but the easiest to implement. Modern SSDs use a combination 468of both algorithms plus also do write-combining. 469.Pp 470USB sticks often use only dynamic wear leveling and have short life spans 471because of that. 472.Pp 473In anycase, any unused space in the SSD effectively makes the dynamic 474wear leveling the SSD does more efficient by giving the SSD more 'unused' 475space above and beyond the physical space it reserves beyond its stated 476storage capacity to cycle data through, so the SSD lasts longer in theory. 477.Pp 478Write-combining is a feature whereby the SSD is able to reduced write 479amplification effects by combining OS writes of smaller, discrete, 480non-contiguous logical sectors into a single contiguous 128KB physical 481flash block. 482.Pp 483On the flip side write-combining also results in more complex lookup tables 484which can become fragmented over time and reduce the SSDs read performance. 485Fragmentation can also occur when write-combined blocks are rewritten 486piecemeal. 487Modern SSDs can regain the lost performance by de-combining previously 488write-combined areas as part of their static wear leveling algorithm, but 489at the cost of extra write/erase cycles which slightly increase write 490amplification effects. 491Operating systems can also help maintain the SSDs performance by utilizing 492larger blocks. 493Write-combining results in a net-reduction 494of write-amplification effects but due to having to de-combine later and 495other fragmentary effects it isn't 100%. 496From testing with Intel devices write-amplification can be well controlled 497in the 2x-4x range with the OS doing 16K writes, versus a worst-case 4988x write-amplification with 16K blocks, 32x with 4K blocks, and a truly 499horrid worst-case with 512 byte blocks. 500.Pp 501The 502.Dx 503.Nm 504feature utilizes 64K-128K writes and is specifically designed to minimize 505write amplification and write-combining stresses. 506In terms of placing an actual filesystem on the SSD, the 507.Dx 508.Xr hammer 8 509filesystem utilizes 16K blocks and is well behaved as long as you limit 510reblocking operations. 511For UFS you should create the filesystem with at least a 4K fragment 512size, versus the default 2K. 513Modern Windows filesystems use 4K clusters but it is unclear how SSD-friendly 514NTFS is. 515.Sh EXPLANATION OF FLASH CHIP FEATURE SIZE VS ERASE/REWRITE CYCLE DURABILITY 516Manufacturers continue to produce flash chips with smaller feature sizes. 517Smaller flash cells means reduced erase/rewrite cycle durability which in 518turn reduces the durability of the SSD. 519.Pp 520The older 34nm flash typically had a 10,000 cell durability while the newer 52125nm flash is closer to 1000. The newer flash uses larger ECCs and more 522sensitive voltage comparators on-chip to increase the durability closer to 5233000 cycles. Generally speaking you should assume a durability of around 5241/3 for the same storage capacity using the new chips versus the older 525chips. If you can squeeze out a 400TB durability from an older 40GB X25-V 526using 34nm technology then you should assume around a 400TB durability from 527a newer 120GB 310 series SSD using 25nm technology. 528.Sh WARNINGS 529I am going to repeat and expand a bit on SSD wear. 530Wear on SSDs is a function of the write durability of the cells, 531whether the SSD implements static or dynamic wear leveling (or both), 532write amplification effects when the OS does not issue write-aligned 128KB 533ops or when the SSD is unable to write-combine adjacent logical sectors, 534or if the SSD has a poor write-combining algorithm for non-adjacent sectors. 535In addition some additional erase/rewrite activity occurs from cleanup 536operations the SSD performs as part of its static wear leveling algorithms 537and its write-decombining algorithms (necessary to maintain performance over 538time). MLC flash uses 128KB physical write/erase blocks while SLC flash 539typically uses 64KB physical write/erase blocks. 540.Pp 541The algorithms the SSD implements in its firmware are probably the most 542important part of the device and a major differentiator between e.g. SATA 543and USB-based SSDs. SATA form factor drives will universally be far superior 544to USB storage sticks. 545SSDs can also have wildly different wearout rates and wildly different 546performance curves over time. 547For example the performance of a SSD which does not implement 548write-decombining can seriously degrade over time as its lookup 549tables become severely fragmented. 550For the purposes of this manual page we are primarily using Intel and OCZ 551drives when describing performance and wear issues. 552.Pp 553.Nm 554parameters should be carefully chosen to avoid early wearout. 555For example, the Intel X25V 40GB SSD has a minimum write durability 556of 40TB and an actual durability that can be quite a bit higher. 557Generally speaking, you want to select parameters that will give you 558at least 10 years of service life. 559The most important parameter to control this is 560.Va vm.swapcache.accrate . 561.Nm 562uses a very conservative 100KB/sec default but even a small X25V 563can probably handle 300KB/sec of continuous writing and still last 10 years. 564.Pp 565Depending on the wear leveling algorithm the drive uses, durability 566and performance can sometimes be improved by configuring less 567space (in a manufacturer-fresh drive) than the drive's probed capacity. 568For example, by only using 32GB of a 40GB SSD. 569SSDs typically implement 10% more storage than advertised and 570use this storage to improve wear leveling. 571As cells begin to fail 572this overallotment slowly becomes part of the primary storage 573until it has been exhausted. 574After that the SSD has basically failed. 575Keep in mind that if you use a larger portion of the SSD's advertised 576storage the SSD will not know if/when you decide to use less unless 577appropriate TRIM commands are sent (if supported), or a low level 578factory erase is issued. 579.Pp 580.Nm smartctl 581(from 582.Xr dports 7 Ap s 583.Pa sysutils/smartmontools ) 584may be used to retrieve the wear indicator from the drive. 585One usually runs something like 586.Ql smartctl -d sat -a /dev/daXX 587(for AHCI/SILI/SCSI), or 588.Ql smartctl -a /dev/adXX 589for NATA. 590Some SSDs 591(particularly the Intels) will brick the SATA port when smart operations 592are done while the drive is busy with normal activity, so the tool should 593only be run when the SSD is idle. 594.Pp 595ID 232 (0xe8) in the SMART data dump indicates available reserved 596space and ID 233 (0xe9) is the wear-out meter. 597Reserved space 598typically starts at 100 and decrements to 10, after which the SSD 599is considered to operate in a degraded mode. 600The wear-out meter typically starts at 99 and decrements to 0, 601after which the SSD has failed. 602.Pp 603.Nm 604tends to use large 64KB writes and tends to cluster multiple writes 605linearly. 606The SSD is able to take significant advantage of this 607and write amplification effects are greatly reduced. 608If we take a 40GB Intel X25V as an example the vendor specifies a write 609durability of approximately 40TB, but 610.Nm 611should be able to squeeze out upwards of 200TB due the fairly optimal 612write clustering it does. 613The theoretical limit for the Intel X25V is 400TB (10,000 erase cycles 614per MLC cell, 40GB drive, with 34nm technology), but the firmware doesn't 615do perfect static wear leveling so the actual durability is less. 616In tests over several hundred days we have validated a write endurance 617greater than 200TB on the 40G Intel X25V using 618.Nm . 619.Pp 620In contrast, filesystems directly stored on a SSD could have 621fairly severe write amplification effects and will have durabilities 622ranging closer to the vendor-specified limit. 623.Pp 624Tests have shown that power cycling (with proper shutdown) and read 625operations do not adversely effect a SSD. Writing within the wearout 626constraints provided by the vendor also does not make a powered SSD any 627less reliable over time. Time itself seems to be a factor as the SSD 628encounters defects and weak cells in the flash chips. Writes to a SSD 629will effect cold durability (a typical flash chip has 10 years of cold 630data retention when fresh and less than 1 year of cold data retention near 631the end of its wear life). Keeping a SSD cool improves its data retention. 632.Pp 633Beware the standard comparison between SLC, MLC, and TLC-based flash 634in terms of wearout and durability. Over the years, tests have shown 635that SLC is not actually any more reliable than MLC, despite having a 636significantly larger theoretical durability. Cell and chip failures seem 637to trump theoretical wear limitations in terms of device reliability. 638With that in mind, we do not recommend using SLC for anything any more. 639Instead we recommend that the flash simply be over-provisioned to provide 640the needed durability. 641This is already done in numerous NVMe solutions for the vendor to be able 642to provide certain minimum wear guarantees. 643Durability scales with the amount of flash storage (but the fab process 644typically scales the opposite... smaller feature sizes for flash cells 645greatly reduce their durability). 646When wear calculations are in years, these differences become huge, but 647often the quantity of storage needed trumps the wear life so we expect most 648people will be using MLC. 649.Pp 650Beware the huge difference between larger (e.g. 2.5") form-factor SSDs 651and smaller SSDs such as USB sticks are very small M.2 storage. Smaller 652form-factor devices have fewer flash chips and, much lower write bandwidths, 653less ram for caching and write-combining, and usb sticks in particular will 654usually have unsophisticated wear-leveling algorithms compared to a 2.5" 655SSD. It is generally not a good idea to make a USB stick your primary 656storage. Long-form-factor NGFF/M.2 devices will be better, and 2.5" 657form factor devices even better. The read-bandwidth for a SATA SSD caps 658out more quickly than the read-bandwidth for a NVMe SSD, but the larger 659form factor of a 2.5" SATA SSD will often have superior write performance 660to a NGFF NVMe device. There are 2.5" NVMe devices as well, requiring a 661special connector or PCIe adapter, which give you the best of both worlds. 662.Sh SEE ALSO 663.Xr chflags 1 , 664.Xr fstab 5 , 665.Xr disklabel64 8 , 666.Xr hammer 8 , 667.Xr swapon 8 668.Sh HISTORY 669.Nm 670first appeared in 671.Dx 2.5 . 672.Sh AUTHORS 673.An Matthew Dillon 674