1.\" Copyright (c) 2001 Matthew Dillon. Terms and conditions are those of 2.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in 3.\" the source tree. 4.\" 5.\" $DragonFly: src/share/man/man7/tuning.7,v 1.21 2008/10/17 11:30:24 swildner Exp $ 6.\" 7.Dd October 24, 2010 8.Dt TUNING 7 9.Os 10.Sh NAME 11.Nm tuning 12.Nd performance tuning under DragonFly 13.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP 14Modern 15.Dx 16systems typically have just three partitions on the main drive. 17In order, a UFS 18.Pa /boot , 19.Pa swap , 20and a HAMMER 21.Pa / . 22The installer usually creates a multitude of PFSs (pseudo filesystems) on 23the HAMMER partition for /var, /tmp, and numerous other sub-trees. 24These PFSs exist to ease the management of snapshots and backups. 25.Pp 26Generally speaking the 27.Pa /boot 28partition should be around 768MB in size. The minimum recommended 29is around 350MB, giving you room for backup kernels and alternative 30boot schemes. 31.Pp 32In the old days we recommended that swap be sized to at least 2x main 33memory. These days swap is often used for other activities, including 34.Xr tmpfs 5 . 35We recommend that swap be sized to the larger of 2x main memory or 361GB if you have a fairly small disk and up to 16GB if you have a 37moderately endowed system and a large drive. 38If you are on a minimally configured machine you may, of course, 39configure far less swap or no swap at all but we recommend at least 40some swap. 41The kernel's VM paging algorithms are tuned to perform best when there is 42at least 2x swap versus main memory. 43Configuring too little swap can lead to inefficiencies in the VM 44page scanning code as well as create issues later on if you add 45more memory to your machine. 46Swap is a good idea even if you don't think you will ever need it as it allows the 47machine to page out completely unused data from idle programs (like getty), 48maximizing the ram available for your activities. 49.Pp 50If you intend to use the 51.Xr swapcache 8 52facility with a SSD we recommend the SSD be configured with at 53least a 32G swap partition (the maximum default for i386). 54If you are on a moderately well configured 64-bit system you can 55size swap even larger. 56.Pp 57Finally, on larger systems with multiple drives, if the use 58of SSD swap is not in the cards, we recommend that you 59configure swap on each drive (up to four drives). 60The swap partitions on the drives should be approximately the same size. 61The kernel can handle arbitrary sizes but 62internal data structures scale to 4 times the largest swap partition. 63Keeping 64the swap partitions near the same size will allow the kernel to optimally 65stripe swap space across the N disks. 66Do not worry about overdoing it a 67little, swap space is the saving grace of 68.Ux 69and even if you do not normally use much swap, it can give you more time to 70recover from a runaway program before being forced to reboot. 71.Pp 72Most 73.Dx 74systems have a single HAMMER root and use PFSs to break it up into 75various administrative domains. 76All the PFSs share the same allocation layer so there is no longer a need 77to size each individual mount. 78Instead you should review the 79.Xr hammer 8 80manual page and use the 'hammer viconfig' facility to adjust snapshot 81retention and other parameters. 82By default 83HAMMER keeps 60 days worth of snapshots. 84Usually snapshots are not desired on PFSs such as 85.Pa /usr/obj 86or 87.Pa /tmp 88since data on these partitions cycles a lot. 89.Pp 90If a very large work area is desired it is often beneficial to 91configure it as a separate HAMMER mount. If it is integrated into 92the root mount it should at least be its own HAMMER PFS. 93We recommend naming the large work area 94.Pa /build . 95Similarly if a machine is going to have a large number of users 96you might want to separate your 97.Pa /home 98out as well. 99.Pp 100A number of run-time 101.Xr mount 8 102options exist that can help you tune the system. 103The most obvious and most dangerous one is 104.Cm async . 105Do not ever use it; it is far too dangerous. 106A less dangerous and more 107useful 108.Xr mount 8 109option is called 110.Cm noatime . 111.Ux 112filesystems normally update the last-accessed time of a file or 113directory whenever it is accessed. 114This operation is handled in 115.Dx 116with a delayed write and normally does not create a burden on the system. 117However, if your system is accessing a huge number of files on a continuing 118basis the buffer cache can wind up getting polluted with atime updates, 119creating a burden on the system. 120For example, if you are running a heavily 121loaded web site, or a news server with lots of readers, you might want to 122consider turning off atime updates on your larger partitions with this 123.Xr mount 8 124option. 125You should not gratuitously turn off atime updates everywhere. 126For example, the 127.Pa /var 128filesystem customarily 129holds mailboxes, and atime (in combination with mtime) is used to 130determine whether a mailbox has new mail. 131.Sh STRIPING DISKS 132In larger systems you can stripe partitions from several drives together 133to create a much larger overall partition. 134Striping can also improve 135the performance of a filesystem by splitting I/O operations across two 136or more disks. 137The 138.Xr vinum 8 , 139.Xr lvm 8 , 140and 141.Xr dm 8 142subsystems may be used to create simple striped filesystems. 143We have deprecated 144.Xr ccd 4 . 145Generally 146speaking, striping smaller partitions such as the root and 147.Pa /var/tmp , 148or essentially read-only partitions such as 149.Pa /usr 150is a complete waste of time. 151You should only stripe partitions that require serious I/O performance. 152We recommend that such partitions be completely separate mounts 153and not use the same storage media as your root mount. 154.Pp 155I should reiterate that last comment. Do not stripe your boot+root. 156Just about everything in there will be cached in system memory anyway. 157Neither would I recommend RAIDing your root. 158If robustness is needed placing your boot, swap, and root on a SSD 159which has about the same MTBF as your motherboard, then RAIDing everything 160which requires significant amounts of storage, should be sufficient. 161There isn't much point making the boot/swap/root storage even more redundant 162when the motherboard itself has no redundancy. 163When a high level of total system redundancy is required you need to be 164thinking more about having multiple physical machines that back each 165other up. 166.Pp 167When striping multiple disks always partition in multiples of at 168least 8 megabytes and use at least a 128KB stripe. 169A 256KB stripe is probably even better. 170This will avoid mis-aligning HAMMER big-blocks (which are 8MB) 171or causing a single I/O cluster from crossing a stripe boundary. 172.Dx 173will issue a significant amount of read-ahead, upwards of a megabyte 174or more if it determines accesses are linear enough, which is 175sufficient to issue concurrent I/O across multiple stripes. 176.Sh SYSCTL TUNING 177.Xr sysctl 8 178variables permit system behavior to be monitored and controlled at 179run-time. 180Some sysctls simply report on the behavior of the system; others allow 181the system behavior to be modified; 182some may be set at boot time using 183.Xr rc.conf 5 , 184but most will be set via 185.Xr sysctl.conf 5 . 186There are several hundred sysctls in the system, including many that appear 187to be candidates for tuning but actually are not. 188In this document we will only cover the ones that have the greatest effect 189on the system. 190.Pp 191The 192.Va kern.ipc.shm_use_phys 193sysctl defaults to 1 (on) and may be set to 0 (off) or 1 (on). 194Setting 195this parameter to 1 will cause all System V shared memory segments to be 196mapped to unpageable physical RAM. 197This feature only has an effect if you 198are either (A) mapping small amounts of shared memory across many (hundreds) 199of processes, or (B) mapping large amounts of shared memory across any 200number of processes. 201This feature allows the kernel to remove a great deal 202of internal memory management page-tracking overhead at the cost of wiring 203the shared memory into core, making it unswappable. 204.Pp 205The 206.Va vfs.write_behind 207sysctl defaults to 1 (on). This tells the filesystem to issue media 208writes as full clusters are collected, which typically occurs when writing 209large sequential files. The idea is to avoid saturating the buffer 210cache with dirty buffers when it would not benefit I/O performance. However, 211this may stall processes and under certain circumstances you may wish to turn 212it off. 213.Pp 214The 215.Va vfs.hirunningspace 216sysctl determines how much outstanding write I/O may be queued to 217disk controllers system wide at any given instance. The default is 218usually sufficient but on machines with lots of disks you may want to bump 219it up to four or five megabytes. Note that setting too high a value 220(exceeding the buffer cache's write threshold) can lead to extremely 221bad clustering performance. Do not set this value arbitrarily high! Also, 222higher write queueing values may add latency to reads occurring at the same 223time. 224.Pp 225There are various other buffer-cache and VM page cache related sysctls. 226We do not recommend modifying these values. 227As of 228.Fx 4.3 , 229the VM system does an extremely good job tuning itself. 230.Pp 231The 232.Va net.inet.tcp.sendspace 233and 234.Va net.inet.tcp.recvspace 235sysctls are of particular interest if you are running network intensive 236applications. 237They control the amount of send and receive buffer space 238allowed for any given TCP connection. 239The default sending buffer is 32K; the default receiving buffer 240is 64K. 241You can often 242improve bandwidth utilization by increasing the default at the cost of 243eating up more kernel memory for each connection. 244We do not recommend 245increasing the defaults if you are serving hundreds or thousands of 246simultaneous connections because it is possible to quickly run the system 247out of memory due to stalled connections building up. 248But if you need 249high bandwidth over a fewer number of connections, especially if you have 250gigabit Ethernet, increasing these defaults can make a huge difference. 251You can adjust the buffer size for incoming and outgoing data separately. 252For example, if your machine is primarily doing web serving you may want 253to decrease the recvspace in order to be able to increase the 254sendspace without eating too much kernel memory. 255Note that the routing table (see 256.Xr route 8 ) 257can be used to introduce route-specific send and receive buffer size 258defaults. 259.Pp 260As an additional management tool you can use pipes in your 261firewall rules (see 262.Xr ipfw 8 ) 263to limit the bandwidth going to or from particular IP blocks or ports. 264For example, if you have a T1 you might want to limit your web traffic 265to 70% of the T1's bandwidth in order to leave the remainder available 266for mail and interactive use. 267Normally a heavily loaded web server 268will not introduce significant latencies into other services even if 269the network link is maxed out, but enforcing a limit can smooth things 270out and lead to longer term stability. 271Many people also enforce artificial 272bandwidth limitations in order to ensure that they are not charged for 273using too much bandwidth. 274.Pp 275Setting the send or receive TCP buffer to values larger than 65535 will result 276in a marginal performance improvement unless both hosts support the window 277scaling extension of the TCP protocol, which is controlled by the 278.Va net.inet.tcp.rfc1323 279sysctl. 280These extensions should be enabled and the TCP buffer size should be set 281to a value larger than 65536 in order to obtain good performance from 282certain types of network links; specifically, gigabit WAN links and 283high-latency satellite links. 284RFC 1323 support is enabled by default. 285.Pp 286The 287.Va net.inet.tcp.always_keepalive 288sysctl determines whether or not the TCP implementation should attempt 289to detect dead TCP connections by intermittently delivering 290.Dq keepalives 291on the connection. 292By default, this is disabled for all applications, only applications 293that specifically request keepalives will use them. 294In most environments, TCP keepalives will improve the management of 295system state by expiring dead TCP connections, particularly for 296systems serving dialup users who may not always terminate individual 297TCP connections before disconnecting from the network. 298However, in some environments, temporary network outages may be 299incorrectly identified as dead sessions, resulting in unexpectedly 300terminated TCP connections. 301In such environments, setting the sysctl to 0 may reduce the occurrence of 302TCP session disconnections. 303.Pp 304The 305.Va net.inet.tcp.delayed_ack 306TCP feature is largely misunderstood. Historically speaking this feature 307was designed to allow the acknowledgement to transmitted data to be returned 308along with the response. For example, when you type over a remote shell 309the acknowledgement to the character you send can be returned along with the 310data representing the echo of the character. With delayed acks turned off 311the acknowledgement may be sent in its own packet before the remote service 312has a chance to echo the data it just received. This same concept also 313applies to any interactive protocol (e.g. SMTP, WWW, POP3) and can cut the 314number of tiny packets flowing across the network in half. The 315.Dx 316delayed-ack implementation also follows the TCP protocol rule that 317at least every other packet be acknowledged even if the standard 100ms 318timeout has not yet passed. Normally the worst a delayed ack can do is 319slightly delay the teardown of a connection, or slightly delay the ramp-up 320of a slow-start TCP connection. While we aren't sure we believe that 321the several FAQs related to packages such as SAMBA and SQUID which advise 322turning off delayed acks may be referring to the slow-start issue. 323.Pp 324The 325.Va net.inet.tcp.inflight_enable 326sysctl turns on bandwidth delay product limiting for all TCP connections. 327The system will attempt to calculate the bandwidth delay product for each 328connection and limit the amount of data queued to the network to just the 329amount required to maintain optimum throughput. This feature is useful 330if you are serving data over modems, GigE, or high speed WAN links (or 331any other link with a high bandwidth*delay product), especially if you are 332also using window scaling or have configured a large send window. If 333you enable this option you should also be sure to set 334.Va net.inet.tcp.inflight_debug 335to 0 (disable debugging), and for production use setting 336.Va net.inet.tcp.inflight_min 337to at least 6144 may be beneficial. Note, however, that setting high 338minimums may effectively disable bandwidth limiting depending on the link. 339The limiting feature reduces the amount of data built up in intermediate 340router and switch packet queues as well as reduces the amount of data built 341up in the local host's interface queue. With fewer packets queued up, 342interactive connections, especially over slow modems, will also be able 343to operate with lower round trip times. However, note that this feature 344only affects data transmission (uploading / server-side). It does not 345affect data reception (downloading). 346.Pp 347Adjusting 348.Va net.inet.tcp.inflight_stab 349is not recommended. 350This parameter defaults to 20, representing 2 maximal packets added 351to the bandwidth delay product window calculation. The additional 352window is required to stabilize the algorithm and improve responsiveness 353to changing conditions, but it can also result in higher ping times 354over slow links (though still much lower than you would get without 355the inflight algorithm). In such cases you may 356wish to try reducing this parameter to 15, 10, or 5, and you may also 357have to reduce 358.Va net.inet.tcp.inflight_min 359(for example, to 3500) to get the desired effect. Reducing these parameters 360should be done as a last resort only. 361.Pp 362The 363.Va net.inet.ip.portrange.* 364sysctls control the port number ranges automatically bound to TCP and UDP 365sockets. There are three ranges: A low range, a default range, and a 366high range, selectable via an IP_PORTRANGE 367.Fn setsockopt 368call. 369Most network programs use the default range which is controlled by 370.Va net.inet.ip.portrange.first 371and 372.Va net.inet.ip.portrange.last , 373which defaults to 1024 and 5000 respectively. Bound port ranges are 374used for outgoing connections and it is possible to run the system out 375of ports under certain circumstances. This most commonly occurs when you are 376running a heavily loaded web proxy. The port range is not an issue 377when running serves which handle mainly incoming connections such as a 378normal web server, or has a limited number of outgoing connections such 379as a mail relay. For situations where you may run yourself out of 380ports we recommend increasing 381.Va net.inet.ip.portrange.last 382modestly. A value of 10000 or 20000 or 30000 may be reasonable. You should 383also consider firewall effects when changing the port range. Some firewalls 384may block large ranges of ports (usually low-numbered ports) and expect systems 385to use higher ranges of ports for outgoing connections. For this reason 386we do not recommend that 387.Va net.inet.ip.portrange.first 388be lowered. 389.Pp 390The 391.Va kern.ipc.somaxconn 392sysctl limits the size of the listen queue for accepting new TCP connections. 393The default value of 128 is typically too low for robust handling of new 394connections in a heavily loaded web server environment. 395For such environments, 396we recommend increasing this value to 1024 or higher. 397The service daemon 398may itself limit the listen queue size (e.g.\& 399.Xr sendmail 8 , 400apache) but will 401often have a directive in its configuration file to adjust the queue size up. 402Larger listen queues also do a better job of fending off denial of service 403attacks. 404.Pp 405The 406.Va kern.maxfiles 407sysctl determines how many open files the system supports. 408The default is 409typically a few thousand but you may need to bump this up to ten or twenty 410thousand if you are running databases or large descriptor-heavy daemons. 411The read-only 412.Va kern.openfiles 413sysctl may be interrogated to determine the current number of open files 414on the system. 415.Pp 416The 417.Va vm.swap_idle_enabled 418sysctl is useful in large multi-user systems where you have lots of users 419entering and leaving the system and lots of idle processes. 420Such systems 421tend to generate a great deal of continuous pressure on free memory reserves. 422Turning this feature on and adjusting the swapout hysteresis (in idle 423seconds) via 424.Va vm.swap_idle_threshold1 425and 426.Va vm.swap_idle_threshold2 427allows you to depress the priority of pages associated with idle processes 428more quickly than the normal pageout algorithm. 429This gives a helping hand 430to the pageout daemon. 431Do not turn this option on unless you need it, 432because the tradeoff you are making is to essentially pre-page memory sooner 433rather than later, eating more swap and disk bandwidth. 434In a small system 435this option will have a detrimental effect but in a large system that is 436already doing moderate paging this option allows the VM system to stage 437whole processes into and out of memory more easily. 438.Sh LOADER TUNABLES 439Some aspects of the system behavior may not be tunable at runtime because 440memory allocations they perform must occur early in the boot process. 441To change loader tunables, you must set their values in 442.Xr loader.conf 5 443and reboot the system. 444.Pp 445.Va kern.maxusers 446controls the scaling of a number of static system tables, including defaults 447for the maximum number of open files, sizing of network memory resources, etc. 448On 449.Dx , 450.Va kern.maxusers 451is automatically sized at boot based on the amount of memory available in 452the system, and may be determined at run-time by inspecting the value of the 453read-only 454.Va kern.maxusers 455sysctl. 456Some sites will require larger or smaller values of 457.Va kern.maxusers 458and may set it as a loader tunable; values of 64, 128, and 256 are not 459uncommon. 460We do not recommend going above 256 unless you need a huge number 461of file descriptors; many of the tunable values set to their defaults by 462.Va kern.maxusers 463may be individually overridden at boot-time or run-time as described 464elsewhere in this document. 465.Pp 466The 467.Va kern.dfldsiz 468and 469.Va kern.dflssiz 470tunables set the default soft limits for process data and stack size 471respectively. 472Processes may increase these up to the hard limits by calling 473.Xr setrlimit 2 . 474The 475.Va kern.maxdsiz , 476.Va kern.maxssiz , 477and 478.Va kern.maxtsiz 479tunables set the hard limits for process data, stack, and text size 480respectively; processes may not exceed these limits. 481The 482.Va kern.sgrowsiz 483tunable controls how much the stack segment will grow when a process 484needs to allocate more stack. 485.Pp 486.Va kern.ipc.nmbclusters 487may be adjusted to increase the number of network mbufs the system is 488willing to allocate. 489Each cluster represents approximately 2K of memory, 490so a value of 1024 represents 2M of kernel memory reserved for network 491buffers. 492You can do a simple calculation to figure out how many you need. 493If you have a web server which maxes out at 1000 simultaneous connections, 494and each connection eats a 16K receive and 16K send buffer, you need 495approximately 32MB worth of network buffers to deal with it. 496A good rule of 497thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768. 498So for this case 499you would want to set 500.Va kern.ipc.nmbclusters 501to 32768. 502We recommend values between 5031024 and 4096 for machines with moderates amount of memory, and between 4096 504and 32768 for machines with greater amounts of memory. 505Under no circumstances 506should you specify an arbitrarily high value for this parameter, it could 507lead to a boot-time crash. 508The 509.Fl m 510option to 511.Xr netstat 1 512may be used to observe network cluster use. 513.Sh KERNEL CONFIG TUNING 514There are a number of kernel options that you may have to fiddle with in 515a large-scale system. 516In order to change these options you need to be 517able to compile a new kernel from source. 518The 519.Xr config 8 520manual page and the handbook are good starting points for learning how to 521do this. 522Generally the first thing you do when creating your own custom 523kernel is to strip out all the drivers and services you do not use. 524Removing things like 525.Dv INET6 526and drivers you do not have will reduce the size of your kernel, sometimes 527by a megabyte or more, leaving more memory available for applications. 528.Pp 529If your motherboard is AHCI-capable then we strongly recommend turning 530on AHCI mode. 531.Sh CPU, MEMORY, DISK, NETWORK 532The type of tuning you do depends heavily on where your system begins to 533bottleneck as load increases. 534If your system runs out of CPU (idle times 535are perpetually 0%) then you need to consider upgrading the CPU or moving to 536an SMP motherboard (multiple CPU's), or perhaps you need to revisit the 537programs that are causing the load and try to optimize them. 538If your system 539is paging to swap a lot you need to consider adding more memory. 540If your 541system is saturating the disk you typically see high CPU idle times and 542total disk saturation. 543.Xr systat 1 544can be used to monitor this. 545There are many solutions to saturated disks: 546increasing memory for caching, mirroring disks, distributing operations across 547several machines, and so forth. 548If disk performance is an issue and you 549are using IDE drives, switching to SCSI can help a great deal. 550While modern 551IDE drives compare with SCSI in raw sequential bandwidth, the moment you 552start seeking around the disk SCSI drives usually win. 553.Pp 554Finally, you might run out of network suds. 555The first line of defense for 556improving network performance is to make sure you are using switches instead 557of hubs, especially these days where switches are almost as cheap. 558Hubs 559have severe problems under heavy loads due to collision backoff and one bad 560host can severely degrade the entire LAN. 561Second, optimize the network path 562as much as possible. 563For example, in 564.Xr firewall 7 565we describe a firewall protecting internal hosts with a topology where 566the externally visible hosts are not routed through it. 567Use 100BaseT rather 568than 10BaseT, or use 1000BaseT rather than 100BaseT, depending on your needs. 569Most bottlenecks occur at the WAN link (e.g.\& 570modem, T1, DSL, whatever). 571If expanding the link is not an option it may be possible to use the 572.Xr dummynet 4 573feature to implement peak shaving or other forms of traffic shaping to 574prevent the overloaded service (such as web services) from affecting other 575services (such as email), or vice versa. 576In home installations this could 577be used to give interactive traffic (your browser, 578.Xr ssh 1 579logins) priority 580over services you export from your box (web services, email). 581.Sh SEE ALSO 582.Xr netstat 1 , 583.Xr systat 1 , 584.Xr dummynet 4 , 585.Xr nata 4 , 586.Xr login.conf 5 , 587.Xr rc.conf 5 , 588.Xr sysctl.conf 5 , 589.Xr firewall 7 , 590.Xr hier 7 , 591.Xr boot 8 , 592.Xr ccdconfig 8 , 593.Xr config 8 , 594.Xr disklabel 8 , 595.Xr fsck 8 , 596.Xr ifconfig 8 , 597.Xr ipfw 8 , 598.Xr loader 8 , 599.Xr mount 8 , 600.Xr newfs 8 , 601.Xr route 8 , 602.Xr sysctl 8 , 603.Xr tunefs 8 , 604.Xr vinum 8 605.Sh HISTORY 606The 607.Nm 608manual page was originally written by 609.An Matthew Dillon 610and first appeared 611in 612.Fx 4.3 , 613May 2001. 614