xref: /dragonfly/share/man/man7/tuning.7 (revision c6f73aab)
1.\" Copyright (c) 2001 Matthew Dillon.  Terms and conditions are those of
2.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in
3.\" the source tree.
4.\"
5.\" $DragonFly: src/share/man/man7/tuning.7,v 1.21 2008/10/17 11:30:24 swildner Exp $
6.\"
7.Dd October 24, 2010
8.Dt TUNING 7
9.Os
10.Sh NAME
11.Nm tuning
12.Nd performance tuning under DragonFly
13.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP
14Modern
15.Dx
16systems typically have just three partitions on the main drive.
17In order, a UFS
18.Pa /boot ,
19.Pa swap ,
20and a HAMMER
21.Pa / .
22The installer usually creates a multitude of PFSs (pseudo filesystems) on
23the HAMMER partition for /var, /tmp, and numerous other sub-trees.
24These PFSs exist to ease the management of snapshots and backups.
25.Pp
26Generally speaking the
27.Pa /boot
28partition should be around 768MB in size.  The minimum recommended
29is around 350MB, giving you room for backup kernels and alternative
30boot schemes.
31.Pp
32In the old days we recommended that swap be sized to at least 2x main
33memory.  These days swap is often used for other activities, including
34.Xr tmpfs 5 .
35We recommend that swap be sized to the larger of 2x main memory or
361GB if you have a fairly small disk and up to 16GB if you have a
37moderately endowed system and a large drive.
38If you are on a minimally configured machine you may, of course,
39configure far less swap or no swap at all but we recommend at least
40some swap.
41The kernel's VM paging algorithms are tuned to perform best when there is
42at least 2x swap versus main memory.
43Configuring too little swap can lead to inefficiencies in the VM
44page scanning code as well as create issues later on if you add
45more memory to your machine.
46Swap is a good idea even if you don't think you will ever need it as it allows the
47machine to page out completely unused data from idle programs (like getty),
48maximizing the ram available for your activities.
49.Pp
50If you intend to use the
51.Xr swapcache 8
52facility with a SSD we recommend the SSD be configured with at
53least a 32G swap partition (the maximum default for i386).
54If you are on a moderately well configured 64-bit system you can
55size swap even larger.
56.Pp
57Finally, on larger systems with multiple drives, if the use
58of SSD swap is not in the cards, we recommend that you
59configure swap on each drive (up to four drives).
60The swap partitions on the drives should be approximately the same size.
61The kernel can handle arbitrary sizes but
62internal data structures scale to 4 times the largest swap partition.
63Keeping
64the swap partitions near the same size will allow the kernel to optimally
65stripe swap space across the N disks.
66Do not worry about overdoing it a
67little, swap space is the saving grace of
68.Ux
69and even if you do not normally use much swap, it can give you more time to
70recover from a runaway program before being forced to reboot.
71.Pp
72Most
73.Dx
74systems have a single HAMMER root and use PFSs to break it up into
75various administrative domains.
76All the PFSs share the same allocation layer so there is no longer a need
77to size each individual mount.
78Instead you should review the
79.Xr hammer 8
80manual page and use the 'hammer viconfig' facility to adjust snapshot
81retention and other parameters.
82By default
83HAMMER keeps 60 days worth of snapshots.
84Usually snapshots are not desired on PFSs such as
85.Pa /usr/obj
86or
87.Pa /tmp
88since data on these partitions cycles a lot.
89.Pp
90If a very large work area is desired it is often beneficial to
91configure it as a separate HAMMER mount.  If it is integrated into
92the root mount it should at least be its own HAMMER PFS.
93We recommend naming the large work area
94.Pa /build .
95Similarly if a machine is going to have a large number of users
96you might want to separate your
97.Pa /home
98out as well.
99.Pp
100A number of run-time
101.Xr mount 8
102options exist that can help you tune the system.
103The most obvious and most dangerous one is
104.Cm async .
105Do not ever use it; it is far too dangerous.
106A less dangerous and more
107useful
108.Xr mount 8
109option is called
110.Cm noatime .
111.Ux
112filesystems normally update the last-accessed time of a file or
113directory whenever it is accessed.
114This operation is handled in
115.Dx
116with a delayed write and normally does not create a burden on the system.
117However, if your system is accessing a huge number of files on a continuing
118basis the buffer cache can wind up getting polluted with atime updates,
119creating a burden on the system.
120For example, if you are running a heavily
121loaded web site, or a news server with lots of readers, you might want to
122consider turning off atime updates on your larger partitions with this
123.Xr mount 8
124option.
125You should not gratuitously turn off atime updates everywhere.
126For example, the
127.Pa /var
128filesystem customarily
129holds mailboxes, and atime (in combination with mtime) is used to
130determine whether a mailbox has new mail.
131.Sh STRIPING DISKS
132In larger systems you can stripe partitions from several drives together
133to create a much larger overall partition.
134Striping can also improve
135the performance of a filesystem by splitting I/O operations across two
136or more disks.
137The
138.Xr vinum 8 ,
139.Xr lvm 8 ,
140and
141.Xr dm 8
142subsystems may be used to create simple striped filesystems.
143We have deprecated
144.Xr ccd 4 .
145Generally
146speaking, striping smaller partitions such as the root and
147.Pa /var/tmp ,
148or essentially read-only partitions such as
149.Pa /usr
150is a complete waste of time.
151You should only stripe partitions that require serious I/O performance.
152We recommend that such partitions be completely separate mounts
153and not use the same storage media as your root mount.
154.Pp
155I should reiterate that last comment.  Do not stripe your boot+root.
156Just about everything in there will be cached in system memory anyway.
157Neither would I recommend RAIDing your root.
158If robustness is needed placing your boot, swap, and root on a SSD
159which has about the same MTBF as your motherboard, then RAIDing everything
160which requires significant amounts of storage, should be sufficient.
161There isn't much point making the boot/swap/root storage even more redundant
162when the motherboard itself has no redundancy.
163When a high level of total system redundancy is required you need to be
164thinking more about having multiple physical machines that back each
165other up.
166.Pp
167When striping multiple disks always partition in multiples of at
168least 8 megabytes and use at least a 128KB stripe.
169A 256KB stripe is probably even better.
170This will avoid mis-aligning HAMMER big-blocks (which are 8MB)
171or causing a single I/O cluster from crossing a stripe boundary.
172.Dx
173will issue a significant amount of read-ahead, upwards of a megabyte
174or more if it determines accesses are linear enough, which is
175sufficient to issue concurrent I/O across multiple stripes.
176.Sh SYSCTL TUNING
177.Xr sysctl 8
178variables permit system behavior to be monitored and controlled at
179run-time.
180Some sysctls simply report on the behavior of the system; others allow
181the system behavior to be modified;
182some may be set at boot time using
183.Xr rc.conf 5 ,
184but most will be set via
185.Xr sysctl.conf 5 .
186There are several hundred sysctls in the system, including many that appear
187to be candidates for tuning but actually are not.
188In this document we will only cover the ones that have the greatest effect
189on the system.
190.Pp
191The
192.Va kern.ipc.shm_use_phys
193sysctl defaults to 1 (on) and may be set to 0 (off) or 1 (on).
194Setting
195this parameter to 1 will cause all System V shared memory segments to be
196mapped to unpageable physical RAM.
197This feature only has an effect if you
198are either (A) mapping small amounts of shared memory across many (hundreds)
199of processes, or (B) mapping large amounts of shared memory across any
200number of processes.
201This feature allows the kernel to remove a great deal
202of internal memory management page-tracking overhead at the cost of wiring
203the shared memory into core, making it unswappable.
204.Pp
205The
206.Va vfs.write_behind
207sysctl defaults to 1 (on).  This tells the filesystem to issue media
208writes as full clusters are collected, which typically occurs when writing
209large sequential files.  The idea is to avoid saturating the buffer
210cache with dirty buffers when it would not benefit I/O performance.  However,
211this may stall processes and under certain circumstances you may wish to turn
212it off.
213.Pp
214The
215.Va vfs.hirunningspace
216sysctl determines how much outstanding write I/O may be queued to
217disk controllers system wide at any given instance.  The default is
218usually sufficient but on machines with lots of disks you may want to bump
219it up to four or five megabytes.  Note that setting too high a value
220(exceeding the buffer cache's write threshold) can lead to extremely
221bad clustering performance.  Do not set this value arbitrarily high!  Also,
222higher write queueing values may add latency to reads occurring at the same
223time.
224.Pp
225There are various other buffer-cache and VM page cache related sysctls.
226We do not recommend modifying these values.
227As of
228.Fx 4.3 ,
229the VM system does an extremely good job tuning itself.
230.Pp
231The
232.Va net.inet.tcp.sendspace
233and
234.Va net.inet.tcp.recvspace
235sysctls are of particular interest if you are running network intensive
236applications.
237They control the amount of send and receive buffer space
238allowed for any given TCP connection.
239The default sending buffer is 32K; the default receiving buffer
240is 64K.
241You can often
242improve bandwidth utilization by increasing the default at the cost of
243eating up more kernel memory for each connection.
244We do not recommend
245increasing the defaults if you are serving hundreds or thousands of
246simultaneous connections because it is possible to quickly run the system
247out of memory due to stalled connections building up.
248But if you need
249high bandwidth over a fewer number of connections, especially if you have
250gigabit Ethernet, increasing these defaults can make a huge difference.
251You can adjust the buffer size for incoming and outgoing data separately.
252For example, if your machine is primarily doing web serving you may want
253to decrease the recvspace in order to be able to increase the
254sendspace without eating too much kernel memory.
255Note that the routing table (see
256.Xr route 8 )
257can be used to introduce route-specific send and receive buffer size
258defaults.
259.Pp
260As an additional management tool you can use pipes in your
261firewall rules (see
262.Xr ipfw 8 )
263to limit the bandwidth going to or from particular IP blocks or ports.
264For example, if you have a T1 you might want to limit your web traffic
265to 70% of the T1's bandwidth in order to leave the remainder available
266for mail and interactive use.
267Normally a heavily loaded web server
268will not introduce significant latencies into other services even if
269the network link is maxed out, but enforcing a limit can smooth things
270out and lead to longer term stability.
271Many people also enforce artificial
272bandwidth limitations in order to ensure that they are not charged for
273using too much bandwidth.
274.Pp
275Setting the send or receive TCP buffer to values larger than 65535 will result
276in a marginal performance improvement unless both hosts support the window
277scaling extension of the TCP protocol, which is controlled by the
278.Va net.inet.tcp.rfc1323
279sysctl.
280These extensions should be enabled and the TCP buffer size should be set
281to a value larger than 65536 in order to obtain good performance from
282certain types of network links; specifically, gigabit WAN links and
283high-latency satellite links.
284RFC 1323 support is enabled by default.
285.Pp
286The
287.Va net.inet.tcp.always_keepalive
288sysctl determines whether or not the TCP implementation should attempt
289to detect dead TCP connections by intermittently delivering
290.Dq keepalives
291on the connection.
292By default, this is disabled for all applications, only applications
293that specifically request keepalives will use them.
294In most environments, TCP keepalives will improve the management of
295system state by expiring dead TCP connections, particularly for
296systems serving dialup users who may not always terminate individual
297TCP connections before disconnecting from the network.
298However, in some environments, temporary network outages may be
299incorrectly identified as dead sessions, resulting in unexpectedly
300terminated TCP connections.
301In such environments, setting the sysctl to 0 may reduce the occurrence of
302TCP session disconnections.
303.Pp
304The
305.Va net.inet.tcp.delayed_ack
306TCP feature is largely misunderstood.  Historically speaking this feature
307was designed to allow the acknowledgement to transmitted data to be returned
308along with the response.  For example, when you type over a remote shell
309the acknowledgement to the character you send can be returned along with the
310data representing the echo of the character.   With delayed acks turned off
311the acknowledgement may be sent in its own packet before the remote service
312has a chance to echo the data it just received.  This same concept also
313applies to any interactive protocol (e.g. SMTP, WWW, POP3) and can cut the
314number of tiny packets flowing across the network in half.   The
315.Dx
316delayed-ack implementation also follows the TCP protocol rule that
317at least every other packet be acknowledged even if the standard 100ms
318timeout has not yet passed.  Normally the worst a delayed ack can do is
319slightly delay the teardown of a connection, or slightly delay the ramp-up
320of a slow-start TCP connection.  While we aren't sure we believe that
321the several FAQs related to packages such as SAMBA and SQUID which advise
322turning off delayed acks may be referring to the slow-start issue.
323.Pp
324The
325.Va net.inet.tcp.inflight_enable
326sysctl turns on bandwidth delay product limiting for all TCP connections.
327The system will attempt to calculate the bandwidth delay product for each
328connection and limit the amount of data queued to the network to just the
329amount required to maintain optimum throughput.  This feature is useful
330if you are serving data over modems, GigE, or high speed WAN links (or
331any other link with a high bandwidth*delay product), especially if you are
332also using window scaling or have configured a large send window.  If
333you enable this option you should also be sure to set
334.Va net.inet.tcp.inflight_debug
335to 0 (disable debugging), and for production use setting
336.Va net.inet.tcp.inflight_min
337to at least 6144 may be beneficial.  Note, however, that setting high
338minimums may effectively disable bandwidth limiting depending on the link.
339The limiting feature reduces the amount of data built up in intermediate
340router and switch packet queues as well as reduces the amount of data built
341up in the local host's interface queue.  With fewer packets queued up,
342interactive connections, especially over slow modems, will also be able
343to operate with lower round trip times.  However, note that this feature
344only affects data transmission (uploading / server-side).  It does not
345affect data reception (downloading).
346.Pp
347Adjusting
348.Va net.inet.tcp.inflight_stab
349is not recommended.
350This parameter defaults to 50, representing +5% fudge when calculating the
351bwnd from the bw.  This fudge is on top of an additional fixed +2*maxseg
352added to bwnd.  The fudge factor is required to stabilize the algorithm
353at very high speeds while the fixed 2*maxseg stabilizes the algorithm at
354low speeds.  If you increase this value excessive packet buffering may occur.
355.Pp
356The
357.Va net.inet.ip.portrange.*
358sysctls control the port number ranges automatically bound to TCP and UDP
359sockets.  There are three ranges:  A low range, a default range, and a
360high range, selectable via an IP_PORTRANGE
361.Fn setsockopt
362call.
363Most network programs use the default range which is controlled by
364.Va net.inet.ip.portrange.first
365and
366.Va net.inet.ip.portrange.last ,
367which defaults to 1024 and 5000 respectively.  Bound port ranges are
368used for outgoing connections and it is possible to run the system out
369of ports under certain circumstances.  This most commonly occurs when you are
370running a heavily loaded web proxy.  The port range is not an issue
371when running serves which handle mainly incoming connections such as a
372normal web server, or has a limited number of outgoing connections such
373as a mail relay.  For situations where you may run yourself out of
374ports we recommend increasing
375.Va net.inet.ip.portrange.last
376modestly.  A value of 10000 or 20000 or 30000 may be reasonable.  You should
377also consider firewall effects when changing the port range.  Some firewalls
378may block large ranges of ports (usually low-numbered ports) and expect systems
379to use higher ranges of ports for outgoing connections.  For this reason
380we do not recommend that
381.Va net.inet.ip.portrange.first
382be lowered.
383.Pp
384The
385.Va kern.ipc.somaxconn
386sysctl limits the size of the listen queue for accepting new TCP connections.
387The default value of 128 is typically too low for robust handling of new
388connections in a heavily loaded web server environment.
389For such environments,
390we recommend increasing this value to 1024 or higher.
391The service daemon
392may itself limit the listen queue size (e.g.\&
393.Xr sendmail 8 ,
394apache) but will
395often have a directive in its configuration file to adjust the queue size up.
396Larger listen queues also do a better job of fending off denial of service
397attacks.
398.Pp
399The
400.Va kern.maxfiles
401sysctl determines how many open files the system supports.
402The default is
403typically a few thousand but you may need to bump this up to ten or twenty
404thousand if you are running databases or large descriptor-heavy daemons.
405The read-only
406.Va kern.openfiles
407sysctl may be interrogated to determine the current number of open files
408on the system.
409.Pp
410The
411.Va vm.swap_idle_enabled
412sysctl is useful in large multi-user systems where you have lots of users
413entering and leaving the system and lots of idle processes.
414Such systems
415tend to generate a great deal of continuous pressure on free memory reserves.
416Turning this feature on and adjusting the swapout hysteresis (in idle
417seconds) via
418.Va vm.swap_idle_threshold1
419and
420.Va vm.swap_idle_threshold2
421allows you to depress the priority of pages associated with idle processes
422more quickly than the normal pageout algorithm.
423This gives a helping hand
424to the pageout daemon.
425Do not turn this option on unless you need it,
426because the tradeoff you are making is to essentially pre-page memory sooner
427rather than later, eating more swap and disk bandwidth.
428In a small system
429this option will have a detrimental effect but in a large system that is
430already doing moderate paging this option allows the VM system to stage
431whole processes into and out of memory more easily.
432.Sh LOADER TUNABLES
433Some aspects of the system behavior may not be tunable at runtime because
434memory allocations they perform must occur early in the boot process.
435To change loader tunables, you must set their values in
436.Xr loader.conf 5
437and reboot the system.
438.Pp
439.Va kern.maxusers
440controls the scaling of a number of static system tables, including defaults
441for the maximum number of open files, sizing of network memory resources, etc.
442On
443.Dx ,
444.Va kern.maxusers
445is automatically sized at boot based on the amount of memory available in
446the system, and may be determined at run-time by inspecting the value of the
447read-only
448.Va kern.maxusers
449sysctl.
450Some sites will require larger or smaller values of
451.Va kern.maxusers
452and may set it as a loader tunable; values of 64, 128, and 256 are not
453uncommon.
454We do not recommend going above 256 unless you need a huge number
455of file descriptors; many of the tunable values set to their defaults by
456.Va kern.maxusers
457may be individually overridden at boot-time or run-time as described
458elsewhere in this document.
459.Pp
460The
461.Va kern.dfldsiz
462and
463.Va kern.dflssiz
464tunables set the default soft limits for process data and stack size
465respectively.
466Processes may increase these up to the hard limits by calling
467.Xr setrlimit 2 .
468The
469.Va kern.maxdsiz ,
470.Va kern.maxssiz ,
471and
472.Va kern.maxtsiz
473tunables set the hard limits for process data, stack, and text size
474respectively; processes may not exceed these limits.
475The
476.Va kern.sgrowsiz
477tunable controls how much the stack segment will grow when a process
478needs to allocate more stack.
479.Pp
480.Va kern.ipc.nmbclusters
481may be adjusted to increase the number of network mbufs the system is
482willing to allocate.
483Each cluster represents approximately 2K of memory,
484so a value of 1024 represents 2M of kernel memory reserved for network
485buffers.
486You can do a simple calculation to figure out how many you need.
487If you have a web server which maxes out at 1000 simultaneous connections,
488and each connection eats a 16K receive and 16K send buffer, you need
489approximately 32MB worth of network buffers to deal with it.
490A good rule of
491thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768.
492So for this case
493you would want to set
494.Va kern.ipc.nmbclusters
495to 32768.
496We recommend values between
4971024 and 4096 for machines with moderates amount of memory, and between 4096
498and 32768 for machines with greater amounts of memory.
499Under no circumstances
500should you specify an arbitrarily high value for this parameter, it could
501lead to a boot-time crash.
502The
503.Fl m
504option to
505.Xr netstat 1
506may be used to observe network cluster use.
507.Sh KERNEL CONFIG TUNING
508There are a number of kernel options that you may have to fiddle with in
509a large-scale system.
510In order to change these options you need to be
511able to compile a new kernel from source.
512The
513.Xr config 8
514manual page and the handbook are good starting points for learning how to
515do this.
516Generally the first thing you do when creating your own custom
517kernel is to strip out all the drivers and services you do not use.
518Removing things like
519.Dv INET6
520and drivers you do not have will reduce the size of your kernel, sometimes
521by a megabyte or more, leaving more memory available for applications.
522.Pp
523If your motherboard is AHCI-capable then we strongly recommend turning
524on AHCI mode.
525.Sh CPU, MEMORY, DISK, NETWORK
526The type of tuning you do depends heavily on where your system begins to
527bottleneck as load increases.
528If your system runs out of CPU (idle times
529are perpetually 0%) then you need to consider upgrading the CPU or moving to
530an SMP motherboard (multiple CPU's), or perhaps you need to revisit the
531programs that are causing the load and try to optimize them.
532If your system
533is paging to swap a lot you need to consider adding more memory.
534If your
535system is saturating the disk you typically see high CPU idle times and
536total disk saturation.
537.Xr systat 1
538can be used to monitor this.
539There are many solutions to saturated disks:
540increasing memory for caching, mirroring disks, distributing operations across
541several machines, and so forth.
542If disk performance is an issue and you
543are using IDE drives, switching to SCSI can help a great deal.
544While modern
545IDE drives compare with SCSI in raw sequential bandwidth, the moment you
546start seeking around the disk SCSI drives usually win.
547.Pp
548Finally, you might run out of network suds.
549The first line of defense for
550improving network performance is to make sure you are using switches instead
551of hubs, especially these days where switches are almost as cheap.
552Hubs
553have severe problems under heavy loads due to collision backoff and one bad
554host can severely degrade the entire LAN.
555Second, optimize the network path
556as much as possible.
557For example, in
558.Xr firewall 7
559we describe a firewall protecting internal hosts with a topology where
560the externally visible hosts are not routed through it.
561Use 100BaseT rather
562than 10BaseT, or use 1000BaseT rather than 100BaseT, depending on your needs.
563Most bottlenecks occur at the WAN link (e.g.\&
564modem, T1, DSL, whatever).
565If expanding the link is not an option it may be possible to use the
566.Xr dummynet 4
567feature to implement peak shaving or other forms of traffic shaping to
568prevent the overloaded service (such as web services) from affecting other
569services (such as email), or vice versa.
570In home installations this could
571be used to give interactive traffic (your browser,
572.Xr ssh 1
573logins) priority
574over services you export from your box (web services, email).
575.Sh SEE ALSO
576.Xr netstat 1 ,
577.Xr systat 1 ,
578.Xr dummynet 4 ,
579.Xr nata 4 ,
580.Xr login.conf 5 ,
581.Xr rc.conf 5 ,
582.Xr sysctl.conf 5 ,
583.Xr firewall 7 ,
584.Xr hier 7 ,
585.Xr boot 8 ,
586.Xr ccdconfig 8 ,
587.Xr config 8 ,
588.Xr disklabel 8 ,
589.Xr fsck 8 ,
590.Xr ifconfig 8 ,
591.Xr ipfw 8 ,
592.Xr loader 8 ,
593.Xr mount 8 ,
594.Xr newfs 8 ,
595.Xr route 8 ,
596.Xr sysctl 8 ,
597.Xr tunefs 8 ,
598.Xr vinum 8
599.Sh HISTORY
600The
601.Nm
602manual page was originally written by
603.An Matthew Dillon
604and first appeared
605in
606.Fx 4.3 ,
607May 2001.
608