xref: /dragonfly/share/man/man7/tuning.7 (revision 25a2db75)
1.\" Copyright (c) 2001 Matthew Dillon.  Terms and conditions are those of
2.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in
3.\" the source tree.
4.\"
5.\" $DragonFly: src/share/man/man7/tuning.7,v 1.21 2008/10/17 11:30:24 swildner Exp $
6.\"
7.Dd October 24, 2010
8.Dt TUNING 7
9.Os
10.Sh NAME
11.Nm tuning
12.Nd performance tuning under DragonFly
13.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP
14Modern
15.Dx
16systems typically have just three partitions on the main drive.
17In order, a UFS
18.Pa /boot ,
19.Pa swap ,
20and a HAMMER
21.Pa / .
22The installer usually creates a multitude of PFSs (pseudo filesystems) on
23the HAMMER partition for /var, /tmp, and numerous other sub-trees.
24These PFSs exist to ease the management of snapshots and backups.
25.Pp
26Generally speaking the
27.Pa /boot
28partition should be around 768MB in size.  The minimum recommended
29is around 350MB, giving you room for backup kernels and alternative
30boot schemes.
31.Pp
32In the old days we recommended that swap be sized to at least 2x main
33memory.  These days swap is often used for other activities, including
34.Xr tmpfs 5 .
35We recommend that swap be sized to the larger of 2x main memory or
361GB if you have a fairly small disk and up to 16GB if you have a
37moderately endowed system and a large drive.
38If you are on a minimally configured machine you may, of course,
39configure far less swap or no swap at all but we recommend at least
40some swap.
41The kernel's VM paging algorithms are tuned to perform best when there is
42at least 2x swap versus main memory.
43Configuring too little swap can lead to inefficiencies in the VM
44page scanning code as well as create issues later on if you add
45more memory to your machine.
46Swap is a good idea even if you don't think you will ever need it as it allows the
47machine to page out completely unused data from idle programs (like getty),
48maximizing the ram available for your activities.
49.Pp
50If you intend to use the
51.Xr swapcache 8
52facility with a SSD we recommend the SSD be configured with at
53least a 32G swap partition (the maximum default for i386).
54If you are on a moderately well configured 64-bit system you can
55size swap even larger.
56.Pp
57Finally, on larger systems with multiple drives, if the use
58of SSD swap is not in the cards, we recommend that you
59configure swap on each drive (up to four drives).
60The swap partitions on the drives should be approximately the same size.
61The kernel can handle arbitrary sizes but
62internal data structures scale to 4 times the largest swap partition.
63Keeping
64the swap partitions near the same size will allow the kernel to optimally
65stripe swap space across the N disks.
66Do not worry about overdoing it a
67little, swap space is the saving grace of
68.Ux
69and even if you do not normally use much swap, it can give you more time to
70recover from a runaway program before being forced to reboot.
71.Pp
72Most
73.Dx
74systems have a single HAMMER root and use PFSs to break it up into
75various administrative domains.
76All the PFSs share the same allocation layer so there is no longer a need
77to size each individual mount.
78Instead you should review the
79.Xr hammer 8
80manual page and use the 'hammer viconfig' facility to adjust snapshot
81retention and other parameters.
82By default
83HAMMER keeps 60 days worth of snapshots.
84Usually snapshots are not desired on PFSs such as
85.Pa /usr/obj
86or
87.Pa /tmp
88since data on these partitions cycles a lot.
89.Pp
90If a very large work area is desired it is often beneficial to
91configure it as a separate HAMMER mount.  If it is integrated into
92the root mount it should at least be its own HAMMER PFS.
93We recommend naming the large work area
94.Pa /build .
95Similarly if a machine is going to have a large number of users
96you might want to separate your
97.Pa /home
98out as well.
99.Pp
100A number of run-time
101.Xr mount 8
102options exist that can help you tune the system.
103The most obvious and most dangerous one is
104.Cm async .
105Do not ever use it; it is far too dangerous.
106A less dangerous and more
107useful
108.Xr mount 8
109option is called
110.Cm noatime .
111.Ux
112filesystems normally update the last-accessed time of a file or
113directory whenever it is accessed.
114This operation is handled in
115.Dx
116with a delayed write and normally does not create a burden on the system.
117However, if your system is accessing a huge number of files on a continuing
118basis the buffer cache can wind up getting polluted with atime updates,
119creating a burden on the system.
120For example, if you are running a heavily
121loaded web site, or a news server with lots of readers, you might want to
122consider turning off atime updates on your larger partitions with this
123.Xr mount 8
124option.
125You should not gratuitously turn off atime updates everywhere.
126For example, the
127.Pa /var
128filesystem customarily
129holds mailboxes, and atime (in combination with mtime) is used to
130determine whether a mailbox has new mail.
131.Sh STRIPING DISKS
132In larger systems you can stripe partitions from several drives together
133to create a much larger overall partition.
134Striping can also improve
135the performance of a filesystem by splitting I/O operations across two
136or more disks.
137The
138.Xr vinum 8 ,
139.Xr lvm 8 ,
140and
141.Xr dm 8
142subsystems may be used to create simple striped filesystems.
143We have deprecated
144.Xr ccd 4 .
145Generally
146speaking, striping smaller partitions such as the root and
147.Pa /var/tmp ,
148or essentially read-only partitions such as
149.Pa /usr
150is a complete waste of time.
151You should only stripe partitions that require serious I/O performance.
152We recommend that such partitions be completely separate mounts
153and not use the same storage media as your root mount.
154.Pp
155I should reiterate that last comment.  Do not stripe your boot+root.
156Just about everything in there will be cached in system memory anyway.
157Neither would I recommend RAIDing your root.
158If robustness is needed placing your boot, swap, and root on a SSD
159which has about the same MTBF as your motherboard, then RAIDing everything
160which requires significant amounts of storage, should be sufficient.
161There isn't much point making the boot/swap/root storage even more redundant
162when the motherboard itself has no redundancy.
163When a high level of total system redundancy is required you need to be
164thinking more about having multiple physical machines that back each
165other up.
166.Pp
167When striping multiple disks always partition in multiples of at
168least 8 megabytes and use at least a 128KB stripe.
169A 256KB stripe is probably even better.
170This will avoid mis-aligning HAMMER big-blocks (which are 8MB)
171or causing a single I/O cluster from crossing a stripe boundary.
172.Dx
173will issue a significant amount of read-ahead, upwards of a megabyte
174or more if it determines accesses are linear enough, which is
175sufficient to issue concurrent I/O across multiple stripes.
176.Sh SYSCTL TUNING
177.Xr sysctl 8
178variables permit system behavior to be monitored and controlled at
179run-time.
180Some sysctls simply report on the behavior of the system; others allow
181the system behavior to be modified;
182some may be set at boot time using
183.Xr rc.conf 5 ,
184but most will be set via
185.Xr sysctl.conf 5 .
186There are several hundred sysctls in the system, including many that appear
187to be candidates for tuning but actually are not.
188In this document we will only cover the ones that have the greatest effect
189on the system.
190.Pp
191The
192.Va kern.ipc.shm_use_phys
193sysctl defaults to 1 (on) and may be set to 0 (off) or 1 (on).
194Setting
195this parameter to 1 will cause all System V shared memory segments to be
196mapped to unpageable physical RAM.
197This feature only has an effect if you
198are either (A) mapping small amounts of shared memory across many (hundreds)
199of processes, or (B) mapping large amounts of shared memory across any
200number of processes.
201This feature allows the kernel to remove a great deal
202of internal memory management page-tracking overhead at the cost of wiring
203the shared memory into core, making it unswappable.
204.Pp
205The
206.Va vfs.write_behind
207sysctl defaults to 1 (on).  This tells the filesystem to issue media
208writes as full clusters are collected, which typically occurs when writing
209large sequential files.  The idea is to avoid saturating the buffer
210cache with dirty buffers when it would not benefit I/O performance.  However,
211this may stall processes and under certain circumstances you may wish to turn
212it off.
213.Pp
214The
215.Va vfs.hirunningspace
216sysctl determines how much outstanding write I/O may be queued to
217disk controllers system wide at any given instance.  The default is
218usually sufficient but on machines with lots of disks you may want to bump
219it up to four or five megabytes.  Note that setting too high a value
220(exceeding the buffer cache's write threshold) can lead to extremely
221bad clustering performance.  Do not set this value arbitrarily high!  Also,
222higher write queueing values may add latency to reads occurring at the same
223time.
224.Pp
225There are various other buffer-cache and VM page cache related sysctls.
226We do not recommend modifying these values.
227As of
228.Fx 4.3 ,
229the VM system does an extremely good job tuning itself.
230.Pp
231The
232.Va net.inet.tcp.sendspace
233and
234.Va net.inet.tcp.recvspace
235sysctls are of particular interest if you are running network intensive
236applications.
237They control the amount of send and receive buffer space
238allowed for any given TCP connection.
239The default sending buffer is 32K; the default receiving buffer
240is 64K.
241You can often
242improve bandwidth utilization by increasing the default at the cost of
243eating up more kernel memory for each connection.
244We do not recommend
245increasing the defaults if you are serving hundreds or thousands of
246simultaneous connections because it is possible to quickly run the system
247out of memory due to stalled connections building up.
248But if you need
249high bandwidth over a fewer number of connections, especially if you have
250gigabit Ethernet, increasing these defaults can make a huge difference.
251You can adjust the buffer size for incoming and outgoing data separately.
252For example, if your machine is primarily doing web serving you may want
253to decrease the recvspace in order to be able to increase the
254sendspace without eating too much kernel memory.
255Note that the routing table (see
256.Xr route 8 )
257can be used to introduce route-specific send and receive buffer size
258defaults.
259.Pp
260As an additional management tool you can use pipes in your
261firewall rules (see
262.Xr ipfw 8 )
263to limit the bandwidth going to or from particular IP blocks or ports.
264For example, if you have a T1 you might want to limit your web traffic
265to 70% of the T1's bandwidth in order to leave the remainder available
266for mail and interactive use.
267Normally a heavily loaded web server
268will not introduce significant latencies into other services even if
269the network link is maxed out, but enforcing a limit can smooth things
270out and lead to longer term stability.
271Many people also enforce artificial
272bandwidth limitations in order to ensure that they are not charged for
273using too much bandwidth.
274.Pp
275Setting the send or receive TCP buffer to values larger than 65535 will result
276in a marginal performance improvement unless both hosts support the window
277scaling extension of the TCP protocol, which is controlled by the
278.Va net.inet.tcp.rfc1323
279sysctl.
280These extensions should be enabled and the TCP buffer size should be set
281to a value larger than 65536 in order to obtain good performance from
282certain types of network links; specifically, gigabit WAN links and
283high-latency satellite links.
284RFC 1323 support is enabled by default.
285.Pp
286The
287.Va net.inet.tcp.always_keepalive
288sysctl determines whether or not the TCP implementation should attempt
289to detect dead TCP connections by intermittently delivering
290.Dq keepalives
291on the connection.
292By default, this is disabled for all applications, only applications
293that specifically request keepalives will use them.
294In most environments, TCP keepalives will improve the management of
295system state by expiring dead TCP connections, particularly for
296systems serving dialup users who may not always terminate individual
297TCP connections before disconnecting from the network.
298However, in some environments, temporary network outages may be
299incorrectly identified as dead sessions, resulting in unexpectedly
300terminated TCP connections.
301In such environments, setting the sysctl to 0 may reduce the occurrence of
302TCP session disconnections.
303.Pp
304The
305.Va net.inet.tcp.delayed_ack
306TCP feature is largely misunderstood.  Historically speaking this feature
307was designed to allow the acknowledgement to transmitted data to be returned
308along with the response.  For example, when you type over a remote shell
309the acknowledgement to the character you send can be returned along with the
310data representing the echo of the character.   With delayed acks turned off
311the acknowledgement may be sent in its own packet before the remote service
312has a chance to echo the data it just received.  This same concept also
313applies to any interactive protocol (e.g. SMTP, WWW, POP3) and can cut the
314number of tiny packets flowing across the network in half.   The
315.Dx
316delayed-ack implementation also follows the TCP protocol rule that
317at least every other packet be acknowledged even if the standard 100ms
318timeout has not yet passed.  Normally the worst a delayed ack can do is
319slightly delay the teardown of a connection, or slightly delay the ramp-up
320of a slow-start TCP connection.  While we aren't sure we believe that
321the several FAQs related to packages such as SAMBA and SQUID which advise
322turning off delayed acks may be referring to the slow-start issue.
323.Pp
324The
325.Va net.inet.tcp.inflight_enable
326sysctl turns on bandwidth delay product limiting for all TCP connections.
327The system will attempt to calculate the bandwidth delay product for each
328connection and limit the amount of data queued to the network to just the
329amount required to maintain optimum throughput.  This feature is useful
330if you are serving data over modems, GigE, or high speed WAN links (or
331any other link with a high bandwidth*delay product), especially if you are
332also using window scaling or have configured a large send window.  If
333you enable this option you should also be sure to set
334.Va net.inet.tcp.inflight_debug
335to 0 (disable debugging), and for production use setting
336.Va net.inet.tcp.inflight_min
337to at least 6144 may be beneficial.  Note, however, that setting high
338minimums may effectively disable bandwidth limiting depending on the link.
339The limiting feature reduces the amount of data built up in intermediate
340router and switch packet queues as well as reduces the amount of data built
341up in the local host's interface queue.  With fewer packets queued up,
342interactive connections, especially over slow modems, will also be able
343to operate with lower round trip times.  However, note that this feature
344only affects data transmission (uploading / server-side).  It does not
345affect data reception (downloading).
346.Pp
347Adjusting
348.Va net.inet.tcp.inflight_stab
349is not recommended.
350This parameter defaults to 20, representing 2 maximal packets added
351to the bandwidth delay product window calculation.  The additional
352window is required to stabilize the algorithm and improve responsiveness
353to changing conditions, but it can also result in higher ping times
354over slow links (though still much lower than you would get without
355the inflight algorithm).  In such cases you may
356wish to try reducing this parameter to 15, 10, or 5, and you may also
357have to reduce
358.Va net.inet.tcp.inflight_min
359(for example, to 3500) to get the desired effect.  Reducing these parameters
360should be done as a last resort only.
361.Pp
362The
363.Va net.inet.ip.portrange.*
364sysctls control the port number ranges automatically bound to TCP and UDP
365sockets.  There are three ranges:  A low range, a default range, and a
366high range, selectable via an IP_PORTRANGE
367.Fn setsockopt
368call.
369Most network programs use the default range which is controlled by
370.Va net.inet.ip.portrange.first
371and
372.Va net.inet.ip.portrange.last ,
373which defaults to 1024 and 5000 respectively.  Bound port ranges are
374used for outgoing connections and it is possible to run the system out
375of ports under certain circumstances.  This most commonly occurs when you are
376running a heavily loaded web proxy.  The port range is not an issue
377when running serves which handle mainly incoming connections such as a
378normal web server, or has a limited number of outgoing connections such
379as a mail relay.  For situations where you may run yourself out of
380ports we recommend increasing
381.Va net.inet.ip.portrange.last
382modestly.  A value of 10000 or 20000 or 30000 may be reasonable.  You should
383also consider firewall effects when changing the port range.  Some firewalls
384may block large ranges of ports (usually low-numbered ports) and expect systems
385to use higher ranges of ports for outgoing connections.  For this reason
386we do not recommend that
387.Va net.inet.ip.portrange.first
388be lowered.
389.Pp
390The
391.Va kern.ipc.somaxconn
392sysctl limits the size of the listen queue for accepting new TCP connections.
393The default value of 128 is typically too low for robust handling of new
394connections in a heavily loaded web server environment.
395For such environments,
396we recommend increasing this value to 1024 or higher.
397The service daemon
398may itself limit the listen queue size (e.g.\&
399.Xr sendmail 8 ,
400apache) but will
401often have a directive in its configuration file to adjust the queue size up.
402Larger listen queues also do a better job of fending off denial of service
403attacks.
404.Pp
405The
406.Va kern.maxfiles
407sysctl determines how many open files the system supports.
408The default is
409typically a few thousand but you may need to bump this up to ten or twenty
410thousand if you are running databases or large descriptor-heavy daemons.
411The read-only
412.Va kern.openfiles
413sysctl may be interrogated to determine the current number of open files
414on the system.
415.Pp
416The
417.Va vm.swap_idle_enabled
418sysctl is useful in large multi-user systems where you have lots of users
419entering and leaving the system and lots of idle processes.
420Such systems
421tend to generate a great deal of continuous pressure on free memory reserves.
422Turning this feature on and adjusting the swapout hysteresis (in idle
423seconds) via
424.Va vm.swap_idle_threshold1
425and
426.Va vm.swap_idle_threshold2
427allows you to depress the priority of pages associated with idle processes
428more quickly than the normal pageout algorithm.
429This gives a helping hand
430to the pageout daemon.
431Do not turn this option on unless you need it,
432because the tradeoff you are making is to essentially pre-page memory sooner
433rather than later, eating more swap and disk bandwidth.
434In a small system
435this option will have a detrimental effect but in a large system that is
436already doing moderate paging this option allows the VM system to stage
437whole processes into and out of memory more easily.
438.Sh LOADER TUNABLES
439Some aspects of the system behavior may not be tunable at runtime because
440memory allocations they perform must occur early in the boot process.
441To change loader tunables, you must set their values in
442.Xr loader.conf 5
443and reboot the system.
444.Pp
445.Va kern.maxusers
446controls the scaling of a number of static system tables, including defaults
447for the maximum number of open files, sizing of network memory resources, etc.
448On
449.Dx ,
450.Va kern.maxusers
451is automatically sized at boot based on the amount of memory available in
452the system, and may be determined at run-time by inspecting the value of the
453read-only
454.Va kern.maxusers
455sysctl.
456Some sites will require larger or smaller values of
457.Va kern.maxusers
458and may set it as a loader tunable; values of 64, 128, and 256 are not
459uncommon.
460We do not recommend going above 256 unless you need a huge number
461of file descriptors; many of the tunable values set to their defaults by
462.Va kern.maxusers
463may be individually overridden at boot-time or run-time as described
464elsewhere in this document.
465.Pp
466The
467.Va kern.dfldsiz
468and
469.Va kern.dflssiz
470tunables set the default soft limits for process data and stack size
471respectively.
472Processes may increase these up to the hard limits by calling
473.Xr setrlimit 2 .
474The
475.Va kern.maxdsiz ,
476.Va kern.maxssiz ,
477and
478.Va kern.maxtsiz
479tunables set the hard limits for process data, stack, and text size
480respectively; processes may not exceed these limits.
481The
482.Va kern.sgrowsiz
483tunable controls how much the stack segment will grow when a process
484needs to allocate more stack.
485.Pp
486.Va kern.ipc.nmbclusters
487may be adjusted to increase the number of network mbufs the system is
488willing to allocate.
489Each cluster represents approximately 2K of memory,
490so a value of 1024 represents 2M of kernel memory reserved for network
491buffers.
492You can do a simple calculation to figure out how many you need.
493If you have a web server which maxes out at 1000 simultaneous connections,
494and each connection eats a 16K receive and 16K send buffer, you need
495approximately 32MB worth of network buffers to deal with it.
496A good rule of
497thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768.
498So for this case
499you would want to set
500.Va kern.ipc.nmbclusters
501to 32768.
502We recommend values between
5031024 and 4096 for machines with moderates amount of memory, and between 4096
504and 32768 for machines with greater amounts of memory.
505Under no circumstances
506should you specify an arbitrarily high value for this parameter, it could
507lead to a boot-time crash.
508The
509.Fl m
510option to
511.Xr netstat 1
512may be used to observe network cluster use.
513.Sh KERNEL CONFIG TUNING
514There are a number of kernel options that you may have to fiddle with in
515a large-scale system.
516In order to change these options you need to be
517able to compile a new kernel from source.
518The
519.Xr config 8
520manual page and the handbook are good starting points for learning how to
521do this.
522Generally the first thing you do when creating your own custom
523kernel is to strip out all the drivers and services you do not use.
524Removing things like
525.Dv INET6
526and drivers you do not have will reduce the size of your kernel, sometimes
527by a megabyte or more, leaving more memory available for applications.
528.Pp
529If your motherboard is AHCI-capable then we strongly recommend turning
530on AHCI mode.
531.Sh CPU, MEMORY, DISK, NETWORK
532The type of tuning you do depends heavily on where your system begins to
533bottleneck as load increases.
534If your system runs out of CPU (idle times
535are perpetually 0%) then you need to consider upgrading the CPU or moving to
536an SMP motherboard (multiple CPU's), or perhaps you need to revisit the
537programs that are causing the load and try to optimize them.
538If your system
539is paging to swap a lot you need to consider adding more memory.
540If your
541system is saturating the disk you typically see high CPU idle times and
542total disk saturation.
543.Xr systat 1
544can be used to monitor this.
545There are many solutions to saturated disks:
546increasing memory for caching, mirroring disks, distributing operations across
547several machines, and so forth.
548If disk performance is an issue and you
549are using IDE drives, switching to SCSI can help a great deal.
550While modern
551IDE drives compare with SCSI in raw sequential bandwidth, the moment you
552start seeking around the disk SCSI drives usually win.
553.Pp
554Finally, you might run out of network suds.
555The first line of defense for
556improving network performance is to make sure you are using switches instead
557of hubs, especially these days where switches are almost as cheap.
558Hubs
559have severe problems under heavy loads due to collision backoff and one bad
560host can severely degrade the entire LAN.
561Second, optimize the network path
562as much as possible.
563For example, in
564.Xr firewall 7
565we describe a firewall protecting internal hosts with a topology where
566the externally visible hosts are not routed through it.
567Use 100BaseT rather
568than 10BaseT, or use 1000BaseT rather than 100BaseT, depending on your needs.
569Most bottlenecks occur at the WAN link (e.g.\&
570modem, T1, DSL, whatever).
571If expanding the link is not an option it may be possible to use the
572.Xr dummynet 4
573feature to implement peak shaving or other forms of traffic shaping to
574prevent the overloaded service (such as web services) from affecting other
575services (such as email), or vice versa.
576In home installations this could
577be used to give interactive traffic (your browser,
578.Xr ssh 1
579logins) priority
580over services you export from your box (web services, email).
581.Sh SEE ALSO
582.Xr netstat 1 ,
583.Xr systat 1 ,
584.Xr dummynet 4 ,
585.Xr nata 4 ,
586.Xr login.conf 5 ,
587.Xr rc.conf 5 ,
588.Xr sysctl.conf 5 ,
589.Xr firewall 7 ,
590.Xr hier 7 ,
591.Xr boot 8 ,
592.Xr ccdconfig 8 ,
593.Xr config 8 ,
594.Xr disklabel 8 ,
595.Xr fsck 8 ,
596.Xr ifconfig 8 ,
597.Xr ipfw 8 ,
598.Xr loader 8 ,
599.Xr mount 8 ,
600.Xr newfs 8 ,
601.Xr route 8 ,
602.Xr sysctl 8 ,
603.Xr tunefs 8 ,
604.Xr vinum 8
605.Sh HISTORY
606The
607.Nm
608manual page was originally written by
609.An Matthew Dillon
610and first appeared
611in
612.Fx 4.3 ,
613May 2001.
614