xref: /openbsd/usr.bin/sort/sort.1 (revision 5af055cd)
1.\"	$OpenBSD: sort.1,v 1.54 2015/04/05 14:20:22 millert Exp $
2.\"
3.\" Copyright (c) 1991, 1993
4.\"	The Regents of the University of California.  All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" the Institute of Electrical and Electronics Engineers, Inc.
8.\"
9.\" Redistribution and use in source and binary forms, with or without
10.\" modification, are permitted provided that the following conditions
11.\" are met:
12.\" 1. Redistributions of source code must retain the above copyright
13.\"    notice, this list of conditions and the following disclaimer.
14.\" 2. Redistributions in binary form must reproduce the above copyright
15.\"    notice, this list of conditions and the following disclaimer in the
16.\"    documentation and/or other materials provided with the distribution.
17.\" 3. Neither the name of the University nor the names of its contributors
18.\"    may be used to endorse or promote products derived from this software
19.\"    without specific prior written permission.
20.\"
21.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31.\" SUCH DAMAGE.
32.\"
33.\"     @(#)sort.1	8.1 (Berkeley) 6/6/93
34.\"
35.Dd $Mdocdate: April 5 2015 $
36.Dt SORT 1
37.Os
38.Sh NAME
39.Nm sort
40.Nd sort, merge, or sequence check text and binary files
41.Sh SYNOPSIS
42.Nm sort
43.Op Fl bCcdfgHhiMmnRrsuVz
44.Op Fl k Ar field1 Ns Op , Ns Ar field2
45.Op Fl o Ar output
46.Op Fl S Ar size
47.Op Fl T Ar dir
48.Op Fl t Ar char
49.Op Ar
50.Sh DESCRIPTION
51The
52.Nm
53utility sorts text and binary files by lines.
54A line is a record separated from the subsequent record by a
55newline (default) or NUL \'\\0\' character (-z option).
56A record can contain any printable or unprintable characters.
57Comparisons are based on one or more sort keys extracted from
58each line of input, and are performed lexicographically,
59according to the current locale's collating rules and the
60specified command-line options that can tune the actual
61sorting behavior.
62By default, if keys are not given,
63.Nm
64uses entire lines for comparison.
65.Pp
66If no
67.Ar file
68is specified, or if
69.Ar file
70is
71.Sq - ,
72the standard input is used.
73.Pp
74The options are as follows:
75.Bl -tag -width Ds
76.It Fl C, Fl Fl check=silent|quiet
77Check that the single input file is sorted.
78If it is, exit 0; if it's not, exit 1.
79In either case, produce no output.
80.It Fl c, Fl Fl check
81Like
82.Fl C ,
83but additionally write a message to
84.Em stderr
85if the input file is not sorted.
86.It Fl m , Fl Fl merge
87Merge only; the input files are assumed to be pre-sorted.
88If they are not sorted, the output order is undefined.
89.It Fl o Ar output , Fl Fl output Ns = Ns Ar output
90Write the output to the
91.Ar output
92file instead of the standard output.
93This file can be the same as one of the input files.
94.It Fl S Ar size , Fl Fl buffer-size Ns = Ns Ar size
95Use a memory buffer no larger than
96.Ar size .
97The modifiers %, b, K, M, G, T, P, E, Z, and Y can be used.
98If no memory limit is specified,
99.Nm
100may use up to about 90% of available memory.
101If the input is too big to fit into the memory buffer,
102temporary files are used.
103.It Fl s
104Stable sort; maintains the original record order of records that have
105an equal key.
106This is a non-standard feature, but it is widely accepted and used.
107.It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir
108Store temporary files in the directory
109.Ar dir .
110The default path is the value of the environment variable
111.Ev TMPDIR
112or
113.Pa /var/tmp
114if
115.Ev TMPDIR
116is not defined.
117.It Fl u , Fl Fl unique
118Unique: suppress all but one in each set of lines having equal keys.
119This option implies a stable sort (see below).
120If used with
121.Fl C
122or
123.Fl c ,
124.Nm
125also checks that there are no lines with duplicate keys.
126.El
127.Pp
128The following options override the default ordering rules.
129If ordering options appear before the first
130.Fl k
131option, they apply globally to all sort keys.
132When attached to a specific key (see
133.Fl k ) ,
134the ordering options override all global ordering options for that key.
135Note that the ordering options intended to apply globally should not
136appear after
137.Fl k
138or results may be unexpected.
139.Bl -tag -width indent
140.It Fl d , Fl Fl dictionary-order
141Consider only blank spaces and alphanumeric characters in comparisons.
142.It Fl f , Fl Fl ignore-case
143Consider all lowercase characters that have uppercase
144equivalents to be the same for purposes of comparison.
145.It Fl g, Fl Fl general-numeric-sort, Fl Fl sort=general-numeric
146Sort by general numerical value.
147As opposed to
148.Fl n ,
149this option handles general floating points.
150It has a more
151permissive format than that allowed by
152.Fl n
153but it has a significant performance drawback.
154.It Fl h, Fl Fl human-numeric-sort, Fl Fl sort=human-numeric
155Sort by numerical value, but take into account the SI suffix,
156if present.
157Sorts first by numeric sign (negative, zero, or
158positive); then by SI suffix (either empty, or `k' or `K', or one
159of `MGTPEZY', in that order); and finally by numeric value.
160The SI suffix must immediately follow the number.
161For example, '12345K' sorts before '1M', because M is "larger" than K.
162This sort option is useful for sorting the output of a single invocation
163of 'df' command with
164.Fl h
165or
166.Fl H
167options (human-readable).
168.It Fl i , Fl Fl ignore-nonprinting
169Ignore all non-printable characters.
170.It Fl M, Fl Fl month-sort, Fl Fl sort=month
171Sort by month abbreviations.
172Unknown strings are considered smaller than valid month names.
173.It Fl n , Fl Fl numeric-sort, Fl Fl sort=numeric
174An initial numeric string, consisting of optional blank space, optional
175minus sign, and zero or more digits (including decimal point)
176.\" with
177.\" optional radix character and thousands
178.\" separator
179.\" (as defined in the current locale),
180is sorted by arithmetic value.
181Leading blank characters are ignored.
182.It Fl R, Fl Fl random-sort, Fl Fl sort=random
183Sort lines in random order.
184This is a random permutation of the inputs with the exception that
185equal keys sort together.
186It is implemented by hashing the input keys and sorting the hash values.
187The hash function is randomized with data from
188.Xr arc4random_buf 3 ,
189or by file content if one is specified via
190.Fl Fl random-source .
191If multiple sort fields are specified,
192the same random hash function is used for all of them.
193.It Fl r , Fl Fl reverse
194Sort in reverse order.
195.It Fl V, Fl Fl version-sort
196Sort version numbers.
197The input lines are treated as file names in form
198PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
199"(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
200The files are compared by their prefixes and versions (leading
201zeros are ignored in version numbers, see example below).
202If an input string does not match the pattern, then it is compared
203using the byte compare function.
204All string comparisons are performed in the C locale.
205.Pp
206For example:
207.Bd -literal -offset indent
208$ ls sort* | sort -V
209sort-1.022.tgz
210sort-1.23.tgz
211sort-1.23.1.tgz
212sort-1.024.tgz
213sort-1.024.003.
214sort-1.024.003.tgz
215sort-1.024.07.tgz
216sort-1.024.009.tgz
217.Ed
218.El
219.Pp
220The treatment of field separators can be altered using these options:
221.Bl -tag -width indent
222.It Fl b , Fl Fl ignore-leading-blanks
223Ignore leading blank space when determining the start
224and end of a restricted sort key (see
225.Fl k ) .
226If
227.Fl b
228is specified before the first
229.Fl k
230option, it applies globally to all key specifications.
231Otherwise,
232.Fl b
233can be attached independently to each
234.Ar field
235argument of the key specifications.
236Note that
237.Fl b
238should not appear after
239.Fl k ,
240and that it has no effect unless key fields are specified.
241.It Xo
242.Fl k Ar field1 Ns Op , Ns Ar field2 ,
243.Fl Fl key Ns = Ns Ar field1 Ns Op , Ns Ar field2
244.Xc
245Define a restricted sort key that has the starting position
246.Ar field1 ,
247and optional ending position
248.Ar field2
249of a key field.
250The
251.Fl k
252option may be specified multiple times,
253in which case subsequent keys are compared after earlier keys compare equal.
254The
255.Fl k
256option replaces the obsolete options
257.Cm \(pl Ns Ar pos1
258and
259.Fl Ns Ar pos2 ,
260but the old notation is also supported.
261.It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char
262Use
263.Ar char
264as the field separator character.
265The initial
266.Ar char
267is not considered to be part of a field when determining key offsets.
268Each occurrence of
269.Ar char
270is significant (for example,
271.Dq Ar charchar
272delimits an empty field).
273If
274.Fl t
275is not specified, the default field separator is a sequence of
276blank-space characters, and consecutive blank spaces do
277.Em not
278delimit an empty field; further, the initial blank space
279.Em is
280considered part of a field when determining key offsets.
281To use NUL as field separator, use
282.Fl t
283\'\\0\'.
284.It Fl z , Fl Fl zero-terminated
285Use NUL as the record separator.
286By default, records in the files are expected to be separated by
287the newline characters.
288With this option, NUL (\'\\0\') is used as the record separator character.
289.El
290.Pp
291Other options:
292.Bl -tag -width indent
293.It Fl Fl batch-size Ns = Ns Ar num
294Specify maximum number of files that can be opened by
295.Nm
296at once.
297This option affects behavior when having many input files or using
298temporary files.
299The minimum value is 2.
300The default value is 16.
301.It Fl Fl compress-program Ns = Ns Ar program
302Use
303.Ar program
304to compress temporary files.
305When invoked with no arguments,
306.Ar program
307must compress standard input to standard output.
308When called with the
309.Fl d
310option, it must decompress standard input to standard output.
311If
312.Ar program
313fails,
314.Nm
315will exit with an error.
316The
317.Xr compress 1
318and
319.Xr gzip 1
320utilities meet these requirements.
321.It Fl Fl debug
322Print some extra information about the sorting process to the
323standard output.
324.It Fl Fl files0-from Ns = Ns Ar filename
325Take the input file list from the file
326.Ar filename .
327The file names must be separated by NUL
328(like the output produced by the command
329.Dq find ... -print0 ) .
330.It Fl Fl heapsort
331Try to use heap sort, if the sort specifications allow.
332This sort algorithm cannot be used with
333.Fl u
334and
335.Fl s .
336.It Fl Fl help
337Print the help text and exit.
338.It Fl Fl mergesort , Fl H
339Use mergesort.
340This is a universal algorithm that can always be used,
341but it is not always the fastest.
342.It Fl Fl mmap
343Try to use file memory mapping system call.
344It may increase speed in some cases.
345.It Fl Fl qsort
346Try to use quick sort, if the sort specifications allow.
347This sort algorithm cannot be used with
348.Fl u
349and
350.Fl s .
351.It Fl Fl radixsort
352Try to use radix sort, if the sort specifications allow.
353The radix sort can only be used for trivial locales (C and POSIX),
354and it cannot be used for numeric or month sort.
355Radix sort is very fast and stable.
356.It Fl Fl random-source Ns = Ns Ar filename
357For random sort, the contents of
358.Ar filename
359are used as the source of the
360.Sq seed
361data for the hash function.
362Two invocations of random sort with the same seed data will use
363produce the same result if the input is also identical.
364By default, the
365.Xr arc4random_buf 3
366function is used instead.
367.It Fl Fl version
368Print the version and exit.
369.El
370.Pp
371A field is defined as a maximal sequence of characters other than the
372field separator and record separator
373.Pq newline by default .
374Initial blank spaces are included in the field unless
375.Fl b
376has been specified;
377the first blank space of a sequence of blank spaces acts as the field
378separator and is included in the field (unless
379.Fl t
380is specified).
381For example, by default all blank spaces at the beginning of a line are
382considered to be part of the first field.
383.Pp
384Fields are specified by the
385.Fl k Ar field1 Ns Op , Ns Ar field2
386option.
387If
388.Ar field2
389is missing, the end of the key defaults to the end of the line.
390.Pp
391The arguments
392.Ar field1
393and
394.Ar field2
395have the form
396.Em m.n
397.Em (m,n > 0)
398and can be followed by one or more of the modifiers
399.Cm b , d , f , i ,
400.Cm n , g , M
401and
402.Cm r ,
403which correspond to the options discussed above.
404When
405.Cm b
406is specified it applies only to
407.Ar field1
408or
409.Ar field2
410where it is specified while the rest of the modifiers
411apply to the whole key field regardless if they are
412specified only with
413.Ar field1
414or
415.Ar field2
416or both.
417A
418.Ar field1
419position specified by
420.Em m.n
421is interpreted as the
422.Em n Ns th
423character from the beginning of the
424.Em m Ns th
425field.
426A missing
427.Em \&.n
428in
429.Ar field1
430means
431.Ql \&.1 ,
432indicating the first character of the
433.Em m Ns th
434field; if the
435.Fl b
436option is in effect,
437.Em n
438is counted from the first non-blank character in the
439.Em m Ns th
440field;
441.Em m Ns \&.1b
442refers to the first non-blank character in the
443.Em m Ns th
444field.
445.No 1\&. Ns Em n
446refers to the
447.Em n Ns th
448character from the beginning of the line;
449if
450.Em n
451is greater than the length of the line, the field is taken to be empty.
452.Pp
453.Em n Ns th
454positions are always counted from the field beginning, even if the field
455is shorter than the number of specified positions.
456Thus, the key can really start from a position in a subsequent field.
457.Pp
458A
459.Ar field2
460position specified by
461.Em m.n
462is interpreted as the
463.Em n Ns th
464character (including separators) from the beginning of the
465.Em m Ns th
466field.
467A missing
468.Em \&.n
469indicates the last character of the
470.Em m Ns th
471field;
472.Em m
473= \&0
474designates the end of a line.
475Thus the option
476.Fl k Ar v.x,w.y
477is synonymous with the obsolete option
478.Cm \(pl Ns Ar v-\&1.x-\&1
479.Fl Ns Ar w-\&1.y ;
480when
481.Em y
482is omitted,
483.Fl k Ar v.x,w
484is synonymous with
485.Cm \(pl Ns Ar v-\&1.x-\&1
486.Fl Ns Ar w\&.0 .
487The obsolete
488.Cm \(pl Ns Ar pos1
489.Fl Ns Ar pos2
490option is still supported, except for
491.Fl Ns Ar w\&.0b ,
492which has no
493.Fl k
494equivalent.
495.Sh ENVIRONMENT
496.Bl -tag -width Fl
497.It Ev GNUSORT_NUMERIC_COMPATIBILITY
498If defined
499.Fl t
500will not override the locale numeric symbols, that is, thousand
501separators and decimal separators.
502By default, if we specify
503.Fl t
504with the same symbol as the thousand separator or decimal point,
505the symbol will be treated as the field separator.
506Older behavior was less definite: the symbol was treated as both field
507separator and numeric separator, simultaneously.
508This environment variable enables the old behavior.
509.It Ev LANG
510Used as a last resort to determine different kinds of locale-specific
511behavior if neither the respective environment variable nor
512.Ev LC_ALL
513are set.
514.It Ev LC_ALL
515Locale settings that override all of the other locale settings.
516This environment variable can be used to set all these settings
517to the same value at once.
518.It Ev LC_COLLATE
519Locale settings to be used to determine the collation for
520sorting records.
521.It Ev LC_CTYPE
522Locale settings to be used to case conversion and classification
523of characters, that is, which characters are considered
524whitespaces, etc.
525.It Ev LC_MESSAGES
526Locale settings that determine the language of output messages
527that
528.Nm
529prints out.
530.It Ev LC_NUMERIC
531Locale settings that determine the number format used in numeric sort.
532.It Ev LC_TIME
533Locale settings that determine the month format used in month sort.
534.It Ev TMPDIR
535Path to the directory in which temporary files will be stored.
536Note that
537.Ev TMPDIR
538may be overridden by the
539.Fl T
540option.
541.El
542.Sh FILES
543.Bl -tag -width Pa -compact
544.It Pa /var/tmp/.bsdsort.PID.*
545Temporary files.
546.El
547.Sh EXIT STATUS
548The
549.Nm
550utility exits with one of the following values:
551.Pp
552.Bl -tag -width Ds -offset indent -compact
553.It 0
554Successfully sorted the input files or if used with
555.Fl C
556or
557.Fl c ,
558the input file already met the sorting criteria.
559.It 1
560On disorder (or non-uniqueness) with the
561.Fl C
562or
563.Fl c
564options.
565.It 2
566An error occurred.
567.El
568.Sh SEE ALSO
569.Xr comm 1 ,
570.Xr join 1 ,
571.Xr uniq 1
572.Sh STANDARDS
573The
574.Nm
575utility is compliant with the
576.St -p1003.1-2008
577specification.
578.Pp
579The flags
580.Op Fl gHhiMRSsTVz
581are extensions to that specification.
582.Pp
583All long options are extensions to the specification.
584Some are provided for compatibility with GNU
585.Nm ,
586others are specific to this implementation.
587.Pp
588Some implementations of
589.Nm
590honor the
591.Fl b
592option even when no key fields are specified.
593This implementation follows historic practice and
594.St -p1003.1-2008
595in only honoring
596.Fl b
597when it precedes a key field.
598.Pp
599The historic practice of allowing the
600.Fl o
601option to appear after the
602.Ar file
603is supported for compatibility with older versions of
604.Nm .
605.Pp
606The historic key notations
607.Cm \(pl Ns Ar pos1
608and
609.Fl Ns Ar pos2
610are supported for compatibility with older versions of
611.Nm
612but their use is highly discouraged.
613.Sh HISTORY
614A
615.Nm
616command appeared in
617.At v3 .
618.Sh AUTHORS
619.An Gabor Kovesdan Aq Mt gabor@FreeBSD.org
620.An Oleg Moskalenko Aq Mt mom040267@gmail.com
621.Sh CAVEATS
622This implementation of
623.Nm
624has no limits on input line length (other than imposed by available
625memory) or any restrictions on bytes allowed within lines.
626.Pp
627The performance depends highly on locale settings,
628efficient choice of sort keys and key complexity.
629The fastest sort is with the C locale, on whole lines, with option
630.Fl s .
631In general, the C locale is the fastest, followed by single-byte
632locales with multi-byte locales being the slowest.
633The correct collation order respected in all cases.
634For the key specification, the simpler to process the
635lines the faster the search will be.
636.Pp
637When sorting by arithmetic value, using
638.Fl n
639results in much better performance than
640.Fl g
641so its use is encouraged whenever possible.
642