xref: /dragonfly/usr.bin/sort/sort.1 (revision 92fe556d)
1.\"	$OpenBSD: sort.1,v 1.45 2015/03/19 13:51:10 jmc Exp $
2.\"	$FreeBSD: head/usr.bin/sort/sort.1.in 281123 2015-04-05 22:22:43Z pfg $
3.\"
4.\" Copyright (c) 1991, 1993
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" the Institute of Electrical and Electronics Engineers, Inc.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. Neither the name of the University nor the names of its contributors
19.\"    may be used to endorse or promote products derived from this software
20.\"    without specific prior written permission.
21.\"
22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
25.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32.\" SUCH DAMAGE.
33.\"
34.\"     @(#)sort.1	8.1 (Berkeley) 6/6/93
35.\"
36.Dd December 2, 2019
37.Dt SORT 1
38.Os
39.Sh NAME
40.Nm sort
41.Nd sort or merge records (lines) of text and binary files
42.Sh SYNOPSIS
43.Nm
44.Bk -words
45.Op Fl bcCdfghiRMmnrsuVz
46.Sm off
47.Op Fl k\ \& Ar field1 Op , Ar field2
48.Sm on
49.Op Fl S Ar memsize
50.Ek
51.Op Fl T Ar dir
52.Op Fl t Ar char
53.Op Fl o Ar output
54.Op Ar file ...
55.Nm
56.Fl Fl help
57.Nm
58.Fl Fl version
59.Sh DESCRIPTION
60The
61.Nm
62utility sorts text and binary files by lines.
63A line is a record separated from the subsequent record by a
64newline (default) or NUL \'\\0\' character (-z option).
65A record can contain any printable or unprintable characters.
66Comparisons are based on one or more sort keys extracted from
67each line of input, and are performed lexicographically,
68according to the current locale's collating rules and the
69specified command-line options that can tune the actual
70sorting behavior.
71By default, if keys are not given,
72.Nm
73uses entire lines for comparison.
74.Pp
75The command line options are as follows:
76.Bl -tag -width Ds
77.It Fl c , Fl Fl check , Fl C , Fl Fl check=silent|quiet
78Check that the single input file is sorted.
79If the file is not sorted,
80.Nm
81produces the appropriate error messages and exits with code 1,
82otherwise returns 0.
83If
84.Fl C
85or
86.Fl Fl check=silent
87is specified,
88.Nm
89produces no output.
90This is a "silent" version of
91.Fl c .
92.It Fl m , Fl Fl merge
93Merge only.
94The input files are assumed to be pre-sorted.
95If they are not sorted the output order is undefined.
96.It Fl o Ar output , Fl Fl output Ns = Ns Ar output
97Print the output to the
98.Ar output
99file instead of the standard output.
100This file can be the same as one of the input files.
101.It Fl S Ar size , Fl Fl buffer-size Ns = Ns Ar size
102Use
103.Ar size
104for the maximum size of the memory buffer.
105Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used.
106If a memory limit is not explicitly specified,
107.Nm
108takes up to about 90% of available memory.
109If the file size is too big to fit into the memory buffer,
110the temporary disk files are used to perform the sorting.
111.It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir
112Store temporary files in the directory
113.Ar dir .
114The default path is the value of the environment variable
115.Ev TMPDIR
116or
117.Pa /var/tmp
118if
119.Ev TMPDIR
120is not defined.
121.It Fl u , Fl Fl unique
122Unique keys.
123Suppress all lines that have a key that is equal to an already
124processed one.
125This option, similarly to
126.Fl s ,
127implies a stable sort.
128If used with
129.Fl c
130or
131.Fl C ,
132.Nm
133also checks that there are no lines with duplicate keys.
134.It Fl s
135Stable sort.
136This option maintains the original record order of records that have
137an equal key.
138This is a non-standard feature, but it is widely accepted and used.
139.It Fl Fl version
140Print the version and silently exits.
141.It Fl Fl help
142Print the help text and silently exits.
143.El
144.Pp
145The following options override the default ordering rules.
146When ordering options appear independently of key field
147specifications, they apply globally to all sort keys.
148When attached to a specific key (see
149.Fl k ) ,
150the ordering options override all global ordering options for
151the key they are attached to.
152.Bl -tag -width indent
153.It Fl b , Fl Fl ignore-leading-blanks
154Ignore leading blank characters when comparing lines.
155.It Fl d , Fl Fl dictionary-order
156Consider only blank spaces and alphanumeric characters in comparisons.
157.It Fl f , Fl Fl ignore-case
158Convert all lowercase characters to their uppercase equivalent
159before comparison, that is, perform case-independent sorting.
160.It Fl g , Fl Fl general-numeric-sort , Fl Fl sort=general-numeric
161Sort by general numerical value.
162As opposed to
163.Fl n ,
164this option handles general floating points.
165It has a more
166permissive format than that allowed by
167.Fl n
168but it has a significant performance drawback.
169.It Fl h , Fl Fl human-numeric-sort , Fl Fl sort=human-numeric
170Sort by numerical value, but take into account the SI suffix,
171if present.
172Sort first by numeric sign (negative, zero, or
173positive); then by SI suffix (either empty, or `k' or `K', or one
174of `MGTPEZY', in that order); and finally by numeric value.
175The SI suffix must immediately follow the number.
176For example, '12345K' sorts before '1M', because M is "larger" than K.
177This sort option is useful for sorting the output of a single invocation
178of 'df' command with
179.Fl h
180or
181.Fl H
182options (human-readable).
183.It Fl i , Fl Fl ignore-nonprinting
184Ignore all non-printable characters.
185.It Fl M , Fl Fl month-sort , Fl Fl sort=month
186Sort by month abbreviations.
187Unknown strings are considered smaller than the month names.
188.It Fl n , Fl Fl numeric-sort , Fl Fl sort=numeric
189Sort fields numerically by arithmetic value.
190Fields are supposed to have optional blanks in the beginning, an
191optional minus sign, zero or more digits (including decimal point and
192possible thousand separators).
193.It Fl R , Fl Fl random-sort , Fl Fl sort=random
194Sort by a random order.
195This is a random permutation of the inputs except that
196the equal keys sort together.
197It is implemented by hashing the input keys and sorting
198the hash values.
199The hash function is chosen randomly.
200The hash function is randomized by
201.Cm /dev/random
202content, or by file content if it is specified by
203.Fl Fl random-source .
204Even if multiple sort fields are specified,
205the same random hash function is used for all of them.
206.It Fl r , Fl Fl reverse
207Sort in reverse order.
208.It Fl V , Fl Fl version-sort
209Sort version numbers.
210The input lines are treated as file names in form
211PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
212"(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
213The files are compared by their prefixes and versions (leading
214zeros are ignored in version numbers, see example below).
215If an input string does not match the pattern, then it is compared
216using the byte compare function.
217All string comparisons are performed in C locale, the locale
218environment setting is ignored.
219.Bl -tag -width indent
220.It Example:
221.It $ ls sort* | sort -V
222.It sort-1.022.tgz
223.It sort-1.23.tgz
224.It sort-1.23.1.tgz
225.It sort-1.024.tgz
226.It sort-1.024.003.
227.It sort-1.024.003.tgz
228.It sort-1.024.07.tgz
229.It sort-1.024.009.tgz
230.El
231.El
232.Pp
233The treatment of field separators can be altered using these options:
234.Bl -tag -width indent
235.It Fl b , Fl Fl ignore-leading-blanks
236Ignore leading blank space when determining the start
237and end of a restricted sort key (see
238.Fl k ) .
239If
240.Fl b
241is specified before the first
242.Fl k
243option, it applies globally to all key specifications.
244Otherwise,
245.Fl b
246can be attached independently to each
247.Ar field
248argument of the key specifications.
249.Fl b .
250.It Xo
251.Fl k Ar field1 Ns Op , Ns Ar field2 ,
252.Fl Fl key Ns = Ns Ar field1 Ns Op , Ns Ar field2
253.Xc
254Define a restricted sort key that has the starting position
255.Ar field1 ,
256and optional ending position
257.Ar field2
258of a key field.
259The
260.Fl k
261option may be specified multiple times,
262in which case subsequent keys are compared when earlier keys compare equal.
263The
264.Fl k
265option replaces the obsolete options
266.Cm \(pl Ns Ar pos1
267and
268.Fl Ns Ar pos2 ,
269but the old notation is also supported.
270.It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char
271Use
272.Ar char
273as a field separator character.
274The initial
275.Ar char
276is not considered to be part of a field when determining key offsets.
277Each occurrence of
278.Ar char
279is significant (for example,
280.Dq Ar charchar
281delimits an empty field).
282If
283.Fl t
284is not specified, the default field separator is a sequence of
285blank space characters, and consecutive blank spaces do
286.Em not
287delimit an empty field, however, the initial blank space
288.Em is
289considered part of a field when determining key offsets.
290To use NUL as field separator, use
291.Fl t
292\'\\0\'.
293.It Fl z , Fl Fl zero-terminated
294Use NUL as record separator.
295By default, records in the files are supposed to be separated by
296the newline characters.
297With this option, NUL (\'\\0\') is used as a record separator character.
298.El
299.Pp
300Other options:
301.Bl -tag -width indent
302.It Fl Fl batch-size Ns = Ns Ar num
303Specify maximum number of files that can be opened by
304.Nm
305at once.
306This option affects behavior when having many input files or using
307temporary files.
308The default value is 16.
309.It Fl Fl compress-program Ns = Ns Ar PROGRAM
310Use PROGRAM to compress temporary files.
311PROGRAM must compress standard input to standard output, when called
312without arguments.
313When called with argument
314.Fl d
315it must decompress standard input to standard output.
316If PROGRAM fails,
317.Nm
318must exit with error.
319An example of PROGRAM that can be used here is bzip2.
320.It Fl Fl random-source Ns = Ns Ar filename
321In random sort, the file content is used as the source of the 'seed' data
322for the hash function choice.
323Two invocations of random sort with the same seed data will use
324the same hash function and will produce the same result if the input is
325also identical.
326By default, file
327.Cm /dev/random
328is used.
329.It Fl Fl debug
330Print some extra information about the sorting process to the
331standard output.
332.It Fl Fl parallel
333Set the maximum number of execution threads.
334Default number equals to the number of CPUs.
335.It Fl Fl files0-from Ns = Ns Ar filename
336Take the input file list from the file
337.Ar filename .
338The file names must be separated by NUL
339(like the output produced by the command "find ... -print0").
340.It Fl Fl radixsort
341Try to use radix sort, if the sort specifications allow.
342The radix sort can only be used for trivial locales (C and POSIX),
343and it cannot be used for numeric or month sort.
344Radix sort is very fast and stable.
345.It Fl Fl mergesort
346Use mergesort.
347This is a universal algorithm that can always be used,
348but it is not always the fastest.
349.It Fl Fl qsort
350Try to use quick sort, if the sort specifications allow.
351This sort algorithm cannot be used with
352.Fl u
353and
354.Fl s .
355.It Fl Fl heapsort
356Try to use heap sort, if the sort specifications allow.
357This sort algorithm cannot be used with
358.Fl u
359and
360.Fl s .
361.It Fl Fl mmap
362Try to use file memory mapping system call.
363It may increase speed in some cases.
364.El
365.Pp
366The following operands are available:
367.Bl -tag -width indent
368.It Ar file
369The pathname of a file to be sorted, merged, or checked.
370If no
371.Ar file
372operands are specified, or if a
373.Ar file
374operand is
375.Fl ,
376the standard input is used.
377.El
378.Pp
379A field is defined as a maximal sequence of characters other than the
380field separator and record separator (newline by default).
381Initial blank spaces are included in the field unless
382.Fl b
383has been specified;
384the first blank space of a sequence of blank spaces acts as the field
385separator and is included in the field (unless
386.Fl t
387is specified).
388For example, all blank spaces at the beginning of a line are
389considered to be part of the first field.
390.Pp
391Fields are specified by the
392.Sm off
393.Fl k\ \& Ar field1 Op , Ar field2
394.Sm on
395command-line option.
396If
397.Ar field2
398is missing, the end of the key defaults to the end of the line.
399.Pp
400The arguments
401.Ar field1
402and
403.Ar field2
404have the form
405.Em m.n
406.Em (m,n > 0)
407and can be followed by one or more of the modifiers
408.Cm b , d , f , i ,
409.Cm n , g , M
410and
411.Cm r ,
412which correspond to the options discussed above.
413When
414.Cm b
415is specified it applies only to
416.Ar field1
417or
418.Ar field2
419where it is specified while the rest of the modifiers
420apply to the whole key field regardless if they are
421specified only with
422.Ar field1
423or
424.Ar field2
425or both.
426A
427.Ar field1
428position specified by
429.Em m.n
430is interpreted as the
431.Em n Ns th
432character from the beginning of the
433.Em m Ns th
434field.
435A missing
436.Em \&.n
437in
438.Ar field1
439means
440.Ql \&.1 ,
441indicating the first character of the
442.Em m Ns th
443field; if the
444.Fl b
445option is in effect,
446.Em n
447is counted from the first non-blank character in the
448.Em m Ns th
449field;
450.Em m Ns \&.1b
451refers to the first non-blank character in the
452.Em m Ns th
453field.
454.No 1\&. Ns Em n
455refers to the
456.Em n Ns th
457character from the beginning of the line;
458if
459.Em n
460is greater than the length of the line, the field is taken to be empty.
461.Pp
462.Em n Ns th
463positions are always counted from the field beginning, even if the field
464is shorter than the number of specified positions.
465Thus, the key can really start from a position in a subsequent field.
466.Pp
467A
468.Ar field2
469position specified by
470.Em m.n
471is interpreted as the
472.Em n Ns th
473character (including separators) from the beginning of the
474.Em m Ns th
475field.
476A missing
477.Em \&.n
478indicates the last character of the
479.Em m Ns th
480field;
481.Em m
482= \&0
483designates the end of a line.
484Thus the option
485.Fl k Ar v.x,w.y
486is synonymous with the obsolete option
487.Cm \(pl Ns Ar v-\&1.x-\&1
488.Fl Ns Ar w-\&1.y ;
489when
490.Em y
491is omitted,
492.Fl k Ar v.x,w
493is synonymous with
494.Cm \(pl Ns Ar v-\&1.x-\&1
495.Fl Ns Ar w\&.0 .
496The obsolete
497.Cm \(pl Ns Ar pos1
498.Fl Ns Ar pos2
499option is still supported, except for
500.Fl Ns Ar w\&.0b ,
501which has no
502.Fl k
503equivalent.
504.Sh ENVIRONMENT
505.Bl -tag -width Fl
506.It Ev LC_COLLATE
507Locale settings to be used to determine the collation for
508sorting records.
509.It Ev LC_CTYPE
510Locale settings to be used to case conversion and classification
511of characters, that is, which characters are considered
512whitespaces, etc.
513.It Ev LC_MESSAGES
514Locale settings that determine the language of output messages
515that
516.Nm
517prints out.
518.It Ev LC_NUMERIC
519Locale settings that determine the number format used in numeric sort.
520.It Ev LC_TIME
521Locale settings that determine the month format used in month sort.
522.It Ev LC_ALL
523Locale settings that override all of the above locale settings.
524This environment variable can be used to set all these settings
525to the same value at once.
526.It Ev LANG
527Used as a last resort to determine different kinds of locale-specific
528behavior if neither the respective environment variable, nor
529.Ev LC_ALL
530are set.
531.\"%%NLS%%.It Ev NLSPATH
532.\"%%NLS%%Path to NLS catalogs.
533.It Ev TMPDIR
534Path to the directory in which temporary files will be stored.
535Note that
536.Ev TMPDIR
537may be overridden by the
538.Fl T
539option.
540.It Ev GNUSORT_NUMERIC_COMPATIBILITY
541If defined
542.Fl t
543will not override the locale numeric symbols, that is, thousand
544separators and decimal separators.
545By default, if we specify
546.Fl t
547with the same symbol as the thousand separator or decimal point,
548the symbol will be treated as the field separator.
549Older behavior was less definite; the symbol was treated as both field
550separator and numeric separator, simultaneously.
551This environment variable enables the old behavior.
552.El
553.Sh FILES
554.Bl -tag -width Pa -compact
555.It Pa /var/tmp/.bsdsort.PID.*
556Temporary files.
557.It Pa /dev/random
558Default seed file for the random sort.
559.El
560.Sh EXIT STATUS
561The
562.Nm
563utility shall exit with one of the following values:
564.Pp
565.Bl -tag -width flag -compact
566.It 0
567Successfully sorted the input files or if used with
568.Fl c
569or
570.Fl C ,
571the input file already met the sorting criteria.
572.It 1
573On disorder (or non-uniqueness) with the
574.Fl c
575or
576.Fl C
577options.
578.It 2
579An error occurred.
580.El
581.Sh SEE ALSO
582.Xr comm 1 ,
583.Xr join 1 ,
584.Xr uniq 1
585.Sh STANDARDS
586The
587.Nm
588utility is compliant with the
589.St -p1003.1-2008
590specification.
591.Pp
592The flags
593.Op Fl ghRMSsTVz
594are extensions to the POSIX specification.
595.Pp
596All long options are extensions to the specification, some of them are
597provided for compatibility with GNU versions and some of them are
598own extensions.
599.Pp
600The old key notations
601.Cm \(pl Ns Ar pos1
602and
603.Fl Ns Ar pos2
604come from older versions of
605.Nm
606and are still supported but their use is highly discouraged.
607.Sh HISTORY
608A
609.Nm
610command first appeared in
611.At v3 .
612.Sh AUTHORS
613.An Gabor Kovesdan Aq Mt gabor@FreeBSD.org ,
614.Pp
615.An Oleg Moskalenko Aq Mt mom040267@gmail.com
616.Sh NOTES
617This implementation of
618.Nm
619has no limits on input line length (other than imposed by available
620memory) or any restrictions on bytes allowed within lines.
621.Pp
622The performance depends highly on locale settings,
623efficient choice of sort keys and key complexity.
624The fastest sort is with locale C, on whole lines,
625with option
626.Fl s .
627In general, locale C is the fastest, then single-byte
628locales follow and multi-byte locales as the slowest but
629the correct collation order is always respected.
630As for the key specification, the simpler to process the
631lines the faster the search will be.
632.Pp
633When sorting by arithmetic value, using
634.Fl n
635results in much better performance than
636.Fl g
637so its use is encouraged
638whenever possible.
639