1\input texinfo   @c -*-texinfo-*-
2@c %**start of header (This is for running Texinfo on a region.)
3@setfilename gawk.info
4@settitle The GNU Awk User's Guide
5@c %**end of header (This is for running Texinfo on a region.)
6
7@c inside ifinfo for older versions of texinfo.tex
8@ifinfo
9@c I hope this is the right category
10@dircategory Programming Languages
11@direntry
12* Gawk: (gawk).           A Text Scanning and Processing Language.
13@end direntry
14@end ifinfo
15
16@c @set xref-automatic-section-title
17@c @set DRAFT
18
19@c The following information should be updated here only!
20@c This sets the edition of the document, the version of gawk it
21@c applies to, and when the document was updated.
22@set TITLE Effective AWK Programming
23@set SUBTITLE A User's Guide for GNU Awk
24@set PATCHLEVEL 6
25@set EDITION 1.0.@value{PATCHLEVEL}
26@set VERSION 3.0
27@set UPDATE-MONTH July, 2000
28@iftex
29@set DOCUMENT book
30@end iftex
31@ifinfo
32@set DOCUMENT Info file
33@end ifinfo
34
35@ignore
36Some comments on the layout for TeX.
371. Use at least texinfo.tex 2.159. It contains fixes that
38   are needed to get the footings for draft mode to not appear.
392. I have done A LOT of work to make this look good. There are  `@page' commands
40   and use of `@group ... @end group' in a number of places. If you muck
41   with anything, it's your responsibility not to break the layout.
42@end ignore
43
44@c merge the function and variable indexes into the concept index
45@ifinfo
46@synindex fn cp
47@synindex vr cp
48@end ifinfo
49@iftex
50@syncodeindex fn cp
51@syncodeindex vr cp
52@end iftex
53
54@c If "finalout" is commented out, the printed output will show
55@c black boxes that mark lines that are too long.  Thus, it is
56@c unwise to comment it out when running a master in case there are
57@c overfulls which are deemed okay.
58
59@ifclear DRAFT
60@iftex
61@finalout
62@end iftex
63@end ifclear
64
65@smallbook
66@iftex
67@c @cropmarks
68@end iftex
69
70@ifinfo
71This file documents @code{awk}, a program that you can use to select
72particular records in a file and perform operations upon them.
73
74This is Edition @value{EDITION} of @cite{@value{TITLE}},
75for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation of AWK.
76
77Copyright (C) 1989, 1991, 1992, 1993, 1996-2000 Free Software Foundation, Inc.
78
79Permission is granted to make and distribute verbatim copies of
80this manual provided the copyright notice and this permission notice
81are preserved on all copies.
82
83@ignore
84Permission is granted to process this file through TeX and print the
85results, provided the printed document carries copying permission
86notice identical to this one except for the removal of this paragraph
87(this paragraph not being relevant to the printed manual).
88
89@end ignore
90Permission is granted to copy and distribute modified versions of this
91manual under the conditions for verbatim copying, provided that the entire
92resulting derived work is distributed under the terms of a permission
93notice identical to this one.
94
95Permission is granted to copy and distribute translations of this manual
96into another language, under the above conditions for modified versions,
97except that this permission notice may be stated in a translation approved
98by the Foundation.
99@end ifinfo
100
101@setchapternewpage odd
102
103@titlepage
104@title @value{TITLE}
105@subtitle @value{SUBTITLE}
106@subtitle Edition @value{EDITION}
107@subtitle @value{UPDATE-MONTH}
108@author Arnold D. Robbins
109@ignore
110@sp 1
111@author Based on @cite{The GAWK Manual},
112@author by Robbins, Close, Rubin, and Stallman
113@end ignore
114
115@c Include the Distribution inside the titlepage environment so
116@c that headings are turned off.  Headings on and off do not work.
117
118@page
119@vskip 0pt plus 1filll
120@ifset LEGALJUNK
121The programs and applications presented in this book have been
122included for their instructional value.  They have been tested with care,
123but are not guaranteed for any particular purpose.  The publisher does not
124offer any warranties or representations, nor does it accept any
125liabilities with respect to the programs or applications.
126So there.
127@sp 2
128UNIX is a registered trademark of X/Open, Ltd. @*
129Microsoft, MS, and MS-DOS are registered trademarks, and Windows is a
130trademark of Microsoft Corporation in the United States and other
131countries. @*
132Atari, 520ST, 1040ST, TT, STE, Mega, and Falcon are registered trademarks
133or trademarks of Atari Corporation. @*
134DEC, Digital, OpenVMS, ULTRIX, and VMS, are trademarks of Digital Equipment
135Corporation. @*
136@end ifset
137``To boldly go where no man has gone before'' is a
138Registered Trademark of Paramount Pictures Corporation. @*
139@c sorry, i couldn't resist
140@sp 3
141Copyright @copyright{} 1989, 1991, 1992, 1993, 1996-2000 Free Software Foundation, Inc.
142@sp 2
143
144This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*
145for the @value{VERSION}.@value{PATCHLEVEL} (or later) version of the GNU implementation of AWK.
146
147@sp 2
148Published by:
149
150Free Software Foundation @*
15159 Temple Place --- Suite 330 @*
152Boston, MA  02111-1307 USA @*
153Phone: +1-617-542-5942 @*
154Fax: +1-617-542-2652 @*
155Email: @code{gnu@@gnu.org} @*
156URL: @code{http://www.gnu.org/} @*
157
158@sp 1
159@c this ISBN can change!
160@c This one is correct for gawk 3.0 and edition 1.0 from the FSF
161ISBN 1-882114-26-4 @*
162
163Permission is granted to make and distribute verbatim copies of
164this manual provided the copyright notice and this permission notice
165are preserved on all copies.
166
167Permission is granted to copy and distribute modified versions of this
168manual under the conditions for verbatim copying, provided that the entire
169resulting derived work is distributed under the terms of a permission
170notice identical to this one.
171
172Permission is granted to copy and distribute translations of this manual
173into another language, under the above conditions for modified versions,
174except that this permission notice may be stated in a translation approved
175by the Foundation.
176@sp 2
177Cover art by Etienne Suvasa.
178@end titlepage
179
180@c Thanks to Bob Chassell for directions on doing dedications.
181@iftex
182@headings off
183@page
184@w{ }
185@sp 9
186@center @i{To Miriam, for making me complete.}
187@sp 1
188@center @i{To Chana, for the joy you bring us.}
189@sp 1
190@center @i{To Rivka, for the exponential increase.}
191@sp 1
192@center @i{To Nachum, for the added dimension.}
193@sp 1
194@center @i{To Malka, for the new beginning.}
195@page
196@w{ }
197@page
198@headings on
199@end iftex
200
201@iftex
202@headings off
203@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @|
204@oddheading  @| @| @strong{@thischapter}@ @ @ @thispage
205@ifset DRAFT
206@evenfooting @today{} @| @emph{DRAFT!} @| Please Do Not Redistribute
207@oddfooting Please Do Not Redistribute @| @emph{DRAFT!} @| @today{}
208@end ifset
209@end iftex
210
211@ifinfo
212@node Top, Preface, (dir), (dir)
213@top General Introduction
214@c Preface or Licensing nodes should come right after the Top
215@c node, in `unnumbered' sections, then the chapter, `What is gawk'.
216
217This file documents @code{awk}, a program that you can use to select
218particular records in a file and perform operations upon them.
219
220This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*
221for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation @*
222of AWK.
223
224@end ifinfo
225
226@menu
227* Preface::                     What this @value{DOCUMENT} is about; brief
228                                history and acknowledgements.
229* What Is Awk::                 What is the @code{awk} language; using this
230                                @value{DOCUMENT}.
231* Getting Started::             A basic introduction to using @code{awk}. How
232                                to run an @code{awk} program. Command line
233                                syntax.
234* One-liners::                  Short, sample @code{awk} programs.
235* Regexp::                      All about matching things using regular
236                                expressions.
237* Reading Files::               How to read files and manipulate fields.
238* Printing::                    How to print using @code{awk}.  Describes the
239                                @code{print} and @code{printf} statements.
240                                Also describes redirection of output.
241* Expressions::                 Expressions are the basic building blocks of
242                                statements.
243* Patterns and Actions::        Overviews of patterns and actions.
244* Statements::                  The various control statements are described
245                                in detail.
246* Built-in Variables::          Built-in Variables
247* Arrays::                      The description and use of arrays. Also
248                                includes array-oriented control statements.
249* Built-in::                    The built-in functions are summarized here.
250* User-defined::                User-defined functions are described in
251                                detail.
252* Invoking Gawk::               How to run @code{gawk}.
253* Library Functions::           A Library of @code{awk} Functions.
254* Sample Programs::             Many @code{awk} programs with complete
255                                explanations.
256* Language History::            The evolution of the @code{awk} language.
257* Gawk Summary::                @code{gawk} Options and Language Summary.
258* Installation::                Installing @code{gawk} under various operating
259                                systems.
260* Notes::                       Something about the implementation of
261                                @code{gawk}.
262* Glossary::                    An explanation of some unfamiliar terms.
263* Copying::                     Your right to copy and distribute @code{gawk}.
264* Index::                       Concept and Variable Index.
265
266* History::                     The history of @code{gawk} and @code{awk}.
267* Manual History::              Brief history of the GNU project and this
268                                @value{DOCUMENT}.
269* Acknowledgements::            Acknowledgements.
270* This Manual::                 Using this @value{DOCUMENT}. Includes sample
271                                input files that you can use.
272* Conventions::                 Typographical Conventions.
273* Sample Data Files::           Sample data files for use in the @code{awk}
274                                programs illustrated in this @value{DOCUMENT}.
275* Names::                       What name to use to find @code{awk}.
276* Running gawk::                How to run @code{gawk} programs; includes
277                                command line syntax.
278* One-shot::                    Running a short throw-away @code{awk} program.
279* Read Terminal::               Using no input files (input from terminal
280                                instead).
281* Long::                        Putting permanent @code{awk} programs in
282                                files.
283* Executable Scripts::          Making self-contained @code{awk} programs.
284* Comments::                    Adding documentation to @code{gawk} programs.
285* Very Simple::                 A very simple example.
286* Two Rules::                   A less simple one-line example with two rules.
287* More Complex::                A more complex example.
288* Statements/Lines::            Subdividing or combining statements into
289                                lines.
290* Other Features::              Other Features of @code{awk}.
291* When::                        When to use @code{gawk} and when to use other
292                                things.
293* Regexp Usage::                How to Use Regular Expressions.
294* Escape Sequences::            How to write non-printing characters.
295* Regexp Operators::            Regular Expression Operators.
296* GNU Regexp Operators::        Operators specific to GNU software.
297* Case-sensitivity::            How to do case-insensitive matching.
298* Leftmost Longest::            How much text matches.
299* Computed Regexps::            Using Dynamic Regexps.
300* Records::                     Controlling how data is split into records.
301* Fields::                      An introduction to fields.
302* Non-Constant Fields::         Non-constant Field Numbers.
303* Changing Fields::             Changing the Contents of a Field.
304* Field Separators::            The field separator and how to change it.
305* Basic Field Splitting::       How fields are split with single characters or
306                                simple strings.
307* Regexp Field Splitting::      Using regexps as the field separator.
308* Single Character Fields::     Making each character a separate field.
309* Command Line Field Separator:: Setting @code{FS} from the command line.
310* Field Splitting Summary::     Some final points and a summary table.
311* Constant Size::               Reading constant width data.
312* Multiple Line::               Reading multi-line records.
313* Getline::                     Reading files under explicit program control
314                                using the @code{getline} function.
315* Getline Intro::               Introduction to the @code{getline} function.
316* Plain Getline::               Using @code{getline} with no arguments.
317* Getline/Variable::            Using @code{getline} into a variable.
318* Getline/File::                Using @code{getline} from a file.
319* Getline/Variable/File::       Using @code{getline} into a variable from a
320                                file.
321* Getline/Pipe::                Using @code{getline} from a pipe.
322* Getline/Variable/Pipe::       Using @code{getline} into a variable from a
323                                pipe.
324* Getline Summary::             Summary Of @code{getline} Variants.
325* Print::                       The @code{print} statement.
326* Print Examples::              Simple examples of @code{print} statements.
327* Output Separators::           The output separators and how to change them.
328* OFMT::                        Controlling Numeric Output With @code{print}.
329* Printf::                      The @code{printf} statement.
330* Basic Printf::                Syntax of the @code{printf} statement.
331* Control Letters::             Format-control letters.
332* Format Modifiers::            Format-specification modifiers.
333* Printf Examples::             Several examples.
334* Redirection::                 How to redirect output to multiple files and
335                                pipes.
336* Special Files::               File name interpretation in @code{gawk}.
337                                @code{gawk} allows access to inherited file
338                                descriptors.
339* Close Files And Pipes::       Closing Input and Output Files and Pipes.
340* Constants::                   String, numeric, and regexp constants.
341* Scalar Constants::            Numeric and string constants.
342* Regexp Constants::            Regular Expression constants.
343* Using Constant Regexps::      When and how to use a regexp constant.
344* Variables::                   Variables give names to values for later use.
345* Using Variables::             Using variables in your programs.
346* Assignment Options::          Setting variables on the command line and a
347                                summary of command line syntax. This is an
348                                advanced method of input.
349* Conversion::                  The conversion of strings to numbers and vice
350                                versa.
351* Arithmetic Ops::              Arithmetic operations (@samp{+}, @samp{-},
352                                etc.)
353* Concatenation::               Concatenating strings.
354* Assignment Ops::              Changing the value of a variable or a field.
355* Increment Ops::               Incrementing the numeric value of a variable.
356* Truth Values::                What is ``true'' and what is ``false''.
357* Typing and Comparison::       How variables acquire types, and how this
358                                affects comparison of numbers and strings with
359                                @samp{<}, etc.
360* Boolean Ops::                 Combining comparison expressions using boolean
361                                operators @samp{||} (``or''), @samp{&&}
362                                (``and'') and @samp{!} (``not'').
363* Conditional Exp::             Conditional expressions select between two
364                                subexpressions under control of a third
365                                subexpression.
366* Function Calls::              A function call is an expression.
367* Precedence::                  How various operators nest.
368* Pattern Overview::            What goes into a pattern.
369* Kinds of Patterns::           A list of all kinds of patterns.
370* Regexp Patterns::             Using regexps as patterns.
371* Expression Patterns::         Any expression can be used as a pattern.
372* Ranges::                      Pairs of patterns specify record ranges.
373* BEGIN/END::                   Specifying initialization and cleanup rules.
374* Using BEGIN/END::             How and why to use BEGIN/END rules.
375* I/O And BEGIN/END::           I/O issues in BEGIN/END rules.
376* Empty::                       The empty pattern, which matches every record.
377* Action Overview::             What goes into an action.
378* If Statement::                Conditionally execute some @code{awk}
379                                statements.
380* While Statement::             Loop until some condition is satisfied.
381* Do Statement::                Do specified action while looping until some
382                                condition is satisfied.
383* For Statement::               Another looping statement, that provides
384                                initialization and increment clauses.
385* Break Statement::             Immediately exit the innermost enclosing loop.
386* Continue Statement::          Skip to the end of the innermost enclosing
387                                loop.
388* Next Statement::              Stop processing the current input record.
389* Nextfile Statement::          Stop processing the current file.
390* Exit Statement::              Stop execution of @code{awk}.
391* User-modified::               Built-in variables that you change to control
392                                @code{awk}.
393* Auto-set::                    Built-in variables where @code{awk} gives you
394                                information.
395* ARGC and ARGV::               Ways to use @code{ARGC} and @code{ARGV}.
396* Array Intro::                 Introduction to Arrays
397* Reference to Elements::       How to examine one element of an array.
398* Assigning Elements::          How to change an element of an array.
399* Array Example::               Basic Example of an Array
400* Scanning an Array::           A variation of the @code{for} statement. It
401                                loops through the indices of an array's
402                                existing elements.
403* Delete::                      The @code{delete} statement removes an element
404                                from an array.
405* Numeric Array Subscripts::    How to use numbers as subscripts in
406                                @code{awk}.
407* Uninitialized Subscripts::    Using Uninitialized variables as subscripts.
408* Multi-dimensional::           Emulating multi-dimensional arrays in
409                                @code{awk}.
410* Multi-scanning::              Scanning multi-dimensional arrays.
411* Calling Built-in::            How to call built-in functions.
412* Numeric Functions::           Functions that work with numbers, including
413                                @code{int}, @code{sin} and @code{rand}.
414* String Functions::            Functions for string manipulation, such as
415                                @code{split}, @code{match}, and
416                                @code{sprintf}.
417* I/O Functions::               Functions for files and shell commands.
418* Time Functions::              Functions for dealing with time stamps.
419* Definition Syntax::           How to write definitions and what they mean.
420* Function Example::            An example function definition and what it
421                                does.
422* Function Caveats::            Things to watch out for.
423* Return Statement::            Specifying the value a function returns.
424* Options::                     Command line options and their meanings.
425* Other Arguments::             Input file names and variable assignments.
426* AWKPATH Variable::            Searching directories for @code{awk} programs.
427* Obsolete::                    Obsolete Options and/or features.
428* Undocumented::                Undocumented Options and Features.
429* Known Bugs::                  Known Bugs in @code{gawk}.
430* Portability Notes::           What to do if you don't have @code{gawk}.
431* Nextfile Function::           Two implementations of a @code{nextfile}
432                                function.
433* Assert Function::             A function for assertions in @code{awk}
434                                programs.
435* Round Function::              A function for rounding if @code{sprintf} does
436                                not do it correctly.
437* Ordinal Functions::           Functions for using characters as numbers and
438                                vice versa.
439* Join Function::               A function to join an array into a string.
440* Mktime Function::             A function to turn a date into a timestamp.
441* Gettimeofday Function::       A function to get formatted times.
442* Filetrans Function::          A function for handling data file transitions.
443* Getopt Function::             A function for processing command line
444                                arguments.
445* Passwd Functions::            Functions for getting user information.
446* Group Functions::             Functions for getting group information.
447* Library Names::               How to best name private global variables in
448                                library functions.
449* Clones::                      Clones of common utilities.
450* Cut Program::                 The @code{cut} utility.
451* Egrep Program::               The @code{egrep} utility.
452* Id Program::                  The @code{id} utility.
453* Split Program::               The @code{split} utility.
454* Tee Program::                 The @code{tee} utility.
455* Uniq Program::                The @code{uniq} utility.
456* Wc Program::                  The @code{wc} utility.
457* Miscellaneous Programs::      Some interesting @code{awk} programs.
458* Dupword Program::             Finding duplicated words in a document.
459* Alarm Program::               An alarm clock.
460* Translate Program::           A program similar to the @code{tr} utility.
461* Labels Program::              Printing mailing labels.
462* Word Sorting::                A program to produce a word usage count.
463* History Sorting::             Eliminating duplicate entries from a history
464                                file.
465* Extract Program::             Pulling out programs from Texinfo source
466                                files.
467* Simple Sed::                  A Simple Stream Editor.
468* Igawk Program::               A wrapper for @code{awk} that includes files.
469* V7/SVR3.1::                   The major changes between V7 and System V
470                                Release 3.1.
471* SVR4::                        Minor changes between System V Releases 3.1
472                                and 4.
473* POSIX::                       New features from the POSIX standard.
474* BTL::                         New features from the Bell Laboratories
475                                version of @code{awk}.
476* POSIX/GNU::                   The extensions in @code{gawk} not in POSIX
477                                @code{awk}.
478* Command Line Summary::        Recapitulation of the command line.
479* Language Summary::            A terse review of the language.
480* Variables/Fields::            Variables, fields, and arrays.
481* Fields Summary::              Input field splitting.
482* Built-in Summary::            @code{awk}'s built-in variables.
483* Arrays Summary::              Using arrays.
484* Data Type Summary::           Values in @code{awk} are numbers or strings.
485* Rules Summary::               Patterns and Actions, and their component
486                                parts.
487* Pattern Summary::             Quick overview of patterns.
488* Regexp Summary::              Quick overview of regular expressions.
489* Actions Summary::             Quick overview of actions.
490* Operator Summary::            @code{awk} operators.
491* Control Flow Summary::        The control statements.
492* I/O Summary::                 The I/O statements.
493* Printf Summary::              A summary of @code{printf}.
494* Special File Summary::        Special file names interpreted internally.
495* Built-in Functions Summary::  Built-in numeric and string functions.
496* Time Functions Summary::      Built-in time functions.
497* String Constants Summary::    Escape sequences in strings.
498* Functions Summary::           Defining and calling functions.
499* Historical Features::         Some undocumented but supported ``features''.
500* Gawk Distribution::           What is in the @code{gawk} distribution.
501* Getting::                     How to get the distribution.
502* Extracting::                  How to extract the distribution.
503* Distribution contents::       What is in the distribution.
504* Unix Installation::           Installing @code{gawk} under various versions
505                                of Unix.
506* Quick Installation::          Compiling @code{gawk} under Unix.
507* Configuration Philosophy::    How it's all supposed to work.
508* VMS Installation::            Installing @code{gawk} on VMS.
509* VMS Compilation::             How to compile @code{gawk} under VMS.
510* VMS Installation Details::    How to install @code{gawk} under VMS.
511* VMS Running::                 How to run @code{gawk} under VMS.
512* VMS POSIX::                   Alternate instructions for VMS POSIX.
513* PC Installation::             Installing and Compiling @code{gawk} on MS-DOS
514                                and OS/2
515* Atari Installation::          Installing @code{gawk} on the Atari ST.
516* Atari Compiling::             Compiling @code{gawk} on Atari
517* Atari Using::                 Running @code{gawk} on Atari
518* Amiga Installation::          Installing @code{gawk} on an Amiga.
519* Bugs::                        Reporting Problems and Bugs.
520* Other Versions::              Other freely available @code{awk}
521                                implementations.
522* Compatibility Mode::          How to disable certain @code{gawk} extensions.
523* Additions::                   Making Additions To @code{gawk}.
524* Adding Code::                 Adding code to the main body of @code{gawk}.
525* New Ports::                   Porting @code{gawk} to a new operating system.
526* Future Extensions::           New features that may be implemented one day.
527* Improvements::                Suggestions for improvements by volunteers.
528
529@end menu
530
531@c dedication for Info file
532@ifinfo
533@center To Miriam, for making me complete.
534@sp 1
535@center To Chana, for the joy you bring us.
536@sp 1
537@center To Rivka, for the exponential increase.
538@sp 1
539@center To Nachum, for the added dimension.
540@sp 1
541@center To Malka, for the new beginning.
542@end ifinfo
543
544@node Preface, What Is Awk, Top, Top
545@unnumbered Preface
546
547@c I saw a comment somewhere that the preface should describe the book itself,
548@c and the introduction should describe what the book covers.
549
550This @value{DOCUMENT} teaches you about the @code{awk} language and
551how you can use it effectively.  You should already be familiar with basic
552system commands, such as @code{cat} and @code{ls},@footnote{These commands
553are available on POSIX compliant systems, as well as on traditional Unix
554based systems. If you are using some other operating system, you still need to
555be familiar with the ideas of I/O redirection and pipes.} and basic shell
556facilities, such as Input/Output (I/O) redirection and pipes.
557
558Implementations of the @code{awk} language are available for many different
559computing environments.  This @value{DOCUMENT}, while describing the @code{awk} language
560in general, also describes a particular implementation of @code{awk} called
561@code{gawk} (which stands for ``GNU Awk'').  @code{gawk} runs on a broad range
562of Unix systems, ranging from 80386 PC-based computers, up through large scale
563systems, such as Crays. @code{gawk} has also been ported to MS-DOS and
564OS/2 PC's, Atari and Amiga micro-computers, and VMS.
565
566@menu
567* History::                     The history of @code{gawk} and @code{awk}.
568* Manual History::              Brief history of the GNU project and this
569                                @value{DOCUMENT}.
570* Acknowledgements::            Acknowledgements.
571@end menu
572
573@node History, Manual History, Preface, Preface
574@unnumberedsec History of @code{awk} and @code{gawk}
575
576@cindex acronym
577@cindex history of @code{awk}
578@cindex Aho, Alfred
579@cindex Weinberger, Peter
580@cindex Kernighan, Brian
581@cindex old @code{awk}
582@cindex new @code{awk}
583The name @code{awk} comes from the initials of its designers: Alfred V.@:
584Aho, Peter J.@: Weinberger, and Brian W.@: Kernighan.  The original version of
585@code{awk} was written in 1977 at AT&T Bell Laboratories.
586In 1985 a new version made the programming
587language more powerful, introducing user-defined functions, multiple input
588streams, and computed regular expressions.
589This new version became generally available with Unix System V Release 3.1.
590The version in System V Release 4 added some new features and also cleaned
591up the behavior in some of the ``dark corners'' of the language.
592The specification for @code{awk} in the POSIX Command Language
593and Utilities standard further clarified the language based on feedback
594from both the @code{gawk} designers, and the original Bell Labs @code{awk}
595designers.
596
597The GNU implementation, @code{gawk}, was written in 1986 by Paul Rubin
598and Jay Fenlason, with advice from Richard Stallman.  John Woods
599contributed parts of the code as well.  In 1988 and 1989, David Trueman, with
600help from Arnold Robbins, thoroughly reworked @code{gawk} for compatibility
601with the newer @code{awk}.  Current development focuses on bug fixes,
602performance improvements, standards compliance, and occasionally, new features.
603
604@node Manual History, Acknowledgements, History, Preface
605@unnumberedsec The GNU Project and This Book
606
607@cindex Free Software Foundation
608@cindex Stallman, Richard
609The Free Software Foundation (FSF) is a non-profit organization dedicated
610to the production and distribution of freely distributable software.
611It was founded by Richard M.@: Stallman, the author of the original
612Emacs editor.  GNU Emacs is the most widely used version of Emacs today.
613
614@cindex GNU Project
615The GNU project is an on-going effort on the part of the Free Software
616Foundation to create a complete, freely distributable, POSIX compliant
617computing environment.  (GNU stands for ``GNU's not Unix''.)
618The FSF uses the ``GNU General Public License'' (or GPL) to ensure that
619source code for their software is always available to the end user. A
620copy of the GPL is included for your reference
621(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
622The GPL applies to the C language source code for @code{gawk}.
623
624A shell, an editor (Emacs), highly portable optimizing C, C++, and
625Objective-C compilers, a symbolic debugger, and dozens of large and
626small utilities (such as @code{gawk}), have all been completed and are
627freely available.  As of this writing (early 1997), the GNU operating
628system kernel (the HURD), has been released, but is still in an early
629stage of development.
630
631@cindex Linux
632@cindex NetBSD
633@cindex FreeBSD
634Until the GNU operating system is more fully developed, you should
635consider using Linux, a freely distributable, Unix-like operating
636system for 80386, DEC Alpha, Sun SPARC and other systems.  There are
637many books on Linux. One freely available one is @cite{Linux
638Installation and Getting Started}, by Matt Welsh.
639Many Linux distributions are available, often in computer stores or
640bundled on CD-ROM with books about Linux.
641(There are three other freely available, Unix-like operating systems for
64280386 and other systems, NetBSD, FreeBSD,and OpenBSD. All are based on the
6434.4-Lite Berkeley Software Distribution, and they use recent versions
644of @code{gawk} for their versions of @code{awk}.)
645
646@iftex
647This @value{DOCUMENT} you are reading now is actually free.  The
648information in it is freely available to anyone, the machine readable
649source code for the @value{DOCUMENT} comes with @code{gawk}, and anyone
650may take this @value{DOCUMENT} to a copying machine and make as many
651copies of it as they like.  (Take a moment to check the copying
652permissions on the Copyright page.)
653
654If you paid money for this @value{DOCUMENT}, what you actually paid for
655was the @value{DOCUMENT}'s nice printing and binding, and the
656publisher's associated costs to produce it.  We have made an effort to
657keep these costs reasonable; most people would prefer a bound book to
658over 330 pages of photo-copied text that would then have to be held in
659a loose-leaf binder (not to mention the time and labor involved in
660doing the copying).  The same is true of producing this
661@value{DOCUMENT} from the machine readable source; the retail price is
662only slightly more than the cost per page of printing it
663on a laser printer.
664@end iftex
665
666This @value{DOCUMENT} itself has gone through several previous,
667preliminary editions.  I started working on a preliminary draft of
668@cite{The GAWK Manual}, by Diane Close, Paul Rubin, and Richard
669Stallman in the fall of 1988.
670It was around 90 pages long, and barely described the original, ``old''
671version of @code{awk}. After substantial revision, the first version of
672the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in
673October of 1989.  The manual then underwent more substantial revision
674for Edition 0.13 of December 1991.
675David Trueman, Pat Rankin, and Michal Jaegermann contributed sections
676of the manual for Edition 0.13.
677That edition was published by the
678FSF as a bound book early in 1992.  Since then there have been several
679minor revisions, notably Edition 0.14 of November 1992 that was published
680by the FSF in January of 1993, and Edition 0.16 of August 1993.
681
682Edition 1.0 of @cite{@value{TITLE}} represents a significant re-working
683of @cite{The GAWK Manual}, with much additional material.
684The FSF and I agree that I am now the primary author.
685I also felt that it needed a more descriptive title.
686
687@cite{@value{TITLE}} will undoubtedly continue to evolve.
688An electronic version
689comes with the @code{gawk} distribution from the FSF.
690If you find an error in this @value{DOCUMENT}, please report it!
691@xref{Bugs, ,Reporting Problems and Bugs}, for information on submitting
692problem reports electronically, or write to me in care of the FSF.
693
694@node Acknowledgements, , Manual History, Preface
695@unnumberedsec Acknowledgements
696
697@cindex Stallman, Richard
698I would like to acknowledge Richard M.@: Stallman, for his vision of a
699better world, and for his courage in founding the FSF and starting the
700GNU project.
701
702The initial draft of @cite{The GAWK Manual} had the following acknowledgements:
703
704@quotation
705Many people need to be thanked for their assistance in producing this
706manual.  Jay Fenlason contributed many ideas and sample programs.  Richard
707Mlynarik and Robert Chassell gave helpful comments on drafts of this
708manual.  The paper @cite{A Supplemental Document for @code{awk}} by John W.@:
709Pierce of the Chemistry Department at UC San Diego, pinpointed several
710issues relevant both to @code{awk} implementation and to this manual, that
711would otherwise have escaped us.
712@end quotation
713
714The following people provided many helpful comments on Edition 0.13 of
715@cite{The GAWK Manual}: Rick Adams, Michael Brennan, Rich Burridge, Diane Close,
716Christopher (``Topher'') Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins,
717and Michal Jaegermann.
718
719The following people provided many helpful comments for Edition 1.0 of
720@cite{@value{TITLE}}: Karl Berry, Michael Brennan, Darrel
721Hankerson, Michal Jaegermann, Michael Lijewski, and Miriam Robbins.
722Pat Rankin, Michal Jaegermann, Darrel Hankerson and Scott Deifik
723updated their respective sections for Edition 1.0.
724
725Robert J.@: Chassell provided much valuable advice on
726the use of Texinfo.  He also deserves special thanks for
727convincing me @emph{not} to title this @value{DOCUMENT}
728@cite{How To Gawk Politely}.
729Karl Berry helped significantly with the @TeX{} part of Texinfo.
730
731@cindex Trueman, David
732David Trueman deserves special credit; he has done a yeoman job
733of evolving @code{gawk} so that it performs well, and without bugs.
734Although he is no longer involved with @code{gawk},
735working with him on this project was a significant pleasure.
736
737@cindex Deifik, Scott
738@cindex Hankerson, Darrel
739@cindex Rommel, Kai Uwe
740@cindex Rankin, Pat
741@cindex Jaegermann, Michal
742Scott Deifik, Darrel Hankerson, Kai Uwe Rommel, Pat Rankin, and Michal
743Jaegermann (in no particular order) are long time members of the
744@code{gawk} ``crack portability team.''  Without their hard work and
745help, @code{gawk} would not be nearly the fine program it is today.  It
746has been and continues to be a pleasure working with this team of fine
747people.
748
749@cindex Friedl, Jeffrey
750Jeffrey Friedl provided invaluable help in tracking down a number
751of last minute problems with regular expressions in @code{gawk} 3.0.
752
753@cindex Kernighan, Brian
754David and I would like to thank Brian Kernighan of Bell Labs for
755invaluable assistance during the testing and debugging of @code{gawk}, and for
756help in clarifying numerous points about the language.  We could not have
757done nearly as good a job on either @code{gawk} or its documentation without
758his help.
759
760@cindex Hughes, Phil
761I would like to thank Marshall and Elaine Hartholz of Seattle, and Dr.@:
762Bert and Rita Schreiber of Detroit for large amounts of quiet vacation
763time in their homes, which allowed me to make significant progress on
764this @value{DOCUMENT} and on @code{gawk} itself.  Phil Hughes of SSC
765contributed in a very important way by loaning me his laptop Linux
766system, not once, but twice, allowing me to do a lot of work while
767away from home.
768
769@cindex Robbins, Miriam
770Finally, I must thank my wonderful wife, Miriam, for her patience through
771the many versions of this project, for her proof-reading,
772and for sharing me with the computer.
773I would like to thank my parents for their love, and for the grace with
774which they raised and educated me.
775I also must acknowledge my gratitude to G-d, for the many opportunities
776He has sent my way, as well as for the gifts He has given me with which to
777take advantage of those opportunities.
778@sp 2
779@noindent
780Arnold Robbins @*
781Atlanta, Georgia @*
782February, 1997
783
784@ignore
785Stuff still not covered anywhere:
786BASICS:
787   Integer vs. floating point
788   Hex vs. octal vs. decimal
789   Interpreter vs compiler
790   input/output
791@end ignore
792
793@node What Is Awk, Getting Started, Preface, Top
794@chapter Introduction
795
796If you are like many computer users, you would frequently like to make
797changes in various text files wherever certain patterns appear, or
798extract data from parts of certain lines while discarding the rest.  To
799write a program to do this in a language such as C or Pascal is a
800time-consuming inconvenience that may take many lines of code.  The job
801may be easier with @code{awk}.
802
803The @code{awk} utility interprets a special-purpose programming language
804that makes it possible to handle simple data-reformatting jobs
805with just a few lines of code.
806
807The GNU implementation of @code{awk} is called @code{gawk}; it is fully
808upward compatible with the System V Release 4 version of
809@code{awk}.  @code{gawk} is also upward compatible with the POSIX
810specification of the @code{awk} language.  This means that all
811properly written @code{awk} programs should work with @code{gawk}.
812Thus, we usually don't distinguish between @code{gawk} and other @code{awk}
813implementations.
814
815@cindex uses of @code{awk}
816Using @code{awk} you can:
817
818@itemize @bullet
819@item
820manage small, personal databases
821
822@item
823generate reports
824
825@item
826validate data
827
828@item
829produce indexes, and perform other document preparation tasks
830
831@item
832even experiment with algorithms that can be adapted later to other computer
833languages
834@end itemize
835
836@menu
837* This Manual::                 Using this @value{DOCUMENT}. Includes sample
838                                input files that you can use.
839* Conventions::                 Typographical Conventions.
840* Sample Data Files::           Sample data files for use in the @code{awk}
841                                programs illustrated in this @value{DOCUMENT}.
842@end menu
843
844@node This Manual, Conventions, What Is Awk, What Is Awk
845@section Using This Book
846@cindex book, using this
847@cindex using this book
848@cindex language, @code{awk}
849@cindex program, @code{awk}
850@ignore
851@cindex @code{awk} language
852@cindex @code{awk} program
853@end ignore
854
855The term @code{awk} refers to a particular program, and to the language you
856use to tell this program what to do.  When we need to be careful, we call
857the program ``the @code{awk} utility'' and the language ``the @code{awk}
858language.''  The term @code{gawk} refers to a version of @code{awk} developed
859as part the GNU project.  The purpose of this @value{DOCUMENT} is to explain
860both the @code{awk} language and how to run the @code{awk} utility.
861
862The main purpose of the @value{DOCUMENT} is to explain the features
863of @code{awk}, as defined in the POSIX standard.  It does so in the context
864of one particular implementation, @code{gawk}. While doing so, it will also
865attempt to describe important differences between @code{gawk} and other
866@code{awk} implementations.  Finally, any @code{gawk} features that
867are not in the POSIX standard for @code{awk} will be noted.
868
869@iftex
870This @value{DOCUMENT} has the difficult task of being both tutorial and reference.
871If you are a novice, feel free to skip over details that seem too complex.
872You should also ignore the many cross references; they are for the
873expert user, and for the on-line Info version of the document.
874@end iftex
875
876The term @dfn{@code{awk} program} refers to a program written by you in
877the @code{awk} programming language.
878
879@xref{Getting Started, ,Getting Started with @code{awk}}, for the bare
880essentials you need to know to start using @code{awk}.
881
882Some useful ``one-liners'' are included to give you a feel for the
883@code{awk} language (@pxref{One-liners, ,Useful One Line Programs}).
884
885Many sample @code{awk} programs have been provided for you
886(@pxref{Library Functions, ,A Library of @code{awk} Functions}; also
887@pxref{Sample Programs, ,Practical @code{awk} Programs}).
888
889The entire @code{awk} language is summarized for quick reference in
890@ref{Gawk Summary, ,@code{gawk} Summary}.  Look there if you just need
891to refresh your memory about a particular feature.
892
893If you find terms that you aren't familiar with, try looking them
894up in the glossary (@pxref{Glossary}).
895
896Most of the time complete @code{awk} programs are used as examples, but in
897some of the more advanced sections, only the part of the @code{awk} program
898that illustrates the concept being described is shown.
899
900While this @value{DOCUMENT} is aimed principally at people who have not been
901exposed
902to @code{awk}, there is a lot of information here that even the @code{awk}
903expert should find useful.  In particular, the description of POSIX
904@code{awk}, and the example programs in
905@ref{Library Functions, ,A Library of @code{awk} Functions}, and
906@ref{Sample Programs, ,Practical @code{awk} Programs},
907should be of interest.
908
909@c fakenode --- for prepinfo
910@unnumberedsubsec Dark Corners
911@display
912@i{Who opened that window shade?!?}
913Count Dracula
914@end display
915@sp 1
916
917@cindex d.c., see ``dark corner''
918@cindex dark corner
919Until the POSIX standard (and @cite{The Gawk Manual}),
920many features of @code{awk} were either poorly documented, or not
921documented at all.  Descriptions of such features
922(often called ``dark corners'') are noted in this @value{DOCUMENT} with
923``(d.c.)''.
924They also appear in the index under the heading ``dark corner.''
925
926@node Conventions, Sample Data Files, This Manual, What Is Awk
927@section Typographical Conventions
928
929This @value{DOCUMENT} is written using Texinfo, the GNU documentation formatting language.
930A single Texinfo source file is used to produce both the printed and on-line
931versions of the documentation.
932@iftex
933Because of this, the typographical conventions
934are slightly different than in other books you may have read.
935@end iftex
936@ifinfo
937This section briefly documents the typographical conventions used in Texinfo.
938@end ifinfo
939
940Examples you would type at the command line are preceded by the common
941shell primary and secondary prompts, @samp{$} and @samp{>}.
942Output from the command is preceded by the glyph ``@print{}''.
943This typically represents the command's standard output.
944Error messages, and other output on the command's standard error, are preceded
945by the glyph ``@error{}''.  For example:
946
947@example
948@group
949$ echo hi on stdout
950@print{} hi on stdout
951$ echo hello on stderr 1>&2
952@error{} hello on stderr
953@end group
954@end example
955
956@iftex
957In the text, command names appear in @code{this font}, while code segments
958appear in the same font and quoted, @samp{like this}.  Some things will
959be emphasized @emph{like this}, and if a point needs to be made
960strongly, it will be done @strong{like this}.  The first occurrence of
961a new term is usually its @dfn{definition}, and appears in the same
962font as the previous occurrence of ``definition'' in this sentence.
963File names are indicated like this: @file{/path/to/ourfile}.
964@end iftex
965
966Characters that you type at the keyboard look @kbd{like this}.  In particular,
967there are special characters called ``control characters.''  These are
968characters that you type by holding down both the @kbd{CONTROL} key and
969another key, at the same time.  For example, a @kbd{Control-d} is typed
970by first pressing and holding the @kbd{CONTROL} key, next
971pressing the @kbd{d} key, and finally releasing both keys.
972
973@node Sample Data Files,  , Conventions, What Is Awk
974@section Data Files for the Examples
975
976@cindex input file, sample
977@cindex sample input file
978@cindex @file{BBS-list} file
979Many of the examples in this @value{DOCUMENT} take their input from two sample
980data files.  The first, called @file{BBS-list}, represents a list of
981computer bulletin board systems together with information about those systems.
982The second data file, called @file{inventory-shipped}, contains
983information about shipments on a monthly basis.  In both files,
984each line is considered to be one @dfn{record}.
985
986In the file @file{BBS-list}, each record contains the name of a computer
987bulletin board, its phone number, the board's baud rate(s), and a code for
988the number of hours it is operational.  An @samp{A} in the last column
989means the board operates 24 hours a day.  A @samp{B} in the last
990column means the board operates evening and weekend hours, only.  A
991@samp{C} means the board operates only on weekends.
992
993@c 2e: Update the baud rates to reflect today's faster modems
994@example
995@c system mkdir eg
996@c system mkdir eg/lib
997@c system mkdir eg/data
998@c system mkdir eg/prog
999@c system mkdir eg/misc
1000@c file eg/data/BBS-list
1001aardvark     555-5553     1200/300          B
1002alpo-net     555-3412     2400/1200/300     A
1003barfly       555-7685     1200/300          A
1004bites        555-1675     2400/1200/300     A
1005camelot      555-0542     300               C
1006core         555-2912     1200/300          C
1007fooey        555-1234     2400/1200/300     B
1008foot         555-6699     1200/300          B
1009macfoo       555-6480     1200/300          A
1010sdace        555-3430     2400/1200/300     A
1011sabafoo      555-2127     1200/300          C
1012@c endfile
1013@end example
1014
1015@cindex @file{inventory-shipped} file
1016The second data file, called @file{inventory-shipped}, represents
1017information about shipments during the year.
1018Each record contains the month of the year, the number
1019of green crates shipped, the number of red boxes shipped, the number of
1020orange bags shipped, and the number of blue packages shipped,
1021respectively.  There are 16 entries, covering the 12 months of one year
1022and four months of the next year.
1023
1024@example
1025@c file eg/data/inventory-shipped
1026Jan  13  25  15 115
1027Feb  15  32  24 226
1028Mar  15  24  34 228
1029Apr  31  52  63 420
1030May  16  34  29 208
1031Jun  31  42  75 492
1032Jul  24  34  67 436
1033Aug  15  34  47 316
1034Sep  13  55  37 277
1035Oct  29  54  68 525
1036Nov  20  87  82 577
1037Dec  17  35  61 401
1038
1039Jan  21  36  64 620
1040Feb  26  58  80 652
1041Mar  24  75  70 495
1042Apr  21  70  74 514
1043@c endfile
1044@end example
1045
1046@ifinfo
1047If you are reading this in GNU Emacs using Info, you can copy the regions
1048of text showing these sample files into your own test files.  This way you
1049can try out the examples shown in the remainder of this document.  You do
1050this by using the command @kbd{M-x write-region} to copy text from the Info
1051file into a file for use with @code{awk}
1052(@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual},
1053for more information).  Using this information, create your own
1054@file{BBS-list} and @file{inventory-shipped} files, and practice what you
1055learn in this @value{DOCUMENT}.
1056
1057If you are using the stand-alone version of Info,
1058see @ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
1059for an @code{awk} program that will extract these data files from
1060@file{gawk.texi}, the Texinfo source file for this Info file.
1061@end ifinfo
1062
1063@node Getting Started, One-liners, What Is Awk, Top
1064@chapter Getting Started with @code{awk}
1065@cindex script, definition of
1066@cindex rule, definition of
1067@cindex program, definition of
1068@cindex basic function of @code{awk}
1069
1070The basic function of @code{awk} is to search files for lines (or other
1071units of text) that contain certain patterns.  When a line matches one
1072of the patterns, @code{awk} performs specified actions on that line.
1073@code{awk} keeps processing input lines in this way until the end of the
1074input files are reached.
1075
1076@cindex data-driven languages
1077@cindex procedural languages
1078@cindex language, data-driven
1079@cindex language, procedural
1080Programs in @code{awk} are different from programs in most other languages,
1081because @code{awk} programs are @dfn{data-driven}; that is, you describe
1082the data you wish to work with, and then what to do when you find it.
1083Most other languages are @dfn{procedural}; you have to describe, in great
1084detail, every step the program is to take.  When working with procedural
1085languages, it is usually much
1086harder to clearly describe the data your program will process.
1087For this reason, @code{awk} programs are often refreshingly easy to both
1088write and read.
1089
1090@cindex program, definition of
1091@cindex rule, definition of
1092When you run @code{awk}, you specify an @code{awk} @dfn{program} that
1093tells @code{awk} what to do.  The program consists of a series of
1094@dfn{rules}.  (It may also contain @dfn{function definitions},
1095an advanced feature which we will ignore for now.
1096@xref{User-defined, ,User-defined Functions}.)  Each rule specifies one
1097pattern to search for, and one action to perform when that pattern is found.
1098
1099Syntactically, a rule consists of a pattern followed by an action.  The
1100action is enclosed in curly braces to separate it from the pattern.
1101Rules are usually separated by newlines.  Therefore, an @code{awk}
1102program looks like this:
1103
1104@example
1105@var{pattern} @{ @var{action} @}
1106@var{pattern} @{ @var{action} @}
1107@dots{}
1108@end example
1109
1110@menu
1111* Names::                       What name to use to find @code{awk}.
1112* Running gawk::                How to run @code{gawk} programs; includes
1113                                command line syntax.
1114* Very Simple::                 A very simple example.
1115* Two Rules::                   A less simple one-line example with two rules.
1116* More Complex::                A more complex example.
1117* Statements/Lines::            Subdividing or combining statements into
1118                                lines.
1119* Other Features::              Other Features of @code{awk}.
1120* When::                        When to use @code{gawk} and when to use other
1121                                things.
1122@end menu
1123
1124@node Names, Running gawk , Getting Started, Getting Started
1125@section A Rose By Any Other Name
1126
1127@cindex old @code{awk} vs. new @code{awk}
1128@cindex new @code{awk} vs. old @code{awk}
1129The @code{awk} language has evolved over the years. Full details are
1130provided in @ref{Language History, ,The Evolution of the @code{awk} Language}.
1131The language described in this @value{DOCUMENT}
1132is often referred to as ``new @code{awk}.''
1133
1134Because of this, many systems have multiple
1135versions of @code{awk}.
1136Some systems have an @code{awk} utility that implements the
1137original version of the @code{awk} language, and a @code{nawk} utility
1138for the new version.  Others have an @code{oawk} for the ``old @code{awk}''
1139language, and plain @code{awk} for the new one.  Still others only
1140have one version, usually the new one.@footnote{Often, these systems
1141use @code{gawk} for their @code{awk} implementation!}
1142
1143All in all, this makes it difficult for you to know which version of
1144@code{awk} you should run when writing your programs.  The best advice
1145we can give here is to check your local documentation. Look for @code{awk},
1146@code{oawk}, and @code{nawk}, as well as for @code{gawk}. Chances are, you
1147will have some version of new @code{awk} on your system, and that is what
1148you should use when running your programs.  (Of course, if you're reading
1149this @value{DOCUMENT}, chances are good that you have @code{gawk}!)
1150
1151Throughout this @value{DOCUMENT}, whenever we refer to a language feature
1152that should be available in any complete implementation of POSIX @code{awk},
1153we simply use the term @code{awk}.  When referring to a feature that is
1154specific to the GNU implementation, we use the term @code{gawk}.
1155
1156@node Running gawk, Very Simple, Names, Getting Started
1157@section How to Run @code{awk} Programs
1158
1159@cindex command line formats
1160@cindex running @code{awk} programs
1161There are several ways to run an @code{awk} program.  If the program is
1162short, it is easiest to include it in the command that runs @code{awk},
1163like this:
1164
1165@example
1166awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
1167@end example
1168
1169@noindent
1170where @var{program} consists of a series of patterns and actions, as
1171described earlier.
1172(The reason for the single quotes is described below, in
1173@ref{One-shot, ,One-shot Throw-away @code{awk} Programs}.)
1174
1175When the program is long, it is usually more convenient to put it in a file
1176and run it with a command like this:
1177
1178@example
1179awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{}
1180@end example
1181
1182@menu
1183* One-shot::                    Running a short throw-away @code{awk} program.
1184* Read Terminal::               Using no input files (input from terminal
1185                                instead).
1186* Long::                        Putting permanent @code{awk} programs in
1187                                files.
1188* Executable Scripts::          Making self-contained @code{awk} programs.
1189* Comments::                    Adding documentation to @code{gawk} programs.
1190@end menu
1191
1192@node One-shot, Read Terminal, Running gawk, Running gawk
1193@subsection One-shot Throw-away @code{awk} Programs
1194
1195Once you are familiar with @code{awk}, you will often type in simple
1196programs the moment you want to use them.  Then you can write the
1197program as the first argument of the @code{awk} command, like this:
1198
1199@example
1200awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
1201@end example
1202
1203@noindent
1204where @var{program} consists of a series of @var{patterns} and
1205@var{actions}, as described earlier.
1206
1207@cindex single quotes, why needed
1208This command format instructs the @dfn{shell}, or command interpreter,
1209to start @code{awk} and use the @var{program} to process records in the
1210input file(s).  There are single quotes around @var{program} so that
1211the shell doesn't interpret any @code{awk} characters as special shell
1212characters.  They also cause the shell to treat all of @var{program} as
1213a single argument for @code{awk} and allow @var{program} to be more
1214than one line long.
1215
1216This format is also useful for running short or medium-sized @code{awk}
1217programs from shell scripts, because it avoids the need for a separate
1218file for the @code{awk} program.  A self-contained shell script is more
1219reliable since there are no other files to misplace.
1220
1221@ref{One-liners, , Useful One Line Programs}, presents several short,
1222self-contained programs.
1223
1224As an interesting side point, the command
1225
1226@example
1227awk '/foo/' @var{files} @dots{}
1228@end example
1229
1230@noindent
1231is essentially the same as
1232
1233@cindex @code{egrep}
1234@example
1235egrep foo @var{files} @dots{}
1236@end example
1237
1238@node Read Terminal, Long, One-shot, Running gawk
1239@subsection Running @code{awk} without Input Files
1240
1241@cindex standard input
1242@cindex input, standard
1243You can also run @code{awk} without any input files.  If you type the
1244command line:
1245
1246@example
1247awk '@var{program}'
1248@end example
1249
1250@noindent
1251then @code{awk} applies the @var{program} to the @dfn{standard input},
1252which usually means whatever you type on the terminal.  This continues
1253until you indicate end-of-file by typing @kbd{Control-d}.
1254(On other operating systems, the end-of-file character may be different.
1255For example, on OS/2 and MS-DOS, it is @kbd{Control-z}.)
1256
1257For example, the following program prints a friendly piece of advice
1258(from Douglas Adams' @cite{The Hitchhiker's Guide to the Galaxy}),
1259to keep you from worrying about the complexities of computer programming
1260(@samp{BEGIN} is a feature we haven't discussed yet).
1261
1262@example
1263$ awk "BEGIN @{ print \"Don't Panic!\" @}"
1264@print{} Don't Panic!
1265@end example
1266
1267@cindex quoting, shell
1268@cindex shell quoting
1269This program does not read any input.  The @samp{\} before each of the
1270inner double quotes is necessary because of the shell's quoting rules,
1271in particular because it mixes both single quotes and double quotes.
1272
1273This next simple @code{awk} program
1274emulates the @code{cat} utility; it copies whatever you type at the
1275keyboard to its standard output. (Why this works is explained shortly.)
1276
1277@example
1278$ awk '@{ print @}'
1279Now is the time for all good men
1280@print{} Now is the time for all good men
1281to come to the aid of their country.
1282@print{} to come to the aid of their country.
1283Four score and seven years ago, ...
1284@print{} Four score and seven years ago, ...
1285What, me worry?
1286@print{} What, me worry?
1287@kbd{Control-d}
1288@end example
1289
1290@node Long, Executable Scripts, Read Terminal, Running gawk
1291@subsection Running Long Programs
1292
1293@cindex running long programs
1294@cindex @code{-f} option
1295@cindex program file
1296@cindex file, @code{awk} program
1297Sometimes your @code{awk} programs can be very long.  In this case it is
1298more convenient to put the program into a separate file.  To tell
1299@code{awk} to use that file for its program, you type:
1300
1301@example
1302awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{}
1303@end example
1304
1305The @samp{-f} instructs the @code{awk} utility to get the @code{awk} program
1306from the file @var{source-file}.  Any file name can be used for
1307@var{source-file}.  For example, you could put the program:
1308
1309@example
1310BEGIN @{ print "Don't Panic!" @}
1311@end example
1312
1313@noindent
1314into the file @file{advice}.  Then this command:
1315
1316@example
1317awk -f advice
1318@end example
1319
1320@noindent
1321does the same thing as this one:
1322
1323@example
1324awk "BEGIN @{ print \"Don't Panic!\" @}"
1325@end example
1326
1327@cindex quoting, shell
1328@cindex shell quoting
1329@noindent
1330which was explained earlier (@pxref{Read Terminal, ,Running @code{awk} without Input Files}).
1331Note that you don't usually need single quotes around the file name that you
1332specify with @samp{-f}, because most file names don't contain any of the shell's
1333special characters.  Notice that in @file{advice}, the @code{awk}
1334program did not have single quotes around it.  The quotes are only needed
1335for programs that are provided on the @code{awk} command line.
1336
1337If you want to identify your @code{awk} program files clearly as such,
1338you can add the extension @file{.awk} to the file name.  This doesn't
1339affect the execution of the @code{awk} program, but it does make
1340``housekeeping'' easier.
1341
1342@node Executable Scripts, Comments, Long, Running gawk
1343@subsection Executable @code{awk} Programs
1344@cindex executable scripts
1345@cindex scripts, executable
1346@cindex self contained programs
1347@cindex program, self contained
1348@cindex @code{#!} (executable scripts)
1349
1350Once you have learned @code{awk}, you may want to write self-contained
1351@code{awk} scripts, using the @samp{#!} script mechanism.  You can do
1352this on many Unix systems@footnote{The @samp{#!} mechanism works on
1353Linux systems,
1354Unix systems derived from Berkeley Unix, System V Release 4, and some System
1355V Release 3 systems.} (and someday on the GNU system).
1356
1357For example, you could update the file @file{advice} to look like this:
1358
1359@example
1360#! /bin/awk -f
1361
1362BEGIN    @{ print "Don't Panic!" @}
1363@end example
1364
1365@noindent
1366After making this file executable (with the @code{chmod} utility), you
1367can simply type @samp{advice}
1368at the shell, and the system will arrange to run @code{awk}@footnote{The
1369line beginning with @samp{#!} lists the full file name of an interpreter
1370to be run, and an optional initial command line argument to pass to that
1371interpreter.  The operating system then runs the interpreter with the given
1372argument and the full argument list of the executed program.  The first argument
1373in the list is the full file name of the @code{awk} program.  The rest of the
1374argument list will either be options to @code{awk}, or data files,
1375or both.} as if you had typed @samp{awk -f advice}.
1376
1377@example
1378@group
1379$ advice
1380@print{} Don't Panic!
1381@end group
1382@end example
1383
1384@noindent
1385Self-contained @code{awk} scripts are useful when you want to write a
1386program which users can invoke without their having to know that the program is
1387written in @code{awk}.
1388
1389@strong{Caution:} You should not put more than one argument on the @samp{#!}
1390line after the path to @code{awk}. This will not work. The operating system
1391treats the rest of the line as a single agument, and passes it to @code{awk}.
1392Doing this will lead to confusing behavior: most likely a usage diagnostic
1393of some sort from @code{awk}.
1394
1395@cindex shell scripts
1396@cindex scripts, shell
1397Some older systems do not support the @samp{#!} mechanism. You can get a
1398similar effect using a regular shell script.  It would look something
1399like this:
1400
1401@example
1402: The colon ensures execution by the standard shell.
1403awk '@var{program}' "$@@"
1404@end example
1405
1406Using this technique, it is @emph{vital} to enclose the @var{program} in
1407single quotes to protect it from interpretation by the shell.  If you
1408omit the quotes, only a shell wizard can predict the results.
1409
1410The @code{"$@@"} causes the shell to forward all the command line
1411arguments to the @code{awk} program, without interpretation.  The first
1412line, which starts with a colon, is used so that this shell script will
1413work even if invoked by a user who uses the C shell.  (Not all older systems
1414obey this convention, but many do.)
1415@c 2e:
1416@c Someday: (See @cite{The Bourne Again Shell}, by ??.)
1417
1418@node Comments,  , Executable Scripts, Running gawk
1419@subsection Comments in @code{awk} Programs
1420@cindex @code{#} (comment)
1421@cindex comments
1422@cindex use of comments
1423@cindex documenting @code{awk} programs
1424@cindex programs, documenting
1425
1426A @dfn{comment} is some text that is included in a program for the sake
1427of human readers; it is not really part of the program.  Comments
1428can explain what the program does, and how it works.  Nearly all
1429programming languages have provisions for comments, because programs are
1430typically hard to understand without their extra help.
1431
1432In the @code{awk} language, a comment starts with the sharp sign
1433character, @samp{#}, and continues to the end of the line.
1434The @samp{#} does not have to be the first character on the line. The
1435@code{awk} language ignores the rest of a line following a sharp sign.
1436For example, we could have put the following into @file{advice}:
1437
1438@example
1439# This program prints a nice friendly message.  It helps
1440# keep novice users from being afraid of the computer.
1441BEGIN    @{ print "Don't Panic!" @}
1442@end example
1443
1444You can put comment lines into keyboard-composed throw-away @code{awk}
1445programs also, but this usually isn't very useful; the purpose of a
1446comment is to help you or another person understand the program at
1447a later time.
1448
1449@strong{Caution:} As mentioned in
1450@ref{One-shot, ,One-shot Throw-away @code{awk} Programs},
1451you can enclose small to medium programs in single quotes, in order to keep
1452your shell scripts self-contained.  When doing so, @emph{don't} put
1453an apostrophe (i.e., a single quote) into a comment (or anywhere else
1454in your program). The shell will interpret the quote as the closing
1455quote for the entire program. As a result, usually the shell will
1456print a message about mismatched quotes, and if @code{awk} actually
1457runs, it will probably print strange messages about syntax errors.
1458For example:
1459
1460@example
1461awk 'BEGIN @{ print "hello" @} # let's be cute'
1462@end example
1463
1464@node Very Simple, Two Rules, Running gawk, Getting Started
1465@section A Very Simple Example
1466
1467The following command runs a simple @code{awk} program that searches the
1468input file @file{BBS-list} for the string of characters: @samp{foo}.  (A
1469string of characters is usually called a @dfn{string}.
1470The term @dfn{string} is perhaps based on similar usage in English, such
1471as ``a string of pearls,'' or, ``a string of cars in a train.'')
1472
1473@example
1474awk '/foo/ @{ print $0 @}' BBS-list
1475@end example
1476
1477@noindent
1478When lines containing @samp{foo} are found, they are printed, because
1479@w{@samp{print $0}} means print the current line.  (Just @samp{print} by
1480itself means the same thing, so we could have written that
1481instead.)
1482
1483You will notice that slashes, @samp{/}, surround the string @samp{foo}
1484in the @code{awk} program.  The slashes indicate that @samp{foo}
1485is a pattern to search for.  This type of pattern is called a
1486@dfn{regular expression}, and is covered in more detail later
1487(@pxref{Regexp, ,Regular Expressions}).
1488The pattern is allowed to match parts of words.
1489There are
1490single-quotes around the @code{awk} program so that the shell won't
1491interpret any of it as special shell characters.
1492
1493Here is what this program prints:
1494
1495@example
1496@group
1497$ awk '/foo/ @{ print $0 @}' BBS-list
1498@print{} fooey        555-1234     2400/1200/300     B
1499@print{} foot         555-6699     1200/300          B
1500@print{} macfoo       555-6480     1200/300          A
1501@print{} sabafoo      555-2127     1200/300          C
1502@end group
1503@end example
1504
1505@cindex action, default
1506@cindex pattern, default
1507@cindex default action
1508@cindex default pattern
1509In an @code{awk} rule, either the pattern or the action can be omitted,
1510but not both.  If the pattern is omitted, then the action is performed
1511for @emph{every} input line.  If the action is omitted, the default
1512action is to print all lines that match the pattern.
1513
1514@cindex empty action
1515@cindex action, empty
1516Thus, we could leave out the action (the @code{print} statement and the curly
1517braces) in the above example, and the result would be the same: all
1518lines matching the pattern @samp{foo} would be printed.  By comparison,
1519omitting the @code{print} statement but retaining the curly braces makes an
1520empty action that does nothing; then no lines would be printed.
1521
1522@node Two Rules, More Complex, Very Simple, Getting Started
1523@section An Example with Two Rules
1524@cindex how @code{awk} works
1525
1526The @code{awk} utility reads the input files one line at a
1527time.  For each line, @code{awk} tries the patterns of each of the rules.
1528If several patterns match then several actions are run, in the order in
1529which they appear in the @code{awk} program.  If no patterns match, then
1530no actions are run.
1531
1532After processing all the rules (perhaps none) that match the line,
1533@code{awk} reads the next line (however,
1534@pxref{Next Statement, ,The @code{next} Statement},
1535and also @pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
1536This continues until the end of the file is reached.
1537
1538For example, the @code{awk} program:
1539
1540@example
1541/12/  @{ print $0 @}
1542/21/  @{ print $0 @}
1543@end example
1544
1545@noindent
1546contains two rules.  The first rule has the string @samp{12} as the
1547pattern and @samp{print $0} as the action.  The second rule has the
1548string @samp{21} as the pattern and also has @samp{print $0} as the
1549action.  Each rule's action is enclosed in its own pair of braces.
1550
1551This @code{awk} program prints every line that contains the string
1552@samp{12} @emph{or} the string @samp{21}.  If a line contains both
1553strings, it is printed twice, once by each rule.
1554
1555This is what happens if we run this program on our two sample data files,
1556@file{BBS-list} and @file{inventory-shipped}, as shown here:
1557
1558@example
1559$ awk '/12/ @{ print $0 @}
1560>      /21/ @{ print $0 @}' BBS-list inventory-shipped
1561@print{} aardvark     555-5553     1200/300          B
1562@print{} alpo-net     555-3412     2400/1200/300     A
1563@print{} barfly       555-7685     1200/300          A
1564@print{} bites        555-1675     2400/1200/300     A
1565@print{} core         555-2912     1200/300          C
1566@print{} fooey        555-1234     2400/1200/300     B
1567@print{} foot         555-6699     1200/300          B
1568@print{} macfoo       555-6480     1200/300          A
1569@print{} sdace        555-3430     2400/1200/300     A
1570@print{} sabafoo      555-2127     1200/300          C
1571@print{} sabafoo      555-2127     1200/300          C
1572@print{} Jan  21  36  64 620
1573@print{} Apr  21  70  74 514
1574@end example
1575
1576@noindent
1577Note how the line in @file{BBS-list} beginning with @samp{sabafoo}
1578was printed twice, once for each rule.
1579
1580@node More Complex, Statements/Lines, Two Rules, Getting Started
1581@section A More Complex Example
1582
1583@ignore
1584We have to use ls -lg here to get portable output across Unix systems.
1585The POSIX ls matches this behavior too. Sigh.
1586@end ignore
1587Here is an example to give you an idea of what typical @code{awk}
1588programs do.  This example shows how @code{awk} can be used to
1589summarize, select, and rearrange the output of another utility.  It uses
1590features that haven't been covered yet, so don't worry if you don't
1591understand all the details.
1592
1593@example
1594ls -lg | awk '$6 == "Nov" @{ sum += $5 @}
1595             END @{ print sum @}'
1596@end example
1597
1598@cindex @code{csh}, backslash continuation
1599@cindex backslash continuation in @code{csh}
1600This command prints the total number of bytes in all the files in the
1601current directory that were last modified in November (of any year).
1602(In the C shell you would need to type a semicolon and then a backslash
1603at the end of the first line; in a POSIX-compliant shell, such as the
1604Bourne shell or Bash, the GNU Bourne-Again shell, you can type the example
1605as shown.)
1606@ignore
1607FIXME:  how can users tell what shell they are running?  Need a footnote
1608or something, but getting into this is a distraction.
1609@end ignore
1610
1611The @w{@samp{ls -lg}} part of this example is a system command that gives
1612you a listing of the files in a directory, including file size and the date
1613the file was last modified. Its output looks like this:
1614
1615@example
1616-rw-r--r--  1 arnold   user   1933 Nov  7 13:05 Makefile
1617-rw-r--r--  1 arnold   user  10809 Nov  7 13:03 gawk.h
1618-rw-r--r--  1 arnold   user    983 Apr 13 12:14 gawk.tab.h
1619-rw-r--r--  1 arnold   user  31869 Jun 15 12:20 gawk.y
1620-rw-r--r--  1 arnold   user  22414 Nov  7 13:03 gawk1.c
1621-rw-r--r--  1 arnold   user  37455 Nov  7 13:03 gawk2.c
1622-rw-r--r--  1 arnold   user  27511 Dec  9 13:07 gawk3.c
1623-rw-r--r--  1 arnold   user   7989 Nov  7 13:03 gawk4.c
1624@end example
1625
1626@noindent
1627The first field contains read-write permissions, the second field contains
1628the number of links to the file, and the third field identifies the owner of
1629the file. The fourth field identifies the group of the file.
1630The fifth field contains the size of the file in bytes.  The
1631sixth, seventh and eighth fields contain the month, day, and time,
1632respectively, that the file was last modified.  Finally, the ninth field
1633contains the name of the file.
1634
1635@cindex automatic initialization
1636@cindex initialization, automatic
1637The @samp{$6 == "Nov"} in our @code{awk} program is an expression that
1638tests whether the sixth field of the output from @w{@samp{ls -lg}}
1639matches the string @samp{Nov}.  Each time a line has the string
1640@samp{Nov} for its sixth field, the action @samp{sum += $5} is
1641performed.  This adds the fifth field (the file size) to the variable
1642@code{sum}.  As a result, when @code{awk} has finished reading all the
1643input lines, @code{sum} is the sum of the sizes of files whose
1644lines matched the pattern.  (This works because @code{awk} variables
1645are automatically initialized to zero.)
1646
1647After the last line of output from @code{ls} has been processed, the
1648@code{END} rule is executed, and the value of @code{sum} is
1649printed.  In this example, the value of @code{sum} would be 80600.
1650
1651These more advanced @code{awk} techniques are covered in later sections
1652(@pxref{Action Overview, ,Overview of Actions}).  Before you can move on to more
1653advanced @code{awk} programming, you have to know how @code{awk} interprets
1654your input and displays your output.  By manipulating fields and using
1655@code{print} statements, you can produce some very useful and impressive
1656looking reports.
1657
1658@node Statements/Lines, Other Features, More Complex, Getting Started
1659@section @code{awk} Statements Versus Lines
1660@cindex line break
1661@cindex newline
1662
1663Most often, each line in an @code{awk} program is a separate statement or
1664separate rule, like this:
1665
1666@example
1667awk '/12/  @{ print $0 @}
1668     /21/  @{ print $0 @}' BBS-list inventory-shipped
1669@end example
1670
1671However, @code{gawk} will ignore newlines after any of the following:
1672
1673@example
1674,    @{    ?    :    ||    &&    do    else
1675@end example
1676
1677@noindent
1678A newline at any other point is considered the end of the statement.
1679(Splitting lines after @samp{?} and @samp{:} is a minor @code{gawk}
1680extension.  The @samp{?} and @samp{:} referred to here is the
1681three operand conditional expression described in
1682@ref{Conditional Exp, ,Conditional Expressions}.)
1683
1684@cindex backslash continuation
1685@cindex continuation of lines
1686@cindex line continuation
1687If you would like to split a single statement into two lines at a point
1688where a newline would terminate it, you can @dfn{continue} it by ending the
1689first line with a backslash character, @samp{\}.  The backslash must be
1690the final character on the line to be recognized as a continuation
1691character.  This is allowed absolutely anywhere in the statement, even
1692in the middle of a string or regular expression.  For example:
1693
1694@example
1695awk '/This regular expression is too long, so continue it\
1696 on the next line/ @{ print $1 @}'
1697@end example
1698
1699@noindent
1700@cindex portability issues
1701We have generally not used backslash continuation in the sample programs
1702in this @value{DOCUMENT}.  Since in @code{gawk} there is no limit on the
1703length of a line, it is never strictly necessary; it just makes programs
1704more readable.  For this same reason, as well as for clarity, we have
1705kept most statements short in the sample programs presented throughout
1706the @value{DOCUMENT}.  Backslash continuation is most useful when your
1707@code{awk} program is in a separate source file, instead of typed in on
1708the command line.  You should also note that many @code{awk}
1709implementations are more particular about where you may use backslash
1710continuation. For example, they may not allow you to split a string
1711constant using backslash continuation.  Thus, for maximal portability of
1712your @code{awk} programs, it is best not to split your lines in the
1713middle of a regular expression or a string.
1714
1715@cindex @code{csh}, backslash continuation
1716@cindex backslash continuation in @code{csh}
1717@strong{Caution: backslash continuation does not work as described above
1718with the C shell.}  Continuation with backslash works for @code{awk}
1719programs in files, and also for one-shot programs @emph{provided} you
1720are using a POSIX-compliant shell, such as the Bourne shell or Bash, the
1721GNU Bourne-Again shell.  But the C shell (@code{csh}) behaves
1722differently!  There, you must use two backslashes in a row, followed by
1723a newline.  Note also that when using the C shell, @emph{every} newline
1724in your awk program must be escaped with a backslash. To illustrate:
1725
1726@example
1727% awk 'BEGIN @{ \
1728?   print \\
1729?       "hello, world" \
1730? @}'
1731@print{} hello, world
1732@end example
1733
1734@noindent
1735Here, the @samp{%} and @samp{?} are the C shell's primary and secondary
1736prompts, analogous to the standard shell's @samp{$} and @samp{>}.
1737
1738@code{awk} is a line-oriented language.  Each rule's action has to
1739begin on the same line as the pattern.  To have the pattern and action
1740on separate lines, you @emph{must} use backslash continuation---there
1741is no other way.
1742
1743@cindex backslash continuation and comments
1744@cindex comments and backslash continuation
1745Note that backslash continuation and comments do not mix. As soon
1746as @code{awk} sees the @samp{#} that starts a comment, it ignores
1747@emph{everything} on the rest of the line. For example:
1748
1749@example
1750@group
1751$ gawk 'BEGIN @{ print "dont panic" # a friendly \
1752>                                    BEGIN rule
1753> @}'
1754@error{} gawk: cmd. line:2:                BEGIN rule
1755@error{} gawk: cmd. line:2:                ^ parse error
1756@end group
1757@end example
1758
1759@noindent
1760Here, it looks like the backslash would continue the comment onto the
1761next line. However, the backslash-newline combination is never even
1762noticed, since it is ``hidden'' inside the comment. Thus, the
1763@samp{BEGIN} is noted as a syntax error.
1764
1765@cindex multiple statements on one line
1766When @code{awk} statements within one rule are short, you might want to put
1767more than one of them on a line.  You do this by separating the statements
1768with a semicolon, @samp{;}.
1769
1770This also applies to the rules themselves.
1771Thus, the previous program could have been written:
1772
1773@example
1774/12/ @{ print $0 @} ; /21/ @{ print $0 @}
1775@end example
1776
1777@noindent
1778@strong{Note:} the requirement that rules on the same line must be
1779separated with a semicolon was not in the original @code{awk}
1780language; it was added for consistency with the treatment of statements
1781within an action.
1782
1783@node Other Features, When, Statements/Lines, Getting Started
1784@section Other Features of @code{awk}
1785
1786The @code{awk} language provides a number of predefined, or built-in variables, which
1787your programs can use to get information from @code{awk}.  There are other
1788variables your program can set to control how @code{awk} processes your
1789data.
1790
1791In addition, @code{awk} provides a number of built-in functions for doing
1792common computational and string related operations.
1793
1794As we develop our presentation of the @code{awk} language, we introduce
1795most of the variables and many of the functions. They are defined
1796systematically in @ref{Built-in Variables}, and
1797@ref{Built-in, ,Built-in Functions}.
1798
1799@node When,  , Other Features, Getting Started
1800@section When to Use @code{awk}
1801
1802@cindex when to use @code{awk}
1803@cindex applications of @code{awk}
1804You might wonder how @code{awk} might be useful for you.  Using
1805utility programs, advanced patterns, field separators, arithmetic
1806statements, and other selection criteria, you can produce much more
1807complex output.  The @code{awk} language is very useful for producing
1808reports from large amounts of raw data, such as summarizing information
1809from the output of other utility programs like @code{ls}.
1810(@xref{More Complex, ,A More Complex Example}.)
1811
1812Programs written with @code{awk} are usually much smaller than they would
1813be in other languages.  This makes @code{awk} programs easy to compose and
1814use.  Often, @code{awk} programs can be quickly composed at your terminal,
1815used once, and thrown away.  Since @code{awk} programs are interpreted, you
1816can avoid the (usually lengthy) compilation part of the typical
1817edit-compile-test-debug cycle of software development.
1818
1819Complex programs have been written in @code{awk}, including a complete
1820retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for
1821more information) and a microcode assembler for a special purpose Prolog
1822computer.  However, @code{awk}'s capabilities are strained by tasks of
1823such complexity.
1824
1825If you find yourself writing @code{awk} scripts of more than, say, a few
1826hundred lines, you might consider using a different programming
1827language.  Emacs Lisp is a good choice if you need sophisticated string
1828or pattern matching capabilities.  The shell is also good at string and
1829pattern matching; in addition, it allows powerful use of the system
1830utilities.  More conventional languages, such as C, C++, and Lisp, offer
1831better facilities for system programming and for managing the complexity
1832of large programs.  Programs in these languages may require more lines
1833of source code than the equivalent @code{awk} programs, but they are
1834easier to maintain and usually run more efficiently.
1835
1836@node One-liners, Regexp, Getting Started, Top
1837@chapter Useful One Line Programs
1838
1839@cindex one-liners
1840Many useful @code{awk} programs are short, just a line or two.  Here is a
1841collection of useful, short programs to get you started.  Some of these
1842programs contain constructs that haven't been covered yet.  The description
1843of the program will give you a good idea of what is going on, but please
1844read the rest of the @value{DOCUMENT} to become an @code{awk} expert!
1845
1846Most of the examples use a data file named @file{data}.  This is just a
1847placeholder; if you were to use these programs yourself, you would substitute
1848your own file names for @file{data}.
1849
1850@ifinfo
1851Since you are reading this in Info, each line of the example code is
1852enclosed in quotes, to represent text that you would type literally.
1853The examples themselves represent shell commands that use single quotes
1854to keep the shell from interpreting the contents of the program.
1855When reading the examples, focus on the text between the open and close
1856quotes.
1857@end ifinfo
1858
1859@table @code
1860@item awk '@{ if (length($0) > max) max = length($0) @}
1861@itemx @ @ @ @ @ END @{ print max @}' data
1862This program prints the length of the longest input line.
1863
1864@item awk 'length($0) > 80' data
1865This program prints every line that is longer than 80 characters.  The sole
1866rule has a relational expression as its pattern, and has no action (so the
1867default action, printing the record, is used).
1868
1869@item expand@ data@ |@ awk@ '@{ if (x < length()) x = length() @}
1870@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "maximum line length is " x @}'
1871This program prints the length of the longest line in @file{data}.  The input
1872is processed by the @code{expand} program to change tabs into spaces,
1873so the widths compared are actually the right-margin columns.
1874
1875@item awk 'NF > 0' data
1876This program prints every line that has at least one field.  This is an
1877easy way to delete blank lines from a file (or rather, to create a new
1878file similar to the old file but from which the blank lines have been
1879deleted).
1880
1881@c Karl Berry points out that new users probably don't want to see
1882@c multiple ways to do things, just the `best' way.  He's probably
1883@c right.  At some point it might be worth adding something about there
1884@c often being multiple ways to do things in awk, but for now we'll
1885@c just take this one out.
1886@ignore
1887@item awk '@{ if (NF > 0) print @}' data
1888This program also prints every line that has at least one field.  Here we
1889allow the rule to match every line, and then decide in the action whether
1890to print.
1891@end ignore
1892
1893@item awk@ 'BEGIN@ @{@ for (i = 1; i <= 7; i++)
1894@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ print int(101 * rand()) @}'
1895This program prints seven random numbers from zero to 100, inclusive.
1896
1897@item ls -lg @var{files} | awk '@{ x += $5 @} ; END @{ print "total bytes: " x @}'
1898This program prints the total number of bytes used by @var{files}.
1899
1900@item ls -lg @var{files} | awk '@{ x += $5 @}
1901@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "total K-bytes: " (x + 1023)/1024 @}'
1902This program prints the total number of kilobytes used by @var{files}.
1903
1904@item awk -F: '@{ print $1 @}' /etc/passwd | sort
1905This program prints a sorted list of the login names of all users.
1906
1907@item awk 'END @{ print NR @}' data
1908This program counts lines in a file.
1909
1910@item awk 'NR % 2 == 0' data
1911This program prints the even numbered lines in the data file.
1912If you were to use the expression @samp{NR % 2 == 1} instead,
1913it would print the odd numbered lines.
1914@end table
1915
1916@node Regexp, Reading Files, One-liners, Top
1917@chapter Regular Expressions
1918@cindex pattern, regular expressions
1919@cindex regexp
1920@cindex regular expression
1921@cindex regular expressions as patterns
1922
1923A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a
1924set of strings.
1925Because regular expressions are such a fundamental part of @code{awk}
1926programming, their format and use deserve a separate chapter.
1927
1928A regular expression enclosed in slashes (@samp{/})
1929is an @code{awk} pattern that matches every input record whose text
1930belongs to that set.
1931
1932The simplest regular expression is a sequence of letters, numbers, or
1933both.  Such a regexp matches any string that contains that sequence.
1934Thus, the regexp @samp{foo} matches any string containing @samp{foo}.
1935Therefore, the pattern @code{/foo/} matches any input record containing
1936the three characters @samp{foo}, @emph{anywhere} in the record.  Other
1937kinds of regexps let you specify more complicated classes of strings.
1938
1939@iftex
1940Initially, the examples will be simple. As we explain more about how
1941regular expressions work, we will present more complicated examples.
1942@end iftex
1943
1944@menu
1945* Regexp Usage::                How to Use Regular Expressions.
1946* Escape Sequences::            How to write non-printing characters.
1947* Regexp Operators::            Regular Expression Operators.
1948* GNU Regexp Operators::        Operators specific to GNU software.
1949* Case-sensitivity::            How to do case-insensitive matching.
1950* Leftmost Longest::            How much text matches.
1951* Computed Regexps::            Using Dynamic Regexps.
1952@end menu
1953
1954@node Regexp Usage, Escape Sequences, Regexp, Regexp
1955@section How to Use Regular Expressions
1956
1957A regular expression can be used as a pattern by enclosing it in
1958slashes.  Then the regular expression is tested against the
1959entire text of each record.  (Normally, it only needs
1960to match some part of the text in order to succeed.)  For example, this
1961prints the second field of each record that contains the three
1962characters @samp{foo} anywhere in it:
1963
1964@example
1965@group
1966$ awk '/foo/ @{ print $2 @}' BBS-list
1967@print{} 555-1234
1968@print{} 555-6699
1969@print{} 555-6480
1970@print{} 555-2127
1971@end group
1972@end example
1973
1974@cindex regexp matching operators
1975@cindex string-matching operators
1976@cindex operators, string-matching
1977@cindex operators, regexp matching
1978@cindex regexp match/non-match operators
1979@cindex @code{~} operator
1980@cindex @code{!~} operator
1981Regular expressions can also be used in matching expressions.  These
1982expressions allow you to specify the string to match against; it need
1983not be the entire current input record.  The two operators, @samp{~}
1984and @samp{!~}, perform regular expression comparisons.  Expressions
1985using these operators can be used as patterns or in @code{if},
1986@code{while}, @code{for}, and @code{do} statements.
1987@ifinfo
1988@c adding this xref in TeX screws up the formatting too much
1989(@xref{Statements, ,Control Statements in Actions}.)
1990@end ifinfo
1991
1992@table @code
1993@item @var{exp} ~ /@var{regexp}/
1994This is true if the expression @var{exp} (taken as a string)
1995is matched by @var{regexp}.  The following example matches, or selects,
1996all input records with the upper-case letter @samp{J} somewhere in the
1997first field:
1998
1999@example
2000@group
2001$ awk '$1 ~ /J/' inventory-shipped
2002@print{} Jan  13  25  15 115
2003@print{} Jun  31  42  75 492
2004@print{} Jul  24  34  67 436
2005@print{} Jan  21  36  64 620
2006@end group
2007@end example
2008
2009So does this:
2010
2011@example
2012awk '@{ if ($1 ~ /J/) print @}' inventory-shipped
2013@end example
2014
2015@item @var{exp} !~ /@var{regexp}/
2016This is true if the expression @var{exp} (taken as a character string)
2017is @emph{not} matched by @var{regexp}.  The following example matches,
2018or selects, all input records whose first field @emph{does not} contain
2019the upper-case letter @samp{J}:
2020
2021@example
2022@group
2023$ awk '$1 !~ /J/' inventory-shipped
2024@print{} Feb  15  32  24 226
2025@print{} Mar  15  24  34 228
2026@print{} Apr  31  52  63 420
2027@print{} May  16  34  29 208
2028@dots{}
2029@end group
2030@end example
2031@end table
2032
2033@cindex regexp constant
2034When a regexp is written enclosed in slashes, like @code{/foo/}, we call it
2035a @dfn{regexp constant}, much like @code{5.27} is a numeric constant, and
2036@code{"foo"} is a string constant.
2037
2038@node Escape Sequences, Regexp Operators, Regexp Usage, Regexp
2039@section Escape Sequences
2040
2041@cindex escape sequence notation
2042Some characters cannot be included literally in string constants
2043(@code{"foo"}) or regexp constants (@code{/foo/}).  You represent them
2044instead with @dfn{escape sequences}, which are character sequences
2045beginning with a backslash (@samp{\}).
2046
2047One use of an escape sequence is to include a double-quote character in
2048a string constant.  Since a plain double-quote would end the string, you
2049must use @samp{\"} to represent an actual double-quote character as a
2050part of the string.  For example:
2051
2052@example
2053$ awk 'BEGIN @{ print "He said \"hi!\" to her." @}'
2054@print{} He said "hi!" to her.
2055@end example
2056
2057The  backslash character itself is another character that cannot be
2058included normally; you write @samp{\\} to put one backslash in the
2059string or regexp.  Thus, the string whose contents are the two characters
2060@samp{"} and @samp{\} must be written @code{"\"\\"}.
2061
2062Another use of backslash is to represent unprintable characters
2063such as tab or newline.  While there is nothing to stop you from entering most
2064unprintable characters directly in a string constant or regexp constant,
2065they may look ugly.
2066
2067Here is a table of all the escape sequences used in @code{awk}, and
2068what they represent. Unless noted otherwise, all of these escape
2069sequences apply to both string constants and regexp constants.
2070
2071@c @cartouche
2072@table @code
2073@item \\
2074A literal backslash, @samp{\}.
2075
2076@cindex @code{awk} language, V.4 version
2077@item \a
2078The ``alert'' character, @kbd{Control-g}, ASCII code 7 (BEL).
2079
2080@item \b
2081Backspace, @kbd{Control-h}, ASCII code 8 (BS).
2082
2083@item \f
2084Formfeed, @kbd{Control-l}, ASCII code 12 (FF).
2085
2086@item \n
2087Newline, @kbd{Control-j}, ASCII code 10 (LF).
2088
2089@item \r
2090Carriage return, @kbd{Control-m}, ASCII code 13 (CR).
2091
2092@item \t
2093Horizontal tab, @kbd{Control-i}, ASCII code 9 (HT).
2094
2095@cindex @code{awk} language, V.4 version
2096@item \v
2097Vertical tab, @kbd{Control-k}, ASCII code 11 (VT).
2098
2099@item \@var{nnn}
2100The octal value @var{nnn}, where @var{nnn} are one to three digits
2101between @samp{0} and @samp{7}.  For example, the code for the ASCII ESC
2102(escape) character is @samp{\033}.
2103
2104@cindex @code{awk} language, V.4 version
2105@cindex @code{awk} language, POSIX version
2106@cindex POSIX @code{awk}
2107@item \x@var{hh}@dots{}
2108The hexadecimal value @var{hh}, where @var{hh} are hexadecimal
2109digits (@samp{0} through @samp{9} and either @samp{A} through @samp{F} or
2110@samp{a} through @samp{f}).  Like the same construct in ANSI C, the escape
2111sequence continues until the first non-hexadecimal digit is seen.  However,
2112using more than two hexadecimal digits produces undefined results. (The
2113@samp{\x} escape sequence is not allowed in POSIX @code{awk}.)
2114
2115@item \/
2116A literal slash (necessary for regexp constants only).
2117You use this when you wish to write a regexp
2118constant that contains a slash. Since the regexp is delimited by
2119slashes, you need to escape the slash that is part of the pattern,
2120in order to tell @code{awk} to keep processing the rest of the regexp.
2121
2122@item \"
2123A literal double-quote (necessary for string constants only).
2124You use this when you wish to write a string
2125constant that contains a double-quote. Since the string is delimited by
2126double-quotes, you need to escape the quote that is part of the string,
2127in order to tell @code{awk} to keep processing the rest of the string.
2128@end table
2129@c @end cartouche
2130
2131In @code{gawk}, there are additional two character sequences that begin
2132with backslash that have special meaning in regexps.
2133@xref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.
2134
2135In a string constant,
2136what happens if you place a backslash before something that is not one of
2137the characters listed above?  POSIX @code{awk} purposely leaves this case
2138undefined.  There are two choices.
2139
2140@itemize @bullet
2141@item
2142Strip the backslash out.  This is what Unix @code{awk} and @code{gawk} both do.
2143For example, @code{"a\qc"} is the same as @code{"aqc"}.
2144
2145@item
2146Leave the backslash alone.  Some other @code{awk} implementations do this.
2147In such implementations, @code{"a\qc"} is the same as if you had typed
2148@code{"a\\qc"}.
2149@end itemize
2150
2151In a regexp, a backslash before any character that is not in the above table,
2152and not listed in
2153@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}},
2154means that the next character should be taken literally, even if it would
2155normally be a regexp operator. E.g., @code{/a\+b/} matches the three
2156characters @samp{a+b}.
2157
2158@cindex portability issues
2159For complete portability, do not use a backslash before any character not
2160listed in the table above.
2161
2162Another interesting question arises. Suppose you use an octal or hexadecimal
2163escape to represent a regexp metacharacter
2164(@pxref{Regexp Operators, ,  Regular Expression Operators}).
2165Does @code{awk} treat the character as a literal character, or as a regexp
2166operator?
2167
2168@cindex dark corner
2169It turns out that historically, such characters were taken literally (d.c.).
2170However, the POSIX standard indicates that they should be treated
2171as real metacharacters, and this is what @code{gawk} does.
2172However, in compatibility mode (@pxref{Options, ,Command Line Options}),
2173@code{gawk} treats the characters represented by octal and hexadecimal
2174escape sequences literally when used in regexp constants. Thus,
2175@code{/a\52b/} is equivalent to @code{/a\*b/}.
2176
2177To summarize:
2178
2179@enumerate 1
2180@item
2181The escape sequences in the table above are always processed first,
2182for both string constants and regexp constants. This happens very early,
2183as soon as @code{awk} reads your program.
2184
2185@item
2186@code{gawk} processes both regexp constants and dynamic regexps
2187(@pxref{Computed Regexps, ,Using Dynamic Regexps}),
2188for the special operators listed in
2189@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.
2190
2191@item
2192A backslash before any other character means to treat that character
2193literally.
2194@end enumerate
2195
2196@node Regexp Operators, GNU Regexp Operators, Escape Sequences, Regexp
2197@section Regular Expression Operators
2198@cindex metacharacters
2199@cindex regular expression metacharacters
2200@cindex regexp operators
2201
2202You can combine regular expressions with the following characters,
2203called @dfn{regular expression operators}, or @dfn{metacharacters}, to
2204increase the power and versatility of regular expressions.
2205
2206The escape sequences described
2207@iftex
2208above
2209@end iftex
2210in @ref{Escape Sequences},
2211are valid inside a regexp.  They are introduced by a @samp{\}.  They
2212are recognized and converted into the corresponding real characters as
2213the very first step in processing regexps.
2214
2215Here is a table of metacharacters.  All characters that are not escape
2216sequences and that are not listed in the table stand for themselves.
2217
2218@table @code
2219@item \
2220This is used to suppress the special meaning of a character when
2221matching.  For example:
2222
2223@example
2224\$
2225@end example
2226
2227@noindent
2228matches the character @samp{$}.
2229
2230@c NEEDED
2231@page
2232@cindex anchors in regexps
2233@cindex regexp, anchors
2234@item ^
2235This matches the beginning of a string.  For example:
2236
2237@example
2238^@@chapter
2239@end example
2240
2241@noindent
2242matches the @samp{@@chapter} at the beginning of a string, and can be used
2243to identify chapter beginnings in Texinfo source files.
2244The @samp{^} is known as an @dfn{anchor}, since it anchors the pattern to
2245matching only at the beginning of the string.
2246
2247It is important to realize that @samp{^} does not match the beginning of
2248a line embedded in a string.  In this example the condition is not true:
2249
2250@example
2251if ("line1\nLINE 2" ~ /^L/) @dots{}
2252@end example
2253
2254@item $
2255This is similar to @samp{^}, but it matches only at the end of a string.
2256For example:
2257
2258@example
2259p$
2260@end example
2261
2262@noindent
2263matches a record that ends with a @samp{p}.  The @samp{$} is also an anchor,
2264and also does not match the end of a line embedded in a string.  In this
2265example the condition is not true:
2266
2267@example
2268if ("line1\nLINE 2" ~ /1$/) @dots{}
2269@end example
2270
2271@item .
2272The period, or dot, matches any single character,
2273@emph{including} the newline character.  For example:
2274
2275@example
2276.P
2277@end example
2278
2279@noindent
2280matches any single character followed by a @samp{P} in a string.  Using
2281concatenation we can make a regular expression like @samp{U.A}, which
2282matches any three-character sequence that begins with @samp{U} and ends
2283with @samp{A}.
2284
2285@cindex @code{awk} language, POSIX version
2286@cindex POSIX @code{awk}
2287In strict POSIX mode (@pxref{Options, ,Command Line Options}),
2288@samp{.} does not match the @sc{nul}
2289character, which is a character with all bits equal to zero.
2290Otherwise, @sc{nul} is just another character. Other versions of @code{awk}
2291may not be able to match the @sc{nul} character.
2292
2293@ignore
22942e: Add stuff that character list is the POSIX terminology. In other
2295    literature known as character set or character class.
2296@end ignore
2297
2298@cindex character list
2299@item [@dots{}]
2300This is called a @dfn{character list}.  It matches any @emph{one} of the
2301characters that are enclosed in the square brackets.  For example:
2302
2303@example
2304[MVX]
2305@end example
2306
2307@noindent
2308matches any one of the characters @samp{M}, @samp{V}, or @samp{X} in a
2309string.
2310
2311Ranges of characters are indicated by using a hyphen between the beginning
2312and ending characters, and enclosing the whole thing in brackets.  For
2313example:
2314
2315@example
2316[0-9]
2317@end example
2318
2319@noindent
2320matches any digit.
2321Multiple ranges are allowed. E.g., the list @code{@w{[A-Za-z0-9]}} is a
2322common way to express the idea of ``all alphanumeric characters.''
2323
2324To include one of the characters @samp{\}, @samp{]}, @samp{-} or @samp{^} in a
2325character list, put a @samp{\} in front of it.  For example:
2326
2327@example
2328[d\]]
2329@end example
2330
2331@noindent
2332matches either @samp{d}, or @samp{]}.
2333
2334@cindex @code{egrep}
2335This treatment of @samp{\} in character lists
2336is compatible with other @code{awk}
2337implementations, and is also mandated by POSIX.
2338The regular expressions in @code{awk} are a superset
2339of the POSIX specification for Extended Regular Expressions (EREs).
2340POSIX EREs are based on the regular expressions accepted by the
2341traditional @code{egrep} utility.
2342
2343@cindex character classes
2344@cindex @code{awk} language, POSIX version
2345@cindex POSIX @code{awk}
2346@dfn{Character classes} are a new feature introduced in the POSIX standard.
2347A character class is a special notation for describing
2348lists of characters that have a specific attribute, but where the
2349actual characters themselves can vary from country to country and/or
2350from character set to character set.  For example, the notion of what
2351is an alphabetic character differs in the USA and in France.
2352
2353A character class is only valid in a regexp @emph{inside} the
2354brackets of a character list.  Character classes consist of @samp{[:},
2355a keyword denoting the class, and @samp{:]}.  Here are the character
2356classes defined by the POSIX standard.
2357
2358@table @code
2359@item [:alnum:]
2360Alphanumeric characters.
2361
2362@item [:alpha:]
2363Alphabetic characters.
2364
2365@item [:blank:]
2366Space and tab characters.
2367
2368@item [:cntrl:]
2369Control characters.
2370
2371@item [:digit:]
2372Numeric characters.
2373
2374@item [:graph:]
2375Characters that are printable and are also visible.
2376(A space is printable, but not visible, while an @samp{a} is both.)
2377
2378@item [:lower:]
2379Lower-case alphabetic characters.
2380
2381@item [:print:]
2382Printable characters (characters that are not control characters.)
2383
2384@item [:punct:]
2385Punctuation characters (characters that are not letter, digits,
2386control characters, or space characters).
2387
2388@item [:space:]
2389Space characters (such as space, tab, and formfeed, to name a few).
2390
2391@item [:upper:]
2392Upper-case alphabetic characters.
2393
2394@item [:xdigit:]
2395Characters that are hexadecimal digits.
2396@end table
2397
2398For example, before the POSIX standard, to match alphanumeric
2399characters, you had to write @code{/[A-Za-z0-9]/}.  If your
2400character set had other alphabetic characters in it, this would not
2401match them.  With the POSIX character classes, you can write
2402@code{/[[:alnum:]]/}, and this will match @emph{all} the alphabetic
2403and numeric characters in your character set.
2404
2405@cindex collating elements
2406Two additional special sequences can appear in character lists.
2407These apply to non-ASCII character sets, which can have single symbols
2408(called @dfn{collating elements}) that are represented with more than one
2409character, as well as several characters that are equivalent for
2410@dfn{collating}, or sorting, purposes.  (E.g., in French, a plain ``e''
2411and a grave-accented ``@`e'' are equivalent.)
2412
2413@table @asis
2414@cindex collating symbols
2415@item Collating Symbols
2416A @dfn{collating symbol} is a multi-character collating element enclosed in
2417@samp{[.} and @samp{.]}.  For example, if @samp{ch} is a collating element,
2418then @code{[[.ch.]]} is a regexp that matches this collating element, while
2419@code{[ch]} is a regexp that matches either @samp{c} or @samp{h}.
2420
2421@cindex equivalence classes
2422@item Equivalence Classes
2423An @dfn{equivalence class} is a locale-specific name for a list of
2424characters that are equivalent. The name is enclosed in
2425@samp{[=} and @samp{=]}.
2426For example, the name @samp{e} might be used to represent all of
2427``e,'' ``@`e,'' and ``@'e.'' In this case, @code{[[=e]]} is a regexp
2428that matches any of @samp{e}, @samp{@'e},  or @samp{@`e}.
2429@end table
2430
2431These features are very valuable in non-English speaking locales.
2432
2433@strong{Caution:} The library functions that @code{gawk} uses for regular
2434expression matching currently only recognize POSIX character classes;
2435they do not recognize collating symbols or equivalence classes.
2436@c maybe one day ...
2437
2438@cindex complemented character list
2439@cindex character list, complemented
2440@item [^ @dots{}]
2441This is a @dfn{complemented character list}.  The first character after
2442the @samp{[} @emph{must} be a @samp{^}.  It matches any characters
2443@emph{except} those in the square brackets.  For example:
2444
2445@example
2446[^0-9]
2447@end example
2448
2449@noindent
2450matches any character that is not a digit.
2451
2452@item |
2453This is the @dfn{alternation operator}, and it is used to specify
2454alternatives.  For example:
2455
2456@example
2457^P|[0-9]
2458@end example
2459
2460@noindent
2461matches any string that matches either @samp{^P} or @samp{[0-9]}.  This
2462means it matches any string that starts with @samp{P} or contains a digit.
2463
2464The alternation applies to the largest possible regexps on either side.
2465In other words, @samp{|} has the lowest precedence of all the regular
2466expression operators.
2467
2468@item (@dots{})
2469Parentheses are used for grouping in regular expressions as in
2470arithmetic.  They can be used to concatenate regular expressions
2471containing the alternation operator, @samp{|}.  For example,
2472@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and
2473@samp{@@samp@{bar@}}. (These are Texinfo formatting control sequences.)
2474
2475@item *
2476This symbol means that the preceding regular expression is to be
2477repeated as many times as necessary to find a match.  For example:
2478
2479@example
2480ph*
2481@end example
2482
2483@noindent
2484applies the @samp{*} symbol to the preceding @samp{h} and looks for matches
2485of one @samp{p} followed by any number of @samp{h}s.  This will also match
2486just @samp{p} if no @samp{h}s are present.
2487
2488The @samp{*} repeats the @emph{smallest} possible preceding expression.
2489(Use parentheses if you wish to repeat a larger expression.)  It finds
2490as many repetitions as possible.  For example:
2491
2492@example
2493awk '/\(c[ad][ad]*r x\)/ @{ print @}' sample
2494@end example
2495
2496@noindent
2497prints every record in @file{sample} containing a string of the form
2498@samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on.
2499Notice the escaping of the parentheses by preceding them
2500with backslashes.
2501
2502@item +
2503This symbol is similar to @samp{*}, but the preceding expression must be
2504matched at least once.  This means that:
2505
2506@example
2507wh+y
2508@end example
2509
2510@noindent
2511would match @samp{why} and @samp{whhy} but not @samp{wy}, whereas
2512@samp{wh*y} would match all three of these strings.  This is a simpler
2513way of writing the last @samp{*} example:
2514
2515@example
2516awk '/\(c[ad]+r x\)/ @{ print @}' sample
2517@end example
2518
2519@item ?
2520This symbol is similar to @samp{*}, but the preceding expression can be
2521matched either once or not at all.  For example:
2522
2523@example
2524fe?d
2525@end example
2526
2527@noindent
2528will match @samp{fed} and @samp{fd}, but nothing else.
2529
2530@cindex @code{awk} language, POSIX version
2531@cindex POSIX @code{awk}
2532@cindex interval expressions
2533@item @{@var{n}@}
2534@itemx @{@var{n},@}
2535@itemx @{@var{n},@var{m}@}
2536One or two numbers inside braces denote an @dfn{interval expression}.
2537If there is one number in the braces, the preceding regexp is repeated
2538@var{n} times.
2539If there are two numbers separated by a comma, the preceding regexp is
2540repeated @var{n} to @var{m} times.
2541If there is one number followed by a comma, then the preceding regexp
2542is repeated at least @var{n} times.
2543
2544@table @code
2545@item wh@{3@}y
2546matches @samp{whhhy} but not @samp{why} or @samp{whhhhy}.
2547
2548@item wh@{3,5@}y
2549matches @samp{whhhy} or @samp{whhhhy} or @samp{whhhhhy}, only.
2550
2551@item wh@{2,@}y
2552matches @samp{whhy} or @samp{whhhy}, and so on.
2553@end table
2554
2555Interval expressions were not traditionally available in @code{awk}.
2556As part of the POSIX standard they were added, to make @code{awk}
2557and @code{egrep} consistent with each other.
2558
2559However, since old programs may use @samp{@{} and @samp{@}} in regexp
2560constants, by default @code{gawk} does @emph{not} match interval expressions
2561in regexps.  If either @samp{--posix} or @samp{--re-interval} are specified
2562(@pxref{Options, , Command Line Options}), then interval expressions
2563are allowed in regexps.
2564@end table
2565
2566@cindex precedence, regexp operators
2567@cindex regexp operators, precedence of
2568In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators,
2569as well as the braces @samp{@{} and @samp{@}},
2570have
2571the highest precedence, followed by concatenation, and finally by @samp{|}.
2572As in arithmetic, parentheses can change how operators are grouped.
2573
2574If @code{gawk} is in compatibility mode
2575(@pxref{Options, ,Command Line Options}),
2576character classes and interval expressions are not available in
2577regular expressions.
2578
2579The next
2580@ifinfo
2581node
2582@end ifinfo
2583@iftex
2584section
2585@end iftex
2586discusses the GNU-specific regexp operators, and provides
2587more detail concerning how command line options affect the way @code{gawk}
2588interprets the characters in regular expressions.
2589
2590@node GNU Regexp Operators, Case-sensitivity, Regexp Operators, Regexp
2591@section Additional Regexp Operators Only in @code{gawk}
2592
2593@c This section adapted from the regex-0.12 manual
2594
2595@cindex regexp operators, GNU specific
2596GNU software that deals with regular expressions provides a number of
2597additional regexp operators.  These operators are described in this
2598section, and are specific to @code{gawk}; they are not available in other
2599@code{awk} implementations.
2600
2601@cindex word, regexp definition of
2602Most of the additional operators are for dealing with word matching.
2603For our purposes, a @dfn{word} is a sequence of one or more letters, digits,
2604or underscores (@samp{_}).
2605
2606@table @code
2607@cindex @code{\w} regexp operator
2608@item \w
2609This operator matches any word-constituent character, i.e.@: any
2610letter, digit, or underscore. Think of it as a short-hand for
2611@c @w{@code{[A-Za-z0-9_]}} or
2612@w{@code{[[:alnum:]_]}}.
2613
2614@cindex @code{\W} regexp operator
2615@item \W
2616This operator matches any character that is not word-constituent.
2617Think of it as a short-hand for
2618@c @w{@code{[^A-Za-z0-9_]}} or
2619@w{@code{[^[:alnum:]_]}}.
2620
2621@cindex @code{\<} regexp operator
2622@item \<
2623This operator matches the empty string at the beginning of a word.
2624For example, @code{/\<away/} matches @samp{away}, but not
2625@samp{stowaway}.
2626
2627@cindex @code{\>} regexp operator
2628@item \>
2629This operator matches the empty string at the end of a word.
2630For example, @code{/stow\>/} matches @samp{stow}, but not @samp{stowaway}.
2631
2632@cindex @code{\y} regexp operator
2633@cindex word boundaries, matching
2634@item \y
2635This operator matches the empty string at either the beginning or the
2636end of a word (the word boundar@strong{y}).  For example, @samp{\yballs?\y}
2637matches either @samp{ball} or @samp{balls} as a separate word.
2638
2639@cindex @code{\B} regexp operator
2640@item \B
2641This operator matches the empty string within a word. In other words,
2642@samp{\B} matches the empty string that occurs between two
2643word-constituent characters. For example,
2644@code{/\Brat\B/} matches @samp{crate}, but it does not match @samp{dirty rat}.
2645@samp{\B} is essentially the opposite of @samp{\y}.
2646@end table
2647
2648There are two other operators that work on buffers.  In Emacs, a
2649@dfn{buffer} is, naturally, an Emacs buffer.  For other programs, the
2650regexp library routines that @code{gawk} uses consider the entire
2651string to be matched as the buffer.
2652
2653For @code{awk}, since @samp{^} and @samp{$} always work in terms
2654of the beginning and end of strings, these operators don't add any
2655new capabilities.  They are provided for compatibility with other GNU
2656software.
2657
2658@cindex buffer matching operators
2659@table @code
2660@cindex @code{\`} regexp operator
2661@item \`
2662This operator matches the empty string at the
2663beginning of the buffer.
2664
2665@cindex @code{\'} regexp operator
2666@item \'
2667This operator matches the empty string at the
2668end of the buffer.
2669@end table
2670
2671In other GNU software, the word boundary operator is @samp{\b}. However,
2672that conflicts with the @code{awk} language's definition of @samp{\b}
2673as backspace, so @code{gawk} uses a different letter.
2674
2675An alternative method would have been to require two backslashes in the
2676GNU operators, but this was deemed to be too confusing, and the current
2677method of using @samp{\y} for the GNU @samp{\b} appears to be the
2678lesser of two evils.
2679
2680@c NOTE!!! Keep this in sync with the same table in the summary appendix!
2681@cindex regexp, effect of command line options
2682The various command line options
2683(@pxref{Options, ,Command Line Options})
2684control how @code{gawk} interprets characters in regexps.
2685
2686@table @asis
2687@item No options
2688In the default case, @code{gawk} provides all the facilities of
2689POSIX regexps and the GNU regexp operators described
2690@iftex
2691above.
2692@end iftex
2693@ifinfo
2694in @ref{Regexp Operators, ,Regular Expression Operators}.
2695@end ifinfo
2696However, interval expressions are not supported.
2697
2698@item @code{--posix}
2699Only POSIX regexps are supported, the GNU operators are not special
2700(e.g., @samp{\w} matches a literal @samp{w}).  Interval expressions
2701are allowed.
2702
2703@item @code{--traditional}
2704Traditional Unix @code{awk} regexps are matched. The GNU operators
2705are not special, interval expressions are not available, and neither
2706are the POSIX character classes (@code{[[:alnum:]]} and so on).
2707Characters described by octal and hexadecimal escape sequences are
2708treated literally, even if they represent regexp metacharacters.
2709
2710@item @code{--re-interval}
2711Allow interval expressions in regexps, even if @samp{--traditional}
2712has been provided.
2713@end table
2714
2715@node Case-sensitivity, Leftmost Longest, GNU Regexp Operators, Regexp
2716@section Case-sensitivity in Matching
2717
2718@cindex case sensitivity
2719@cindex ignoring case
2720Case is normally significant in regular expressions, both when matching
2721ordinary characters (i.e.@: not metacharacters), and inside character
2722sets.  Thus a @samp{w} in a regular expression matches only a lower-case
2723@samp{w} and not an upper-case @samp{W}.
2724
2725The simplest way to do a case-independent match is to use a character
2726list: @samp{[Ww]}.  However, this can be cumbersome if you need to use it
2727often; and it can make the regular expressions harder to
2728read.  There are two alternatives that you might prefer.
2729
2730One way to do a case-insensitive match at a particular point in the
2731program is to convert the data to a single case, using the
2732@code{tolower} or @code{toupper} built-in string functions (which we
2733haven't discussed yet;
2734@pxref{String Functions, ,Built-in Functions for String Manipulation}).
2735For example:
2736
2737@example
2738tolower($1) ~ /foo/  @{ @dots{} @}
2739@end example
2740
2741@noindent
2742converts the first field to lower-case before matching against it.
2743This will work in any POSIX-compliant implementation of @code{awk}.
2744
2745@cindex differences between @code{gawk} and @code{awk}
2746@cindex @code{~} operator
2747@cindex @code{!~} operator
2748@vindex IGNORECASE
2749Another method, specific to @code{gawk}, is to set the variable
2750@code{IGNORECASE} to a non-zero value (@pxref{Built-in Variables}).
2751When @code{IGNORECASE} is not zero, @emph{all} regexp and string
2752operations ignore case.  Changing the value of
2753@code{IGNORECASE} dynamically controls the case sensitivity of your
2754program as it runs.  Case is significant by default because
2755@code{IGNORECASE} (like most variables) is initialized to zero.
2756
2757@example
2758@group
2759x = "aB"
2760if (x ~ /ab/) @dots{}   # this test will fail
2761@end group
2762
2763@group
2764IGNORECASE = 1
2765if (x ~ /ab/) @dots{}   # now it will succeed
2766@end group
2767@end example
2768
2769In general, you cannot use @code{IGNORECASE} to make certain rules
2770case-insensitive and other rules case-sensitive, because there is no way
2771to set @code{IGNORECASE} just for the pattern of a particular rule.
2772@ignore
2773This isn't quite true. Consider:
2774
2775	IGNORECASE=1 && /foObAr/ { .... }
2776	IGNORECASE=0 || /foobar/ { .... }
2777
2778But that's pretty bad style and I don't want to get into it at this
2779late date.
2780@end ignore
2781To do this, you must use character lists or @code{tolower}.  However, one
2782thing you can do only with @code{IGNORECASE} is turn case-sensitivity on
2783or off dynamically for all the rules at once.
2784
2785@code{IGNORECASE} can be set on the command line, or in a @code{BEGIN} rule
2786(@pxref{Other Arguments, ,Other Command Line Arguments}; also
2787@pxref{Using BEGIN/END, ,Startup and Cleanup Actions}).
2788Setting @code{IGNORECASE} from the command line is a way to make
2789a program case-insensitive without having to edit it.
2790
2791Prior to version 3.0 of @code{gawk}, the value of @code{IGNORECASE}
2792only affected regexp operations. It did not affect string comparison
2793with @samp{==}, @samp{!=}, and so on.
2794Beginning with version 3.0, both regexp and string comparison
2795operations are affected by @code{IGNORECASE}.
2796
2797@cindex ISO 8859-1
2798@cindex ISO Latin-1
2799Beginning with version 3.0 of @code{gawk}, the equivalences between upper-case
2800and lower-case characters are based on the ISO-8859-1 (ISO Latin-1)
2801character set. This character set is a superset of the traditional 128
2802ASCII characters, that also provides a number of characters suitable
2803for use with European languages.
2804@ignore
2805A pure ASCII character set can be used instead if @code{gawk} is compiled
2806with @samp{-DUSE_PURE_ASCII}.
2807@end ignore
2808
2809The value of @code{IGNORECASE} has no effect if @code{gawk} is in
2810compatibility mode (@pxref{Options, ,Command Line Options}).
2811Case is always significant in compatibility mode.
2812
2813@node Leftmost Longest, Computed Regexps, Case-sensitivity, Regexp
2814@section How Much Text Matches?
2815
2816@cindex leftmost longest match
2817@cindex matching, leftmost longest
2818Consider the following example:
2819
2820@example
2821echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
2822@end example
2823
2824This example uses the @code{sub} function (which we haven't discussed yet,
2825@pxref{String Functions, ,Built-in Functions for String Manipulation})
2826to make a change to the input record. Here, the regexp @code{/a+/}
2827indicates ``one or more @samp{a} characters,'' and the replacement
2828text is @samp{<A>}.
2829
2830The input contains four @samp{a} characters.  What will the output be?
2831In other words, how many is ``one or more''---will @code{awk} match two,
2832three, or all four @samp{a} characters?
2833
2834The answer is, @code{awk} (and POSIX) regular expressions always match
2835the leftmost, @emph{longest} sequence of input characters that can
2836match.  Thus, in this example, all four @samp{a} characters are
2837replaced with @samp{<A>}.
2838
2839@example
2840$ echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
2841@print{} <A>bcd
2842@end example
2843
2844For simple match/no-match tests, this is not so important. But when doing
2845text matching and substitutions with the @code{match}, @code{sub}, @code{gsub},
2846and @code{gensub} functions, it is very important.
2847@ifinfo
2848@xref{String Functions, ,Built-in Functions for String Manipulation},
2849for more information on these functions.
2850@end ifinfo
2851Understanding this principle is also important for regexp-based record
2852and field splitting (@pxref{Records, ,How Input is Split into Records},
2853and also @pxref{Field Separators, ,Specifying How Fields are Separated}).
2854
2855@node Computed Regexps, , Leftmost Longest, Regexp
2856@section Using Dynamic Regexps
2857
2858@cindex computed regular expressions
2859@cindex regular expressions, computed
2860@cindex dynamic regular expressions
2861@cindex regexp, dynamic
2862@cindex @code{~} operator
2863@cindex @code{!~} operator
2864The right hand side of a @samp{~} or @samp{!~} operator need not be a
2865regexp constant (i.e.@: a string of characters between slashes).  It may
2866be any expression.  The expression is evaluated, and converted if
2867necessary to a string; the contents of the string are used as the
2868regexp.  A regexp that is computed in this way is called a @dfn{dynamic
2869regexp}.  For example:
2870
2871@example
2872BEGIN @{ identifier_regexp = "[A-Za-z_][A-Za-z_0-9]*" @}
2873$0 ~ identifier_regexp    @{ print @}
2874@end example
2875
2876@noindent
2877sets @code{identifier_regexp} to a regexp that describes @code{awk}
2878variable names, and tests if the input record matches this regexp.
2879
2880@ignore
2881Do we want to use "^[A-Za-z_][A-Za-z_0-9]*$" to restrict the entire
2882record to just identifiers?  Doing that also would disrupt the flow of
2883the text.
2884@end ignore
2885
2886@strong{Caution:} When using the @samp{~} and @samp{!~}
2887operators, there is a difference between a regexp constant
2888enclosed in slashes, and a string constant enclosed in double quotes.
2889If you are going to use a string constant, you have to understand that
2890the string is in essence scanned @emph{twice}; the first time when
2891@code{awk} reads your program, and the second time when it goes to
2892match the string on the left-hand side of the operator with the pattern
2893on the right.  This is true of any string valued expression (such as
2894@code{identifier_regexp} above), not just string constants.
2895
2896@cindex regexp constants, difference between slashes and quotes
2897What difference does it make if the string is
2898scanned twice? The answer has to do with escape sequences, and particularly
2899with backslashes.  To get a backslash into a regular expression inside a
2900string, you have to type two backslashes.
2901
2902For example, @code{/\*/} is a regexp constant for a literal @samp{*}.
2903Only one backslash is needed.  To do the same thing with a string,
2904you would have to type @code{"\\*"}.  The first backslash escapes the
2905second one, so that the string actually contains the
2906two characters @samp{\} and @samp{*}.
2907
2908@cindex common mistakes
2909@cindex mistakes, common
2910@cindex errors, common
2911Given that you can use both regexp and string constants to describe
2912regular expressions, which should you use?  The answer is ``regexp
2913constants,'' for several reasons.
2914
2915@enumerate 1
2916@item
2917String constants are more complicated to write, and
2918more difficult to read. Using regexp constants makes your programs
2919less error-prone.  Not understanding the difference between the two
2920kinds of constants is a common source of errors.
2921
2922@item
2923It is also more efficient to use regexp constants: @code{awk} can note
2924that you have supplied a regexp and store it internally in a form that
2925makes pattern matching more efficient.  When using a string constant,
2926@code{awk} must first convert the string into this internal form, and
2927then perform the pattern matching.
2928
2929@item
2930Using regexp constants is better style; it shows clearly that you
2931intend a regexp match.
2932@end enumerate
2933
2934@node Reading Files, Printing, Regexp, Top
2935@chapter Reading Input Files
2936
2937@cindex reading files
2938@cindex input
2939@cindex standard input
2940@vindex FILENAME
2941In the typical @code{awk} program, all input is read either from the
2942standard input (by default the keyboard, but often a pipe from another
2943command) or from files whose names you specify on the @code{awk} command
2944line.  If you specify input files, @code{awk} reads them in order, reading
2945all the data from one before going on to the next.  The name of the current
2946input file can be found in the built-in variable @code{FILENAME}
2947(@pxref{Built-in Variables}).
2948
2949The input is read in units called @dfn{records}, and processed by the
2950rules of your program one record at a time.
2951By default, each record is one line.  Each
2952record is automatically split into chunks called @dfn{fields}.
2953This makes it more convenient for programs to work on the parts of a record.
2954
2955On rare occasions you will need to use the @code{getline} command.
2956The  @code{getline} command is valuable, both because it
2957can do explicit input from any number of files, and because the files
2958used with it do not have to be named on the @code{awk} command line
2959(@pxref{Getline, ,Explicit Input with @code{getline}}).
2960
2961@menu
2962* Records::                     Controlling how data is split into records.
2963* Fields::                      An introduction to fields.
2964* Non-Constant Fields::         Non-constant Field Numbers.
2965* Changing Fields::             Changing the Contents of a Field.
2966* Field Separators::            The field separator and how to change it.
2967* Constant Size::               Reading constant width data.
2968* Multiple Line::               Reading multi-line records.
2969* Getline::                     Reading files under explicit program control
2970                                using the @code{getline} function.
2971@end menu
2972
2973@node Records, Fields, Reading Files, Reading Files
2974@section How Input is Split into Records
2975
2976@cindex record separator, @code{RS}
2977@cindex changing the record separator
2978@cindex record, definition of
2979@vindex RS
2980The @code{awk} utility divides the input for your @code{awk}
2981program into records and fields.
2982Records are separated by a character called the @dfn{record separator}.
2983By default, the record separator is the newline character.
2984This is why records are, by default, single lines.
2985You can use a different character for the record separator by
2986assigning the character to the built-in variable @code{RS}.
2987
2988You can change the value of @code{RS} in the @code{awk} program,
2989like any other variable, with the
2990assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).
2991The new record-separator character should be enclosed in quotation marks,
2992which indicate
2993a string constant.  Often the right time to do this is at the beginning
2994of execution, before any input has been processed, so that the very
2995first record will be read with the proper separator.  To do this, use
2996the special @code{BEGIN} pattern
2997(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).  For
2998example:
2999
3000@example
3001awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list
3002@end example
3003
3004@noindent
3005changes the value of @code{RS} to @code{"/"}, before reading any input.
3006This is a string whose first character is a slash; as a result, records
3007are separated by slashes.  Then the input file is read, and the second
3008rule in the @code{awk} program (the action with no pattern) prints each
3009record.  Since each @code{print} statement adds a newline at the end of
3010its output, the effect of this @code{awk} program is to copy the input
3011with each slash changed to a newline.  Here are the results of running
3012the program on @file{BBS-list}:
3013
3014@example
3015@group
3016$ awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list
3017@print{} aardvark     555-5553     1200
3018@print{} 300          B
3019@print{} alpo-net     555-3412     2400
3020@print{} 1200
3021@print{} 300     A
3022@print{} barfly       555-7685     1200
3023@print{} 300          A
3024@print{} bites        555-1675     2400
3025@print{} 1200
3026@print{} 300     A
3027@print{} camelot      555-0542     300               C
3028@print{} core         555-2912     1200
3029@print{} 300          C
3030@print{} fooey        555-1234     2400
3031@print{} 1200
3032@print{} 300     B
3033@print{} foot         555-6699     1200
3034@print{} 300          B
3035@print{} macfoo       555-6480     1200
3036@print{} 300          A
3037@print{} sdace        555-3430     2400
3038@print{} 1200
3039@print{} 300     A
3040@print{} sabafoo      555-2127     1200
3041@print{} 300          C
3042@print{}
3043@end group
3044@end example
3045
3046@noindent
3047Note that the entry for the @samp{camelot} BBS is not split.
3048In the original data file
3049(@pxref{Sample Data Files,  , Data Files for the Examples}),
3050the line looks like this:
3051
3052@example
3053camelot      555-0542     300               C
3054@end example
3055
3056@noindent
3057It only has one baud rate; there are no slashes in the record.
3058
3059Another way to change the record separator is on the command line,
3060using the variable-assignment feature
3061(@pxref{Other Arguments, ,Other Command Line Arguments}).
3062
3063@example
3064awk '@{ print $0 @}' RS="/" BBS-list
3065@end example
3066
3067@noindent
3068This sets @code{RS} to @samp{/} before processing @file{BBS-list}.
3069
3070Using an unusual character such as @samp{/} for the record separator
3071produces correct behavior in the vast majority of cases.  However,
3072the following (extreme) pipeline prints a surprising @samp{1}.  There
3073is one field, consisting of a newline.  The value of the built-in
3074variable @code{NF} is the number of fields in the current record.
3075
3076@example
3077@group
3078$ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}'
3079@print{} 1
3080@end group
3081@end example
3082
3083@cindex dark corner
3084@noindent
3085Reaching the end of an input file terminates the current input record,
3086even if the last character in the file is not the character in @code{RS}
3087(d.c.).
3088
3089@cindex empty string
3090The empty string, @code{""} (a string of no characters), has a special meaning
3091as the value of @code{RS}: it means that records are separated
3092by one or more blank lines, and nothing else.
3093@xref{Multiple Line, ,Multiple-Line Records}, for more details.
3094
3095If you change the value of @code{RS} in the middle of an @code{awk} run,
3096the new value is used to delimit subsequent records, but the record
3097currently being processed (and records already processed) are not
3098affected.
3099
3100@vindex RT
3101@cindex record terminator, @code{RT}
3102@cindex terminator, record
3103@cindex differences between @code{gawk} and @code{awk}
3104After the end of the record has been determined, @code{gawk}
3105sets the variable @code{RT} to the text in the input that matched
3106@code{RS}.
3107
3108@cindex regular expressions as record separators
3109The value of @code{RS} is in fact not limited to a one-character
3110string.  It can be any regular expression
3111(@pxref{Regexp, ,Regular Expressions}).
3112In general, each record
3113ends at the next string that matches the regular expression; the next
3114record starts at the end of the matching string.  This general rule is
3115actually at work in the usual case, where @code{RS} contains just a
3116newline: a record ends at the beginning of the next matching string (the
3117next newline in the input) and the following record starts just after
3118the end of this string (at the first character of the following line).
3119The newline, since it matches @code{RS}, is not part of either record.
3120
3121When @code{RS} is a single character, @code{RT} will
3122contain the same single character. However, when @code{RS} is a
3123regular expression, then @code{RT} becomes more useful; it contains
3124the actual input text that matched the regular expression.
3125
3126The following example illustrates both of these features.
3127It sets @code{RS} equal to a regular expression that
3128matches either a newline, or a series of one or more upper-case letters
3129with optional leading and/or trailing white space
3130(@pxref{Regexp, , Regular Expressions}).
3131
3132@example
3133$ echo record 1 AAAA record 2 BBBB record 3 |
3134> gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @}
3135>             @{ print "Record =", $0, "and RT =", RT @}'
3136@print{} Record = record 1 and RT =  AAAA
3137@print{} Record = record 2 and RT =  BBBB
3138@print{} Record = record 3 and RT =
3139@print{}
3140@end example
3141
3142@noindent
3143The final line of output has an extra blank line. This is because the
3144value of @code{RT} is a newline, and then the @code{print} statement
3145supplies its own terminating newline.
3146
3147@xref{Simple Sed, ,A Simple Stream Editor}, for a more useful example
3148of @code{RS} as a regexp and @code{RT}.
3149
3150@cindex differences between @code{gawk} and @code{awk}
3151The use of @code{RS} as a regular expression and the @code{RT}
3152variable are @code{gawk} extensions; they are not available in
3153compatibility mode
3154(@pxref{Options, ,Command Line Options}).
3155In compatibility mode, only the first character of the value of
3156@code{RS} is used to determine the end of the record.
3157
3158@cindex number of records, @code{NR}, @code{FNR}
3159@vindex NR
3160@vindex FNR
3161The @code{awk} utility keeps track of the number of records that have
3162been read so far from the current input file.  This value is stored in a
3163built-in variable called @code{FNR}.  It is reset to zero when a new
3164file is started.  Another built-in variable, @code{NR}, is the total
3165number of input records read so far from all data files.  It starts at zero
3166but is never automatically reset to zero.
3167
3168@node Fields, Non-Constant Fields, Records, Reading Files
3169@section Examining Fields
3170
3171@cindex examining fields
3172@cindex fields
3173@cindex accessing fields
3174When @code{awk} reads an input record, the record is
3175automatically separated or @dfn{parsed} by the interpreter into chunks
3176called @dfn{fields}.  By default, fields are separated by whitespace,
3177like words in a line.
3178Whitespace in @code{awk} means any string of one or more spaces,
3179tabs or newlines;@footnote{In POSIX @code{awk}, newlines are not
3180considered whitespace for separating fields.} other characters such as
3181formfeed, and so on, that are
3182considered whitespace by other languages are @emph{not} considered
3183whitespace by @code{awk}.
3184
3185The purpose of fields is to make it more convenient for you to refer to
3186these pieces of the record.  You don't have to use them---you can
3187operate on the whole record if you wish---but fields are what make
3188simple @code{awk} programs so powerful.
3189
3190@cindex @code{$} (field operator)
3191@cindex field operator @code{$}
3192To refer to a field in an @code{awk} program, you use a dollar-sign,
3193@samp{$}, followed by the number of the field you want.  Thus, @code{$1}
3194refers to the first field, @code{$2} to the second, and so on.  For
3195example, suppose the following is a line of input:
3196
3197@example
3198This seems like a pretty nice example.
3199@end example
3200
3201@noindent
3202Here the first field, or @code{$1}, is @samp{This}; the second field, or
3203@code{$2}, is @samp{seems}; and so on.  Note that the last field,
3204@code{$7}, is @samp{example.}.  Because there is no space between the
3205@samp{e} and the @samp{.}, the period is considered part of the seventh
3206field.
3207
3208@vindex NF
3209@cindex number of fields, @code{NF}
3210@code{NF} is a built-in variable whose value
3211is the number of fields in the current record.
3212@code{awk} updates the value of @code{NF} automatically, each time
3213a record is read.
3214
3215No matter how many fields there are, the last field in a record can be
3216represented by @code{$NF}.  So, in the example above, @code{$NF} would
3217be the same as @code{$7}, which is @samp{example.}.  Why this works is
3218explained below (@pxref{Non-Constant Fields, ,Non-constant Field Numbers}).
3219If you try to reference a field beyond the last one, such as @code{$8}
3220when the record has only seven fields, you get the empty string.
3221@c the empty string acts like 0 in some contexts, but I don't want to
3222@c get into that here....
3223
3224@code{$0}, which looks like a reference to the ``zeroth'' field, is
3225a special case: it represents the whole input record.  @code{$0} is
3226used when you are not interested in fields.
3227
3228@c NEEDED
3229@page
3230Here are some more examples:
3231
3232@example
3233@group
3234$ awk '$1 ~ /foo/ @{ print $0 @}' BBS-list
3235@print{} fooey        555-1234     2400/1200/300     B
3236@print{} foot         555-6699     1200/300          B
3237@print{} macfoo       555-6480     1200/300          A
3238@print{} sabafoo      555-2127     1200/300          C
3239@end group
3240@end example
3241
3242@noindent
3243This example prints each record in the file @file{BBS-list} whose first
3244field contains the string @samp{foo}.  The operator @samp{~} is called a
3245@dfn{matching operator}
3246(@pxref{Regexp Usage, , How to Use Regular Expressions});
3247it tests whether a string (here, the field @code{$1}) matches a given regular
3248expression.
3249
3250By contrast, the following example
3251looks for @samp{foo} in @emph{the entire record} and prints the first
3252field and the last field for each input record containing a
3253match.
3254
3255@example
3256@group
3257$ awk '/foo/ @{ print $1, $NF @}' BBS-list
3258@print{} fooey B
3259@print{} foot B
3260@print{} macfoo A
3261@print{} sabafoo C
3262@end group
3263@end example
3264
3265@node Non-Constant Fields, Changing Fields, Fields, Reading Files
3266@section Non-constant Field Numbers
3267
3268The number of a field does not need to be a constant.  Any expression in
3269the @code{awk} language can be used after a @samp{$} to refer to a
3270field.  The value of the expression specifies the field number.  If the
3271value is a string, rather than a number, it is converted to a number.
3272Consider this example:
3273
3274@example
3275awk '@{ print $NR @}'
3276@end example
3277
3278@noindent
3279Recall that @code{NR} is the number of records read so far: one in the
3280first record, two in the second, etc.  So this example prints the first
3281field of the first record, the second field of the second record, and so
3282on.  For the twentieth record, field number 20 is printed; most likely,
3283the record has fewer than 20 fields, so this prints a blank line.
3284
3285Here is another example of using expressions as field numbers:
3286
3287@example
3288awk '@{ print $(2*2) @}' BBS-list
3289@end example
3290
3291@code{awk} must evaluate the expression @samp{(2*2)} and use
3292its value as the number of the field to print.  The @samp{*} sign
3293represents multiplication, so the expression @samp{2*2} evaluates to four.
3294The parentheses are used so that the multiplication is done before the
3295@samp{$} operation; they are necessary whenever there is a binary
3296operator in the field-number expression.  This example, then, prints the
3297hours of operation (the fourth field) for every line of the file
3298@file{BBS-list}.  (All of the @code{awk} operators are listed, in
3299order of decreasing precedence, in
3300@ref{Precedence,  , Operator Precedence (How Operators Nest)}.)
3301
3302If the field number you compute is zero, you get the entire record.
3303Thus, @code{$(2-2)} has the same value as @code{$0}.  Negative field
3304numbers are not allowed; trying to reference one will usually terminate
3305your running @code{awk} program.  (The POSIX standard does not define
3306what happens when you reference a negative field number.  @code{gawk}
3307will notice this and terminate your program.  Other @code{awk}
3308implementations may behave differently.)
3309
3310As mentioned in @ref{Fields, ,Examining Fields},
3311the number of fields in the current record is stored in the built-in
3312variable @code{NF} (also @pxref{Built-in Variables}).  The expression
3313@code{$NF} is not a special feature: it is the direct consequence of
3314evaluating @code{NF} and using its value as a field number.
3315
3316@node Changing Fields, Field Separators, Non-Constant Fields, Reading Files
3317@section Changing the Contents of a Field
3318
3319@cindex field, changing contents of
3320@cindex changing contents of a field
3321@cindex assignment to fields
3322You can change the contents of a field as seen by @code{awk} within an
3323@code{awk} program; this changes what @code{awk} perceives as the
3324current input record.  (The actual input is untouched; @code{awk} @emph{never}
3325modifies the input file.)
3326
3327Consider this example and its output:
3328
3329@example
3330@group
3331$ awk '@{ $3 = $2 - 10; print $2, $3 @}' inventory-shipped
3332@print{} 13 3
3333@print{} 15 5
3334@print{} 15 5
3335@dots{}
3336@end group
3337@end example
3338
3339@noindent
3340The @samp{-} sign represents subtraction, so this program reassigns
3341field three, @code{$3}, to be the value of field two minus ten,
3342@samp{$2 - 10}.  (@xref{Arithmetic Ops, ,Arithmetic Operators}.)
3343Then field two, and the new value for field three, are printed.
3344
3345In order for this to work, the text in field @code{$2} must make sense
3346as a number; the string of characters must be converted to a number in
3347order for the computer to do arithmetic on it.  The number resulting
3348from the subtraction is converted back to a string of characters which
3349then becomes field three.
3350@xref{Conversion, ,Conversion of Strings and Numbers}.
3351
3352When you change the value of a field (as perceived by @code{awk}), the
3353text of the input record is recalculated to contain the new field where
3354the old one was.  Therefore, @code{$0} changes to reflect the altered
3355field.  Thus, this program
3356prints a copy of the input file, with 10 subtracted from the second
3357field of each line.
3358
3359@example
3360@group
3361$ awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped
3362@print{} Jan 3 25 15 115
3363@print{} Feb 5 32 24 226
3364@print{} Mar 5 24 34 228
3365@dots{}
3366@end group
3367@end example
3368
3369You can also assign contents to fields that are out of range.  For
3370example:
3371
3372@example
3373$ awk '@{ $6 = ($5 + $4 + $3 + $2)
3374>        print $6 @}' inventory-shipped
3375@print{} 168
3376@print{} 297
3377@print{} 301
3378@dots{}
3379@end example
3380
3381@noindent
3382We've just created @code{$6}, whose value is the sum of fields
3383@code{$2}, @code{$3}, @code{$4}, and @code{$5}.  The @samp{+} sign
3384represents addition.  For the file @file{inventory-shipped}, @code{$6}
3385represents the total number of parcels shipped for a particular month.
3386
3387Creating a new field changes @code{awk}'s internal copy of the current
3388input record---the value of @code{$0}.  Thus, if you do @samp{print $0}
3389after adding a field, the record printed includes the new field, with
3390the appropriate number of field separators between it and the previously
3391existing fields.
3392
3393This recomputation affects and is affected by
3394@code{NF} (the number of fields; @pxref{Fields, ,Examining Fields}),
3395and by a feature that has not been discussed yet,
3396the @dfn{output field separator}, @code{OFS},
3397which is used to separate the fields (@pxref{Output Separators}).
3398For example, the value of @code{NF} is set to the number of the highest
3399field you create.
3400
3401Note, however, that merely @emph{referencing} an out-of-range field
3402does @emph{not} change the value of either @code{$0} or @code{NF}.
3403Referencing an out-of-range field only produces an empty string.  For
3404example:
3405
3406@example
3407if ($(NF+1) != "")
3408    print "can't happen"
3409else
3410    print "everything is normal"
3411@end example
3412
3413@noindent
3414should print @samp{everything is normal}, because @code{NF+1} is certain
3415to be out of range.  (@xref{If Statement, ,The @code{if}-@code{else} Statement},
3416for more information about @code{awk}'s @code{if-else} statements.
3417@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions},
3418for more information about the @samp{!=} operator.)
3419
3420It is important to note that making an assignment to an existing field
3421will change the
3422value of @code{$0}, but will not change the value of @code{NF},
3423even when you assign the empty string to a field.  For example:
3424
3425@example
3426@group
3427$ echo a b c d | awk '@{ OFS = ":"; $2 = ""
3428>                       print $0; print NF @}'
3429@print{} a::c:d
3430@print{} 4
3431@end group
3432@end example
3433
3434@noindent
3435The field is still there; it just has an empty value.  You can tell
3436because there are two colons in a row.
3437
3438This example shows what happens if you create a new field.
3439
3440@example
3441$ echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new"
3442>                       print $0; print NF @}'
3443@print{} a::c:d::new
3444@print{} 6
3445@end example
3446
3447@noindent
3448The intervening field, @code{$5} is created with an empty value
3449(indicated by the second pair of adjacent colons),
3450and @code{NF} is updated with the value six.
3451
3452Finally, decrementing @code{NF} will lose the values of the fields
3453after the new value of @code{NF}, and @code{$0} will be recomputed.
3454Here is an example:
3455
3456@example
3457$ echo a b c d e f | ../gawk '@{ print "NF =", NF;
3458>                               NF = 3; print $0 @}'
3459@print{} NF = 6
3460@print{} a b c
3461@end example
3462
3463@node Field Separators, Constant Size, Changing Fields, Reading Files
3464@section Specifying How Fields are Separated
3465
3466This section is rather long; it describes one of the most fundamental
3467operations in @code{awk}.
3468
3469@menu
3470* Basic Field Splitting::        How fields are split with single characters
3471                                 or simple strings.
3472* Regexp Field Splitting::       Using regexps as the field separator.
3473* Single Character Fields::      Making each character a separate field.
3474* Command Line Field Separator:: Setting @code{FS} from the command line.
3475* Field Splitting Summary::      Some final points and a summary table.
3476@end menu
3477
3478@node Basic Field Splitting, Regexp Field Splitting, Field Separators, Field Separators
3479@subsection The Basics of Field Separating
3480@vindex FS
3481@cindex fields, separating
3482@cindex field separator, @code{FS}
3483
3484The @dfn{field separator}, which is either a single character or a regular
3485expression, controls the way @code{awk} splits an input record into fields.
3486@code{awk} scans the input record for character sequences that
3487match the separator; the fields themselves are the text between the matches.
3488
3489In the examples below, we use the bullet symbol ``@bullet{}'' to represent
3490spaces in the output.
3491
3492If the field separator is @samp{oo}, then the following line:
3493
3494@example
3495moo goo gai pan
3496@end example
3497
3498@noindent
3499would be split into three fields: @samp{m}, @samp{@bullet{}g} and
3500@samp{@bullet{}gai@bullet{}pan}.
3501Note the leading spaces in the values of the second and third fields.
3502
3503@cindex common mistakes
3504@cindex mistakes, common
3505@cindex errors, common
3506The field separator is represented by the built-in variable @code{FS}.
3507Shell programmers take note!  @code{awk} does @emph{not} use the name @code{IFS}
3508which is used by the POSIX compatible shells (such as the Bourne shell,
3509@code{sh}, or the GNU Bourne-Again Shell, Bash).
3510
3511You can change the value of @code{FS} in the @code{awk} program with the
3512assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).
3513Often the right time to do this is at the beginning of execution,
3514before any input has been processed, so that the very first record
3515will be read with the proper separator.  To do this, use the special
3516@code{BEGIN} pattern
3517(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
3518For example, here we set the value of @code{FS} to the string
3519@code{","}:
3520
3521@example
3522awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
3523@end example
3524
3525@noindent
3526Given the input line,
3527
3528@example
3529John Q. Smith, 29 Oak St., Walamazoo, MI 42139
3530@end example
3531
3532@noindent
3533this @code{awk} program extracts and prints the string
3534@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
3535
3536@cindex field separator, choice of
3537@cindex regular expressions as field separators
3538Sometimes your input data will contain separator characters that don't
3539separate fields the way you thought they would.  For instance, the
3540person's name in the example we just used might have a title or
3541suffix attached, such as @samp{John Q. Smith, LXIX}.  From input
3542containing such a name:
3543
3544@example
3545John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
3546@end example
3547
3548@noindent
3549@c careful of an overfull hbox here!
3550the above program would extract @samp{@bullet{}LXIX}, instead of
3551@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
3552If you were expecting the program to print the
3553address, you would be surprised.  The moral is: choose your data layout and
3554separator characters carefully to prevent such problems.
3555
3556@iftex
3557As you know, normally,
3558@end iftex
3559@ifinfo
3560Normally,
3561@end ifinfo
3562fields are separated by whitespace sequences
3563(spaces, tabs and newlines), not by single spaces: two spaces in a row do not
3564delimit an empty field.  The default value of the field separator @code{FS}
3565is a string containing a single space, @w{@code{" "}}.  If this value were
3566interpreted in the usual way, each space character would separate
3567fields, so two spaces in a row would make an empty field between them.
3568The reason this does not happen is that a single space as the value of
3569@code{FS} is a special case: it is taken to specify the default manner
3570of delimiting fields.
3571
3572If @code{FS} is any other single character, such as @code{","}, then
3573each occurrence of that character separates two fields.  Two consecutive
3574occurrences delimit an empty field.  If the character occurs at the
3575beginning or the end of the line, that too delimits an empty field.  The
3576space character is the only single character which does not follow these
3577rules.
3578
3579@node Regexp Field Splitting, Single Character Fields, Basic Field Splitting, Field Separators
3580@subsection Using Regular Expressions to Separate Fields
3581
3582The previous
3583@iftex
3584subsection
3585@end iftex
3586@ifinfo
3587node
3588@end ifinfo
3589discussed the use of single characters or simple strings as the
3590value of @code{FS}.
3591More generally, the value of @code{FS} may be a string containing any
3592regular expression.  In this case, each match in the record for the regular
3593expression separates fields.  For example, the assignment:
3594
3595@example
3596FS = ", \t"
3597@end example
3598
3599@noindent
3600makes every area of an input line that consists of a comma followed by a
3601space and a tab, into a field separator.  (@samp{\t}
3602is an @dfn{escape sequence} that stands for a tab;
3603@pxref{Escape Sequences},
3604for the complete list of similar escape sequences.)
3605
3606For a less trivial example of a regular expression, suppose you want
3607single spaces to separate fields the way single commas were used above.
3608You can set @code{FS} to @w{@code{"[@ ]"}} (left bracket, space, right
3609bracket).  This regular expression matches a single space and nothing else
3610(@pxref{Regexp, ,Regular Expressions}).
3611
3612There is an important difference between the two cases of @samp{FS = @w{" "}}
3613(a single space) and @samp{FS = @w{"[ \t\n]+"}} (left bracket, space,
3614backslash, ``t'', backslash, ``n'', right bracket, which is a regular
3615expression matching one or more spaces, tabs, or newlines).  For both
3616values of @code{FS}, fields are separated by runs of spaces, tabs
3617and/or newlines.  However, when the value of @code{FS} is @w{@code{"
3618"}}, @code{awk} will first strip leading and trailing whitespace from
3619the record, and then decide where the fields are.
3620
3621For example, the following pipeline prints @samp{b}:
3622
3623@example
3624@group
3625$ echo ' a b c d ' | awk '@{ print $2 @}'
3626@print{} b
3627@end group
3628@end example
3629
3630@noindent
3631However, this pipeline prints @samp{a} (note the extra spaces around
3632each letter):
3633
3634@example
3635$ echo ' a  b  c  d ' | awk 'BEGIN @{ FS = "[ \t]+" @}
3636>                                  @{ print $2 @}'
3637@print{} a
3638@end example
3639
3640@noindent
3641@cindex null string
3642@cindex empty string
3643In this case, the first field is @dfn{null}, or empty.
3644
3645The stripping of leading and trailing whitespace also comes into
3646play whenever @code{$0} is recomputed.  For instance, study this pipeline:
3647
3648@example
3649$ echo '   a b c d' | awk '@{ print; $2 = $2; print @}'
3650@print{}    a b c d
3651@print{} a b c d
3652@end example
3653
3654@noindent
3655The first @code{print} statement prints the record as it was read,
3656with leading whitespace intact.  The assignment to @code{$2} rebuilds
3657@code{$0} by concatenating @code{$1} through @code{$NF} together,
3658separated by the value of @code{OFS}.  Since the leading whitespace
3659was ignored when finding @code{$1}, it is not part of the new @code{$0}.
3660Finally, the last @code{print} statement prints the new @code{$0}.
3661
3662@node Single Character Fields, Command Line Field Separator, Regexp Field Splitting, Field Separators
3663@subsection Making Each Character a Separate Field
3664
3665@cindex differences between @code{gawk} and @code{awk}
3666@cindex single character fields
3667There are times when you may want to examine each character
3668of a record separately.  In @code{gawk}, this is easy to do, you
3669simply assign the null string (@code{""}) to @code{FS}. In this case,
3670each individual character in the record will become a separate field.
3671Here is an example:
3672
3673@example
3674@group
3675$ echo a b | gawk 'BEGIN @{ FS = "" @}
3676>                  @{
3677>                      for (i = 1; i <= NF; i = i + 1)
3678>                          print "Field", i, "is", $i
3679>                  @}'
3680@print{} Field 1 is a
3681@print{} Field 2 is
3682@print{} Field 3 is b
3683@end group
3684@end example
3685
3686@cindex dark corner
3687Traditionally, the behavior for @code{FS} equal to @code{""} was not defined.
3688In this case, Unix @code{awk} would simply treat the entire record
3689as only having one field (d.c.).  In compatibility mode
3690(@pxref{Options, ,Command Line Options}),
3691if @code{FS} is the null string, then @code{gawk} will also
3692behave this way.
3693
3694@node Command Line Field Separator, Field Splitting Summary, Single Character Fields, Field Separators
3695@subsection Setting @code{FS} from the Command Line
3696@cindex @code{-F} option
3697@cindex field separator, on command line
3698@cindex command line, setting @code{FS} on
3699
3700@code{FS} can be set on the command line.  You use the @samp{-F} option to
3701do so.  For example:
3702
3703@example
3704awk -F, '@var{program}' @var{input-files}
3705@end example
3706
3707@noindent
3708sets @code{FS} to be the @samp{,} character.  Notice that the option uses
3709a capital @samp{F}.  Contrast this with @samp{-f}, which specifies a file
3710containing an @code{awk} program.  Case is significant in command line options:
3711the @samp{-F} and @samp{-f} options have nothing to do with each other.
3712You can use both options at the same time to set the @code{FS} variable
3713@emph{and} get an @code{awk} program from a file.
3714
3715The value used for the argument to @samp{-F} is processed in exactly the
3716same way as assignments to the built-in variable @code{FS}.  This means that
3717if the field separator contains special characters, they must be escaped
3718appropriately.  For example, to use a @samp{\} as the field separator, you
3719would have to type:
3720
3721@example
3722# same as FS = "\\"
3723awk -F\\\\ '@dots{}' files @dots{}
3724@end example
3725
3726@noindent
3727Since @samp{\} is used for quoting in the shell, @code{awk} will see
3728@samp{-F\\}.  Then @code{awk} processes the @samp{\\} for escape
3729characters (@pxref{Escape Sequences}), finally yielding
3730a single @samp{\} to be used for the field separator.
3731
3732@cindex historical features
3733As a special case, in compatibility mode
3734(@pxref{Options, ,Command Line Options}), if the
3735argument to @samp{-F} is @samp{t}, then @code{FS} is set to the tab
3736character.  This is because if you type @samp{-F\t} at the shell,
3737without any quotes, the @samp{\} gets deleted, so @code{awk} figures that you
3738really want your fields to be separated with tabs, and not @samp{t}s.
3739Use @samp{-v FS="t"} on the command line if you really do want to separate
3740your fields with @samp{t}s
3741(@pxref{Options, ,Command Line Options}).
3742
3743For example, let's use an @code{awk} program file called @file{baud.awk}
3744that contains the pattern @code{/300/}, and the action @samp{print $1}.
3745Here is the program:
3746
3747@example
3748/300/   @{ print $1 @}
3749@end example
3750
3751Let's also set @code{FS} to be the @samp{-} character, and run the
3752program on the file @file{BBS-list}.  The following command prints a
3753list of the names of the bulletin boards that operate at 300 baud and
3754the first three digits of their phone numbers:
3755
3756@c tweaked to make the tex output look better in @smallbook
3757@example
3758@group
3759$ awk -F- -f baud.awk BBS-list
3760@print{} aardvark     555
3761@print{} alpo
3762@print{} barfly       555
3763@dots{}
3764@end group
3765@ignore
3766@print{} bites        555
3767@print{} camelot      555
3768@print{} core         555
3769@print{} fooey        555
3770@print{} foot         555
3771@print{} macfoo       555
3772@print{} sdace        555
3773@print{} sabafoo      555
3774@end ignore
3775@end example
3776
3777@noindent
3778Note the second line of output.  In the original file
3779(@pxref{Sample Data Files, ,Data Files for the Examples}),
3780the second line looked like this:
3781
3782@example
3783alpo-net     555-3412     2400/1200/300     A
3784@end example
3785
3786The @samp{-} as part of the system's name was used as the field
3787separator, instead of the @samp{-} in the phone number that was
3788originally intended.  This demonstrates why you have to be careful in
3789choosing your field and record separators.
3790
3791On many Unix systems, each user has a separate entry in the system password
3792file, one line per user.  The information in these lines is separated
3793by colons.  The first field is the user's logon name, and the second is
3794the user's encrypted password.  A password file entry might look like this:
3795
3796@example
3797arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
3798@end example
3799
3800The following program searches the system password file, and prints
3801the entries for users who have no password:
3802
3803@example
3804awk -F: '$2 == ""' /etc/passwd
3805@end example
3806
3807@node Field Splitting Summary,  , Command Line Field Separator, Field Separators
3808@subsection Field Splitting Summary
3809
3810@cindex @code{awk} language, POSIX version
3811@cindex POSIX @code{awk}
3812According to the POSIX standard, @code{awk} is supposed to behave
3813as if each record is split into fields at the time that it is read.
3814In particular, this means that you can change the value of @code{FS}
3815after a record is read, and the value of the fields (i.e.@: how they were split)
3816should reflect the old value of @code{FS}, not the new one.
3817
3818@cindex dark corner
3819@cindex @code{sed} utility
3820@cindex stream editor
3821However, many implementations of @code{awk} do not work this way.  Instead,
3822they defer splitting the fields until a field is actually
3823referenced.  The fields will be split
3824using the @emph{current} value of @code{FS}! (d.c.)
3825This behavior can be difficult
3826to diagnose. The following example illustrates the difference
3827between the two methods.
3828(The @code{sed}@footnote{The @code{sed} utility is a ``stream editor.''
3829Its behavior is also defined by the POSIX standard.}
3830command prints just the first line of @file{/etc/passwd}.)
3831
3832@example
3833sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}'
3834@end example
3835
3836@noindent
3837will usually print
3838
3839@example
3840root
3841@end example
3842
3843@noindent
3844on an incorrect implementation of @code{awk}, while @code{gawk}
3845will print something like
3846
3847@example
3848root:nSijPlPhZZwgE:0:0:Root:/:
3849@end example
3850
3851The following table summarizes how fields are split, based on the
3852value of @code{FS}. (@samp{==} means ``is equal to.'')
3853
3854@c @cartouche
3855@table @code
3856@item FS == " "
3857Fields are separated by runs of whitespace.  Leading and trailing
3858whitespace are ignored.  This is the default.
3859
3860@item FS == @var{any other single character}
3861Fields are separated by each occurrence of the character.  Multiple
3862successive occurrences delimit empty fields, as do leading and
3863trailing occurrences.
3864The character can even be a regexp metacharacter; it does not need
3865to be escaped.
3866
3867@item FS == @var{regexp}
3868Fields are separated by occurrences of characters that match @var{regexp}.
3869Leading and trailing matches of @var{regexp} delimit empty fields.
3870
3871@item FS == ""
3872Each individual character in the record becomes a separate field.
3873@end table
3874@c @end cartouche
3875
3876@node Constant Size, Multiple Line, Field Separators, Reading Files
3877@section Reading Fixed-width Data
3878
3879(This section discusses an advanced, experimental feature.  If you are
3880a novice @code{awk} user, you may wish to skip it on the first reading.)
3881
3882@code{gawk} version 2.13 introduced a new facility for dealing with
3883fixed-width fields with no distinctive field separator.  Data of this
3884nature arises, for example, in  the input for old FORTRAN programs where
3885numbers are run together; or in the output of programs that did not
3886anticipate the use of their output as input for other programs.
3887
3888An example of the latter is a table where all the columns are lined up by
3889the use of a variable number of spaces and @emph{empty fields are just
3890spaces}.  Clearly, @code{awk}'s normal field splitting based on @code{FS}
3891will not work well in this case.  Although a portable @code{awk} program
3892can use a series of @code{substr} calls on @code{$0}
3893(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
3894this is awkward and inefficient for a large number of fields.
3895
3896The splitting of an input record into fixed-width fields is specified by
3897assigning a string containing space-separated numbers to the built-in
3898variable @code{FIELDWIDTHS}.  Each number specifies the width of the field
3899@emph{including} columns between fields.  If you want to ignore the columns
3900between fields, you can specify the width as a separate field that is
3901subsequently ignored.
3902
3903The following data is the output of the Unix @code{w} utility.  It is useful
3904to illustrate the use of @code{FIELDWIDTHS}.
3905
3906@example
3907@group
3908 10:06pm  up 21 days, 14:04,  23 users
3909User     tty       login@  idle   JCPU   PCPU  what
3910hzuo     ttyV0     8:58pm            9      5  vi p24.tex
3911hzang    ttyV3     6:37pm    50                -csh
3912eklye    ttyV5     9:53pm            7      1  em thes.tex
3913dportein ttyV6     8:17pm  1:47                -csh
3914gierd    ttyD3    10:00pm     1                elm
3915dave     ttyD4     9:47pm            4      4  w
3916brent    ttyp0    26Jun91  4:46  26:46   4:41  bash
3917dave     ttyq4    26Jun9115days     46     46  wnewmail
3918@end group
3919@end example
3920
3921The following program takes the above input, converts the idle time to
3922number of seconds and prints out the first two fields and the calculated
3923idle time.  (This program uses a number of @code{awk} features that
3924haven't been introduced yet.)
3925
3926@example
3927BEGIN  @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @}
3928NR > 2 @{
3929    idle = $4
3930    sub(/^  */, "", idle)   # strip leading spaces
3931    if (idle == "")
3932        idle = 0
3933@group
3934    if (idle ~ /:/) @{
3935        split(idle, t, ":")
3936        idle = t[1] * 60 + t[2]
3937    @}
3938@end group
3939@group
3940    if (idle ~ /days/)
3941        idle *= 24 * 60 * 60
3942
3943    print $1, $2, idle
3944@}
3945@end group
3946@end example
3947
3948Here is the result of running the program on the data:
3949
3950@example
3951hzuo      ttyV0  0
3952hzang     ttyV3  50
3953eklye     ttyV5  0
3954dportein  ttyV6  107
3955gierd     ttyD3  1
3956dave      ttyD4  0
3957brent     ttyp0  286
3958dave      ttyq4  1296000
3959@end example
3960
3961Another (possibly more practical) example of fixed-width input data
3962would be the input from a deck of balloting cards.  In some parts of
3963the United States, voters mark their choices by punching holes in computer
3964cards.  These cards are then processed to count the votes for any particular
3965candidate or on any particular issue.  Since a voter may choose not to
3966vote on some issue, any column on the card may be empty.  An @code{awk}
3967program for processing such data could use the @code{FIELDWIDTHS} feature
3968to simplify reading the data.  (Of course, getting @code{gawk} to run on
3969a system with card readers is another story!)
3970
3971@ignore
3972Exercise: Write a ballot card reading program
3973@end ignore
3974
3975Assigning a value to @code{FS} causes @code{gawk} to return to using
3976@code{FS} for field splitting.  Use @samp{FS = FS} to make this happen,
3977without having to know the current value of @code{FS}.
3978
3979This feature is still experimental, and may evolve over time.
3980Note that in particular, @code{gawk} does not attempt to verify
3981the sanity of the values used in the value of @code{FIELDWIDTHS}.
3982
3983@node Multiple Line, Getline, Constant Size, Reading Files
3984@section Multiple-Line Records
3985
3986@cindex multiple line records
3987@cindex input, multiple line records
3988@cindex reading files, multiple line records
3989@cindex records, multiple line
3990In some data bases, a single line cannot conveniently hold all the
3991information in one entry.  In such cases, you can use multi-line
3992records.
3993
3994The first step in doing this is to choose your data format: when records
3995are not defined as single lines, how do you want to define them?
3996What should separate records?
3997
3998One technique is to use an unusual character or string to separate
3999records.  For example, you could use the formfeed character (written
4000@samp{\f} in @code{awk}, as in C) to separate them, making each record
4001a page of the file.  To do this, just set the variable @code{RS} to
4002@code{"\f"} (a string containing the formfeed character).  Any
4003other character could equally well be used, as long as it won't be part
4004of the data in a record.
4005
4006Another technique is to have blank lines separate records.  By a special
4007dispensation, an empty string as the value of @code{RS} indicates that
4008records are separated by one or more blank lines.  If you set @code{RS}
4009to the empty string, a record always ends at the first blank line
4010encountered.  And the next record doesn't start until the first non-blank
4011line that follows---no matter how many blank lines appear in a row, they
4012are considered one record-separator.
4013
4014@cindex leftmost longest match
4015@cindex matching, leftmost longest
4016You can achieve the same effect as @samp{RS = ""} by assigning the
4017string @code{"\n\n+"} to @code{RS}. This regexp matches the newline
4018at the end of the record, and one or more blank lines after the record.
4019In addition, a regular expression always matches the longest possible
4020sequence when there is a choice
4021(@pxref{Leftmost Longest, ,How Much Text Matches?}).
4022So the next record doesn't start until
4023the first non-blank line that follows---no matter how many blank lines
4024appear in a row, they are considered one record-separator.
4025
4026@cindex dark corner
4027There is an important difference between @samp{RS = ""} and
4028@samp{RS = "\n\n+"}. In the first case, leading newlines in the input
4029data file are ignored, and if a file ends without extra blank lines
4030after the last record, the final newline is removed from the record.
4031In the second case, this special processing is not done (d.c.).
4032
4033Now that the input is separated into records, the second step is to
4034separate the fields in the record.  One way to do this is to divide each
4035of the lines into fields in the normal manner.  This happens by default
4036as the result of a special feature: when @code{RS} is set to the empty
4037string, the newline character @emph{always} acts as a field separator.
4038This is in addition to whatever field separations result from @code{FS}.
4039
4040The original motivation for this special exception was probably to provide
4041useful behavior in the default case (i.e.@: @code{FS} is equal
4042to @w{@code{" "}}).  This feature can be a problem if you really don't
4043want the newline character to separate fields, since there is no way to
4044prevent it.  However, you can work around this by using the @code{split}
4045function to break up the record manually
4046(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
4047
4048Another way to separate fields is to
4049put each field on a separate line: to do this, just set the
4050variable @code{FS} to the string @code{"\n"}.  (This simple regular
4051expression matches a single newline.)
4052
4053A practical example of a data file organized this way might be a mailing
4054list, where each entry is separated by blank lines.  If we have a mailing
4055list in a file named @file{addresses}, that looks like this:
4056
4057@c NEEDED
4058@page
4059@example
4060Jane Doe
4061123 Main Street
4062Anywhere, SE 12345-6789
4063
4064John Smith
4065456 Tree-lined Avenue
4066Smallville, MW 98765-4321
4067@dots{}
4068@end example
4069
4070@noindent
4071A simple program to process this file would look like this:
4072
4073@example
4074@group
4075# addrs.awk --- simple mailing list program
4076
4077# Records are separated by blank lines.
4078# Each line is one field.
4079BEGIN @{ RS = "" ; FS = "\n" @}
4080
4081@{
4082      print "Name is:", $1
4083      print "Address is:", $2
4084      print "City and State are:", $3
4085      print ""
4086@}
4087@end group
4088@end example
4089
4090Running the program produces the following output:
4091
4092@example
4093@group
4094$ awk -f addrs.awk addresses
4095@print{} Name is: Jane Doe
4096@print{} Address is: 123 Main Street
4097@print{} City and State are: Anywhere, SE 12345-6789
4098@print{}
4099@end group
4100@group
4101@print{} Name is: John Smith
4102@print{} Address is: 456 Tree-lined Avenue
4103@print{} City and State are: Smallville, MW 98765-4321
4104@print{}
4105@dots{}
4106@end group
4107@end example
4108
4109@xref{Labels Program, ,Printing Mailing Labels}, for a more realistic
4110program that deals with address lists.
4111
4112The following table summarizes how records are split, based on the
4113value of @code{RS}. (@samp{==} means ``is equal to.'')
4114
4115@c @cartouche
4116@table @code
4117@item RS == "\n"
4118Records are separated by the newline character (@samp{\n}).  In effect,
4119every line in the data file is a separate record, including blank lines.
4120This is the default.
4121
4122@item RS == @var{any single character}
4123Records are separated by each occurrence of the character.  Multiple
4124successive occurrences delimit empty records.
4125
4126@item RS == ""
4127Records are separated by runs of blank lines.  The newline character
4128always serves as a field separator, in addition to whatever value
4129@code{FS} may have. Leading and trailing newlines in a file are ignored.
4130
4131@item RS == @var{regexp}
4132Records are separated by occurrences of characters that match @var{regexp}.
4133Leading and trailing matches of @var{regexp} delimit empty records.
4134@end table
4135@c @end cartouche
4136
4137@vindex RT
4138In all cases, @code{gawk} sets @code{RT} to the input text that matched the
4139value specified by @code{RS}.
4140
4141@node Getline, , Multiple Line, Reading Files
4142@section Explicit Input with @code{getline}
4143
4144@findex getline
4145@cindex input, explicit
4146@cindex explicit input
4147@cindex input, @code{getline} command
4148@cindex reading files, @code{getline} command
4149So far we have been getting our input data from @code{awk}'s main
4150input stream---either the standard input (usually your terminal, sometimes
4151the output from another program) or from the
4152files specified on the command line.  The @code{awk} language has a
4153special built-in command called @code{getline} that
4154can be used to read input under your explicit control.
4155
4156@menu
4157* Getline Intro::            Introduction to the @code{getline} function.
4158* Plain Getline::            Using @code{getline} with no arguments.
4159* Getline/Variable::         Using @code{getline} into a variable.
4160* Getline/File::             Using @code{getline} from a file.
4161* Getline/Variable/File::    Using @code{getline} into a variable from a
4162                             file.
4163* Getline/Pipe::             Using @code{getline} from a pipe.
4164* Getline/Variable/Pipe::    Using @code{getline} into a variable from a
4165                             pipe.
4166* Getline Summary::          Summary Of @code{getline} Variants.
4167@end menu
4168
4169@node Getline Intro, Plain Getline, Getline, Getline
4170@subsection Introduction to @code{getline}
4171
4172This command is used in several different ways, and should @emph{not} be
4173used by beginners.  It is covered here because this is the chapter on input.
4174The examples that follow the explanation of the @code{getline} command
4175include material that has not been covered yet.  Therefore, come back
4176and study the @code{getline} command @emph{after} you have reviewed the
4177rest of this @value{DOCUMENT} and have a good knowledge of how @code{awk} works.
4178
4179@vindex ERRNO
4180@cindex differences between @code{gawk} and @code{awk}
4181@cindex @code{getline}, return values
4182@code{getline} returns one if it finds a record, and zero if the end of the
4183file is encountered.  If there is some error in getting a record, such
4184as a file that cannot be opened, then @code{getline} returns @minus{}1.
4185In this case, @code{gawk} sets the variable @code{ERRNO} to a string
4186describing the error that occurred.
4187
4188In the following examples, @var{command} stands for a string value that
4189represents a shell command.
4190
4191@node Plain Getline, Getline/Variable, Getline Intro, Getline
4192@subsection Using @code{getline} with No Arguments
4193
4194The @code{getline} command can be used without arguments to read input
4195from the current input file.  All it does in this case is read the next
4196input record and split it up into fields.  This is useful if you've
4197finished processing the current record, but you want to do some special
4198processing @emph{right now} on the next record.  Here's an
4199example:
4200
4201@example
4202@group
4203awk '@{
4204     if ((t = index($0, "/*")) != 0) @{
4205          # value will be "" if t is 1
4206          tmp = substr($0, 1, t - 1)
4207          u = index(substr($0, t + 2), "*/")
4208          while (u == 0) @{
4209               if (getline <= 0) @{
4210                    m = "unexpected EOF or error"
4211                    m = (m ": " ERRNO)
4212                    print m > "/dev/stderr"
4213                    exit
4214               @}
4215               t = -1
4216               u = index($0, "*/")
4217          @}
4218@end group
4219@group
4220          # substr expression will be "" if */
4221          # occurred at end of line
4222          $0 = tmp substr($0, t + u + 3)
4223     @}
4224     print $0
4225@}'
4226@end group
4227@end example
4228
4229This @code{awk} program deletes all C-style comments, @samp{/* @dots{}
4230*/}, from the input.  By replacing the @samp{print $0} with other
4231statements, you could perform more complicated processing on the
4232decommented input, like searching for matches of a regular
4233expression.  This program has a subtle problem---it does not work if one
4234comment ends and another begins on the same line.
4235
4236@ignore
4237Exercise,
4238write a program that does handle multiple comments on the line.
4239@end ignore
4240
4241This form of the @code{getline} command sets @code{NF} (the number of
4242fields; @pxref{Fields, ,Examining Fields}), @code{NR} (the number of
4243records read so far; @pxref{Records, ,How Input is Split into Records}),
4244@code{FNR} (the number of records read from this input file), and the
4245value of @code{$0}.
4246
4247@cindex dark corner
4248@strong{Note:} the new value of @code{$0} is used in testing
4249the patterns of any subsequent rules.  The original value
4250of @code{$0} that triggered the rule which executed @code{getline}
4251is lost (d.c.).
4252By contrast, the @code{next} statement reads a new record
4253but immediately begins processing it normally, starting with the first
4254rule in the program.  @xref{Next Statement, ,The @code{next} Statement}.
4255
4256@node Getline/Variable, Getline/File, Plain Getline, Getline
4257@subsection Using @code{getline} Into a Variable
4258
4259You can use @samp{getline @var{var}} to read the next record from
4260@code{awk}'s input into the variable @var{var}.  No other processing is
4261done.
4262
4263For example, suppose the next line is a comment, or a special string,
4264and you want to read it, without triggering
4265any rules.  This form of @code{getline} allows you to read that line
4266and store it in a variable so that the main
4267read-a-line-and-check-each-rule loop of @code{awk} never sees it.
4268
4269The following example swaps every two lines of input.  For example, given:
4270
4271@example
4272wan
4273tew
4274free
4275phore
4276@end example
4277
4278@noindent
4279it outputs:
4280
4281@example
4282tew
4283wan
4284phore
4285free
4286@end example
4287
4288@noindent
4289Here's the program:
4290
4291@example
4292@group
4293awk '@{
4294     if ((getline tmp) > 0) @{
4295          print tmp
4296          print $0
4297     @} else
4298          print $0
4299@}'
4300@end group
4301@end example
4302
4303The @code{getline} command used in this way sets only the variables
4304@code{NR} and @code{FNR} (and of course, @var{var}).  The record is not
4305split into fields, so the values of the fields (including @code{$0}) and
4306the value of @code{NF} do not change.
4307
4308@node Getline/File, Getline/Variable/File, Getline/Variable, Getline
4309@subsection Using @code{getline} from a File
4310
4311@cindex input redirection
4312@cindex redirection of input
4313Use @samp{getline < @var{file}} to read
4314the next record from the file
4315@var{file}.  Here @var{file} is a string-valued expression that
4316specifies the file name.  @samp{< @var{file}} is called a @dfn{redirection}
4317since it directs input to come from a different place.
4318
4319For example, the following
4320program reads its input record from the file @file{secondary.input} when it
4321encounters a first field with a value equal to 10 in the current input
4322file.
4323
4324@example
4325@group
4326awk '@{
4327    if ($1 == 10) @{
4328         getline < "secondary.input"
4329         print
4330    @} else
4331         print
4332@}'
4333@end group
4334@end example
4335
4336Since the main input stream is not used, the values of @code{NR} and
4337@code{FNR} are not changed.  But the record read is split into fields in
4338the normal manner, so the values of @code{$0} and other fields are
4339changed.  So is the value of @code{NF}.
4340
4341@c Thanks to Paul Eggert for initial wording here
4342According to POSIX, @samp{getline < @var{expression}} is ambiguous if
4343@var{expression} contains unparenthesized operators other than
4344@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous
4345because the concatenation operator is not parenthesized, and you should
4346write it as @samp{getline < (dir "/" file)} if you want your program
4347to be portable to other @code{awk} implementations.
4348
4349@node Getline/Variable/File, Getline/Pipe, Getline/File, Getline
4350@subsection Using @code{getline} Into a Variable from a File
4351
4352Use @samp{getline @var{var} < @var{file}} to read input
4353the file
4354@var{file} and put it in the variable @var{var}.  As above, @var{file}
4355is a string-valued expression that specifies the file from which to read.
4356
4357In this version of @code{getline}, none of the built-in variables are
4358changed, and the record is not split into fields.  The only variable
4359changed is @var{var}.
4360
4361@ifinfo
4362@c Thanks to Paul Eggert for initial wording here
4363According to POSIX, @samp{getline @var{var} < @var{expression}} is ambiguous if
4364@var{expression} contains unparenthesized operators other than
4365@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous
4366because the concatenation operator is not parenthesized, and you should
4367write it as @samp{getline < (dir "/" file)} if you want your program
4368to be portable to other @code{awk} implementations.
4369@end ifinfo
4370
4371For example, the following program copies all the input files to the
4372output, except for records that say @w{@samp{@@include @var{filename}}}.
4373Such a record is replaced by the contents of the file
4374@var{filename}.
4375
4376@example
4377@group
4378awk '@{
4379     if (NF == 2 && $1 == "@@include") @{
4380          while ((getline line < $2) > 0)
4381               print line
4382          close($2)
4383     @} else
4384          print
4385@}'
4386@end group
4387@end example
4388
4389Note here how the name of the extra input file is not built into
4390the program; it is taken directly from the data, from the second field on
4391the @samp{@@include} line.
4392
4393The @code{close} function is called to ensure that if two identical
4394@samp{@@include} lines appear in the input, the entire specified file is
4395included twice.
4396@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.
4397
4398One deficiency of this program is that it does not process nested
4399@samp{@@include} statements
4400(@samp{@@include} statements in included files)
4401the way a true macro preprocessor would.
4402@xref{Igawk Program, ,An Easy Way to Use Library Functions}, for a program
4403that does handle nested @samp{@@include} statements.
4404
4405@node Getline/Pipe, Getline/Variable/Pipe, Getline/Variable/File, Getline
4406@subsection Using @code{getline} from a Pipe
4407
4408@cindex input pipeline
4409@cindex pipeline, input
4410You can pipe the output of a command into @code{getline}, using
4411@samp{@var{command} | getline}.  In
4412this case, the string @var{command} is run as a shell command and its output
4413is piped into @code{awk} to be used as input.  This form of @code{getline}
4414reads one record at a time from the pipe.
4415
4416For example, the following program copies its input to its output, except for
4417lines that begin with @samp{@@execute}, which are replaced by the output
4418produced by running the rest of the line as a shell command:
4419
4420@example
4421@group
4422awk '@{
4423     if ($1 == "@@execute") @{
4424          tmp = substr($0, 10)
4425          while ((tmp | getline) > 0)
4426               print
4427          close(tmp)
4428     @} else
4429          print
4430@}'
4431@end group
4432@end example
4433
4434@noindent
4435The @code{close} function is called to ensure that if two identical
4436@samp{@@execute} lines appear in the input, the command is run for
4437each one.
4438@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.
4439@c Exercise!!
4440@c This example is unrealistic, since you could just use system
4441
4442Given the input:
4443
4444@example
4445@group
4446foo
4447bar
4448baz
4449@@execute who
4450bletch
4451@end group
4452@end example
4453
4454@noindent
4455the program might produce:
4456
4457@example
4458@group
4459foo
4460bar
4461baz
4462arnold     ttyv0   Jul 13 14:22
4463miriam     ttyp0   Jul 13 14:23     (murphy:0)
4464bill       ttyp1   Jul 13 14:23     (murphy:0)
4465bletch
4466@end group
4467@end example
4468
4469@noindent
4470Notice that this program ran the command @code{who} and printed the result.
4471(If you try this program yourself, you will of course get different results,
4472showing you who is logged in on your system.)
4473
4474This variation of @code{getline} splits the record into fields, sets the
4475value of @code{NF} and recomputes the value of @code{$0}.  The values of
4476@code{NR} and @code{FNR} are not changed.
4477
4478@c Thanks to Paul Eggert for initial wording here
4479According to POSIX, @samp{@var{expression} | getline} is ambiguous if
4480@var{expression} contains unparenthesized operators other than
4481@samp{$}; for example, @samp{"echo " "date" | getline} is ambiguous
4482because the concatenation operator is not parenthesized, and you should
4483write it as @samp{("echo " "date") | getline} if you want your program
4484to be portable to other @code{awk} implementations.
4485(It happens that @code{gawk} gets it right, but you should not
4486rely on this. Parentheses make it easier to read, anyway.)
4487
4488@node Getline/Variable/Pipe, Getline Summary, Getline/Pipe, Getline
4489@subsection Using @code{getline} Into a Variable from a Pipe
4490
4491When you use @samp{@var{command} | getline @var{var}}, the
4492output of the command @var{command} is sent through a pipe to
4493@code{getline} and into the variable @var{var}.  For example, the
4494following program reads the current date and time into the variable
4495@code{current_time}, using the @code{date} utility, and then
4496prints it.
4497
4498@example
4499@group
4500awk 'BEGIN @{
4501     "date" | getline current_time
4502     close("date")
4503     print "Report printed on " current_time
4504@}'
4505@end group
4506@end example
4507
4508In this version of @code{getline}, none of the built-in variables are
4509changed, and the record is not split into fields.
4510
4511@ifinfo
4512@c Thanks to Paul Eggert for initial wording here
4513According to POSIX, @samp{@var{expression} | getline @var{var}} is ambiguous if
4514@var{expression} contains unparenthesized operators other than
4515@samp{$}; for example, @samp{"echo " "date" | getline @var{var}} is ambiguous
4516because the concatenation operator is not parenthesized, and you should
4517write it as @samp{("echo " "date") | getline @var{var}} if you want your
4518program to be portable to other @code{awk} implementations.
4519(It happens that @code{gawk} gets it right, but you should not
4520rely on this. Parentheses make it easier to read, anyway.)
4521@end ifinfo
4522
4523@node Getline Summary,  , Getline/Variable/Pipe, Getline
4524@subsection Summary of @code{getline} Variants
4525
4526With all the forms of @code{getline}, even though @code{$0} and @code{NF},
4527may be updated, the record will not be tested against all the patterns
4528in the @code{awk} program, in the way that would happen if the record
4529were read normally by the main processing loop of @code{awk}.  However
4530the new record is tested against any subsequent rules.
4531
4532@cindex differences between @code{gawk} and @code{awk}
4533@cindex limitations
4534@cindex implementation limits
4535Many @code{awk} implementations limit the number of pipelines an @code{awk}
4536program may have open to just one!  In @code{gawk}, there is no such limit.
4537You can open as many pipelines as the underlying operating system will
4538permit.
4539
4540@vindex FILENAME
4541@cindex dark corner
4542@cindex @code{getline}, setting @code{FILENAME}
4543@cindex @code{FILENAME}, being set by @code{getline}
4544An interesting side-effect occurs if you use @code{getline} (without a
4545redirection) inside a @code{BEGIN} rule. Since an unredirected @code{getline}
4546reads from the command line data files, the first @code{getline} command
4547causes @code{awk} to set the value of @code{FILENAME}. Normally,
4548@code{FILENAME} does not have a value inside @code{BEGIN} rules, since you
4549have not yet started to process the command line data files (d.c.).
4550(@xref{BEGIN/END, , The @code{BEGIN} and @code{END} Special Patterns},
4551also @pxref{Auto-set, , Built-in Variables that Convey Information}.)
4552
4553The following table summarizes the six variants of @code{getline},
4554listing which built-in variables are set by each one.
4555
4556@c @cartouche
4557@table @code
4558@item getline
4559sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR}.
4560
4561@item getline @var{var}
4562sets @var{var}, @code{FNR}, and @code{NR}.
4563
4564@item getline < @var{file}
4565sets @code{$0}, and @code{NF}.
4566
4567@item getline @var{var} < @var{file}
4568sets @var{var}.
4569
4570@item @var{command} | getline
4571sets @code{$0}, and @code{NF}.
4572
4573@item @var{command} | getline @var{var}
4574sets @var{var}.
4575@end table
4576@c @end cartouche
4577
4578@node Printing, Expressions, Reading Files, Top
4579@chapter Printing Output
4580
4581@cindex printing
4582@cindex output
4583One of the most common actions is to @dfn{print}, or output,
4584some or all of the input.  You use the @code{print} statement
4585for simple output.  You use the @code{printf} statement
4586for fancier formatting.  Both are described in this chapter.
4587
4588@menu
4589* Print::                       The @code{print} statement.
4590* Print Examples::              Simple examples of @code{print} statements.
4591* Output Separators::           The output separators and how to change them.
4592* OFMT::                        Controlling Numeric Output With @code{print}.
4593* Printf::                      The @code{printf} statement.
4594* Redirection::                 How to redirect output to multiple files and
4595                                pipes.
4596* Special Files::               File name interpretation in @code{gawk}.
4597                                @code{gawk} allows access to inherited file
4598                                descriptors.
4599* Close Files And Pipes::       Closing Input and Output Files and Pipes.
4600@end menu
4601
4602@node Print, Print Examples, Printing, Printing
4603@section The @code{print} Statement
4604@cindex @code{print} statement
4605
4606The @code{print} statement does output with simple, standardized
4607formatting.  You specify only the strings or numbers to be printed, in a
4608list separated by commas.  They are output, separated by single spaces,
4609followed by a newline.  The statement looks like this:
4610
4611@example
4612print @var{item1}, @var{item2}, @dots{}
4613@end example
4614
4615@noindent
4616The entire list of items may optionally be enclosed in parentheses.  The
4617parentheses are necessary if any of the item expressions uses the @samp{>}
4618relational operator; otherwise it could be confused with a redirection
4619(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).
4620
4621The items to be printed can be constant strings or numbers, fields of the
4622current record (such as @code{$1}), variables, or any @code{awk}
4623expressions.
4624Numeric values are converted to strings, and then printed.
4625
4626The @code{print} statement is completely general for
4627computing @emph{what} values to print. However, with two exceptions,
4628you cannot specify @emph{how} to print them---how many
4629columns, whether to use exponential notation or not, and so on.
4630(For the exceptions, @pxref{Output Separators}, and
4631@ref{OFMT, ,Controlling Numeric Output with @code{print}}.)
4632For that, you need the @code{printf} statement
4633(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
4634
4635The simple statement @samp{print} with no items is equivalent to
4636@samp{print $0}: it prints the entire current record.  To print a blank
4637line, use @samp{print ""}, where @code{""} is the empty string.
4638
4639To print a fixed piece of text, use a string constant such as
4640@w{@code{"Don't Panic"}} as one item.  If you forget to use the
4641double-quote characters, your text will be taken as an @code{awk}
4642expression, and you will probably get an error.  Keep in mind that a
4643space is printed between any two items.
4644
4645Each @code{print} statement makes at least one line of output.  But it
4646isn't limited to one line.  If an item value is a string that contains a
4647newline, the newline is output along with the rest of the string.  A
4648single @code{print} can make any number of lines this way.
4649
4650@node Print Examples, Output Separators, Print, Printing
4651@section Examples of @code{print} Statements
4652
4653Here is an example of printing a string that contains embedded newlines
4654(the @samp{\n} is an escape sequence, used to represent the newline
4655character; @pxref{Escape Sequences}):
4656
4657@example
4658@group
4659$ awk 'BEGIN @{ print "line one\nline two\nline three" @}'
4660@print{} line one
4661@print{} line two
4662@print{} line three
4663@end group
4664@end example
4665
4666Here is an example that prints the first two fields of each input record,
4667with a space between them:
4668
4669@example
4670@group
4671$ awk '@{ print $1, $2 @}' inventory-shipped
4672@print{} Jan 13
4673@print{} Feb 15
4674@print{} Mar 15
4675@dots{}
4676@end group
4677@end example
4678
4679@cindex common mistakes
4680@cindex mistakes, common
4681@cindex errors, common
4682A common mistake in using the @code{print} statement is to omit the comma
4683between two items.  This often has the effect of making the items run
4684together in the output, with no space.  The reason for this is that
4685juxtaposing two string expressions in @code{awk} means to concatenate
4686them.  Here is the same program, without the comma:
4687
4688@example
4689@group
4690$ awk '@{ print $1 $2 @}' inventory-shipped
4691@print{} Jan13
4692@print{} Feb15
4693@print{} Mar15
4694@dots{}
4695@end group
4696@end example
4697
4698To someone unfamiliar with the file @file{inventory-shipped}, neither
4699example's output makes much sense.  A heading line at the beginning
4700would make it clearer.  Let's add some headings to our table of months
4701(@code{$1}) and green crates shipped (@code{$2}).  We do this using the
4702@code{BEGIN} pattern
4703(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
4704to force the headings to be printed only once:
4705
4706@example
4707awk 'BEGIN @{  print "Month Crates"
4708              print "----- ------" @}
4709           @{  print $1, $2 @}' inventory-shipped
4710@end example
4711
4712@noindent
4713Did you already guess what happens? When run, the program prints
4714the following:
4715
4716@example
4717@group
4718Month Crates
4719----- ------
4720Jan 13
4721Feb 15
4722Mar 15
4723@dots{}
4724@end group
4725@end example
4726
4727@noindent
4728The headings and the table data don't line up!  We can fix this by printing
4729some spaces between the two fields:
4730
4731@example
4732awk 'BEGIN @{ print "Month Crates"
4733             print "----- ------" @}
4734           @{ print $1, "     ", $2 @}' inventory-shipped
4735@end example
4736
4737You can imagine that this way of lining up columns can get pretty
4738complicated when you have many columns to fix.  Counting spaces for two
4739or three columns can be simple, but more than this and you can get
4740lost quite easily.  This is why the @code{printf} statement was
4741created (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing});
4742one of its specialties is lining up columns of data.
4743
4744@cindex line continuation
4745As a side point,
4746you can continue either a @code{print} or @code{printf} statement simply
4747by putting a newline after any comma
4748(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
4749
4750@node Output Separators, OFMT, Print Examples, Printing
4751@section Output Separators
4752
4753@cindex output field separator, @code{OFS}
4754@cindex output record separator, @code{ORS}
4755@vindex OFS
4756@vindex ORS
4757As mentioned previously, a @code{print} statement contains a list
4758of items, separated by commas.  In the output, the items are normally
4759separated by single spaces.  This need not be the case; a
4760single space is only the default.  You can specify any string of
4761characters to use as the @dfn{output field separator} by setting the
4762built-in variable @code{OFS}.  The initial value of this variable
4763is the string @w{@code{" "}}, that is, a single space.
4764
4765The output from an entire @code{print} statement is called an
4766@dfn{output record}.  Each @code{print} statement outputs one output
4767record and then outputs a string called the @dfn{output record separator}.
4768The built-in variable @code{ORS} specifies this string.  The initial
4769value of @code{ORS} is the string @code{"\n"}, i.e.@: a newline
4770character; thus, normally each @code{print} statement makes a separate line.
4771
4772You can change how output fields and records are separated by assigning
4773new values to the variables @code{OFS} and/or @code{ORS}.  The usual
4774place to do this is in the @code{BEGIN} rule
4775(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}), so
4776that it happens before any input is processed.  You may also do this
4777with assignments on the command line, before the names of your input
4778files, or using the @samp{-v} command line option
4779(@pxref{Options, ,Command Line Options}).
4780
4781@ignore
4782Exercise,
4783Rewrite the
4784@example
4785awk 'BEGIN @{ print "Month Crates"
4786             print "----- ------" @}
4787           @{ print $1, "     ", $2 @}' inventory-shipped
4788@end example
4789program by using a new value of @code{OFS}.
4790@end ignore
4791
4792The following example prints the first and second fields of each input
4793record separated by a semicolon, with a blank line added after each
4794line:
4795
4796@example
4797@group
4798$ awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @}
4799>            @{ print $1, $2 @}' BBS-list
4800@print{} aardvark;555-5553
4801@print{}
4802@print{} alpo-net;555-3412
4803@print{}
4804@print{} barfly;555-7685
4805@dots{}
4806@end group
4807@end example
4808
4809If the value of @code{ORS} does not contain a newline, all your output
4810will be run together on a single line, unless you output newlines some
4811other way.
4812
4813@node OFMT, Printf, Output Separators, Printing
4814@section Controlling Numeric Output with @code{print}
4815@vindex OFMT
4816@cindex numeric output format
4817@cindex format, numeric output
4818@cindex output format specifier, @code{OFMT}
4819When you use the @code{print} statement to print numeric values,
4820@code{awk} internally converts the number to a string of characters,
4821and prints that string.  @code{awk} uses the @code{sprintf} function
4822to do this conversion
4823(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
4824For now, it suffices to say that the @code{sprintf}
4825function accepts a @dfn{format specification} that tells it how to format
4826numbers (or strings), and that there are a number of different ways in which
4827numbers can be formatted.  The different format specifications are discussed
4828more fully in
4829@ref{Control Letters, , Format-Control Letters}.
4830
4831The built-in variable @code{OFMT} contains the default format specification
4832that @code{print} uses with @code{sprintf} when it wants to convert a
4833number to a string for printing.
4834The default value of @code{OFMT} is @code{"%.6g"}.
4835By supplying different format specifications
4836as the value of @code{OFMT}, you can change how @code{print} will print
4837your numbers.  As a brief example:
4838
4839@example
4840@group
4841$ awk 'BEGIN @{
4842>   OFMT = "%.0f"  # print numbers as integers (rounds)
4843>   print 17.23 @}'
4844@print{} 17
4845@end group
4846@end example
4847
4848@noindent
4849@cindex dark corner
4850@cindex @code{awk} language, POSIX version
4851@cindex POSIX @code{awk}
4852According to the POSIX standard, @code{awk}'s behavior will be undefined
4853if @code{OFMT} contains anything but a floating point conversion specification
4854(d.c.).
4855
4856@node Printf, Redirection, OFMT, Printing
4857@section Using @code{printf} Statements for Fancier Printing
4858@cindex formatted output
4859@cindex output, formatted
4860
4861If you want more precise control over the output format than
4862@code{print} gives you, use @code{printf}.  With @code{printf} you can
4863specify the width to use for each item, and you can specify various
4864formatting choices for numbers (such as what radix to use, whether to
4865print an exponent, whether to print a sign, and how many digits to print
4866after the decimal point).  You do this by supplying a string, called
4867the @dfn{format string}, which controls how and where to print the other
4868arguments.
4869
4870@menu
4871* Basic Printf::                Syntax of the @code{printf} statement.
4872* Control Letters::             Format-control letters.
4873* Format Modifiers::            Format-specification modifiers.
4874* Printf Examples::             Several examples.
4875@end menu
4876
4877@node Basic Printf, Control Letters, Printf, Printf
4878@subsection Introduction to the @code{printf} Statement
4879
4880@cindex @code{printf} statement, syntax of
4881The @code{printf} statement looks like this:
4882
4883@example
4884printf @var{format}, @var{item1}, @var{item2}, @dots{}
4885@end example
4886
4887@noindent
4888The entire list of arguments may optionally be enclosed in parentheses.  The
4889parentheses are necessary if any of the item expressions use the @samp{>}
4890relational operator; otherwise it could be confused with a redirection
4891(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).
4892
4893@cindex format string
4894The difference between @code{printf} and @code{print} is the @var{format}
4895argument.  This is an expression whose value is taken as a string; it
4896specifies how to output each of the other arguments.  It is called
4897the @dfn{format string}.
4898
4899The format string is very similar to that in the ANSI C library function
4900@code{printf}.  Most of @var{format} is text to be output verbatim.
4901Scattered among this text are @dfn{format specifiers}, one per item.
4902Each format specifier says to output the next item in the argument list
4903at that place in the format.
4904
4905The @code{printf} statement does not automatically append a newline to its
4906output.  It outputs only what the format string specifies.  So if you want
4907a newline, you must include one in the format string.  The output separator
4908variables @code{OFS} and @code{ORS} have no effect on @code{printf}
4909statements. For example:
4910
4911@example
4912@group
4913BEGIN @{
4914   ORS = "\nOUCH!\n"; OFS = "!"
4915   msg = "Don't Panic!"; printf "%s\n", msg
4916@}
4917@end group
4918@end example
4919
4920This program still prints the familiar @samp{Don't Panic!} message.
4921
4922@node Control Letters, Format Modifiers, Basic Printf, Printf
4923@subsection Format-Control Letters
4924@cindex @code{printf}, format-control characters
4925@cindex format specifier
4926
4927A format specifier starts with the character @samp{%} and ends with a
4928@dfn{format-control letter}; it tells the @code{printf} statement how
4929to output one item.  (If you actually want to output a @samp{%}, write
4930@samp{%%}.)  The format-control letter specifies what kind of value to
4931print.  The rest of the format specifier is made up of optional
4932@dfn{modifiers} which are parameters to use, such as the field width.
4933
4934Here is a list of the format-control letters:
4935
4936@table @code
4937@item c
4938This prints a number as an ASCII character.  Thus, @samp{printf "%c",
493965} outputs the letter @samp{A}.  The output for a string value is
4940the first character of the string.
4941
4942@item d
4943@itemx i
4944These are equivalent. They both print a decimal integer.
4945The @samp{%i} specification is for compatibility with ANSI C.
4946
4947@item e
4948@itemx E
4949This prints a number in scientific (exponential) notation.
4950For example,
4951
4952@example
4953printf "%4.3e\n", 1950
4954@end example
4955
4956@noindent
4957prints @samp{1.950e+03}, with a total of four significant figures of
4958which three follow the decimal point.  The @samp{4.3} are modifiers,
4959discussed below. @samp{%E} uses @samp{E} instead of @samp{e} in the output.
4960
4961@item f
4962This prints a number in floating point notation.
4963For example,
4964
4965@example
4966printf "%4.3f", 1950
4967@end example
4968
4969@noindent
4970prints @samp{1950.000}, with a total of four significant figures of
4971which three follow the decimal point.  The @samp{4.3} are modifiers,
4972discussed below.
4973
4974@item g
4975@itemx G
4976This prints a number in either scientific notation or floating point
4977notation, whichever uses fewer characters. If the result is printed in
4978scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}.
4979
4980@item o
4981This prints an unsigned octal integer.
4982(In octal, or base-eight notation, the digits run from @samp{0} to @samp{7};
4983the decimal number eight is represented as @samp{10} in octal.)
4984
4985@item s
4986This prints a string.
4987
4988@item u
4989This prints an unsigned decimal number.
4990(This format is of marginal use, since all numbers in @code{awk}
4991are floating point.  It is provided primarily for compatibility
4992with C.)
4993
4994@item x
4995@itemx X
4996This prints an unsigned hexadecimal integer.
4997(In hexadecimal, or base-16 notation, the digits are @samp{0} through @samp{9}
4998and @samp{a} through @samp{f}.  The hexadecimal digit @samp{f} represents
4999the decimal number 15.) @samp{%X} uses the letters @samp{A} through @samp{F}
5000instead of @samp{a} through @samp{f}.
5001
5002@item %
5003This isn't really a format-control letter, but it does have a meaning
5004when used after a @samp{%}: the sequence @samp{%%} outputs one
5005@samp{%}.  It does not consume an argument, and it ignores any
5006modifiers.
5007@end table
5008
5009@cindex dark corner
5010When using the integer format-control letters for values that are outside
5011the range of a C @code{long} integer, @code{gawk} will switch to the
5012@samp{%g} format specifier. Other versions of @code{awk} may print
5013invalid values, or do something else entirely (d.c.).
5014
5015@node Format Modifiers, Printf Examples, Control Letters, Printf
5016@subsection Modifiers for @code{printf} Formats
5017
5018@cindex @code{printf}, modifiers
5019@cindex modifiers (in format specifiers)
5020A format specification can also include @dfn{modifiers} that can control
5021how much of the item's value is printed and how much space it gets.  The
5022modifiers come between the @samp{%} and the format-control letter.
5023In the examples below, we use the bullet symbol ``@bullet{}'' to represent
5024spaces in the output. Here are the possible modifiers, in the order in
5025which they may appear:
5026
5027@table @code
5028@item -
5029The minus sign, used before the width modifier (see below),
5030says to left-justify
5031the argument within its specified width.  Normally the argument
5032is printed right-justified in the specified width.  Thus,
5033
5034@example
5035printf "%-4s", "foo"
5036@end example
5037
5038@noindent
5039prints @samp{foo@bullet{}}.
5040
5041@item @var{space}
5042For numeric conversions, prefix positive values with a space, and
5043negative values with a minus sign.
5044
5045@item +
5046The plus sign, used before the width modifier (see below),
5047says to always supply a sign for numeric conversions, even if the data
5048to be formatted is positive. The @samp{+} overrides the space modifier.
5049
5050@item #
5051Use an ``alternate form'' for certain control letters.
5052For @samp{%o}, supply a leading zero.
5053For @samp{%x}, and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for
5054a non-zero result.
5055For @samp{%e}, @samp{%E}, and @samp{%f}, the result will always contain a
5056decimal point.
5057For @samp{%g}, and @samp{%G}, trailing zeros are not removed from the result.
5058
5059@cindex dark corner
5060@item 0
5061A leading @samp{0} (zero) acts as a flag, that indicates output should be
5062padded with zeros instead of spaces.
5063This applies even to non-numeric output formats (d.c.).
5064This flag only has an effect when the field width is wider than the
5065value to be printed.
5066
5067@item @var{width}
5068This is a number specifying the desired minimum width of a field.  Inserting any
5069number between the @samp{%} sign and the format control character forces the
5070field to be expanded to this width.  The default way to do this is to
5071pad with spaces on the left.  For example,
5072
5073@example
5074printf "%4s", "foo"
5075@end example
5076
5077@noindent
5078prints @samp{@bullet{}foo}.
5079
5080The value of @var{width} is a minimum width, not a maximum.  If the item
5081value requires more than @var{width} characters, it can be as wide as
5082necessary.  Thus,
5083
5084@example
5085printf "%4s", "foobar"
5086@end example
5087
5088@noindent
5089prints @samp{foobar}.
5090
5091Preceding the @var{width} with a minus sign causes the output to be
5092padded with spaces on the right, instead of on the left.
5093
5094@item .@var{prec}
5095This is a number that specifies the precision to use when printing.
5096For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the
5097number of digits you want printed to the right of the decimal point.
5098For the @samp{g}, and @samp{G} formats, it specifies the maximum number
5099of significant digits.  For the @samp{d}, @samp{o}, @samp{i}, @samp{u},
5100@samp{x}, and @samp{X} formats, it specifies the minimum number of
5101digits to print.  For a string, it specifies the maximum number of
5102characters from the string that should be printed.  Thus,
5103
5104@example
5105printf "%.4s", "foobar"
5106@end example
5107
5108@noindent
5109prints @samp{foob}.
5110@end table
5111
5112The C library @code{printf}'s dynamic @var{width} and @var{prec}
5113capability (for example, @code{"%*.*s"}) is supported.  Instead of
5114supplying explicit @var{width} and/or @var{prec} values in the format
5115string, you pass them in the argument list.  For example:
5116
5117@example
5118w = 5
5119p = 3
5120s = "abcdefg"
5121printf "%*.*s\n", w, p, s
5122@end example
5123
5124@noindent
5125is exactly equivalent to
5126
5127@example
5128s = "abcdefg"
5129printf "%5.3s\n", s
5130@end example
5131
5132@noindent
5133Both programs output @samp{@w{@bullet{}@bullet{}abc}}.
5134
5135Earlier versions of @code{awk} did not support this capability.
5136If you must use such a version, you may simulate this feature by using
5137concatenation to build up the format string, like so:
5138
5139@example
5140w = 5
5141p = 3
5142s = "abcdefg"
5143printf "%" w "." p "s\n", s
5144@end example
5145
5146@noindent
5147This is not particularly easy to read, but it does work.
5148
5149@cindex @code{awk} language, POSIX version
5150@cindex POSIX @code{awk}
5151C programmers may be used to supplying additional @samp{l} and @samp{h}
5152flags in @code{printf} format strings. These are not valid in @code{awk}.
5153Most @code{awk} implementations silently ignore these flags.
5154If @samp{--lint} is provided on the command line
5155(@pxref{Options, ,Command Line Options}),
5156@code{gawk} will warn about their use. If @samp{--posix} is supplied,
5157their use is a fatal error.
5158
5159@node Printf Examples,  , Format Modifiers, Printf
5160@subsection Examples Using @code{printf}
5161
5162Here is how to use @code{printf} to make an aligned table:
5163
5164@example
5165awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5166@end example
5167
5168@noindent
5169prints the names of bulletin boards (@code{$1}) of the file
5170@file{BBS-list} as a string of 10 characters, left justified.  It also
5171prints the phone numbers (@code{$2}) afterward on the line.  This
5172produces an aligned two-column table of names and phone numbers:
5173
5174@example
5175@group
5176$ awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5177@print{} aardvark   555-5553
5178@print{} alpo-net   555-3412
5179@print{} barfly     555-7685
5180@print{} bites      555-1675
5181@print{} camelot    555-0542
5182@print{} core       555-2912
5183@print{} fooey      555-1234
5184@print{} foot       555-6699
5185@print{} macfoo     555-6480
5186@print{} sdace      555-3430
5187@print{} sabafoo    555-2127
5188@end group
5189@end example
5190
5191Did you notice that we did not specify that the phone numbers be printed
5192as numbers?  They had to be printed as strings because the numbers are
5193separated by a dash.
5194If we had tried to print the phone numbers as numbers, all we would have
5195gotten would have been the first three digits, @samp{555}.
5196This would have been pretty confusing.
5197
5198We did not specify a width for the phone numbers because they are the
5199last things on their lines.  We don't need to put spaces after them.
5200
5201We could make our table look even nicer by adding headings to the tops
5202of the columns.  To do this, we use the @code{BEGIN} pattern
5203(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
5204to force the header to be printed only once, at the beginning of
5205the @code{awk} program:
5206
5207@example
5208@group
5209awk 'BEGIN @{ print "Name      Number"
5210             print "----      ------" @}
5211     @{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5212@end group
5213@end example
5214
5215Did you notice that we mixed @code{print} and @code{printf} statements in
5216the above example?  We could have used just @code{printf} statements to get
5217the same results:
5218
5219@example
5220@group
5221awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number"
5222             printf "%-10s %s\n", "----", "------" @}
5223     @{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5224@end group
5225@end example
5226
5227@noindent
5228By printing each column heading with the same format specification
5229used for the elements of the column, we have made sure that the headings
5230are aligned just like the columns.
5231
5232The fact that the same format specification is used three times can be
5233emphasized by storing it in a variable, like this:
5234
5235@example
5236@group
5237awk 'BEGIN @{ format = "%-10s %s\n"
5238             printf format, "Name", "Number"
5239             printf format, "----", "------" @}
5240     @{ printf format, $1, $2 @}' BBS-list
5241@end group
5242@end example
5243
5244@c !!! exercise
5245See if you can use the @code{printf} statement to line up the headings and
5246table data for our @file{inventory-shipped} example covered earlier in the
5247section on the @code{print} statement
5248(@pxref{Print, ,The @code{print} Statement}).
5249
5250@node Redirection, Special Files, Printf, Printing
5251@section Redirecting Output of @code{print} and @code{printf}
5252
5253@cindex output redirection
5254@cindex redirection of output
5255So far we have been dealing only with output that prints to the standard
5256output, usually your terminal.  Both @code{print} and @code{printf} can
5257also send their output to other places.
5258This is called @dfn{redirection}.
5259
5260A redirection appears after the @code{print} or @code{printf} statement.
5261Redirections in @code{awk} are written just like redirections in shell
5262commands, except that they are written inside the @code{awk} program.
5263
5264There are three forms of output redirection: output to a file,
5265output appended to a file, and output through a pipe to another
5266command.
5267They are all shown for
5268the @code{print} statement, but they work identically for @code{printf}
5269also.
5270
5271@table @code
5272@item print @var{items} > @var{output-file}
5273This type of redirection prints the items into the output file
5274@var{output-file}.  The file name @var{output-file} can be any
5275expression.  Its value is changed to a string and then used as a
5276file name (@pxref{Expressions}).
5277
5278When this type of redirection is used, the @var{output-file} is erased
5279before the first output is written to it.  Subsequent writes
5280to the same @var{output-file} do not
5281erase @var{output-file}, but append to it.  If @var{output-file} does
5282not exist, then it is created.
5283
5284For example, here is how an @code{awk} program can write a list of
5285BBS names to a file @file{name-list} and a list of phone numbers to a
5286file @file{phone-list}.  Each output file contains one name or number
5287per line.
5288
5289@example
5290@group
5291$ awk '@{ print $2 > "phone-list"
5292>        print $1 > "name-list" @}' BBS-list
5293@end group
5294@group
5295$ cat phone-list
5296@print{} 555-5553
5297@print{} 555-3412
5298@dots{}
5299@end group
5300@group
5301$ cat name-list
5302@print{} aardvark
5303@print{} alpo-net
5304@dots{}
5305@end group
5306@end example
5307
5308@item print @var{items} >> @var{output-file}
5309This type of redirection prints the items into the pre-existing output file
5310@var{output-file}.  The difference between this and the
5311single-@samp{>} redirection is that the old contents (if any) of
5312@var{output-file} are not erased.  Instead, the @code{awk} output is
5313appended to the file.
5314If @var{output-file} does not exist, then it is created.
5315
5316@cindex pipes for output
5317@cindex output, piping
5318@item print @var{items} | @var{command}
5319It is also possible to send output to another program through a pipe
5320instead of into a
5321file.   This type of redirection opens a pipe to @var{command} and writes
5322the values of @var{items} through this pipe, to another process created
5323to execute @var{command}.
5324
5325The redirection argument @var{command} is actually an @code{awk}
5326expression.  Its value is converted to a string, whose contents give the
5327shell command to be run.
5328
5329For example, this produces two files, one unsorted list of BBS names
5330and one list sorted in reverse alphabetical order:
5331
5332@example
5333awk '@{ print $1 > "names.unsorted"
5334       command = "sort -r > names.sorted"
5335       print $1 | command @}' BBS-list
5336@end example
5337
5338Here the unsorted list is written with an ordinary redirection while
5339the sorted list is written by piping through the @code{sort} utility.
5340
5341This example uses redirection to mail a message to a mailing
5342list @samp{bug-system}.  This might be useful when trouble is encountered
5343in an @code{awk} script run periodically for system maintenance.
5344
5345@example
5346report = "mail bug-system"
5347print "Awk script failed:", $0 | report
5348m = ("at record number " FNR " of " FILENAME)
5349print m | report
5350close(report)
5351@end example
5352
5353The message is built using string concatenation and saved in the variable
5354@code{m}.  It is then sent down the pipeline to the @code{mail} program.
5355
5356We call the @code{close} function here because it's a good idea to close
5357the pipe as soon as all the intended output has been sent to it.
5358@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},
5359for more information
5360on this.  This example also illustrates the use of a variable to represent
5361a @var{file} or @var{command}: it is not necessary to always
5362use a string constant.  Using a variable is generally a good idea,
5363since @code{awk} requires you to spell the string value identically
5364every time.
5365@end table
5366
5367Redirecting output using @samp{>}, @samp{>>}, or @samp{|} asks the system
5368to open a file or pipe only if the particular @var{file} or @var{command}
5369you've specified has not already been written to by your program, or if
5370it has been closed since it was last written to.
5371
5372@cindex differences between @code{gawk} and @code{awk}
5373@cindex limitations
5374@cindex implementation limits
5375@iftex
5376As mentioned earlier
5377(@pxref{Getline Summary,  , Summary of @code{getline} Variants}),
5378many
5379@end iftex
5380@ifinfo
5381Many
5382@end ifinfo
5383@code{awk} implementations limit the number of pipelines an @code{awk}
5384program may have open to just one!  In @code{gawk}, there is no such limit.
5385You can open as many pipelines as the underlying operating system will
5386permit.
5387
5388@node Special Files, Close Files And Pipes , Redirection, Printing
5389@section Special File Names in @code{gawk}
5390@cindex standard input
5391@cindex standard output
5392@cindex standard error output
5393@cindex file descriptors
5394
5395Running programs conventionally have three input and output streams
5396already available to them for reading and writing.  These are known as
5397the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error
5398output}.  These streams are, by default, connected to your terminal, but
5399they are often redirected with the shell, via the @samp{<}, @samp{<<},
5400@samp{>}, @samp{>>}, @samp{>&} and @samp{|} operators.  Standard error
5401is typically used for writing error messages; the reason we have two separate
5402streams, standard output and standard error, is so that they can be
5403redirected separately.
5404
5405@cindex differences between @code{gawk} and @code{awk}
5406In other implementations of @code{awk}, the only way to write an error
5407message to standard error in an @code{awk} program is as follows:
5408
5409@example
5410print "Serious error detected!" | "cat 1>&2"
5411@end example
5412
5413@noindent
5414This works by opening a pipeline to a shell command which can access the
5415standard error stream which it inherits from the @code{awk} process.
5416This is far from elegant, and is also inefficient, since it requires a
5417separate process.  So people writing @code{awk} programs often
5418neglect to do this.  Instead, they send the error messages to the
5419terminal, like this:
5420
5421@example
5422@group
5423print "Serious error detected!" > "/dev/tty"
5424@end group
5425@end example
5426
5427@noindent
5428This usually has the same effect, but not always: although the
5429standard error stream is usually the terminal, it can be redirected, and
5430when that happens, writing to the terminal is not correct.  In fact, if
5431@code{awk} is run from a background job, it may not have a terminal at all.
5432Then opening @file{/dev/tty} will fail.
5433
5434@code{gawk} provides special file names for accessing the three standard
5435streams.  When you redirect input or output in @code{gawk}, if the file name
5436matches one of these special names, then @code{gawk} directly uses the
5437stream it stands for.
5438
5439@cindex @file{/dev/stdin}
5440@cindex @file{/dev/stdout}
5441@cindex @file{/dev/stderr}
5442@cindex @file{/dev/fd}
5443@c @cartouche
5444@table @file
5445@item /dev/stdin
5446The standard input (file descriptor 0).
5447
5448@item /dev/stdout
5449The standard output (file descriptor 1).
5450
5451@item /dev/stderr
5452The standard error output (file descriptor 2).
5453
5454@item /dev/fd/@var{N}
5455The file associated with file descriptor @var{N}.  Such a file must have
5456been opened by the program initiating the @code{awk} execution (typically
5457the shell).  Unless you take special pains in the shell from which
5458you invoke @code{gawk}, only descriptors 0, 1 and 2 are available.
5459@end table
5460@c @end cartouche
5461
5462The file names @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr}
5463are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2},
5464respectively, but they are more self-explanatory.
5465
5466The proper way to write an error message in a @code{gawk} program
5467is to use @file{/dev/stderr}, like this:
5468
5469@example
5470print "Serious error detected!" > "/dev/stderr"
5471@end example
5472
5473@code{gawk} also provides special file names that give access to information
5474about the running @code{gawk} process.  Each of these ``files'' provides
5475a single record of information.  To read them more than once, you must
5476first close them with the @code{close} function
5477(@pxref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}).
5478The filenames are:
5479
5480@cindex process information
5481@cindex @file{/dev/pid}
5482@cindex @file{/dev/pgrpid}
5483@cindex @file{/dev/ppid}
5484@cindex @file{/dev/user}
5485@c @cartouche
5486@table @file
5487@item /dev/pid
5488Reading this file returns the process ID of the current process,
5489in decimal, terminated with a newline.
5490
5491@item  /dev/ppid
5492Reading this file returns the parent process ID of the current process,
5493in decimal, terminated with a newline.
5494
5495@item  /dev/pgrpid
5496Reading this file returns the process group ID of the current process,
5497in decimal, terminated with a newline.
5498
5499@item /dev/user
5500Reading this file returns a single record terminated with a newline.
5501The fields are separated with spaces.  The fields represent the
5502following information:
5503
5504@table @code
5505@item $1
5506The return value of the @code{getuid} system call
5507(the real user ID number).
5508
5509@item $2
5510The return value of the @code{geteuid} system call
5511(the effective user ID number).
5512
5513@item $3
5514The return value of the @code{getgid} system call
5515(the real group ID number).
5516
5517@item $4
5518The return value of the @code{getegid} system call
5519(the effective group ID number).
5520@end table
5521
5522If there are any additional fields, they are the group IDs returned by
5523@code{getgroups} system call.
5524(Multiple groups may not be supported on all systems.)
5525@end table
5526@c @end cartouche
5527
5528These special file names may be used on the command line as data
5529files, as well as for I/O redirections within an @code{awk} program.
5530They may not be used as source files with the @samp{-f} option.
5531
5532Recognition of these special file names is disabled if @code{gawk} is in
5533compatibility mode (@pxref{Options, ,Command Line Options}).
5534
5535@strong{Caution}:  Unless your system actually has a @file{/dev/fd} directory
5536(or any of the other above listed special files),
5537the interpretation of these file names is done by @code{gawk} itself.
5538For example, using @samp{/dev/fd/4} for output will actually write on
5539file descriptor 4, and not on a new file descriptor that was @code{dup}'ed
5540from file descriptor 4.  Most of the time this does not matter; however, it
5541is important to @emph{not} close any of the files related to file descriptors
55420, 1, and 2.  If you do close one of these files, unpredictable behavior
5543will result.
5544
5545The special files that provide process-related information will disappear
5546in a future version of @code{gawk}.
5547@xref{Future Extensions, ,Probable Future Extensions}.
5548
5549@node Close Files And Pipes, , Special Files, Printing
5550@section Closing Input and Output Files and Pipes
5551@cindex closing input files and pipes
5552@cindex closing output files and pipes
5553@findex close
5554
5555If the same file name or the same shell command is used with
5556@code{getline}
5557(@pxref{Getline, ,Explicit Input with @code{getline}})
5558more than once during the execution of an @code{awk}
5559program, the file is opened (or the command is executed) only the first time.
5560At that time, the first record of input is read from that file or command.
5561The next time the same file or command is used in @code{getline}, another
5562record is read from it, and so on.
5563
5564Similarly, when a file or pipe is opened for output, the file name or command
5565associated with
5566it is remembered by @code{awk} and subsequent writes to the same file or
5567command are appended to the previous writes.  The file or pipe stays
5568open until @code{awk} exits.
5569
5570This implies that if you want to start reading the same file again from
5571the beginning, or if you want to rerun a shell command (rather than
5572reading more output from the command), you must take special steps.
5573What you must do is use the @code{close} function, as follows:
5574
5575@example
5576close(@var{filename})
5577@end example
5578
5579@noindent
5580or
5581
5582@example
5583close(@var{command})
5584@end example
5585
5586The argument @var{filename} or @var{command} can be any expression.  Its
5587value must @emph{exactly} match the string that was used to open the file or
5588start the command (spaces and other ``irrelevant'' characters
5589included). For example, if you open a pipe with this:
5590
5591@example
5592"sort -r names" | getline foo
5593@end example
5594
5595@noindent
5596then you must close it with this:
5597
5598@example
5599close("sort -r names")
5600@end example
5601
5602Once this function call is executed, the next @code{getline} from that
5603file or command, or the next @code{print} or @code{printf} to that
5604file or command, will reopen the file or rerun the command.
5605
5606Because the expression that you use to close a file or pipeline must
5607exactly match the expression used to open the file or run the command,
5608it is good practice to use a variable to store the file name or command.
5609The previous example would become
5610
5611@example
5612sortcom = "sort -r names"
5613sortcom | getline foo
5614@dots{}
5615close(sortcom)
5616@end example
5617
5618@noindent
5619This helps avoid hard-to-find typographical errors in your @code{awk}
5620programs.
5621
5622Here are some reasons why you might need to close an output file:
5623
5624@itemize @bullet
5625@item
5626To write a file and read it back later on in the same @code{awk}
5627program.  Close the file when you are finished writing it; then
5628you can start reading it with @code{getline}.
5629
5630@item
5631To write numerous files, successively, in the same @code{awk}
5632program.  If you don't close the files, eventually you may exceed a
5633system limit on the number of open files in one process.  So close
5634each one when you are finished writing it.
5635
5636@item
5637To make a command finish.  When you redirect output through a pipe,
5638the command reading the pipe normally continues to try to read input
5639as long as the pipe is open.  Often this means the command cannot
5640really do its work until the pipe is closed.  For example, if you
5641redirect output to the @code{mail} program, the message is not
5642actually sent until the pipe is closed.
5643
5644@c NEEDED
5645@page
5646@item
5647To run the same program a second time, with the same arguments.
5648This is not the same thing as giving more input to the first run!
5649
5650For example, suppose you pipe output to the @code{mail} program.  If you
5651output several lines redirected to this pipe without closing it, they make
5652a single message of several lines.  By contrast, if you close the pipe
5653after each line of output, then each line makes a separate message.
5654@end itemize
5655
5656@vindex ERRNO
5657@cindex differences between @code{gawk} and @code{awk}
5658@code{close} returns a value of zero if the close succeeded.
5659Otherwise, the value will be non-zero.
5660In this case, @code{gawk} sets the variable @code{ERRNO} to a string
5661describing the error that occurred.
5662
5663@cindex differences between @code{gawk} and @code{awk}
5664@cindex portability issues
5665If you use more files than the system allows you to have open,
5666@code{gawk} will attempt to multiplex the available open files among
5667your data files.  @code{gawk}'s ability to do this depends upon the
5668facilities of your operating system: it may not always work.  It is
5669therefore both good practice and good portability advice to always
5670use @code{close} on your files when you are done with them.
5671
5672@node Expressions, Patterns and Actions, Printing, Top
5673@chapter Expressions
5674@cindex expression
5675
5676Expressions are the basic building blocks of @code{awk} patterns
5677and actions.  An expression evaluates to a value, which you can print, test,
5678store in a variable or pass to a function.  Additionally, an expression
5679can assign a new value to a variable or a field, with an assignment operator.
5680
5681An expression can serve as a pattern or action statement on its own.
5682Most other kinds of
5683statements contain one or more expressions which specify data on which to
5684operate.  As in other languages, expressions in @code{awk} include
5685variables, array references, constants, and function calls, as well as
5686combinations of these with various operators.
5687
5688@menu
5689* Constants::                   String, numeric, and regexp constants.
5690* Using Constant Regexps::      When and how to use a regexp constant.
5691* Variables::                   Variables give names to values for later use.
5692* Conversion::                  The conversion of strings to numbers and vice
5693                                versa.
5694* Arithmetic Ops::              Arithmetic operations (@samp{+}, @samp{-},
5695                                etc.)
5696* Concatenation::               Concatenating strings.
5697* Assignment Ops::              Changing the value of a variable or a field.
5698* Increment Ops::               Incrementing the numeric value of a variable.
5699* Truth Values::                What is ``true'' and what is ``false''.
5700* Typing and Comparison::       How variables acquire types, and how this
5701                                affects comparison of numbers and strings with
5702                                @samp{<}, etc.
5703* Boolean Ops::                 Combining comparison expressions using boolean
5704                                operators @samp{||} (``or''), @samp{&&}
5705                                (``and'') and @samp{!} (``not'').
5706* Conditional Exp::             Conditional expressions select between two
5707                                subexpressions under control of a third
5708                                subexpression.
5709* Function Calls::              A function call is an expression.
5710* Precedence::                  How various operators nest.
5711@end menu
5712
5713@node Constants, Using Constant Regexps, Expressions, Expressions
5714@section Constant Expressions
5715@cindex constants, types of
5716@cindex string constants
5717
5718The simplest type of expression is the @dfn{constant}, which always has
5719the same value.  There are three types of constants: numeric constants,
5720string constants, and regular expression constants.
5721
5722@menu
5723* Scalar Constants::            Numeric and string constants.
5724* Regexp Constants::            Regular Expression constants.
5725@end menu
5726
5727@node Scalar Constants, Regexp Constants, Constants, Constants
5728@subsection Numeric and String Constants
5729
5730@cindex numeric constant
5731@cindex numeric value
5732A @dfn{numeric constant} stands for a number.  This number can be an
5733integer, a decimal fraction, or a number in scientific (exponential)
5734notation.@footnote{The internal representation uses double-precision
5735floating point numbers. If you don't know what that means, then don't
5736worry about it.} Here are some examples of numeric constants, which all
5737have the same value:
5738
5739@example
5740105
57411.05e+2
57421050e-1
5743@end example
5744
5745A string constant consists of a sequence of characters enclosed in
5746double-quote marks.  For example:
5747
5748@example
5749"parrot"
5750@end example
5751
5752@noindent
5753@cindex differences between @code{gawk} and @code{awk}
5754represents the string whose contents are @samp{parrot}.  Strings in
5755@code{gawk} can be of any length and they can contain any of the possible
5756eight-bit ASCII characters including ASCII NUL (character code zero).
5757Other @code{awk}
5758implementations may have difficulty with some character codes.
5759
5760@node Regexp Constants,  , Scalar Constants, Constants
5761@subsection Regular Expression Constants
5762
5763@cindex @code{~} operator
5764@cindex @code{!~} operator
5765A regexp constant is a regular expression description enclosed in
5766slashes, such as @code{@w{/^beginning and end$/}}.  Most regexps used in
5767@code{awk} programs are constant, but the @samp{~} and @samp{!~}
5768matching operators can also match computed or ``dynamic'' regexps
5769(which are just ordinary strings or variables that contain a regexp).
5770
5771@node Using Constant Regexps, Variables, Constants, Expressions
5772@section Using Regular Expression Constants
5773
5774When used on the right hand side of the @samp{~} or @samp{!~}
5775operators, a regexp constant merely stands for the regexp that is to be
5776matched.
5777
5778@cindex dark corner
5779Regexp constants (such as @code{/foo/}) may be used like simple expressions.
5780When a
5781regexp constant appears by itself, it has the same meaning as if it appeared
5782in a pattern, i.e.@: @samp{($0 ~ /foo/)} (d.c.)
5783(@pxref{Expression Patterns, ,Expressions as Patterns}).
5784This means that the two code segments,
5785
5786@example
5787if ($0 ~ /barfly/ || $0 ~ /camelot/)
5788    print "found"
5789@end example
5790
5791@noindent
5792and
5793
5794@example
5795if (/barfly/ || /camelot/)
5796    print "found"
5797@end example
5798
5799@noindent
5800are exactly equivalent.
5801
5802One rather bizarre consequence of this rule is that the following
5803boolean expression is valid, but does not do what the user probably
5804intended:
5805
5806@example
5807# note that /foo/ is on the left of the ~
5808if (/foo/ ~ $1) print "found foo"
5809@end example
5810
5811@noindent
5812This code is ``obviously'' testing @code{$1} for a match against the regexp
5813@code{/foo/}.  But in fact, the expression @samp{/foo/ ~ $1} actually means
5814@samp{($0 ~ /foo/) ~ $1}.  In other words, first match the input record
5815against the regexp @code{/foo/}.  The result will be either zero or one,
5816depending upon the success or failure of the match.  Then match that result
5817against the first field in the record.
5818
5819Since it is unlikely that you would ever really wish to make this kind of
5820test, @code{gawk} will issue a warning when it sees this construct in
5821a program.
5822
5823Another consequence of this rule is that the assignment statement
5824
5825@example
5826matches = /foo/
5827@end example
5828
5829@noindent
5830will assign either zero or one to the variable @code{matches}, depending
5831upon the contents of the current input record.
5832
5833This feature of the language was never well documented until the
5834POSIX specification.
5835
5836@cindex differences between @code{gawk} and @code{awk}
5837@cindex dark corner
5838Constant regular expressions are also used as the first argument for
5839the @code{gensub}, @code{sub} and @code{gsub} functions, and as the
5840second argument of the @code{match} function
5841(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
5842Modern implementations of @code{awk}, including @code{gawk}, allow
5843the third argument of @code{split} to be a regexp constant, while some
5844older implementations do not (d.c.).
5845
5846This can lead to confusion when attempting to use regexp constants
5847as arguments to user defined functions
5848(@pxref{User-defined, , User-defined Functions}).
5849For example:
5850
5851@example
5852@group
5853function mysub(pat, repl, str, global)
5854@{
5855    if (global)
5856        gsub(pat, repl, str)
5857    else
5858        sub(pat, repl, str)
5859    return str
5860@}
5861@end group
5862
5863@group
5864@{
5865    @dots{}
5866    text = "hi! hi yourself!"
5867    mysub(/hi/, "howdy", text, 1)
5868    @dots{}
5869@}
5870@end group
5871@end example
5872
5873In this example, the programmer wishes to pass a regexp constant to the
5874user-defined function @code{mysub}, which will in turn pass it on to
5875either @code{sub} or @code{gsub}.  However, what really happens is that
5876the @code{pat} parameter will be either one or zero, depending upon whether
5877or not @code{$0} matches @code{/hi/}.
5878
5879As it is unlikely that you would ever really wish to pass a truth value
5880in this way, @code{gawk} will issue a warning when it sees a regexp
5881constant used as a parameter to a user-defined function.
5882
5883@node Variables, Conversion, Using Constant Regexps, Expressions
5884@section Variables
5885
5886Variables are ways of storing values at one point in your program for
5887use later in another part of your program.  You can manipulate them
5888entirely within your program text, and you can also assign values to
5889them on the @code{awk} command line.
5890
5891@menu
5892* Using Variables::             Using variables in your programs.
5893* Assignment Options::          Setting variables on the command line and a
5894                                summary of command line syntax. This is an
5895                                advanced method of input.
5896@end menu
5897
5898@node Using Variables, Assignment Options, Variables, Variables
5899@subsection Using Variables in a Program
5900
5901@cindex variables, user-defined
5902@cindex user-defined variables
5903Variables let you give names to values and refer to them later.  You have
5904already seen variables in many of the examples.  The name of a variable
5905must be a sequence of letters, digits and underscores, but it may not begin
5906with a digit.  Case is significant in variable names; @code{a} and @code{A}
5907are distinct variables.
5908
5909A variable name is a valid expression by itself; it represents the
5910variable's current value.  Variables are given new values with
5911@dfn{assignment operators}, @dfn{increment operators} and
5912@dfn{decrement operators}.
5913@xref{Assignment Ops, ,Assignment Expressions}.
5914
5915A few variables have special built-in meanings, such as @code{FS}, the
5916field separator, and @code{NF}, the number of fields in the current
5917input record.  @xref{Built-in Variables}, for a list of them.  These
5918built-in variables can be used and assigned just like all other
5919variables, but their values are also used or changed automatically by
5920@code{awk}.  All built-in variables names are entirely upper-case.
5921
5922Variables in @code{awk} can be assigned either numeric or string
5923values.  By default, variables are initialized to the empty string, which
5924is zero if converted to a number.  There is no need to
5925``initialize'' each variable explicitly in @code{awk},
5926the way you would in C and in most other traditional languages.
5927
5928@node Assignment Options,  , Using Variables, Variables
5929@subsection Assigning Variables on the Command Line
5930
5931You can set any @code{awk} variable by including a @dfn{variable assignment}
5932among the arguments on the command line when you invoke @code{awk}
5933(@pxref{Other Arguments, ,Other Command Line Arguments}).  Such an assignment has
5934this form:
5935
5936@example
5937@var{variable}=@var{text}
5938@end example
5939
5940@noindent
5941With it, you can set a variable either at the beginning of the
5942@code{awk} run or in between input files.
5943
5944If you precede the assignment with the @samp{-v} option, like this:
5945
5946@example
5947-v @var{variable}=@var{text}
5948@end example
5949
5950@noindent
5951then the variable is set at the very beginning, before even the
5952@code{BEGIN} rules are run.  The @samp{-v} option and its assignment
5953must precede all the file name arguments, as well as the program text.
5954(@xref{Options, ,Command Line Options}, for more information about
5955the @samp{-v} option.)
5956
5957Otherwise, the variable assignment is performed at a time determined by
5958its position among the input file arguments: after the processing of the
5959preceding input file argument.  For example:
5960
5961@example
5962awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
5963@end example
5964
5965@noindent
5966prints the value of field number @code{n} for all input records.  Before
5967the first file is read, the command line sets the variable @code{n}
5968equal to four.  This causes the fourth field to be printed in lines from
5969the file @file{inventory-shipped}.  After the first file has finished,
5970but before the second file is started, @code{n} is set to two, so that the
5971second field is printed in lines from @file{BBS-list}.
5972
5973@example
5974@group
5975$ awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
5976@print{} 15
5977@print{} 24
5978@dots{}
5979@print{} 555-5553
5980@print{} 555-3412
5981@dots{}
5982@end group
5983@end example
5984
5985Command line arguments are made available for explicit examination by
5986the @code{awk} program in an array named @code{ARGV}
5987(@pxref{ARGC and ARGV, ,Using @code{ARGC} and @code{ARGV}}).
5988
5989@cindex dark corner
5990@code{awk} processes the values of command line assignments for escape
5991sequences (d.c.) (@pxref{Escape Sequences}).
5992
5993@node Conversion, Arithmetic Ops, Variables, Expressions
5994@section Conversion of Strings and Numbers
5995
5996@cindex conversion of strings and numbers
5997Strings are converted to numbers, and numbers to strings, if the context
5998of the @code{awk} program demands it.  For example, if the value of
5999either @code{foo} or @code{bar} in the expression @samp{foo + bar}
6000happens to be a string, it is converted to a number before the addition
6001is performed.  If numeric values appear in string concatenation, they
6002are converted to strings.  Consider this:
6003
6004@example
6005two = 2; three = 3
6006print (two three) + 4
6007@end example
6008
6009@noindent
6010This prints the (numeric) value 27.  The numeric values of
6011the variables @code{two} and @code{three} are converted to strings and
6012concatenated together, and the resulting string is converted back to the
6013number 23, to which four is then added.
6014
6015@cindex null string
6016@cindex empty string
6017@cindex type conversion
6018If, for some reason, you need to force a number to be converted to a
6019string, concatenate the empty string, @code{""}, with that number.
6020To force a string to be converted to a number, add zero to that string.
6021
6022A string is converted to a number by interpreting any numeric prefix
6023of the string as numerals:
6024@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"}
6025has a numeric value of 25.
6026Strings that can't be interpreted as valid numbers are converted to
6027zero.
6028
6029@vindex CONVFMT
6030The exact manner in which numbers are converted into strings is controlled
6031by the @code{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}).
6032Numbers are converted using the @code{sprintf} function
6033(@pxref{String Functions, ,Built-in Functions for String Manipulation})
6034with @code{CONVFMT} as the format
6035specifier.
6036
6037@code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with
6038at least six significant digits.  For some applications you will want to
6039change it to specify more precision.  On most modern machines, you must
6040print 17 digits to capture a floating point number's value exactly.
6041
6042Strange results can happen if you set @code{CONVFMT} to a string that doesn't
6043tell @code{sprintf} how to format floating point numbers in a useful way.
6044For example, if you forget the @samp{%} in the format, all numbers will be
6045converted to the same constant string.
6046
6047@cindex dark corner
6048As a special case, if a number is an integer, then the result of converting
6049it to a string is @emph{always} an integer, no matter what the value of
6050@code{CONVFMT} may be.  Given the following code fragment:
6051
6052@example
6053CONVFMT = "%2.2f"
6054a = 12
6055b = a ""
6056@end example
6057
6058@noindent
6059@code{b} has the value @code{"12"}, not @code{"12.00"} (d.c.).
6060
6061@cindex @code{awk} language, POSIX version
6062@cindex POSIX @code{awk}
6063@vindex OFMT
6064Prior to the POSIX standard, @code{awk} specified that the value
6065of @code{OFMT} was used for converting numbers to strings.  @code{OFMT}
6066specifies the output format to use when printing numbers with @code{print}.
6067@code{CONVFMT} was introduced in order to separate the semantics of
6068conversion from the semantics of printing.  Both @code{CONVFMT} and
6069@code{OFMT} have the same default value: @code{"%.6g"}.  In the vast majority
6070of cases, old @code{awk} programs will not change their behavior.
6071However, this use of @code{OFMT} is something to keep in mind if you must
6072port your program to other implementations of @code{awk}; we recommend
6073that instead of changing your programs, you just port @code{gawk} itself!
6074@xref{Print, ,The @code{print} Statement},
6075for more information on the @code{print} statement.
6076
6077@node Arithmetic Ops, Concatenation, Conversion, Expressions
6078@section Arithmetic Operators
6079@cindex arithmetic operators
6080@cindex operators, arithmetic
6081@cindex addition
6082@cindex subtraction
6083@cindex multiplication
6084@cindex division
6085@cindex remainder
6086@cindex quotient
6087@cindex exponentiation
6088
6089The @code{awk} language uses the common arithmetic operators when
6090evaluating expressions.  All of these arithmetic operators follow normal
6091precedence rules, and work as you would expect them to.  Arithmetic
6092operations are evaluated using double precision floating point, which
6093has the usual problems of inexactness and exceptions.@footnote{David
6094Goldberg, @uref{http://www.validgh.com/goldberg/paper.ps, @cite{What Every
6095Computer Scientist Should Know About Floating-point Arithmetic}},
6096@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03), 5-48.}
6097
6098Here is a file @file{grades} containing a list of student names and
6099three test scores per student (it's a small class):
6100
6101@example
6102Pat   100 97 58
6103Sandy  84 72 93
6104Chris  72 92 89
6105@end example
6106
6107@noindent
6108This programs takes the file @file{grades}, and prints the average
6109of the scores.
6110
6111@example
6112$ awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3
6113>        print $1, avg @}' grades
6114@print{} Pat 85
6115@print{} Sandy 83
6116@print{} Chris 84.3333
6117@end example
6118
6119This table lists the arithmetic operators in @code{awk}, in order from
6120highest precedence to lowest:
6121
6122@c @cartouche
6123@table @code
6124@item - @var{x}
6125Negation.
6126
6127@item + @var{x}
6128Unary plus.  The expression is converted to a number.
6129
6130@cindex @code{awk} language, POSIX version
6131@cindex POSIX @code{awk}
6132@item @var{x} ^ @var{y}
6133@itemx @var{x} ** @var{y}
6134Exponentiation: @var{x} raised to the @var{y} power.  @samp{2 ^ 3} has
6135the value eight.  The character sequence @samp{**} is equivalent to
6136@samp{^}.  (The POSIX standard only specifies the use of @samp{^}
6137for exponentiation.)
6138
6139@item @var{x} * @var{y}
6140Multiplication.
6141
6142@item @var{x} / @var{y}
6143Division.  Since all numbers in @code{awk} are
6144floating point numbers, the result is not rounded to an integer: @samp{3 / 4}
6145has the value 0.75.
6146
6147@item @var{x} % @var{y}
6148@cindex differences between @code{gawk} and @code{awk}
6149Remainder.  The quotient is rounded toward zero to an integer,
6150multiplied by @var{y} and this result is subtracted from @var{x}.
6151This operation is sometimes known as ``trunc-mod.''  The following
6152relation always holds:
6153
6154@example
6155b * int(a / b) + (a % b) == a
6156@end example
6157
6158One possibly undesirable effect of this definition of remainder is that
6159@code{@var{x} % @var{y}} is negative if @var{x} is negative.  Thus,
6160
6161@example
6162-17 % 8 = -1
6163@end example
6164
6165In other @code{awk} implementations, the signedness of the remainder
6166may be machine dependent.
6167@c !!! what does posix say?
6168
6169@item @var{x} + @var{y}
6170Addition.
6171
6172@item @var{x} - @var{y}
6173Subtraction.
6174@end table
6175@c @end cartouche
6176
6177For maximum portability, do not use the @samp{**} operator.
6178
6179Unary plus and minus have the same precedence,
6180the multiplication operators all have the same precedence, and
6181addition and subtraction have the same precedence.
6182
6183@node Concatenation, Assignment Ops, Arithmetic Ops, Expressions
6184@section String Concatenation
6185@cindex Kernighan, Brian
6186@display
6187@i{It seemed like a good idea at the time.}
6188Brian Kernighan
6189@end display
6190@sp 1
6191
6192@cindex string operators
6193@cindex operators, string
6194@cindex concatenation
6195There is only one string operation: concatenation.  It does not have a
6196specific operator to represent it.  Instead, concatenation is performed by
6197writing expressions next to one another, with no operator.  For example:
6198
6199@example
6200@group
6201$ awk '@{ print "Field number one: " $1 @}' BBS-list
6202@print{} Field number one: aardvark
6203@print{} Field number one: alpo-net
6204@dots{}
6205@end group
6206@end example
6207
6208Without the space in the string constant after the @samp{:}, the line
6209would run together.  For example:
6210
6211@example
6212@group
6213$ awk '@{ print "Field number one:" $1 @}' BBS-list
6214@print{} Field number one:aardvark
6215@print{} Field number one:alpo-net
6216@dots{}
6217@end group
6218@end example
6219
6220Since string concatenation does not have an explicit operator, it is
6221often necessary to insure that it happens where you want it to by
6222using parentheses to enclose
6223the items to be concatenated.  For example, the
6224following code fragment does not concatenate @code{file} and @code{name}
6225as you might expect:
6226
6227@example
6228@group
6229file = "file"
6230name = "name"
6231print "something meaningful" > file name
6232@end group
6233@end example
6234
6235@noindent
6236It is necessary to use the following:
6237
6238@example
6239print "something meaningful" > (file name)
6240@end example
6241
6242We recommend that you use parentheses around concatenation in all but the
6243most common contexts (such as on the right-hand side of @samp{=}).
6244
6245@node Assignment Ops, Increment Ops, Concatenation, Expressions
6246@section Assignment Expressions
6247@cindex assignment operators
6248@cindex operators, assignment
6249@cindex expression, assignment
6250
6251An @dfn{assignment} is an expression that stores a new value into a
6252variable.  For example, let's assign the value one to the variable
6253@code{z}:
6254
6255@example
6256z = 1
6257@end example
6258
6259After this expression is executed, the variable @code{z} has the value one.
6260Whatever old value @code{z} had before the assignment is forgotten.
6261
6262Assignments can store string values also.  For example, this would store
6263the value @code{"this food is good"} in the variable @code{message}:
6264
6265@example
6266thing = "food"
6267predicate = "good"
6268message = "this " thing " is " predicate
6269@end example
6270
6271@noindent
6272(This also illustrates string concatenation.)
6273
6274The @samp{=} sign is called an @dfn{assignment operator}.  It is the
6275simplest assignment operator because the value of the right-hand
6276operand is stored unchanged.
6277
6278@cindex side effect
6279Most operators (addition, concatenation, and so on) have no effect
6280except to compute a value.  If you ignore the value, you might as well
6281not use the operator.  An assignment operator is different; it does
6282produce a value, but even if you ignore the value, the assignment still
6283makes itself felt through the alteration of the variable.  We call this
6284a @dfn{side effect}.
6285
6286@cindex lvalue
6287@cindex rvalue
6288The left-hand operand of an assignment need not be a variable
6289(@pxref{Variables}); it can also be a field
6290(@pxref{Changing Fields, ,Changing the Contents of a Field}) or
6291an array element (@pxref{Arrays, ,Arrays in @code{awk}}).
6292These are all called @dfn{lvalues},
6293which means they can appear on the left-hand side of an assignment operator.
6294The right-hand operand may be any expression; it produces the new value
6295which the assignment stores in the specified variable, field or array
6296element. (Such values are called @dfn{rvalues}).
6297
6298@cindex types of variables
6299It is important to note that variables do @emph{not} have permanent types.
6300The type of a variable is simply the type of whatever value it happens
6301to hold at the moment.  In the following program fragment, the variable
6302@code{foo} has a numeric value at first, and a string value later on:
6303
6304@example
6305@group
6306foo = 1
6307print foo
6308foo = "bar"
6309print foo
6310@end group
6311@end example
6312
6313@noindent
6314When the second assignment gives @code{foo} a string value, the fact that
6315it previously had a numeric value is forgotten.
6316
6317String values that do not begin with a digit have a numeric value of
6318zero. After executing this code, the value of @code{foo} is five:
6319
6320@example
6321foo = "a string"
6322foo = foo + 5
6323@end example
6324
6325@noindent
6326(Note that using a variable as a number and then later as a string can
6327be confusing and is poor programming style.  The above examples illustrate how
6328@code{awk} works, @emph{not} how you should write your own programs!)
6329
6330An assignment is an expression, so it has a value: the same value that
6331is assigned.  Thus, @samp{z = 1} as an expression has the value one.
6332One consequence of this is that you can write multiple assignments together:
6333
6334@example
6335x = y = z = 0
6336@end example
6337
6338@noindent
6339stores the value zero in all three variables.  It does this because the
6340value of @samp{z = 0}, which is zero, is stored into @code{y}, and then
6341the value of @samp{y = z = 0}, which is zero, is stored into @code{x}.
6342
6343You can use an assignment anywhere an expression is called for.  For
6344example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one
6345and then test whether @code{x} equals one.  But this style tends to make
6346programs hard to read; except in a one-shot program, you should
6347not use such nesting of assignments.
6348
6349Aside from @samp{=}, there are several other assignment operators that
6350do arithmetic with the old value of the variable.  For example, the
6351operator @samp{+=} computes a new value by adding the right-hand value
6352to the old value of the variable.  Thus, the following assignment adds
6353five to the value of @code{foo}:
6354
6355@example
6356foo += 5
6357@end example
6358
6359@noindent
6360This is equivalent to the following:
6361
6362@example
6363foo = foo + 5
6364@end example
6365
6366@noindent
6367Use whichever one makes the meaning of your program clearer.
6368
6369There are situations where using @samp{+=} (or any assignment operator)
6370is @emph{not} the same as simply repeating the left-hand operand in the
6371right-hand expression.  For example:
6372
6373@cindex Rankin, Pat
6374@example
6375@group
6376# Thanks to Pat Rankin for this example
6377BEGIN  @{
6378    foo[rand()] += 5
6379    for (x in foo)
6380       print x, foo[x]
6381
6382    bar[rand()] = bar[rand()] + 5
6383    for (x in bar)
6384       print x, bar[x]
6385@}
6386@end group
6387@end example
6388
6389@noindent
6390The indices of @code{bar} are guaranteed to be different, because
6391@code{rand} will return different values each time it is called.
6392(Arrays and the @code{rand} function haven't been covered yet.
6393@xref{Arrays, ,Arrays in @code{awk}},
6394and see @ref{Numeric Functions, ,Numeric Built-in Functions}, for more information).
6395This example illustrates an important fact about the assignment
6396operators: the left-hand expression is only evaluated @emph{once}.
6397
6398It is also up to the implementation as to which expression is evaluated
6399first, the left-hand one or the right-hand one.
6400Consider this example:
6401
6402@example
6403i = 1
6404a[i += 2] = i + 1
6405@end example
6406
6407@noindent
6408The value of @code{a[3]} could be either two or four.
6409
6410Here is a table of the arithmetic assignment operators.  In each
6411case, the right-hand operand is an expression whose value is converted
6412to a number.
6413
6414@c @cartouche
6415@table @code
6416@item @var{lvalue} += @var{increment}
6417Adds @var{increment} to the value of @var{lvalue} to make the new value
6418of @var{lvalue}.
6419
6420@item @var{lvalue} -= @var{decrement}
6421Subtracts @var{decrement} from the value of @var{lvalue}.
6422
6423@item @var{lvalue} *= @var{coefficient}
6424Multiplies the value of @var{lvalue} by @var{coefficient}.
6425
6426@item @var{lvalue} /= @var{divisor}
6427Divides the value of @var{lvalue} by @var{divisor}.
6428
6429@item @var{lvalue} %= @var{modulus}
6430Sets @var{lvalue} to its remainder by @var{modulus}.
6431
6432@cindex @code{awk} language, POSIX version
6433@cindex POSIX @code{awk}
6434@item @var{lvalue} ^= @var{power}
6435@itemx @var{lvalue} **= @var{power}
6436Raises @var{lvalue} to the power @var{power}.
6437(Only the @samp{^=} operator is specified by POSIX.)
6438@end table
6439@c @end cartouche
6440
6441For maximum portability, do not use the @samp{**=} operator.
6442
6443@node Increment Ops, Truth Values, Assignment Ops, Expressions
6444@section Increment and Decrement Operators
6445
6446@cindex increment operators
6447@cindex operators, increment
6448@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of
6449a variable by one.  You could do the same thing with an assignment operator, so
6450the increment operators add no power to the @code{awk} language; but they
6451are convenient abbreviations for very common operations.
6452
6453The operator to add one is written @samp{++}.  It can be used to increment
6454a variable either before or after taking its value.
6455
6456To pre-increment a variable @var{v}, write @samp{++@var{v}}.  This adds
6457one to the value of @var{v} and that new value is also the value of this
6458expression.  The assignment expression @samp{@var{v} += 1} is completely
6459equivalent.
6460
6461Writing the @samp{++} after the variable specifies post-increment.  This
6462increments the variable value just the same; the difference is that the
6463value of the increment expression itself is the variable's @emph{old}
6464value.  Thus, if @code{foo} has the value four, then the expression @samp{foo++}
6465has the value four, but it changes the value of @code{foo} to five.
6466
6467The post-increment @samp{foo++} is nearly equivalent to writing @samp{(foo
6468+= 1) - 1}.  It is not perfectly equivalent because all numbers in
6469@code{awk} are floating point: in floating point, @samp{foo + 1 - 1} does
6470not necessarily equal @code{foo}.  But the difference is minute as
6471long as you stick to numbers that are fairly small (less than 10e12).
6472
6473Any lvalue can be incremented.  Fields and array elements are incremented
6474just like variables.  (Use @samp{$(i++)} when you wish to do a field reference
6475and a variable increment at the same time.  The parentheses are necessary
6476because of the precedence of the field reference operator, @samp{$}.)
6477
6478@cindex decrement operators
6479@cindex operators, decrement
6480The decrement operator @samp{--} works just like @samp{++} except that
6481it subtracts one instead of adding.  Like @samp{++}, it can be used before
6482the lvalue to pre-decrement or after it to post-decrement.
6483
6484Here is a summary of increment and decrement expressions.
6485
6486@c @cartouche
6487@table @code
6488@item ++@var{lvalue}
6489This expression increments @var{lvalue} and the new value becomes the
6490value of the expression.
6491
6492@item @var{lvalue}++
6493This expression increments @var{lvalue}, but
6494the value of the expression is the @emph{old} value of @var{lvalue}.
6495
6496@item --@var{lvalue}
6497Like @samp{++@var{lvalue}}, but instead of adding, it subtracts.  It
6498decrements @var{lvalue} and delivers the value that results.
6499
6500@item @var{lvalue}--
6501Like @samp{@var{lvalue}++}, but instead of adding, it subtracts.  It
6502decrements @var{lvalue}.  The value of the expression is the @emph{old}
6503value of @var{lvalue}.
6504@end table
6505@c @end cartouche
6506
6507@node Truth Values, Typing and Comparison, Increment Ops, Expressions
6508@section True and False in @code{awk}
6509@cindex truth values
6510@cindex logical true
6511@cindex logical false
6512
6513Many programming languages have a special representation for the concepts
6514of ``true'' and ``false.''  Such languages usually use the special
6515constants @code{true} and @code{false}, or perhaps their upper-case
6516equivalents.
6517
6518@cindex null string
6519@cindex empty string
6520@code{awk} is different.  It borrows a very simple concept of true and
6521false from C.  In @code{awk}, any non-zero numeric value, @emph{or} any
6522non-empty string value is true.  Any other value (zero or the null
6523string, @code{""}) is false.  The following program will print @samp{A strange
6524truth value} three times:
6525
6526@example
6527@group
6528BEGIN @{
6529   if (3.1415927)
6530       print "A strange truth value"
6531   if ("Four Score And Seven Years Ago")
6532       print "A strange truth value"
6533   if (j = 57)
6534       print "A strange truth value"
6535@}
6536@end group
6537@end example
6538
6539@cindex dark corner
6540There is a surprising consequence of the ``non-zero or non-null'' rule:
6541The string constant @code{"0"} is actually true, since it is non-null (d.c.).
6542
6543@node Typing and Comparison, Boolean Ops, Truth Values, Expressions
6544@section Variable Typing and Comparison Expressions
6545@cindex comparison expressions
6546@cindex expression, comparison
6547@cindex expression, matching
6548@cindex relational operators
6549@cindex operators, relational
6550@cindex regexp match/non-match operators
6551@cindex variable typing
6552@cindex types of variables
6553@c 2e: consider splitting this section into subsections
6554@display
6555@i{The Guide is definitive. Reality is frequently inaccurate.}
6556The Hitchhiker's Guide to the Galaxy
6557@end display
6558@sp 1
6559
6560Unlike other programming languages, @code{awk} variables do not have a
6561fixed type. Instead, they can be either a number or a string, depending
6562upon the value that is assigned to them.
6563
6564@cindex numeric string
6565The 1992 POSIX standard introduced
6566the concept of a @dfn{numeric string}, which is simply a string that looks
6567like a number, for example, @code{@w{" +2"}}.  This concept is used
6568for determining the type of a variable.
6569
6570The type of the variable is important, since the types of two variables
6571determine how they are compared.
6572
6573In @code{gawk}, variable typing follows these rules.
6574
6575@enumerate 1
6576@item
6577A numeric literal or the result of a numeric operation has the @var{numeric}
6578attribute.
6579
6580@item
6581A string literal or the result of a string operation has the @var{string}
6582attribute.
6583
6584@item
6585Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements,
6586@code{ENVIRON} elements and the
6587elements of an array created by @code{split} that are numeric strings
6588have the @var{strnum} attribute.  Otherwise, they have the @var{string}
6589attribute.
6590Uninitialized variables also have the @var{strnum} attribute.
6591
6592@item
6593Attributes propagate across assignments, but are not changed by
6594any use.
6595@c  (Although a use may cause the entity to acquire an additional
6596@c value such that it has both a numeric and string value -- this leaves the
6597@c attribute unchanged.)
6598@c This is important but not relevant
6599@end enumerate
6600
6601The last rule is particularly important. In the following program,
6602@code{a} has numeric type, even though it is later used in a string
6603operation.
6604
6605@example
6606BEGIN @{
6607         a = 12.345
6608         b = a " is a cute number"
6609         print b
6610@}
6611@end example
6612
6613When two operands are compared, either string comparison or numeric comparison
6614may be used, depending on the attributes of the operands, according to the
6615following, symmetric, matrix:
6616
6617@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables
6618@tex
6619\centerline{
6620\vbox{\bigskip % space above the table (about 1 linespace)
6621% Because we have vertical rules, we can't let TeX insert interline space
6622% in its usual way.
6623\offinterlineskip
6624%
6625% Define the table template. & separates columns, and \cr ends the
6626% template (and each row). # is replaced by the text of that entry on
6627% each row. The template for the first column breaks down like this:
6628%   \strut -- a way to make each line have the height and depth
6629%             of a normal line of type, since we turned off interline spacing.
6630%   \hfil -- infinite glue; has the effect of right-justifying in this case.
6631%   #     -- replaced by the text (for instance, `STRNUM', in the last row).
6632%   \quad -- about the width of an `M'. Just separates the columns.
6633%
6634% The second column (\vrule#) is what generates the vertical rule that
6635% spans table rows.
6636%
6637% The doubled && before the next entry means `repeat the following
6638% template as many times as necessary on each line' -- in our case, twice.
6639%
6640% The template itself, \quad#\hfil, left-justifies with a little space before.
6641%
6642\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr
6643	&&STRING	&NUMERIC	&STRNUM\cr
6644% The \omit tells TeX to skip inserting the template for this column on
6645% this particular row. In this case, we only want a little extra space
6646% to separate the heading row from the rule below it.  the depth 2pt --
6647% `\vrule depth 2pt' is that little space.
6648\omit	&depth 2pt\cr
6649% This is the horizontal rule below the heading. Since it has nothing to
6650% do with the columns of the table, we use \noalign to get it in there.
6651\noalign{\hrule}
6652% Like above, this time a little more space.
6653\omit	&depth 4pt\cr
6654% The remaining rows have nothing special about them.
6655STRING	&&string	&string		&string\cr
6656NUMERIC	&&string	&numeric	&numeric\cr
6657STRNUM  &&string	&numeric	&numeric\cr
6658}}}
6659@end tex
6660@ifinfo
6661@display
6662	+----------------------------------------------
6663	|	STRING		NUMERIC		STRNUM
6664--------+----------------------------------------------
6665	|
6666STRING	|	string		string		string
6667	|
6668NUMERIC	|	string		numeric		numeric
6669	|
6670STRNUM	|	string		numeric		numeric
6671--------+----------------------------------------------
6672@end display
6673@end ifinfo
6674
6675The basic idea is that user input that looks numeric, and @emph{only}
6676user input, should be treated as numeric, even though it is actually
6677made of characters, and is therefore also a string.
6678
6679@dfn{Comparison expressions} compare strings or numbers for
6680relationships such as equality.  They are written using @dfn{relational
6681operators}, which are a superset of those in C.  Here is a table of
6682them:
6683
6684@cindex relational operators
6685@cindex operators, relational
6686@cindex @code{<} operator
6687@cindex @code{<=} operator
6688@cindex @code{>} operator
6689@cindex @code{>=} operator
6690@cindex @code{==} operator
6691@cindex @code{!=} operator
6692@cindex @code{~} operator
6693@cindex @code{!~} operator
6694@cindex @code{in} operator
6695@c @cartouche
6696@table @code
6697@item @var{x} < @var{y}
6698True if @var{x} is less than @var{y}.
6699
6700@item @var{x} <= @var{y}
6701True if @var{x} is less than or equal to @var{y}.
6702
6703@item @var{x} > @var{y}
6704True if @var{x} is greater than @var{y}.
6705
6706@item @var{x} >= @var{y}
6707True if @var{x} is greater than or equal to @var{y}.
6708
6709@item @var{x} == @var{y}
6710True if @var{x} is equal to @var{y}.
6711
6712@item @var{x} != @var{y}
6713True if @var{x} is not equal to @var{y}.
6714
6715@item @var{x} ~ @var{y}
6716True if the string @var{x} matches the regexp denoted by @var{y}.
6717
6718@item @var{x} !~ @var{y}
6719True if the string @var{x} does not match the regexp denoted by @var{y}.
6720
6721@item @var{subscript} in @var{array}
6722True if the array @var{array} has an element with the subscript @var{subscript}.
6723@end table
6724@c @end cartouche
6725
6726Comparison expressions have the value one if true and zero if false.
6727
6728When comparing operands of mixed types, numeric operands are converted
6729to strings using the value of @code{CONVFMT}
6730(@pxref{Conversion, ,Conversion of Strings and Numbers}).
6731
6732Strings are compared
6733by comparing the first character of each, then the second character of each,
6734and so on.  Thus @code{"10"} is less than @code{"9"}.  If there are two
6735strings where one is a prefix of the other, the shorter string is less than
6736the longer one.  Thus @code{"abc"} is less than @code{"abcd"}.
6737
6738@cindex common mistakes
6739@cindex mistakes, common
6740@cindex errors, common
6741It is very easy to accidentally mistype the @samp{==} operator, and
6742leave off one of the @samp{=}s.  The result is still valid @code{awk}
6743code, but the program will not do what you mean:
6744
6745@example
6746if (a = b)   # oops! should be a == b
6747   @dots{}
6748else
6749   @dots{}
6750@end example
6751
6752@noindent
6753Unless @code{b} happens to be zero or the null string, the @code{if}
6754part of the test will always succeed.  Because the operators are
6755so similar, this kind of error is very difficult to spot when
6756scanning the source code.
6757
6758Here are some sample expressions, how @code{gawk} compares them, and what
6759the result of the comparison is.
6760
6761@table @code
6762@item 1.5 <= 2.0
6763numeric comparison (true)
6764
6765@item "abc" >= "xyz"
6766string comparison (false)
6767
6768@item 1.5 != " +2"
6769string comparison (true)
6770
6771@item "1e2" < "3"
6772string comparison (true)
6773
6774@item a = 2; b = "2"
6775@itemx a == b
6776string comparison (true)
6777
6778@item a = 2; b = " +2"
6779@itemx a == b
6780string comparison (false)
6781@end table
6782
6783In this example,
6784
6785@example
6786@group
6787$ echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}'
6788@print{} false
6789@end group
6790@end example
6791
6792@noindent
6793the result is @samp{false} since both @code{$1} and @code{$2} are numeric
6794strings and thus both have the @var{strnum} attribute,
6795dictating a numeric comparison.
6796
6797The purpose of the comparison rules and the use of numeric strings is
6798to attempt to produce the behavior that is ``least surprising,'' while
6799still ``doing the right thing.''
6800
6801@cindex comparisons, string vs. regexp
6802@cindex string comparison vs. regexp comparison
6803@cindex regexp comparison vs. string comparison
6804String comparisons and regular expression comparisons are very different.
6805For example,
6806
6807@example
6808x == "foo"
6809@end example
6810
6811@noindent
6812has the value of one, or is true, if the variable @code{x}
6813is precisely @samp{foo}.  By contrast,
6814
6815@example
6816x ~ /foo/
6817@end example
6818
6819@noindent
6820has the value one if @code{x} contains @samp{foo}, such as
6821@code{"Oh, what a fool am I!"}.
6822
6823The right hand operand of the @samp{~} and @samp{!~} operators may be
6824either a regexp constant (@code{/@dots{}/}), or an ordinary
6825expression, in which case the value of the expression as a string is used as a
6826dynamic regexp (@pxref{Regexp Usage, ,How to Use Regular Expressions}; also
6827@pxref{Computed Regexps, ,Using Dynamic Regexps}).
6828
6829@cindex regexp as expression
6830In recent implementations of @code{awk}, a constant regular
6831expression in slashes by itself is also an expression.  The regexp
6832@code{/@var{regexp}/} is an abbreviation for this comparison expression:
6833
6834@example
6835$0 ~ /@var{regexp}/
6836@end example
6837
6838One special place where @code{/foo/} is @emph{not} an abbreviation for
6839@samp{$0 ~ /foo/} is when it is the right-hand operand of @samp{~} or
6840@samp{!~}!
6841@xref{Using Constant Regexps, ,Using Regular Expression Constants},
6842where this is discussed in more detail.
6843
6844@c This paragraph has been here since day 1, and has always bothered
6845@c me, especially since the expression doesn't really make a lot of
6846@c sense. So, just take it out.
6847@ignore
6848In some contexts it may be necessary to write parentheses around the
6849regexp to avoid confusing the @code{gawk} parser.  For example,
6850@samp{(/x/ - /y/) > threshold} is not allowed, but @samp{((/x/) - (/y/))
6851> threshold} parses properly.
6852@end ignore
6853
6854@node Boolean Ops, Conditional Exp, Typing and Comparison, Expressions
6855@section Boolean Expressions
6856@cindex expression, boolean
6857@cindex boolean expressions
6858@cindex operators, boolean
6859@cindex boolean operators
6860@cindex logical operations
6861@cindex operations, logical
6862@cindex short-circuit operators
6863@cindex operators, short-circuit
6864@cindex and operator
6865@cindex or operator
6866@cindex not operator
6867@cindex @code{&&} operator
6868@cindex @code{||} operator
6869@cindex @code{!} operator
6870
6871A @dfn{boolean expression} is a combination of comparison expressions or
6872matching expressions, using the boolean operators ``or''
6873(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with
6874parentheses to control nesting.  The truth value of the boolean expression is
6875computed by combining the truth values of the component expressions.
6876Boolean expressions are also referred to as @dfn{logical expressions}.
6877The terms are equivalent.
6878
6879Boolean expressions can be used wherever comparison and matching
6880expressions can be used.  They can be used in @code{if}, @code{while},
6881@code{do} and @code{for} statements
6882(@pxref{Statements, ,Control Statements in Actions}).
6883They have numeric values (one if true, zero if false), which come into play
6884if the result of the boolean expression is stored in a variable, or
6885used in arithmetic.
6886
6887In addition, every boolean expression is also a valid pattern, so
6888you can use one as a pattern to control the execution of rules.
6889
6890Here are descriptions of the three boolean operators, with examples.
6891
6892@c @cartouche
6893@table @code
6894@item @var{boolean1} && @var{boolean2}
6895True if both @var{boolean1} and @var{boolean2} are true.  For example,
6896the following statement prints the current input record if it contains
6897both @samp{2400} and @samp{foo}.
6898
6899@example
6900if ($0 ~ /2400/ && $0 ~ /foo/) print
6901@end example
6902
6903The subexpression @var{boolean2} is evaluated only if @var{boolean1}
6904is true.  This can make a difference when @var{boolean2} contains
6905expressions that have side effects: in the case of @samp{$0 ~ /foo/ &&
6906($2 == bar++)}, the variable @code{bar} is not incremented if there is
6907no @samp{foo} in the record.
6908
6909@item @var{boolean1} || @var{boolean2}
6910True if at least one of @var{boolean1} or @var{boolean2} is true.
6911For example, the following statement prints all records in the input
6912that contain @emph{either} @samp{2400} or
6913@samp{foo}, or both.
6914
6915@example
6916if ($0 ~ /2400/ || $0 ~ /foo/) print
6917@end example
6918
6919The subexpression @var{boolean2} is evaluated only if @var{boolean1}
6920is false.  This can make a difference when @var{boolean2} contains
6921expressions that have side effects.
6922
6923@item ! @var{boolean}
6924True if @var{boolean} is false.  For example, the following program prints
6925all records in the input file @file{BBS-list} that do @emph{not} contain the
6926string @samp{foo}.
6927
6928@c A better example would be `if (! (subscript in array)) ...' but we
6929@c haven't done anything with arrays or `in' yet. Sigh.
6930@example
6931awk '@{ if (! ($0 ~ /foo/)) print @}' BBS-list
6932@end example
6933@end table
6934@c @end cartouche
6935
6936The @samp{&&} and @samp{||} operators are called @dfn{short-circuit}
6937operators because of the way they work.  Evaluation of the full expression
6938is ``short-circuited'' if the result can be determined part way through
6939its evaluation.
6940
6941@cindex line continuation
6942You can continue a statement that uses @samp{&&} or @samp{||} simply
6943by putting a newline after them.  But you cannot put a newline in front
6944of either of these operators without using backslash continuation
6945(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
6946
6947The actual value of an expression using the @samp{!} operator will be
6948either one or zero, depending upon the truth value of the expression it
6949is applied to.
6950
6951The @samp{!} operator is often useful for changing the sense of a flag
6952variable from false to true and back again. For example, the following
6953program is one way to print lines in between special bracketing lines:
6954
6955@example
6956$1 == "START"   @{ interested = ! interested @}
6957interested == 1 @{ print @}
6958$1 == "END"     @{ interested = ! interested @}
6959@end example
6960
6961@noindent
6962The variable @code{interested}, like all @code{awk} variables, starts
6963out initialized to zero, which is also false.  When a line is seen whose
6964first field is @samp{START}, the value of @code{interested} is toggled
6965to true, using @samp{!}. The next rule prints lines as long as
6966@code{interested} is true.  When a line is seen whose first field is
6967@samp{END}, @code{interested} is toggled back to false.
6968@ignore
6969We should discuss using `next' in the two rules that toggle the
6970variable, to avoid printing the bracketing lines, but that's more
6971distraction than really needed.
6972@end ignore
6973
6974@node Conditional Exp, Function Calls, Boolean Ops, Expressions
6975@section Conditional Expressions
6976@cindex conditional expression
6977@cindex expression, conditional
6978
6979A @dfn{conditional expression} is a special kind of expression with
6980three operands.  It allows you to use one expression's value to select
6981one of two other expressions.
6982
6983The conditional expression is the same as in the C language:
6984
6985@example
6986@var{selector} ? @var{if-true-exp} : @var{if-false-exp}
6987@end example
6988
6989@noindent
6990There are three subexpressions.  The first, @var{selector}, is always
6991computed first.  If it is ``true'' (not zero and not null) then
6992@var{if-true-exp} is computed next and its value becomes the value of
6993the whole expression.  Otherwise, @var{if-false-exp} is computed next
6994and its value becomes the value of the whole expression.
6995
6996For example, this expression produces the absolute value of @code{x}:
6997
6998@example
6999x > 0 ? x : -x
7000@end example
7001
7002Each time the conditional expression is computed, exactly one of
7003@var{if-true-exp} and @var{if-false-exp} is used; the other is ignored.
7004This is important when the expressions have side effects.  For example,
7005this conditional expression examines element @code{i} of either array
7006@code{a} or array @code{b}, and increments @code{i}.
7007
7008@example
7009x == y ? a[i++] : b[i++]
7010@end example
7011
7012@noindent
7013This is guaranteed to increment @code{i} exactly once, because each time
7014only one of the two increment expressions is executed,
7015and the other is not.
7016@xref{Arrays, ,Arrays in @code{awk}},
7017for more information about arrays.
7018
7019@cindex differences between @code{gawk} and @code{awk}
7020@cindex line continuation
7021As a minor @code{gawk} extension,
7022you can continue a statement that uses @samp{?:} simply
7023by putting a newline after either character.
7024However, you cannot put a newline in front
7025of either character without using backslash continuation
7026(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
7027If @samp{--posix} is specified
7028(@pxref{Options, , Command Line Options}), then this extension is disabled.
7029
7030@node Function Calls, Precedence, Conditional Exp, Expressions
7031@section Function Calls
7032@cindex function call
7033@cindex calling a function
7034
7035A @dfn{function} is a name for a particular calculation.  Because it has
7036a name, you can ask for it by name at any point in the program.  For
7037example, the function @code{sqrt} computes the square root of a number.
7038
7039A fixed set of functions are @dfn{built-in}, which means they are
7040available in every @code{awk} program.  The @code{sqrt} function is one
7041of these.  @xref{Built-in, ,Built-in Functions}, for a list of built-in
7042functions and their descriptions.  In addition, you can define your own
7043functions for use in your program.
7044@xref{User-defined, ,User-defined Functions}, for how to do this.
7045
7046@cindex arguments in function call
7047The way to use a function is with a @dfn{function call} expression,
7048which consists of the function name followed immediately by a list of
7049@dfn{arguments} in parentheses.  The arguments are expressions which
7050provide the raw materials for the function's calculations.
7051When there is more than one argument, they are separated by commas.  If
7052there are no arguments, write just @samp{()} after the function name.
7053Here are some examples:
7054
7055@example
7056sqrt(x^2 + y^2)        @i{one argument}
7057atan2(y, x)            @i{two arguments}
7058rand()                 @i{no arguments}
7059@end example
7060
7061@strong{Do not put any space between the function name and the
7062open-parenthesis!}  A user-defined function name looks just like the name of
7063a variable, and space would make the expression look like concatenation
7064of a variable with an expression inside parentheses.  Space before the
7065parenthesis is harmless with built-in functions, but it is best not to get
7066into the habit of using space to avoid mistakes with user-defined
7067functions.
7068
7069Each function expects a particular number of arguments.  For example, the
7070@code{sqrt} function must be called with a single argument, the number
7071to take the square root of:
7072
7073@example
7074sqrt(@var{argument})
7075@end example
7076
7077Some of the built-in functions allow you to omit the final argument.
7078If you do so, they use a reasonable default.
7079@xref{Built-in, ,Built-in Functions}, for full details.  If arguments
7080are omitted in calls to user-defined functions, then those arguments are
7081treated as local variables, initialized to the empty string
7082(@pxref{User-defined, ,User-defined Functions}).
7083
7084Like every other expression, the function call has a value, which is
7085computed by the function based on the arguments you give it.  In this
7086example, the value of @samp{sqrt(@var{argument})} is the square root of
7087@var{argument}.  A function can also have side effects, such as assigning
7088values to certain variables or doing I/O.
7089
7090Here is a command to read numbers, one number per line, and print the
7091square root of each one:
7092
7093@example
7094@group
7095$ awk '@{ print "The square root of", $1, "is", sqrt($1) @}'
70961
7097@print{} The square root of 1 is 1
70983
7099@print{} The square root of 3 is 1.73205
71005
7101@print{} The square root of 5 is 2.23607
7102@kbd{Control-d}
7103@end group
7104@end example
7105
7106@node Precedence,  , Function Calls, Expressions
7107@section Operator Precedence (How Operators Nest)
7108@cindex precedence
7109@cindex operator precedence
7110
7111@dfn{Operator precedence} determines how operators are grouped, when
7112different operators appear close by in one expression.  For example,
7113@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c}
7114means to multiply @code{b} and @code{c}, and then add @code{a} to the
7115product (i.e.@: @samp{a + (b * c)}).
7116
7117You can overrule the precedence of the operators by using parentheses.
7118You can think of the precedence rules as saying where the
7119parentheses are assumed to be if you do not write parentheses yourself.  In
7120fact, it is wise to always use parentheses whenever you have an unusual
7121combination of operators, because other people who read the program may
7122not remember what the precedence is in this case.  You might forget,
7123too; then you could make a mistake.  Explicit parentheses will help prevent
7124any such mistake.
7125
7126When operators of equal precedence are used together, the leftmost
7127operator groups first, except for the assignment, conditional and
7128exponentiation operators, which group in the opposite order.
7129Thus, @samp{a - b + c} groups as @samp{(a - b) + c}, and
7130@samp{a = b = c} groups as @samp{a = (b = c)}.
7131
7132The precedence of prefix unary operators does not matter as long as only
7133unary operators are involved, because there is only one way to interpret
7134them---innermost first.  Thus, @samp{$++i} means @samp{$(++i)} and
7135@samp{++$x} means @samp{++($x)}.  However, when another operator follows
7136the operand, then the precedence of the unary operators can matter.
7137Thus, @samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means
7138@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^}
7139while @samp{$} has higher precedence.
7140
7141Here is a table of @code{awk}'s operators, in order from highest
7142precedence to lowest:
7143
7144@c use @code in the items, looks better in TeX w/o all the quotes
7145@table @code
7146@item (@dots{})
7147Grouping.
7148
7149@item $
7150Field.
7151
7152@item ++ --
7153Increment, decrement.
7154
7155@cindex @code{awk} language, POSIX version
7156@cindex POSIX @code{awk}
7157@item ^ **
7158Exponentiation.  These operators group right-to-left.
7159(The @samp{**} operator is not specified by POSIX.)
7160
7161@item + - !
7162Unary plus, minus, logical ``not''.
7163
7164@item * / %
7165Multiplication, division, modulus.
7166
7167@item + -
7168Addition, subtraction.
7169
7170@item @r{Concatenation}
7171No special token is used to indicate concatenation.
7172The operands are simply written side by side.
7173
7174@item < <= == !=
7175@itemx > >= >> |
7176Relational, and redirection.
7177The relational operators and the redirections have the same precedence
7178level.  Characters such as @samp{>} serve both as relationals and as
7179redirections; the context distinguishes between the two meanings.
7180
7181Note that the I/O redirection operators in @code{print} and @code{printf}
7182statements belong to the statement level, not to expressions.  The
7183redirection does not produce an expression which could be the operand of
7184another operator.  As a result, it does not make sense to use a
7185redirection operator near another operator of lower precedence, without
7186parentheses.  Such combinations, for example @samp{print foo > a ? b : c},
7187result in syntax errors.
7188The correct way to write this statement is @samp{print foo > (a ? b : c)}.
7189
7190@item ~ !~
7191Matching, non-matching.
7192
7193@item in
7194Array membership.
7195
7196@item &&
7197Logical ``and''.
7198
7199@item ||
7200Logical ``or''.
7201
7202@item ?:
7203Conditional.  This operator groups right-to-left.
7204
7205@cindex @code{awk} language, POSIX version
7206@cindex POSIX @code{awk}
7207@item = += -= *=
7208@itemx /= %= ^= **=
7209Assignment.  These operators group right-to-left.
7210(The @samp{**=} operator is not specified by POSIX.)
7211@end table
7212
7213@node Patterns and Actions, Statements, Expressions, Top
7214@chapter Patterns and Actions
7215@cindex pattern, definition of
7216
7217As you have already seen, each @code{awk} statement consists of
7218a pattern with an associated action.  This chapter describes how
7219you build patterns and actions.
7220
7221@menu
7222* Pattern Overview::            What goes into a pattern.
7223* Action Overview::             What goes into an action.
7224@end menu
7225
7226@node Pattern Overview, Action Overview, Patterns and Actions, Patterns and Actions
7227@section Pattern Elements
7228
7229Patterns in @code{awk} control the execution of rules: a rule is
7230executed when its pattern matches the current input record.  This
7231section explains all about how to write patterns.
7232
7233@menu
7234* Kinds of Patterns::           A list of all kinds of patterns.
7235* Regexp Patterns::             Using regexps as patterns.
7236* Expression Patterns::         Any expression can be used as a pattern.
7237* Ranges::                      Pairs of patterns specify record ranges.
7238* BEGIN/END::                   Specifying initialization and cleanup rules.
7239* Empty::                       The empty pattern, which matches every record.
7240@end menu
7241
7242@node Kinds of Patterns, Regexp Patterns, Pattern Overview, Pattern Overview
7243@subsection Kinds of Patterns
7244@cindex patterns, types of
7245
7246Here is a summary of the types of patterns supported in @code{awk}.
7247
7248@table @code
7249@item /@var{regular expression}/
7250A regular expression as a pattern.  It matches when the text of the
7251input record fits the regular expression.
7252(@xref{Regexp, ,Regular Expressions}.)
7253
7254@item @var{expression}
7255A single expression.  It matches when its value
7256is non-zero (if a number) or non-null (if a string).
7257(@xref{Expression Patterns, ,Expressions as Patterns}.)
7258
7259@item @var{pat1}, @var{pat2}
7260A pair of patterns separated by a comma, specifying a range of records.
7261The range includes both the initial record that matches @var{pat1}, and
7262the final record that matches @var{pat2}.
7263(@xref{Ranges, ,Specifying Record Ranges with Patterns}.)
7264
7265@item BEGIN
7266@itemx END
7267Special patterns for you to supply start-up or clean-up actions for your
7268@code{awk} program.
7269(@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.)
7270
7271@item @var{empty}
7272The empty pattern matches every input record.
7273(@xref{Empty, ,The Empty Pattern}.)
7274@end table
7275
7276@node Regexp Patterns, Expression Patterns, Kinds of Patterns, Pattern Overview
7277@subsection Regular Expressions as Patterns
7278
7279We have been using regular expressions as patterns since our early examples.
7280This kind of pattern is simply a regexp constant in the pattern part of
7281a rule.  Its  meaning is @samp{$0 ~ /@var{pattern}/}.
7282The pattern matches when the input record matches the regexp.
7283For example:
7284
7285@example
7286/foo|bar|baz/  @{ buzzwords++ @}
7287END            @{ print buzzwords, "buzzwords seen" @}
7288@end example
7289
7290@node Expression Patterns, Ranges, Regexp Patterns, Pattern Overview
7291@subsection Expressions as Patterns
7292
7293Any @code{awk} expression is valid as an @code{awk} pattern.
7294Then the pattern matches if the expression's value is non-zero (if a
7295number) or non-null (if a string).
7296
7297The expression is reevaluated each time the rule is tested against a new
7298input record.  If the expression uses fields such as @code{$1}, the
7299value depends directly on the new input record's text; otherwise, it
7300depends only on what has happened so far in the execution of the
7301@code{awk} program, but that may still be useful.
7302
7303A very common kind of expression used as a pattern is the comparison
7304expression, using the comparison operators described in
7305@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
7306
7307Regexp matching and non-matching are also very common expressions.
7308The left operand of the @samp{~} and @samp{!~} operators is a string.
7309The right operand is either a constant regular expression enclosed in
7310slashes (@code{/@var{regexp}/}), or any expression, whose string value
7311is used as a dynamic regular expression
7312(@pxref{Computed Regexps, , Using Dynamic Regexps}).
7313
7314The following example prints the second field of each input record
7315whose first field is precisely @samp{foo}.
7316
7317@example
7318$ awk '$1 == "foo" @{ print $2 @}' BBS-list
7319@end example
7320
7321@noindent
7322(There is no output, since there is no BBS site named ``foo''.)
7323Contrast this with the following regular expression match, which would
7324accept any record with a first field that contains @samp{foo}:
7325
7326@example
7327@group
7328$ awk '$1 ~ /foo/ @{ print $2 @}' BBS-list
7329@print{} 555-1234
7330@print{} 555-6699
7331@print{} 555-6480
7332@print{} 555-2127
7333@end group
7334@end example
7335
7336Boolean expressions are also commonly used as patterns.
7337Whether the pattern
7338matches an input record depends on whether its subexpressions match.
7339
7340For example, the following command prints all records in
7341@file{BBS-list} that contain both @samp{2400} and @samp{foo}.
7342
7343@example
7344$ awk '/2400/ && /foo/' BBS-list
7345@print{} fooey        555-1234     2400/1200/300     B
7346@end example
7347
7348The following command prints all records in
7349@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo}, or
7350both.
7351
7352@example
7353@group
7354$ awk '/2400/ || /foo/' BBS-list
7355@print{} alpo-net     555-3412     2400/1200/300     A
7356@print{} bites        555-1675     2400/1200/300     A
7357@print{} fooey        555-1234     2400/1200/300     B
7358@print{} foot         555-6699     1200/300          B
7359@print{} macfoo       555-6480     1200/300          A
7360@print{} sdace        555-3430     2400/1200/300     A
7361@print{} sabafoo      555-2127     1200/300          C
7362@end group
7363@end example
7364
7365The following command prints all records in
7366@file{BBS-list} that do @emph{not} contain the string @samp{foo}.
7367
7368@example
7369@group
7370$ awk '! /foo/' BBS-list
7371@print{} aardvark     555-5553     1200/300          B
7372@print{} alpo-net     555-3412     2400/1200/300     A
7373@print{} barfly       555-7685     1200/300          A
7374@print{} bites        555-1675     2400/1200/300     A
7375@print{} camelot      555-0542     300               C
7376@print{} core         555-2912     1200/300          C
7377@print{} sdace        555-3430     2400/1200/300     A
7378@end group
7379@end example
7380
7381The subexpressions of a boolean operator in a pattern can be constant regular
7382expressions, comparisons, or any other @code{awk} expressions.  Range
7383patterns are not expressions, so they cannot appear inside boolean
7384patterns.  Likewise, the special patterns @code{BEGIN} and @code{END},
7385which never match any input record, are not expressions and cannot
7386appear inside boolean patterns.
7387
7388A regexp constant as a pattern is also a special case of an expression
7389pattern.  @code{/foo/} as an expression has the value one if @samp{foo}
7390appears in the current input record; thus, as a pattern, @code{/foo/}
7391matches any record containing @samp{foo}.
7392
7393@node Ranges, BEGIN/END, Expression Patterns, Pattern Overview
7394@subsection Specifying Record Ranges with Patterns
7395
7396@cindex range pattern
7397@cindex pattern, range
7398@cindex matching ranges of lines
7399A @dfn{range pattern} is made of two patterns separated by a comma, of
7400the form @samp{@var{begpat}, @var{endpat}}.  It matches ranges of
7401consecutive input records.  The first pattern, @var{begpat}, controls
7402where the range begins, and the second one, @var{endpat}, controls where
7403it ends.  For example,
7404
7405@example
7406awk '$1 == "on", $1 == "off"'
7407@end example
7408
7409@noindent
7410prints every record between @samp{on}/@samp{off} pairs, inclusive.
7411
7412A range pattern starts out by matching @var{begpat}
7413against every input record; when a record matches @var{begpat}, the
7414range pattern becomes @dfn{turned on}.  The range pattern matches this
7415record.  As long as it stays turned on, it automatically matches every
7416input record read.  It also matches @var{endpat} against
7417every input record; when that succeeds, the range pattern is turned
7418off again for the following record.  Then it goes back to checking
7419@var{begpat} against each record.
7420
7421The record that turns on the range pattern and the one that turns it
7422off both match the range pattern.  If you don't want to operate on
7423these records, you can write @code{if} statements in the rule's action
7424to distinguish them from the records you are interested in.
7425
7426It is possible for a pattern to be turned both on and off by the same
7427record, if the record satisfies both conditions.  Then the action is
7428executed for just that record.
7429
7430For example, suppose you have text between two identical markers (say
7431the @samp{%} symbol) that you wish to ignore.  You might try to
7432combine a range pattern that describes the delimited text with the
7433@code{next} statement
7434(not discussed yet, @pxref{Next Statement, , The @code{next} Statement}),
7435which causes @code{awk} to skip any further processing of the current
7436record and start over again with the next input record. Such a program
7437would look like this:
7438
7439@example
7440/^%$/,/^%$/    @{ next @}
7441               @{ print @}
7442@end example
7443
7444@noindent
7445@cindex skipping lines between markers
7446This program fails because the range pattern is both turned on and turned off
7447by the first line with just a @samp{%} on it.  To accomplish this task, you
7448must write the program this way, using a flag:
7449
7450@example
7451/^%$/     @{ skip = ! skip; next @}
7452skip == 1 @{ next @} # skip lines with `skip' set
7453@end example
7454
7455Note that in a range pattern, the @samp{,} has the lowest precedence
7456(is evaluated last) of all the operators.  Thus, for example, the
7457following program attempts to combine a range pattern with another,
7458simpler test.
7459
7460@example
7461echo Yes | awk '/1/,/2/ || /Yes/'
7462@end example
7463
7464The author of this program intended it to mean @samp{(/1/,/2/) || /Yes/}.
7465However, @code{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}.
7466This cannot be changed or worked around; range patterns do not combine
7467with other patterns.
7468
7469@node BEGIN/END, Empty, Ranges, Pattern Overview
7470@subsection The @code{BEGIN} and @code{END} Special Patterns
7471
7472@cindex @code{BEGIN} special pattern
7473@cindex pattern, @code{BEGIN}
7474@cindex @code{END} special pattern
7475@cindex pattern, @code{END}
7476@code{BEGIN} and @code{END} are special patterns.  They are not used to
7477match input records.  Rather, they supply start-up or
7478clean-up actions for your @code{awk} script.
7479
7480@menu
7481* Using BEGIN/END::             How and why to use BEGIN/END rules.
7482* I/O And BEGIN/END::           I/O issues in BEGIN/END rules.
7483@end menu
7484
7485@node Using BEGIN/END, I/O And BEGIN/END, BEGIN/END, BEGIN/END
7486@subsubsection Startup and Cleanup Actions
7487
7488A @code{BEGIN} rule is executed, once, before the first input record
7489has been read.  An @code{END} rule is executed, once, after all the
7490input has been read.  For example:
7491
7492@example
7493@group
7494$ awk '
7495> BEGIN @{ print "Analysis of \"foo\"" @}
7496> /foo/ @{ ++n @}
7497> END   @{ print "\"foo\" appears " n " times." @}' BBS-list
7498@print{} Analysis of "foo"
7499@print{} "foo" appears 4 times.
7500@end group
7501@end example
7502
7503This program finds the number of records in the input file @file{BBS-list}
7504that contain the string @samp{foo}.  The @code{BEGIN} rule prints a title
7505for the report.  There is no need to use the @code{BEGIN} rule to
7506initialize the counter @code{n} to zero, as @code{awk} does this
7507automatically (@pxref{Variables}).
7508
7509The second rule increments the variable @code{n} every time a
7510record containing the pattern @samp{foo} is read.  The @code{END} rule
7511prints the value of @code{n} at the end of the run.
7512
7513The special patterns @code{BEGIN} and @code{END} cannot be used in ranges
7514or with boolean operators (indeed, they cannot be used with any operators).
7515
7516An @code{awk} program may have multiple @code{BEGIN} and/or @code{END}
7517rules.  They are executed in the order they appear, all the @code{BEGIN}
7518rules at start-up and all the @code{END} rules at termination.
7519@code{BEGIN} and @code{END} rules may be intermixed with other rules.
7520This feature was added in the 1987 version of @code{awk}, and is included
7521in the POSIX standard.  The original (1978) version of @code{awk}
7522required you to put the @code{BEGIN} rule at the beginning of the
7523program, and the @code{END} rule at the end, and only allowed one of
7524each.  This is no longer required, but it is a good idea in terms of
7525program organization and readability.
7526
7527Multiple @code{BEGIN} and @code{END} rules are useful for writing
7528library functions, since each library file can have its own @code{BEGIN} and/or
7529@code{END} rule to do its own initialization and/or cleanup.  Note that
7530the order in which library functions are named on the command line
7531controls the order in which their @code{BEGIN} and @code{END} rules are
7532executed.  Therefore you have to be careful to write such rules in
7533library files so that the order in which they are executed doesn't matter.
7534@xref{Options, ,Command Line Options}, for more information on
7535using library functions.
7536@xref{Library Functions, ,A Library of @code{awk} Functions},
7537for a number of useful library functions.
7538
7539@cindex dark corner
7540If an @code{awk} program only has a @code{BEGIN} rule, and no other
7541rules, then the program exits after the @code{BEGIN} rule has been run.
7542(The original version of @code{awk} used to keep reading and ignoring input
7543until end of file was seen.)  However, if an @code{END} rule exists,
7544then the input will be read, even if there are no other rules in
7545the program.  This is necessary in case the @code{END} rule checks the
7546@code{FNR} and @code{NR} variables (d.c.).
7547
7548@code{BEGIN} and @code{END} rules must have actions; there is no default
7549action for these rules since there is no current record when they run.
7550
7551@node I/O And BEGIN/END, , Using BEGIN/END, BEGIN/END
7552@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules
7553
7554@cindex I/O from @code{BEGIN} and @code{END}
7555There are several (sometimes subtle) issues involved when doing I/O
7556from a @code{BEGIN} or @code{END} rule.
7557
7558The first has to do with the value of @code{$0} in a @code{BEGIN}
7559rule.  Since @code{BEGIN} rules are executed before any input is read,
7560there simply is no input record, and therefore no fields, when
7561executing @code{BEGIN} rules.  References to @code{$0} and the fields
7562yield a null string or zero, depending upon the context.  One way
7563to give @code{$0} a real value is to execute a @code{getline} command
7564without a variable (@pxref{Getline, ,Explicit Input with @code{getline}}).
7565Another way is to simply assign a value to it.
7566
7567@cindex differences between @code{gawk} and @code{awk}
7568The second point is similar to the first, but from the other direction.
7569Inside an @code{END} rule, what is the value of @code{$0} and @code{NF}?
7570Traditionally, due largely to implementation issues, @code{$0} and
7571@code{NF} were @emph{undefined} inside an @code{END} rule.
7572The POSIX standard specified that @code{NF} was available in an @code{END}
7573rule, containing the number of fields from the last input record.
7574Due most probably to an oversight, the standard does not say that @code{$0}
7575is also preserved, although logically one would think that it should be.
7576In fact, @code{gawk} does preserve the value of @code{$0} for use in
7577@code{END} rules.  Be aware, however, that Unix @code{awk}, and possibly
7578other implementations, do not.
7579
7580The third point follows from the first two.  What is the meaning of
7581@samp{print} inside a @code{BEGIN} or @code{END} rule?  The meaning is
7582the same as always, @samp{print $0}.  If @code{$0} is the null string,
7583then this prints an empty line.  Many long time @code{awk} programmers
7584use @samp{print} in @code{BEGIN} and @code{END} rules, to mean
7585@samp{@w{print ""}}, relying on @code{$0} being null.  While you might
7586generally get away with this in @code{BEGIN} rules, in @code{gawk} at
7587least, it is a very bad idea in @code{END} rules.  It is also poor
7588style, since if you want an empty line in the output, you
7589should say so explicitly in your program.
7590
7591@node Empty,  , BEGIN/END, Pattern Overview
7592@subsection The Empty Pattern
7593
7594@cindex empty pattern
7595@cindex pattern, empty
7596An empty (i.e.@: non-existent) pattern is considered to match @emph{every}
7597input record.  For example, the program:
7598
7599@example
7600awk '@{ print $1 @}' BBS-list
7601@end example
7602
7603@noindent
7604prints the first field of every record.
7605
7606@node Action Overview,  , Pattern Overview, Patterns and Actions
7607@section Overview of Actions
7608@cindex action, definition of
7609@cindex curly braces
7610@cindex action, curly braces
7611@cindex action, separating statements
7612
7613An @code{awk} program or script consists of a series of
7614rules and function definitions, interspersed.  (Functions are
7615described later.  @xref{User-defined, ,User-defined Functions}.)
7616
7617A rule contains a pattern and an action, either of which (but not
7618both) may be
7619omitted.  The purpose of the @dfn{action} is to tell @code{awk} what to do
7620once a match for the pattern is found.  Thus, in outline, an @code{awk}
7621program generally looks like this:
7622
7623@example
7624@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
7625@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
7626@dots{}
7627function @var{name}(@var{args}) @{ @dots{} @}
7628@dots{}
7629@end example
7630
7631An action consists of one or more @code{awk} @dfn{statements}, enclosed
7632in curly braces (@samp{@{} and @samp{@}}).  Each statement specifies one
7633thing to be done.  The statements are separated by newlines or
7634semicolons.
7635
7636The curly braces around an action must be used even if the action
7637contains only one statement, or even if it contains no statements at
7638all.  However, if you omit the action entirely, omit the curly braces as
7639well.  An omitted action is equivalent to @samp{@{ print $0 @}}.
7640
7641@example
7642/foo/  @{ @}  # match foo, do nothing - empty action
7643/foo/       # match foo, print the record - omitted action
7644@end example
7645
7646Here are the kinds of statements supported in @code{awk}:
7647
7648@itemize @bullet
7649@item
7650Expressions, which can call functions or assign values to variables
7651(@pxref{Expressions}).  Executing
7652this kind of statement simply computes the value of the expression.
7653This is useful when the expression has side effects
7654(@pxref{Assignment Ops, ,Assignment Expressions}).
7655
7656@item
7657Control statements, which specify the control flow of @code{awk}
7658programs.  The @code{awk} language gives you C-like constructs
7659(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few
7660special ones (@pxref{Statements, ,Control Statements in Actions}).
7661
7662@item
7663Compound statements, which consist of one or more statements enclosed in
7664curly braces.  A compound statement is used in order to put several
7665statements together in the body of an @code{if}, @code{while}, @code{do}
7666or @code{for} statement.
7667
7668@item
7669Input statements, using the @code{getline} command
7670(@pxref{Getline, ,Explicit Input with @code{getline}}), the @code{next}
7671statement (@pxref{Next Statement, ,The @code{next} Statement}),
7672and the @code{nextfile} statement
7673(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
7674
7675@item
7676Output statements, @code{print} and @code{printf}.
7677@xref{Printing, ,Printing Output}.
7678
7679@item
7680Deletion statements, for deleting array elements.
7681@xref{Delete, ,The @code{delete} Statement}.
7682@end itemize
7683
7684@iftex
7685The next chapter covers control statements in detail.
7686@end iftex
7687
7688@node Statements, Built-in Variables, Patterns and Actions, Top
7689@chapter Control Statements in Actions
7690@cindex control statement
7691
7692@dfn{Control statements} such as @code{if}, @code{while}, and so on
7693control the flow of execution in @code{awk} programs.  Most of the
7694control statements in @code{awk} are patterned on similar statements in
7695C.
7696
7697All the control statements start with special keywords such as @code{if}
7698and @code{while}, to distinguish them from simple expressions.
7699
7700@cindex compound statement
7701@cindex statement, compound
7702Many control statements contain other statements; for example, the
7703@code{if} statement contains another statement which may or may not be
7704executed.  The contained statement is called the @dfn{body}.  If you
7705want to include more than one statement in the body, group them into a
7706single @dfn{compound statement} with curly braces, separating them with
7707newlines or semicolons.
7708
7709@menu
7710* If Statement::                Conditionally execute some @code{awk}
7711                                statements.
7712* While Statement::             Loop until some condition is satisfied.
7713* Do Statement::                Do specified action while looping until some
7714                                condition is satisfied.
7715* For Statement::               Another looping statement, that provides
7716                                initialization and increment clauses.
7717* Break Statement::             Immediately exit the innermost enclosing loop.
7718* Continue Statement::          Skip to the end of the innermost enclosing
7719                                loop.
7720* Next Statement::              Stop processing the current input record.
7721* Nextfile Statement::          Stop processing the current file.
7722* Exit Statement::              Stop execution of @code{awk}.
7723@end menu
7724
7725@node If Statement, While Statement, Statements, Statements
7726@section The @code{if}-@code{else} Statement
7727
7728@cindex @code{if}-@code{else} statement
7729The @code{if}-@code{else} statement is @code{awk}'s decision-making
7730statement.  It looks like this:
7731
7732@example
7733if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]}
7734@end example
7735
7736@noindent
7737The @var{condition} is an expression that controls what the rest of the
7738statement will do.  If @var{condition} is true, @var{then-body} is
7739executed; otherwise, @var{else-body} is executed.
7740The @code{else} part of the statement is
7741optional.  The condition is considered false if its value is zero or
7742the null string, and true otherwise.
7743
7744Here is an example:
7745
7746@example
7747if (x % 2 == 0)
7748    print "x is even"
7749else
7750    print "x is odd"
7751@end example
7752
7753In this example, if the expression @samp{x % 2 == 0} is true (that is,
7754the value of @code{x} is evenly divisible by two), then the first @code{print}
7755statement is executed, otherwise the second @code{print} statement is
7756executed.
7757
7758If the @code{else} appears on the same line as @var{then-body}, and
7759@var{then-body} is not a compound statement (i.e.@: not surrounded by
7760curly braces), then a semicolon must separate @var{then-body} from
7761@code{else}.  To illustrate this, let's rewrite the previous example:
7762
7763@example
7764if (x % 2 == 0) print "x is even"; else
7765        print "x is odd"
7766@end example
7767
7768@noindent
7769If you forget the @samp{;}, @code{awk} won't be able to interpret the
7770statement, and you will get a syntax error.
7771
7772We would not actually write this example this way, because a human
7773reader might fail to see the @code{else} if it were not the first thing
7774on its line.
7775
7776@node While Statement, Do Statement, If Statement, Statements
7777@section The @code{while} Statement
7778@cindex @code{while} statement
7779@cindex loop
7780@cindex body of a loop
7781
7782In programming, a @dfn{loop} means a part of a program that can
7783be executed two or more times in succession.
7784
7785The @code{while} statement is the simplest looping statement in
7786@code{awk}.  It repeatedly executes a statement as long as a condition is
7787true.  It looks like this:
7788
7789@example
7790while (@var{condition})
7791  @var{body}
7792@end example
7793
7794@noindent
7795Here @var{body} is a statement that we call the @dfn{body} of the loop,
7796and @var{condition} is an expression that controls how long the loop
7797keeps running.
7798
7799The first thing the @code{while} statement does is test @var{condition}.
7800If @var{condition} is true, it executes the statement @var{body}.
7801@ifinfo
7802(The @var{condition} is true when the value
7803is not zero and not a null string.)
7804@end ifinfo
7805After @var{body} has been executed,
7806@var{condition} is tested again, and if it is still true, @var{body} is
7807executed again.  This process repeats until @var{condition} is no longer
7808true.  If @var{condition} is initially false, the body of the loop is
7809never executed, and @code{awk} continues with the statement following
7810the loop.
7811
7812This example prints the first three fields of each record, one per line.
7813
7814@example
7815awk '@{ i = 1
7816       while (i <= 3) @{
7817           print $i
7818           i++
7819       @}
7820@}' inventory-shipped
7821@end example
7822
7823@noindent
7824Here the body of the loop is a compound statement enclosed in braces,
7825containing two statements.
7826
7827The loop works like this: first, the value of @code{i} is set to one.
7828Then, the @code{while} tests whether @code{i} is less than or equal to
7829three.  This is true when @code{i} equals one, so the @code{i}-th
7830field is printed.  Then the @samp{i++} increments the value of @code{i}
7831and the loop repeats.  The loop terminates when @code{i} reaches four.
7832
7833As you can see, a newline is not required between the condition and the
7834body; but using one makes the program clearer unless the body is a
7835compound statement or is very simple.  The newline after the open-brace
7836that begins the compound statement is not required either, but the
7837program would be harder to read without it.
7838
7839@node Do Statement, For Statement, While Statement, Statements
7840@section The @code{do}-@code{while} Statement
7841
7842The @code{do} loop is a variation of the @code{while} looping statement.
7843The @code{do} loop executes the @var{body} once, and then repeats @var{body}
7844as long as @var{condition} is true.  It looks like this:
7845
7846@example
7847@group
7848do
7849  @var{body}
7850while (@var{condition})
7851@end group
7852@end example
7853
7854Even if @var{condition} is false at the start, @var{body} is executed at
7855least once (and only once, unless executing @var{body} makes
7856@var{condition} true).  Contrast this with the corresponding
7857@code{while} statement:
7858
7859@example
7860while (@var{condition})
7861  @var{body}
7862@end example
7863
7864@noindent
7865This statement does not execute @var{body} even once if @var{condition}
7866is false to begin with.
7867
7868Here is an example of a @code{do} statement:
7869
7870@example
7871awk '@{ i = 1
7872       do @{
7873          print $0
7874          i++
7875       @} while (i <= 10)
7876@}'
7877@end example
7878
7879@noindent
7880This program prints each input record ten times.  It isn't a very
7881realistic example, since in this case an ordinary @code{while} would do
7882just as well.  But this reflects actual experience; there is only
7883occasionally a real use for a @code{do} statement.
7884
7885@node For Statement, Break Statement, Do Statement, Statements
7886@section The @code{for} Statement
7887@cindex @code{for} statement
7888
7889The @code{for} statement makes it more convenient to count iterations of a
7890loop.  The general form of the @code{for} statement looks like this:
7891
7892@example
7893for (@var{initialization}; @var{condition}; @var{increment})
7894  @var{body}
7895@end example
7896
7897@noindent
7898The @var{initialization}, @var{condition} and @var{increment} parts are
7899arbitrary @code{awk} expressions, and @var{body} stands for any
7900@code{awk} statement.
7901
7902The @code{for} statement starts by executing @var{initialization}.
7903Then, as long
7904as @var{condition} is true, it repeatedly executes @var{body} and then
7905@var{increment}.  Typically @var{initialization} sets a variable to
7906either zero or one, @var{increment} adds one to it, and @var{condition}
7907compares it against the desired number of iterations.
7908
7909Here is an example of a @code{for} statement:
7910
7911@example
7912@group
7913awk '@{ for (i = 1; i <= 3; i++)
7914          print $i
7915@}' inventory-shipped
7916@end group
7917@end example
7918
7919@noindent
7920This prints the first three fields of each input record, one field per
7921line.
7922
7923You cannot set more than one variable in the
7924@var{initialization} part unless you use a multiple assignment statement
7925such as @samp{x = y = 0}, which is possible only if all the initial values
7926are equal.  (But you can initialize additional variables by writing
7927their assignments as separate statements preceding the @code{for} loop.)
7928
7929The same is true of the @var{increment} part; to increment additional
7930variables, you must write separate statements at the end of the loop.
7931The C compound expression, using C's comma operator, would be useful in
7932this context, but it is not supported in @code{awk}.
7933
7934Most often, @var{increment} is an increment expression, as in the
7935example above.  But this is not required; it can be any expression
7936whatever.  For example, this statement prints all the powers of two
7937between one and 100:
7938
7939@example
7940for (i = 1; i <= 100; i *= 2)
7941  print i
7942@end example
7943
7944Any of the three expressions in the parentheses following the @code{for} may
7945be omitted if there is nothing to be done there.  Thus, @w{@samp{for (; x
7946> 0;)}} is equivalent to @w{@samp{while (x > 0)}}.  If the
7947@var{condition} is omitted, it is treated as @var{true}, effectively
7948yielding an @dfn{infinite loop} (i.e.@: a loop that will never
7949terminate).
7950
7951In most cases, a @code{for} loop is an abbreviation for a @code{while}
7952loop, as shown here:
7953
7954@example
7955@var{initialization}
7956while (@var{condition}) @{
7957  @var{body}
7958  @var{increment}
7959@}
7960@end example
7961
7962@noindent
7963The only exception is when the @code{continue} statement
7964(@pxref{Continue Statement, ,The @code{continue} Statement}) is used
7965inside the loop; changing a @code{for} statement to a @code{while}
7966statement in this way can change the effect of the @code{continue}
7967statement inside the loop.
7968
7969There is an alternate version of the @code{for} loop, for iterating over
7970all the indices of an array:
7971
7972@example
7973for (i in array)
7974    @var{do something with} array[i]
7975@end example
7976
7977@noindent
7978@xref{Scanning an Array, ,Scanning All Elements of an Array},
7979for more information on this version of the @code{for} loop.
7980
7981The @code{awk} language has a @code{for} statement in addition to a
7982@code{while} statement because often a @code{for} loop is both less work to
7983type and more natural to think of.  Counting the number of iterations is
7984very common in loops.  It can be easier to think of this counting as part
7985of looping rather than as something to do inside the loop.
7986
7987The next section has more complicated examples of @code{for} loops.
7988
7989@node Break Statement, Continue Statement, For Statement, Statements
7990@section The @code{break} Statement
7991@cindex @code{break} statement
7992@cindex loops, exiting
7993
7994The @code{break} statement jumps out of the innermost @code{for},
7995@code{while}, or @code{do} loop that encloses it.  The
7996following example finds the smallest divisor of any integer, and also
7997identifies prime numbers:
7998
7999@example
8000awk '# find smallest divisor of num
8001     @{ num = $1
8002@group
8003       for (div = 2; div*div <= num; div++)
8004         if (num % div == 0)
8005           break
8006@end group
8007       if (num % div == 0)
8008         printf "Smallest divisor of %d is %d\n", num, div
8009       else
8010         printf "%d is prime\n", num
8011     @}'
8012@end example
8013
8014When the remainder is zero in the first @code{if} statement, @code{awk}
8015immediately @dfn{breaks out} of the containing @code{for} loop.  This means
8016that @code{awk} proceeds immediately to the statement following the loop
8017and continues processing.  (This is very different from the @code{exit}
8018statement which stops the entire @code{awk} program.
8019@xref{Exit Statement, ,The @code{exit} Statement}.)
8020
8021Here is another program equivalent to the previous one.  It illustrates how
8022the @var{condition} of a @code{for} or @code{while} could just as well be
8023replaced with a @code{break} inside an @code{if}:
8024
8025@example
8026@group
8027awk '# find smallest divisor of num
8028     @{ num = $1
8029       for (div = 2; ; div++) @{
8030         if (num % div == 0) @{
8031           printf "Smallest divisor of %d is %d\n", num, div
8032           break
8033         @}
8034         if (div*div > num) @{
8035           printf "%d is prime\n", num
8036           break
8037         @}
8038       @}
8039@}'
8040@end group
8041@end example
8042
8043@cindex @code{break}, outside of loops
8044@cindex historical features
8045@cindex @code{awk} language, POSIX version
8046@cindex POSIX @code{awk}
8047@cindex dark corner
8048As described above, the @code{break} statement has no meaning when
8049used outside the body of a loop.  However, although it was never documented,
8050historical implementations of @code{awk} have treated the @code{break}
8051statement outside of a loop as if it were a @code{next} statement
8052(@pxref{Next Statement, ,The @code{next} Statement}).
8053Recent versions of Unix @code{awk} no longer allow this usage.
8054@code{gawk} will support this use of @code{break} only if @samp{--traditional}
8055has been specified on the command line
8056(@pxref{Options, ,Command Line Options}).
8057Otherwise, it will be treated as an error, since the POSIX standard
8058specifies that @code{break} should only be used inside the body of a
8059loop (d.c.).
8060
8061@node Continue Statement, Next Statement, Break Statement, Statements
8062@section The @code{continue} Statement
8063
8064@cindex @code{continue} statement
8065The @code{continue} statement, like @code{break}, is used only inside
8066@code{for}, @code{while}, and @code{do} loops.  It skips
8067over the rest of the loop body, causing the next cycle around the loop
8068to begin immediately.  Contrast this with @code{break}, which jumps out
8069of the loop altogether.
8070
8071@c The point of this program was to illustrate the use of continue with
8072@c a while loop. But Karl Berry points out that that is done adequately
8073@c below, and that this example is very un-awk-like. So for now, we'll
8074@c omit it.
8075@ignore
8076In Texinfo source files, text that the author wishes to ignore can be
8077enclosed between lines that start with @samp{@@ignore} and end with
8078@samp{@atend ignore}.  Here is a program that strips out lines between
8079@samp{@@ignore} and @samp{@atend ignore} pairs.
8080
8081@example
8082BEGIN @{
8083    while (getline > 0) @{
8084       if (/^@@ignore/)
8085           ignoring = 1
8086       else if (/^@@end[ \t]+ignore/) @{
8087           ignoring = 0
8088           continue
8089       @}
8090       if (ignoring)
8091           continue
8092       print
8093    @}
8094@}
8095@end example
8096
8097When an @samp{@@ignore} is seen, the @code{ignoring} flag is set to one (true).
8098When @samp{@atend ignore} is seen, the flag is reset to zero (false). As long
8099as the flag is true, the input record is not printed, because the
8100@code{continue} restarts the @code{while} loop, skipping over the @code{print}
8101statement.
8102
8103@c Exercise!!!
8104@c How could this program be written to make better use of the awk language?
8105@end ignore
8106
8107The @code{continue} statement in a @code{for} loop directs @code{awk} to
8108skip the rest of the body of the loop, and resume execution with the
8109increment-expression of the @code{for} statement.  The following program
8110illustrates this fact:
8111
8112@example
8113awk 'BEGIN @{
8114     for (x = 0; x <= 20; x++) @{
8115         if (x == 5)
8116             continue
8117         printf "%d ", x
8118     @}
8119     print ""
8120@}'
8121@end example
8122
8123@noindent
8124This program prints all the numbers from zero to 20, except for five, for
8125which the @code{printf} is skipped.  Since the increment @samp{x++}
8126is not skipped, @code{x} does not remain stuck at five.  Contrast the
8127@code{for} loop above with this @code{while} loop:
8128
8129@example
8130awk 'BEGIN @{
8131     x = 0
8132     while (x <= 20) @{
8133         if (x == 5)
8134             continue
8135         printf "%d ", x
8136         x++
8137     @}
8138     print ""
8139@}'
8140@end example
8141
8142@noindent
8143This program loops forever once @code{x} gets to five.
8144
8145@cindex @code{continue}, outside of loops
8146@cindex historical features
8147@cindex @code{awk} language, POSIX version
8148@cindex POSIX @code{awk}
8149@cindex dark corner
8150As described above, the @code{continue} statement has no meaning when
8151used outside the body of a loop.  However, although it was never documented,
8152historical implementations of @code{awk} have treated the @code{continue}
8153statement outside of a loop as if it were a @code{next} statement
8154(@pxref{Next Statement, ,The @code{next} Statement}).
8155Recent versions of Unix @code{awk} no longer allow this usage.
8156@code{gawk} will support this use of @code{continue} only if
8157@samp{--traditional} has been specified on the command line
8158(@pxref{Options, ,Command Line Options}).
8159Otherwise, it will be treated as an error, since the POSIX standard
8160specifies that @code{continue} should only be used inside the body of a
8161loop (d.c.).
8162
8163@node Next Statement, Nextfile Statement, Continue Statement, Statements
8164@section The @code{next} Statement
8165@cindex @code{next} statement
8166
8167The @code{next} statement forces @code{awk} to immediately stop processing
8168the current record and go on to the next record.  This means that no
8169further rules are executed for the current record.  The rest of the
8170current rule's action is not executed either.
8171
8172Contrast this with the effect of the @code{getline} function
8173(@pxref{Getline, ,Explicit Input with @code{getline}}).  That too causes
8174@code{awk} to read the next record immediately, but it does not alter the
8175flow of control in any way.  So the rest of the current action executes
8176with a new input record.
8177
8178At the highest level, @code{awk} program execution is a loop that reads
8179an input record and then tests each rule's pattern against it.  If you
8180think of this loop as a @code{for} statement whose body contains the
8181rules, then the @code{next} statement is analogous to a @code{continue}
8182statement: it skips to the end of the body of this implicit loop, and
8183executes the increment (which reads another record).
8184
8185For example, if your @code{awk} program works only on records with four
8186fields, and you don't want it to fail when given bad input, you might
8187use this rule near the beginning of the program:
8188
8189@example
8190@group
8191NF != 4 @{
8192  err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR)
8193  print err > "/dev/stderr"
8194  next
8195@}
8196@end group
8197@end example
8198
8199@noindent
8200so that the following rules will not see the bad record.  The error
8201message is redirected to the standard error output stream, as error
8202messages should be.  @xref{Special Files, ,Special File Names in @code{gawk}}.
8203
8204@cindex @code{awk} language, POSIX version
8205@cindex POSIX @code{awk}
8206According to the POSIX standard, the behavior is undefined if
8207the @code{next} statement is used in a @code{BEGIN} or @code{END} rule.
8208@code{gawk} will treat it as a syntax error.
8209Although POSIX permits it,
8210some other @code{awk} implementations don't allow the @code{next}
8211statement inside function bodies
8212(@pxref{User-defined, ,User-defined Functions}).
8213Just as any other @code{next} statement, a @code{next} inside a
8214function body reads the next record and starts processing it with the
8215first rule in the program.
8216
8217If the @code{next} statement causes the end of the input to be reached,
8218then the code in any @code{END} rules will be executed.
8219@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.
8220
8221@cindex @code{next}, inside a user-defined function
8222@strong{Caution:} Some @code{awk} implementations generate a run-time
8223error if you use the @code{next} statement inside a user-defined function
8224(@pxref{User-defined, , User-defined Functions}).
8225@code{gawk} does not have this problem.
8226
8227@node Nextfile Statement, Exit Statement, Next Statement, Statements
8228@section The @code{nextfile} Statement
8229@cindex @code{nextfile} statement
8230@cindex differences between @code{gawk} and @code{awk}
8231
8232@code{gawk} provides the @code{nextfile} statement,
8233which is similar to the @code{next} statement.
8234However, instead of abandoning processing of the current record, the
8235@code{nextfile} statement instructs @code{gawk} to stop processing the
8236current data file.
8237
8238Upon execution of the @code{nextfile} statement, @code{FILENAME} is
8239updated to the name of the next data file listed on the command line,
8240@code{FNR} is reset to one, @code{ARGIND} is incremented, and processing
8241starts over with the first rule in the progam.  @xref{Built-in Variables}.
8242
8243If the @code{nextfile} statement causes the end of the input to be reached,
8244then the code in any @code{END} rules will be executed.
8245@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.
8246
8247The @code{nextfile} statement is a @code{gawk} extension; it is not
8248(currently) available in any other @code{awk} implementation.
8249@xref{Nextfile Function, ,Implementing @code{nextfile} as a Function},
8250for a user-defined function you can use to simulate the @code{nextfile}
8251statement.
8252
8253The @code{nextfile} statement would be useful if you have many data
8254files to process, and you expect that you
8255would not want to process every record in every file.
8256Normally, in order to move on to
8257the next data file, you would have to continue scanning the unwanted
8258records.  The @code{nextfile} statement accomplishes this much more
8259efficiently.
8260
8261@cindex @code{next file} statement
8262@strong{Caution:}  Versions of @code{gawk} prior to 3.0 used two
8263words (@samp{next file}) for the @code{nextfile} statement.  This was
8264changed in 3.0 to one word, since the treatment of @samp{file} was
8265inconsistent. When it appeared after @code{next}, it was a keyword.
8266Otherwise, it was a regular identifier.  The old usage is still
8267accepted. However, @code{gawk} will generate a warning message, and
8268support for @code{next file} will eventually be discontinued in a
8269future version of @code{gawk}.
8270
8271@node Exit Statement,  , Nextfile Statement, Statements
8272@section The @code{exit} Statement
8273
8274@cindex @code{exit} statement
8275The @code{exit} statement causes @code{awk} to immediately stop
8276executing the current rule and to stop processing input; any remaining input
8277is ignored.  It looks like this:
8278
8279@example
8280exit @r{[}@var{return code}@r{]}
8281@end example
8282
8283If an @code{exit} statement is executed from a @code{BEGIN} rule the
8284program stops processing everything immediately.  No input records are
8285read.  However, if an @code{END} rule is present, it is executed
8286(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
8287
8288If @code{exit} is used as part of an @code{END} rule, it causes
8289the program to stop immediately.
8290
8291An @code{exit} statement that is not part
8292of a @code{BEGIN} or @code{END} rule stops the execution of any further
8293automatic rules for the current record, skips reading any remaining input
8294records, and executes
8295the @code{END} rule if there is one.
8296
8297If you do not want the @code{END} rule to do its job in this case, you
8298can set a variable to non-zero before the @code{exit} statement, and check
8299that variable in the @code{END} rule.
8300@xref{Assert Function, ,Assertions},
8301for an example that does this.
8302
8303@cindex dark corner
8304If an argument is supplied to @code{exit}, its value is used as the exit
8305status code for the @code{awk} process.  If no argument is supplied,
8306@code{exit} returns status zero (success).  In the case where an argument
8307is supplied to a first @code{exit} statement, and then @code{exit} is
8308called a second time with no argument, the previously supplied exit value
8309is used (d.c.).
8310
8311For example, let's say you've discovered an error condition you really
8312don't know how to handle.  Conventionally, programs report this by
8313exiting with a non-zero status.  Your @code{awk} program can do this
8314using an @code{exit} statement with a non-zero argument.  Here is an
8315example:
8316
8317@example
8318@group
8319BEGIN @{
8320       if (("date" | getline date_now) <= 0) @{
8321         print "Can't get system date" > "/dev/stderr"
8322         exit 1
8323       @}
8324       print "current date is", date_now
8325       close("date")
8326@}
8327@end group
8328@end example
8329
8330@node Built-in Variables, Arrays, Statements, Top
8331@chapter Built-in Variables
8332@cindex built-in variables
8333
8334Most @code{awk} variables are available for you to use for your own
8335purposes; they never change except when your program assigns values to
8336them, and never affect anything except when your program examines them.
8337However, a few variables in @code{awk} have special built-in meanings.
8338Some of them @code{awk} examines automatically, so that they enable you
8339to tell @code{awk} how to do certain things.  Others are set
8340automatically by @code{awk}, so that they carry information from the
8341internal workings of @code{awk} to your program.
8342
8343This chapter documents all the built-in variables of @code{gawk}.  Most
8344of them are also documented in the chapters describing their areas of
8345activity.
8346
8347@menu
8348* User-modified::               Built-in variables that you change to control
8349                                @code{awk}.
8350* Auto-set::                    Built-in variables where @code{awk} gives you
8351                                information.
8352* ARGC and ARGV::               Ways to use @code{ARGC} and @code{ARGV}.
8353@end menu
8354
8355@node User-modified, Auto-set, Built-in Variables, Built-in Variables
8356@section Built-in Variables that Control @code{awk}
8357@cindex built-in variables, user modifiable
8358
8359This is an alphabetical list of the variables which you can change to
8360control how @code{awk} does certain things. Those variables that are
8361specific to @code{gawk} are marked with an asterisk, @samp{*}.
8362
8363@table @code
8364@vindex CONVFMT
8365@cindex @code{awk} language, POSIX version
8366@cindex POSIX @code{awk}
8367@item CONVFMT
8368This string controls conversion of numbers to
8369strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).
8370It works by being passed, in effect, as the first argument to the
8371@code{sprintf} function
8372(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8373Its default value is @code{"%.6g"}.
8374@code{CONVFMT} was introduced by the POSIX standard.
8375
8376@vindex FIELDWIDTHS
8377@item FIELDWIDTHS *
8378This is a space separated list of columns that tells @code{gawk}
8379how to split input with fixed, columnar boundaries.  It is an
8380experimental feature.  Assigning to @code{FIELDWIDTHS}
8381overrides the use of @code{FS} for field splitting.
8382@xref{Constant Size, ,Reading Fixed-width Data}, for more information.
8383
8384If @code{gawk} is in compatibility mode
8385(@pxref{Options, ,Command Line Options}), then @code{FIELDWIDTHS}
8386has no special meaning, and field splitting operations are done based
8387exclusively on the value of @code{FS}.
8388
8389@vindex FS
8390@item FS
8391@code{FS} is the input field separator
8392(@pxref{Field Separators, ,Specifying How Fields are Separated}).
8393The value is a single-character string or a multi-character regular
8394expression that matches the separations between fields in an input
8395record.  If the value is the null string (@code{""}), then each
8396character in the record becomes a separate field.
8397
8398The default value is @w{@code{" "}}, a string consisting of a single
8399space.  As a special exception, this value means that any
8400sequence of spaces, tabs, and/or newlines is a single separator.@footnote{In
8401POSIX @code{awk}, newline does not count as whitespace.}  It also causes
8402spaces, tabs, and newlines at the beginning and end of a record to be ignored.
8403
8404You can set the value of @code{FS} on the command line using the
8405@samp{-F} option:
8406
8407@example
8408awk -F, '@var{program}' @var{input-files}
8409@end example
8410
8411If @code{gawk} is using @code{FIELDWIDTHS} for field-splitting,
8412assigning a value to @code{FS} will cause @code{gawk} to return to
8413the normal, @code{FS}-based, field splitting. An easy way to do this
8414is to simply say @samp{FS = FS}, perhaps with an explanatory comment.
8415
8416@vindex IGNORECASE
8417@item IGNORECASE *
8418If @code{IGNORECASE} is non-zero or non-null, then all string comparisons,
8419and all regular expression matching are case-independent.  Thus, regexp
8420matching with @samp{~} and @samp{!~}, and the @code{gensub},
8421@code{gsub}, @code{index}, @code{match}, @code{split} and @code{sub}
8422functions, record termination with @code{RS}, and field splitting with
8423@code{FS} all ignore case when doing their particular regexp operations.
8424The value of @code{IGNORECASE} does @emph{not} affect array subscripting.
8425@xref{Case-sensitivity, ,Case-sensitivity in Matching}.
8426
8427If @code{gawk} is in compatibility mode
8428(@pxref{Options, ,Command Line Options}),
8429then @code{IGNORECASE} has no special meaning, and string
8430and regexp operations are always case-sensitive.
8431
8432@vindex OFMT
8433@item OFMT
8434This string controls conversion of numbers to
8435strings (@pxref{Conversion, ,Conversion of Strings and Numbers}) for
8436printing with the @code{print} statement.  It works by being passed, in
8437effect, as the first argument to the @code{sprintf} function
8438(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8439Its default value is @code{"%.6g"}.  Earlier versions of @code{awk}
8440also used @code{OFMT} to specify the format for converting numbers to
8441strings in general expressions; this is now done by @code{CONVFMT}.
8442
8443@vindex OFS
8444@item OFS
8445This is the output field separator (@pxref{Output Separators}).  It is
8446output between the fields output by a @code{print} statement.  Its
8447default value is @w{@code{" "}}, a string consisting of a single space.
8448
8449@vindex ORS
8450@item ORS
8451This is the output record separator.  It is output at the end of every
8452@code{print} statement.  Its default value is @code{"\n"}.
8453(@xref{Output Separators}.)
8454
8455@vindex RS
8456@item RS
8457This is @code{awk}'s input record separator.  Its default value is a string
8458containing a single newline character, which means that an input record
8459consists of a single line of text.
8460It can also be the null string, in which case records are separated by
8461runs of blank lines, or a regexp, in which case records are separated by
8462matches of the regexp in the input text.
8463(@xref{Records, ,How Input is Split into Records}.)
8464
8465@vindex SUBSEP
8466@item SUBSEP
8467@code{SUBSEP} is the subscript separator.  It has the default value of
8468@code{"\034"}, and is used to separate the parts of the indices of a
8469multi-dimensional array.  Thus, the expression @code{@w{foo["A", "B"]}}
8470really accesses @code{foo["A\034B"]}
8471(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).
8472@end table
8473
8474@node Auto-set, ARGC and ARGV, User-modified, Built-in Variables
8475@section Built-in Variables that Convey Information
8476@cindex built-in variables, convey information
8477
8478This is an alphabetical list of the variables that are set
8479automatically by @code{awk} on certain occasions in order to provide
8480information to your program.  Those variables that are specific to
8481@code{gawk} are marked with an asterisk, @samp{*}.
8482
8483@table @code
8484@vindex ARGC
8485@vindex ARGV
8486@item ARGC
8487@itemx ARGV
8488The command-line arguments available to @code{awk} programs are stored in
8489an array called @code{ARGV}.  @code{ARGC} is the number of command-line
8490arguments present.  @xref{Other Arguments, ,Other Command Line Arguments}.
8491Unlike most @code{awk} arrays,
8492@code{ARGV} is indexed from zero to @code{ARGC} @minus{} 1.  For example:
8493
8494@example
8495@group
8496$ awk 'BEGIN @{
8497>        for (i = 0; i < ARGC; i++)
8498>            print ARGV[i]
8499>      @}' inventory-shipped BBS-list
8500@print{} awk
8501@print{} inventory-shipped
8502@print{} BBS-list
8503@end group
8504@end example
8505
8506@noindent
8507In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}
8508contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains
8509@code{"BBS-list"}.  The value of @code{ARGC} is three, one more than the
8510index of the last element in @code{ARGV}, since the elements are numbered
8511from zero.
8512
8513The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing
8514the array from zero to @code{ARGC} @minus{} 1, are derived from the C language's
8515method of accessing command line arguments.
8516@xref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}, for information
8517about how @code{awk} uses these variables.
8518
8519@vindex ARGIND
8520@item ARGIND *
8521The index in @code{ARGV} of the current file being processed.
8522Every time @code{gawk} opens a new data file for processing, it sets
8523@code{ARGIND} to the index in @code{ARGV} of the file name.
8524When @code{gawk} is processing the input files, it is always
8525true that @samp{FILENAME == ARGV[ARGIND]}.
8526
8527This variable is useful in file processing; it allows you to tell how far
8528along you are in the list of data files, and to distinguish between
8529successive instances of the same filename on the command line.
8530
8531While you can change the value of @code{ARGIND} within your @code{awk}
8532program, @code{gawk} will automatically set it to a new value when the
8533next file is opened.
8534
8535This variable is a @code{gawk} extension. In other @code{awk} implementations,
8536or if @code{gawk} is in compatibility mode
8537(@pxref{Options, ,Command Line Options}),
8538it is not special.
8539
8540@vindex ENVIRON
8541@item ENVIRON
8542An associative array that contains the values of the environment.  The array
8543indices are the environment variable names; the values are the values of
8544the particular environment variables.  For example,
8545@code{ENVIRON["HOME"]} might be @file{/home/arnold}.  Changing this array
8546does not affect the environment passed on to any programs that
8547@code{awk} may spawn via redirection or the @code{system} function.
8548(In a future version of @code{gawk}, it may do so.)
8549
8550Some operating systems may not have environment variables.
8551On such systems, the @code{ENVIRON} array is empty (except for
8552@w{@code{ENVIRON["AWKPATH"]}}).
8553
8554@vindex ERRNO
8555@item ERRNO *
8556If a system error occurs either doing a redirection for @code{getline},
8557during a read for @code{getline}, or during a @code{close} operation,
8558then @code{ERRNO} will contain a string describing the error.
8559
8560This variable is a @code{gawk} extension. In other @code{awk} implementations,
8561or if @code{gawk} is in compatibility mode
8562(@pxref{Options, ,Command Line Options}),
8563it is not special.
8564
8565@cindex dark corner
8566@vindex FILENAME
8567@item FILENAME
8568This is the name of the file that @code{awk} is currently reading.
8569When no data files are listed on the command line, @code{awk} reads
8570from the standard input, and @code{FILENAME} is set to @code{"-"}.
8571@code{FILENAME} is changed each time a new file is read
8572(@pxref{Reading Files, ,Reading Input Files}).
8573Inside a @code{BEGIN} rule, the value of @code{FILENAME} is
8574@code{""}, since there are no input files being processed
8575yet.@footnote{Some early implementations of Unix @code{awk} initialized
8576@code{FILENAME} to @code{"-"}, even if there were data files to be
8577processed. This behavior was incorrect, and should not be relied
8578upon in your programs.} (d.c.)
8579
8580@vindex FNR
8581@item FNR
8582@code{FNR} is the current record number in the current file.  @code{FNR} is
8583incremented each time a new record is read
8584(@pxref{Getline, ,Explicit Input with @code{getline}}).  It is reinitialized
8585to zero each time a new input file is started.
8586
8587@vindex NF
8588@item NF
8589@code{NF} is the number of fields in the current input record.
8590@code{NF} is set each time a new record is read, when a new field is
8591created, or when @code{$0} changes (@pxref{Fields, ,Examining Fields}).
8592
8593@vindex NR
8594@item NR
8595This is the number of input records @code{awk} has processed since
8596the beginning of the program's execution
8597(@pxref{Records, ,How Input is Split into Records}).
8598@code{NR} is set each time a new record is read.
8599
8600@vindex RLENGTH
8601@item RLENGTH
8602@code{RLENGTH} is the length of the substring matched by the
8603@code{match} function
8604(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8605@code{RLENGTH} is set by invoking the @code{match} function.  Its value
8606is the length of the matched string, or @minus{}1 if no match was found.
8607
8608@vindex RSTART
8609@item RSTART
8610@code{RSTART} is the start-index in characters of the substring matched by the
8611@code{match} function
8612(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8613@code{RSTART} is set by invoking the @code{match} function.  Its value
8614is the position of the string where the matched substring starts, or zero
8615if no match was found.
8616
8617@vindex RT
8618@item RT *
8619@code{RT} is set each time a record is read. It contains the input text
8620that matched the text denoted by @code{RS}, the record separator.
8621
8622This variable is a @code{gawk} extension. In other @code{awk} implementations,
8623or if @code{gawk} is in compatibility mode
8624(@pxref{Options, ,Command Line Options}),
8625it is not special.
8626@end table
8627
8628@cindex dark corner
8629A side note about @code{NR} and @code{FNR}.
8630@code{awk} simply increments both of these variables
8631each time it reads a record, instead of setting them to the absolute
8632value of the number of records read.  This means that your program can
8633change these variables, and their new values will be incremented for
8634each record (d.c.).  For example:
8635
8636@example
8637@group
8638$ echo '1
8639> 2
8640> 3
8641> 4' | awk 'NR == 2 @{ NR = 17 @}
8642> @{ print NR @}'
8643@print{} 1
8644@print{} 17
8645@print{} 18
8646@print{} 19
8647@end group
8648@end example
8649
8650@noindent
8651Before @code{FNR} was added to the @code{awk} language
8652(@pxref{V7/SVR3.1, ,Major Changes between V7 and SVR3.1}),
8653many @code{awk} programs used this feature to track the number of
8654records in a file by resetting @code{NR} to zero when @code{FILENAME}
8655changed.
8656
8657@node ARGC and ARGV, , Auto-set, Built-in Variables
8658@section Using @code{ARGC} and @code{ARGV}
8659
8660In @ref{Auto-set,  ,  Built-in Variables that Convey Information},
8661you saw this program describing the information contained in @code{ARGC}
8662and @code{ARGV}:
8663
8664@example
8665@group
8666$ awk 'BEGIN @{
8667>        for (i = 0; i < ARGC; i++)
8668>            print ARGV[i]
8669>      @}' inventory-shipped BBS-list
8670@print{} awk
8671@print{} inventory-shipped
8672@print{} BBS-list
8673@end group
8674@end example
8675
8676@noindent
8677In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}
8678contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains
8679@code{"BBS-list"}.
8680
8681Notice that the @code{awk} program is not entered in @code{ARGV}.  The
8682other special command line options, with their arguments, are also not
8683entered.  This includes variable assignments done with the @samp{-v}
8684option (@pxref{Options, ,Command Line Options}).
8685Normal variable assignments on the command line @emph{are}
8686treated as arguments, and do show up in the @code{ARGV} array.
8687
8688@example
8689$ cat showargs.awk
8690@print{} BEGIN @{
8691@print{}     printf "A=%d, B=%d\n", A, B
8692@print{}     for (i = 0; i < ARGC; i++)
8693@print{}         printf "\tARGV[%d] = %s\n", i, ARGV[i]
8694@print{} @}
8695@print{} END   @{ printf "A=%d, B=%d\n", A, B @}
8696$ awk -v A=1 -f showargs.awk B=2 /dev/null
8697@print{} A=1, B=0
8698@print{} 	ARGV[0] = awk
8699@print{} 	ARGV[1] = B=2
8700@print{} 	ARGV[2] = /dev/null
8701@print{} A=1, B=2
8702@end example
8703
8704Your program can alter @code{ARGC} and the elements of @code{ARGV}.
8705Each time @code{awk} reaches the end of an input file, it uses the next
8706element of @code{ARGV} as the name of the next input file.  By storing a
8707different string there, your program can change which files are read.
8708You can use @code{"-"} to represent the standard input.  By storing
8709additional elements and incrementing @code{ARGC} you can cause
8710additional files to be read.
8711
8712If you decrease the value of @code{ARGC}, that eliminates input files
8713from the end of the list.  By recording the old value of @code{ARGC}
8714elsewhere, your program can treat the eliminated arguments as
8715something other than file names.
8716
8717To eliminate a file from the middle of the list, store the null string
8718(@code{""}) into @code{ARGV} in place of the file's name.  As a
8719special feature, @code{awk} ignores file names that have been
8720replaced with the null string.
8721You may also use the @code{delete} statement to remove elements from
8722@code{ARGV} (@pxref{Delete, ,The @code{delete} Statement}).
8723
8724All of these actions are typically done from the @code{BEGIN} rule,
8725before actual processing of the input begins.
8726@xref{Split Program, ,Splitting a Large File Into Pieces}, and see
8727@ref{Tee Program, ,Duplicating Output Into Multiple Files}, for an example
8728of each way of removing elements from @code{ARGV}.
8729
8730The following fragment processes @code{ARGV} in order to examine, and
8731then remove, command line options.
8732
8733@example
8734@group
8735BEGIN @{
8736    for (i = 1; i < ARGC; i++) @{
8737        if (ARGV[i] == "-v")
8738            verbose = 1
8739        else if (ARGV[i] == "-d")
8740            debug = 1
8741@end group
8742@group
8743        else if (ARGV[i] ~ /^-?/) @{
8744            e = sprintf("%s: unrecognized option -- %c",
8745                    ARGV[0], substr(ARGV[i], 1, ,1))
8746            print e > "/dev/stderr"
8747        @} else
8748            break
8749        delete ARGV[i]
8750    @}
8751@}
8752@end group
8753@end example
8754
8755To actually get the options into the @code{awk} program, you have to
8756end the @code{awk} options with @samp{--}, and then supply your options,
8757like so:
8758
8759@example
8760awk -f myprog -- -v -d file1 file2 @dots{}
8761@end example
8762
8763@cindex differences between @code{gawk} and @code{awk}
8764This is not necessary in @code{gawk}: Unless @samp{--posix} has been
8765specified, @code{gawk} silently puts any unrecognized options into
8766@code{ARGV} for the @code{awk} program to deal with.
8767
8768As soon as it
8769sees an unknown option, @code{gawk} stops looking for other options it might
8770otherwise recognize.  The above example with @code{gawk} would be:
8771
8772@example
8773gawk -f myprog -d -v file1 file2 @dots{}
8774@end example
8775
8776@noindent
8777Since @samp{-d} is not a valid @code{gawk} option, the following @samp{-v}
8778is passed on to the @code{awk} program.
8779
8780@node Arrays, Built-in, Built-in Variables, Top
8781@chapter Arrays in @code{awk}
8782
8783An @dfn{array} is a table of values, called @dfn{elements}.  The
8784elements of an array are distinguished by their indices.  @dfn{Indices}
8785may be either numbers or strings.  @code{awk} maintains a single set
8786of names that may be used for naming variables, arrays and functions
8787(@pxref{User-defined, ,User-defined Functions}).
8788Thus, you cannot have a variable and an array with the same name in the
8789same @code{awk} program.
8790
8791@menu
8792* Array Intro::                 Introduction to Arrays
8793* Reference to Elements::       How to examine one element of an array.
8794* Assigning Elements::          How to change an element of an array.
8795* Array Example::               Basic Example of an Array
8796* Scanning an Array::           A variation of the @code{for} statement. It
8797                                loops through the indices of an array's
8798                                existing elements.
8799* Delete::                      The @code{delete} statement removes an element
8800                                from an array.
8801* Numeric Array Subscripts::    How to use numbers as subscripts in
8802                                @code{awk}.
8803* Uninitialized Subscripts::    Using Uninitialized variables as subscripts.
8804* Multi-dimensional::           Emulating multi-dimensional arrays in
8805                                @code{awk}.
8806* Multi-scanning::              Scanning multi-dimensional arrays.
8807* Array Efficiency::            Implementation-specific tips.
8808@end menu
8809
8810@node Array Intro, Reference to Elements, Arrays, Arrays
8811@section Introduction to Arrays
8812
8813@cindex arrays
8814The @code{awk} language provides one-dimensional @dfn{arrays} for storing groups
8815of related strings or numbers.
8816
8817Every @code{awk} array must have a name.  Array names have the same
8818syntax as variable names; any valid variable name would also be a valid
8819array name.  But you cannot use one name in both ways (as an array and
8820as a variable) in one @code{awk} program.
8821
8822Arrays in @code{awk} superficially resemble arrays in other programming
8823languages; but there are fundamental differences.  In @code{awk}, you
8824don't need to specify the size of an array before you start to use it.
8825Additionally, any number or string in @code{awk} may be used as an
8826array index, not just consecutive integers.
8827
8828In most other languages, you have to @dfn{declare} an array and specify
8829how many elements or components it contains.  In such languages, the
8830declaration causes a contiguous block of memory to be allocated for that
8831many elements.  An index in the array usually must be a positive integer; for
8832example, the index zero specifies the first element in the array, which is
8833actually stored at the beginning of the block of memory.  Index one
8834specifies the second element, which is stored in memory right after the
8835first element, and so on.  It is impossible to add more elements to the
8836array, because it has room for only as many elements as you declared.
8837(Some languages allow arbitrary starting and ending indices,
8838e.g., @samp{15 .. 27}, but the size of the array is still fixed when
8839the array is declared.)
8840
8841A contiguous array of four elements might look like this,
8842conceptually, if the element values are eight, @code{"foo"},
8843@code{""} and 30:
8844
8845@iftex
8846@c from Karl Berry, much thanks for the help.
8847@tex
8848\bigskip % space above the table (about 1 linespace)
8849\offinterlineskip
8850\newdimen\width \width = 1.5cm
8851\newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt
8852\centerline{\vbox{
8853\halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr
8854\noalign{\hrule width\hwidth}
8855	&&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad value\cr
8856\noalign{\hrule width\hwidth}
8857\noalign{\smallskip}
8858	&\omit&0&\omit &1   &\omit&2 &\omit&3 &\omit&\quad index\cr
8859}
8860}}
8861@end tex
8862@end iftex
8863@ifinfo
8864@example
8865+---------+---------+--------+---------+
8866|    8    |  "foo"  |   ""   |    30   |    @r{value}
8867+---------+---------+--------+---------+
8868     0         1         2         3        @r{index}
8869@end example
8870@end ifinfo
8871
8872@noindent
8873Only the values are stored; the indices are implicit from the order of
8874the values.  Eight is the value at index zero, because eight appears in the
8875position with zero elements before it.
8876
8877@cindex arrays, definition of
8878@cindex associative arrays
8879@cindex arrays, associative
8880Arrays in @code{awk} are different: they are @dfn{associative}.  This means
8881that each array is a collection of pairs: an index, and its corresponding
8882array element value:
8883
8884@example
8885@r{Element} 4     @r{Value} 30
8886@r{Element} 2     @r{Value} "foo"
8887@r{Element} 1     @r{Value} 8
8888@r{Element} 3     @r{Value} ""
8889@end example
8890
8891@noindent
8892We have shown the pairs in jumbled order because their order is irrelevant.
8893
8894One advantage of associative arrays is that new pairs can be added
8895at any time.  For example, suppose we add to the above array a tenth element
8896whose value is @w{@code{"number ten"}}.  The result is this:
8897
8898@example
8899@r{Element} 10    @r{Value} "number ten"
8900@r{Element} 4     @r{Value} 30
8901@r{Element} 2     @r{Value} "foo"
8902@r{Element} 1     @r{Value} 8
8903@r{Element} 3     @r{Value} ""
8904@end example
8905
8906@noindent
8907@cindex sparse arrays
8908@cindex arrays, sparse
8909Now the array is @dfn{sparse}, which just means some indices are missing:
8910it has elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.
8911@c ok, I should spell out the above, but ...
8912
8913Another consequence of associative arrays is that the indices don't
8914have to be positive integers.  Any number, or even a string, can be
8915an index.  For example, here is an array which translates words from
8916English into French:
8917
8918@example
8919@r{Element} "dog" @r{Value} "chien"
8920@r{Element} "cat" @r{Value} "chat"
8921@r{Element} "one" @r{Value} "un"
8922@r{Element} 1     @r{Value} "un"
8923@end example
8924
8925@noindent
8926Here we decided to translate the number one in both spelled-out and
8927numeric form---thus illustrating that a single array can have both
8928numbers and strings as indices.
8929(In fact, array subscripts are always strings; this is discussed
8930in more detail in
8931@ref{Numeric Array Subscripts, ,Using Numbers to Subscript Arrays}.)
8932
8933@cindex Array subscripts and @code{IGNORECASE}
8934@cindex @code{IGNORECASE} and array subscripts
8935@vindex IGNORECASE
8936The value of @code{IGNORECASE} has no effect upon array subscripting.
8937You must use the exact same string value to retrieve an array element
8938as you used to store it.
8939
8940When @code{awk} creates an array for you, e.g., with the @code{split}
8941built-in function,
8942that array's indices are consecutive integers starting at one.
8943(@xref{String Functions, ,Built-in Functions for String Manipulation}.)
8944
8945@node Reference to Elements, Assigning Elements, Array Intro, Arrays
8946@section Referring to an Array Element
8947@cindex array reference
8948@cindex element of array
8949@cindex reference to array
8950
8951The principal way of using an array is to refer to one of its elements.
8952An array reference is an expression which looks like this:
8953
8954@example
8955@var{array}[@var{index}]
8956@end example
8957
8958@noindent
8959Here, @var{array} is the name of an array.  The expression @var{index} is
8960the index of the element of the array that you want.
8961
8962The value of the array reference is the current value of that array
8963element.  For example, @code{foo[4.3]} is an expression for the element
8964of array @code{foo} at index @samp{4.3}.
8965
8966If you refer to an array element that has no recorded value, the value
8967of the reference is @code{""}, the null string.  This includes elements
8968to which you have not assigned any value, and elements that have been
8969deleted (@pxref{Delete, ,The @code{delete} Statement}).  Such a reference
8970automatically creates that array element, with the null string as its value.
8971(In some cases, this is unfortunate, because it might waste memory inside
8972@code{awk}.)
8973
8974@cindex arrays, presence of elements
8975@cindex arrays, the @code{in} operator
8976You can find out if an element exists in an array at a certain index with
8977the expression:
8978
8979@example
8980@var{index} in @var{array}
8981@end example
8982
8983@noindent
8984This expression tests whether or not the particular index exists,
8985without the side effect of creating that element if it is not present.
8986The expression has the value one (true) if @code{@var{array}[@var{index}]}
8987exists, and zero (false) if it does not exist.
8988
8989For example, to test whether the array @code{frequencies} contains the
8990index @samp{2}, you could write this statement:
8991
8992@example
8993if (2 in frequencies)
8994    print "Subscript 2 is present."
8995@end example
8996
8997Note that this is @emph{not} a test of whether or not the array
8998@code{frequencies} contains an element whose @emph{value} is two.
8999(There is no way to do that except to scan all the elements.)  Also, this
9000@emph{does not} create @code{frequencies[2]}, while the following
9001(incorrect) alternative would do so:
9002
9003@example
9004if (frequencies[2] != "")
9005    print "Subscript 2 is present."
9006@end example
9007
9008@node Assigning Elements, Array Example, Reference to Elements, Arrays
9009@section Assigning Array Elements
9010@cindex array assignment
9011@cindex element assignment
9012
9013Array elements are lvalues: they can be assigned values just like
9014@code{awk} variables:
9015
9016@example
9017@var{array}[@var{subscript}] = @var{value}
9018@end example
9019
9020@noindent
9021Here @var{array} is the name of your array.  The expression
9022@var{subscript} is the index of the element of the array that you want
9023to assign a value.  The expression @var{value} is the value you are
9024assigning to that element of the array.
9025
9026@node Array Example, Scanning an Array, Assigning Elements, Arrays
9027@section Basic Array Example
9028
9029The following program takes a list of lines, each beginning with a line
9030number, and prints them out in order of line number.  The line numbers are
9031not in order, however, when they are first read:  they are scrambled.  This
9032program sorts the lines by making an array using the line numbers as
9033subscripts.  It then prints out the lines in sorted order of their numbers.
9034It is a very simple program, and gets confused if it encounters repeated
9035numbers, gaps, or lines that don't begin with a number.
9036
9037@example
9038@group
9039@c file eg/misc/arraymax.awk
9040@{
9041  if ($1 > max)
9042    max = $1
9043  arr[$1] = $0
9044@}
9045@end group
9046
9047END @{
9048  for (x = 1; x <= max; x++)
9049    print arr[x]
9050@}
9051@c endfile
9052@end example
9053
9054The first rule keeps track of the largest line number seen so far;
9055it also stores each line into the array @code{arr}, at an index that
9056is the line's number.
9057
9058The second rule runs after all the input has been read, to print out
9059all the lines.
9060
9061When this program is run with the following input:
9062
9063@example
9064@group
9065@c file eg/misc/arraymax.data
90665  I am the Five man
90672  Who are you?  The new number two!
90684  . . . And four on the floor
90691  Who is number one?
90703  I three you.
9071@c endfile
9072@end group
9073@end example
9074
9075@noindent
9076its output is this:
9077
9078@example
90791  Who is number one?
90802  Who are you?  The new number two!
90813  I three you.
90824  . . . And four on the floor
90835  I am the Five man
9084@end example
9085
9086If a line number is repeated, the last line with a given number overrides
9087the others.
9088
9089Gaps in the line numbers can be handled with an easy improvement to the
9090program's @code{END} rule:
9091
9092@example
9093END @{
9094  for (x = 1; x <= max; x++)
9095    if (x in arr)
9096      print arr[x]
9097@}
9098@end example
9099
9100@node Scanning an Array, Delete, Array Example, Arrays
9101@section Scanning All Elements of an Array
9102@cindex @code{for (x in @dots{})}
9103@cindex arrays, special @code{for} statement
9104@cindex scanning an array
9105
9106In programs that use arrays, you often need a loop that executes
9107once for each element of an array.  In other languages, where arrays are
9108contiguous and indices are limited to positive integers, this is
9109easy: you can
9110find all the valid indices by counting from the lowest index
9111up to the highest.  This
9112technique won't do the job in @code{awk}, since any number or string
9113can be an array index.  So @code{awk} has a special kind of @code{for}
9114statement for scanning an array:
9115
9116@example
9117for (@var{var} in @var{array})
9118  @var{body}
9119@end example
9120
9121@noindent
9122This loop executes @var{body} once for each index in @var{array} that your
9123program has previously used, with the
9124variable @var{var} set to that index.
9125
9126Here is a program that uses this form of the @code{for} statement.  The
9127first rule scans the input records and notes which words appear (at
9128least once) in the input, by storing a one into the array @code{used} with
9129the word as index.  The second rule scans the elements of @code{used} to
9130find all the distinct words that appear in the input.  It prints each
9131word that is more than 10 characters long, and also prints the number of
9132such words.  @xref{String Functions, ,Built-in Functions for String Manipulation}, for more information
9133on the built-in function @code{length}.
9134
9135@example
9136# Record a 1 for each word that is used at least once.
9137@{
9138    for (i = 1; i <= NF; i++)
9139        used[$i] = 1
9140@}
9141
9142# Find number of distinct words more than 10 characters long.
9143END @{
9144    for (x in used)
9145        if (length(x) > 10) @{
9146            ++num_long_words
9147            print x
9148        @}
9149    print num_long_words, "words longer than 10 characters"
9150@}
9151@end example
9152
9153@noindent
9154@xref{Word Sorting, ,Generating Word Usage Counts},
9155for a more detailed example of this type.
9156
9157The order in which elements of the array are accessed by this statement
9158is determined by the internal arrangement of the array elements within
9159@code{awk} and cannot be controlled or changed.  This can lead to
9160problems if new elements are added to @var{array} by statements in
9161the loop body; you cannot predict whether or not the @code{for} loop will
9162reach them.  Similarly, changing @var{var} inside the loop may produce
9163strange results.  It is best to avoid such things.
9164
9165@node Delete, Numeric Array Subscripts, Scanning an Array, Arrays
9166@section The @code{delete} Statement
9167@cindex @code{delete} statement
9168@cindex deleting elements of arrays
9169@cindex removing elements of arrays
9170@cindex arrays, deleting an element
9171
9172You can remove an individual element of an array using the @code{delete}
9173statement:
9174
9175@example
9176delete @var{array}[@var{index}]
9177@end example
9178
9179Once you have deleted an array element, you can no longer obtain any
9180value the element once had.  It is as if you had never referred
9181to it and had never given it any value.
9182
9183Here is an example of deleting elements in an array:
9184
9185@example
9186for (i in frequencies)
9187  delete frequencies[i]
9188@end example
9189
9190@noindent
9191This example removes all the elements from the array @code{frequencies}.
9192
9193If you delete an element, a subsequent @code{for} statement to scan the array
9194will not report that element, and the @code{in} operator to check for
9195the presence of that element will return zero (i.e.@: false):
9196
9197@example
9198delete foo[4]
9199if (4 in foo)
9200    print "This will never be printed"
9201@end example
9202
9203It is important to note that deleting an element is @emph{not} the
9204same as assigning it a null value (the empty string, @code{""}).
9205
9206@example
9207foo[4] = ""
9208if (4 in foo)
9209  print "This is printed, even though foo[4] is empty"
9210@end example
9211
9212It is not an error to delete an element that does not exist.
9213
9214@cindex arrays, deleting entire contents
9215@cindex deleting entire arrays
9216@cindex differences between @code{gawk} and @code{awk}
9217You can delete all the elements of an array with a single statement,
9218by leaving off the subscript in the @code{delete} statement.
9219
9220@example
9221delete @var{array}
9222@end example
9223
9224This ability is a @code{gawk} extension; it is not available in
9225compatibility mode (@pxref{Options, ,Command Line Options}).
9226
9227Using this version of the @code{delete} statement is about three times
9228more efficient than the equivalent loop that deletes each element one
9229at a time.
9230
9231@cindex portability issues
9232The following statement provides a portable, but non-obvious way to clear
9233out an array.
9234
9235@cindex Brennan, Michael
9236@example
9237@group
9238# thanks to Michael Brennan for pointing this out
9239split("", array)
9240@end group
9241@end example
9242
9243The @code{split} function
9244(@pxref{String Functions, ,Built-in Functions for String Manipulation})
9245clears out the target array first. This call asks it to split
9246apart the null string. Since there is no data to split out, the
9247function simply clears the array and then returns.
9248
9249@strong{Caution:} Deleting an array does not change its type; you cannot
9250delete an array and then use the array's name as a scalar. For
9251example, this will not work:
9252
9253@example
9254a[1] = 3; delete a; a = 3
9255@end example
9256
9257@node Numeric Array Subscripts, Uninitialized Subscripts, Delete, Arrays
9258@section Using Numbers to Subscript Arrays
9259
9260An important aspect of arrays to remember is that @emph{array subscripts
9261are always strings}.  If you use a numeric value as a subscript,
9262it will be converted to a string value before it is used for subscripting
9263(@pxref{Conversion, ,Conversion of Strings and Numbers}).
9264
9265@cindex conversions, during subscripting
9266@cindex numbers, used as subscripts
9267@vindex CONVFMT
9268This means that the value of the built-in variable @code{CONVFMT} can potentially
9269affect how your program accesses elements of an array.  For example:
9270
9271@example
9272xyz = 12.153
9273data[xyz] = 1
9274CONVFMT = "%2.2f"
9275@group
9276if (xyz in data)
9277    printf "%s is in data\n", xyz
9278else
9279    printf "%s is not in data\n", xyz
9280@end group
9281@end example
9282
9283@noindent
9284This prints @samp{12.15 is not in data}.  The first statement gives
9285@code{xyz} a numeric value.  Assigning to
9286@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"}
9287(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}),
9288and assigns one to @code{data["12.153"]}.  The program then changes
9289the value of @code{CONVFMT}.  The test @samp{(xyz in data)} generates a new
9290string value from @code{xyz}, this time @code{"12.15"}, since the value of
9291@code{CONVFMT} only allows two significant digits.  This test fails,
9292since @code{"12.15"} is a different string from @code{"12.153"}.
9293
9294According to the rules for conversions
9295(@pxref{Conversion, ,Conversion of Strings and Numbers}), integer
9296values are always converted to strings as integers, no matter what the
9297value of @code{CONVFMT} may happen to be.  So the usual case of:
9298
9299@example
9300for (i = 1; i <= maxsub; i++)
9301    @i{do something with} array[i]
9302@end example
9303
9304@noindent
9305will work, no matter what the value of @code{CONVFMT}.
9306
9307Like many things in @code{awk}, the majority of the time things work
9308as you would expect them to work.  But it is useful to have a precise
9309knowledge of the actual rules, since sometimes they can have a subtle
9310effect on your programs.
9311
9312@node Uninitialized Subscripts, Multi-dimensional, Numeric Array Subscripts, Arrays
9313@section Using Uninitialized Variables as Subscripts
9314
9315@cindex uninitialized variables, as array subscripts
9316@cindex array subscripts, uninitialized variables
9317Suppose you want to print your input data in reverse order.
9318A reasonable attempt at a program to do so (with some test
9319data) might look like this:
9320
9321@example
9322@group
9323$ echo 'line 1
9324> line 2
9325> line 3' | awk '@{ l[lines] = $0; ++lines @}
9326> END @{
9327>     for (i = lines-1; i >= 0; --i)
9328>        print l[i]
9329> @}'
9330@print{} line 3
9331@print{} line 2
9332@end group
9333@end example
9334
9335Unfortunately, the very first line of input data did not come out in the
9336output!
9337
9338At first glance, this program should have worked.  The variable @code{lines}
9339is uninitialized, and uninitialized variables have the numeric value zero.
9340So, @code{awk} should have printed the value of @code{l[0]}.
9341
9342The issue here is that subscripts for @code{awk} arrays are @strong{always}
9343strings. And uninitialized variables, when used as strings, have the
9344value @code{""}, not zero.  Thus, @samp{line 1} ended up stored in
9345@code{l[""]}.
9346
9347The following version of the program works correctly:
9348
9349@example
9350@{ l[lines++] = $0 @}
9351END @{
9352    for (i = lines - 1; i >= 0; --i)
9353       print l[i]
9354@}
9355@end example
9356
9357Here, the @samp{++} forces @code{lines} to be numeric, thus making
9358the ``old value'' numeric zero, which is then converted to @code{"0"}
9359as the array subscript.
9360
9361@cindex null string, as array subscript
9362@cindex dark corner
9363As we have just seen, even though it is somewhat unusual, the null string
9364(@code{""}) is a valid array subscript (d.c.). If @samp{--lint} is provided
9365on the command line (@pxref{Options, ,Command Line Options}),
9366@code{gawk} will warn about the use of the null string as a subscript.
9367
9368@node Multi-dimensional, Multi-scanning, Uninitialized Subscripts, Arrays
9369@section Multi-dimensional Arrays
9370
9371@cindex subscripts in arrays
9372@cindex arrays, multi-dimensional subscripts
9373@cindex multi-dimensional subscripts
9374A multi-dimensional array is an array in which an element is identified
9375by a sequence of indices, instead of a single index.  For example, a
9376two-dimensional array requires two indices.  The usual way (in most
9377languages, including @code{awk}) to refer to an element of a
9378two-dimensional array named @code{grid} is with
9379@code{grid[@var{x},@var{y}]}.
9380
9381@vindex SUBSEP
9382Multi-dimensional arrays are supported in @code{awk} through
9383concatenation of indices into one string.  What happens is that
9384@code{awk} converts the indices into strings
9385(@pxref{Conversion, ,Conversion of Strings and Numbers}) and
9386concatenates them together, with a separator between them.  This creates
9387a single string that describes the values of the separate indices.  The
9388combined string is used as a single index into an ordinary,
9389one-dimensional array.  The separator used is the value of the built-in
9390variable @code{SUBSEP}.
9391
9392For example, suppose we evaluate the expression @samp{foo[5,12] = "value"}
9393when the value of @code{SUBSEP} is @code{"@@"}.  The numbers five and 12 are
9394converted to strings and
9395concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus,
9396the array element @code{foo["5@@12"]} is set to @code{"value"}.
9397
9398Once the element's value is stored, @code{awk} has no record of whether
9399it was stored with a single index or a sequence of indices.  The two
9400expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always
9401equivalent.
9402
9403The default value of @code{SUBSEP} is the string @code{"\034"},
9404which contains a non-printing character that is unlikely to appear in an
9405@code{awk} program or in most input data.
9406
9407The usefulness of choosing an unlikely character comes from the fact
9408that index values that contain a string matching @code{SUBSEP} lead to
9409combined strings that are ambiguous.  Suppose that @code{SUBSEP} were
9410@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a",
9411"b@@c"]}} would be indistinguishable because both would actually be
9412stored as @samp{foo["a@@b@@c"]}.
9413
9414You can test whether a particular index-sequence exists in a
9415``multi-dimensional'' array with the same operator @samp{in} used for single
9416dimensional arrays.  Instead of a single index as the left-hand operand,
9417write the whole sequence of indices, separated by commas, in
9418parentheses:
9419
9420@example
9421(@var{subscript1}, @var{subscript2}, @dots{}) in @var{array}
9422@end example
9423
9424The following example treats its input as a two-dimensional array of
9425fields; it rotates this array 90 degrees clockwise and prints the
9426result.  It assumes that all lines have the same number of
9427elements.
9428
9429@example
9430@group
9431awk '@{
9432     if (max_nf < NF)
9433          max_nf = NF
9434     max_nr = NR
9435     for (x = 1; x <= NF; x++)
9436          vector[x, NR] = $x
9437@}
9438@end group
9439
9440@group
9441END @{
9442     for (x = 1; x <= max_nf; x++) @{
9443          for (y = max_nr; y >= 1; --y)
9444               printf("%s ", vector[x, y])
9445          printf("\n")
9446     @}
9447@}'
9448@end group
9449@end example
9450
9451@noindent
9452When given the input:
9453
9454@example
9455@group
94561 2 3 4 5 6
94572 3 4 5 6 1
94583 4 5 6 1 2
94594 5 6 1 2 3
9460@end group
9461@end example
9462
9463@noindent
9464it produces:
9465
9466@example
9467@group
94684 3 2 1
94695 4 3 2
94706 5 4 3
94711 6 5 4
94722 1 6 5
94733 2 1 6
9474@end group
9475@end example
9476
9477@node Multi-scanning,  Array Efficiency, Multi-dimensional, Arrays
9478@section Scanning Multi-dimensional Arrays
9479
9480There is no special @code{for} statement for scanning a
9481``multi-dimensional'' array; there cannot be one, because in truth there
9482are no multi-dimensional arrays or elements; there is only a
9483multi-dimensional @emph{way of accessing} an array.
9484
9485However, if your program has an array that is always accessed as
9486multi-dimensional, you can get the effect of scanning it by combining
9487the scanning @code{for} statement
9488(@pxref{Scanning an Array, ,Scanning All Elements of an Array}) with the
9489@code{split} built-in function
9490(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
9491It works like this:
9492
9493@example
9494for (combined in array) @{
9495  split(combined, separate, SUBSEP)
9496  @dots{}
9497@}
9498@end example
9499
9500@noindent
9501This sets @code{combined} to
9502each concatenated, combined index in the array, and splits it
9503into the individual indices by breaking it apart where the value of
9504@code{SUBSEP} appears.  The split-out indices become the elements of
9505the array @code{separate}.
9506
9507Thus, suppose you have previously stored a value in @code{array[1, "foo"]};
9508then an element with index @code{"1\034foo"} exists in
9509@code{array}.  (Recall that the default value of @code{SUBSEP} is
9510the character with code 034.)  Sooner or later the @code{for} statement
9511will find that index and do an iteration with @code{combined} set to
9512@code{"1\034foo"}.  Then the @code{split} function is called as
9513follows:
9514
9515@example
9516split("1\034foo", separate, "\034")
9517@end example
9518
9519@noindent
9520The result of this is to set @code{separate[1]} to @code{"1"} and
9521@code{separate[2]} to @code{"foo"}.  Presto, the original sequence of
9522separate indices has been recovered.
9523
9524@node Array Efficiency, , Multi-scanning, Arrays
9525@section Using Array Memory Efficiently
9526
9527This section applies just to @code{gawk}.
9528
9529It is often useful to use the same bit of data as an index
9530into multiple arrays.
9531Due to the way @code{gawk} implements associative arrays,
9532when you need to use input data as an index for multiple
9533arrays, it is much more effecient to assign the input field
9534to a separate variable, and then use that variable as the index.
9535
9536@example
9537@{
9538      name = $1
9539      ssn = $2
9540      nkids = $3
9541      @dots{}
9542      seniority[name]++    # better than seniority[$1]++
9543      kids[name] = nkids   # better than kids[$1] = nkids
9544@}
9545@end example
9546
9547Using separate variables with mnemonic names for the input fields
9548makes programs more readable, in any case.
9549It is an eventual goal to make @code{gawk}'s array indexing as efficient
9550as possible, no matter what the source of the index value.
9551
9552@node Built-in, User-defined, Arrays, Top
9553@chapter Built-in Functions
9554
9555@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
9556@cindex built-in functions
9557@dfn{Built-in} functions are functions that are always available for
9558your @code{awk} program to call.  This chapter defines all the built-in
9559functions in @code{awk}; some of them are mentioned in other sections,
9560but they are summarized here for your convenience.  (You can also define
9561new functions yourself.  @xref{User-defined, ,User-defined Functions}.)
9562
9563@menu
9564* Calling Built-in::            How to call built-in functions.
9565* Numeric Functions::           Functions that work with numbers, including
9566                                @code{int}, @code{sin} and @code{rand}.
9567* String Functions::            Functions for string manipulation, such as
9568                                @code{split}, @code{match}, and
9569                                @code{sprintf}.
9570* I/O Functions::               Functions for files and shell commands.
9571* Time Functions::              Functions for dealing with time stamps.
9572@end menu
9573
9574@node Calling Built-in, Numeric Functions, Built-in, Built-in
9575@section Calling Built-in Functions
9576
9577To call a built-in function, write the name of the function followed
9578by arguments in parentheses.  For example, @samp{atan2(y + z, 1)}
9579is a call to the function @code{atan2}, with two arguments.
9580
9581Whitespace is ignored between the built-in function name and the
9582open-parenthesis, but we recommend that you avoid using whitespace
9583there.  User-defined functions do not permit whitespace in this way, and
9584you will find it easier to avoid mistakes by following a simple
9585convention which always works: no whitespace after a function name.
9586
9587@cindex differences between @code{gawk} and @code{awk}
9588Each built-in function accepts a certain number of arguments.
9589In some cases, arguments can be omitted. The defaults for omitted
9590arguments vary from function to function and are described under the
9591individual functions.  In some @code{awk} implementations, extra
9592arguments given to built-in functions are ignored.  However, in @code{gawk},
9593it is a fatal error to give extra arguments to a built-in function.
9594
9595When a function is called, expressions that create the function's actual
9596parameters are evaluated completely before the function call is performed.
9597For example, in the code fragment:
9598
9599@example
9600i = 4
9601j = sqrt(i++)
9602@end example
9603
9604@noindent
9605the variable @code{i} is set to five before @code{sqrt} is called
9606with a value of four for its actual parameter.
9607
9608@cindex evaluation, order of
9609@cindex order of evaluation
9610The order of evaluation of the expressions used for the function's
9611parameters is undefined.  Thus, you should not write programs that
9612assume that parameters are evaluated from left to right or from
9613right to left.  For example,
9614
9615@example
9616i = 5
9617j = atan2(i++, i *= 2)
9618@end example
9619
9620If the order of evaluation is left to right, then @code{i} first becomes
9621six, and then 12, and @code{atan2} is called with the two arguments six
9622and 12.  But if the order of evaluation is right to left, @code{i}
9623first becomes 10, and then 11, and @code{atan2} is called with the
9624two arguments 11 and 10.
9625
9626@node Numeric Functions, String Functions, Calling Built-in, Built-in
9627@section Numeric Built-in Functions
9628
9629Here is a full list of built-in functions that work with numbers.
9630Optional parameters are enclosed in square brackets (``['' and ``]'').
9631
9632@table @code
9633@item int(@var{x})
9634@findex int
9635This produces the nearest integer to @var{x}, located between @var{x} and zero,
9636truncated toward zero.
9637
9638For example, @code{int(3)} is three, @code{int(3.9)} is three, @code{int(-3.9)}
9639is @minus{}3, and @code{int(-3)} is @minus{}3 as well.
9640
9641@item sqrt(@var{x})
9642@findex sqrt
9643This gives you the positive square root of @var{x}.  It reports an error
9644if @var{x} is negative.  Thus, @code{sqrt(4)} is two.
9645
9646@item exp(@var{x})
9647@findex exp
9648This gives you the exponential of @var{x} (@code{e ^ @var{x}}), or reports
9649an error if @var{x} is out of range.  The range of values @var{x} can have
9650depends on your machine's floating point representation.
9651
9652@item log(@var{x})
9653@findex log
9654This gives you the natural logarithm of @var{x}, if @var{x} is positive;
9655otherwise, it reports an error.
9656
9657@item sin(@var{x})
9658@findex sin
9659This gives you the sine of @var{x}, with @var{x} in radians.
9660
9661@item cos(@var{x})
9662@findex cos
9663This gives you the cosine of @var{x}, with @var{x} in radians.
9664
9665@item atan2(@var{y}, @var{x})
9666@findex atan2
9667This gives you the arctangent of @code{@var{y} / @var{x}} in radians.
9668
9669@item rand()
9670@findex rand
9671This gives you a random number.  The values of @code{rand} are
9672uniformly-distributed between zero and one.
9673The value is never zero and never one.
9674
9675Often you want random integers instead.  Here is a user-defined function
9676you can use to obtain a random non-negative integer less than @var{n}:
9677
9678@example
9679function randint(n) @{
9680     return int(n * rand())
9681@}
9682@end example
9683
9684@noindent
9685The multiplication produces a random number greater than zero and less
9686than @code{n}.  We then make it an integer (using @code{int}) between zero
9687and @code{n} @minus{} 1, inclusive.
9688
9689Here is an example where a similar function is used to produce
9690random integers between one and @var{n}.  This program
9691prints a new random number for each input record.
9692
9693@example
9694@group
9695awk '
9696# Function to roll a simulated die.
9697function roll(n) @{ return 1 + int(rand() * n) @}
9698@end group
9699
9700@group
9701# Roll 3 six-sided dice and
9702# print total number of points.
9703@{
9704      printf("%d points\n",
9705             roll(6)+roll(6)+roll(6))
9706@}'
9707@end group
9708@end example
9709
9710@cindex seed for random numbers
9711@cindex random numbers, seed of
9712@comment MAWK uses a different seed each time.
9713@strong{Caution:} In most @code{awk} implementations, including @code{gawk},
9714@code{rand} starts generating numbers from the same
9715starting number, or @dfn{seed}, each time you run @code{awk}.  Thus,
9716a program will generate the same results each time you run it.
9717The numbers are random within one @code{awk} run, but predictable
9718from run to run.  This is convenient for debugging, but if you want
9719a program to do different things each time it is used, you must change
9720the seed to a value that will be different in each run.  To do this,
9721use @code{srand}.
9722
9723@item srand(@r{[}@var{x}@r{]})
9724@findex srand
9725The function @code{srand} sets the starting point, or seed,
9726for generating random numbers to the value @var{x}.
9727
9728Each seed value leads to a particular sequence of random
9729numbers.@footnote{Computer generated random numbers really are not truly
9730random.  They are technically known as ``pseudo-random.''  This means
9731that while the numbers in a sequence appear to be random, you can in
9732fact generate the same sequence of random numbers over and over again.}
9733Thus, if you set the seed to the same value a second time, you will get
9734the same sequence of random numbers again.
9735
9736If you omit the argument @var{x}, as in @code{srand()}, then the current
9737date and time of day are used for a seed.  This is the way to get random
9738numbers that are truly unpredictable.
9739
9740The return value of @code{srand} is the previous seed.  This makes it
9741easy to keep track of the seeds for use in consistently reproducing
9742sequences of random numbers.
9743@end table
9744
9745@node String Functions, I/O Functions, Numeric Functions, Built-in
9746@section Built-in Functions for String Manipulation
9747
9748The functions in this section look at or change the text of one or more
9749strings.
9750Optional parameters are enclosed in square brackets (``['' and ``]'').
9751
9752@table @code
9753@item index(@var{in}, @var{find})
9754@findex index
9755This searches the string @var{in} for the first occurrence of the string
9756@var{find}, and returns the position in characters where that occurrence
9757begins in the string @var{in}.  For example:
9758
9759@example
9760$ awk 'BEGIN @{ print index("peanut", "an") @}'
9761@print{} 3
9762@end example
9763
9764@noindent
9765If @var{find} is not found, @code{index} returns zero.
9766(Remember that string indices in @code{awk} start at one.)
9767
9768@item length(@r{[}@var{string}@r{]})
9769@findex length
9770This gives you the number of characters in @var{string}.  If
9771@var{string} is a number, the length of the digit string representing
9772that number is returned.  For example, @code{length("abcde")} is five.  By
9773contrast, @code{length(15 * 35)} works out to three.  How?  Well, 15 * 35 =
9774525, and 525 is then converted to the string @code{"525"}, which has
9775three characters.
9776
9777If no argument is supplied, @code{length} returns the length of @code{$0}.
9778
9779@cindex historical features
9780@cindex portability issues
9781@cindex @code{awk} language, POSIX version
9782@cindex POSIX @code{awk}
9783In older versions of @code{awk}, you could call the @code{length} function
9784without any parentheses.  Doing so is marked as ``deprecated'' in the
9785POSIX standard.  This means that while you can do this in your
9786programs, it is a feature that can eventually be removed from a future
9787version of the standard.  Therefore, for maximal portability of your
9788@code{awk} programs, you should always supply the parentheses.
9789
9790@item match(@var{string}, @var{regexp})
9791@findex match
9792The @code{match} function searches the string, @var{string}, for the
9793longest, leftmost substring matched by the regular expression,
9794@var{regexp}.  It returns the character position, or @dfn{index}, of
9795where that substring begins (one, if it starts at the beginning of
9796@var{string}).  If no match is found, it returns zero.
9797
9798@vindex RSTART
9799@vindex RLENGTH
9800The @code{match} function sets the built-in variable @code{RSTART} to
9801the index.  It also sets the built-in variable @code{RLENGTH} to the
9802length in characters of the matched substring.  If no match is found,
9803@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1.
9804
9805For example:
9806
9807@example
9808@group
9809@c file eg/misc/findpat.sh
9810awk '@{
9811       if ($1 == "FIND")
9812         regex = $2
9813       else @{
9814         where = match($0, regex)
9815         if (where != 0)
9816           print "Match of", regex, "found at", \
9817                     where, "in", $0
9818       @}
9819@}'
9820@c endfile
9821@end group
9822@end example
9823
9824@noindent
9825This program looks for lines that match the regular expression stored in
9826the variable @code{regex}.  This regular expression can be changed.  If the
9827first word on a line is @samp{FIND}, @code{regex} is changed to be the
9828second word on that line.  Therefore, given:
9829
9830@example
9831@c file eg/misc/findpat.data
9832FIND ru+n
9833My program runs
9834but not very quickly
9835FIND Melvin
9836JF+KM
9837This line is property of Reality Engineering Co.
9838Melvin was here.
9839@c endfile
9840@end example
9841
9842@noindent
9843@code{awk} prints:
9844
9845@example
9846Match of ru+n found at 12 in My program runs
9847Match of Melvin found at 1 in Melvin was here.
9848@end example
9849
9850@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]})
9851@findex split
9852This divides @var{string} into pieces separated by @var{fieldsep},
9853and stores the pieces in @var{array}.  The first piece is stored in
9854@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so
9855forth.  The string value of the third argument, @var{fieldsep}, is
9856a regexp describing where to split @var{string} (much as @code{FS} can
9857be a regexp describing where to split input records).  If
9858the @var{fieldsep} is omitted, the value of @code{FS} is used.
9859@code{split} returns the number of elements created.
9860
9861The @code{split} function splits strings into pieces in a
9862manner similar to the way input lines are split into fields.  For example:
9863
9864@example
9865split("cul-de-sac", a, "-")
9866@end example
9867
9868@noindent
9869splits the string @samp{cul-de-sac} into three fields using @samp{-} as the
9870separator.  It sets the contents of the array @code{a} as follows:
9871
9872@example
9873a[1] = "cul"
9874a[2] = "de"
9875a[3] = "sac"
9876@end example
9877
9878@noindent
9879The value returned by this call to @code{split} is three.
9880
9881As with input field-splitting, when the value of @var{fieldsep} is
9882@w{@code{" "}}, leading and trailing whitespace is ignored, and the elements
9883are separated by runs of whitespace.
9884
9885@cindex differences between @code{gawk} and @code{awk}
9886Also as with input field-splitting, if @var{fieldsep} is the null string, each
9887individual character in the string is split into its own array element.
9888(This is a @code{gawk}-specific extension.)
9889
9890@cindex dark corner
9891Recent implementations of @code{awk}, including @code{gawk}, allow
9892the third argument to be a regexp constant (@code{/abc/}), as well as a
9893string (d.c.).  The POSIX standard allows this as well.
9894
9895Before splitting the string, @code{split} deletes any previously existing
9896elements in the array @var{array} (d.c.).
9897
9898If @var{string} does not match @var{fieldsep} at all, @var{array} will have
9899one element. The value of that element will be the original
9900@var{string}.
9901
9902@item sprintf(@var{format}, @var{expression1},@dots{})
9903@findex sprintf
9904This returns (without printing) the string that @code{printf} would
9905have printed out with the same arguments
9906(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
9907For example:
9908
9909@example
9910sprintf("pi = %.2f (approx.)", 22/7)
9911@end example
9912
9913@noindent
9914returns the string @w{@code{"pi = 3.14 (approx.)"}}.
9915
9916@ignore
99172e: For sub, gsub, and gensub, either here or in the "how much matches"
9918    section, we need some explanation that it is possible to match the
9919    null string when using closures like *.  E.g.,
9920
9921         $ echo abc | awk '{ gsub(/m*/, "X"); print }'
9922         @print{} XaXbXcX
9923
9924    Although this makes a certain amount of sense, it can be very
9925    suprising.
9926@end ignore
9927
9928@item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
9929@findex sub
9930The @code{sub} function alters the value of @var{target}.
9931It searches this value, which is treated as a string, for the
9932leftmost longest substring matched by the regular expression, @var{regexp},
9933extending this match as far as possible.  Then the entire string is
9934changed by replacing the matched text with @var{replacement}.
9935The modified string becomes the new value of @var{target}.
9936
9937This function is peculiar because @var{target} is not simply
9938used to compute a value, and not just any expression will do: it
9939must be a variable, field or array element, so that @code{sub} can
9940store a modified value there.  If this argument is omitted, then the
9941default is to use and alter @code{$0}.
9942
9943For example:
9944
9945@example
9946str = "water, water, everywhere"
9947sub(/at/, "ith", str)
9948@end example
9949
9950@noindent
9951sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the
9952leftmost, longest occurrence of @samp{at} with @samp{ith}.
9953
9954The @code{sub} function returns the number of substitutions made (either
9955one or zero).
9956
9957If the special character @samp{&} appears in @var{replacement}, it
9958stands for the precise substring that was matched by @var{regexp}.  (If
9959the regexp can match more than one string, then this precise substring
9960may vary.)  For example:
9961
9962@example
9963awk '@{ sub(/candidate/, "& and his wife"); print @}'
9964@end example
9965
9966@noindent
9967changes the first occurrence of @samp{candidate} to @samp{candidate
9968and his wife} on each input line.
9969
9970Here is another example:
9971
9972@example
9973awk 'BEGIN @{
9974        str = "daabaaa"
9975        sub(/a+/, "C&C", str)
9976        print str
9977@}'
9978@print{} dCaaCbaaa
9979@end example
9980
9981@noindent
9982This shows how @samp{&} can represent a non-constant string, and also
9983illustrates the ``leftmost, longest'' rule in regexp matching
9984(@pxref{Leftmost Longest, ,How Much Text Matches?}).
9985
9986The effect of this special character (@samp{&}) can be turned off by putting a
9987backslash before it in the string.  As usual, to insert one backslash in
9988the string, you must write two backslashes.  Therefore, write @samp{\\&}
9989in a string constant to include a literal @samp{&} in the replacement.
9990For example, here is how to replace the first @samp{|} on each line with
9991an @samp{&}:
9992
9993@example
9994awk '@{ sub(/\|/, "\\&"); print @}'
9995@end example
9996
9997@cindex @code{sub}, third argument of
9998@cindex @code{gsub}, third argument of
9999@strong{Note:} As mentioned above, the third argument to @code{sub} must
10000be a variable, field or array reference.
10001Some versions of @code{awk} allow the third argument to
10002be an expression which is not an lvalue.  In such a case, @code{sub}
10003would still search for the pattern and return zero or one, but the result of
10004the substitution (if any) would be thrown away because there is no place
10005to put it.  Such versions of @code{awk} accept expressions like
10006this:
10007
10008@example
10009sub(/USA/, "United States", "the USA and Canada")
10010@end example
10011
10012@noindent
10013For historical compatibility, @code{gawk} will accept erroneous code,
10014such as in the above example. However, using any other non-changeable
10015object as the third parameter will cause a fatal error, and your program
10016will not run.
10017
10018Finally, if the @var{regexp} is not a regexp constant, it is converted into a
10019string and then the value of that string is treated as the regexp to match.
10020
10021@item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
10022@findex gsub
10023This is similar to the @code{sub} function, except @code{gsub} replaces
10024@emph{all} of the longest, leftmost, @emph{non-overlapping} matching
10025substrings it can find.  The @samp{g} in @code{gsub} stands for
10026``global,'' which means replace everywhere.  For example:
10027
10028@example
10029awk '@{ gsub(/Britain/, "United Kingdom"); print @}'
10030@end example
10031
10032@noindent
10033replaces all occurrences of the string @samp{Britain} with @samp{United
10034Kingdom} for all input records.
10035
10036The @code{gsub} function returns the number of substitutions made.  If
10037the variable to be searched and altered, @var{target}, is
10038omitted, then the entire input record, @code{$0}, is used.
10039
10040As in @code{sub}, the characters @samp{&} and @samp{\} are special,
10041and the third argument must be an lvalue.
10042@end table
10043
10044@table @code
10045@item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]})
10046@findex gensub
10047@code{gensub} is a general substitution function.  Like @code{sub} and
10048@code{gsub}, it searches the target string @var{target} for matches of
10049the regular expression @var{regexp}.  Unlike @code{sub} and
10050@code{gsub}, the modified string is returned as the result of the
10051function, and the original target string is @emph{not} changed.  If
10052@var{how} is a string beginning with @samp{g} or @samp{G}, then it
10053replaces all matches of @var{regexp} with @var{replacement}.
10054Otherwise, @var{how} is a number indicating which match of @var{regexp}
10055to replace. If no @var{target} is supplied, @code{$0} is used instead.
10056
10057@code{gensub} provides an additional feature that is not available
10058in @code{sub} or @code{gsub}: the ability to specify components of
10059a regexp in the replacement text.  This is done by using parentheses
10060in the regexp to mark the components, and then specifying @samp{\@var{n}}
10061in the replacement text, where @var{n} is a digit from one to nine.
10062For example:
10063
10064@example
10065@group
10066$ gawk '
10067> BEGIN @{
10068>      a = "abc def"
10069>      b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
10070>      print b
10071> @}'
10072@print{} def abc
10073@end group
10074@end example
10075
10076@noindent
10077As described above for @code{sub}, you must type two backslashes in order
10078to get one into the string.
10079
10080In the replacement text, the sequence @samp{\0} represents the entire
10081matched text, as does the character @samp{&}.
10082
10083This example shows how you can use the third argument to control
10084which match of the regexp should be changed.
10085
10086@example
10087$ echo a b c a b c |
10088> gawk '@{ print gensub(/a/, "AA", 2) @}'
10089@print{} a b c AA b c
10090@end example
10091
10092In this case, @code{$0} is used as the default target string.
10093@code{gensub} returns the new string as its result, which is
10094passed directly to @code{print} for printing.
10095
10096If the @var{how} argument is a string that does not begin with @samp{g} or
10097@samp{G}, or if it is a number that is less than zero, only one
10098substitution is performed.
10099
10100If @var{regexp} does not match @var{target}, @code{gensub}'s return value
10101is the original, unchanged value of @var{target}.
10102
10103@cindex differences between @code{gawk} and @code{awk}
10104@code{gensub} is a @code{gawk} extension; it is not available
10105in compatibility mode (@pxref{Options, ,Command Line Options}).
10106
10107@item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]})
10108@findex substr
10109This returns a @var{length}-character-long substring of @var{string},
10110starting at character number @var{start}.  The first character of a
10111string is character number one.  For example,
10112@code{substr("washington", 5, 3)} returns @code{"ing"}.
10113
10114If @var{length} is not present, this function returns the whole suffix of
10115@var{string} that begins at character number @var{start}.  For example,
10116@code{substr("washington", 5)} returns @code{"ington"}.  The whole
10117suffix is also returned
10118if @var{length} is greater than the number of characters remaining
10119in the string, counting from character number @var{start}.
10120
10121@strong{Note:} The string returned by @code{substr} @emph{cannot} be
10122assigned to.  Thus, it is a mistake to attempt to change a portion of
10123a string, like this:
10124
10125@example
10126string = "abcdef"
10127# try to get "abCDEf", won't work
10128substr(string, 3, 3) = "CDE"
10129@end example
10130
10131@noindent
10132or to use @code{substr} as the third agument of @code{sub} or @code{gsub}:
10133
10134@example
10135gsub(/xyz/, "pdq", substr($0, 5, 20))  # WRONG
10136@end example
10137
10138@cindex case conversion
10139@cindex conversion of case
10140@item tolower(@var{string})
10141@findex tolower
10142This returns a copy of @var{string}, with each upper-case character
10143in the string replaced with its corresponding lower-case character.
10144Non-alphabetic characters are left unchanged.  For example,
10145@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}.
10146
10147@item toupper(@var{string})
10148@findex toupper
10149This returns a copy of @var{string}, with each lower-case character
10150in the string replaced with its corresponding upper-case character.
10151Non-alphabetic characters are left unchanged.  For example,
10152@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}.
10153@end table
10154
10155@c fakenode --- for prepinfo
10156@subheading More About @samp{\} and @samp{&} with @code{sub}, @code{gsub} and @code{gensub}
10157
10158@cindex escape processing, @code{sub} et. al.
10159When using @code{sub}, @code{gsub} or @code{gensub}, and trying to get literal
10160backslashes and ampersands into the replacement text, you need to remember
10161that there are several levels of @dfn{escape processing} going on.
10162
10163First, there is the @dfn{lexical} level, which is when @code{awk} reads
10164your program, and builds an internal copy of your program that can
10165be executed.
10166
10167Then there is the run-time level, when @code{awk} actually scans the
10168replacement string to determine what to generate.
10169
10170At both levels, @code{awk} looks for a defined set of characters that
10171can come after a backslash.  At the lexical level, it looks for the
10172escape sequences listed in @ref{Escape Sequences}.
10173Thus, for every @samp{\} that @code{awk} will process at the run-time
10174level, you type two @samp{\}s at the lexical level.
10175When a character that is not valid for an escape sequence follows the
10176@samp{\}, Unix @code{awk} and @code{gawk} both simply remove the initial
10177@samp{\}, and put the following character into the string. Thus, for
10178example, @code{"a\qb"} is treated as @code{"aqb"}.
10179
10180At the run-time level, the various functions handle sequences of
10181@samp{\} and @samp{&} differently.  The situation is (sadly) somewhat complex.
10182
10183Historically, the @code{sub} and @code{gsub} functions treated the two
10184character sequence @samp{\&} specially; this sequence was replaced in
10185the generated text with a single @samp{&}.  Any other @samp{\} within
10186the @var{replacement} string that did not precede an @samp{&} was passed
10187through unchanged.  To illustrate with a table:
10188
10189@c Thank to Karl Berry for help with the TeX stuff.
10190@tex
10191\vbox{\bigskip
10192% This table has lots of &'s and \'s, so unspecialize them.
10193\catcode`\& = \other \catcode`\\ = \other
10194% But then we need character for escape and tab.
10195@catcode`! = 4
10196@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10197    You type!@code{sub} sees!@code{sub} generates@cr
10198@hrulefill!@hrulefill!@hrulefill@cr
10199   @code{\&}!       @code{&}!the matched text@cr
10200  @code{\\&}!      @code{\&}!a literal @samp{&}@cr
10201 @code{\\\&}!      @code{\&}!a literal @samp{&}@cr
10202@code{\\\\&}!     @code{\\&}!a literal @samp{\&}@cr
10203@code{\\\\\&}!     @code{\\&}!a literal @samp{\&}@cr
10204@code{\\\\\\&}!     @code{\\\&}!a literal @samp{\\&}@cr
10205  @code{\\q}!      @code{\q}!a literal @samp{\q}@cr
10206}
10207@bigskip}
10208@end tex
10209@ifinfo
10210@display
10211 You type         @code{sub} sees          @code{sub} generates
10212 --------         ----------          ---------------
10213     @code{\&}              @code{&}            the matched text
10214    @code{\\&}             @code{\&}            a literal @samp{&}
10215   @code{\\\&}             @code{\&}            a literal @samp{&}
10216  @code{\\\\&}            @code{\\&}            a literal @samp{\&}
10217 @code{\\\\\&}            @code{\\&}            a literal @samp{\&}
10218@code{\\\\\\&}           @code{\\\&}            a literal @samp{\\&}
10219    @code{\\q}             @code{\q}            a literal @samp{\q}
10220@end display
10221@end ifinfo
10222
10223@noindent
10224This table shows both the lexical level processing, where
10225an odd number of backslashes becomes an even number at the run time level,
10226and the run-time processing done by @code{sub}.
10227(For the sake of simplicity, the rest of the tables below only show the
10228case of even numbers of @samp{\}s entered at the lexical level.)
10229
10230The problem with the historical approach is that there is no way to get
10231a literal @samp{\} followed by the matched text.
10232
10233@cindex @code{awk} language, POSIX version
10234@cindex POSIX @code{awk}
10235The 1992 POSIX standard attempted to fix this problem. The standard
10236says that @code{sub} and @code{gsub} look for either a @samp{\} or an @samp{&}
10237after the @samp{\}. If either one follows a @samp{\}, that character is
10238output literally.  The interpretation of @samp{\} and @samp{&} then becomes
10239like this:
10240
10241@c thanks to Karl Berry for formatting this table
10242@tex
10243\vbox{\bigskip
10244% This table has lots of &'s and \'s, so unspecialize them.
10245\catcode`\& = \other \catcode`\\ = \other
10246% But then we need character for escape and tab.
10247@catcode`! = 4
10248@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10249    You type!@code{sub} sees!@code{sub} generates@cr
10250@hrulefill!@hrulefill!@hrulefill@cr
10251    @code{&}!       @code{&}!the matched text@cr
10252  @code{\\&}!      @code{\&}!a literal @samp{&}@cr
10253@code{\\\\&}!     @code{\\&}!a literal @samp{\}, then the matched text@cr
10254@code{\\\\\\&}!  @code{\\\&}!a literal @samp{\&}@cr
10255}
10256@bigskip}
10257@end tex
10258@ifinfo
10259@display
10260 You type         @code{sub} sees          @code{sub} generates
10261 --------         ----------          ---------------
10262      @code{&}              @code{&}            the matched text
10263    @code{\\&}             @code{\&}            a literal @samp{&}
10264  @code{\\\\&}            @code{\\&}            a literal @samp{\}, then the matched text
10265@code{\\\\\\&}           @code{\\\&}            a literal @samp{\&}
10266@end display
10267@end ifinfo
10268
10269@noindent
10270This would appear to solve the problem.
10271Unfortunately, the phrasing of the standard is unusual. It
10272says, in effect, that @samp{\} turns off the special meaning of any
10273following character, but that for anything other than @samp{\} and @samp{&},
10274such special meaning is undefined.  This wording leads to two problems.
10275
10276@enumerate
10277@item
10278Backslashes must now be doubled in the @var{replacement} string, breaking
10279historical @code{awk} programs.
10280
10281@item
10282To make sure that an @code{awk} program is portable, @emph{every} character
10283in the @var{replacement} string must be preceded with a
10284backslash.@footnote{This consequence was certainly unintended.}
10285@c I can say that, 'cause I was involved in making this change
10286@end enumerate
10287
10288The POSIX standard is under revision.@footnote{As of @value{UPDATE-MONTH},
10289with final approval and publication as part of the Austin Group
10290Standards hopefully sometime in 2001.}
10291Because of the above problems, proposed text for the revised standard
10292reverts to rules that correspond more closely to the original existing
10293practice. The proposed rules have special cases that make it possible
10294to produce a @samp{\} preceding the matched text.
10295
10296@tex
10297\vbox{\bigskip
10298% This table has lots of &'s and \'s, so unspecialize them.
10299\catcode`\& = \other \catcode`\\ = \other
10300% But then we need character for escape and tab.
10301@catcode`! = 4
10302@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10303    You type!@code{sub} sees!@code{sub} generates@cr
10304@hrulefill!@hrulefill!@hrulefill@cr
10305@code{\\\\\\&}!     @code{\\\&}!a literal @samp{\&}@cr
10306@code{\\\\&}!     @code{\\&}!a literal @samp{\}, followed by the matched text@cr
10307  @code{\\&}!      @code{\&}!a literal @samp{&}@cr
10308  @code{\\q}!      @code{\q}!a literal @samp{\q}@cr
10309}
10310@bigskip}
10311@end tex
10312@ifinfo
10313@display
10314 You type         @code{sub} sees         @code{sub} generates
10315 --------         ----------         ---------------
10316@code{\\\\\\&}           @code{\\\&}            a literal @samp{\&}
10317  @code{\\\\&}            @code{\\&}            a literal @samp{\}, followed by the matched text
10318    @code{\\&}             @code{\&}            a literal @samp{&}
10319    @code{\\q}             @code{\q}            a literal @samp{\q}
10320@end display
10321@end ifinfo
10322
10323In a nutshell, at the run-time level, there are now three special sequences
10324of characters, @samp{\\\&}, @samp{\\&} and @samp{\&}, whereas historically,
10325there was only one.  However, as in the historical case, any @samp{\} that
10326is not part of one of these three sequences is not special, and appears
10327in the output literally.
10328
10329@code{gawk} 3.0 follows these proposed POSIX rules for @code{sub} and
10330@code{gsub}.
10331@c As much as we think it's a lousy idea. You win some, you lose some. Sigh.
10332Whether these proposed rules will actually become codified into the
10333standard is unknown at this point. Subsequent @code{gawk} releases will
10334track the standard and implement whatever the final version specifies;
10335this @value{DOCUMENT} will be updated as well.
10336
10337The rules for @code{gensub} are considerably simpler. At the run-time
10338level, whenever @code{gawk} sees a @samp{\}, if the following character
10339is a digit, then the text that matched the corresponding parenthesized
10340subexpression is placed in the generated output.  Otherwise,
10341no matter what the character after the @samp{\} is, that character will
10342appear in the generated text, and the @samp{\} will not.
10343
10344@tex
10345\vbox{\bigskip
10346% This table has lots of &'s and \'s, so unspecialize them.
10347\catcode`\& = \other \catcode`\\ = \other
10348% But then we need character for escape and tab.
10349@catcode`! = 4
10350@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10351    You type!@code{gensub} sees!@code{gensub} generates@cr
10352@hrulefill!@hrulefill!@hrulefill@cr
10353      @code{&}!           @code{&}!the matched text@cr
10354    @code{\\&}!          @code{\&}!a literal @samp{&}@cr
10355   @code{\\\\}!          @code{\\}!a literal @samp{\}@cr
10356  @code{\\\\&}!         @code{\\&}!a literal @samp{\}, then the matched text@cr
10357@code{\\\\\\&}!        @code{\\\&}!a literal @samp{\&}@cr
10358    @code{\\q}!          @code{\q}!a literal @samp{q}@cr
10359}
10360@bigskip}
10361@end tex
10362@ifinfo
10363@display
10364  You type          @code{gensub} sees         @code{gensub} generates
10365  --------          -------------         ------------------
10366      @code{&}                    @code{&}            the matched text
10367    @code{\\&}                   @code{\&}            a literal @samp{&}
10368   @code{\\\\}                   @code{\\}            a literal @samp{\}
10369  @code{\\\\&}                  @code{\\&}            a literal @samp{\}, then the matched text
10370@code{\\\\\\&}                 @code{\\\&}            a literal @samp{\&}
10371    @code{\\q}                   @code{\q}            a literal @samp{q}
10372@end display
10373@end ifinfo
10374
10375Because of the complexity of the lexical and run-time level processing,
10376and the special cases for @code{sub} and @code{gsub},
10377we recommend the use of @code{gawk} and @code{gensub} for when you have
10378to do substitutions.
10379
10380@node I/O Functions, Time Functions, String Functions, Built-in
10381@section Built-in Functions for Input/Output
10382
10383The following functions are related to Input/Output (I/O).
10384Optional parameters are enclosed in square brackets (``['' and ``]'').
10385
10386@table @code
10387@item close(@var{filename})
10388@findex close
10389Close the file @var{filename}, for input or output.  The argument may
10390alternatively be a shell command that was used for redirecting to or
10391from a pipe; then the pipe is closed.
10392@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},
10393for more information.
10394
10395@item fflush(@r{[}@var{filename}@r{]})
10396@findex fflush
10397@cindex portability issues
10398@cindex flushing buffers
10399@cindex buffers, flushing
10400@cindex buffering output
10401@cindex output, buffering
10402Flush any buffered output associated @var{filename}, which is either a
10403file opened for writing, or a shell command for redirecting output to
10404a pipe.
10405
10406Many utility programs will @dfn{buffer} their output; they save information
10407to be written to a disk file or terminal in memory, until there is enough
10408for it to be worthwhile to send the data to the ouput device.
10409This is often more efficient than writing
10410every little bit of information as soon as it is ready.  However, sometimes
10411it is necessary to force a program to @dfn{flush} its buffers; that is,
10412write the information to its destination, even if a buffer is not full.
10413This is the purpose of the @code{fflush} function; @code{gawk} too
10414buffers its output, and the @code{fflush} function can be used to force
10415@code{gawk} to flush its buffers.
10416
10417@code{fflush} is a recent (1994) addition to the Bell Labs research
10418version of @code{awk}; it is not part of the POSIX standard, and will
10419not be available if @samp{--posix} has been specified on the command
10420line (@pxref{Options, ,Command Line Options}).
10421
10422@code{gawk} extends the @code{fflush} function in two ways.  The first
10423is to allow no argument at all. In this case, the buffer for the
10424standard output is flushed.  The second way is to allow the null string
10425(@w{@code{""}}) as the argument. In this case, the buffers for
10426@emph{all} open output files and pipes are flushed.
10427
10428@code{fflush} returns zero if the buffer was successfully flushed,
10429and nonzero otherwise.
10430
10431@item system(@var{command})
10432@findex system
10433@cindex interaction, @code{awk} and other programs
10434The @code{system} function allows the user to execute operating system commands
10435and then return to the @code{awk} program.  The @code{system} function
10436executes the command given by the string @var{command}.  It returns, as
10437its value, the status returned by the command that was executed.
10438
10439For example, if the following fragment of code is put in your @code{awk}
10440program:
10441
10442@example
10443END @{
10444     system("date | mail -s 'awk run done' root")
10445@}
10446@end example
10447
10448@noindent
10449the system administrator will be sent mail when the @code{awk} program
10450finishes processing input and begins its end-of-input processing.
10451
10452Note that redirecting @code{print} or @code{printf} into a pipe is often
10453enough to accomplish your task.  If you need to run many commands, it
10454will be more efficient to simply print them to a pipe to the shell:
10455
10456@example
10457while (@var{more stuff to do})
10458    print @var{command} | "/bin/sh"
10459close("/bin/sh")
10460@end example
10461
10462@noindent
10463However, if your @code{awk}
10464program is interactive, @code{system} is useful for cranking up large
10465self-contained programs, such as a shell or an editor.
10466
10467Some operating systems cannot implement the @code{system} function.
10468@code{system} causes a fatal error if it is not supported.
10469@end table
10470
10471@c fakenode --- for prepinfo
10472@subheading Interactive vs. Non-Interactive Buffering
10473@cindex buffering, interactive vs. non-interactive
10474@cindex buffering, non-interactive vs. interactive
10475@cindex interactive buffering vs. non-interactive
10476@cindex non-interactive buffering vs. interactive
10477
10478As a side point, buffering issues can be even more confusing depending
10479upon whether or not your program is @dfn{interactive}, i.e., communicating
10480with a user sitting at a keyboard.@footnote{A program is interactive
10481if the standard output is connected
10482to a terminal device.}
10483
10484Interactive programs generally @dfn{line buffer} their output; they
10485write out every line.  Non-interactive programs wait until they have
10486a full buffer, which may be many lines of output.
10487
10488@c Thanks to Walter.Mecky@dresdnerbank.de for this example, and for
10489@c motivating me to write this section.
10490Here is an example of the difference.
10491
10492@example
10493$ awk '@{ print $1 + $2 @}'
104941 1
10495@print{} 2
104962 3
10497@print{} 5
10498@kbd{Control-d}
10499@end example
10500
10501@noindent
10502Each line of output is printed immediately. Compare that behavior
10503with this example.
10504
10505@example
10506$ awk '@{ print $1 + $2 @}' | cat
105071 1
105082 3
10509@kbd{Control-d}
10510@print{} 2
10511@print{} 5
10512@end example
10513
10514@noindent
10515Here, no output is printed until after the @kbd{Control-d} is typed, since
10516it is all buffered, and sent down the pipe to @code{cat} in one shot.
10517
10518@c fakenode --- for prepinfo
10519@subheading Controlling Output Buffering with @code{system}
10520@cindex flushing buffers
10521@cindex buffers, flushing
10522@cindex buffering output
10523@cindex output, buffering
10524
10525The @code{fflush} function provides explicit control over output buffering for
10526individual files and pipes.  However, its use is not portable to many other
10527@code{awk} implementations.  An alternative method to flush output
10528buffers is by calling @code{system} with a null string as its argument:
10529
10530@example
10531system("")   # flush output
10532@end example
10533
10534@noindent
10535@code{gawk} treats this use of the @code{system} function as a special
10536case, and is smart enough not to run a shell (or other command
10537interpreter) with the empty command.  Therefore, with @code{gawk}, this
10538idiom is not only useful, it is efficient.  While this method should work
10539with other @code{awk} implementations, it will not necessarily avoid
10540starting an unnecessary shell.  (Other implementations may only
10541flush the buffer associated with the standard output, and not necessarily
10542all buffered output.)
10543
10544If you think about what a programmer expects, it makes sense that
10545@code{system} should flush any pending output.  The following program:
10546
10547@example
10548BEGIN @{
10549     print "first print"
10550     system("echo system echo")
10551     print "second print"
10552@}
10553@end example
10554
10555@noindent
10556must print
10557
10558@example
10559first print
10560system echo
10561second print
10562@end example
10563
10564@noindent
10565and not
10566
10567@example
10568system echo
10569first print
10570second print
10571@end example
10572
10573If @code{awk} did not flush its buffers before calling @code{system}, the
10574latter (undesirable) output is what you would see.
10575
10576@node Time Functions,  , I/O Functions, Built-in
10577@section Functions for Dealing with Time Stamps
10578
10579@cindex timestamps
10580@cindex time of day
10581A common use for @code{awk} programs is the processing of log files
10582containing time stamp information, indicating when a
10583particular log record was written.  Many programs log their time stamp
10584in the form returned by the @code{time} system call, which is the
10585number of seconds since a particular epoch.  On POSIX systems,
10586it is the number of seconds since Midnight, January 1, 1970, UTC.
10587
10588In order to make it easier to process such log files, and to produce
10589useful reports, @code{gawk} provides two functions for working with time
10590stamps.  Both of these are @code{gawk} extensions; they are not specified
10591in the POSIX standard, nor are they in any other known version
10592of @code{awk}.
10593
10594Optional parameters are enclosed in square brackets (``['' and ``]'').
10595
10596@table @code
10597@item systime()
10598@findex systime
10599This function returns the current time as the number of seconds since
10600the system epoch.  On POSIX systems, this is the number of seconds
10601since Midnight, January 1, 1970, UTC.  It may be a different number on
10602other systems.
10603
10604@item strftime(@r{[}@var{format} @r{[}, @var{timestamp}@r{]]})
10605@findex strftime
10606This function returns a string.  It is similar to the function of the
10607same name in ANSI C.  The time specified by @var{timestamp} is used to
10608produce a string, based on the contents of the @var{format} string.
10609The @var{timestamp} is in the same format as the value returned by the
10610@code{systime} function.  If no @var{timestamp} argument is supplied,
10611@code{gawk} will use the current time of day as the time stamp.
10612If no @var{format} argument is supplied, @code{strftime} uses
10613@code{@w{"%a %b %d %H:%M:%S %Z %Y"}}.  This format string produces
10614output (almost) equivalent to that of the @code{date} utility.
10615(Versions of @code{gawk} prior to 3.0 require the @var{format} argument.)
10616@end table
10617
10618The @code{systime} function allows you to compare a time stamp from a
10619log file with the current time of day.  In particular, it is easy to
10620determine how long ago a particular record was logged.  It also allows
10621you to produce log records using the ``seconds since the epoch'' format.
10622
10623The @code{strftime} function allows you to easily turn a time stamp
10624into human-readable information.  It is similar in nature to the @code{sprintf}
10625function
10626(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
10627in that it copies non-format specification characters verbatim to the
10628returned string, while substituting date and time values for format
10629specifications in the @var{format} string.
10630
10631@code{strftime} is guaranteed by the ANSI C standard to support
10632the following date format specifications:
10633
10634@table @code
10635@item %a
10636The locale's abbreviated weekday name.
10637
10638@item %A
10639The locale's full weekday name.
10640
10641@item %b
10642The locale's abbreviated month name.
10643
10644@item %B
10645The locale's full month name.
10646
10647@item %c
10648The locale's ``appropriate'' date and time representation.
10649
10650@item %d
10651The day of the month as a decimal number (01--31).
10652
10653@item %H
10654The hour (24-hour clock) as a decimal number (00--23).
10655
10656@item %I
10657The hour (12-hour clock) as a decimal number (01--12).
10658
10659@item %j
10660The day of the year as a decimal number (001--366).
10661
10662@item %m
10663The month as a decimal number (01--12).
10664
10665@item %M
10666The minute as a decimal number (00--59).
10667
10668@item %p
10669The locale's equivalent of the AM/PM designations associated
10670with a 12-hour clock.
10671
10672@item %S
10673The second as a decimal number (00--60).@footnote{Occasionally there are
10674minutes in a year with a leap second, which is why the
10675seconds can go up to 60.}
10676
10677@item %U
10678The week number of the year (the first Sunday as the first day of week one)
10679as a decimal number (00--53).
10680
10681@item %w
10682The weekday as a decimal number (0--6).  Sunday is day zero.
10683
10684@item %W
10685The week number of the year (the first Monday as the first day of week one)
10686as a decimal number (00--53).
10687
10688@item %x
10689The locale's ``appropriate'' date representation.
10690
10691@item %X
10692The locale's ``appropriate'' time representation.
10693
10694@item %y
10695The year without century as a decimal number (00--99).
10696
10697@item %Y
10698The year with century as a decimal number (e.g., 1995).
10699
10700@item %Z
10701The time zone name or abbreviation, or no characters if
10702no time zone is determinable.
10703
10704@item %%
10705A literal @samp{%}.
10706@end table
10707
10708If a conversion specifier is not one of the above, the behavior is
10709undefined.@footnote{This is because ANSI C leaves the
10710behavior of the C version of @code{strftime} undefined, and @code{gawk}
10711will use the system's version of @code{strftime} if it's there.
10712Typically, the conversion specifier will either not appear in the
10713returned string, or it will appear literally.}
10714
10715@cindex locale, definition of
10716Informally, a @dfn{locale} is the geographic place in which a program
10717is meant to run.  For example, a common way to abbreviate the date
10718September 4, 1991 in the United States would be ``9/4/91''.
10719In many countries in Europe, however, it would be abbreviated ``4.9.91''.
10720Thus, the @samp{%x} specification in a @code{"US"} locale might produce
10721@samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce
10722@samp{4.9.91}.  The ANSI C standard defines a default @code{"C"}
10723locale, which is an environment that is typical of what most C programmers
10724are used to.
10725
10726A public-domain C version of @code{strftime} is supplied with @code{gawk}
10727for systems that are not yet fully ANSI-compliant.  If that version is
10728used to compile @code{gawk} (@pxref{Installation, ,Installing @code{gawk}}),
10729then the following additional format specifications are available:
10730
10731@table @code
10732@item %D
10733Equivalent to specifying @samp{%m/%d/%y}.
10734
10735@item %e
10736The day of the month, padded with a space if it is only one digit.
10737
10738@item %h
10739Equivalent to @samp{%b}, above.
10740
10741@item %n
10742A newline character (ASCII LF).
10743
10744@item %r
10745Equivalent to specifying @samp{%I:%M:%S %p}.
10746
10747@item %R
10748Equivalent to specifying @samp{%H:%M}.
10749
10750@item %T
10751Equivalent to specifying @samp{%H:%M:%S}.
10752
10753@item %t
10754A tab character.
10755
10756@item %k
10757The hour (24-hour clock) as a decimal number (0-23).
10758Single digit numbers are padded with a space.
10759
10760@item %l
10761The hour (12-hour clock) as a decimal number (1-12).
10762Single digit numbers are padded with a space.
10763
10764@item %C
10765The century, as a number between 00 and 99.
10766
10767@item %u
10768The weekday as a decimal number
10769[1 (Monday)--7].
10770
10771@cindex ISO 8601
10772@item %V
10773The week number of the year (the first Monday as the first
10774day of week one) as a decimal number (01--53).
10775The method for determining the week number is as specified by ISO 8601
10776(to wit: if the week containing January 1 has four or more days in the
10777new year, then it is week one, otherwise it is week 53 of the previous year
10778and the next week is week one).
10779
10780@item %G
10781The year with century of the ISO week number, as a decimal number.
10782
10783For example, January 1, 1993, is in week 53 of 1992. Thus, the year
10784of its ISO week number is 1992, even though its year is 1993.
10785Similarly, December 31, 1973, is in week 1 of 1974. Thus, the year
10786of its ISO week number is 1974, even though its year is 1973.
10787
10788@item %g
10789The year without century of the ISO week number, as a decimal number (00--99).
10790
10791@item %Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI
10792@itemx %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy
10793These are ``alternate representations'' for the specifications
10794that use only the second letter (@samp{%c}, @samp{%C}, and so on).
10795They are recognized, but their normal representations are
10796used.@footnote{If you don't understand any of this, don't worry about
10797it; these facilities are meant to make it easier to ``internationalize''
10798programs.}
10799(These facilitate compliance with the POSIX @code{date} utility.)
10800
10801@item %v
10802The date in VMS format (e.g., 20-JUN-1991).
10803
10804@cindex RFC-822
10805@cindex RFC-1036
10806@item %z
10807The timezone offset in a +HHMM format (e.g., the format necessary to
10808produce RFC-822/RFC-1036 date headers).
10809@end table
10810
10811This example is an @code{awk} implementation of the POSIX
10812@code{date} utility.  Normally, the @code{date} utility prints the
10813current date and time of day in a well known format.  However, if you
10814provide an argument to it that begins with a @samp{+}, @code{date}
10815will copy non-format specifier characters to the standard output, and
10816will interpret the current time according to the format specifiers in
10817the string.  For example:
10818
10819@example
10820$ date '+Today is %A, %B %d, %Y.'
10821@print{} Today is Thursday, July 11, 1991.
10822@end example
10823
10824Here is the @code{gawk} version of the @code{date} utility.
10825It has a shell ``wrapper'', to handle the @samp{-u} option,
10826which requires that @code{date} run as if the time zone
10827was set to UTC.
10828
10829@example
10830@group
10831#! /bin/sh
10832#
10833# date --- approximate the P1003.2 'date' command
10834
10835case $1 in
10836-u)  TZ=GMT0     # use UTC
10837     export TZ
10838     shift ;;
10839esac
10840@end group
10841
10842@group
10843gawk 'BEGIN  @{
10844    format = "%a %b %d %H:%M:%S %Z %Y"
10845    exitval = 0
10846@end group
10847
10848@group
10849    if (ARGC > 2)
10850        exitval = 1
10851    else if (ARGC == 2) @{
10852        format = ARGV[1]
10853        if (format ~ /^\+/)
10854            format = substr(format, 2)   # remove leading +
10855    @}
10856    print strftime(format)
10857    exit exitval
10858@}' "$@@"
10859@end group
10860@end example
10861
10862@node User-defined, Invoking Gawk, Built-in, Top
10863@chapter User-defined Functions
10864
10865@cindex user-defined functions
10866@cindex functions, user-defined
10867Complicated @code{awk} programs can often be simplified by defining
10868your own functions.  User-defined functions can be called just like
10869built-in ones (@pxref{Function Calls}), but it is up to you to define
10870them---to tell @code{awk} what they should do.
10871
10872@menu
10873* Definition Syntax::           How to write definitions and what they mean.
10874* Function Example::            An example function definition and what it
10875                                does.
10876* Function Caveats::            Things to watch out for.
10877* Return Statement::            Specifying the value a function returns.
10878@end menu
10879
10880@node Definition Syntax, Function Example, User-defined, User-defined
10881@section Function Definition Syntax
10882@cindex defining functions
10883@cindex function definition
10884
10885Definitions of functions can appear anywhere between the rules of an
10886@code{awk} program.  Thus, the general form of an @code{awk} program is
10887extended to include sequences of rules @emph{and} user-defined function
10888definitions.
10889There is no need in @code{awk} to put the definition of a function
10890before all uses of the function.  This is because @code{awk} reads the
10891entire program before starting to execute any of it.
10892
10893The definition of a function named @var{name} looks like this:
10894
10895@example
10896function @var{name}(@var{parameter-list})
10897@{
10898     @var{body-of-function}
10899@}
10900@end example
10901
10902@cindex names, use of
10903@cindex namespaces
10904@noindent
10905@var{name} is the name of the function to be defined.  A valid function
10906name is like a valid variable name: a sequence of letters, digits and
10907underscores, not starting with a digit.
10908Within a single @code{awk} program, any particular name can only be
10909used as a variable, array or function.
10910
10911@var{parameter-list} is a list of the function's arguments and local
10912variable names, separated by commas.  When the function is called,
10913the argument names are used to hold the argument values given in
10914the call.  The local variables are initialized to the empty string.
10915A function cannot have two parameters with the same name.
10916
10917The @var{body-of-function} consists of @code{awk} statements.  It is the
10918most important part of the definition, because it says what the function
10919should actually @emph{do}.  The argument names exist to give the body a
10920way to talk about the arguments; local variables, to give the body
10921places to keep temporary values.
10922
10923Argument names are not distinguished syntactically from local variable
10924names; instead, the number of arguments supplied when the function is
10925called determines how many argument variables there are.  Thus, if three
10926argument values are given, the first three names in @var{parameter-list}
10927are arguments, and the rest are local variables.
10928
10929It follows that if the number of arguments is not the same in all calls
10930to the function, some of the names in @var{parameter-list} may be
10931arguments on some occasions and local variables on others.  Another
10932way to think of this is that omitted arguments default to the
10933null string.
10934
10935Usually when you write a function you know how many names you intend to
10936use for arguments and how many you intend to use as local variables.  It is
10937conventional to place some extra space between the arguments and
10938the local variables, to document how your function is supposed to be used.
10939
10940@cindex variable shadowing
10941During execution of the function body, the arguments and local variable
10942values hide or @dfn{shadow} any variables of the same names used in the
10943rest of the program.  The shadowed variables are not accessible in the
10944function definition, because there is no way to name them while their
10945names have been taken away for the local variables.  All other variables
10946used in the @code{awk} program can be referenced or set normally in the
10947function's body.
10948
10949The arguments and local variables last only as long as the function body
10950is executing.  Once the body finishes, you can once again access the
10951variables that were shadowed while the function was running.
10952
10953@cindex recursive function
10954@cindex function, recursive
10955The function body can contain expressions which call functions.  They
10956can even call this function, either directly or by way of another
10957function.  When this happens, we say the function is @dfn{recursive}.
10958
10959@cindex @code{awk} language, POSIX version
10960@cindex POSIX @code{awk}
10961In many @code{awk} implementations, including @code{gawk},
10962the keyword @code{function} may be
10963abbreviated @code{func}.  However, POSIX only specifies the use of
10964the keyword @code{function}.  This actually has some practical implications.
10965If @code{gawk} is in POSIX-compatibility mode
10966(@pxref{Options, ,Command Line Options}), then the following
10967statement will @emph{not} define a function:
10968
10969@example
10970func foo() @{ a = sqrt($1) ; print a @}
10971@end example
10972
10973@noindent
10974Instead it defines a rule that, for each record, concatenates the value
10975of the variable @samp{func} with the return value of the function @samp{foo}.
10976If the resulting string is non-null, the action is executed.
10977This is probably not what was desired.  (@code{awk} accepts this input as
10978syntactically valid, since functions may be used before they are defined
10979in @code{awk} programs.)
10980
10981@cindex portability issues
10982To ensure that your @code{awk} programs are portable, always use the
10983keyword @code{function} when defining a function.
10984
10985@node Function Example, Function Caveats, Definition Syntax, User-defined
10986@section Function Definition Examples
10987
10988Here is an example of a user-defined function, called @code{myprint}, that
10989takes a number and prints it in a specific format.
10990
10991@example
10992function myprint(num)
10993@{
10994     printf "%6.3g\n", num
10995@}
10996@end example
10997
10998@noindent
10999To illustrate, here is an @code{awk} rule which uses our @code{myprint}
11000function:
11001
11002@example
11003$3 > 0     @{ myprint($3) @}
11004@end example
11005
11006@noindent
11007This program prints, in our special format, all the third fields that
11008contain a positive number in our input.  Therefore, when given:
11009
11010@example
11011@group
11012 1.2   3.4    5.6   7.8
11013 9.10 11.12 -13.14 15.16
1101417.18 19.20  21.22 23.24
11015@end group
11016@end example
11017
11018@noindent
11019this program, using our function to format the results, prints:
11020
11021@example
11022   5.6
11023  21.2
11024@end example
11025
11026This function deletes all the elements in an array.
11027
11028@example
11029function delarray(a,    i)
11030@{
11031    for (i in a)
11032       delete a[i]
11033@}
11034@end example
11035
11036When working with arrays, it is often necessary to delete all the elements
11037in an array and start over with a new list of elements
11038(@pxref{Delete, ,The @code{delete} Statement}).
11039Instead of having
11040to repeat this loop everywhere in your program that you need to clear out
11041an array, your program can just call @code{delarray}.
11042(This guarantees portability.  The usage @samp{delete @var{array}} to delete
11043the contents of an entire array is a non-standard extension.)
11044
11045Here is an example of a recursive function.  It takes a string
11046as an input parameter, and returns the string in backwards order.
11047
11048@example
11049function rev(str, start)
11050@{
11051    if (start == 0)
11052        return ""
11053
11054    return (substr(str, start, 1) rev(str, start - 1))
11055@}
11056@end example
11057
11058If this function is in a file named @file{rev.awk}, we can test it
11059this way:
11060
11061@example
11062$ echo "Don't Panic!" |
11063> gawk --source '@{ print rev($0, length($0)) @}' -f rev.awk
11064@print{} !cinaP t'noD
11065@end example
11066
11067Here is an example that uses the built-in function @code{strftime}.
11068(@xref{Time Functions, ,Functions for Dealing with Time Stamps},
11069for more information on @code{strftime}.)
11070The C @code{ctime} function takes a timestamp and returns it in a string,
11071formatted in a well known fashion.  Here is an @code{awk} version:
11072
11073@example
11074@c file eg/lib/ctime.awk
11075# ctime.awk
11076#
11077# awk version of C ctime(3) function
11078
11079@group
11080function ctime(ts,    format)
11081@{
11082    format = "%a %b %d %H:%M:%S %Z %Y"
11083    if (ts == 0)
11084        ts = systime()       # use current time as default
11085    return strftime(format, ts)
11086@}
11087@c endfile
11088@end group
11089@end example
11090
11091@node Function Caveats, Return Statement, Function Example, User-defined
11092@section Calling User-defined Functions
11093
11094@cindex call by value
11095@cindex call by reference
11096@cindex calling a function
11097@cindex function call
11098@dfn{Calling a function} means causing the function to run and do its job.
11099A function call is an expression, and its value is the value returned by
11100the function.
11101
11102A function call consists of the function name followed by the arguments
11103in parentheses.  What you write in the call for the arguments are
11104@code{awk} expressions; each time the call is executed, these
11105expressions are evaluated, and the values are the actual arguments.  For
11106example, here is a call to @code{foo} with three arguments (the first
11107being a string concatenation):
11108
11109@example
11110foo(x y, "lose", 4 * z)
11111@end example
11112
11113@strong{Caution:} whitespace characters (spaces and tabs) are not allowed
11114between the function name and the open-parenthesis of the argument list.
11115If you write whitespace by mistake, @code{awk} might think that you mean
11116to concatenate a variable with an expression in parentheses.  However, it
11117notices that you used a function name and not a variable name, and reports
11118an error.
11119
11120@cindex call by value
11121When a function is called, it is given a @emph{copy} of the values of
11122its arguments.  This is known as @dfn{call by value}.  The caller may use
11123a variable as the expression for the argument, but the called function
11124does not know this: it only knows what value the argument had.  For
11125example, if you write this code:
11126
11127@example
11128foo = "bar"
11129z = myfunc(foo)
11130@end example
11131
11132@noindent
11133then you should not think of the argument to @code{myfunc} as being
11134``the variable @code{foo}.''  Instead, think of the argument as the
11135string value, @code{"bar"}.
11136
11137If the function @code{myfunc} alters the values of its local variables,
11138this has no effect on any other variables.  Thus, if @code{myfunc}
11139does this:
11140
11141@example
11142@group
11143function myfunc(str)
11144@{
11145  print str
11146  str = "zzz"
11147  print str
11148@}
11149@end group
11150@end example
11151
11152@noindent
11153to change its first argument variable @code{str}, this @emph{does not}
11154change the value of @code{foo} in the caller.  The role of @code{foo} in
11155calling @code{myfunc} ended when its value, @code{"bar"}, was computed.
11156If @code{str} also exists outside of @code{myfunc}, the function body
11157cannot alter this outer value, because it is shadowed during the
11158execution of @code{myfunc} and cannot be seen or changed from there.
11159
11160@cindex call by reference
11161However, when arrays are the parameters to functions, they are @emph{not}
11162copied.  Instead, the array itself is made available for direct manipulation
11163by the function.  This is usually called @dfn{call by reference}.
11164Changes made to an array parameter inside the body of a function @emph{are}
11165visible outside that function.
11166@ifinfo
11167This can be @strong{very} dangerous if you do not watch what you are
11168doing.  For example:
11169@end ifinfo
11170@iftex
11171@emph{This can be very dangerous if you do not watch what you are
11172doing.}  For example:
11173@end iftex
11174
11175@example
11176@group
11177function changeit(array, ind, nvalue)
11178@{
11179     array[ind] = nvalue
11180@}
11181@end group
11182
11183BEGIN @{
11184    a[1] = 1; a[2] = 2; a[3] = 3
11185    changeit(a, 2, "two")
11186    printf "a[1] = %s, a[2] = %s, a[3] = %s\n",
11187            a[1], a[2], a[3]
11188@}
11189@end example
11190
11191@noindent
11192This program prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because
11193@code{changeit} stores @code{"two"} in the second element of @code{a}.
11194
11195@cindex undefined functions
11196@cindex functions, undefined
11197Some @code{awk} implementations allow you to call a function that
11198has not been defined, and only report a problem at run-time when the
11199program actually tries to call the function. For example:
11200
11201@example
11202@group
11203BEGIN @{
11204    if (0)
11205        foo()
11206    else
11207        bar()
11208@}
11209function bar() @{ @dots{} @}
11210# note that `foo' is not defined
11211@end group
11212@end example
11213
11214@noindent
11215Since the @samp{if} statement will never be true, it is not really a
11216problem that @code{foo} has not been defined.  Usually though, it is a
11217problem if a program calls an undefined function.
11218
11219@ignore
11220At one point, I had gawk dieing on this, but later decided that this might
11221break old programs and/or test suites.
11222@end ignore
11223
11224If @samp{--lint} has been specified
11225(@pxref{Options, ,Command Line Options}),
11226@code{gawk} will report about calls to undefined functions.
11227
11228Some @code{awk} implementations generate a run-time
11229error if you use the @code{next} statement
11230(@pxref{Next Statement, , The @code{next} Statement})
11231inside a user-defined function.
11232@code{gawk} does not have this problem.
11233
11234@node Return Statement,  , Function Caveats, User-defined
11235@section The @code{return} Statement
11236@cindex @code{return} statement
11237
11238The body of a user-defined function can contain a @code{return} statement.
11239This statement returns control to the rest of the @code{awk} program.  It
11240can also be used to return a value for use in the rest of the @code{awk}
11241program.  It looks like this:
11242
11243@example
11244return @r{[}@var{expression}@r{]}
11245@end example
11246
11247The @var{expression} part is optional.  If it is omitted, then the returned
11248value is undefined and, therefore, unpredictable.
11249
11250A @code{return} statement with no value expression is assumed at the end of
11251every function definition.  So if control reaches the end of the function
11252body, then the function returns an unpredictable value.  @code{awk}
11253will @emph{not} warn you if you use the return value of such a function.
11254
11255Sometimes, you want to write a function for what it does, not for
11256what it returns.  Such a function corresponds to a @code{void} function
11257in C or to a @code{procedure} in Pascal.  Thus, it may be appropriate to not
11258return any value; you should simply bear in mind that if you use the return
11259value of such a function, you do so at your own risk.
11260
11261Here is an example of a user-defined function that returns a value
11262for the largest number among the elements of an array:
11263
11264@example
11265@group
11266function maxelt(vec,   i, ret)
11267@{
11268     for (i in vec) @{
11269          if (ret == "" || vec[i] > ret)
11270               ret = vec[i]
11271     @}
11272     return ret
11273@}
11274@end group
11275@end example
11276
11277@noindent
11278You call @code{maxelt} with one argument, which is an array name.  The local
11279variables @code{i} and @code{ret} are not intended to be arguments;
11280while there is nothing to stop you from passing two or three arguments
11281to @code{maxelt}, the results would be strange.  The extra space before
11282@code{i} in the function parameter list indicates that @code{i} and
11283@code{ret} are not supposed to be arguments.  This is a convention that
11284you should follow when you define functions.
11285
11286Here is a program that uses our @code{maxelt} function.  It loads an
11287array, calls @code{maxelt}, and then reports the maximum number in that
11288array:
11289
11290@example
11291@group
11292awk '
11293function maxelt(vec,   i, ret)
11294@{
11295     for (i in vec) @{
11296          if (ret == "" || vec[i] > ret)
11297               ret = vec[i]
11298     @}
11299     return ret
11300@}
11301@end group
11302
11303@group
11304# Load all fields of each record into nums.
11305@{
11306     for(i = 1; i <= NF; i++)
11307          nums[NR, i] = $i
11308@}
11309
11310END @{
11311     print maxelt(nums)
11312@}'
11313@end group
11314@end example
11315
11316Given the following input:
11317
11318@example
11319@group
11320 1 5 23 8 16
1132144 3 5 2 8 26
11322256 291 1396 2962 100
11323-6 467 998 1101
1132499385 11 0 225
11325@end group
11326@end example
11327
11328@noindent
11329our program tells us (predictably) that @code{99385} is the largest number
11330in our array.
11331
11332@node Invoking Gawk, Library Functions, User-defined, Top
11333@chapter Running @code{awk}
11334@cindex command line
11335@cindex invocation of @code{gawk}
11336@cindex arguments, command line
11337@cindex options, command line
11338@cindex long options
11339@cindex options, long
11340
11341There are two ways to run @code{awk}: with an explicit program, or with
11342one or more program files.  Here are templates for both of them; items
11343enclosed in @samp{@r{[}@dots{}@r{]}} in these templates are optional.
11344
11345Besides traditional one-letter POSIX-style options, @code{gawk} also
11346supports GNU long options.
11347
11348@example
11349awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{}
11350awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}
11351@end example
11352
11353@cindex empty program
11354@cindex dark corner
11355It is possible to invoke @code{awk} with an empty program:
11356
11357@example
11358$ awk '' datafile1 datafile2
11359@end example
11360
11361@noindent
11362Doing so makes little sense though; @code{awk} will simply exit
11363silently when given an empty program (d.c.).  If @samp{--lint} has
11364been specified on the command line, @code{gawk} will issue a
11365warning that the program is empty.
11366
11367@menu
11368* Options::                     Command line options and their meanings.
11369* Other Arguments::             Input file names and variable assignments.
11370* AWKPATH Variable::            Searching directories for @code{awk} programs.
11371* Obsolete::                    Obsolete Options and/or features.
11372* Undocumented::                Undocumented Options and Features.
11373* Known Bugs::                  Known Bugs in @code{gawk}.
11374@end menu
11375
11376@node Options, Other Arguments, Invoking Gawk, Invoking Gawk
11377@section Command Line Options
11378
11379Options begin with a dash, and consist of a single character.
11380GNU style long options consist of two dashes and a keyword.
11381The keyword can be abbreviated, as long the abbreviation allows the option
11382to be uniquely identified.  If the option takes an argument, then the
11383keyword is either immediately followed by an equals sign (@samp{=}) and the
11384argument's value, or the keyword and the argument's value are separated
11385by whitespace.  For brevity, the discussion below only refers to the
11386traditional short options; however the long and short options are
11387interchangeable in all contexts.
11388
11389Each long option for @code{gawk} has a corresponding
11390POSIX-style option.  The options and their meanings are as follows:
11391
11392@table @code
11393@item -F @var{fs}
11394@itemx --field-separator @var{fs}
11395@cindex @code{-F} option
11396@cindex @code{--field-separator} option
11397Sets the @code{FS} variable to @var{fs}
11398(@pxref{Field Separators, ,Specifying How Fields are Separated}).
11399
11400@item -f @var{source-file}
11401@itemx --file @var{source-file}
11402@cindex @code{-f} option
11403@cindex @code{--file} option
11404Indicates that the @code{awk} program is to be found in @var{source-file}
11405instead of in the first non-option argument.
11406
11407@item -v @var{var}=@var{val}
11408@itemx --assign @var{var}=@var{val}
11409@cindex @code{-v} option
11410@cindex @code{--assign} option
11411Sets the variable @var{var} to the value @var{val} @strong{before}
11412execution of the program begins.  Such variable values are available
11413inside the @code{BEGIN} rule
11414(@pxref{Other Arguments, ,Other Command Line Arguments}).
11415
11416The @samp{-v} option can only set one variable, but you can use
11417it more than once, setting another variable each time, like this:
11418@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}.
11419
11420@strong{Caution:}  Using @samp{-v} to set the values of the builtin
11421variables may lead to suprising results.  @code{awk} will reset the
11422values of those variables as it needs to, possibly ignoring any
11423predefined value you may have given.
11424
11425@item -mf @var{NNN}
11426@itemx -mr @var{NNN}
11427Set various memory limits to the value @var{NNN}.  The @samp{f} flag sets
11428the maximum number of fields, and the @samp{r} flag sets the maximum
11429record size.  These two flags and the @samp{-m} option are from the
11430Bell Labs research version of Unix @code{awk}.  They are provided
11431for compatibility, but otherwise ignored by
11432@code{gawk}, since @code{gawk} has no predefined limits.
11433
11434@item -W @var{gawk-opt}
11435@cindex @code{-W} option
11436Following the POSIX standard, options that are implementation
11437specific are supplied as arguments to the @samp{-W} option.  These options
11438also have corresponding GNU style long options.
11439See below.
11440
11441@item --
11442Signals the end of the command line options.  The following arguments
11443are not treated as options even if they begin with @samp{-}.  This
11444interpretation of @samp{--} follows the POSIX argument parsing
11445conventions.
11446
11447This is useful if you have file names that start with @samp{-},
11448or in shell scripts, if you have file names that will be specified
11449by the user which could start with @samp{-}.
11450@end table
11451
11452The following @code{gawk}-specific options are available:
11453
11454@table @code
11455@item -W traditional
11456@itemx -W compat
11457@itemx --traditional
11458@itemx --compat
11459@cindex @code{--compat} option
11460@cindex @code{--traditional} option
11461@cindex compatibility mode
11462Specifies @dfn{compatibility mode}, in which the GNU extensions to
11463the @code{awk} language are disabled, so that @code{gawk} behaves just
11464like the Bell Labs research version of Unix @code{awk}.
11465@samp{--traditional} is the preferred form of this option.
11466@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},
11467which summarizes the extensions.  Also see
11468@ref{Compatibility Mode, ,Downward Compatibility and Debugging}.
11469
11470@item -W copyleft
11471@itemx -W copyright
11472@itemx --copyleft
11473@itemx --copyright
11474@cindex @code{--copyleft} option
11475@cindex @code{--copyright} option
11476Print the short version of the General Public License, and then exit.
11477This option may disappear in a future version of @code{gawk}.
11478
11479@item -W help
11480@itemx -W usage
11481@itemx --help
11482@itemx --usage
11483@cindex @code{--help} option
11484@cindex @code{--usage} option
11485Print a ``usage'' message summarizing the short and long style options
11486that @code{gawk} accepts, and then exit.
11487
11488@item -W lint
11489@itemx --lint
11490@cindex @code{--lint} option
11491Warn about constructs that are dubious or non-portable to
11492other @code{awk} implementations.
11493Some warnings are issued when @code{gawk} first reads your program.  Others
11494are issued at run-time, as your program executes.
11495
11496@item -W lint-old
11497@itemx --lint-old
11498@cindex @code{--lint-old} option
11499Warn about constructs that are not available in
11500the original Version 7 Unix version of @code{awk}
11501(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).
11502
11503@item -W posix
11504@itemx --posix
11505@cindex @code{--posix} option
11506@cindex POSIX mode
11507Operate in strict POSIX mode.  This disables all @code{gawk}
11508extensions (just like @samp{--traditional}), and adds the following additional
11509restrictions:
11510
11511@c IMPORTANT! Keep this list in sync with the one in node POSIX
11512
11513@itemize @bullet
11514@item
11515@code{\x} escape sequences are not recognized
11516(@pxref{Escape Sequences}).
11517
11518@item
11519Newlines do not act as whitespace to separate fields when @code{FS} is
11520equal to a single space.
11521
11522@item
11523The synonym @code{func} for the keyword @code{function} is not
11524recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).
11525
11526@item
11527The operators @samp{**} and @samp{**=} cannot be used in
11528place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},
11529and also @pxref{Assignment Ops, ,Assignment Expressions}).
11530
11531@item
11532Specifying @samp{-Ft} on the command line does not set the value
11533of @code{FS} to be a single tab character
11534(@pxref{Field Separators, ,Specifying How Fields are Separated}).
11535
11536@item
11537The @code{fflush} built-in function is not supported
11538(@pxref{I/O Functions, , Built-in Functions for Input/Output}).
11539@end itemize
11540
11541If you supply both @samp{--traditional} and @samp{--posix} on the
11542command line, @samp{--posix} will take precedence. @code{gawk}
11543will also issue a warning if both options are supplied.
11544
11545@item -W re-interval
11546@itemx --re-interval
11547Allow interval expressions
11548(@pxref{Regexp Operators, , Regular Expression Operators}),
11549in regexps.
11550Because interval expressions were traditionally not available in @code{awk},
11551@code{gawk} does not provide them by default. This prevents old @code{awk}
11552programs from breaking.
11553
11554@item -W source @var{program-text}
11555@itemx --source @var{program-text}
11556@cindex @code{--source} option
11557Program source code is taken from the @var{program-text}.  This option
11558allows you to mix source code in files with source
11559code that you enter on the command line. This is particularly useful
11560when you have library functions that you wish to use from your command line
11561programs (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
11562
11563@item -W version
11564@itemx --version
11565@cindex @code{--version} option
11566Prints version information for this particular copy of @code{gawk}.
11567This allows you to determine if your copy of @code{gawk} is up to date
11568with respect to whatever the Free Software Foundation is currently
11569distributing.
11570It is also useful for bug reports
11571(@pxref{Bugs,  , Reporting Problems and Bugs}).
11572@end table
11573
11574Any other options are flagged as invalid with a warning message, but
11575are otherwise ignored.
11576
11577In compatibility mode, as a special case, if the value of @var{fs} supplied
11578to the @samp{-F} option is @samp{t}, then @code{FS} is set to the tab
11579character (@code{"\t"}).  This is only true for @samp{--traditional}, and not
11580for @samp{--posix}
11581(@pxref{Field Separators, ,Specifying How Fields are Separated}).
11582
11583The @samp{-f} option may be used more than once on the command line.
11584If it is, @code{awk} reads its program source from all of the named files, as
11585if they had been concatenated together into one big file.  This is
11586useful for creating libraries of @code{awk} functions.  Useful functions
11587can be written once, and then retrieved from a standard place, instead
11588of having to be included into each individual program.
11589
11590You can type in a program at the terminal and still use library functions,
11591by specifying @samp{-f /dev/tty}.  @code{awk} will read a file from the terminal
11592to use as part of the @code{awk} program.  After typing your program,
11593type @kbd{Control-d} (the end-of-file character) to terminate it.
11594(You may also use @samp{-f -} to read program source from the standard
11595input, but then you will not be able to also use the standard input as a
11596source of data.)
11597
11598Because it is clumsy using the standard @code{awk} mechanisms to mix source
11599file and command line @code{awk} programs, @code{gawk} provides the
11600@samp{--source} option.  This does not require you to pre-empt the standard
11601input for your source code, and allows you to easily mix command line
11602and library source code
11603(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
11604
11605If no @samp{-f} or @samp{--source} option is specified, then @code{gawk}
11606will use the first non-option command line argument as the text of the
11607program source code.
11608
11609@cindex @code{POSIXLY_CORRECT} environment variable
11610@cindex environment variable, @code{POSIXLY_CORRECT}
11611If the environment variable @code{POSIXLY_CORRECT} exists,
11612then @code{gawk} will behave in strict POSIX mode, exactly as if
11613you had supplied the @samp{--posix} command line option.
11614Many GNU programs look for this environment variable to turn on
11615strict POSIX mode. If you supply @samp{--lint} on the command line,
11616and @code{gawk} turns on POSIX mode because of @code{POSIXLY_CORRECT},
11617then it will print a warning message indicating that POSIX
11618mode is in effect.
11619
11620You would typically set this variable in your shell's startup file.
11621For a Bourne compatible shell (such as Bash), you would add these
11622lines to the @file{.profile} file in your home directory.
11623
11624@example
11625@group
11626POSIXLY_CORRECT=true
11627export POSIXLY_CORRECT
11628@end group
11629@end example
11630
11631For a @code{csh} compatible shell,@footnote{Not recommended.}
11632you would add this line to the @file{.login} file in your home directory.
11633
11634@example
11635setenv POSIXLY_CORRECT true
11636@end example
11637
11638@node Other Arguments, AWKPATH Variable, Options, Invoking Gawk
11639@section Other Command Line Arguments
11640
11641Any additional arguments on the command line are normally treated as
11642input files to be processed in the order specified.   However, an
11643argument that has the form @code{@var{var}=@var{value}}, assigns
11644the value @var{value} to the variable @var{var}---it does not specify a
11645file at all.
11646
11647@vindex ARGIND
11648@vindex ARGV
11649All these arguments are made available to your @code{awk} program in the
11650@code{ARGV} array (@pxref{Built-in Variables}).  Command line options
11651and the program text (if present) are omitted from @code{ARGV}.
11652All other arguments, including variable assignments, are
11653included.   As each element of @code{ARGV} is processed, @code{gawk}
11654sets the variable @code{ARGIND} to the index in @code{ARGV} of the
11655current element.
11656
11657The distinction between file name arguments and variable-assignment
11658arguments is made when @code{awk} is about to open the next input file.
11659At that point in execution, it checks the ``file name'' to see whether
11660it is really a variable assignment; if so, @code{awk} sets the variable
11661instead of reading a file.
11662
11663Therefore, the variables actually receive the given values after all
11664previously specified files have been read.  In particular, the values of
11665variables assigned in this fashion are @emph{not} available inside a
11666@code{BEGIN} rule
11667(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}),
11668since such rules are run before @code{awk} begins scanning the argument list.
11669
11670@cindex dark corner
11671The variable values given on the command line are processed for escape
11672sequences (d.c.) (@pxref{Escape Sequences}).
11673
11674In some earlier implementations of @code{awk}, when a variable assignment
11675occurred before any file names, the assignment would happen @emph{before}
11676the @code{BEGIN} rule was executed.  @code{awk}'s behavior was thus
11677inconsistent; some command line assignments were available inside the
11678@code{BEGIN} rule, while others were not.  However,
11679some applications came to depend
11680upon this ``feature.''  When @code{awk} was changed to be more consistent,
11681the @samp{-v} option was added to accommodate applications that depended
11682upon the old behavior.
11683
11684The variable assignment feature is most useful for assigning to variables
11685such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and
11686output formats, before scanning the data files.  It is also useful for
11687controlling state if multiple passes are needed over a data file.  For
11688example:
11689
11690@cindex multiple passes over data
11691@cindex passes, multiple
11692@example
11693awk 'pass == 1  @{ @var{pass 1 stuff} @}
11694     pass == 2  @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata
11695@end example
11696
11697Given the variable assignment feature, the @samp{-F} option for setting
11698the value of @code{FS} is not
11699strictly necessary.  It remains for historical compatibility.
11700
11701@node AWKPATH Variable, Obsolete, Other Arguments, Invoking Gawk
11702@section The @code{AWKPATH} Environment Variable
11703@cindex @code{AWKPATH} environment variable
11704@cindex environment variable, @code{AWKPATH}
11705@cindex search path
11706@cindex directory search
11707@cindex path, search
11708@cindex differences between @code{gawk} and @code{awk}
11709
11710The previous section described how @code{awk} program files can be named
11711on the command line with the @samp{-f} option.  In most @code{awk}
11712implementations, you must supply a precise path name for each program
11713file, unless the file is in the current directory.
11714
11715@cindex search path, for source files
11716But in @code{gawk}, if the file name supplied to the @samp{-f} option
11717does not contain a @samp{/}, then @code{gawk} searches a list of
11718directories (called the @dfn{search path}), one by one, looking for a
11719file with the specified name.
11720
11721The search path is a string consisting of directory names
11722separated by colons.  @code{gawk} gets its search path from the
11723@code{AWKPATH} environment variable.  If that variable does not exist,
11724@code{gawk} uses a default path, which is
11725@samp{.:/usr/local/share/awk}.@footnote{Your version of @code{gawk}
11726may use a different directory; it
11727will depend upon how @code{gawk} was built and installed. The actual
11728directory will be the value of @samp{$(datadir)} generated when
11729@code{gawk} was configured.  You probably don't need to worry about this
11730though.} (Programs written for use by
11731system administrators should use an @code{AWKPATH} variable that
11732does not include the current directory, @file{.}.)
11733
11734The search path feature is particularly useful for building up libraries
11735of useful @code{awk} functions.  The library files can be placed in a
11736standard directory that is in the default path, and then specified on
11737the command line with a short file name.  Otherwise, the full file name
11738would have to be typed for each file.
11739
11740By using both the @samp{--source} and @samp{-f} options, your command line
11741@code{awk} programs can use facilities in @code{awk} library files.
11742@xref{Library Functions, , A Library of @code{awk} Functions}.
11743
11744Path searching is not done if @code{gawk} is in compatibility mode.
11745This is true for both @samp{--traditional} and @samp{--posix}.
11746@xref{Options, ,Command Line Options}.
11747
11748@strong{Note:} if you want files in the current directory to be found,
11749you must include the current directory in the path, either by including
11750@file{.} explicitly in the path, or by writing a null entry in the
11751path.  (A null entry is indicated by starting or ending the path with a
11752colon, or by placing two colons next to each other (@samp{::}).)  If the
11753current directory is not included in the path, then files cannot be
11754found in the current directory.  This path search mechanism is identical
11755to the shell's.
11756@c someday, @cite{The Bourne Again Shell}....
11757
11758Starting with version 3.0, if @code{AWKPATH} is not defined in the
11759environment, @code{gawk} will place its default search path into
11760@code{ENVIRON["AWKPATH"]}. This makes it easy to determine
11761the actual search path @code{gawk} will use.
11762
11763@node Obsolete, Undocumented, AWKPATH Variable, Invoking Gawk
11764@section Obsolete Options and/or Features
11765
11766@cindex deprecated options
11767@cindex obsolete options
11768@cindex deprecated features
11769@cindex obsolete features
11770This section describes features and/or command line options from
11771previous releases of @code{gawk} that are either not available in the
11772current version, or that are still supported but deprecated (meaning that
11773they will @emph{not} be in the next release).
11774
11775@c update this section for each release!
11776
11777For version @value{VERSION}.@value{PATCHLEVEL} of @code{gawk}, there are no
11778command line options
11779or other deprecated features from the previous version of @code{gawk}.
11780@iftex
11781This section
11782@end iftex
11783@ifinfo
11784This node
11785@end ifinfo
11786is thus essentially a place holder,
11787in case some option becomes obsolete in a future version of @code{gawk}.
11788
11789@ignore
11790@c This is pretty old news...
11791The public-domain version of @code{strftime} that is distributed with
11792@code{gawk} changed for the 2.14 release.  The @samp{%V} conversion specifier
11793that used to generate the date in VMS format was changed to @samp{%v}.
11794This is because the POSIX standard for the @code{date} utility now
11795specifies a @samp{%V} conversion specifier.
11796@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for details.
11797@end ignore
11798
11799@node Undocumented, Known Bugs, Obsolete, Invoking Gawk
11800@section Undocumented Options and Features
11801@cindex undocumented features
11802@display
11803@i{Use the Source, Luke!}
11804Obi-Wan
11805@end display
11806@sp 1
11807
11808This section intentionally left blank.
11809
11810@c Read The Source, Luke!
11811
11812@ignore
11813@c If these came out in the Info file or TeX document, then they wouldn't
11814@c be undocumented, would they?
11815
11816@code{gawk} has one undocumented option:
11817
11818@table @code
11819@item -W nostalgia
11820@itemx --nostalgia
11821Print the message @code{"awk: bailing out near line 1"} and dump core.
11822This option was inspired by the common behavior of very early versions of
11823Unix @code{awk}, and by a t--shirt.
11824@end table
11825
11826Early versions of @code{awk} used to not require any separator (either
11827a newline or @samp{;}) between the rules in @code{awk} programs.  Thus,
11828it was common to see one-line programs like:
11829
11830@example
11831awk '@{ sum += $1 @} END @{ print sum @}'
11832@end example
11833
11834@code{gawk} actually supports this, but it is purposely undocumented
11835since it is considered bad style.  The correct way to write such a program
11836is either
11837
11838@example
11839awk '@{ sum += $1 @} ; END @{ print sum @}'
11840@end example
11841
11842@noindent
11843or
11844
11845@example
11846awk '@{ sum += $1 @}
11847     END @{ print sum @}' data
11848@end example
11849
11850@noindent
11851@xref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a fuller
11852explanation.
11853
11854@end ignore
11855
11856@node Known Bugs, , Undocumented, Invoking Gawk
11857@section Known Bugs in @code{gawk}
11858@cindex bugs, known in @code{gawk}
11859@cindex known bugs
11860
11861@itemize @bullet
11862@item
11863The @samp{-F} option for changing the value of @code{FS}
11864(@pxref{Options, ,Command Line Options})
11865is not necessary given the command line variable
11866assignment feature; it remains only for backwards compatibility.
11867
11868@item
11869If your system actually has support for @file{/dev/fd} and the
11870associated @file{/dev/stdin}, @file{/dev/stdout}, and
11871@file{/dev/stderr} files, you may get different output from @code{gawk}
11872than you would get on a system without those files.  When @code{gawk}
11873interprets these files internally, it synchronizes output to the
11874standard output with output to @file{/dev/stdout}, while on a system
11875with those files, the output is actually to different open files
11876(@pxref{Special Files, ,Special File Names in @code{gawk}}).
11877
11878@item
11879Syntactically invalid single character programs tend to overflow
11880the parse stack, generating a rather unhelpful message.  Such programs
11881are surprisingly difficult to diagnose in the completely general case,
11882and the effort to do so really is not worth it.
11883@end itemize
11884
11885@node Library Functions, Sample Programs, Invoking Gawk, Top
11886@chapter A Library of @code{awk} Functions
11887
11888@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
11889This chapter presents a library of useful @code{awk} functions.  The
11890sample programs presented later
11891(@pxref{Sample Programs, ,Practical @code{awk} Programs})
11892use these functions.
11893The functions are presented here in a progression from simple to complex.
11894
11895@ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
11896presents a program that you can use to extract the source code for
11897these example library functions and programs from the Texinfo source
11898for this @value{DOCUMENT}.
11899(This has already been done as part of the @code{gawk} distribution.)
11900
11901If you have written one or more useful, general purpose @code{awk} functions,
11902and would like to contribute them for a subsequent edition of this @value{DOCUMENT},
11903please contact the author.  @xref{Bugs, ,Reporting Problems and Bugs},
11904for information on doing this.  Don't just send code, as you will be
11905required to either place your code in the public domain,
11906publish it under the GPL (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),
11907or assign the copyright in it to the Free Software Foundation.
11908
11909@menu
11910* Portability Notes::           What to do if you don't have @code{gawk}.
11911* Nextfile Function::           Two implementations of a @code{nextfile}
11912                                function.
11913* Assert Function::             A function for assertions in @code{awk}
11914                                programs.
11915* Round Function::              A function for rounding if @code{sprintf} does
11916                                not do it correctly.
11917* Ordinal Functions::           Functions for using characters as numbers and
11918                                vice versa.
11919* Join Function::               A function to join an array into a string.
11920* Mktime Function::             A function to turn a date into a timestamp.
11921* Gettimeofday Function::       A function to get formatted times.
11922* Filetrans Function::          A function for handling data file transitions.
11923* Getopt Function::             A function for processing command line
11924                                arguments.
11925* Passwd Functions::            Functions for getting user information.
11926* Group Functions::             Functions for getting group information.
11927* Library Names::               How to best name private global variables in
11928                                library functions.
11929@end menu
11930
11931@node Portability Notes, Nextfile Function, Library Functions, Library Functions
11932@section Simulating @code{gawk}-specific Features
11933@cindex portability issues
11934
11935The programs in this chapter and in
11936@ref{Sample Programs, ,Practical @code{awk} Programs},
11937freely use features that are specific to @code{gawk}.
11938This section briefly discusses how you can rewrite these programs for
11939different implementations of @code{awk}.
11940
11941Diagnostic error messages are sent to @file{/dev/stderr}.
11942Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"}, if your system
11943does not have a @file{/dev/stderr}, or if you cannot use @code{gawk}.
11944
11945A number of programs use @code{nextfile}
11946(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}),
11947to skip any remaining input in the input file.
11948@ref{Nextfile Function, ,Implementing @code{nextfile} as a Function},
11949shows you how to write a function that will do the same thing.
11950
11951Finally, some of the programs choose to ignore upper-case and lower-case
11952distinctions in their input. They do this by assigning one to @code{IGNORECASE}.
11953You can achieve the same effect by adding the following rule to the
11954beginning of the program:
11955
11956@example
11957# ignore case
11958@{ $0 = tolower($0) @}
11959@end example
11960
11961@noindent
11962Also, verify that all regexp and string constants used in
11963comparisons only use lower-case letters.
11964
11965@node Nextfile Function, Assert Function, Portability Notes, Library Functions
11966@section Implementing @code{nextfile} as a Function
11967
11968@cindex skipping input files
11969@cindex input files, skipping
11970The @code{nextfile} statement presented in
11971@ref{Nextfile Statement, ,The @code{nextfile} Statement},
11972is a @code{gawk}-specific extension.  It is not available in other
11973implementations of @code{awk}.  This section shows two versions of a
11974@code{nextfile} function that you can use to simulate @code{gawk}'s
11975@code{nextfile} statement if you cannot use @code{gawk}.
11976
11977Here is a first attempt at writing a @code{nextfile} function.
11978
11979@example
11980@group
11981# nextfile --- skip remaining records in current file
11982
11983# this should be read in before the "main" awk program
11984
11985function nextfile()    @{ _abandon_ = FILENAME; next @}
11986
11987_abandon_ == FILENAME  @{ next @}
11988@end group
11989@end example
11990
11991This file should be included before the main program, because it supplies
11992a rule that must be executed first.  This rule compares the current data
11993file's name (which is always in the @code{FILENAME} variable) to a private
11994variable named @code{_abandon_}.  If the file name matches, then the action
11995part of the rule executes a @code{next} statement, to go on to the next
11996record.  (The use of @samp{_} in the variable name is a convention.
11997It is discussed more fully in
11998@ref{Library Names,  , Naming Library Function Global Variables}.)
11999
12000The use of the @code{next} statement effectively creates a loop that reads
12001all the records from the current data file.
12002Eventually, the end of the file is reached, and
12003a new data file is opened, changing the value of @code{FILENAME}.
12004Once this happens, the comparison of @code{_abandon_} to @code{FILENAME}
12005fails, and execution continues with the first rule of the ``real'' program.
12006
12007The @code{nextfile} function itself simply sets the value of @code{_abandon_}
12008and then executes a @code{next} statement to start the loop
12009going.@footnote{Some implementations of @code{awk} do not allow you to
12010execute @code{next} from within a function body. Some other work-around
12011will be necessary if you use such a version.}
12012@c mawk is what we're talking about.
12013
12014This initial version has a subtle problem.  What happens if the same data
12015file is listed @emph{twice} on the command line, one right after the other,
12016or even with just a variable assignment between the two occurrences of
12017the file name?
12018
12019@c @findex nextfile
12020@c do it this way, since all the indices are merged
12021@cindex @code{nextfile} function
12022In such a case,
12023this code will skip right through the file, a second time, even though
12024it should stop when it gets to the end of the first occurrence.
12025Here is a second version of @code{nextfile} that remedies this problem.
12026
12027@example
12028@c file eg/lib/nextfile.awk
12029# nextfile --- skip remaining records in current file
12030# correctly handle successive occurrences of the same file
12031# Arnold Robbins, arnold@@gnu.org, Public Domain
12032# May, 1993
12033
12034# this should be read in before the "main" awk program
12035
12036function nextfile()   @{ _abandon_ = FILENAME; next @}
12037
12038@group
12039_abandon_ == FILENAME @{
12040      if (FNR == 1)
12041          _abandon_ = ""
12042      else
12043          next
12044@}
12045@end group
12046@c endfile
12047@end example
12048
12049The @code{nextfile} function has not changed.  It sets @code{_abandon_}
12050equal to the current file name and then executes a @code{next} satement.
12051The @code{next} statement reads the next record and increments @code{FNR},
12052so @code{FNR} is guaranteed to have a value of at least two.
12053However, if @code{nextfile} is called for the last record in the file,
12054then @code{awk} will close the current data file and move on to the next
12055one.  Upon doing so, @code{FILENAME} will be set to the name of the new file,
12056and @code{FNR} will be reset to one.  If this next file is the same as
12057the previous one, @code{_abandon_} will still be equal to @code{FILENAME}.
12058However, @code{FNR} will be equal to one, telling us that this is a new
12059occurrence of the file, and not the one we were reading when the
12060@code{nextfile} function was executed.  In that case, @code{_abandon_}
12061is reset to the empty string, so that further executions of this rule
12062will fail (until the next time that @code{nextfile} is called).
12063
12064If @code{FNR} is not one, then we are still in the original data file,
12065and the program executes a @code{next} statement to skip through it.
12066
12067An important question to ask at this point is: ``Given that the
12068functionality of @code{nextfile} can be provided with a library file,
12069why is it built into @code{gawk}?''  This is an important question.  Adding
12070features for little reason leads to larger, slower programs that are
12071harder to maintain.
12072
12073The answer is that building @code{nextfile} into @code{gawk} provides
12074significant gains in efficiency.  If the @code{nextfile} function is executed
12075at the beginning of a large data file, @code{awk} still has to scan the entire
12076file, splitting it up into records, just to skip over it.  The built-in
12077@code{nextfile} can simply close the file immediately and proceed to the
12078next one, saving a lot of time.  This is particularly important in
12079@code{awk}, since @code{awk} programs are generally I/O bound (i.e.@:
12080they spend most of their time doing input and output, instead of performing
12081computations).
12082
12083@node Assert Function, Round Function, Nextfile Function, Library Functions
12084@section Assertions
12085
12086@cindex assertions
12087@cindex @code{assert}, C version
12088When writing large programs, it is often useful to be able to know
12089that a condition or set of conditions is true.  Before proceeding with a
12090particular computation, you make a statement about what you believe to be
12091the case.  Such a statement is known as an
12092``assertion.''  The C language provides an @code{<assert.h>} header file
12093and corresponding @code{assert} macro that the programmer can use to make
12094assertions.  If an assertion fails, the @code{assert} macro arranges to
12095print a diagnostic message describing the condition that should have
12096been true but was not, and then it kills the program.  In C, using
12097@code{assert} looks this:
12098
12099@c NEEDED
12100@page
12101@example
12102#include <assert.h>
12103
12104int myfunc(int a, double b)
12105@{
12106     assert(a <= 5 && b >= 17);
12107     @dots{}
12108@}
12109@end example
12110
12111If the assertion failed, the program would print a message similar to
12112this:
12113
12114@example
12115prog.c:5: assertion failed: a <= 5 && b >= 17
12116@end example
12117
12118@findex assert
12119The ANSI C language makes it possible to turn the condition into a string for use
12120in printing the diagnostic message.  This is not possible in @code{awk}, so
12121this @code{assert} function also requires a string version of the condition
12122that is being tested.
12123
12124@example
12125@c @group
12126@c file eg/lib/assert.awk
12127# assert --- assert that a condition is true. Otherwise exit.
12128# Arnold Robbins, arnold@@gnu.org, Public Domain
12129# May, 1993
12130
12131function assert(condition, string)
12132@{
12133    if (! condition) @{
12134        printf("%s:%d: assertion failed: %s\n",
12135            FILENAME, FNR, string) > "/dev/stderr"
12136        _assert_exit = 1
12137        exit 1
12138    @}
12139@}
12140
12141END @{
12142    if (_assert_exit)
12143        exit 1
12144@}
12145@c endfile
12146@c @end group
12147@end example
12148
12149The @code{assert} function tests the @code{condition} parameter. If it
12150is false, it prints a message to standard error, using the @code{string}
12151parameter to describe the failed condition.  It then sets the variable
12152@code{_assert_exit} to one, and executes the @code{exit} statement.
12153The @code{exit} statement jumps to the @code{END} rule. If the @code{END}
12154rules finds @code{_assert_exit} to be true, then it exits immediately.
12155
12156The purpose of the @code{END} rule with its test is to
12157keep any other @code{END} rules from running.  When an assertion fails, the
12158program should exit immediately.
12159If no assertions fail, then @code{_assert_exit} will still be
12160false when the @code{END} rule is run normally, and the rest of the
12161program's @code{END} rules will execute.
12162For all of this to work correctly, @file{assert.awk} must be the
12163first source file read by @code{awk}.
12164
12165@c NEEDED
12166@page
12167You would use this function in your programs this way:
12168
12169@example
12170function myfunc(a, b)
12171@{
12172     assert(a <= 5 && b >= 17, "a <= 5 && b >= 17")
12173     @dots{}
12174@}
12175@end example
12176
12177@noindent
12178If the assertion failed, you would see a message like this:
12179
12180@example
12181mydata:1357: assertion failed: a <= 5 && b >= 17
12182@end example
12183
12184There is a problem with this version of @code{assert}, that it may not
12185be possible to work around with standard @code{awk}.
12186An @code{END} rule is automatically added
12187to the program calling @code{assert}.  Normally, if a program consists
12188of just a @code{BEGIN} rule, the input files and/or standard input are
12189not read. However, now that the program has an @code{END} rule, @code{awk}
12190will attempt to read the input data files, or standard input
12191(@pxref{Using BEGIN/END, , Startup and Cleanup Actions}),
12192most likely causing the program to hang, waiting for input.
12193
12194@node Round Function, Ordinal Functions, Assert Function, Library Functions
12195@section Rounding Numbers
12196
12197@cindex rounding
12198The way @code{printf} and @code{sprintf}
12199(@pxref{Printf, , Using @code{printf} Statements for Fancier Printing})
12200do rounding will often depend
12201upon the system's C @code{sprintf} subroutine.
12202On many machines,
12203@code{sprintf} rounding is ``unbiased,'' which means it doesn't always
12204round a trailing @samp{.5} up, contrary to naive expectations.  In unbiased
12205rounding, @samp{.5} rounds to even, rather than always up, so 1.5 rounds to
122062 but 4.5 rounds to 4.
12207The result is that if you are using a format that does
12208rounding (e.g., @code{"%.0f"}) you should check what your system does.
12209The following function does traditional rounding;
12210it might be useful if your awk's @code{printf} does unbiased rounding.
12211
12212@findex round
12213@example
12214@c file eg/lib/round.awk
12215# round --- do normal rounding
12216#
12217# Arnold Robbins, arnold@@gnu.org, August, 1996
12218# Public Domain
12219
12220function round(x,   ival, aval, fraction)
12221@{
12222   ival = int(x)    # integer part, int() truncates
12223
12224   # see if fractional part
12225   if (ival == x)   # no fraction
12226      return x
12227
12228   if (x < 0) @{
12229      aval = -x     # absolute value
12230      ival = int(aval)
12231      fraction = aval - ival
12232@group
12233      if (fraction >= .5)
12234         return int(x) - 1   # -2.5 --> -3
12235      else
12236         return int(x)       # -2.3 --> -2
12237@end group
12238   @} else @{
12239      fraction = x - ival
12240      if (fraction >= .5)
12241         return ival + 1
12242      else
12243         return ival
12244   @}
12245@}
12246
12247# test harness
12248@{ print $0, round($0) @}
12249@c endfile
12250@end example
12251
12252@node Ordinal Functions, Join Function, Round Function, Library Functions
12253@section Translating Between Characters and Numbers
12254
12255@cindex numeric character values
12256@cindex values of characters as numbers
12257One commercial implementation of @code{awk} supplies a built-in function,
12258@code{ord}, which takes a character and returns the numeric value for that
12259character in the machine's character set.  If the string passed to
12260@code{ord} has more than one character, only the first one is used.
12261
12262The inverse of this function is @code{chr} (from the function of the same
12263name in Pascal), which takes a number and returns the corresponding character.
12264
12265Both functions can be written very nicely in @code{awk}; there is no real
12266reason to build them into the @code{awk} interpreter.
12267
12268@findex ord
12269@findex chr
12270@example
12271@group
12272@c file eg/lib/ord.awk
12273# ord.awk --- do ord and chr
12274#
12275# Global identifiers:
12276#    _ord_:        numerical values indexed by characters
12277#    _ord_init:    function to initialize _ord_
12278#
12279# Arnold Robbins
12280# arnold@@gnu.org
12281# Public Domain
12282# 16 January, 1992
12283# 20 July, 1992, revised
12284
12285BEGIN    @{ _ord_init() @}
12286@c endfile
12287@end group
12288
12289@c @group
12290@c file eg/lib/ord.awk
12291function _ord_init(    low, high, i, t)
12292@{
12293    low = sprintf("%c", 7) # BEL is ascii 7
12294    if (low == "\a") @{    # regular ascii
12295        low = 0
12296        high = 127
12297    @} else if (sprintf("%c", 128 + 7) == "\a") @{
12298        # ascii, mark parity
12299        low = 128
12300        high = 255
12301    @} else @{        # ebcdic(!)
12302        low = 0
12303        high = 255
12304    @}
12305
12306    for (i = low; i <= high; i++) @{
12307        t = sprintf("%c", i)
12308        _ord_[t] = i
12309    @}
12310@}
12311@c endfile
12312@c @end group
12313@end example
12314
12315@cindex character sets
12316@cindex character encodings
12317@cindex ASCII
12318@cindex EBCDIC
12319@cindex mark parity
12320Some explanation of the numbers used by @code{chr} is worthwhile.
12321The most prominent character set in use today is ASCII. Although an
12322eight-bit byte can hold 256 distinct values (from zero to 255), ASCII only
12323defines characters that use the values from zero to 127.@footnote{ASCII
12324has been extended in many countries to use the values from 128 to 255
12325for country-specific characters.  If your  system uses these extensions,
12326you can simplify @code{_ord_init} to simply loop from zero to 255.}
12327At least one computer manufacturer that we know of
12328@c Pr1me, blech
12329uses ASCII, but with mark parity, meaning that the leftmost bit in the byte
12330is always one.  What this means is that on those systems, characters
12331have numeric values from 128 to 255.
12332Finally, large mainframe systems use the EBCDIC character set, which
12333uses all 256 values.
12334While there are other character sets in use on some older systems,
12335they are not really worth worrying about.
12336
12337@example
12338@group
12339@c file eg/lib/ord.awk
12340function ord(str,    c)
12341@{
12342    # only first character is of interest
12343    c = substr(str, 1, 1)
12344    return _ord_[c]
12345@}
12346@c endfile
12347@end group
12348
12349@group
12350@c file eg/lib/ord.awk
12351function chr(c)
12352@{
12353    # force c to be numeric by adding 0
12354    return sprintf("%c", c + 0)
12355@}
12356@c endfile
12357@end group
12358
12359@group
12360@c file eg/lib/ord.awk
12361#### test code ####
12362# BEGIN    \
12363# @{
12364#    for (;;) @{
12365#        printf("enter a character: ")
12366#        if (getline var <= 0)
12367#            break
12368#        printf("ord(%s) = %d\n", var, ord(var))
12369#    @}
12370# @}
12371@c endfile
12372@end group
12373@end example
12374
12375An obvious improvement to these functions would be to move the code for the
12376@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule.  It was
12377written this way initially for ease of development.
12378
12379There is a ``test program'' in a @code{BEGIN} rule, for testing the
12380function.  It is commented out for production use.
12381
12382@node Join Function, Mktime Function, Ordinal Functions, Library Functions
12383@section Merging an Array Into a String
12384
12385@cindex merging strings
12386When doing string processing, it is often useful to be able to join
12387all the strings in an array into one long string.  The following function,
12388@code{join}, accomplishes this task.  It is used later in several of
12389the application programs
12390(@pxref{Sample Programs, ,Practical @code{awk} Programs}).
12391
12392Good function design is important; this function needs to be general, but it
12393should also have a reasonable default behavior.  It is called with an array
12394and the beginning and ending indices of the elements in the array to be
12395merged.  This assumes that the array indices are numeric---a reasonable
12396assumption since the array was likely created with @code{split}
12397(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
12398
12399@findex join
12400@example
12401@group
12402@c file eg/lib/join.awk
12403# join.awk --- join an array into a string
12404# Arnold Robbins, arnold@@gnu.org, Public Domain
12405# May 1993
12406
12407function join(array, start, end, sep,    result, i)
12408@{
12409    if (sep == "")
12410       sep = " "
12411    else if (sep == SUBSEP) # magic value
12412       sep = ""
12413    result = array[start]
12414    for (i = start + 1; i <= end; i++)
12415        result = result sep array[i]
12416    return result
12417@}
12418@c endfile
12419@end group
12420@end example
12421
12422An optional additional argument is the separator to use when joining the
12423strings back together.  If the caller supplies a non-empty value,
12424@code{join} uses it.  If it is not supplied, it will have a null
12425value.  In this case, @code{join} uses a single blank as a default
12426separator for the strings.  If the value is equal to @code{SUBSEP},
12427then @code{join} joins the strings with no separator between them.
12428@code{SUBSEP} serves as a ``magic'' value to indicate that there should
12429be no separation between the component strings.
12430
12431It would be nice if @code{awk} had an assignment operator for concatenation.
12432The lack of an explicit operator for concatenation makes string operations
12433more difficult than they really need to be.
12434
12435@node Mktime Function, Gettimeofday Function, Join Function, Library Functions
12436@section Turning Dates Into Timestamps
12437
12438The @code{systime} function built in to @code{gawk}
12439returns the current time of day as
12440a timestamp in ``seconds since the Epoch.''  This timestamp
12441can be converted into a printable date of almost infinitely variable
12442format using the built-in @code{strftime} function.
12443(For more information on @code{systime} and @code{strftime},
12444@pxref{Time Functions, ,Functions for Dealing with Time Stamps}.)
12445
12446@cindex converting dates to timestamps
12447@cindex dates, converting to timestamps
12448@cindex timestamps, converting from dates
12449An interesting but difficult problem is to convert a readable representation
12450of a date back into a timestamp.  The ANSI C library provides a @code{mktime}
12451function that does the basic job, converting a canonical representation of a
12452date into a timestamp.
12453
12454It would appear at first glance that @code{gawk} would have to supply a
12455@code{mktime} built-in function that was simply a ``hook'' to the C language
12456version.  In fact though, @code{mktime} can be implemented entirely in
12457@code{awk}.@footnote{@value{UPDATE-MONTH}: Actually, I was mistaken when
12458I wrote this.  The version presented here doesn't always work correctly,
12459and the next major version of @code{gawk} will provide @code{mktime}
12460as a built-in function.}
12461@c sigh.
12462
12463Here is a version of @code{mktime} for @code{awk}.  It takes a simple
12464representation of the date and time, and converts it into a timestamp.
12465
12466The code is presented here intermixed with explanatory prose.  In
12467@ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
12468you will see how the Texinfo source file for this @value{DOCUMENT}
12469can be processed to extract the code into a single source file.
12470
12471The program begins with a descriptive comment and a @code{BEGIN} rule
12472that initializes a table @code{_tm_months}.  This table is a two-dimensional
12473array that has the lengths of the months.  The first index is zero for
12474regular years, and one for leap years.  The values are the same for all the
12475months in both kinds of years, except for February; thus the use of multiple
12476assignment.
12477
12478@example
12479@c @group
12480@c file eg/lib/mktime.awk
12481# mktime.awk --- convert a canonical date representation
12482#                into a timestamp
12483# Arnold Robbins, arnold@@gnu.org, Public Domain
12484# May 1993
12485
12486BEGIN    \
12487@{
12488    # Initialize table of month lengths
12489    _tm_months[0,1] = _tm_months[1,1] = 31
12490    _tm_months[0,2] = 28; _tm_months[1,2] = 29
12491    _tm_months[0,3] = _tm_months[1,3] = 31
12492    _tm_months[0,4] = _tm_months[1,4] = 30
12493    _tm_months[0,5] = _tm_months[1,5] = 31
12494    _tm_months[0,6] = _tm_months[1,6] = 30
12495    _tm_months[0,7] = _tm_months[1,7] = 31
12496    _tm_months[0,8] = _tm_months[1,8] = 31
12497    _tm_months[0,9] = _tm_months[1,9] = 30
12498    _tm_months[0,10] = _tm_months[1,10] = 31
12499    _tm_months[0,11] = _tm_months[1,11] = 30
12500    _tm_months[0,12] = _tm_months[1,12] = 31
12501@}
12502@c endfile
12503@c @end group
12504@end example
12505
12506The benefit of merging multiple @code{BEGIN} rules
12507(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
12508is particularly clear when writing library files.  Functions in library
12509files can cleanly initialize their own private data and also provide clean-up
12510actions in private @code{END} rules.
12511
12512The next function is a simple one that computes whether a given year is or
12513is not a leap year.  If a year is evenly divisible by four, but not evenly
12514divisible by 100, or if it is evenly divisible by 400, then it is a leap
12515year.  Thus, 1904 was a leap year, 1900 was not, but 2000 will be.
12516@c Change this after the year 2000 to ``2000 was'' (:-)
12517
12518@findex _tm_isleap
12519@example
12520@group
12521@c file eg/lib/mktime.awk
12522# decide if a year is a leap year
12523function _tm_isleap(year,    ret)
12524@{
12525    ret = (year % 4 == 0 && year % 100 != 0) ||
12526            (year % 400 == 0)
12527
12528    return ret
12529@}
12530@c endfile
12531@end group
12532@end example
12533
12534This function is only used a few times in this file, and its computation
12535could have been written @dfn{in-line} (at the point where it's used).
12536Making it a separate function made the original development easier, and also
12537avoids the possibility of typing errors when duplicating the code in
12538multiple places.
12539
12540The next function is more interesting.  It does most of the work of
12541generating a timestamp, which is converting a date and time into some number
12542of seconds since the Epoch.  The caller passes an array (rather
12543imaginatively named @code{a}) containing six
12544values: the year including century, the month as a number between one and 12,
12545the day of the month, the hour as a number between zero and 23, the minute in
12546the hour, and the seconds within the minute.
12547
12548The function uses several local variables to precompute the number of
12549seconds in an hour, seconds in a day, and seconds in a year.  Often,
12550similar C code simply writes out the expression in-line, expecting the
12551compiler to do @dfn{constant folding}.  E.g., most C compilers would
12552turn @samp{60 * 60} into @samp{3600} at compile time, instead of recomputing
12553it every time at run time.  Precomputing these values makes the
12554function more efficient.
12555
12556@findex _tm_addup
12557@example
12558@c @group
12559@c file eg/lib/mktime.awk
12560# convert a date into seconds
12561function _tm_addup(a,    total, yearsecs, daysecs,
12562                         hoursecs, i, j)
12563@{
12564    hoursecs = 60 * 60
12565    daysecs = 24 * hoursecs
12566    yearsecs = 365 * daysecs
12567
12568    total = (a[1] - 1970) * yearsecs
12569
12570@group
12571    # extra day for leap years
12572    for (i = 1970; i < a[1]; i++)
12573        if (_tm_isleap(i))
12574            total += daysecs
12575@end group
12576
12577@group
12578    j = _tm_isleap(a[1])
12579    for (i = 1; i < a[2]; i++)
12580        total += _tm_months[j, i] * daysecs
12581@end group
12582
12583    total += (a[3] - 1) * daysecs
12584    total += a[4] * hoursecs
12585    total += a[5] * 60
12586    total += a[6]
12587
12588    return total
12589@}
12590@c endfile
12591@c @end group
12592@end example
12593
12594The function starts with a first approximation of all the seconds between
12595Midnight, January 1, 1970,@footnote{This is the Epoch on POSIX systems.
12596It may be different on other systems.} and the beginning of the current
12597year.  It then goes through all those years, and for every leap year,
12598adds an additional day's worth of seconds.
12599
12600The variable @code{j} holds either one or zero, if the current year is or is not
12601a leap year.
12602For every month in the current year prior to the current month, it adds
12603the number of seconds in the month, using the appropriate entry in the
12604@code{_tm_months} array.
12605
12606Finally, it adds in the seconds for the number of days prior to the current
12607day, and the number of hours, minutes, and seconds in the current day.
12608
12609The result is a count of seconds since January 1, 1970.  This value is not
12610yet what is needed though.  The reason why is described shortly.
12611
12612The main @code{mktime} function takes a single character string argument.
12613This string is a representation of a date and time in a ``canonical''
12614(fixed) form.  This string should be
12615@code{"@var{year} @var{month} @var{day} @var{hour} @var{minute} @var{second}"}.
12616
12617@findex mktime
12618@example
12619@c @group
12620@c file eg/lib/mktime.awk
12621# mktime --- convert a date into seconds,
12622#            compensate for time zone
12623
12624function mktime(str,    res1, res2, a, b, i, j, t, diff)
12625@{
12626    i = split(str, a, " ")    # don't rely on FS
12627
12628    if (i != 6)
12629        return -1
12630
12631    # force numeric
12632    for (j in a)
12633        a[j] += 0
12634
12635@group
12636    # validate
12637    if (a[1] < 1970 ||
12638        a[2] < 1 || a[2] > 12 ||
12639        a[3] < 1 || a[3] > 31 ||
12640        a[4] < 0 || a[4] > 23 ||
12641        a[5] < 0 || a[5] > 59 ||
12642        a[6] < 0 || a[6] > 60 )
12643            return -1
12644@end group
12645
12646    res1 = _tm_addup(a)
12647    t = strftime("%Y %m %d %H %M %S", res1)
12648
12649    if (_tm_debug)
12650        printf("(%s) -> (%s)\n", str, t) > "/dev/stderr"
12651
12652    split(t, b, " ")
12653    res2 = _tm_addup(b)
12654
12655    diff = res1 - res2
12656
12657    if (_tm_debug)
12658        printf("diff = %d seconds\n", diff) > "/dev/stderr"
12659
12660    res1 += diff
12661
12662    return res1
12663@}
12664@c endfile
12665@c @end group
12666@end example
12667
12668The function first splits the string into an array, using spaces and tabs as
12669separators.  If there are not six elements in the array, it returns an
12670error, signaled as the value @minus{}1.
12671Next, it forces each element of the array to be numeric, by adding zero to it.
12672The following @samp{if} statement then makes sure that each element is
12673within an allowable range.  (This checking could be extended further, e.g.,
12674to make sure that the day of the month is within the correct range for the
12675particular month supplied.)  All of this is essentially preliminary set-up
12676and error checking.
12677
12678Recall that @code{_tm_addup} generated a value in seconds since Midnight,
12679January 1, 1970.  This value is not directly usable as the result we want,
12680@emph{since the calculation does not account for the local timezone}.  In other
12681words, the value represents the count in seconds since the Epoch, but only
12682for UTC (Universal Coordinated Time).  If the local timezone is east or west
12683of UTC, then some number of hours should be either added to, or subtracted from
12684the resulting timestamp.
12685
12686For example, 6:23 p.m. in Atlanta, Georgia (USA), is normally five hours west
12687of (behind) UTC.  It is only four hours behind UTC if daylight savings
12688time is in effect.
12689If you are calling @code{mktime} in Atlanta, with the argument
12690@code{@w{"1993 5 23 18 23 12"}}, the result from @code{_tm_addup} will be
12691for 6:23 p.m. UTC, which is only 2:23 p.m. in Atlanta.  It is necessary to
12692add another four hours worth of seconds to the result.
12693
12694How can @code{mktime} determine how far away it is from UTC?  This is
12695surprisingly easy.  The returned timestamp represents the time passed to
12696@code{mktime} @emph{as UTC}.  This timestamp can be fed back to
12697@code{strftime}, which will format it as a @emph{local} time; i.e.@: as
12698if it already had the UTC difference added in to it.  This is done by
12699giving @code{@w{"%Y %m %d %H %M %S"}} to @code{strftime} as the format
12700argument.  It returns the computed timestamp in the original string
12701format.  The result represents a time that accounts for the UTC
12702difference.  When the new time is converted back to a timestamp, the
12703difference between the two timestamps is the difference (in seconds)
12704between the local timezone and UTC.  This difference is then added back
12705to the original result.  An example demonstrating this is presented below.
12706
12707Finally, there is a ``main'' program for testing the function.
12708
12709@example
12710@c there used to be a blank line after the getline,
12711@c squished out for page formatting reasons
12712@c @group
12713@c file eg/lib/mktime.awk
12714BEGIN  @{
12715    if (_tm_test) @{
12716        printf "Enter date as yyyy mm dd hh mm ss: "
12717        getline _tm_test_date
12718        t = mktime(_tm_test_date)
12719        r = strftime("%Y %m %d %H %M %S", t)
12720        printf "Got back (%s)\n", r
12721    @}
12722@}
12723@c endfile
12724@c @end group
12725@end example
12726
12727The entire program uses two variables that can be set on the command
12728line to control debugging output and to enable the test in the final
12729@code{BEGIN} rule.  Here is the result of a test run. (Note that debugging
12730output is to standard error, and test output is to standard output.)
12731
12732@example
12733@c @group
12734$ gawk -f mktime.awk -v _tm_test=1 -v _tm_debug=1
12735@print{} Enter date as yyyy mm dd hh mm ss: 1993 5 23 15 35 10
12736@error{} (1993 5 23 15 35 10) -> (1993 05 23 11 35 10)
12737@error{} diff = 14400 seconds
12738@print{} Got back (1993 05 23 15 35 10)
12739@c @end group
12740@end example
12741
12742The time entered was 3:35 p.m. (15:35 on a 24-hour clock), on May 23, 1993.
12743The first line
12744of debugging output shows the resulting time as UTC---four hours ahead of
12745the local time zone.  The second line shows that the difference is 14400
12746seconds, which is four hours.  (The difference is only four hours, since
12747daylight savings time is in effect during May.)
12748The final line of test output shows that the timezone compensation
12749algorithm works; the returned time is the same as the entered time.
12750
12751This program does not solve the general problem of turning an arbitrary date
12752representation into a timestamp.  That problem is very involved.  However,
12753the @code{mktime} function provides a foundation upon which to build. Other
12754software can convert month names into numeric months, and AM/PM times into
1275524-hour clocks, to generate the ``canonical'' format that @code{mktime}
12756requires.
12757
12758@node Gettimeofday Function, Filetrans Function, Mktime Function, Library Functions
12759@section Managing the Time of Day
12760
12761@cindex formatted timestamps
12762@cindex timestamps, formatted
12763The @code{systime} and @code{strftime} functions described in
12764@ref{Time Functions, ,Functions for Dealing with Time Stamps},
12765provide the minimum functionality necessary for dealing with the time of day
12766in human readable form.  While @code{strftime} is extensive, the control
12767formats are not necessarily easy to remember or intuitively obvious when
12768reading a program.
12769
12770The following function, @code{gettimeofday}, populates a user-supplied array
12771with pre-formatted time information.  It returns a string with the current
12772time formatted in the same way as the @code{date} utility.
12773
12774@findex gettimeofday
12775@example
12776@c @group
12777@c file eg/lib/gettime.awk
12778# gettimeofday --- get the time of day in a usable format
12779# Arnold Robbins, arnold@@gnu.org, Public Domain, May 1993
12780#
12781# Returns a string in the format of output of date(1)
12782# Populates the array argument time with individual values:
12783#    time["second"]       -- seconds (0 - 59)
12784#    time["minute"]       -- minutes (0 - 59)
12785#    time["hour"]         -- hours (0 - 23)
12786#    time["althour"]      -- hours (0 - 12)
12787#    time["monthday"]     -- day of month (1 - 31)
12788#    time["month"]        -- month of year (1 - 12)
12789#    time["monthname"]    -- name of the month
12790#    time["shortmonth"]   -- short name of the month
12791#    time["year"]         -- year within century (0 - 99)
12792#    time["fullyear"]     -- year with century (19xx or 20xx)
12793#    time["weekday"]      -- day of week (Sunday = 0)
12794#    time["altweekday"]   -- day of week (Monday = 0)
12795#    time["weeknum"]      -- week number, Sunday first day
12796#    time["altweeknum"]   -- week number, Monday first day
12797#    time["dayname"]      -- name of weekday
12798#    time["shortdayname"] -- short name of weekday
12799#    time["yearday"]      -- day of year (0 - 365)
12800#    time["timezone"]     -- abbreviation of timezone name
12801#    time["ampm"]         -- AM or PM designation
12802
12803function gettimeofday(time,    ret, now, i)
12804@{
12805    # get time once, avoids unnecessary system calls
12806    now = systime()
12807
12808    # return date(1)-style output
12809    ret = strftime("%a %b %d %H:%M:%S %Z %Y", now)
12810
12811    # clear out target array
12812    for (i in time)
12813        delete time[i]
12814
12815    # fill in values, force numeric values to be
12816    # numeric by adding 0
12817    time["second"]       = strftime("%S", now) + 0
12818    time["minute"]       = strftime("%M", now) + 0
12819    time["hour"]         = strftime("%H", now) + 0
12820    time["althour"]      = strftime("%I", now) + 0
12821    time["monthday"]     = strftime("%d", now) + 0
12822    time["month"]        = strftime("%m", now) + 0
12823    time["monthname"]    = strftime("%B", now)
12824    time["shortmonth"]   = strftime("%b", now)
12825    time["year"]         = strftime("%y", now) + 0
12826    time["fullyear"]     = strftime("%Y", now) + 0
12827    time["weekday"]      = strftime("%w", now) + 0
12828    time["altweekday"]   = strftime("%u", now) + 0
12829    time["dayname"]      = strftime("%A", now)
12830    time["shortdayname"] = strftime("%a", now)
12831    time["yearday"]      = strftime("%j", now) + 0
12832    time["timezone"]     = strftime("%Z", now)
12833    time["ampm"]         = strftime("%p", now)
12834    time["weeknum"]      = strftime("%U", now) + 0
12835    time["altweeknum"]   = strftime("%W", now) + 0
12836
12837    return ret
12838@}
12839@c endfile
12840@end example
12841
12842The string indices are easier to use and read than the various formats
12843required by @code{strftime}.  The @code{alarm} program presented in
12844@ref{Alarm Program, ,An Alarm Clock Program},
12845uses this function.
12846
12847@c exercise!!!
12848The @code{gettimeofday} function is presented above as it was written. A
12849more general design for this function would have allowed the user to supply
12850an optional timestamp value that would have been used instead of the current
12851time.
12852
12853@node Filetrans Function, Getopt Function, Gettimeofday Function, Library Functions
12854@section Noting Data File Boundaries
12855
12856@cindex per file initialization and clean-up
12857The @code{BEGIN} and @code{END} rules are each executed exactly once, at
12858the beginning and end respectively of your @code{awk} program
12859(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
12860We (the @code{gawk} authors) once had a user who mistakenly thought that the
12861@code{BEGIN} rule was executed at the beginning of each data file and the
12862@code{END} rule was executed at the end of each data file.  When informed
12863that this was not the case, the user requested that we add new special
12864patterns to @code{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that
12865would have the desired behavior.  He even supplied us the code to do so.
12866
12867However, after a little thought, I came up with the following library program.
12868It arranges to call two user-supplied functions, @code{beginfile} and
12869@code{endfile}, at the beginning and end of each data file.
12870Besides solving the problem in only nine(!) lines of code, it does so
12871@emph{portably}; this will work with any implementation of @code{awk}.
12872
12873@example
12874@c @group
12875# transfile.awk
12876#
12877# Give the user a hook for filename transitions
12878#
12879# The user must supply functions beginfile() and endfile()
12880# that each take the name of the file being started or
12881# finished, respectively.
12882#
12883# Arnold Robbins, arnold@@gnu.org, January 1992
12884# Public Domain
12885
12886FILENAME != _oldfilename \
12887@{
12888    if (_oldfilename != "")
12889        endfile(_oldfilename)
12890    _oldfilename = FILENAME
12891    beginfile(FILENAME)
12892@}
12893
12894END   @{ endfile(FILENAME) @}
12895@c @end group
12896@end example
12897
12898This file must be loaded before the user's ``main'' program, so that the
12899rule it supplies will be executed first.
12900
12901This rule relies on @code{awk}'s @code{FILENAME} variable that
12902automatically changes for each new data file.  The current file name is
12903saved in a private variable, @code{_oldfilename}.  If @code{FILENAME} does
12904not equal @code{_oldfilename}, then a new data file is being processed, and
12905it is necessary to call @code{endfile} for the old file.  Since
12906@code{endfile} should only be called if a file has been processed, the
12907program first checks to make sure that @code{_oldfilename} is not the null
12908string.  The program then assigns the current file name to
12909@code{_oldfilename}, and calls @code{beginfile} for the file.
12910Since, like all @code{awk} variables, @code{_oldfilename} will be
12911initialized to the null string, this rule executes correctly even for the
12912first data file.
12913
12914The program also supplies an @code{END} rule, to do the final processing for
12915the last file.  Since this @code{END} rule comes before any @code{END} rules
12916supplied in the ``main'' program, @code{endfile} will be called first.  Once
12917again the value of multiple @code{BEGIN} and @code{END} rules should be clear.
12918
12919@findex beginfile
12920@findex endfile
12921This version has same problem as the first version of @code{nextfile}
12922(@pxref{Nextfile Function, ,Implementing @code{nextfile} as a Function}).
12923If the same data file occurs twice in a row on command line, then
12924@code{endfile} and @code{beginfile} will not be executed at the end of the
12925first pass and at the beginning of the second pass.
12926This version solves the problem.
12927
12928@example
12929@c @group
12930@c file eg/lib/ftrans.awk
12931# ftrans.awk --- handle data file transitions
12932#
12933# user supplies beginfile() and endfile() functions
12934#
12935# Arnold Robbins, arnold@@gnu.org, November 1992
12936# Public Domain
12937
12938FNR == 1 @{
12939    if (_filename_ != "")
12940        endfile(_filename_)
12941    _filename_ = FILENAME
12942    beginfile(FILENAME)
12943@}
12944
12945END  @{ endfile(_filename_) @}
12946@c endfile
12947@c @end group
12948@end example
12949
12950In @ref{Wc Program, ,Counting Things},
12951you will see how this library function can be used, and
12952how it simplifies writing the main program.
12953
12954@node Getopt Function, Passwd Functions, Filetrans Function, Library Functions
12955@section Processing Command Line Options
12956
12957@cindex @code{getopt}, C version
12958@cindex processing arguments
12959@cindex argument processing
12960Most utilities on POSIX compatible systems take options or ``switches'' on
12961the command line that can be used to change the way a program behaves.
12962@code{awk} is an example of such a program
12963(@pxref{Options, ,Command Line Options}).
12964Often, options take @dfn{arguments}, data that the program needs to
12965correctly obey the command line option.  For example, @code{awk}'s
12966@samp{-F} option requires a string to use as the field separator.
12967The first occurrence on the command line of either @samp{--} or a
12968string that does not begin with @samp{-} ends the options.
12969
12970Most Unix systems provide a C function named @code{getopt} for processing
12971command line arguments.  The programmer provides a string describing the one
12972letter options. If an option requires an argument, it is followed in the
12973string with a colon.  @code{getopt} is also passed the
12974count and values of the command line arguments, and is called in a loop.
12975@code{getopt} processes the command line arguments for option letters.
12976Each time around the loop, it returns a single character representing the
12977next option letter that it found, or @samp{?} if it found an invalid option.
12978When it returns @minus{}1, there are no options left on the command line.
12979
12980When using @code{getopt}, options that do not take arguments can be
12981grouped together.  Furthermore, options that take arguments require that the
12982argument be present.  The argument can immediately follow the option letter,
12983or it can be a separate command line argument.
12984
12985Given a hypothetical program that takes
12986three command line options, @samp{-a}, @samp{-b}, and @samp{-c}, and
12987@samp{-b} requires an argument, all of the following are valid ways of
12988invoking the program:
12989
12990@example
12991@c @group
12992prog -a -b foo -c data1 data2 data3
12993prog -ac -bfoo -- data1 data2 data3
12994prog -acbfoo data1 data2 data3
12995@c @end group
12996@end example
12997
12998Notice that when the argument is grouped with its option, the rest of
12999the command line argument is considered to be the option's argument.
13000In the above example, @samp{-acbfoo} indicates that all of the
13001@samp{-a}, @samp{-b}, and @samp{-c} options were supplied,
13002and that @samp{foo} is the argument to the @samp{-b} option.
13003
13004@code{getopt} provides four external variables that the programmer can use.
13005
13006@table @code
13007@item optind
13008The index in the argument value array (@code{argv}) where the first
13009non-option command line argument can be found.
13010
13011@item optarg
13012The string value of the argument to an option.
13013
13014@item opterr
13015Usually @code{getopt} prints an error message when it finds an invalid
13016option.  Setting @code{opterr} to zero disables this feature.  (An
13017application might wish to print its own error message.)
13018
13019@item optopt
13020The letter representing the command line option.
13021While not usually documented, most versions supply this variable.
13022@end table
13023
13024The following C fragment shows how @code{getopt} might process command line
13025arguments for @code{awk}.
13026
13027@example
13028@group
13029int
13030main(int argc, char *argv[])
13031@{
13032    @dots{}
13033    /* print our own message */
13034    opterr = 0;
13035@end group
13036@group
13037    while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{
13038        switch (c) @{
13039        case 'f':    /* file */
13040            @dots{}
13041            break;
13042        case 'F':    /* field separator */
13043            @dots{}
13044            break;
13045        case 'v':    /* variable assignment */
13046            @dots{}
13047            break;
13048        case 'W':    /* extension */
13049            @dots{}
13050            break;
13051        case '?':
13052        default:
13053            usage();
13054            break;
13055        @}
13056    @}
13057    @dots{}
13058@}
13059@end group
13060@end example
13061
13062As a side point, @code{gawk} actually uses the GNU @code{getopt_long}
13063function to process both normal and GNU-style long options
13064(@pxref{Options, ,Command Line Options}).
13065
13066The abstraction provided by @code{getopt} is very useful, and would be quite
13067handy in @code{awk} programs as well.  Here is an @code{awk} version of
13068@code{getopt}.  This function highlights one of the greatest weaknesses in
13069@code{awk}, which is that it is very poor at manipulating single characters.
13070Repeated calls to @code{substr} are necessary for accessing individual
13071characters (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
13072
13073The discussion walks through the code a bit at a time.
13074
13075@example
13076@c @group
13077@c file eg/lib/getopt.awk
13078# getopt --- do C library getopt(3) function in awk
13079#
13080# arnold@@gnu.org
13081# Public domain
13082#
13083# Initial version: March, 1991
13084# Revised: May, 1993
13085
13086@group
13087# External variables:
13088#    Optind -- index of ARGV for first non-option argument
13089#    Optarg -- string value of argument to current option
13090#    Opterr -- if non-zero, print our own diagnostic
13091#    Optopt -- current option letter
13092@end group
13093
13094# Returns
13095#    -1     at end of options
13096#    ?      for unrecognized option
13097#    <c>    a character representing the current option
13098
13099# Private Data
13100#    _opti  index in multi-flag option, e.g., -abc
13101@c endfile
13102@c @end group
13103@end example
13104
13105The function starts out with some documentation: who wrote the code,
13106and when it was revised, followed by a list of the global variables it uses,
13107what the return values are and what they mean, and any global variables that
13108are ``private'' to this library function.  Such documentation is essential
13109for any program, and particularly for library functions.
13110
13111@findex getopt
13112@example
13113@c @group
13114@c file eg/lib/getopt.awk
13115function getopt(argc, argv, options,    optl, thisopt, i)
13116@{
13117    optl = length(options)
13118    if (optl == 0)        # no options given
13119        return -1
13120
13121    if (argv[Optind] == "--") @{  # all done
13122        Optind++
13123        _opti = 0
13124        return -1
13125    @} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) @{
13126        _opti = 0
13127        return -1
13128    @}
13129@c endfile
13130@c @end group
13131@end example
13132
13133The function first checks that it was indeed called with a string of options
13134(the @code{options} parameter).  If @code{options} has a zero length,
13135@code{getopt} immediately returns @minus{}1.
13136
13137The next thing to check for is the end of the options.  A @samp{--} ends the
13138command line options, as does any command line argument that does not begin
13139with a @samp{-}.  @code{Optind} is used to step through the array of command
13140line arguments; it retains its value across calls to @code{getopt}, since it
13141is a global variable.
13142
13143The regexp used, @code{@w{/^-[^: \t\n\f\r\v\b]/}}, is
13144perhaps a bit of overkill; it checks for a @samp{-} followed by anything
13145that is not whitespace and not a colon.
13146If the current command line argument does not match this pattern,
13147it is not an option, and it ends option processing.
13148
13149@example
13150@group
13151@c file eg/lib/getopt.awk
13152    if (_opti == 0)
13153        _opti = 2
13154    thisopt = substr(argv[Optind], _opti, 1)
13155    Optopt = thisopt
13156    i = index(options, thisopt)
13157    if (i == 0) @{
13158        if (Opterr)
13159            printf("%c -- invalid option\n",
13160                                  thisopt) > "/dev/stderr"
13161        if (_opti >= length(argv[Optind])) @{
13162            Optind++
13163            _opti = 0
13164        @} else
13165            _opti++
13166        return "?"
13167    @}
13168@c endfile
13169@end group
13170@end example
13171
13172The @code{_opti} variable tracks the position in the current command line
13173argument (@code{argv[Optind]}).  In the case that multiple options were
13174grouped together with one @samp{-} (e.g., @samp{-abx}), it is necessary
13175to return them to the user one at a time.
13176
13177If @code{_opti} is equal to zero, it is set to two, the index in the string
13178of the next character to look at (we skip the @samp{-}, which is at position
13179one).  The variable @code{thisopt} holds the character, obtained with
13180@code{substr}.  It is saved in @code{Optopt} for the main program to use.
13181
13182If @code{thisopt} is not in the @code{options} string, then it is an
13183invalid option.  If @code{Opterr} is non-zero, @code{getopt} prints an error
13184message on the standard error that is similar to the message from the C
13185version of @code{getopt}.
13186
13187Since the option is invalid, it is necessary to skip it and move on to the
13188next option character.  If @code{_opti} is greater than or equal to the
13189length of the current command line argument, then it is necessary to move on
13190to the next one, so @code{Optind} is incremented and @code{_opti} is reset
13191to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely
13192incremented.
13193
13194In any case, since the option was invalid, @code{getopt} returns @samp{?}.
13195The main program can examine @code{Optopt} if it needs to know what the
13196invalid option letter actually was.
13197
13198@example
13199@group
13200@c file eg/lib/getopt.awk
13201    if (substr(options, i + 1, 1) == ":") @{
13202        # get option argument
13203        if (length(substr(argv[Optind], _opti + 1)) > 0)
13204            Optarg = substr(argv[Optind], _opti + 1)
13205        else
13206            Optarg = argv[++Optind]
13207        _opti = 0
13208    @} else
13209        Optarg = ""
13210@c endfile
13211@end group
13212@end example
13213
13214If the option requires an argument, the option letter is followed by a colon
13215in the @code{options} string.  If there are remaining characters in the
13216current command line argument (@code{argv[Optind]}), then the rest of that
13217string is assigned to @code{Optarg}.  Otherwise, the next command line
13218argument is used (@samp{-xFOO} vs. @samp{@w{-x FOO}}). In either case,
13219@code{_opti} is reset to zero, since there are no more characters left to
13220examine in the current command line argument.
13221
13222@example
13223@c @group
13224@c file eg/lib/getopt.awk
13225    if (_opti == 0 || _opti >= length(argv[Optind])) @{
13226        Optind++
13227        _opti = 0
13228    @} else
13229        _opti++
13230    return thisopt
13231@}
13232@c endfile
13233@c @end group
13234@end example
13235
13236Finally, if @code{_opti} is either zero or greater than the length of the
13237current command line argument, it means this element in @code{argv} is
13238through being processed, so @code{Optind} is incremented to point to the
13239next element in @code{argv}.  If neither condition is true, then only
13240@code{_opti} is incremented, so that the next option letter can be processed
13241on the next call to @code{getopt}.
13242
13243@example
13244@c @group
13245@c file eg/lib/getopt.awk
13246BEGIN @{
13247    Opterr = 1    # default is to diagnose
13248    Optind = 1    # skip ARGV[0]
13249
13250    # test program
13251    if (_getopt_test) @{
13252        while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
13253            printf("c = <%c>, optarg = <%s>\n",
13254                                       _go_c, Optarg)
13255        printf("non-option arguments:\n")
13256        for (; Optind < ARGC; Optind++)
13257            printf("\tARGV[%d] = <%s>\n",
13258                                    Optind, ARGV[Optind])
13259    @}
13260@}
13261@c endfile
13262@c @end group
13263@end example
13264
13265The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one.
13266@code{Opterr} is set to one, since the default behavior is for @code{getopt}
13267to print a diagnostic message upon seeing an invalid option.  @code{Optind}
13268is set to one, since there's no reason to look at the program name, which is
13269in @code{ARGV[0]}.
13270
13271The rest of the @code{BEGIN} rule is a simple test program.  Here is the
13272result of two sample runs of the test program.
13273
13274@example
13275@group
13276$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
13277@print{} c = <a>, optarg = <>
13278@print{} c = <c>, optarg = <>
13279@print{} c = <b>, optarg = <ARG>
13280@print{} non-option arguments:
13281@print{}         ARGV[3] = <bax>
13282@print{}         ARGV[4] = <-x>
13283@end group
13284
13285@group
13286$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
13287@print{} c = <a>, optarg = <>
13288@error{} x -- invalid option
13289@print{} c = <?>, optarg = <>
13290@print{} non-option arguments:
13291@print{}         ARGV[4] = <xyz>
13292@print{}         ARGV[5] = <abc>
13293@end group
13294@end example
13295
13296The first @samp{--} terminates the arguments to @code{awk}, so that it does
13297not try to interpret the @samp{-a} etc. as its own options.
13298
13299Several of the sample programs presented in
13300@ref{Sample Programs, ,Practical @code{awk} Programs},
13301use @code{getopt} to process their arguments.
13302
13303@node Passwd Functions, Group Functions, Getopt Function, Library Functions
13304@section Reading the User Database
13305
13306@cindex @file{/dev/user}
13307The @file{/dev/user} special file
13308(@pxref{Special Files, ,Special File Names in @code{gawk}})
13309provides access to the current user's real and effective user and group id
13310numbers, and if available, the user's supplementary group set.
13311However, since these are numbers, they do not provide very useful
13312information to the average user.  There needs to be some way to find the
13313user information associated with the user and group numbers.  This
13314section presents a suite of functions for retrieving information from the
13315user database.  @xref{Group Functions, ,Reading the Group Database},
13316for a similar suite that retrieves information from the group database.
13317
13318@cindex @code{getpwent}, C version
13319@cindex user information
13320@cindex login information
13321@cindex account information
13322@cindex password file
13323The POSIX standard does not define the file where user information is
13324kept.  Instead, it provides the @code{<pwd.h>} header file
13325and several C language subroutines for obtaining user information.
13326The primary function is @code{getpwent}, for ``get password entry.''
13327The ``password'' comes from the original user database file,
13328@file{/etc/passwd}, which kept user information, along with the
13329encrypted passwords (hence the name).
13330
13331While an @code{awk} program could simply read @file{/etc/passwd} directly
13332(the format is well known), because of the way password
13333files are handled on networked systems,
13334this file may not contain complete information about the system's set of users.
13335
13336@cindex @code{pwcat} program
13337To be sure of being
13338able to produce a readable, complete version of the user database, it is
13339necessary to write a small C program that calls @code{getpwent}.
13340@code{getpwent} is defined to return a pointer to a @code{struct passwd}.
13341Each time it is called, it returns the next entry in the database.
13342When there are no more entries, it returns @code{NULL}, the null pointer.
13343When this happens, the C program should call @code{endpwent} to close the
13344database.
13345Here is @code{pwcat}, a C program that ``cats'' the password database.
13346
13347@findex pwcat.c
13348@example
13349@c @group
13350@c file eg/lib/pwcat.c
13351/*
13352 * pwcat.c
13353 *
13354 * Generate a printable version of the password database
13355 *
13356 * Arnold Robbins
13357 * arnold@@gnu.org
13358 * May 1993
13359 * Public Domain
13360 */
13361
13362#include <stdio.h>
13363#include <pwd.h>
13364
13365int
13366main(argc, argv)
13367int argc;
13368char **argv;
13369@{
13370    struct passwd *p;
13371
13372    while ((p = getpwent()) != NULL)
13373        printf("%s:%s:%d:%d:%s:%s:%s\n",
13374            p->pw_name, p->pw_passwd, p->pw_uid,
13375            p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
13376
13377    endpwent();
13378    exit(0);
13379@}
13380@c endfile
13381@c @end group
13382@end example
13383
13384If you don't understand C, don't worry about it.
13385The output from @code{pwcat} is the user database, in the traditional
13386@file{/etc/passwd} format of colon-separated fields.  The fields are:
13387
13388@table @asis
13389@item Login name
13390The user's login name.
13391
13392@item Encrypted password
13393The user's encrypted password.  This may not be available on some systems.
13394
13395@item User-ID
13396The user's numeric user-id number.
13397
13398@item Group-ID
13399The user's numeric group-id number.
13400
13401@item Full name
13402The user's full name, and perhaps other information associated with the
13403user.
13404
13405@item Home directory
13406The user's login, or ``home'' directory (familiar to shell programmers as
13407@code{$HOME}).
13408
13409@item Login shell
13410The program that will be run when the user logs in.  This is usually a
13411shell, such as Bash (the Gnu Bourne-Again shell).
13412@end table
13413
13414Here are a few lines representative of @code{pwcat}'s output.
13415
13416@example
13417@c @group
13418$ pwcat
13419@print{} root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
13420@print{} nobody:*:65534:65534::/:
13421@print{} daemon:*:1:1::/:
13422@print{} sys:*:2:2::/:/bin/csh
13423@print{} bin:*:3:3::/bin:
13424@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
13425@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
13426@print{} andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh
13427@dots{}
13428@c @end group
13429@end example
13430
13431With that introduction, here is a group of functions for getting user
13432information.  There are several functions here, corresponding to the C
13433functions of the same name.
13434
13435@findex _pw_init
13436@example
13437@c file eg/lib/passwdawk.in
13438@group
13439# passwd.awk --- access password file information
13440# Arnold Robbins, arnold@@gnu.org, Public Domain
13441# May 1993
13442
13443BEGIN @{
13444    # tailor this to suit your system
13445    _pw_awklib = "/usr/local/libexec/awk/"
13446@}
13447@end group
13448
13449@group
13450function _pw_init(    oldfs, oldrs, olddol0, pwcat)
13451@{
13452    if (_pw_inited)
13453        return
13454    oldfs = FS
13455    oldrs = RS
13456    olddol0 = $0
13457    FS = ":"
13458    RS = "\n"
13459    pwcat = _pw_awklib "pwcat"
13460    while ((pwcat | getline) > 0) @{
13461        _pw_byname[$1] = $0
13462        _pw_byuid[$3] = $0
13463        _pw_bycount[++_pw_total] = $0
13464    @}
13465    close(pwcat)
13466    _pw_count = 0
13467    _pw_inited = 1
13468    FS = oldfs
13469    RS = oldrs
13470    $0 = olddol0
13471@}
13472@c endfile
13473@end group
13474@end example
13475
13476The @code{BEGIN} rule sets a private variable to the directory where
13477@code{pwcat} is stored.  Since it is used to help out an @code{awk} library
13478routine, we have chosen to put it in @file{/usr/local/libexec/awk}.
13479You might want it to be in a different directory on your system.
13480
13481The function @code{_pw_init} keeps three copies of the user information
13482in three associative arrays.  The arrays are indexed by user name
13483(@code{_pw_byname}), by user-id number (@code{_pw_byuid}), and by order of
13484occurrence (@code{_pw_bycount}).
13485
13486The variable @code{_pw_inited} is used for efficiency; @code{_pw_init} only
13487needs to be called once.
13488
13489Since this function uses @code{getline} to read information from
13490@code{pwcat}, it first saves the values of @code{FS}, @code{RS}, and
13491@code{$0}.  Doing so is necessary, since these functions could be called
13492from anywhere within a user's program, and the user may have his or her
13493own values for @code{FS} and @code{RS}.
13494@ignore
13495Problem, what if FIELDWIDTHS is in use? Sigh.
13496@end ignore
13497
13498The main part of the function uses a loop to read database lines, split
13499the line into fields, and then store the line into each array as necessary.
13500When the loop is done, @code{@w{_pw_init}} cleans up by closing the pipeline,
13501setting @code{@w{_pw_inited}} to one, and restoring @code{FS}, @code{RS}, and
13502@code{$0}.  The use of @code{@w{_pw_count}} will be explained below.
13503
13504@findex getpwnam
13505@example
13506@group
13507@c file eg/lib/passwdawk.in
13508function getpwnam(name)
13509@{
13510    _pw_init()
13511    if (name in _pw_byname)
13512        return _pw_byname[name]
13513    return ""
13514@}
13515@c endfile
13516@end group
13517@end example
13518
13519The @code{getpwnam} function takes a user name as a string argument. If that
13520user is in the database, it returns the appropriate line. Otherwise it
13521returns the null string.
13522
13523@findex getpwuid
13524@example
13525@group
13526@c file eg/lib/passwdawk.in
13527function getpwuid(uid)
13528@{
13529    _pw_init()
13530    if (uid in _pw_byuid)
13531        return _pw_byuid[uid]
13532    return ""
13533@}
13534@c endfile
13535@end group
13536@end example
13537
13538Similarly,
13539the @code{getpwuid} function takes a user-id number argument. If that
13540user number is in the database, it returns the appropriate line. Otherwise it
13541returns the null string.
13542
13543@findex getpwent
13544@example
13545@c @group
13546@c file eg/lib/passwdawk.in
13547function getpwent()
13548@{
13549    _pw_init()
13550    if (_pw_count < _pw_total)
13551        return _pw_bycount[++_pw_count]
13552    return ""
13553@}
13554@c endfile
13555@c @end group
13556@end example
13557
13558The @code{getpwent} function simply steps through the database, one entry at
13559a time.  It uses @code{_pw_count} to track its current position in the
13560@code{_pw_bycount} array.
13561
13562@findex endpwent
13563@example
13564@c @group
13565@c file eg/lib/passwdawk.in
13566function endpwent()
13567@{
13568    _pw_count = 0
13569@}
13570@c endfile
13571@c @end group
13572@end example
13573
13574The @code{@w{endpwent}} function resets @code{@w{_pw_count}} to zero, so that
13575subsequent calls to @code{getpwent} will start over again.
13576
13577A conscious design decision in this suite is that each subroutine calls
13578@code{@w{_pw_init}} to initialize the database arrays.  The overhead of running
13579a separate process to generate the user database, and the I/O to scan it,
13580will only be incurred if the user's main program actually calls one of these
13581functions.  If this library file is loaded along with a user's program, but
13582none of the routines are ever called, then there is no extra run-time overhead.
13583(The alternative would be to move the body of @code{@w{_pw_init}} into a
13584@code{BEGIN} rule, which would always run @code{pwcat}.  This simplifies the
13585code but runs an extra process that may never be needed.)
13586
13587In turn, calling @code{_pw_init} is not too expensive, since the
13588@code{_pw_inited} variable keeps the program from reading the data more than
13589once.  If you are worried about squeezing every last cycle out of your
13590@code{awk} program, the check of @code{_pw_inited} could be moved out of
13591@code{_pw_init} and duplicated in all the other functions.  In practice,
13592this is not necessary, since most @code{awk} programs are I/O bound, and it
13593would clutter up the code.
13594
13595The @code{id} program in @ref{Id Program, ,Printing Out User Information},
13596uses these functions.
13597
13598@node Group Functions, Library Names, Passwd Functions, Library Functions
13599@section Reading the Group Database
13600
13601@cindex @code{getgrent}, C version
13602@cindex group information
13603@cindex account information
13604@cindex group file
13605Much of the discussion presented in
13606@ref{Passwd Functions, ,Reading the User Database},
13607applies to the group database as well.  Although there has traditionally
13608been a well known file, @file{/etc/group}, in a well known format, the POSIX
13609standard only provides a set of C library routines
13610(@code{<grp.h>} and @code{getgrent})
13611for accessing the information.
13612Even though this file may exist, it likely does not have
13613complete information.  Therefore, as with the user database, it is necessary
13614to have a small C program that generates the group database as its output.
13615
13616@cindex @code{grcat} program
13617Here is @code{grcat}, a C program that ``cats'' the group database.
13618
13619@findex grcat.c
13620@example
13621@c @group
13622@c file eg/lib/grcat.c
13623/*
13624 * grcat.c
13625 *
13626 * Generate a printable version of the group database
13627 *
13628 * Arnold Robbins, arnold@@gnu.org
13629 * May 1993
13630 * Public Domain
13631 */
13632
13633#include <stdio.h>
13634#include <grp.h>
13635
13636@group
13637int
13638main(argc, argv)
13639int argc;
13640char **argv;
13641@{
13642    struct group *g;
13643    int i;
13644@end group
13645
13646@group
13647    while ((g = getgrent()) != NULL) @{
13648        printf("%s:%s:%d:", g->gr_name, g->gr_passwd,
13649                                            g->gr_gid);
13650@end group
13651        for (i = 0; g->gr_mem[i] != NULL; i++) @{
13652            printf("%s", g->gr_mem[i]);
13653            if (g->gr_mem[i+1] != NULL)
13654                putchar(',');
13655        @}
13656        putchar('\n');
13657    @}
13658    endgrent();
13659    exit(0);
13660@}
13661@c endfile
13662@c @end group
13663@end example
13664
13665Each line in the group database represent one group.  The fields are
13666separated with colons, and represent the following information.
13667
13668@table @asis
13669@item Group Name
13670The name of the group.
13671
13672@item Group Password
13673The encrypted group password. In practice, this field is never used. It is
13674usually empty, or set to @samp{*}.
13675
13676@item Group ID Number
13677The numeric group-id number. This number should be unique within the file.
13678
13679@item Group Member List
13680A comma-separated list of user names.  These users are members of the group.
13681Most Unix systems allow users to be members of several groups
13682simultaneously.  If your system does, then reading @file{/dev/user} will
13683return those group-id numbers in @code{$5} through @code{$NF}.
13684(Note that @file{/dev/user} is a @code{gawk} extension;
13685@pxref{Special Files, ,Special File Names in @code{gawk}}.)
13686@end table
13687
13688Here is what running @code{grcat} might produce:
13689
13690@example
13691@group
13692$ grcat
13693@print{} wheel:*:0:arnold
13694@print{} nogroup:*:65534:
13695@print{} daemon:*:1:
13696@print{} kmem:*:2:
13697@print{} staff:*:10:arnold,miriam,andy
13698@print{} other:*:20:
13699@dots{}
13700@end group
13701@end example
13702
13703Here are the functions for obtaining information from the group database.
13704There are several, modeled after the C library functions of the same names.
13705
13706@findex _gr_init
13707@example
13708@group
13709@c file eg/lib/groupawk.in
13710# group.awk --- functions for dealing with the group file
13711# Arnold Robbins, arnold@@gnu.org, Public Domain
13712# May 1993
13713
13714BEGIN    \
13715@{
13716    # Change to suit your system
13717    _gr_awklib = "/usr/local/libexec/awk/"
13718@}
13719@c endfile
13720@end group
13721
13722@group
13723@c file eg/lib/groupawk.in
13724function _gr_init(    oldfs, oldrs, olddol0, grcat, n, a, i)
13725@{
13726    if (_gr_inited)
13727        return
13728@end group
13729
13730@group
13731    oldfs = FS
13732    oldrs = RS
13733    olddol0 = $0
13734    FS = ":"
13735    RS = "\n"
13736@end group
13737
13738@group
13739    grcat = _gr_awklib "grcat"
13740    while ((grcat | getline) > 0) @{
13741        if ($1 in _gr_byname)
13742            _gr_byname[$1] = _gr_byname[$1] "," $4
13743        else
13744            _gr_byname[$1] = $0
13745        if ($3 in _gr_bygid)
13746            _gr_bygid[$3] = _gr_bygid[$3] "," $4
13747        else
13748            _gr_bygid[$3] = $0
13749
13750        n = split($4, a, "[ \t]*,[ \t]*")
13751@end group
13752@group
13753        for (i = 1; i <= n; i++)
13754            if (a[i] in _gr_groupsbyuser)
13755                _gr_groupsbyuser[a[i]] = \
13756                    _gr_groupsbyuser[a[i]] " " $1
13757            else
13758                _gr_groupsbyuser[a[i]] = $1
13759@end group
13760
13761@group
13762        _gr_bycount[++_gr_count] = $0
13763    @}
13764@end group
13765@group
13766    close(grcat)
13767    _gr_count = 0
13768    _gr_inited++
13769    FS = oldfs
13770    RS = oldrs
13771    $0 = olddol0
13772@}
13773@c endfile
13774@end group
13775@end example
13776
13777The @code{BEGIN} rule sets a private variable to the directory where
13778@code{grcat} is stored.  Since it is used to help out an @code{awk} library
13779routine, we have chosen to put it in @file{/usr/local/libexec/awk}.  You might
13780want it to be in a different directory on your system.
13781
13782These routines follow the same general outline as the user database routines
13783(@pxref{Passwd Functions, ,Reading the User Database}).
13784The @code{@w{_gr_inited}} variable is used to
13785ensure that the database is scanned no more than once.
13786The @code{@w{_gr_init}} function first saves @code{FS}, @code{RS}, and
13787@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for
13788scanning the group information.
13789
13790The group information is stored is several associative arrays.
13791The arrays are indexed by group name (@code{@w{_gr_byname}}), by group-id number
13792(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}).
13793There is an additional array indexed by user name (@code{@w{_gr_groupsbyuser}}),
13794that is a space separated list of groups that each user belongs to.
13795
13796Unlike the user database, it is possible to have multiple records in the
13797database for the same group.  This is common when a group has a large number
13798of members.  Such a pair of entries might look like:
13799
13800@example
13801tvpeople:*:101:johny,jay,arsenio
13802tvpeople:*:101:david,conan,tom,joan
13803@end example
13804
13805For this reason, @code{_gr_init} looks to see if a group name or
13806group-id number has already been seen.  If it has, then the user names are
13807simply concatenated onto the previous list of users.  (There is actually a
13808subtle problem with the code presented above.  Suppose that
13809the first time there were no names. This code adds the names with
13810a leading comma. It also doesn't check that there is a @code{$4}.)
13811
13812Finally, @code{_gr_init} closes the pipeline to @code{grcat}, restores
13813@code{FS}, @code{RS}, and @code{$0}, initializes @code{_gr_count} to zero
13814(it is used later), and makes @code{_gr_inited} non-zero.
13815
13816@findex getgrnam
13817@example
13818@c @group
13819@c file eg/lib/groupawk.in
13820function getgrnam(group)
13821@{
13822    _gr_init()
13823    if (group in _gr_byname)
13824        return _gr_byname[group]
13825    return ""
13826@}
13827@c endfile
13828@c @end group
13829@end example
13830
13831The @code{getgrnam} function takes a group name as its argument, and if that
13832group exists, it is returned. Otherwise, @code{getgrnam} returns the null
13833string.
13834
13835@findex getgrgid
13836@example
13837@c @group
13838@c file eg/lib/groupawk.in
13839function getgrgid(gid)
13840@{
13841    _gr_init()
13842    if (gid in _gr_bygid)
13843        return _gr_bygid[gid]
13844    return ""
13845@}
13846@c endfile
13847@c @end group
13848@end example
13849
13850The @code{getgrgid} function is similar, it takes a numeric group-id, and
13851looks up the information associated with that group-id.
13852
13853@findex getgruser
13854@example
13855@group
13856@c file eg/lib/groupawk.in
13857function getgruser(user)
13858@{
13859    _gr_init()
13860    if (user in _gr_groupsbyuser)
13861        return _gr_groupsbyuser[user]
13862    return ""
13863@}
13864@c endfile
13865@end group
13866@end example
13867
13868The @code{getgruser} function does not have a C counterpart. It takes a
13869user name, and returns the list of groups that have the user as a member.
13870
13871@findex getgrent
13872@example
13873@c @group
13874@c file eg/lib/groupawk.in
13875function getgrent()
13876@{
13877    _gr_init()
13878    if (++_gr_count in _gr_bycount)
13879        return _gr_bycount[_gr_count]
13880    return ""
13881@}
13882@c endfile
13883@c @end group
13884@end example
13885
13886The @code{getgrent} function steps through the database one entry at a time.
13887It uses @code{_gr_count} to track its position in the list.
13888
13889@findex endgrent
13890@example
13891@group
13892@c file eg/lib/groupawk.in
13893function endgrent()
13894@{
13895    _gr_count = 0
13896@}
13897@c endfile
13898@end group
13899@end example
13900
13901@code{endgrent} resets @code{_gr_count} to zero so that @code{getgrent} can
13902start over again.
13903
13904As with the user database routines, each function calls @code{_gr_init} to
13905initialize the arrays.  Doing so only incurs the extra overhead of running
13906@code{grcat} if these functions are used (as opposed to moving the body of
13907@code{_gr_init} into a @code{BEGIN} rule).
13908
13909Most of the work is in scanning the database and building the various
13910associative arrays.  The functions that the user calls are themselves very
13911simple, relying on @code{awk}'s associative arrays to do work.
13912
13913The @code{id} program in @ref{Id Program, ,Printing Out User Information},
13914uses these functions.
13915
13916@node Library Names,  , Group Functions, Library Functions
13917@section Naming Library Function Global Variables
13918
13919@cindex namespace issues in @code{awk}
13920@cindex documenting @code{awk} programs
13921@cindex programs, documenting
13922Due to the way the @code{awk} language evolved, variables are either
13923@dfn{global} (usable by the entire program), or @dfn{local} (usable just by
13924a specific function).  There is no intermediate state analogous to
13925@code{static} variables in C.
13926
13927Library functions often need to have global variables that they can use to
13928preserve state information between calls to the function. For example,
13929@code{getopt}'s variable @code{_opti}
13930(@pxref{Getopt Function, ,Processing Command Line Options}),
13931and the @code{_tm_months} array used by @code{mktime}
13932(@pxref{Mktime Function, ,Turning Dates Into Timestamps}).
13933Such variables are called @dfn{private}, since the only functions that need to
13934use them are the ones in the library.
13935
13936When writing a library function, you should try to choose names for your
13937private variables so that they will not conflict with any variables used by
13938either another library function or a user's main program.  For example, a
13939name like @samp{i} or @samp{j} is not a good choice, since user programs
13940often use variable names like these for their own purposes.
13941
13942The example programs shown in this chapter all start the names of their
13943private variables with an underscore (@samp{_}).  Users generally don't use
13944leading underscores in their variable names, so this convention immediately
13945decreases the chances that the variable name will be accidentally shared
13946with the user's program.
13947
13948In addition, several of the library functions use a prefix that helps
13949indicate what function or set of functions uses the variables. For example,
13950@code{_tm_months} in @code{mktime}
13951(@pxref{Mktime Function, ,Turning Dates Into Timestamps}), and
13952@code{_pw_byname} in the user data base routines
13953(@pxref{Passwd Functions, ,Reading the User Database}).
13954This convention is recommended, since it even further decreases the chance
13955of inadvertent conflict among variable names.
13956Note that this convention can be used equally well both for variable names
13957and for private function names too.
13958
13959While I could have re-written all the library routines to use this
13960convention, I did not do so, in order to show how my own @code{awk}
13961programming style has evolved, and to provide some basis for this
13962discussion.
13963
13964As a final note on variable naming, if a function makes global variables
13965available for use by a main program, it is a good convention to start that
13966variable's name with a capital letter.
13967For example, @code{getopt}'s @code{Opterr} and @code{Optind} variables
13968(@pxref{Getopt Function, ,Processing Command Line Options}).
13969The leading capital letter indicates that it is global, while the fact that
13970the variable name is not all capital letters indicates that the variable is
13971not one of @code{awk}'s built-in variables, like @code{FS}.
13972
13973It is also important that @emph{all} variables in library functions
13974that do not need to save state are in fact declared local.  If this is
13975not done, the variable could accidentally be used in the user's program,
13976leading to bugs that are very difficult to track down.
13977
13978@example
13979function lib_func(x, y,    l1, l2)
13980@{
13981    @dots{}
13982    @var{use variable} some_var  # some_var could be local
13983    @dots{}                   # but is not by oversight
13984@}
13985@end example
13986
13987@cindex Tcl
13988A different convention, common in the Tcl community, is to use a single
13989associative array to hold the values needed by the library function(s), or
13990``package.''  This significantly decreases the number of actual global names
13991in use.  For example, the functions described in
13992@ref{Passwd Functions, , Reading the User Database},
13993might have used @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}},
13994@code{@w{PW_data["count"]}} and @code{@w{PW_data["awklib"]}}, instead of
13995@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}},
13996and @code{@w{_pw_count}}.
13997
13998The conventions presented in this section are exactly that, conventions. You
13999are not required to write your programs this way, we merely recommend that
14000you do so.
14001
14002@node Sample Programs, Language History, Library Functions, Top
14003@chapter Practical @code{awk} Programs
14004
14005This chapter presents a potpourri of @code{awk} programs for your reading
14006enjoyment.
14007@iftex
14008There are two sections.  The first presents @code{awk}
14009versions of several common POSIX utilities.
14010The second is a grab-bag of interesting programs.
14011@end iftex
14012
14013Many of these programs use the library functions presented in
14014@ref{Library Functions, ,A Library of @code{awk} Functions}.
14015
14016@menu
14017* Clones::                    Clones of common utilities.
14018* Miscellaneous Programs::    Some interesting @code{awk} programs.
14019@end menu
14020
14021@node Clones, Miscellaneous Programs, Sample Programs, Sample Programs
14022@section Re-inventing Wheels for Fun and Profit
14023
14024This section presents a number of POSIX utilities that are implemented in
14025@code{awk}.  Re-inventing these programs in @code{awk} is often enjoyable,
14026since the algorithms can be very clearly expressed, and usually the code is
14027very concise and simple.  This is true because @code{awk} does so much for you.
14028
14029It should be noted that these programs are not necessarily intended to
14030replace the installed versions on your system.  Instead, their
14031purpose is to illustrate @code{awk} language programming for ``real world''
14032tasks.
14033
14034The programs are presented in alphabetical order.
14035
14036@menu
14037* Cut Program::             The @code{cut} utility.
14038* Egrep Program::           The @code{egrep} utility.
14039* Id Program::              The @code{id} utility.
14040* Split Program::           The @code{split} utility.
14041* Tee Program::             The @code{tee} utility.
14042* Uniq Program::            The @code{uniq} utility.
14043* Wc Program::              The @code{wc} utility.
14044@end menu
14045
14046@node Cut Program, Egrep Program, Clones, Clones
14047@subsection Cutting Out Fields and Columns
14048
14049@cindex @code{cut} utility
14050The @code{cut} utility selects, or ``cuts,'' either characters or fields
14051from its standard
14052input and sends them to its standard output.  @code{cut} can cut out either
14053a list of characters, or a list of fields.  By default, fields are separated
14054by tabs, but you may supply a command line option to change the field
14055@dfn{delimiter}, i.e.@: the field separator character. @code{cut}'s definition
14056of fields is less general than @code{awk}'s.
14057
14058A common use of @code{cut} might be to pull out just the login name of
14059logged-on users from the output of @code{who}.  For example, the following
14060pipeline generates a sorted, unique list of the logged on users:
14061
14062@example
14063who | cut -c1-8 | sort | uniq
14064@end example
14065
14066The options for @code{cut} are:
14067
14068@table @code
14069@item -c @var{list}
14070Use @var{list} as the list of characters to cut out.  Items within the list
14071may be separated by commas, and ranges of characters can be separated with
14072dashes.  The list @samp{1-8,15,22-35} specifies characters one through
14073eight, 15, and 22 through 35.
14074
14075@item -f @var{list}
14076Use @var{list} as the list of fields to cut out.
14077
14078@item -d @var{delim}
14079Use @var{delim} as the field separator character instead of the tab
14080character.
14081
14082@item -s
14083Suppress printing of lines that do not contain the field delimiter.
14084@end table
14085
14086The @code{awk} implementation of @code{cut} uses the @code{getopt} library
14087function (@pxref{Getopt Function, ,Processing Command Line Options}),
14088and the @code{join} library function
14089(@pxref{Join Function, ,Merging an Array Into a String}).
14090
14091The program begins with a comment describing the options and a @code{usage}
14092function which prints out a usage message and exits.  @code{usage} is called
14093if invalid arguments are supplied.
14094
14095@findex cut.awk
14096@example
14097@c @group
14098@c file eg/prog/cut.awk
14099# cut.awk --- implement cut in awk
14100# Arnold Robbins, arnold@@gnu.org, Public Domain
14101# May 1993
14102
14103# Options:
14104#    -f list        Cut fields
14105#    -d c           Field delimiter character
14106#    -c list        Cut characters
14107#
14108#    -s        Suppress lines without the delimiter character
14109
14110function usage(    e1, e2)
14111@{
14112    e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
14113    e2 = "usage: cut [-c list] [files...]"
14114    print e1 > "/dev/stderr"
14115    print e2 > "/dev/stderr"
14116    exit 1
14117@}
14118@c endfile
14119@c @end group
14120@end example
14121
14122@noindent
14123The variables @code{e1} and @code{e2} are used so that the function
14124fits nicely on the
14125@iftex
14126page.
14127@end iftex
14128@ifinfo
14129screen.
14130@end ifinfo
14131
14132Next comes a @code{BEGIN} rule that parses the command line options.
14133It sets @code{FS} to a single tab character, since that is @code{cut}'s
14134default field separator.  The output field separator is also set to be the
14135same as the input field separator.  Then @code{getopt} is used to step
14136through the command line options.  One or the other of the variables
14137@code{by_fields} or @code{by_chars} is set to true, to indicate that
14138processing should be done by fields or by characters respectively.
14139When cutting by characters, the output field separator is set to the null
14140string.
14141
14142@example
14143@c @group
14144@c file eg/prog/cut.awk
14145BEGIN    \
14146@{
14147    FS = "\t"    # default
14148    OFS = FS
14149    while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{
14150        if (c == "f") @{
14151            by_fields = 1
14152            fieldlist = Optarg
14153        @} else if (c == "c") @{
14154            by_chars = 1
14155            fieldlist = Optarg
14156            OFS = ""
14157@group
14158        @} else if (c == "d") @{
14159            if (length(Optarg) > 1) @{
14160                printf("Using first character of %s" \
14161                " for delimiter\n", Optarg) > "/dev/stderr"
14162                Optarg = substr(Optarg, 1, 1)
14163            @}
14164            FS = Optarg
14165            OFS = FS
14166            if (FS == " ")    # defeat awk semantics
14167                FS = "[ ]"
14168        @} else if (c == "s")
14169            suppress++
14170        else
14171            usage()
14172    @}
14173@end group
14174
14175    for (i = 1; i < Optind; i++)
14176        ARGV[i] = ""
14177@c endfile
14178@c @end group
14179@end example
14180
14181Special care is taken when the field delimiter is a space. Using
14182@code{@w{" "}} (a single space) for the value of @code{FS} is
14183incorrect---@code{awk} would
14184separate fields with runs of spaces, tabs and/or newlines, and we want them to be
14185separated with individual spaces.  Also, note that after @code{getopt} is
14186through, we have to clear out all the elements of @code{ARGV} from one to
14187@code{Optind}, so that @code{awk} will not try to process the command line
14188options as file names.
14189
14190After dealing with the command line options, the program verifies that the
14191options make sense.  Only one or the other of @samp{-c} and @samp{-f} should
14192be used, and both require a field list.  Then either @code{set_fieldlist} or
14193@code{set_charlist} is called to pull apart the list of fields or
14194characters.
14195
14196@example
14197@c @group
14198@c file eg/prog/cut.awk
14199    if (by_fields && by_chars)
14200        usage()
14201
14202    if (by_fields == 0 && by_chars == 0)
14203        by_fields = 1    # default
14204
14205    if (fieldlist == "") @{
14206        print "cut: needs list for -c or -f" > "/dev/stderr"
14207        exit 1
14208    @}
14209
14210@group
14211    if (by_fields)
14212        set_fieldlist()
14213    else
14214        set_charlist()
14215@}
14216@c endfile
14217@end group
14218@end example
14219
14220Here is @code{set_fieldlist}.  It first splits the field list apart
14221at the commas, into an array.  Then, for each element of the array, it
14222looks to see if it is actually a range, and if so splits it apart. The range
14223is verified to make sure the first number is smaller than the second.
14224Each number in the list is added to the @code{flist} array, which simply
14225lists the fields that will be printed.
14226Normal field splitting is used.
14227The program lets @code{awk}
14228handle the job of doing the field splitting.
14229
14230@example
14231@c @group
14232@c file eg/prog/cut.awk
14233function set_fieldlist(        n, m, i, j, k, f, g)
14234@{
14235    n = split(fieldlist, f, ",")
14236    j = 1    # index in flist
14237    for (i = 1; i <= n; i++) @{
14238        if (index(f[i], "-") != 0) @{ # a range
14239            m = split(f[i], g, "-")
14240            if (m != 2 || g[1] >= g[2]) @{
14241                printf("bad field list: %s\n",
14242                                  f[i]) > "/dev/stderr"
14243                exit 1
14244            @}
14245            for (k = g[1]; k <= g[2]; k++)
14246                flist[j++] = k
14247        @} else
14248            flist[j++] = f[i]
14249    @}
14250    nfields = j - 1
14251@}
14252@c endfile
14253@c @end group
14254@end example
14255
14256The @code{set_charlist} function is more complicated than @code{set_fieldlist}.
14257The idea here is to use @code{gawk}'s @code{FIELDWIDTHS} variable
14258(@pxref{Constant Size, ,Reading Fixed-width Data}),
14259which describes constant width input.  When using a character list, that is
14260exactly what we have.
14261
14262Setting up @code{FIELDWIDTHS} is more complicated than simply listing the
14263fields that need to be printed.  We have to keep track of the fields to be
14264printed, and also the intervening characters that have to be skipped.
14265For example, suppose you wanted characters one through eight, 15, and
1426622 through 35.  You would use @samp{-c 1-8,15,22-35}.  The necessary value
14267for @code{FIELDWIDTHS} would be @code{@w{"8 6 1 6 14"}}.  This gives us five
14268fields, and what should be printed are @code{$1}, @code{$3}, and @code{$5}.
14269The intermediate fields are ``filler,'' stuff in between the desired data.
14270
14271@code{flist} lists the fields to be printed, and @code{t} tracks the
14272complete field list, including filler fields.
14273
14274@example
14275@c @group
14276@c file eg/prog/cut.awk
14277function set_charlist(    field, i, j, f, g, t,
14278                          filler, last, len)
14279@{
14280    field = 1   # count total fields
14281    n = split(fieldlist, f, ",")
14282    j = 1       # index in flist
14283    for (i = 1; i <= n; i++) @{
14284        if (index(f[i], "-") != 0) @{ # range
14285            m = split(f[i], g, "-")
14286            if (m != 2 || g[1] >= g[2]) @{
14287                printf("bad character list: %s\n",
14288                               f[i]) > "/dev/stderr"
14289                exit 1
14290            @}
14291            len = g[2] - g[1] + 1
14292            if (g[1] > 1)  # compute length of filler
14293                filler = g[1] - last - 1
14294            else
14295                filler = 0
14296            if (filler)
14297                t[field++] = filler
14298            t[field++] = len  # length of field
14299            last = g[2]
14300            flist[j++] = field - 1
14301        @} else @{
14302            if (f[i] > 1)
14303                filler = f[i] - last - 1
14304            else
14305                filler = 0
14306            if (filler)
14307                t[field++] = filler
14308            t[field++] = 1
14309            last = f[i]
14310            flist[j++] = field - 1
14311        @}
14312    @}
14313@group
14314    FIELDWIDTHS = join(t, 1, field - 1)
14315    nfields = j - 1
14316@}
14317@end group
14318@c endfile
14319@end example
14320
14321Here is the rule that actually processes the data.  If the @samp{-s} option
14322was given, then @code{suppress} will be true.  The first @code{if} statement
14323makes sure that the input record does have the field separator.  If
14324@code{cut} is processing fields, @code{suppress} is true, and the field
14325separator character is not in the record, then the record is skipped.
14326
14327If the record is valid, then at this point, @code{gawk} has split the data
14328into fields, either using the character in @code{FS} or using fixed-length
14329fields and @code{FIELDWIDTHS}.  The loop goes through the list of fields
14330that should be printed.  If the corresponding field has data in it, it is
14331printed.  If the next field also has data, then the separator character is
14332written out in between the fields.
14333
14334@c 2e: Could use `index($0, FS) != 0' instead of `$0 !~ FS', below
14335
14336@example
14337@c @group
14338@c file eg/prog/cut.awk
14339@{
14340    if (by_fields && suppress && $0 !~ FS)
14341        next
14342
14343    for (i = 1; i <= nfields; i++) @{
14344        if ($flist[i] != "") @{
14345            printf "%s", $flist[i]
14346            if (i < nfields && $flist[i+1] != "")
14347                printf "%s", OFS
14348        @}
14349    @}
14350    print ""
14351@}
14352@c endfile
14353@c @end group
14354@end example
14355
14356This version of @code{cut} relies on @code{gawk}'s @code{FIELDWIDTHS}
14357variable to do the character-based cutting.  While it would be possible in
14358other @code{awk} implementations to use @code{substr}
14359(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
14360it would also be extremely painful to do so.
14361The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem
14362of picking the input line apart by characters.
14363
14364@node Egrep Program, Id Program, Cut Program, Clones
14365@subsection Searching for Regular Expressions in Files
14366
14367@cindex @code{egrep} utility
14368The @code{egrep} utility searches files for patterns.  It uses regular
14369expressions that are almost identical to those available in @code{awk}
14370(@pxref{Regexp Constants, ,Regular Expression Constants}).  It is used this way:
14371
14372@example
14373egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{}
14374@end example
14375
14376The @var{pattern} is a regexp.
14377In typical usage, the regexp is quoted to prevent the shell from expanding
14378any of the special characters as file name wildcards.
14379Normally, @code{egrep} prints the
14380lines that matched.  If multiple file names are provided on the command
14381line, each output line is preceded by the name of the file and a colon.
14382
14383The options are:
14384
14385@table @code
14386@item -c
14387Print out a count of the lines that matched the pattern, instead of the
14388lines themselves.
14389
14390@item -s
14391Be silent.  No output is produced, and the exit value indicates whether
14392or not the pattern was matched.
14393
14394@item -v
14395Invert the sense of the test. @code{egrep} prints the lines that do
14396@emph{not} match the pattern, and exits successfully if the pattern was not
14397matched.
14398
14399@item -i
14400Ignore case distinctions in both the pattern and the input data.
14401
14402@item -l
14403Only print the names of the files that matched, not the lines that matched.
14404
14405@item -e @var{pattern}
14406Use @var{pattern} as the regexp to match.  The purpose of the @samp{-e}
14407option is to allow patterns that start with a @samp{-}.
14408@end table
14409
14410This version uses the @code{getopt} library function
14411(@pxref{Getopt Function, ,Processing Command Line Options}),
14412and the file transition library program
14413(@pxref{Filetrans Function, ,Noting Data File Boundaries}).
14414
14415The program begins with a descriptive comment, and then a @code{BEGIN} rule
14416that processes the command line arguments with @code{getopt}.  The @samp{-i}
14417(ignore case) option is particularly easy with @code{gawk}; we just use the
14418@code{IGNORECASE} built in variable
14419(@pxref{Built-in Variables}).
14420
14421@findex egrep.awk
14422@example
14423@c @group
14424@c file eg/prog/egrep.awk
14425# egrep.awk --- simulate egrep in awk
14426# Arnold Robbins, arnold@@gnu.org, Public Domain
14427# May 1993
14428
14429# Options:
14430#    -c    count of lines
14431#    -s    silent - use exit value
14432#    -v    invert test, success if no match
14433#    -i    ignore case
14434#    -l    print filenames only
14435#    -e    argument is pattern
14436
14437BEGIN @{
14438    while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{
14439        if (c == "c")
14440            count_only++
14441        else if (c == "s")
14442            no_print++
14443        else if (c == "v")
14444            invert++
14445        else if (c == "i")
14446            IGNORECASE = 1
14447        else if (c == "l")
14448            filenames_only++
14449        else if (c == "e")
14450            pattern = Optarg
14451        else
14452            usage()
14453    @}
14454@c endfile
14455@c @end group
14456@end example
14457
14458Next comes the code that handles the @code{egrep} specific behavior. If no
14459pattern was supplied with @samp{-e}, the first non-option on the command
14460line is used.  The @code{awk} command line arguments up to @code{ARGV[Optind]}
14461are cleared, so that @code{awk} won't try to process them as files.  If no
14462files were specified, the standard input is used, and if multiple files were
14463specified, we make sure to note this so that the file names can precede the
14464matched lines in the output.
14465
14466The last two lines are commented out, since they are not needed in
14467@code{gawk}.  They should be uncommented if you have to use another version
14468of @code{awk}.
14469
14470@example
14471@c @group
14472@c file eg/prog/egrep.awk
14473    if (pattern == "")
14474        pattern = ARGV[Optind++]
14475
14476    for (i = 1; i < Optind; i++)
14477        ARGV[i] = ""
14478    if (Optind >= ARGC) @{
14479        ARGV[1] = "-"
14480        ARGC = 2
14481    @} else if (ARGC - Optind > 1)
14482        do_filenames++
14483
14484#    if (IGNORECASE)
14485#        pattern = tolower(pattern)
14486@}
14487@c endfile
14488@c @end group
14489@end example
14490
14491The next set of lines should be uncommented if you are not using
14492@code{gawk}.  This rule translates all the characters in the input line
14493into lower-case if the @samp{-i} option was specified.  The rule is
14494commented out since it is not necessary with @code{gawk}.
14495@c bug: if a match happens, we output the translated line, not the original
14496
14497@example
14498@c @group
14499@c file eg/prog/egrep.awk
14500#@{
14501#    if (IGNORECASE)
14502#        $0 = tolower($0)
14503#@}
14504@c endfile
14505@c @end group
14506@end example
14507
14508The @code{beginfile} function is called by the rule in @file{ftrans.awk}
14509when each new file is processed.  In this case, it is very simple; all it
14510does is initialize a variable @code{fcount} to zero. @code{fcount} tracks
14511how many lines in the current file matched the pattern.
14512
14513@example
14514@group
14515@c file eg/prog/egrep.awk
14516function beginfile(junk)
14517@{
14518    fcount = 0
14519@}
14520@c endfile
14521@end group
14522@end example
14523
14524The @code{endfile} function is called after each file has been processed.
14525It is used only when the user wants a count of the number of lines that
14526matched.  @code{no_print} will be true only if the exit status is desired.
14527@code{count_only} will be true if line counts are desired.  @code{egrep}
14528will therefore only print line counts if printing and counting are enabled.
14529The output format must be adjusted depending upon the number of files to be
14530processed.  Finally, @code{fcount} is added to @code{total}, so that we
14531know how many lines altogether matched the pattern.
14532
14533@example
14534@group
14535@c file eg/prog/egrep.awk
14536function endfile(file)
14537@{
14538    if (! no_print && count_only)
14539        if (do_filenames)
14540            print file ":" fcount
14541        else
14542            print fcount
14543
14544    total += fcount
14545@}
14546@c endfile
14547@end group
14548@end example
14549
14550This rule does most of the work of matching lines. The variable
14551@code{matches} will be true if the line matched the pattern. If the user
14552wants lines that did not match, the sense of the @code{matches} is inverted
14553using the @samp{!} operator. @code{fcount} is incremented with the value of
14554@code{matches}, which will be either one or zero, depending upon a
14555successful or unsuccessful match.  If the line did not match, the
14556@code{next} statement just moves on to the next record.
14557
14558There are several optimizations for performance in the following few lines
14559of code. If the user only wants exit status (@code{no_print} is true), and
14560we don't have to count lines, then it is enough to know that one line in
14561this file matched, and we can skip on to the next file with @code{nextfile}.
14562Along similar lines, if we are only printing file names, and we
14563don't need to count lines, we can print the file name, and then skip to the
14564next file with @code{nextfile}.
14565
14566Finally, each line is printed, with a leading filename and colon if
14567necessary.
14568
14569@ignore
145702e: note, probably better to recode the last few lines as
14571    if (! count_only) @{
14572        if (no_print)
14573            nextfile
14574
14575        if (filenames_only) @{
14576            print FILENAME
14577            nextfile
14578        @}
14579
14580        if (do_filenames)
14581            print FILENAME ":" $0
14582        else
14583            print
14584    @}
14585@end ignore
14586
14587@example
14588@c @group
14589@c file eg/prog/egrep.awk
14590@{
14591    matches = ($0 ~ pattern)
14592    if (invert)
14593        matches = ! matches
14594
14595    fcount += matches    # 1 or 0
14596
14597    if (! matches)
14598        next
14599
14600    if (no_print && ! count_only)
14601        nextfile
14602
14603    if (filenames_only && ! count_only) @{
14604        print FILENAME
14605        nextfile
14606    @}
14607
14608    if (do_filenames && ! count_only)
14609        print FILENAME ":" $0
14610@group
14611    else if (! count_only)
14612        print
14613@end group
14614@}
14615@c endfile
14616@c @end group
14617@end example
14618
14619@c @strong{Exercise}: rearrange the code inside @samp{if (! count_only)}.
14620
14621The @code{END} rule takes care of producing the correct exit status. If
14622there were no matches, the exit status is one, otherwise it is zero.
14623
14624@example
14625@c @group
14626@c file eg/prog/egrep.awk
14627END    \
14628@{
14629    if (total == 0)
14630        exit 1
14631    exit 0
14632@}
14633@c endfile
14634@c @end group
14635@end example
14636
14637The @code{usage} function prints a usage message in case of invalid options
14638and then exits.
14639
14640@example
14641@c @group
14642@c file eg/prog/egrep.awk
14643function usage(    e)
14644@{
14645    e = "Usage: egrep [-csvil] [-e pat] [files ...]"
14646    print e > "/dev/stderr"
14647    exit 1
14648@}
14649@c endfile
14650@c @end group
14651@end example
14652
14653The variable @code{e} is used so that the function fits nicely
14654on the printed page.
14655
14656@cindex backslash continuation
14657Just a note on programming style. You may have noticed that the @code{END}
14658rule uses backslash continuation, with the open brace on a line by
14659itself.  This is so that it more closely resembles the way functions
14660are written.  Many of the examples
14661@iftex
14662in this chapter
14663@end iftex
14664use this style. You can decide for yourself if you like writing
14665your @code{BEGIN} and @code{END} rules this way,
14666or not.
14667
14668@node Id Program, Split Program, Egrep Program, Clones
14669@subsection Printing Out User Information
14670
14671@cindex @code{id} utility
14672The @code{id} utility lists a user's real and effective user-id numbers,
14673real and effective group-id numbers, and the user's group set, if any.
14674@code{id} will only print the effective user-id and group-id if they are
14675different from the real ones.  If possible, @code{id} will also supply the
14676corresponding user and group names.  The output might look like this:
14677
14678@example
14679$ id
14680@print{} uid=2076(arnold) gid=10(staff) groups=10(staff),4(tty)
14681@end example
14682
14683This information is exactly what is provided by @code{gawk}'s
14684@file{/dev/user} special file (@pxref{Special Files, ,Special File Names in @code{gawk}}).
14685However, the @code{id} utility provides a more palatable output than just a
14686string of numbers.
14687
14688Here is a simple version of @code{id} written in @code{awk}.
14689It uses the user database library functions
14690(@pxref{Passwd Functions, ,Reading the User Database}),
14691and the group database library functions
14692(@pxref{Group Functions, ,Reading the Group Database}).
14693
14694The program is fairly straightforward.  All the work is done in the
14695@code{BEGIN} rule.  The user and group id numbers are obtained from
14696@file{/dev/user}.  If there is no support for @file{/dev/user}, the program
14697gives up.
14698
14699The code is repetitive.  The entry in the user database for the real user-id
14700number is split into parts at the @samp{:}. The name is the first field.
14701Similar code is used for the effective user-id number, and the group
14702numbers.
14703
14704@findex id.awk
14705@example
14706@c @group
14707@c file eg/prog/id.awk
14708# id.awk --- implement id in awk
14709# Arnold Robbins, arnold@@gnu.org, Public Domain
14710# May 1993
14711
14712# output is:
14713# uid=12(foo) euid=34(bar) gid=3(baz) \
14714#             egid=5(blat) groups=9(nine),2(two),1(one)
14715
14716BEGIN    \
14717@{
14718    if ((getline < "/dev/user") < 0) @{
14719        err = "id: no /dev/user support - cannot run"
14720        print err > "/dev/stderr"
14721        exit 1
14722    @}
14723    close("/dev/user")
14724
14725    uid = $1
14726    euid = $2
14727    gid = $3
14728    egid = $4
14729
14730    printf("uid=%d", uid)
14731    pw = getpwuid(uid)
14732@group
14733    if (pw != "") @{
14734        split(pw, a, ":")
14735        printf("(%s)", a[1])
14736    @}
14737@end group
14738
14739    if (euid != uid) @{
14740        printf(" euid=%d", euid)
14741        pw = getpwuid(euid)
14742        if (pw != "") @{
14743            split(pw, a, ":")
14744            printf("(%s)", a[1])
14745        @}
14746    @}
14747
14748    printf(" gid=%d", gid)
14749    pw = getgrgid(gid)
14750    if (pw != "") @{
14751        split(pw, a, ":")
14752        printf("(%s)", a[1])
14753    @}
14754
14755    if (egid != gid) @{
14756        printf(" egid=%d", egid)
14757        pw = getgrgid(egid)
14758        if (pw != "") @{
14759            split(pw, a, ":")
14760            printf("(%s)", a[1])
14761        @}
14762    @}
14763
14764    if (NF > 4) @{
14765        printf(" groups=");
14766        for (i = 5; i <= NF; i++) @{
14767            printf("%d", $i)
14768            pw = getgrgid($i)
14769            if (pw != "") @{
14770                split(pw, a, ":")
14771                printf("(%s)", a[1])
14772            @}
14773@group
14774            if (i < NF)
14775                printf(",")
14776@end group
14777        @}
14778    @}
14779    print ""
14780@}
14781@c endfile
14782@c @end group
14783@end example
14784
14785@c exercise!!!
14786@ignore
14787The POSIX version of @code{id} takes arguments that control which
14788information is printed.  Modify this version to accept the same
14789arguments and perform in the same way.
14790@end ignore
14791
14792@node Split Program, Tee Program, Id Program, Clones
14793@subsection Splitting a Large File Into Pieces
14794
14795@cindex @code{split} utility
14796The @code{split} program splits large text files into smaller pieces. By default,
14797the output files are named @file{xaa}, @file{xab}, and so on. Each file has
147981000 lines in it, with the likely exception of the last file. To change the
14799number of lines in each file, you supply a number on the command line
14800preceded with a minus, e.g., @samp{-500} for files with 500 lines in them
14801instead of 1000.  To change the name of the output files to something like
14802@file{myfileaa}, @file{myfileab}, and so on, you supply an additional
14803argument that specifies the filename.
14804
14805Here is a version of @code{split} in @code{awk}. It uses the @code{ord} and
14806@code{chr} functions presented in
14807@ref{Ordinal Functions, ,Translating Between Characters and Numbers}.
14808
14809The program first sets its defaults, and then tests to make sure there are
14810not too many arguments.  It then looks at each argument in turn.  The
14811first argument could be a minus followed by a number. If it is, this happens
14812to look like a negative number, so it is made positive, and that is the
14813count of lines.  The data file name is skipped over, and the final argument
14814is used as the prefix for the output file names.
14815
14816@findex split.awk
14817@example
14818@c @group
14819@c file eg/prog/split.awk
14820# split.awk --- do split in awk
14821# Arnold Robbins, arnold@@gnu.org, Public Domain
14822# May 1993
14823
14824# usage: split [-num] [file] [outname]
14825
14826BEGIN @{
14827    outfile = "x"    # default
14828    count = 1000
14829    if (ARGC > 4)
14830        usage()
14831
14832    i = 1
14833    if (ARGV[i] ~ /^-[0-9]+$/) @{
14834        count = -ARGV[i]
14835        ARGV[i] = ""
14836        i++
14837    @}
14838    # test argv in case reading from stdin instead of file
14839    if (i in ARGV)
14840        i++    # skip data file name
14841    if (i in ARGV) @{
14842        outfile = ARGV[i]
14843        ARGV[i] = ""
14844    @}
14845
14846    s1 = s2 = "a"
14847    out = (outfile s1 s2)
14848@}
14849@c endfile
14850@c @end group
14851@end example
14852
14853The next rule does most of the work. @code{tcount} (temporary count) tracks
14854how many lines have been printed to the output file so far. If it is greater
14855than @code{count}, it is time to close the current file and start a new one.
14856@code{s1} and @code{s2} track the current suffixes for the file name. If
14857they are both @samp{z}, the file is just too big.  Otherwise, @code{s1}
14858moves to the next letter in the alphabet and @code{s2} starts over again at
14859@samp{a}.
14860
14861@example
14862@c @group
14863@c file eg/prog/split.awk
14864@{
14865    if (++tcount > count) @{
14866        close(out)
14867        if (s2 == "z") @{
14868            if (s1 == "z") @{
14869                printf("split: %s is too large to split\n", \
14870                       FILENAME) > "/dev/stderr"
14871                exit 1
14872            @}
14873            s1 = chr(ord(s1) + 1)
14874            s2 = "a"
14875        @} else
14876            s2 = chr(ord(s2) + 1)
14877        out = (outfile s1 s2)
14878        tcount = 1
14879    @}
14880    print > out
14881@}
14882@c endfile
14883@c @end group
14884@end example
14885
14886The @code{usage} function simply prints an error message and exits.
14887
14888@example
14889@c @group
14890@c file eg/prog/split.awk
14891function usage(   e)
14892@{
14893    e = "usage: split [-num] [file] [outname]"
14894    print e > "/dev/stderr"
14895    exit 1
14896@}
14897@c endfile
14898@c @end group
14899@end example
14900
14901@noindent
14902The variable @code{e} is used so that the function
14903fits nicely on the
14904@iftex
14905page.
14906@end iftex
14907@ifinfo
14908screen.
14909@end ifinfo
14910
14911This program is a bit sloppy; it relies on @code{awk} to close the last file
14912for it automatically, instead of doing it in an @code{END} rule.
14913
14914@node Tee Program, Uniq Program, Split Program, Clones
14915@subsection Duplicating Output Into Multiple Files
14916
14917@cindex @code{tee} utility
14918The @code{tee} program is known as a ``pipe fitting.''  @code{tee} copies
14919its standard input to its standard output, and also duplicates it to the
14920files named on the command line.  Its usage is:
14921
14922@example
14923tee @r{[}-a@r{]} file @dots{}
14924@end example
14925
14926The @samp{-a} option tells @code{tee} to append to the named files, instead of
14927truncating them and starting over.
14928
14929The @code{BEGIN} rule first makes a copy of all the command line arguments,
14930into an array named @code{copy}.
14931@code{ARGV[0]} is not copied, since it is not needed.
14932@code{tee} cannot use @code{ARGV} directly, since @code{awk} will attempt to
14933process each file named in @code{ARGV} as input data.
14934
14935If the first argument is @samp{-a}, then the flag variable
14936@code{append} is set to true, and both @code{ARGV[1]} and
14937@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no file
14938names were supplied, and @code{tee} prints a usage message and exits.
14939Finally, @code{awk} is forced to read the standard input by setting
14940@code{ARGV[1]} to @code{"-"}, and @code{ARGC} to two.
14941
14942@c 2e: the `ARGC--' in the `if (ARGV[1] == "-a")' isn't needed.
14943
14944@findex tee.awk
14945@example
14946@group
14947@c file eg/prog/tee.awk
14948# tee.awk --- tee in awk
14949# Arnold Robbins, arnold@@gnu.org, Public Domain
14950# May 1993
14951# Revised December 1995
14952@end group
14953
14954@group
14955BEGIN    \
14956@{
14957    for (i = 1; i < ARGC; i++)
14958        copy[i] = ARGV[i]
14959@end group
14960
14961@group
14962    if (ARGV[1] == "-a") @{
14963        append = 1
14964        delete ARGV[1]
14965        delete copy[1]
14966        ARGC--
14967    @}
14968@end group
14969@group
14970    if (ARGC < 2) @{
14971        print "usage: tee [-a] file ..." > "/dev/stderr"
14972        exit 1
14973    @}
14974@end group
14975@group
14976    ARGV[1] = "-"
14977    ARGC = 2
14978@}
14979@c endfile
14980@end group
14981@end example
14982
14983The single rule does all the work.  Since there is no pattern, it is
14984executed for each line of input.  The body of the rule simply prints the
14985line into each file on the command line, and then to the standard output.
14986
14987@example
14988@group
14989@c file eg/prog/tee.awk
14990@{
14991    # moving the if outside the loop makes it run faster
14992    if (append)
14993        for (i in copy)
14994            print >> copy[i]
14995    else
14996        for (i in copy)
14997            print > copy[i]
14998    print
14999@}
15000@c endfile
15001@end group
15002@end example
15003
15004It would have been possible to code the loop this way:
15005
15006@example
15007for (i in copy)
15008    if (append)
15009        print >> copy[i]
15010    else
15011        print > copy[i]
15012@end example
15013
15014@noindent
15015This is more concise, but it is also less efficient.  The @samp{if} is
15016tested for each record and for each output file.  By duplicating the loop
15017body, the @samp{if} is only tested once for each input record.  If there are
15018@var{N} input records and @var{M} input files, the first method only
15019executes @var{N} @samp{if} statements, while the second would execute
15020@var{N}@code{*}@var{M} @samp{if} statements.
15021
15022Finally, the @code{END} rule cleans up, by closing all the output files.
15023
15024@example
15025@c @group
15026@c file eg/prog/tee.awk
15027END    \
15028@{
15029    for (i in copy)
15030        close(copy[i])
15031@}
15032@c endfile
15033@c @end group
15034@end example
15035
15036@node Uniq Program, Wc Program, Tee Program, Clones
15037@subsection Printing Non-duplicated Lines of Text
15038
15039@cindex @code{uniq} utility
15040The @code{uniq} utility reads sorted lines of data on its standard input,
15041and (by default) removes duplicate lines.  In other words, only unique lines
15042are printed, hence the name.  @code{uniq} has a number of options. The usage is:
15043
15044@example
15045uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]}
15046@end example
15047
15048The option meanings are:
15049
15050@table @code
15051@item -d
15052Only print repeated lines.
15053
15054@item -u
15055Only print non-repeated lines.
15056
15057@item -c
15058Count lines. This option overrides @samp{-d} and @samp{-u}.  Both repeated
15059and non-repeated lines are counted.
15060
15061@item -@var{n}
15062Skip @var{n} fields before comparing lines.  The definition of fields
15063is similar to @code{awk}'s default: non-whitespace characters separated
15064by runs of spaces and/or tabs.
15065
15066@item +@var{n}
15067Skip @var{n} characters before comparing lines.  Any fields specified with
15068@samp{-@var{n}} are skipped first.
15069
15070@item @var{input file}
15071Data is read from the input file named on the command line, instead of from
15072the standard input.
15073
15074@item @var{output file}
15075The generated output is sent to the named output file, instead of to the
15076standard output.
15077@end table
15078
15079Normally @code{uniq} behaves as if both the @samp{-d} and @samp{-u} options
15080had been provided.
15081
15082Here is an @code{awk} implementation of @code{uniq}. It uses the
15083@code{getopt} library function
15084(@pxref{Getopt Function, ,Processing Command Line Options}),
15085and the @code{join} library function
15086(@pxref{Join Function, ,Merging an Array Into a String}).
15087
15088The program begins with a @code{usage} function and then a brief outline of
15089the options and their meanings in a comment.
15090
15091The @code{BEGIN} rule deals with the command line arguments and options. It
15092uses a trick to get @code{getopt} to handle options of the form @samp{-25},
15093treating such an option as the option letter @samp{2} with an argument of
15094@samp{5}. If indeed two or more digits were supplied (@code{Optarg} looks
15095like a number), @code{Optarg} is
15096concatenated with the option digit, and then result is added to zero to make
15097it into a number.  If there is only one digit in the option, then
15098@code{Optarg} is not needed, and @code{Optind} must be decremented so that
15099@code{getopt} will process it next time.  This code is admittedly a bit
15100tricky.
15101
15102If no options were supplied, then the default is taken, to print both
15103repeated and non-repeated lines.  The output file, if provided, is assigned
15104to @code{outputfile}.  Earlier, @code{outputfile} was initialized to the
15105standard output, @file{/dev/stdout}.
15106
15107@findex uniq.awk
15108@example
15109@c file eg/prog/uniq.awk
15110# uniq.awk --- do uniq in awk
15111# Arnold Robbins, arnold@@gnu.org, Public Domain
15112# May 1993
15113
15114@group
15115function usage(    e)
15116@{
15117    e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
15118    print e > "/dev/stderr"
15119    exit 1
15120@}
15121@end group
15122
15123# -c    count lines. overrides -d and -u
15124# -d    only repeated lines
15125# -u    only non-repeated lines
15126# -n    skip n fields
15127# +n    skip n characters, skip fields first
15128
15129BEGIN   \
15130@{
15131    count = 1
15132    outputfile = "/dev/stdout"
15133    opts = "udc0:1:2:3:4:5:6:7:8:9:"
15134    while ((c = getopt(ARGC, ARGV, opts)) != -1) @{
15135        if (c == "u")
15136            non_repeated_only++
15137        else if (c == "d")
15138            repeated_only++
15139        else if (c == "c")
15140            do_count++
15141        else if (index("0123456789", c) != 0) @{
15142            # getopt requires args to options
15143            # this messes us up for things like -5
15144            if (Optarg ~ /^[0-9]+$/)
15145                fcount = (c Optarg) + 0
15146@group
15147            else @{
15148                fcount = c + 0
15149                Optind--
15150            @}
15151@end group
15152        @} else
15153            usage()
15154    @}
15155
15156    if (ARGV[Optind] ~ /^\+[0-9]+$/) @{
15157        charcount = substr(ARGV[Optind], 2) + 0
15158        Optind++
15159    @}
15160
15161    for (i = 1; i < Optind; i++)
15162        ARGV[i] = ""
15163
15164    if (repeated_only == 0 && non_repeated_only == 0)
15165        repeated_only = non_repeated_only = 1
15166
15167    if (ARGC - Optind == 2) @{
15168        outputfile = ARGV[ARGC - 1]
15169        ARGV[ARGC - 1] = ""
15170    @}
15171@}
15172@c endfile
15173@end example
15174
15175The following function, @code{are_equal}, compares the current line,
15176@code{$0}, to the
15177previous line, @code{last}.  It handles skipping fields and characters.
15178
15179If no field count and no character count were specified, @code{are_equal}
15180simply returns one or zero depending upon the result of a simple string
15181comparison of @code{last} and @code{$0}.  Otherwise, things get more
15182complicated.
15183
15184If fields have to be skipped, each line is broken into an array using
15185@code{split}
15186(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
15187and then the desired fields are joined back into a line using @code{join}.
15188The joined lines are stored in @code{clast} and @code{cline}.
15189If no fields are skipped, @code{clast} and @code{cline} are set to
15190@code{last} and @code{$0} respectively.
15191
15192Finally, if characters are skipped, @code{substr} is used to strip off the
15193leading @code{charcount} characters in @code{clast} and @code{cline}.  The
15194two strings are then compared, and @code{are_equal} returns the result.
15195
15196@example
15197@c @group
15198@c file eg/prog/uniq.awk
15199function are_equal(    n, m, clast, cline, alast, aline)
15200@{
15201    if (fcount == 0 && charcount == 0)
15202        return (last == $0)
15203
15204    if (fcount > 0) @{
15205        n = split(last, alast)
15206        m = split($0, aline)
15207        clast = join(alast, fcount+1, n)
15208        cline = join(aline, fcount+1, m)
15209    @} else @{
15210        clast = last
15211        cline = $0
15212    @}
15213    if (charcount) @{
15214        clast = substr(clast, charcount + 1)
15215        cline = substr(cline, charcount + 1)
15216    @}
15217
15218    return (clast == cline)
15219@}
15220@c endfile
15221@c @end group
15222@end example
15223
15224The following two rules are the body of the program.  The first one is
15225executed only for the very first line of data.  It sets @code{last} equal to
15226@code{$0}, so that subsequent lines of text have something to be compared to.
15227
15228The second rule does the work. The variable @code{equal} will be one or zero
15229depending upon the results of @code{are_equal}'s comparison. If @code{uniq}
15230is counting repeated lines, then the @code{count} variable is incremented if
15231the lines are equal. Otherwise the line is printed and @code{count} is
15232reset, since the two lines are not equal.
15233
15234If @code{uniq} is not counting, @code{count} is incremented if the lines are
15235equal. Otherwise, if @code{uniq} is counting repeated lines, and more than
15236one line has been seen, or if @code{uniq} is counting non-repeated lines,
15237and only one line has been seen, then the line is printed, and @code{count}
15238is reset.
15239
15240Finally, similar logic is used in the @code{END} rule to print the final
15241line of input data.
15242
15243@example
15244@c @group
15245@c file eg/prog/uniq.awk
15246@group
15247NR == 1 @{
15248    last = $0
15249    next
15250@}
15251@end group
15252
15253@{
15254    equal = are_equal()
15255
15256    if (do_count) @{    # overrides -d and -u
15257        if (equal)
15258            count++
15259        else @{
15260            printf("%4d %s\n", count, last) > outputfile
15261            last = $0
15262            count = 1    # reset
15263        @}
15264        next
15265    @}
15266
15267    if (equal)
15268        count++
15269    else @{
15270        if ((repeated_only && count > 1) ||
15271            (non_repeated_only && count == 1))
15272                print last > outputfile
15273        last = $0
15274        count = 1
15275    @}
15276@}
15277
15278@group
15279END @{
15280    if (do_count)
15281        printf("%4d %s\n", count, last) > outputfile
15282    else if ((repeated_only && count > 1) ||
15283            (non_repeated_only && count == 1))
15284        print last > outputfile
15285@}
15286@end group
15287@c endfile
15288@c @end group
15289@end example
15290
15291@node Wc Program,  , Uniq Program, Clones
15292@subsection Counting Things
15293
15294@cindex @code{wc} utility
15295The @code{wc} (word count) utility counts lines, words, and characters in
15296one or more input files. Its usage is:
15297
15298@example
15299wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]}
15300@end example
15301
15302If no files are specified on the command line, @code{wc} reads its standard
15303input. If there are multiple files, it will also print total counts for all
15304the files.  The options and their meanings are:
15305
15306@table @code
15307@item -l
15308Only count lines.
15309
15310@item -w
15311Only count words.
15312A ``word'' is a contiguous sequence of non-whitespace characters, separated
15313by spaces and/or tabs.  Happily, this is the normal way @code{awk} separates
15314fields in its input data.
15315
15316@item -c
15317Only count characters.
15318@end table
15319
15320Implementing @code{wc} in @code{awk} is particularly elegant, since
15321@code{awk} does a lot of the work for us; it splits lines into words (i.e.@:
15322fields) and counts them, it counts lines (i.e.@: records) for us, and it can
15323easily tell us how long a line is.
15324
15325This version uses the @code{getopt} library function
15326(@pxref{Getopt Function, ,Processing Command Line Options}),
15327and the file transition functions
15328(@pxref{Filetrans Function, ,Noting Data File Boundaries}).
15329
15330This version has one major difference from traditional versions of @code{wc}.
15331Our version always prints the counts in the order lines, words,
15332and characters.  Traditional versions note the order of the @samp{-l},
15333@samp{-w}, and @samp{-c} options on the command line, and print the counts
15334in that order.
15335
15336The @code{BEGIN} rule does the argument processing.
15337The variable @code{print_total} will
15338be true if more than one file was named on the command line.
15339
15340@findex wc.awk
15341@example
15342@c @group
15343@c file eg/prog/wc.awk
15344# wc.awk --- count lines, words, characters
15345# Arnold Robbins, arnold@@gnu.org, Public Domain
15346# May 1993
15347
15348# Options:
15349#    -l    only count lines
15350#    -w    only count words
15351#    -c    only count characters
15352#
15353# Default is to count lines, words, characters
15354
15355BEGIN @{
15356    # let getopt print a message about
15357    # invalid options. we ignore them
15358    while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
15359        if (c == "l")
15360            do_lines = 1
15361        else if (c == "w")
15362            do_words = 1
15363        else if (c == "c")
15364            do_chars = 1
15365    @}
15366    for (i = 1; i < Optind; i++)
15367        ARGV[i] = ""
15368
15369    # if no options, do all
15370    if (! do_lines && ! do_words && ! do_chars)
15371        do_lines = do_words = do_chars = 1
15372
15373    print_total = (ARGC - i > 2)
15374@}
15375@c endfile
15376@c @end group
15377@end example
15378
15379The @code{beginfile} function is simple; it just resets the counts of lines,
15380words, and characters to zero, and saves the current file name in
15381@code{fname}.
15382
15383The @code{endfile} function adds the current file's numbers to the running
15384totals of lines, words, and characters.  It then prints out those numbers
15385for the file that was just read. It relies on @code{beginfile} to reset the
15386numbers for the following data file.
15387
15388@example
15389@c left brace on line with `function' because of page breaking
15390@c file eg/prog/wc.awk
15391@group
15392function beginfile(file) @{
15393    chars = lines = words = 0
15394    fname = FILENAME
15395@}
15396@end group
15397
15398function endfile(file)
15399@{
15400    tchars += chars
15401    tlines += lines
15402    twords += words
15403    if (do_lines)
15404        printf "\t%d", lines
15405    if (do_words)
15406        printf "\t%d", words
15407    if (do_chars)
15408        printf "\t%d", chars
15409    printf "\t%s\n", fname
15410@}
15411@c endfile
15412@end example
15413
15414There is one rule that is executed for each line. It adds the length of the
15415record to @code{chars}.  It has to add one, since the newline character
15416separating records (the value of @code{RS}) is not part of the record
15417itself.  @code{lines} is incremented for each line read, and @code{words} is
15418incremented by the value of @code{NF}, the number of ``words'' on this
15419line.@footnote{Examine the code in
15420@ref{Filetrans Function, ,Noting Data File Boundaries}.
15421Why must @code{wc} use a separate @code{lines} variable, instead of using
15422the value of @code{FNR} in @code{endfile}?}
15423
15424Finally, the @code{END} rule simply prints the totals for all the files.
15425
15426@example
15427@c @group
15428@c file eg/prog/wc.awk
15429# do per line
15430@{
15431    chars += length($0) + 1    # get newline
15432    lines++
15433    words += NF
15434@}
15435
15436END @{
15437    if (print_total) @{
15438        if (do_lines)
15439            printf "\t%d", tlines
15440        if (do_words)
15441            printf "\t%d", twords
15442        if (do_chars)
15443            printf "\t%d", tchars
15444        print "\ttotal"
15445    @}
15446@}
15447@c endfile
15448@c @end group
15449@end example
15450
15451@node Miscellaneous Programs,  , Clones, Sample Programs
15452@section A Grab Bag of @code{awk} Programs
15453
15454This section is a large ``grab bag'' of miscellaneous programs.
15455We hope you find them both interesting and enjoyable.
15456
15457@menu
15458* Dupword Program::         Finding duplicated words in a document.
15459* Alarm Program::           An alarm clock.
15460* Translate Program::       A program similar to the @code{tr} utility.
15461* Labels Program::          Printing mailing labels.
15462* Word Sorting::            A program to produce a word usage count.
15463* History Sorting::         Eliminating duplicate entries from a history
15464                            file.
15465* Extract Program::         Pulling out programs from Texinfo source
15466                            files.
15467* Simple Sed::              A Simple Stream Editor.
15468* Igawk Program::           A wrapper for @code{awk} that includes files.
15469@end menu
15470
15471@node Dupword Program, Alarm Program, Miscellaneous Programs, Miscellaneous Programs
15472@subsection Finding Duplicated Words in a Document
15473
15474A common error when writing large amounts of prose is to accidentally
15475duplicate words.  Often you will see this in text as something like ``the
15476the program does the following @dots{}.''  When the text is on-line, often
15477the duplicated words occur at the end of one line and the beginning of
15478another, making them very difficult to spot.
15479@c as here!
15480
15481This program, @file{dupword.awk}, scans through a file one line at a time,
15482and looks for adjacent occurrences of the same word.  It also saves the last
15483word on a line (in the variable @code{prev}) for comparison with the first
15484word on the next line.
15485
15486The first two statements make sure that the line is all lower-case, so that,
15487for example,
15488``The'' and ``the'' compare equal to each other.  The second statement
15489removes all non-alphanumeric and non-whitespace characters from the line, so
15490that punctuation does not affect the comparison either.  This sometimes
15491leads to reports of duplicated words that really are different, but this is
15492unusual.
15493
15494@c FIXME: add check for $i != ""
15495@findex dupword.awk
15496@example
15497@group
15498@c file eg/prog/dupword.awk
15499# dupword --- find duplicate words in text
15500# Arnold Robbins, arnold@@gnu.org, Public Domain
15501# December 1991
15502
15503@{
15504    $0 = tolower($0)
15505    gsub(/[^A-Za-z0-9 \t]/, "");
15506    if ($1 == prev)
15507        printf("%s:%d: duplicate %s\n",
15508            FILENAME, FNR, $1)
15509    for (i = 2; i <= NF; i++)
15510        if ($i == $(i-1))
15511            printf("%s:%d: duplicate %s\n",
15512                FILENAME, FNR, $i)
15513    prev = $NF
15514@}
15515@c endfile
15516@end group
15517@end example
15518
15519@node Alarm Program, Translate Program, Dupword Program, Miscellaneous Programs
15520@subsection An Alarm Clock Program
15521
15522The following program is a simple ``alarm clock'' program.
15523You give it a time of day, and an optional message.  At the given time,
15524it prints the message on the standard output. In addition, you can give it
15525the number of times to repeat the message, and also a delay between
15526repetitions.
15527
15528This program uses the @code{gettimeofday} function from
15529@ref{Gettimeofday Function, ,Managing the Time of Day}.
15530
15531All the work is done in the @code{BEGIN} rule.  The first part is argument
15532checking and setting of defaults; the delay, the count, and the message to
15533print.  If the user supplied a message, but it does not contain the ASCII BEL
15534character (known as the ``alert'' character, @samp{\a}), then it is added to
15535the message.  (On many systems, printing the ASCII BEL generates some sort
15536of audible alert. Thus, when the alarm goes off, the system calls attention
15537to itself, in case the user is not looking at their computer or terminal.)
15538
15539@findex alarm.awk
15540@example
15541@c @group
15542@c file eg/prog/alarm.awk
15543# alarm --- set an alarm
15544# Arnold Robbins, arnold@@gnu.org, Public Domain
15545# May 1993
15546
15547# usage: alarm time [ "message" [ count [ delay ] ] ]
15548
15549BEGIN    \
15550@{
15551    # Initial argument sanity checking
15552    usage1 = "usage: alarm time ['message' [count [delay]]]"
15553    usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
15554
15555    if (ARGC < 2) @{
15556        print usage > "/dev/stderr"
15557        exit 1
15558    @} else if (ARGC == 5) @{
15559        delay = ARGV[4] + 0
15560        count = ARGV[3] + 0
15561        message = ARGV[2]
15562    @} else if (ARGC == 4) @{
15563        count = ARGV[3] + 0
15564        message = ARGV[2]
15565    @} else if (ARGC == 3) @{
15566        message = ARGV[2]
15567    @} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{
15568        print usage1 > "/dev/stderr"
15569        print usage2 > "/dev/stderr"
15570        exit 1
15571    @}
15572
15573    # set defaults for once we reach the desired time
15574    if (delay == 0)
15575        delay = 180    # 3 minutes
15576    if (count == 0)
15577        count = 5
15578@group
15579    if (message == "")
15580        message = sprintf("\aIt is now %s!\a", ARGV[1])
15581    else if (index(message, "\a") == 0)
15582        message = "\a" message "\a"
15583@end group
15584@c endfile
15585@end example
15586
15587The next section of code turns the alarm time into hours and minutes,
15588and converts it if necessary to a 24-hour clock.  Then it turns that
15589time into a count of the seconds since midnight.  Next it turns the current
15590time into a count of seconds since midnight.  The difference between the two
15591is how long to wait before setting off the alarm.
15592
15593@example
15594@c @group
15595@c file eg/prog/alarm.awk
15596    # split up dest time
15597    split(ARGV[1], atime, ":")
15598    hour = atime[1] + 0    # force numeric
15599    minute = atime[2] + 0  # force numeric
15600
15601    # get current broken down time
15602    gettimeofday(now)
15603
15604    # if time given is 12-hour hours and it's after that
15605    # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
15606    # then add 12 to real hour
15607    if (hour < 12 && now["hour"] > hour)
15608        hour += 12
15609
15610    # set target time in seconds since midnight
15611    target = (hour * 60 * 60) + (minute * 60)
15612
15613    # get current time in seconds since midnight
15614    current = (now["hour"] * 60 * 60) + \
15615               (now["minute"] * 60) + now["second"]
15616
15617    # how long to sleep for
15618    naptime = target - current
15619    if (naptime <= 0) @{
15620        print "time is in the past!" > "/dev/stderr"
15621        exit 1
15622    @}
15623@c endfile
15624@c @end group
15625@end example
15626
15627Finally, the program uses the @code{system} function
15628(@pxref{I/O Functions, ,Built-in Functions for Input/Output})
15629to call the @code{sleep} utility.  The @code{sleep} utility simply pauses
15630for the given number of seconds.  If the exit status is not zero,
15631the program assumes that @code{sleep} was interrupted, and exits. If
15632@code{sleep} exited with an OK status (zero), then the program prints the
15633message in a loop, again using @code{sleep} to delay for however many
15634seconds are necessary.
15635
15636@example
15637@c file eg/prog/alarm.awk
15638@group
15639    # zzzzzz..... go away if interrupted
15640    if (system(sprintf("sleep %d", naptime)) != 0)
15641        exit 1
15642@end group
15643
15644    # time to notify!
15645    command = sprintf("sleep %d", delay)
15646    for (i = 1; i <= count; i++) @{
15647        print message
15648        # if sleep command interrupted, go away
15649        if (system(command) != 0)
15650            break
15651    @}
15652
15653    exit 0
15654@}
15655@c endfile
15656@end example
15657
15658@node Translate Program, Labels Program, Alarm Program, Miscellaneous Programs
15659@subsection Transliterating Characters
15660
15661The system @code{tr} utility transliterates characters.  For example, it is
15662often used to map upper-case letters into lower-case, for further
15663processing.
15664
15665@example
15666@var{generate data} | tr '[A-Z]' '[a-z]' | @var{process data} @dots{}
15667@end example
15668
15669You give @code{tr} two lists of characters enclosed in square brackets.
15670Usually, the lists are quoted to keep the shell from attempting to do a
15671filename expansion.@footnote{On older, non-POSIX systems, @code{tr} often
15672does not require that the lists be enclosed in square brackets and quoted.
15673This is a feature.}  When processing the input, the
15674first character in the first list is replaced with the first character in the
15675second list, the second character in the first list is replaced with the
15676second character in the second list, and so on.
15677If there are more characters in the ``from'' list than in the ``to'' list,
15678the last character of the ``to'' list is used for the remaining characters
15679in the ``from'' list.
15680
15681Some time ago,
15682@c early or mid-1989!
15683a user proposed to us that we add a transliteration function to @code{gawk}.
15684Being opposed to ``creeping featurism,'' I wrote the following program to
15685prove that character transliteration could be done with a user-level
15686function.  This program is not as complete as the system @code{tr} utility,
15687but it will do most of the job.
15688
15689The @code{translate} program demonstrates one of the few weaknesses of
15690standard
15691@code{awk}: dealing with individual characters is very painful, requiring
15692repeated use of the @code{substr}, @code{index}, and @code{gsub} built-in
15693functions
15694(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@footnote{This
15695program was written before @code{gawk} acquired the ability to
15696split each character in a string into separate array elements.
15697How might you use this new feature to simplify the program?}
15698
15699There are two functions.  The first, @code{stranslate}, takes three
15700arguments.
15701
15702@table @code
15703@item from
15704A list of characters to translate from.
15705
15706@item to
15707A list of characters to translate to.
15708
15709@item target
15710The string to do the translation on.
15711@end table
15712
15713Associative arrays make the translation part fairly easy. @code{t_ar} holds
15714the ``to'' characters, indexed by the ``from'' characters.  Then a simple
15715loop goes through @code{from}, one character at a time.  For each character
15716in @code{from}, if the character appears in @code{target}, @code{gsub}
15717is used to change it to the corresponding @code{to} character.
15718
15719The @code{translate} function simply calls @code{stranslate} using @code{$0}
15720as the target.  The main program sets two global variables, @code{FROM} and
15721@code{TO}, from the command line, and then changes @code{ARGV} so that
15722@code{awk} will read from the standard input.
15723
15724Finally, the processing rule simply calls @code{translate} for each record.
15725
15726@findex translate.awk
15727@example
15728@c @group
15729@c file eg/prog/translate.awk
15730# translate --- do tr like stuff
15731# Arnold Robbins, arnold@@gnu.org, Public Domain
15732# August 1989
15733
15734# bugs: does not handle things like: tr A-Z a-z, it has
15735# to be spelled out. However, if `to' is shorter than `from',
15736# the last character in `to' is used for the rest of `from'.
15737
15738function stranslate(from, to, target,     lf, lt, t_ar, i, c)
15739@{
15740    lf = length(from)
15741    lt = length(to)
15742    for (i = 1; i <= lt; i++)
15743        t_ar[substr(from, i, 1)] = substr(to, i, 1)
15744    if (lt < lf)
15745        for (; i <= lf; i++)
15746            t_ar[substr(from, i, 1)] = substr(to, lt, 1)
15747    for (i = 1; i <= lf; i++) @{
15748        c = substr(from, i, 1)
15749        if (index(target, c) > 0)
15750            gsub(c, t_ar[c], target)
15751    @}
15752    return target
15753@}
15754
15755function translate(from, to)
15756@{
15757    return $0 = stranslate(from, to, $0)
15758@}
15759
15760@group
15761# main program
15762BEGIN @{
15763    if (ARGC < 3) @{
15764        print "usage: translate from to" > "/dev/stderr"
15765        exit
15766    @}
15767@end group
15768    FROM = ARGV[1]
15769    TO = ARGV[2]
15770    ARGC = 2
15771    ARGV[1] = "-"
15772@}
15773
15774@{
15775    translate(FROM, TO)
15776    print
15777@}
15778@c endfile
15779@c @end group
15780@end example
15781
15782While it is possible to do character transliteration in a user-level
15783function, it is not necessarily efficient, and we started to consider adding
15784a built-in function.  However, shortly after writing this program, we learned
15785that the System V Release 4 @code{awk} had added the @code{toupper} and
15786@code{tolower} functions.  These functions handle the vast majority of the
15787cases where character transliteration is necessary, and so we chose to
15788simply add those functions to @code{gawk} as well, and then leave well
15789enough alone.
15790
15791An obvious improvement to this program would be to set up the
15792@code{t_ar} array only once, in a @code{BEGIN} rule. However, this
15793assumes that the ``from'' and ``to'' lists
15794will never change throughout the lifetime of the program.
15795
15796@node Labels Program, Word Sorting, Translate Program, Miscellaneous Programs
15797@subsection Printing Mailing Labels
15798
15799Here is a ``real world''@footnote{``Real world'' is defined as
15800``a program actually used to get something done.''}
15801program.  This script reads lists of names and
15802addresses, and generates mailing labels.  Each page of labels has 20 labels
15803on it, two across and ten down.  The addresses are guaranteed to be no more
15804than five lines of data.  Each address is separated from the next by a blank
15805line.
15806
15807The basic idea is to read 20 labels worth of data.  Each line of each label
15808is stored in the @code{line} array.  The single rule takes care of filling
15809the @code{line} array and printing the page when 20 labels have been read.
15810
15811The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that
15812@code{awk} will split records at blank lines
15813(@pxref{Records, ,How Input is Split into Records}).
15814It sets @code{MAXLINES} to 100, since @code{MAXLINE} is the maximum number
15815of lines on the page (20 * 5 = 100).
15816
15817Most of the work is done in the @code{printpage} function.
15818The label lines are stored sequentially in the @code{line} array.  But they
15819have to be printed horizontally; @code{line[1]} next to @code{line[6]},
15820@code{line[2]} next to @code{line[7]}, and so on.  Two loops are used to
15821accomplish this.  The outer loop, controlled by @code{i}, steps through
15822every 10 lines of data; this is each row of labels.  The inner loop,
15823controlled by @code{j}, goes through the lines within the row.
15824As @code{j} goes from zero to four, @samp{i+j} is the @code{j}'th line in
15825the row, and @samp{i+j+5} is the entry next to it.  The output ends up
15826looking something like this:
15827
15828@example
15829line 1          line 6
15830line 2          line 7
15831line 3          line 8
15832line 4          line 9
15833line 5          line 10
15834@end example
15835
15836As a final note, at lines 21 and 61, an extra blank line is printed, to keep
15837the output lined up on the labels.  This is dependent on the particular
15838brand of labels in use when the program was written.  You will also note
15839that there are two blank lines at the top and two blank lines at the bottom.
15840
15841The @code{END} rule arranges to flush the final page of labels; there may
15842not have been an even multiple of 20 labels in the data.
15843
15844@findex labels.awk
15845@example
15846@c @group
15847@c file eg/prog/labels.awk
15848# labels.awk
15849# Arnold Robbins, arnold@@gnu.org, Public Domain
15850# June 1992
15851
15852# Program to print labels.  Each label is 5 lines of data
15853# that may have blank lines.  The label sheets have 2
15854# blank lines at the top and 2 at the bottom.
15855
15856BEGIN    @{ RS = "" ; MAXLINES = 100 @}
15857
15858function printpage(    i, j)
15859@{
15860    if (Nlines <= 0)
15861        return
15862
15863    printf "\n\n"        # header
15864
15865    for (i = 1; i <= Nlines; i += 10) @{
15866        if (i == 21 || i == 61)
15867            print ""
15868        for (j = 0; j < 5; j++) @{
15869            if (i + j > MAXLINES)
15870                break
15871            printf "   %-41s %s\n", line[i+j], line[i+j+5]
15872        @}
15873        print ""
15874    @}
15875
15876    printf "\n\n"        # footer
15877
15878    for (i in line)
15879        line[i] = ""
15880@}
15881
15882# main rule
15883@{
15884    if (Count >= 20) @{
15885        printpage()
15886        Count = 0
15887        Nlines = 0
15888    @}
15889    n = split($0, a, "\n")
15890    for (i = 1; i <= n; i++)
15891        line[++Nlines] = a[i]
15892    for (; i <= 5; i++)
15893        line[++Nlines] = ""
15894    Count++
15895@}
15896
15897END    \
15898@{
15899    printpage()
15900@}
15901@c endfile
15902@c @end group
15903@end example
15904
15905@node Word Sorting, History Sorting, Labels Program, Miscellaneous Programs
15906@subsection Generating Word Usage Counts
15907
15908The following @code{awk} program prints
15909the number of occurrences of each word in its input.  It illustrates the
15910associative nature of @code{awk} arrays by using strings as subscripts.  It
15911also demonstrates the @samp{for @var{x} in @var{array}} construction.
15912Finally, it shows how @code{awk} can be used in conjunction with other
15913utility programs to do a useful task of some complexity with a minimum of
15914effort.  Some explanations follow the program listing.
15915
15916@example
15917awk '
15918# Print list of word frequencies
15919@{
15920    for (i = 1; i <= NF; i++)
15921        freq[$i]++
15922@}
15923
15924@group
15925END @{
15926    for (word in freq)
15927        printf "%s\t%d\n", word, freq[word]
15928@}'
15929@end group
15930@end example
15931
15932The first thing to notice about this program is that it has two rules.  The
15933first rule, because it has an empty pattern, is executed on every line of
15934the input.  It uses @code{awk}'s field-accessing mechanism
15935(@pxref{Fields, ,Examining Fields}) to pick out the individual words from
15936the line, and the built-in variable @code{NF} (@pxref{Built-in Variables})
15937to know how many fields are available.
15938
15939For each input word, an element of the array @code{freq} is incremented to
15940reflect that the word has been seen an additional time.
15941
15942The second rule, because it has the pattern @code{END}, is not executed
15943until the input has been exhausted.  It prints out the contents of the
15944@code{freq} table that has been built up inside the first action.
15945
15946This program has several problems that would prevent it from being
15947useful by itself on real text files:
15948
15949@itemize @bullet
15950@item
15951Words are detected using the @code{awk} convention that fields are
15952separated by whitespace and that other characters in the input (except
15953newlines) don't have any special meaning to @code{awk}.  This means that
15954punctuation characters count as part of words.
15955
15956@item
15957The @code{awk} language considers upper- and lower-case characters to be
15958distinct.  Therefore, @samp{bartender} and @samp{Bartender} are not treated
15959as the same word.  This is undesirable since, in normal text, words
15960are capitalized if they begin sentences, and a frequency analyzer should not
15961be sensitive to capitalization.
15962
15963@item
15964The output does not come out in any useful order.  You're more likely to be
15965interested in which words occur most frequently, or having an alphabetized
15966table of how frequently each word occurs.
15967@end itemize
15968
15969The way to solve these problems is to use some of the more advanced
15970features of the @code{awk} language.  First, we use @code{tolower} to remove
15971case distinctions.  Next, we use @code{gsub} to remove punctuation
15972characters.  Finally, we use the system @code{sort} utility to process the
15973output of the @code{awk} script.  Here is the new version of
15974the program:
15975
15976@findex wordfreq.sh
15977@example
15978@c file eg/prog/wordfreq.awk
15979# Print list of word frequencies
15980@{
15981    $0 = tolower($0)    # remove case distinctions
15982    gsub(/[^a-z0-9_ \t]/, "", $0)  # remove punctuation
15983    for (i = 1; i <= NF; i++)
15984        freq[$i]++
15985@}
15986@c endfile
15987
15988@group
15989END @{
15990    for (word in freq)
15991        printf "%s\t%d\n", word, freq[word]
15992@}
15993@end group
15994@end example
15995
15996Assuming we have saved this program in a file named @file{wordfreq.awk},
15997and that the data is in @file{file1}, the following pipeline
15998
15999@example
16000awk -f wordfreq.awk file1 | sort +1 -nr
16001@end example
16002
16003@noindent
16004produces a table of the words appearing in @file{file1} in order of
16005decreasing frequency.
16006
16007The @code{awk} program suitably massages the data and produces a word
16008frequency table, which is not ordered.
16009
16010The @code{awk} script's output is then sorted by the @code{sort} utility and
16011printed on the terminal.  The options given to @code{sort} in this example
16012specify to sort using the second field of each input line (skipping one field),
16013that the sort keys should be treated as numeric quantities (otherwise
16014@samp{15} would come before @samp{5}), and that the sorting should be done
16015in descending (reverse) order.
16016
16017We could have even done the @code{sort} from within the program, by
16018changing the @code{END} action to:
16019
16020@example
16021@c file eg/prog/wordfreq.awk
16022END @{
16023    sort = "sort +1 -nr"
16024    for (word in freq)
16025        printf "%s\t%d\n", word, freq[word] | sort
16026    close(sort)
16027@}
16028@c endfile
16029@end example
16030
16031You would have to use this way of sorting on systems that do not
16032have true pipes.
16033
16034See the general operating system documentation for more information on how
16035to use the @code{sort} program.
16036
16037@node History Sorting, Extract Program, Word Sorting, Miscellaneous Programs
16038@subsection Removing Duplicates from Unsorted Text
16039
16040The @code{uniq} program
16041(@pxref{Uniq Program, ,Printing Non-duplicated Lines of Text}),
16042removes duplicate lines from @emph{sorted} data.
16043
16044Suppose, however, you need to remove duplicate lines from a data file, but
16045that you wish to preserve the order the lines are in?  A good example of
16046this might be a shell history file.  The history file keeps a copy of all
16047the commands you have entered, and it is not unusual to repeat a command
16048several times in a row.  Occasionally you might wish to compact the history
16049by removing duplicate entries.  Yet it is desirable to maintain the order
16050of the original commands.
16051
16052This simple program does the job.  It uses two arrays.  The @code{data}
16053array is indexed by the text of each line.
16054For each line, @code{data[$0]} is incremented.
16055
16056If a particular line has not
16057been seen before, then @code{data[$0]} will be zero.
16058In that case, the text of the line is stored in @code{lines[count]}.
16059Each element of @code{lines} is a unique command, and the indices of
16060@code{lines} indicate the order in which those lines were encountered.
16061The @code{END} rule simply prints out the lines, in order.
16062
16063@cindex Rakitzis, Byron
16064@findex histsort.awk
16065@example
16066@group
16067@c file eg/prog/histsort.awk
16068# histsort.awk --- compact a shell history file
16069# Arnold Robbins, arnold@@gnu.org, Public Domain
16070# May 1993
16071
16072# Thanks to Byron Rakitzis for the general idea
16073@{
16074    if (data[$0]++ == 0)
16075        lines[++count] = $0
16076@}
16077
16078END @{
16079    for (i = 1; i <= count; i++)
16080        print lines[i]
16081@}
16082@c endfile
16083@end group
16084@end example
16085
16086This program also provides a foundation for generating other useful
16087information.  For example, using the following @code{print} satement in the
16088@code{END} rule would indicate how often a particular command was used.
16089
16090@example
16091print data[lines[i]], lines[i]
16092@end example
16093
16094This works because @code{data[$0]} was incremented each time a line was
16095seen.
16096
16097@node Extract Program, Simple Sed, History Sorting, Miscellaneous Programs
16098@subsection Extracting Programs from Texinfo Source Files
16099
16100@iftex
16101Both this chapter and the previous chapter
16102(@ref{Library Functions, ,A Library of @code{awk} Functions}),
16103present a large number of @code{awk} programs.
16104@end iftex
16105@ifinfo
16106The nodes
16107@ref{Library Functions, ,A Library of @code{awk} Functions},
16108and @ref{Sample Programs, ,Practical @code{awk} Programs},
16109are the top level nodes for a large number of @code{awk} programs.
16110@end ifinfo
16111If you wish to experiment with these programs, it is tedious to have to type
16112them in by hand.  Here we present a program that can extract parts of a
16113Texinfo input file into separate files.
16114
16115This @value{DOCUMENT} is written in Texinfo, the GNU project's document
16116formatting language.  A single Texinfo source file can be used to produce both
16117printed and on-line documentation.
16118@iftex
16119Texinfo is fully documented in @cite{Texinfo---The GNU Documentation Format},
16120available from the Free Software Foundation.
16121@end iftex
16122@ifinfo
16123The Texinfo language is described fully, starting with
16124@ref{Top, , Introduction, texi, Texinfo---The GNU Documentation Format}.
16125@end ifinfo
16126
16127For our purposes, it is enough to know three things about Texinfo input
16128files.
16129
16130@itemize @bullet
16131@item
16132The ``at'' symbol, @samp{@@}, is special in Texinfo, much like @samp{\} in C
16133or @code{awk}.  Literal @samp{@@} symbols are represented in Texinfo source
16134files as @samp{@@@@}.
16135
16136@item
16137Comments start with either @samp{@@c} or @samp{@@comment}.
16138The file extraction program will work by using special comments that start
16139at the beginning of a line.
16140
16141@item
16142Example text that should not be split across a page boundary is bracketed
16143between lines containing @samp{@@group} and @samp{@@end group} commands.
16144@end itemize
16145
16146The following program, @file{extract.awk}, reads through a Texinfo source
16147file, and does two things, based on the special comments.
16148Upon seeing @samp{@w{@@c system @dots{}}},
16149it runs a command, by extracting the command text from the
16150control line and passing it on to the @code{system} function
16151(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
16152Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to
16153the file @var{filename}, until @samp{@@c endfile} is encountered.
16154The rules in @file{extract.awk} will match either @samp{@@c} or
16155@samp{@@comment} by letting the @samp{omment} part be optional.
16156Lines containing @samp{@@group} and @samp{@@end group} are simply removed.
16157@file{extract.awk} uses the @code{join} library function
16158(@pxref{Join Function, ,Merging an Array Into a String}).
16159
16160The example programs in the on-line Texinfo source for @cite{@value{TITLE}}
16161(@file{gawk.texi}) have all been bracketed inside @samp{file},
16162and @samp{endfile} lines.  The @code{gawk} distribution uses a copy of
16163@file{extract.awk} to extract the sample
16164programs and install many of them in a standard directory, where
16165@code{gawk} can find them.
16166The Texinfo file looks something like this:
16167
16168@example
16169@dots{}
16170This program has a @@code@{BEGIN@} block,
16171which prints a nice message:
16172
16173@@example
16174@@c file examples/messages.awk
16175BEGIN @@@{ print "Don't panic!" @@@}
16176@@c end file
16177@@end example
16178
16179It also prints some final advice:
16180
16181@@example
16182@@c file examples/messages.awk
16183END @@@{ print "Always avoid bored archeologists!" @@@}
16184@@c end file
16185@@end example
16186@dots{}
16187@end example
16188
16189@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that
16190mixed upper-case and lower-case letters in the directives won't matter.
16191
16192The first rule handles calling @code{system}, checking that a command was
16193given (@code{NF} is at least three), and also checking that the command
16194exited with a zero exit status, signifying OK.
16195
16196@findex extract.awk
16197@example
16198@c @group
16199@c file eg/prog/extract.awk
16200# extract.awk --- extract files and run programs
16201#                 from texinfo files
16202# Arnold Robbins, arnold@@gnu.org, Public Domain, May 1993
16203
16204BEGIN    @{ IGNORECASE = 1 @}
16205
16206@group
16207/^@@c(omment)?[ \t]+system/    \
16208@{
16209    if (NF < 3) @{
16210        e = (FILENAME ":" FNR)
16211        e = (e  ": badly formed `system' line")
16212        print e > "/dev/stderr"
16213        next
16214    @}
16215    $1 = ""
16216    $2 = ""
16217    stat = system($0)
16218    if (stat != 0) @{
16219        e = (FILENAME ":" FNR)
16220        e = (e ": warning: system returned " stat)
16221        print e > "/dev/stderr"
16222    @}
16223@}
16224@end group
16225@c endfile
16226@end example
16227
16228@noindent
16229The variable @code{e} is used so that the function
16230fits nicely on the
16231@iftex
16232page.
16233@end iftex
16234@ifinfo
16235screen.
16236@end ifinfo
16237
16238The second rule handles moving data into files.  It verifies that a file
16239name was given in the directive.  If the file named is not the current file,
16240then the current file is closed.  This means that an @samp{@@c endfile} was
16241not given for that file.  (We should probably print a diagnostic in this
16242case, although at the moment we do not.)
16243
16244The @samp{for} loop does the work.  It reads lines using @code{getline}
16245(@pxref{Getline, ,Explicit Input with @code{getline}}).
16246For an unexpected end of file, it calls the @code{@w{unexpected_eof}}
16247function.  If the line is an ``endfile'' line, then it breaks out of
16248the loop.
16249If the line is an @samp{@@group} or @samp{@@end group} line, then it
16250ignores it, and goes on to the next line.
16251(These Texinfo control lines keep blocks of code together on one page;
16252unfortunately, @TeX{} isn't always smart enough to do things exactly right,
16253and we have to give it some advice.)
16254
16255Most of the work is in the following few lines.  If the line has no @samp{@@}
16256symbols, it can be printed directly.  Otherwise, each leading @samp{@@} must be
16257stripped off.
16258
16259To remove the @samp{@@} symbols, the line is split into separate elements of
16260the array @code{a}, using the @code{split} function
16261(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
16262Each element of @code{a} that is empty indicates two successive @samp{@@}
16263symbols in the original line.  For each two empty elements (@samp{@@@@} in
16264the original file), we have to add back in a single @samp{@@} symbol.
16265
16266When the processing of the array is finished, @code{join} is called with the
16267value of @code{SUBSEP}, to rejoin the pieces back into a single
16268line.  That line is then printed to the output file.
16269
16270@example
16271@c @group
16272@c file eg/prog/extract.awk
16273@group
16274/^@@c(omment)?[ \t]+file/    \
16275@{
16276    if (NF != 3) @{
16277        e = (FILENAME ":" FNR ": badly formed `file' line")
16278        print e > "/dev/stderr"
16279        next
16280    @}
16281@end group
16282    if ($3 != curfile) @{
16283        if (curfile != "")
16284            close(curfile)
16285        curfile = $3
16286    @}
16287
16288    for (;;) @{
16289        if ((getline line) <= 0)
16290            unexpected_eof()
16291        if (line ~ /^@@c(omment)?[ \t]+endfile/)
16292            break
16293        else if (line ~ /^@@(end[ \t]+)?group/)
16294            continue
16295        if (index(line, "@@") == 0) @{
16296            print line > curfile
16297            continue
16298        @}
16299        n = split(line, a, "@@")
16300@group
16301        # if a[1] == "", means leading @@,
16302        # don't add one back in.
16303@end group
16304        for (i = 2; i <= n; i++) @{
16305            if (a[i] == "") @{ # was an @@@@
16306                a[i] = "@@"
16307                if (a[i+1] == "")
16308                    i++
16309            @}
16310        @}
16311        print join(a, 1, n, SUBSEP) > curfile
16312    @}
16313@}
16314@c endfile
16315@c @end group
16316@end example
16317
16318An important thing to note is the use of the @samp{>} redirection.
16319Output done with @samp{>} only opens the file once; it stays open and
16320subsequent output is appended to the file
16321(@pxref{Redirection, , Redirecting Output of @code{print} and @code{printf}}).
16322This allows us to easily mix program text and explanatory prose for the same
16323sample source file (as has been done here!) without any hassle.  The file is
16324only closed when a new data file name is encountered, or at the end of the
16325input file.
16326
16327Finally, the function @code{@w{unexpected_eof}} prints an appropriate
16328error message and then exits.
16329
16330The @code{END} rule handles the final cleanup, closing the open file.
16331
16332@example
16333@c file eg/prog/extract.awk
16334@group
16335function unexpected_eof()
16336@{
16337    printf("%s:%d: unexpected EOF or error\n", \
16338        FILENAME, FNR) > "/dev/stderr"
16339    exit 1
16340@}
16341@end group
16342
16343END @{
16344    if (curfile)
16345        close(curfile)
16346@}
16347@c endfile
16348@end example
16349
16350@node Simple Sed, Igawk Program, Extract Program, Miscellaneous Programs
16351@subsection A Simple Stream Editor
16352
16353@cindex @code{sed} utility
16354The @code{sed} utility is a ``stream editor,'' a program that reads a
16355stream of data, makes changes to it, and passes the modified data on.
16356It is often used to make global changes to a large file, or to a stream
16357of data generated by a pipeline of commands.
16358
16359While @code{sed} is a complicated program in its own right, its most common
16360use is to perform global substitutions in the middle of a pipeline:
16361
16362@example
16363command1 < orig.data | sed 's/old/new/g' | command2 > result
16364@end example
16365
16366Here, the @samp{s/old/new/g} tells @code{sed} to look for the regexp
16367@samp{old} on each input line, and replace it with the text @samp{new},
16368globally (i.e.@: all the occurrences on a line).  This is similar to
16369@code{awk}'s @code{gsub} function
16370(@pxref{String Functions, , Built-in Functions for String Manipulation}).
16371
16372The following program, @file{awksed.awk}, accepts at least two command line
16373arguments; the pattern to look for and the text to replace it with. Any
16374additional arguments are treated as data file names to process. If none
16375are provided, the standard input is used.
16376
16377@cindex Brennan, Michael
16378@cindex @code{awksed}
16379@cindex simple stream editor
16380@cindex stream editor, simple
16381@example
16382@c @group
16383@c file eg/prog/awksed.awk
16384# awksed.awk --- do s/foo/bar/g using just print
16385#    Thanks to Michael Brennan for the idea
16386
16387# Arnold Robbins, arnold@@gnu.org, Public Domain
16388# August 1995
16389
16390function usage()
16391@{
16392    print "usage: awksed pat repl [files...]" > "/dev/stderr"
16393    exit 1
16394@}
16395
16396@group
16397BEGIN @{
16398    # validate arguments
16399    if (ARGC < 3)
16400        usage()
16401@end group
16402
16403    RS = ARGV[1]
16404    ORS = ARGV[2]
16405
16406    # don't use arguments as files
16407    ARGV[1] = ARGV[2] = ""
16408@}
16409
16410# look ma, no hands!
16411@{
16412    if (RT == "")
16413        printf "%s", $0
16414    else
16415        print
16416@}
16417@c endfile
16418@c @end group
16419@end example
16420
16421The program relies on @code{gawk}'s ability to have @code{RS} be a regexp
16422and on the setting of @code{RT} to the actual text that terminated the
16423record (@pxref{Records, ,How Input is Split into Records}).
16424
16425The idea is to have @code{RS} be the pattern to look for. @code{gawk}
16426will automatically set @code{$0} to the text between matches of the pattern.
16427This is text that we wish to keep, unmodified.  Then, by setting @code{ORS}
16428to the replacement text, a simple @code{print} statement will output the
16429text we wish to keep, followed by the replacement text.
16430
16431There is one wrinkle to this scheme, which is what to do if the last record
16432doesn't end with text that matches @code{RS}?  Using a @code{print}
16433statement unconditionally prints the replacement text, which is not correct.
16434
16435However, if the file did not end in text that matches @code{RS}, @code{RT}
16436will be set to the null string.  In this case, we can print @code{$0} using
16437@code{printf}
16438(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
16439
16440The @code{BEGIN} rule handles the setup, checking for the right number
16441of arguments, and calling @code{usage} if there is a problem. Then it sets
16442@code{RS} and @code{ORS} from the command line arguments, and sets
16443@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they will
16444not be treated as file names
16445(@pxref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}).
16446
16447The @code{usage} function prints an error message and exits.
16448
16449Finally, the single rule handles the printing scheme outlined above,
16450using @code{print} or @code{printf} as appropriate, depending upon the
16451value of @code{RT}.
16452
16453@ignore
16454Exercise, compare the performance of this version with the more
16455straightforward:
16456
16457BEGIN {
16458    pat = ARGV[1]
16459    repl = ARGV[2]
16460    ARGV[1] = ARGV[2] = ""
16461}
16462
16463{ gsub(pat, repl); print }
16464
16465Exercise: what are the advantages and disadvantages of this version vs. sed?
16466  Advantage: egrep regexps
16467             speed (?)
16468  Disadvantage: no & in replacement text
16469
16470Others?
16471@end ignore
16472
16473@node Igawk Program, , Simple Sed, Miscellaneous Programs
16474@subsection An Easy Way to Use Library Functions
16475
16476Using library functions in @code{awk} can be very beneficial. It
16477encourages code re-use and the writing of general functions. Programs are
16478smaller, and therefore clearer.
16479However, using library functions is only easy when writing @code{awk}
16480programs; it is painful when running them, requiring multiple @samp{-f}
16481options.  If @code{gawk} is unavailable, then so too is the @code{AWKPATH}
16482environment variable and the ability to put @code{awk} functions into a
16483library directory (@pxref{Options, ,Command Line Options}).
16484
16485It would be nice to be able to write programs like so:
16486
16487@example
16488# library functions
16489@@include getopt.awk
16490@@include join.awk
16491@dots{}
16492
16493# main program
16494BEGIN @{
16495    while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
16496        @dots{}
16497    @dots{}
16498@}
16499@end example
16500
16501The following program, @file{igawk.sh}, provides this service.
16502It simulates @code{gawk}'s searching of the @code{AWKPATH} variable,
16503and also allows @dfn{nested} includes; i.e.@: a file that has been included
16504with @samp{@@include} can contain further @samp{@@include} statements.
16505@code{igawk} will make an effort to only include files once, so that nested
16506includes don't accidentally include a library function twice.
16507
16508@code{igawk} should behave externally just like @code{gawk}.  This means it
16509should accept all of @code{gawk}'s command line arguments, including the
16510ability to have multiple source files specified via @samp{-f}, and the
16511ability to mix command line and library source files.
16512
16513The program is written using the POSIX Shell (@code{sh}) command language.
16514The way the program works is as follows:
16515
16516@enumerate
16517@item
16518Loop through the arguments, saving anything that doesn't represent
16519@code{awk} source code for later, when the expanded program is run.
16520
16521@item
16522For any arguments that do represent @code{awk} text, put the arguments into
16523a temporary file that will be expanded.  There are two cases.
16524
16525@enumerate a
16526@item
16527Literal text, provided with @samp{--source} or @samp{--source=}.  This
16528text is just echoed directly.  The @code{echo} program will automatically
16529supply a trailing newline.
16530
16531@item
16532File names provided with @samp{-f}.  We use a neat trick, and echo
16533@samp{@@include @var{filename}} into the temporary file.  Since the file
16534inclusion program will work the way @code{gawk} does, this will get the text
16535of the file included into the program at the correct point.
16536@end enumerate
16537
16538@item
16539Run an @code{awk} program (naturally) over the temporary file to expand
16540@samp{@@include} statements.  The expanded program is placed in a second
16541temporary file.
16542
16543@item
16544Run the expanded program with @code{gawk} and any other original command line
16545arguments that the user supplied (such as the data file names).
16546@end enumerate
16547
16548The initial part of the program turns on shell tracing if the first
16549argument was @samp{debug}.  Otherwise, a shell @code{trap} statement
16550arranges to clean up any temporary files on program exit or upon an
16551interrupt.
16552
16553@c 2e: For the temporary file handling, use mktemp with $@{TMPDIR:-/tmp@}.
16554
16555The next part loops through all the command line arguments.
16556There are several cases of interest.
16557
16558@table @code
16559@item --
16560This ends the arguments to @code{igawk}.  Anything else should be passed on
16561to the user's @code{awk} program without being evaluated.
16562
16563@item -W
16564This indicates that the next option is specific to @code{gawk}.  To make
16565argument processing easier, the @samp{-W} is appended to the front of the
16566remaining arguments and the loop continues.  (This is an @code{sh}
16567programming trick.  Don't worry about it if you are not familiar with
16568@code{sh}.)
16569
16570@item -v
16571@itemx -F
16572These are saved and passed on to @code{gawk}.
16573
16574@item -f
16575@itemx --file
16576@itemx --file=
16577@itemx -Wfile=
16578The file name is saved to a temporary file with an
16579@samp{@@include} statement.
16580The @code{sed} utility is used to remove the leading option part of the
16581argument (e.g., @samp{--file=}).
16582
16583@item --source
16584@itemx --source=
16585@itemx -Wsource=
16586The source text is echoed into a temporary file.
16587
16588@item --version
16589@itemx -Wversion
16590@code{igawk} prints its version number, and runs @samp{gawk --version}
16591to get the @code{gawk} version information, and then exits.
16592@end table
16593
16594If none of @samp{-f}, @samp{--file}, @samp{-Wfile}, @samp{--source},
16595or @samp{-Wsource}, were supplied, then the first non-option argument
16596should be the @code{awk} program.  If there are no command line
16597arguments left, @code{igawk} prints an error message and exits.
16598Otherwise, the first argument is echoed into a temporary file.
16599
16600In any case, after the arguments have been processed,
16601the complete text of the original @code{awk} program
16602is contained in a temporary file.
16603
16604@cindex @code{sed} utility
16605Here's the program:
16606
16607@findex igawk.sh
16608@example
16609@c @group
16610@c file eg/prog/igawk.sh
16611#! /bin/sh
16612
16613# igawk --- like gawk but do @@include processing
16614# Arnold Robbins, arnold@@gnu.org, Public Domain
16615# July 1993
16616
16617# Temporary file handling modifications for Owl by
16618# Jarno Huuskonen and Solar Designer, still Public Domain
16619# May 2001
16620
16621if [ ! -x /bin/mktemp ]; then
16622    echo "$0 needs mktemp to create temporary files."
16623    exit 1
16624fi
16625
16626STEMPFILE=`/bin/mktemp $@{TMPDIR:-/tmp@}/igawk.s.XXXXXX` || exit 1
16627ETEMPFILE=`/bin/mktemp $@{TMPDIR:-/tmp@}/igawk.e.XXXXXX` || exit 1
16628
16629if [ "$1" = debug ]
16630then
16631    set -x
16632    shift
16633else
16634    # cleanup on exit, hangup, interrupt, quit, termination
16635    trap 'rm -f $STEMPFILE $ETEMPFILE' EXIT HUP INT QUIT TERM
16636fi
16637
16638while [ $# -ne 0 ] # loop over arguments
16639do
16640    case $1 in
16641    --)     shift; break;;
16642
16643    -W)     shift
16644            set -- -W"$@@"
16645            continue;;
16646
16647    -[vF])  opts="$opts $1 '$2'"
16648            shift;;
16649
16650    -[vF]*) opts="$opts '$1'" ;;
16651
16652    -f)     echo @@include "$2" >> $STEMPFILE
16653            shift;;
16654
16655@group
16656    -f*)    f=`echo "$1" | sed 's/-f//'`
16657            echo @@include "$f" >> $STEMPFILE ;;
16658@end group
16659
16660    -?file=*)    # -Wfile or --file
16661            f=`echo "$1" | sed 's/-.file=//'`
16662            echo @@include "$f" >> $STEMPFILE ;;
16663
16664    -?file)    # get arg, $2
16665            echo @@include "$2" >> $STEMPFILE
16666            shift;;
16667
16668    -?source=*)    # -Wsource or --source
16669            t=`echo "$1" | sed 's/-.source=//'`
16670            echo "$t" >> $STEMPFILE ;;
16671
16672    -?source)  # get arg, $2
16673            echo "$2" >> $STEMPFILE
16674            shift;;
16675
16676    -?version)
16677            echo igawk: version 1.0 1>&2
16678            gawk --version
16679            exit 0 ;;
16680
16681    -[W-]*)    opts="$opts '$1'" ;;
16682
16683    *)      break;;
16684    esac
16685    shift
16686done
16687
16688if [ ! -s $STEMPFILE ]
16689then
16690    if [ -z "$1" ]
16691    then
16692         echo igawk: no program! 1>&2
16693         exit 1
16694    else
16695        echo "$1" > $STEMPFILE
16696        shift
16697    fi
16698fi
16699
16700# at this point, $STEMPFILE has the program
16701@c endfile
16702@c @end group
16703@end example
16704
16705The @code{awk} program to process @samp{@@include} directives reads through
16706the program, one line at a time using @code{getline}
16707(@pxref{Getline, ,Explicit Input with @code{getline}}).
16708The input file names and @samp{@@include} statements are managed using a
16709stack.  As each @samp{@@include} is encountered, the current file name is
16710``pushed'' onto the stack, and the file named in the @samp{@@include}
16711directive becomes
16712the current file name.  As each file is finished, the stack is ``popped,''
16713and the previous input file becomes the current input file again.
16714The process is started by making the original file the first one on the
16715stack.
16716
16717The @code{pathto} function does the work of finding the full path to a
16718file.  It simulates @code{gawk}'s behavior when searching the @code{AWKPATH}
16719environment variable
16720(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
16721If a file name has a @samp{/} in it, no path search
16722is done. Otherwise, the file name is concatenated with the name of each
16723directory in the path, and an attempt is made to open the generated file
16724name.  The only way in @code{awk} to test if a file can be read is to go
16725ahead and try to read it with @code{getline}; that is what @code{pathto}
16726does.@footnote{On some very old versions of @code{awk}, the test
16727@samp{getline junk < t} can loop forever if the file exists but is empty.
16728Caveat Emptor.}
16729If the file can be read, it is closed, and the file name is
16730returned.
16731@ignore
16732An alternative way to test for the file's existence would be to call
16733@samp{system("test -r " t)}, which uses the @code{test} utility to
16734see if the file exists and is readable.  The disadvantage to this method
16735is that it requires creating an extra process, and can thus be slightly
16736slower.
16737@end ignore
16738
16739@example
16740@c file eg/prog/igawk.sh
16741gawk -- '
16742# process @@include directives
16743@c endfile
16744
16745@group
16746@c file eg/prog/igawk.sh
16747function pathto(file,    i, t, junk)
16748@{
16749    if (index(file, "/") != 0)
16750        return file
16751
16752    for (i = 1; i <= ndirs; i++) @{
16753        t = (pathlist[i] "/" file)
16754        if ((getline junk < t) > 0) @{
16755            # found it
16756            close(t)
16757            return t
16758        @}
16759    @}
16760    return ""
16761@}
16762@c endfile
16763@end group
16764@end example
16765
16766The main program is contained inside one @code{BEGIN} rule.  The first thing it
16767does is set up the @code{pathlist} array that @code{pathto} uses.  After
16768splitting the path on @samp{:}, null elements are replaced with @code{"."},
16769which represents the current directory.
16770
16771@example
16772@group
16773@c file eg/prog/igawk.sh
16774BEGIN @{
16775    path = ENVIRON["AWKPATH"]
16776    ndirs = split(path, pathlist, ":")
16777    for (i = 1; i <= ndirs; i++) @{
16778        if (pathlist[i] == "")
16779            pathlist[i] = "."
16780    @}
16781@c endfile
16782@end group
16783@end example
16784
16785The stack is initialized with @code{ARGV[1]}, which will be @file{$STEMPFILE}.
16786The main loop comes next.  Input lines are read in succession. Lines that
16787do not start with @samp{@@include} are printed verbatim.
16788
16789If the line does start with @samp{@@include}, the file name is in @code{$2}.
16790@code{pathto} is called to generate the full path.  If it could not, then we
16791print an error message and continue.
16792
16793The next thing to check is if the file has been included already.  The
16794@code{processed} array is indexed by the full file name of each included
16795file, and it tracks this information for us.  If the file has been
16796seen, a warning message is printed. Otherwise, the new file name is
16797pushed onto the stack and processing continues.
16798
16799Finally, when @code{getline} encounters the end of the input file, the file
16800is closed and the stack is popped.  When @code{stackptr} is less than zero,
16801the program is done.
16802
16803@example
16804@c @group
16805@c file eg/prog/igawk.sh
16806    stackptr = 0
16807    input[stackptr] = ARGV[1] # ARGV[1] is first file
16808
16809    for (; stackptr >= 0; stackptr--) @{
16810        while ((getline < input[stackptr]) > 0) @{
16811            if (tolower($1) != "@@include") @{
16812                print
16813                continue
16814            @}
16815            fpath = pathto($2)
16816            if (fpath == "") @{
16817                printf("igawk:%s:%d: cannot find %s\n", \
16818                    input[stackptr], FNR, $2) > "/dev/stderr"
16819                continue
16820            @}
16821@group
16822            if (! (fpath in processed)) @{
16823                processed[fpath] = input[stackptr]
16824                input[++stackptr] = fpath
16825            @} else
16826                print $2, "included in", input[stackptr], \
16827                    "already included in", \
16828                    processed[fpath] > "/dev/stderr"
16829        @}
16830@end group
16831@group
16832        close(input[stackptr])
16833    @}
16834@}' $STEMPFILE > $ETEMPFILE
16835@end group
16836@c endfile
16837@c @end group
16838@end example
16839
16840The last step is to call @code{gawk} with the expanded program and the original
16841options and command line arguments that the user supplied.  @code{gawk}'s
16842exit status is passed back on to @code{igawk}'s calling program.
16843
16844@c this causes more problems than it solves, so leave it out.
16845@ignore
16846The special file @file{/dev/null} is passed as a data file to @code{gawk}
16847to handle an interesting case. Suppose that the user's program only has
16848a @code{BEGIN} rule, and there are no data files to read. The program should exit without reading any data
16849files.  However, suppose that an included library file defines an @code{END}
16850rule of its own. In this case, @code{gawk} will hang, reading standard
16851input. In order to avoid this, @file{/dev/null} is explicitly to the
16852command line. Reading from @file{/dev/null} always returns an immediate
16853end of file indication.
16854
16855@c Hmm. Add /dev/null if $# is 0?  Still messes up ARGV. Sigh.
16856@end ignore
16857
16858@example
16859@c @group
16860@c file eg/prog/igawk.sh
16861eval gawk -f $ETEMPFILE $opts -- "$@@"
16862
16863exit $?
16864@c endfile
16865@c @end group
16866@end example
16867
16868This version of @code{igawk} represents my third attempt at this program.
16869There are three key simplifications that made the program work better.
16870
16871@enumerate
16872@item
16873Using @samp{@@include} even for the files named with @samp{-f} makes building
16874the initial collected @code{awk} program much simpler; all the
16875@samp{@@include} processing can be done once.
16876
16877@item
16878The @code{pathto} function doesn't try to save the line read with
16879@code{getline} when testing for the file's accessibility.  Trying to save
16880this line for use with the main program complicates things considerably.
16881@c what problem does this engender though - exercise
16882@c answer, reading from "-" or /dev/stdin
16883
16884@item
16885Using a @code{getline} loop in the @code{BEGIN} rule does it all in one
16886place.  It is not necessary to call out to a separate loop for processing
16887nested @samp{@@include} statements.
16888@end enumerate
16889
16890Also, this program illustrates that it is often worthwhile to combine
16891@code{sh} and @code{awk} programming together.  You can usually accomplish
16892quite a lot, without having to resort to low-level programming in C or C++, and it
16893is frequently easier to do certain kinds of string and argument manipulation
16894using the shell than it is in @code{awk}.
16895
16896Finally, @code{igawk} shows that it is not always necessary to add new
16897features to a program; they can often be layered on top.  With @code{igawk},
16898there is no real reason to build @samp{@@include} processing into
16899@code{gawk} itself.
16900
16901As an additional example of this, consider the idea of having two
16902files in a directory in the search path.
16903
16904@table @file
16905@item default.awk
16906This file would contain a set of default library functions, such
16907as @code{getopt} and @code{assert}.
16908
16909@item site.awk
16910This file would contain library functions that are specific to a site or
16911installation, i.e.@: locally developed functions.
16912Having a separate file allows @file{default.awk} to change with
16913new @code{gawk} releases, without requiring the system administrator to
16914update it each time by adding the local functions.
16915@end table
16916
16917One user
16918@c Karl Berry, karl@ileaf.com, 10/95
16919suggested that @code{gawk} be modified to automatically read these files
16920upon startup.  Instead, it would be very simple to modify @code{igawk}
16921to do this. Since @code{igawk} can process nested @samp{@@include}
16922directives, @file{default.awk} could simply contain @samp{@@include}
16923statements for the desired library functions.
16924
16925@c Exercise: make this change
16926
16927@node Language History, Gawk Summary, Sample Programs, Top
16928@chapter The Evolution of the @code{awk} Language
16929
16930This @value{DOCUMENT} describes the GNU implementation of @code{awk}, which follows
16931the POSIX specification.  Many @code{awk} users are only familiar
16932with the original @code{awk} implementation in Version 7 Unix.
16933(This implementation was the basis for @code{awk} in Berkeley Unix,
16934through 4.3--Reno.  The 4.4 release of Berkeley Unix uses @code{gawk} 2.15.2
16935for its version of @code{awk}.) This chapter briefly describes the
16936evolution of the @code{awk} language, with cross references to other parts
16937of the @value{DOCUMENT} where you can find more information.
16938
16939@menu
16940* V7/SVR3.1::                   The major changes between V7 and System V
16941                                Release 3.1.
16942* SVR4::                        Minor changes between System V Releases 3.1
16943                                and 4.
16944* POSIX::                       New features from the POSIX standard.
16945* BTL::                         New features from the Bell Laboratories
16946                                version of @code{awk}.
16947* POSIX/GNU::                   The extensions in @code{gawk} not in POSIX
16948                                @code{awk}.
16949@end menu
16950
16951@node V7/SVR3.1, SVR4, Language History, Language History
16952@section Major Changes between V7 and SVR3.1
16953
16954The @code{awk} language evolved considerably between the release of
16955Version 7 Unix (1978) and the new version first made generally available in
16956System V Release 3.1 (1987).  This section summarizes the changes, with
16957cross-references to further details.
16958
16959@itemize @bullet
16960@item
16961The requirement for @samp{;} to separate rules on a line
16962(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
16963
16964@item
16965User-defined functions, and the @code{return} statement
16966(@pxref{User-defined, ,User-defined Functions}).
16967
16968@item
16969The @code{delete} statement (@pxref{Delete, ,The @code{delete} Statement}).
16970
16971@item
16972The @code{do}-@code{while} statement
16973(@pxref{Do Statement, ,The @code{do}-@code{while} Statement}).
16974
16975@item
16976The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand} and
16977@code{srand} (@pxref{Numeric Functions, ,Numeric Built-in Functions}).
16978
16979@item
16980The built-in functions @code{gsub}, @code{sub}, and @code{match}
16981(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
16982
16983@item
16984The built-in functions @code{close}, and @code{system}
16985(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
16986
16987@item
16988The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART},
16989and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}).
16990
16991@item
16992The conditional expression using the ternary operator @samp{?:}
16993(@pxref{Conditional Exp, ,Conditional Expressions}).
16994
16995@item
16996The exponentiation operator @samp{^}
16997(@pxref{Arithmetic Ops, ,Arithmetic Operators}) and its assignment operator
16998form @samp{^=} (@pxref{Assignment Ops, ,Assignment Expressions}).
16999
17000@item
17001C-compatible operator precedence, which breaks some old @code{awk}
17002programs (@pxref{Precedence, ,Operator Precedence (How Operators Nest)}).
17003
17004@item
17005Regexps as the value of @code{FS}
17006(@pxref{Field Separators, ,Specifying How Fields are Separated}), and as the
17007third argument to the @code{split} function
17008(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
17009
17010@item
17011Dynamic regexps as operands of the @samp{~} and @samp{!~} operators
17012(@pxref{Regexp Usage, ,How to Use Regular Expressions}).
17013
17014@item
17015The escape sequences @samp{\b}, @samp{\f}, and @samp{\r}
17016(@pxref{Escape Sequences}).
17017(Some vendors have updated their old versions of @code{awk} to
17018recognize @samp{\r}, @samp{\b}, and @samp{\f}, but this is not
17019something you can rely on.)
17020
17021@item
17022Redirection of input for the @code{getline} function
17023(@pxref{Getline, ,Explicit Input with @code{getline}}).
17024
17025@item
17026Multiple @code{BEGIN} and @code{END} rules
17027(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
17028
17029@item
17030Multi-dimensional arrays
17031(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).
17032@end itemize
17033
17034@node SVR4, POSIX, V7/SVR3.1, Language History
17035@section Changes between SVR3.1 and SVR4
17036
17037@cindex @code{awk} language, V.4 version
17038The System V Release 4 version of Unix @code{awk} added these features
17039(some of which originated in @code{gawk}):
17040
17041@itemize @bullet
17042@item
17043The @code{ENVIRON} variable (@pxref{Built-in Variables}).
17044
17045@item
17046Multiple @samp{-f} options on the command line
17047(@pxref{Options, ,Command Line Options}).
17048
17049@item
17050The @samp{-v} option for assigning variables before program execution begins
17051(@pxref{Options, ,Command Line Options}).
17052
17053@item
17054The @samp{--} option for terminating command line options.
17055
17056@item
17057The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences
17058(@pxref{Escape Sequences}).
17059
17060@item
17061A defined return value for the @code{srand} built-in function
17062(@pxref{Numeric Functions, ,Numeric Built-in Functions}).
17063
17064@item
17065The @code{toupper} and @code{tolower} built-in string functions
17066for case translation
17067(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
17068
17069@item
17070A cleaner specification for the @samp{%c} format-control letter in the
17071@code{printf} function
17072(@pxref{Control Letters, ,Format-Control Letters}).
17073
17074@item
17075The ability to dynamically pass the field width and precision (@code{"%*.*d"})
17076in the argument list of the @code{printf} function
17077(@pxref{Control Letters, ,Format-Control Letters}).
17078
17079@item
17080The use of regexp constants such as @code{/foo/} as expressions, where
17081they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/}
17082(@pxref{Using Constant Regexps, ,Using Regular Expression Constants}).
17083@end itemize
17084
17085@node POSIX, BTL, SVR4, Language History
17086@section Changes between SVR4 and POSIX @code{awk}
17087
17088The POSIX Command Language and Utilities standard for @code{awk}
17089introduced the following changes into the language:
17090
17091@itemize @bullet
17092@item
17093The use of @samp{-W} for implementation-specific options.
17094
17095@item
17096The use of @code{CONVFMT} for controlling the conversion of numbers
17097to strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).
17098
17099@item
17100The concept of a numeric string, and tighter comparison rules to go
17101with it (@pxref{Typing and Comparison, ,Variable Typing and Comparison Expressions}).
17102
17103@item
17104More complete documentation of many of the previously undocumented
17105features of the language.
17106@end itemize
17107
17108The following common extensions are not permitted by the POSIX
17109standard:
17110
17111@c IMPORTANT! Keep this list in sync with the one in node Options
17112
17113@itemize @bullet
17114@item
17115@code{\x} escape sequences are not recognized
17116(@pxref{Escape Sequences}).
17117
17118@item
17119Newlines do not act as whitespace to separate fields when @code{FS} is
17120equal to a single space.
17121
17122@item
17123The synonym @code{func} for the keyword @code{function} is not
17124recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).
17125
17126@item
17127The operators @samp{**} and @samp{**=} cannot be used in
17128place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},
17129and also @pxref{Assignment Ops, ,Assignment Expressions}).
17130
17131@item
17132Specifying @samp{-Ft} on the command line does not set the value
17133of @code{FS} to be a single tab character
17134(@pxref{Field Separators, ,Specifying How Fields are Separated}).
17135
17136@item
17137The @code{fflush} built-in function is not supported
17138(@pxref{I/O Functions, , Built-in Functions for Input/Output}).
17139@end itemize
17140
17141@node BTL, POSIX/GNU, POSIX, Language History
17142@section Extensions in the Bell Laboratories @code{awk}
17143
17144@cindex Kernighan, Brian
17145Brian Kernighan, one of the original designers of Unix @code{awk},
17146has made his version available via anonymous @code{ftp}
17147(@pxref{Other Versions, ,Other Freely Available @code{awk} Implementations}).
17148This section describes extensions in his version of @code{awk} that are
17149not in POSIX @code{awk}.
17150
17151@itemize @bullet
17152@item
17153The @samp{-mf @var{NNN}} and @samp{-mr @var{NNN}} command line options
17154to set the maximum number of fields, and the maximum
17155record size, respectively
17156(@pxref{Options, ,Command Line Options}).
17157
17158@item
17159The @code{fflush} built-in function for flushing buffered output
17160(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
17161
17162@ignore
17163@item
17164The @code{SYMTAB} array, that allows access to the internal symbol
17165table of @code{awk}. This feature is not documented, largely because
17166it is somewhat shakily implemented. For instance, you cannot access arrays
17167or array elements through it.
17168@end ignore
17169@end itemize
17170
17171@node POSIX/GNU, , BTL, Language History
17172@section Extensions in @code{gawk} Not in POSIX @code{awk}
17173
17174@cindex compatibility mode
17175The GNU implementation, @code{gawk}, adds a number of features.
17176This sections lists them in the order they were added to @code{gawk}.
17177They can all be disabled with either the @samp{--traditional} or
17178@samp{--posix} options
17179(@pxref{Options, ,Command Line Options}).
17180
17181Version 2.10 of @code{gawk} introduced these features:
17182
17183@itemize @bullet
17184@item
17185The @code{AWKPATH} environment variable for specifying a path search for
17186the @samp{-f} command line option
17187(@pxref{Options, ,Command Line Options}).
17188
17189@item
17190The @code{IGNORECASE} variable and its effects
17191(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).
17192
17193@item
17194The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and
17195@file{/dev/fd/@var{n}} file name interpretation
17196(@pxref{Special Files, ,Special File Names in @code{gawk}}).
17197@end itemize
17198
17199Version 2.13 of @code{gawk} introduced these features:
17200
17201@itemize @bullet
17202@item
17203The @code{FIELDWIDTHS} variable and its effects
17204(@pxref{Constant Size, ,Reading Fixed-width Data}).
17205
17206@item
17207The @code{systime} and @code{strftime} built-in functions for obtaining
17208and printing time stamps
17209(@pxref{Time Functions, ,Functions for Dealing with Time Stamps}).
17210
17211@item
17212The @samp{-W lint} option to provide source code and run time error
17213and portability checking
17214(@pxref{Options, ,Command Line Options}).
17215
17216@item
17217The @samp{-W compat} option to turn off these extensions
17218(@pxref{Options, ,Command Line Options}).
17219
17220@item
17221The @samp{-W posix} option for full POSIX compliance
17222(@pxref{Options, ,Command Line Options}).
17223@end itemize
17224
17225Version 2.14 of @code{gawk} introduced these features:
17226
17227@itemize @bullet
17228@item
17229The @code{next file} statement for skipping to the next data file
17230(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
17231@end itemize
17232
17233Version 2.15 of @code{gawk} introduced these features:
17234
17235@itemize @bullet
17236@item
17237The @code{ARGIND} variable, that tracks the movement of @code{FILENAME}
17238through @code{ARGV}  (@pxref{Built-in Variables}).
17239
17240@item
17241The @code{ERRNO} variable, that contains the system error message when
17242@code{getline} returns @minus{}1, or when @code{close} fails
17243(@pxref{Built-in Variables}).
17244
17245@item
17246The ability to use GNU-style long named options that start with @samp{--}
17247(@pxref{Options, ,Command Line Options}).
17248
17249@item
17250The @samp{--source} option for mixing command line and library
17251file source code
17252(@pxref{Options, ,Command Line Options}).
17253
17254@item
17255The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and
17256@file{/dev/user} file name interpretation
17257(@pxref{Special Files, ,Special File Names in @code{gawk}}).
17258@end itemize
17259
17260Version 3.0 of @code{gawk} introduced these features:
17261
17262@itemize @bullet
17263@item
17264The @code{next file} statement became @code{nextfile}
17265(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
17266
17267@item
17268The @samp{--lint-old} option to
17269warn about constructs that are not available in
17270the original Version 7 Unix version of @code{awk}
17271(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).
17272
17273@item
17274The @samp{--traditional} option was added as a better name for
17275@samp{--compat} (@pxref{Options, ,Command Line Options}).
17276
17277@item
17278The ability for @code{FS} to be a null string, and for the third
17279argument to @code{split} to be the null string
17280(@pxref{Single Character Fields, , Making Each Character a Separate Field}).
17281
17282@item
17283The ability for @code{RS} to be a regexp
17284(@pxref{Records, , How Input is Split into Records}).
17285
17286@item
17287The @code{RT} variable
17288(@pxref{Records, , How Input is Split into Records}).
17289
17290@item
17291The @code{gensub} function for more powerful text manipulation
17292(@pxref{String Functions, , Built-in Functions for String Manipulation}).
17293
17294@item
17295The @code{strftime} function acquired a default time format,
17296allowing it to be called with no arguments
17297(@pxref{Time Functions,  , Functions for Dealing with Time Stamps}).
17298
17299@item
17300Full support for both POSIX and GNU regexps
17301(@pxref{Regexp, , Regular Expressions}).
17302
17303@item
17304The @samp{--re-interval} option to provide interval expressions in regexps
17305(@pxref{Regexp Operators, , Regular Expression Operators}).
17306
17307@item
17308@code{IGNORECASE} changed, now applying to string comparison as well
17309as regexp operations
17310(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).
17311
17312@item
17313The @samp{-m} option and the @code{fflush} function from the
17314Bell Labs research version of @code{awk}
17315(@pxref{Options, ,Command Line Options}; also
17316@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
17317
17318@item
17319The use of GNU Autoconf to control the configuration process
17320(@pxref{Quick Installation, , Compiling @code{gawk} for Unix}).
17321
17322@item
17323Amiga support
17324(@pxref{Amiga Installation, ,Installing @code{gawk} on an Amiga}).
17325
17326@c XXX ADD MORE STUFF HERE
17327
17328@end itemize
17329
17330@node Gawk Summary, Installation, Language History, Top
17331@appendix @code{gawk} Summary
17332
17333This appendix provides a brief summary of the @code{gawk} command line and the
17334@code{awk} language.  It is designed to serve as ``quick reference.''  It is
17335therefore terse, but complete.
17336
17337@menu
17338* Command Line Summary::        Recapitulation of the command line.
17339* Language Summary::            A terse review of the language.
17340* Variables/Fields::            Variables, fields, and arrays.
17341* Rules Summary::               Patterns and Actions, and their component
17342                                parts.
17343* Actions Summary::             Quick overview of actions.
17344* Functions Summary::           Defining and calling functions.
17345* Historical Features::         Some undocumented but supported ``features''.
17346@end menu
17347
17348@node Command Line Summary, Language Summary, Gawk Summary, Gawk Summary
17349@appendixsec Command Line Options Summary
17350
17351The command line consists of options to @code{gawk} itself, the
17352@code{awk} program text (if not supplied via the @samp{-f} option), and
17353values to be made available in the @code{ARGC} and @code{ARGV}
17354predefined @code{awk} variables:
17355
17356@example
17357gawk @r{[@var{POSIX or GNU style options}]} -f @var{source-file} @r{[@code{--}]} @var{file} @dots{}
17358gawk @r{[@var{POSIX or GNU style options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}
17359@end example
17360
17361The options that @code{gawk} accepts are:
17362
17363@table @code
17364@item -F @var{fs}
17365@itemx --field-separator @var{fs}
17366Use @var{fs} for the input field separator (the value of the @code{FS}
17367predefined variable).
17368
17369@item -f @var{program-file}
17370@itemx --file @var{program-file}
17371Read the @code{awk} program source from the file @var{program-file}, instead
17372of from the first command line argument.
17373
17374@item -mf @var{NNN}
17375@itemx -mr @var{NNN}
17376The @samp{f} flag sets
17377the maximum number of fields, and the @samp{r} flag sets the maximum
17378record size.  These options are ignored by @code{gawk}, since @code{gawk}
17379has no predefined limits; they are only for compatibility with the
17380Bell Labs research version of Unix @code{awk}.
17381
17382@item -v @var{var}=@var{val}
17383@itemx --assign @var{var}=@var{val}
17384Assign the variable @var{var} the value @var{val} before program execution
17385begins.
17386
17387@item -W traditional
17388@itemx -W compat
17389@itemx --traditional
17390@itemx --compat
17391Use compatibility mode, in which @code{gawk} extensions are turned
17392off.
17393
17394@item -W copyleft
17395@itemx -W copyright
17396@itemx --copyleft
17397@itemx --copyright
17398Print the short version of the General Public License on the standard
17399output, and exit.  This option may disappear in a future version of @code{gawk}.
17400
17401@item -W help
17402@itemx -W usage
17403@itemx --help
17404@itemx --usage
17405Print a relatively short summary of the available options on the standard
17406output, and exit.
17407
17408@item -W lint
17409@itemx --lint
17410Give warnings about dubious or non-portable @code{awk} constructs.
17411
17412@item -W lint-old
17413@itemx --lint-old
17414Warn about constructs that are not available in
17415the original Version 7 Unix version of @code{awk}.
17416
17417@item -W posix
17418@itemx --posix
17419Use POSIX compatibility mode, in which @code{gawk} extensions
17420are turned off and additional restrictions apply.
17421
17422@item -W re-interval
17423@itemx --re-interval
17424Allow interval expressions
17425(@pxref{Regexp Operators, , Regular Expression Operators}),
17426in regexps.
17427
17428@item -W source=@var{program-text}
17429@itemx --source @var{program-text}
17430Use @var{program-text} as @code{awk} program source code.  This option allows
17431mixing command line source code with source code from files, and is
17432particularly useful for mixing command line programs with library functions.
17433
17434@item -W version
17435@itemx --version
17436Print version information for this particular copy of @code{gawk} on the error
17437output.
17438
17439@item --
17440Signal the end of options.  This is useful to allow further arguments to the
17441@code{awk} program itself to start with a @samp{-}.  This is mainly for
17442consistency with POSIX argument parsing conventions.
17443@end table
17444
17445Any other options are flagged as invalid, but are otherwise ignored.
17446@xref{Options, ,Command Line Options}, for more details.
17447
17448@node Language Summary, Variables/Fields, Command Line Summary, Gawk Summary
17449@appendixsec Language Summary
17450
17451An @code{awk} program consists of a sequence of zero or more pattern-action
17452statements and optional function definitions.  One or the other of the
17453pattern and action may be omitted.
17454
17455@example
17456@var{pattern}    @{ @var{action statements} @}
17457@var{pattern}
17458          @{ @var{action statements} @}
17459
17460function @var{name}(@var{parameter list})     @{ @var{action statements} @}
17461@end example
17462
17463@code{gawk} first reads the program source from the
17464@var{program-file}(s), if specified, or from the first non-option
17465argument on the command line.  The @samp{-f} option may be used multiple
17466times on the command line.  @code{gawk} reads the program text from all
17467the @var{program-file} files, effectively concatenating them in the
17468order they are specified.  This is useful for building libraries of
17469@code{awk} functions, without having to include them in each new
17470@code{awk} program that uses them.  To use a library function in a file
17471from a program typed in on the command line, specify
17472@samp{--source '@var{program}'}, and type your program in between the single
17473quotes.
17474@xref{Options, ,Command Line Options}.
17475
17476The environment variable @code{AWKPATH} specifies a search path to use
17477when finding source files named with the @samp{-f} option.  The default
17478path, which is
17479@samp{.:/usr/local/share/awk}@footnote{The path may use a directory
17480other than @file{/usr/local/share/awk}, depending upon how @code{gawk}
17481was built and installed.} is used if @code{AWKPATH} is not set.
17482If a file name given to the @samp{-f} option contains a @samp{/} character,
17483no path search is performed.
17484@xref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
17485
17486@code{gawk} compiles the program into an internal form, and then proceeds to
17487read each file named in the @code{ARGV} array.
17488The initial values of @code{ARGV} come from the command line arguments.
17489If there are no files named
17490on the command line, @code{gawk} reads the standard input.
17491
17492If a ``file'' named on the command line has the form
17493@samp{@var{var}=@var{val}}, it is treated as a variable assignment: the
17494variable @var{var} is assigned the value @var{val}.
17495If any of the files have a value that is the null string, that
17496element in the list is skipped.
17497
17498For each record in the input, @code{gawk} tests to see if it matches any
17499@var{pattern} in the @code{awk} program.  For each pattern that the record
17500matches, the associated @var{action} is executed.
17501
17502@node Variables/Fields, Rules Summary, Language Summary, Gawk Summary
17503@appendixsec Variables and Fields
17504
17505@code{awk} variables are not declared; they come into existence when they are
17506first used.  Their values are either floating-point numbers or strings.
17507@code{awk} also has one-dimensional arrays; multiple-dimensional arrays
17508may be simulated.  There are several predefined variables that
17509@code{awk} sets as a program runs; these are summarized below.
17510
17511@menu
17512* Fields Summary::              Input field splitting.
17513* Built-in Summary::            @code{awk}'s built-in variables.
17514* Arrays Summary::              Using arrays.
17515* Data Type Summary::           Values in @code{awk} are numbers or strings.
17516@end menu
17517
17518@node Fields Summary, Built-in Summary, Variables/Fields, Variables/Fields
17519@appendixsubsec Fields
17520
17521As each input line is read, @code{gawk} splits the line into
17522@var{fields}, using the value of the @code{FS} variable as the field
17523separator.  If @code{FS} is a single character, fields are separated by
17524that character.  Otherwise, @code{FS} is expected to be a full regular
17525expression.  In the special case that @code{FS} is a single space,
17526fields are separated by runs of spaces, tabs and/or newlines.@footnote{In
17527POSIX @code{awk}, newline does not separate fields.}
17528If @code{FS} is the null string (@code{""}), then each individual
17529character in the record becomes a separate field.
17530Note that the value
17531of @code{IGNORECASE} (@pxref{Case-sensitivity, ,Case-sensitivity in Matching})
17532also affects how fields are split when @code{FS} is a regular expression.
17533
17534Each field in the input line may be referenced by its position, @code{$1},
17535@code{$2}, and so on.  @code{$0} is the whole line.  The value of a field may
17536be assigned to as well.  Field numbers need not be constants:
17537
17538@example
17539n = 5
17540print $n
17541@end example
17542
17543@noindent
17544prints the fifth field in the input line.  The variable @code{NF} is set to
17545the total number of fields in the input line.
17546
17547References to non-existent fields (i.e.@: fields after @code{$NF}) return
17548the null string.  However, assigning to a non-existent field (e.g.,
17549@code{$(NF+2) = 5}) increases the value of @code{NF}, creates any
17550intervening fields with the null string as their value, and causes the
17551value of @code{$0} to be recomputed, with the fields being separated by
17552the value of @code{OFS}.
17553Decrementing @code{NF} causes the values of fields past the new value to
17554be lost, and the value of @code{$0} to be recomputed, with the fields being
17555separated by the value of @code{OFS}.
17556@xref{Reading Files, ,Reading Input Files}.
17557
17558@node Built-in Summary, Arrays Summary, Fields Summary, Variables/Fields
17559@appendixsubsec Built-in Variables
17560
17561@code{gawk}'s built-in variables are:
17562
17563@table @code
17564@item ARGC
17565The number of elements in @code{ARGV}. See below for what is actually
17566included in @code{ARGV}.
17567
17568@item ARGIND
17569The index in @code{ARGV} of the current file being processed.
17570When @code{gawk} is processing the input data files,
17571it is always true that @samp{FILENAME == ARGV[ARGIND]}.
17572
17573@item ARGV
17574The array of command line arguments.  The array is indexed from zero to
17575@code{ARGC} @minus{} 1.  Dynamically changing @code{ARGC} and
17576the contents of @code{ARGV}
17577can control the files used for data.  A null-valued element in
17578@code{ARGV} is ignored. @code{ARGV} does not include the options to
17579@code{awk} or the text of the @code{awk} program itself.
17580
17581@item CONVFMT
17582The conversion format to use when converting numbers to strings.
17583
17584@item FIELDWIDTHS
17585A space separated list of numbers describing the fixed-width input data.
17586
17587@item ENVIRON
17588An array of environment variable values.  The array
17589is indexed by variable name, each element being the value of that
17590variable.  Thus, the environment variable @code{HOME} is
17591@code{ENVIRON["HOME"]}.  One possible value might be @file{/home/arnold}.
17592
17593Changing this array does not affect the environment seen by programs
17594which @code{gawk} spawns via redirection or the @code{system} function.
17595(This may change in a future version of @code{gawk}.)
17596
17597Some operating systems do not have environment variables.
17598The @code{ENVIRON} array is empty when running on these systems.
17599
17600@item ERRNO
17601The system error message when an error occurs using @code{getline}
17602or @code{close}.
17603
17604@item FILENAME
17605The name of the current input file.  If no files are specified on the command
17606line, the value of @code{FILENAME} is the null string.
17607
17608@item FNR
17609The input record number in the current input file.
17610
17611@item FS
17612The input field separator, a space by default.
17613
17614@item IGNORECASE
17615The case-sensitivity flag for string comparisons and regular expression
17616operations.  If @code{IGNORECASE} has a non-zero value, then pattern
17617matching in rules, record separating with @code{RS}, field splitting
17618with @code{FS}, regular expression matching with @samp{~} and
17619@samp{!~}, and the @code{gensub}, @code{gsub}, @code{index},
17620@code{match}, @code{split} and @code{sub} built-in functions all
17621ignore case when doing regular expression operations, and all string
17622comparisons are done ignoring case.
17623The value of @code{IGNORECASE} does @emph{not} affect array subscripting.
17624
17625@item NF
17626The number of fields in the current input record.
17627
17628@item NR
17629The total number of input records seen so far.
17630
17631@item OFMT
17632The output format for numbers for the @code{print} statement,
17633@code{"%.6g"} by default.
17634
17635@item OFS
17636The output field separator, a space by default.
17637
17638@item ORS
17639The output record separator, by default a newline.
17640
17641@item RS
17642The input record separator, by default a newline.
17643If @code{RS} is set to the null string, then records are separated by
17644blank lines.  When @code{RS} is set to the null string, then the newline
17645character always acts as a field separator, in addition to whatever value
17646@code{FS} may have.  If @code{RS} is set to a multi-character
17647string, it denotes a regexp; input text matching the regexp
17648separates records.
17649
17650@item RT
17651The input text that matched the text denoted by @code{RS},
17652the record separator.
17653
17654@item RSTART
17655The index of the first character last matched by @code{match}; zero if no match.
17656
17657@item RLENGTH
17658The length of the string last matched by @code{match}; @minus{}1 if no match.
17659
17660@item SUBSEP
17661The string used to separate multiple subscripts in array elements, by
17662default @code{"\034"}.
17663@end table
17664
17665@xref{Built-in Variables}, for more information.
17666
17667@node Arrays Summary, Data Type Summary, Built-in Summary, Variables/Fields
17668@appendixsubsec Arrays
17669
17670Arrays are subscripted with an expression between square brackets
17671(@samp{[} and @samp{]}).  Array subscripts are @emph{always} strings;
17672numbers are converted to strings as necessary, following the standard
17673conversion rules
17674(@pxref{Conversion, ,Conversion of Strings and Numbers}).
17675
17676If you use multiple expressions separated by commas inside the square
17677brackets, then the array subscript is a string consisting of the
17678concatenation of the individual subscript values, converted to strings,
17679separated by the subscript separator (the value of @code{SUBSEP}).
17680
17681The special operator @code{in} may be used in a conditional context
17682to see if an array has an index consisting of a particular value.
17683
17684@example
17685if (val in array)
17686        print array[val]
17687@end example
17688
17689If the array has multiple subscripts, use @samp{(i, j, @dots{}) in @var{array}}
17690to test for existence of an element.
17691
17692The @code{in} construct may also be used in a @code{for} loop to iterate
17693over all the elements of an array.
17694@xref{Scanning an Array, ,Scanning All Elements of an Array}.
17695
17696You can remove an element from an array using the @code{delete} statement.
17697
17698You can clear an entire array using @samp{delete @var{array}}.
17699
17700@xref{Arrays, ,Arrays in @code{awk}}.
17701
17702@node Data Type Summary,  , Arrays Summary, Variables/Fields
17703@appendixsubsec Data Types
17704
17705The value of an @code{awk} expression is always either a number
17706or a string.
17707
17708Some contexts (such as arithmetic operators) require numeric
17709values.  They convert strings to numbers by interpreting the text
17710of the string as a number.  If the string does not look like a
17711number, it converts to zero.
17712
17713Other contexts (such as concatenation) require string values.
17714They convert numbers to strings by effectively printing them
17715with @code{sprintf}.
17716@xref{Conversion, ,Conversion of Strings and Numbers}, for the details.
17717
17718To force conversion of a string value to a number, simply add zero
17719to it.  If the value you start with is already a number, this
17720does not change it.
17721
17722To force conversion of a numeric value to a string, concatenate it with
17723the null string.
17724
17725Comparisons are done numerically if both operands are numeric, or if
17726one is numeric and the other is a numeric string.  Otherwise one or
17727both operands are converted to strings and a string comparison is
17728performed.  Fields, @code{getline} input, @code{FILENAME}, @code{ARGV}
17729elements, @code{ENVIRON} elements and the elements of an array created
17730by @code{split} are the only items that can be numeric strings. String
17731constants, such as @code{"3.1415927"} are not numeric strings, they are
17732string constants.  The full rules for comparisons are described in
17733@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
17734
17735Uninitialized variables have the string value @code{""} (the null, or
17736empty, string).  In contexts where a number is required, this is
17737equivalent to zero.
17738
17739@xref{Variables}, for more information on variable naming and initialization;
17740@pxref{Conversion, ,Conversion of Strings and Numbers}, for more information
17741on how variable values are interpreted.
17742
17743@node Rules Summary, Actions Summary, Variables/Fields, Gawk Summary
17744@appendixsec Patterns
17745
17746@menu
17747* Pattern Summary::             Quick overview of patterns.
17748* Regexp Summary::              Quick overview of regular expressions.
17749@end menu
17750
17751An @code{awk} program is mostly composed of rules, each consisting of a
17752pattern followed by an action.  The action is enclosed in @samp{@{} and
17753@samp{@}}.  Either the pattern may be missing, or the action may be
17754missing, but not both.  If the pattern is missing, the
17755action is executed for every input record.  A missing action is
17756equivalent to @samp{@w{@{ print @}}}, which prints the entire line.
17757
17758@c These paragraphs repeated for both patterns and actions. I don't
17759@c like this, but I also don't see any way around it. Update both copies
17760@c if they need fixing.
17761Comments begin with the @samp{#} character, and continue until the end of the
17762line.  Blank lines may be used to separate statements.  Statements normally
17763end with a newline; however, this is not the case for lines ending in a
17764@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}.  Lines
17765ending in @code{do} or @code{else} also have their statements automatically
17766continued on the following line.  In other cases, a line can be continued by
17767ending it with a @samp{\}, in which case the newline is ignored.
17768
17769Multiple statements may be put on one line by separating each one with
17770a @samp{;}.
17771This applies to both the statements within the action part of a rule (the
17772usual case), and to the rule statements.
17773
17774@xref{Comments, ,Comments in @code{awk} Programs}, for information on
17775@code{awk}'s commenting convention;
17776@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a
17777description of the line continuation mechanism in @code{awk}.
17778
17779@node Pattern Summary, Regexp Summary, Rules Summary, Rules Summary
17780@appendixsubsec Pattern Summary
17781
17782@code{awk} patterns may be one of the following:
17783
17784@example
17785/@var{regular expression}/
17786@var{relational expression}
17787@var{pattern} && @var{pattern}
17788@var{pattern} || @var{pattern}
17789@var{pattern} ? @var{pattern} : @var{pattern}
17790(@var{pattern})
17791! @var{pattern}
17792@var{pattern1}, @var{pattern2}
17793BEGIN
17794END
17795@end example
17796
17797@code{BEGIN} and @code{END} are two special kinds of patterns that are not
17798tested against the input.  The action parts of all @code{BEGIN} rules are
17799concatenated as if all the statements had been written in a single @code{BEGIN}
17800rule.  They are executed before any of the input is read.  Similarly, all the
17801@code{END} rules are concatenated, and executed when all the input is exhausted (or
17802when an @code{exit} statement is executed).  @code{BEGIN} and @code{END}
17803patterns cannot be combined with other patterns in pattern expressions.
17804@code{BEGIN} and @code{END} rules cannot have missing action parts.
17805
17806For @code{/@var{regular-expression}/} patterns, the associated statement is
17807executed for each input record that matches the regular expression.  Regular
17808expressions are summarized below.
17809
17810A @var{relational expression} may use any of the operators defined below in
17811the section on actions.  These generally test whether certain fields match
17812certain regular expressions.
17813
17814The @samp{&&}, @samp{||}, and @samp{!} operators are logical ``and,''
17815logical ``or,'' and logical ``not,'' respectively, as in C.  They do
17816short-circuit evaluation, also as in C, and are used for combining more
17817primitive pattern expressions.  As in most languages, parentheses may be
17818used to change the order of evaluation.
17819
17820The @samp{?:} operator is like the same operator in C.  If the first
17821pattern matches, then the second pattern is matched against the input
17822record; otherwise, the third is matched.  Only one of the second and
17823third patterns is matched.
17824
17825The @samp{@var{pattern1}, @var{pattern2}} form of a pattern is called a
17826range pattern.  It matches all input lines starting with a line that
17827matches @var{pattern1}, and continuing until a line that matches
17828@var{pattern2}, inclusive.  A range pattern cannot be used as an operand
17829of any of the pattern operators.
17830
17831@xref{Pattern Overview, ,Pattern Elements}.
17832
17833@node Regexp Summary, , Pattern Summary, Rules Summary
17834@appendixsubsec Regular Expressions
17835
17836Regular expressions are based on POSIX EREs (extended regular expressions).
17837The escape sequences allowed in string constants are also valid in
17838regular expressions (@pxref{Escape Sequences}).
17839Regexps are composed of characters as follows:
17840
17841@table @code
17842@item @var{c}
17843matches the character @var{c} (assuming @var{c} is none of the characters
17844listed below).
17845
17846@item \@var{c}
17847matches the literal character @var{c}.
17848
17849@item .
17850matches any character, @emph{including} newline.
17851In strict POSIX mode, @samp{.} does not match the @sc{nul}
17852character, which is a character with all bits equal to zero.
17853
17854@item ^
17855matches the beginning of a string.
17856
17857@item $
17858matches the end of a string.
17859
17860@item [@var{abc}@dots{}]
17861matches any of the characters @var{abc}@dots{} (character list).
17862
17863@item [[:@var{class}:]]
17864matches any character in the character class @var{class}. Allowable classes
17865are @code{alnum}, @code{alpha}, @code{blank}, @code{cntrl},
17866@code{digit}, @code{graph}, @code{lower}, @code{print}, @code{punct},
17867@code{space}, @code{upper}, and @code{xdigit}.
17868
17869@item [[.@var{symbol}.]]
17870matches the multi-character collating symbol @var{symbol}.
17871@code{gawk} does not currently support collating symbols.
17872
17873@item [[=@var{classname}=]]
17874matches any of the equivalent characters in the current locale named by the
17875equivalence class @var{classname}.
17876@code{gawk} does not currently support equivalence classes.
17877
17878@item [^@var{abc}@dots{}]
17879matches any character except @var{abc}@dots{} (negated
17880character list).
17881
17882@item @var{r1}|@var{r2}
17883matches either @var{r1} or @var{r2} (alternation).
17884
17885@item @var{r1r2}
17886matches @var{r1}, and then @var{r2} (concatenation).
17887
17888@item @var{r}+
17889matches one or more @var{r}'s.
17890
17891@item @var{r}*
17892matches zero or more @var{r}'s.
17893
17894@item @var{r}?
17895matches zero or one @var{r}'s.
17896
17897@item (@var{r})
17898matches @var{r} (grouping).
17899
17900@item @var{r}@{@var{n}@}
17901@itemx @var{r}@{@var{n},@}
17902@itemx @var{r}@{@var{n},@var{m}@}
17903matches at least @var{n}, @var{n} to any number, or @var{n} to @var{m}
17904occurrences of @var{r} (interval expressions).
17905
17906@item \y
17907matches the empty string at either the beginning or the
17908end of a word.
17909
17910@item \B
17911matches the empty string within a word.
17912
17913@item \<
17914matches the empty string at the beginning of a word.
17915
17916@item \>
17917matches the empty string at the end of a word.
17918
17919@item \w
17920matches any word-constituent character (alphanumeric characters and
17921the underscore).
17922
17923@item \W
17924matches any character that is not word-constituent.
17925
17926@item \`
17927matches the empty string at the beginning of a buffer (same as a string
17928in @code{gawk}).
17929
17930@item \'
17931matches the empty string at the end of a buffer.
17932@end table
17933
17934The various command line options
17935control how @code{gawk} interprets characters in regexps.
17936
17937@c NOTE!!! Keep this in sync with the same table in the regexp chapter!
17938@table @asis
17939@item No options
17940In the default case, @code{gawk} provide all the facilities of
17941POSIX regexps and the GNU regexp operators described above.
17942However, interval expressions are not supported.
17943
17944@item @code{--posix}
17945Only POSIX regexps are supported, the GNU operators are not special
17946(e.g., @samp{\w} matches a literal @samp{w}).  Interval expressions
17947are allowed.
17948
17949@item @code{--traditional}
17950Traditional Unix @code{awk} regexps are matched. The GNU operators
17951are not special, interval expressions are not available, and neither
17952are the POSIX character classes (@code{[[:alnum:]]} and so on).
17953Characters described by octal and hexadecimal escape sequences are
17954treated literally, even if they represent regexp metacharacters.
17955
17956@item @code{--re-interval}
17957Allow interval expressions in regexps, even if @samp{--traditional}
17958has been provided.
17959@end table
17960
17961@xref{Regexp, ,Regular Expressions}.
17962
17963@node Actions Summary, Functions Summary, Rules Summary, Gawk Summary
17964@appendixsec Actions
17965
17966Action statements are enclosed in braces, @samp{@{} and @samp{@}}.
17967A missing action statement is equivalent to @samp{@w{@{ print @}}}.
17968
17969Action statements consist of the usual assignment, conditional, and looping
17970statements found in most languages.  The operators, control statements,
17971and Input/Output statements available are similar to those in C.
17972
17973@c These paragraphs repeated for both patterns and actions. I don't
17974@c like this, but I also don't see any way around it. Update both copies
17975@c if they need fixing.
17976Comments begin with the @samp{#} character, and continue until the end of the
17977line.  Blank lines may be used to separate statements.  Statements normally
17978end with a newline; however, this is not the case for lines ending in a
17979@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}.  Lines
17980ending in @code{do} or @code{else} also have their statements automatically
17981continued on the following line.  In other cases, a line can be continued by
17982ending it with a @samp{\}, in which case the newline is ignored.
17983
17984Multiple statements may be put on one line by separating each one with
17985a @samp{;}.
17986This applies to both the statements within the action part of a rule (the
17987usual case), and to the rule statements.
17988
17989@xref{Comments, ,Comments in @code{awk} Programs}, for information on
17990@code{awk}'s commenting convention;
17991@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a
17992description of the line continuation mechanism in @code{awk}.
17993
17994@menu
17995* Operator Summary::            @code{awk} operators.
17996* Control Flow Summary::        The control statements.
17997* I/O Summary::                 The I/O statements.
17998* Printf Summary::              A summary of @code{printf}.
17999* Special File Summary::        Special file names interpreted internally.
18000* Built-in Functions Summary::  Built-in numeric and string functions.
18001* Time Functions Summary::      Built-in time functions.
18002* String Constants Summary::    Escape sequences in strings.
18003@end menu
18004
18005@node Operator Summary, Control Flow Summary, Actions Summary, Actions Summary
18006@appendixsubsec Operators
18007
18008The operators in @code{awk}, in order of decreasing precedence, are:
18009
18010@table @code
18011@item (@dots{})
18012Grouping.
18013
18014@item $
18015Field reference.
18016
18017@item ++ --
18018Increment and decrement, both prefix and postfix.
18019
18020@item ^
18021Exponentiation (@samp{**} may also be used, and @samp{**=} for the assignment
18022operator, but they are not specified in the POSIX standard).
18023
18024@item + - !
18025Unary plus, unary minus, and logical negation.
18026
18027@item * / %
18028Multiplication, division, and modulus.
18029
18030@item + -
18031Addition and subtraction.
18032
18033@item @var{space}
18034String concatenation.
18035
18036@item < <= > >= != ==
18037The usual relational operators.
18038
18039@item ~ !~
18040Regular expression match, negated match.
18041
18042@item in
18043Array membership.
18044
18045@item &&
18046Logical ``and''.
18047
18048@item ||
18049Logical ``or''.
18050
18051@item ?:
18052A conditional expression.  This has the form @samp{@var{expr1} ?
18053@var{expr2} : @var{expr3}}.  If @var{expr1} is true, the value of the
18054expression is @var{expr2}; otherwise it is @var{expr3}.  Only one of
18055@var{expr2} and @var{expr3} is evaluated.
18056
18057@item = += -= *= /= %= ^=
18058Assignment.  Both absolute assignment (@code{@var{var}=@var{value}})
18059and operator assignment (the other forms) are supported.
18060@end table
18061
18062@xref{Expressions}.
18063
18064@node Control Flow Summary, I/O Summary, Operator Summary, Actions Summary
18065@appendixsubsec Control Statements
18066
18067The control statements are as follows:
18068
18069@example
18070if (@var{condition}) @var{statement} @r{[} else @var{statement} @r{]}
18071while (@var{condition}) @var{statement}
18072do @var{statement} while (@var{condition})
18073for (@var{expr1}; @var{expr2}; @var{expr3}) @var{statement}
18074for (@var{var} in @var{array}) @var{statement}
18075break
18076continue
18077delete @var{array}[@var{index}]
18078delete @var{array}
18079exit @r{[} @var{expression} @r{]}
18080@{ @var{statements} @}
18081@end example
18082
18083@xref{Statements, ,Control Statements in Actions}.
18084
18085@node I/O Summary, Printf Summary, Control Flow Summary, Actions Summary
18086@appendixsubsec I/O Statements
18087
18088The Input/Output statements are as follows:
18089
18090@table @code
18091@item getline
18092Set @code{$0} from next input record; set @code{NF}, @code{NR}, @code{FNR}.
18093@xref{Getline, ,Explicit Input with @code{getline}}.
18094
18095@item getline <@var{file}
18096Set @code{$0} from next record of @var{file}; set @code{NF}.
18097
18098@item getline @var{var}
18099Set @var{var} from next input record; set @code{NR}, @code{FNR}.
18100
18101@item getline @var{var} <@var{file}
18102Set @var{var} from next record of @var{file}.
18103
18104@item @var{command} | getline
18105Run @var{command}, piping its output into @code{getline}; sets @code{$0},
18106@code{NF}, @code{NR}.
18107
18108@item @var{command} | getline @code{var}
18109Run @var{command}, piping its output into @code{getline}; sets @var{var}.
18110
18111@item next
18112Stop processing the current input record.  The next input record is read and
18113processing starts over with the first pattern in the @code{awk} program.
18114If the end of the input data is reached, the @code{END} rule(s), if any,
18115are executed.
18116@xref{Next Statement, ,The @code{next} Statement}.
18117
18118@item nextfile
18119Stop processing the current input file.  The next input record read comes
18120from the next input file.  @code{FILENAME} is updated, @code{FNR} is set to one,
18121@code{ARGIND} is incremented,
18122and processing starts over with the first pattern in the @code{awk} program.
18123If the end of the input data is reached, the @code{END} rule(s), if any,
18124are executed.
18125Earlier versions of @code{gawk} used @samp{next file}; this usage is still
18126supported, but is considered to be deprecated.
18127@xref{Nextfile Statement, ,The @code{nextfile} Statement}.
18128
18129@item print
18130Prints the current record.
18131@xref{Printing, ,Printing Output}.
18132
18133@item print @var{expr-list}
18134Prints expressions.
18135
18136@item print @var{expr-list} > @var{file}
18137Prints expressions to @var{file}. If @var{file} does not exist, it is
18138created. If it does exist, its contents are deleted the first time the
18139@code{print} is executed.
18140
18141@item print @var{expr-list} >> @var{file}
18142Prints expressions to @var{file}.  The previous contents of @var{file}
18143are retained, and the output of @code{print} is appended to the file.
18144
18145@item print @var{expr-list} | @var{command}
18146Prints expressions, sending the output down a pipe to @var{command}.
18147The pipeline to the command stays open until the @code{close} function
18148is called.
18149
18150@item printf @var{fmt}, @var{expr-list}
18151Format and print.
18152
18153@item printf @var{fmt}, @var{expr-list} > @var{file}
18154Format and print to @var{file}. If @var{file} does not exist, it is
18155created. If it does exist, its contents are deleted the first time the
18156@code{printf} is executed.
18157
18158@item printf @var{fmt}, @var{expr-list} >> @var{file}
18159Format and print to @var{file}.  The previous contents of @var{file}
18160are retained, and the output of @code{printf} is appended to the file.
18161
18162@item printf @var{fmt}, @var{expr-list} | @var{command}
18163Format and print, sending the output down a pipe to @var{command}.
18164The pipeline to the command stays open until the @code{close} function
18165is called.
18166@end table
18167
18168@code{getline} returns zero on end of file, and @minus{}1 on an error.
18169In the event of an error, @code{getline} will set @code{ERRNO} to
18170the value of a system-dependent string that describes the error.
18171
18172@node Printf Summary, Special File Summary, I/O Summary, Actions Summary
18173@appendixsubsec @code{printf} Summary
18174
18175Conversion specification have the form
18176@code{%}[@var{flag}][@var{width}][@code{.}@var{prec}]@var{format}.
18177@c whew!
18178Items in brackets are optional.
18179
18180The @code{awk} @code{printf} statement and @code{sprintf} function
18181accept the following conversion specification formats:
18182
18183@table @code
18184@item %c
18185An ASCII character.  If the argument used for @samp{%c} is numeric, it is
18186treated as a character and printed.  Otherwise, the argument is assumed to
18187be a string, and the only first character of that string is printed.
18188
18189@item %d
18190@itemx %i
18191A decimal number (the integer part).
18192
18193@item %e
18194@itemx %E
18195A floating point number of the form
18196@samp{@r{[}-@r{]}d.dddddde@r{[}+-@r{]}dd}.
18197The @samp{%E} format uses @samp{E} instead of @samp{e}.
18198
18199@item %f
18200A floating point number of the form
18201@r{[}@code{-}@r{]}@code{ddd.dddddd}.
18202
18203@item %g
18204@itemx %G
18205Use either the @samp{%e} or @samp{%f} formats, whichever produces a shorter
18206string, with non-significant zeros suppressed.
18207@samp{%G} will use @samp{%E} instead of @samp{%e}.
18208
18209@item %o
18210An unsigned octal number (also an integer).
18211
18212@item %u
18213An unsigned decimal number (again, an integer).
18214
18215@item %s
18216A character string.
18217
18218@item %x
18219@itemx %X
18220An unsigned hexadecimal number (an integer).
18221The @samp{%X} format uses @samp{A} through @samp{F} instead of
18222@samp{a} through @samp{f} for decimal 10 through 15.
18223
18224@item %%
18225A single @samp{%} character; no argument is converted.
18226@end table
18227
18228There are optional, additional parameters that may lie between the @samp{%}
18229and the control letter:
18230
18231@table @code
18232@item -
18233The expression should be left-justified within its field.
18234
18235@item @var{space}
18236For numeric conversions, prefix positive values with a space, and
18237negative values with a minus sign.
18238
18239@item +
18240The plus sign, used before the width modifier (see below),
18241says to always supply a sign for numeric conversions, even if the data
18242to be formatted is positive. The @samp{+} overrides the space modifier.
18243
18244@item #
18245Use an ``alternate form'' for certain control letters.
18246For @samp{o}, supply a leading zero.
18247For @samp{x}, and @samp{X}, supply a leading @samp{0x} or @samp{0X} for
18248a non-zero result.
18249For @samp{e}, @samp{E}, and @samp{f}, the result will always contain a
18250decimal point.
18251For @samp{g}, and @samp{G}, trailing zeros are not removed from the result.
18252
18253@item 0
18254A leading @samp{0} (zero) acts as a flag, that indicates output should be
18255padded with zeros instead of spaces.
18256This applies even to non-numeric output formats.
18257This flag only has an effect when the field width is wider than the
18258value to be printed.
18259
18260@item @var{width}
18261The field should be padded to this width. The field is normally padded
18262with spaces.  If the @samp{0} flag has been used, it is padded with zeros.
18263
18264@item .@var{prec}
18265A number that specifies the precision to use when printing.
18266For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the
18267number of digits you want printed to the right of the decimal point.
18268For the @samp{g}, and @samp{G} formats, it specifies the maximum number
18269of significant digits.  For the @samp{d}, @samp{o}, @samp{i}, @samp{u},
18270@samp{x}, and @samp{X} formats, it specifies the minimum number of
18271digits to print.  For the @samp{s} format, it specifies the maximum number of
18272characters from the string that should be printed.
18273@end table
18274
18275Either or both of the @var{width} and @var{prec} values may be specified
18276as @samp{*}.  In that case, the particular value is taken from the argument
18277list.
18278
18279@xref{Printf, ,Using @code{printf} Statements for Fancier Printing}.
18280
18281@node Special File Summary, Built-in Functions Summary, Printf Summary, Actions Summary
18282@appendixsubsec Special File Names
18283
18284When doing I/O redirection from either @code{print} or @code{printf} into a
18285file, or via @code{getline} from a file, @code{gawk} recognizes certain special
18286file names internally.  These file names allow access to open file descriptors
18287inherited from @code{gawk}'s parent process (usually the shell).  The
18288file names are:
18289
18290@table @file
18291@item /dev/stdin
18292The standard input.
18293
18294@item /dev/stdout
18295The standard output.
18296
18297@item /dev/stderr
18298The standard error output.
18299
18300@item /dev/fd/@var{n}
18301The file denoted by the open file descriptor @var{n}.
18302@end table
18303
18304In addition, reading the following files provides process related information
18305about the running @code{gawk} program.  All returned records are terminated
18306with a newline.
18307
18308@table @file
18309@item /dev/pid
18310Returns the process ID of the current process.
18311
18312@item  /dev/ppid
18313Returns the parent process ID of the current process.
18314
18315@item  /dev/pgrpid
18316Returns the process group ID of the current process.
18317
18318@item /dev/user
18319At least four space-separated fields, containing the return values of
18320the @code{getuid}, @code{geteuid}, @code{getgid}, and @code{getegid}
18321system calls.
18322If there are any additional fields, they are the group IDs returned by
18323@code{getgroups} system call.
18324(Multiple groups may not be supported on all systems.)
18325@end table
18326
18327@noindent
18328These file names may also be used on the command line to name data files.
18329These file names are only recognized internally if you do not
18330actually have files with these names on your system.
18331
18332@xref{Special Files, ,Special File Names in @code{gawk}}, for a longer description that
18333provides the motivation for this feature.
18334
18335@node Built-in Functions Summary, Time Functions Summary, Special File Summary, Actions Summary
18336@appendixsubsec Built-in Functions
18337
18338@code{awk} provides a number of built-in functions for performing
18339numeric operations, string related operations, and I/O related operations.
18340
18341@c NEEDED
18342@page
18343The built-in arithmetic functions are:
18344
18345@table @code
18346@item atan2(@var{y}, @var{x})
18347the arctangent of @var{y/x} in radians.
18348
18349@item cos(@var{expr})
18350the cosine of @var{expr}, which is in radians.
18351
18352@item exp(@var{expr})
18353the exponential function (@code{e ^ @var{expr}}).
18354
18355@item int(@var{expr})
18356truncates to integer.
18357
18358@item log(@var{expr})
18359the natural logarithm of @code{expr}.
18360
18361@item rand()
18362a random number between zero and one.
18363
18364@item sin(@var{expr})
18365the sine of @var{expr}, which is in radians.
18366
18367@item sqrt(@var{expr})
18368the square root function.
18369
18370@item srand(@r{[}@var{expr}@r{]})
18371use @var{expr} as a new seed for the random number generator.  If no @var{expr}
18372is provided, the time of day is used.  The return value is the previous
18373seed for the random number generator.
18374@end table
18375
18376@code{awk} has the following built-in string functions:
18377
18378@table @code
18379@item gensub(@var{regex}, @var{subst}, @var{how} @r{[}, @var{target}@r{]})
18380If @var{how} is a string beginning with @samp{g} or @samp{G}, then
18381replace each match of @var{regex} in @var{target} with @var{subst}.
18382Otherwise, replace the @var{how}'th occurrence. If @var{target} is not
18383supplied, use @code{$0}.  The return value is the changed string; the
18384original @var{target} is not modified. Within @var{subst},
18385@samp{\@var{n}}, where @var{n} is a digit from one to nine, can be used to
18386indicate the text that matched the @var{n}'th parenthesized
18387subexpression.
18388This function is @code{gawk}-specific.
18389
18390@item gsub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})
18391for each substring matching the regular expression @var{regex} in the string
18392@var{target}, substitute the string @var{subst}, and return the number of
18393substitutions. If @var{target} is not supplied, use @code{$0}.
18394
18395@item index(@var{str}, @var{search})
18396returns the index of the string @var{search} in the string @var{str}, or
18397zero if
18398@var{search} is not present.
18399
18400@item length(@r{[}@var{str}@r{]})
18401returns the length of the string @var{str}.  The length of @code{$0}
18402is returned if no argument is supplied.
18403
18404@item match(@var{str}, @var{regex})
18405returns the position in @var{str} where the regular expression @var{regex}
18406occurs, or zero if @var{regex} is not present, and sets the values of
18407@code{RSTART} and @code{RLENGTH}.
18408
18409@item split(@var{str}, @var{arr} @r{[}, @var{regex}@r{]})
18410splits the string @var{str} into the array @var{arr} on the regular expression
18411@var{regex}, and returns the number of elements.  If @var{regex} is omitted,
18412@code{FS} is used instead. @var{regex} can be the null string, causing
18413each character to be placed into its own array element.
18414The array @var{arr} is cleared first.
18415
18416@item sprintf(@var{fmt}, @var{expr-list})
18417prints @var{expr-list} according to @var{fmt}, and returns the resulting string.
18418
18419@item sub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})
18420just like @code{gsub}, but only the first matching substring is replaced.
18421
18422@item substr(@var{str}, @var{index} @r{[}, @var{len}@r{]})
18423returns the @var{len}-character substring of @var{str} starting at @var{index}.
18424If @var{len} is omitted, the rest of @var{str} is used.
18425
18426@item tolower(@var{str})
18427returns a copy of the string @var{str}, with all the upper-case characters in
18428@var{str} translated to their corresponding lower-case counterparts.
18429Non-alphabetic characters are left unchanged.
18430
18431@item toupper(@var{str})
18432returns a copy of the string @var{str}, with all the lower-case characters in
18433@var{str} translated to their corresponding upper-case counterparts.
18434Non-alphabetic characters are left unchanged.
18435@end table
18436
18437The I/O related functions are:
18438
18439@table @code
18440@item close(@var{expr})
18441Close the open file or pipe denoted by @var{expr}.
18442
18443@item fflush(@r{[}@var{expr}@r{]})
18444Flush any buffered output for the output file or pipe denoted by @var{expr}.
18445If @var{expr} is omitted, standard output is flushed.
18446If @var{expr} is the null string (@code{""}), all output buffers are flushed.
18447
18448@item system(@var{cmd-line})
18449Execute the command @var{cmd-line}, and return the exit status.
18450If your operating system does not support @code{system}, calling it will
18451generate a fatal error.
18452
18453@samp{system("")} can be used to force @code{awk} to flush any pending
18454output.  This is more portable, but less obvious, than calling @code{fflush}.
18455@end table
18456
18457@node Time Functions Summary, String Constants Summary, Built-in Functions Summary, Actions Summary
18458@appendixsubsec Time Functions
18459
18460The following two functions are available for getting the current
18461time of day, and for formatting time stamps.
18462They are specific to @code{gawk}.
18463
18464@table @code
18465@item systime()
18466returns the current time of day as the number of seconds since a particular
18467epoch (Midnight, January 1, 1970 UTC, on POSIX systems).
18468
18469@item strftime(@r{[}@var{format}@r{[}, @var{timestamp}@r{]]})
18470formats @var{timestamp} according to the specification in @var{format}.
18471The current time of day is used if no @var{timestamp} is supplied.
18472A default format equivalent to the output of the @code{date} utility is used if
18473no @var{format} is supplied.
18474@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for the
18475details on the conversion specifiers that @code{strftime} accepts.
18476@end table
18477
18478@iftex
18479@xref{Built-in, ,Built-in Functions}, for a description of all of
18480@code{awk}'s built-in functions.
18481@end iftex
18482
18483@node String Constants Summary,  , Time Functions Summary, Actions Summary
18484@appendixsubsec String Constants
18485
18486String constants in @code{awk} are sequences of characters enclosed
18487in double quotes (@code{"}).  Within strings, certain @dfn{escape sequences}
18488are recognized, as in C.  These are:
18489
18490@table @code
18491@item \\
18492A literal backslash.
18493
18494@item \a
18495The ``alert'' character; usually the ASCII BEL character.
18496
18497@item \b
18498Backspace.
18499
18500@item \f
18501Formfeed.
18502
18503@item \n
18504Newline.
18505
18506@item \r
18507Carriage return.
18508
18509@item \t
18510Horizontal tab.
18511
18512@item \v
18513Vertical tab.
18514
18515@item \x@var{hex digits}
18516The character represented by the string of hexadecimal digits following
18517the @samp{\x}.  As in ANSI C, all following hexadecimal digits are
18518considered part of the escape sequence.  E.g., @code{"\x1B"} is a
18519string containing the ASCII ESC (escape) character.  (The @samp{\x}
18520escape sequence is not in POSIX @code{awk}.)
18521
18522@item \@var{ddd}
18523The character represented by the one, two, or three digit sequence of octal
18524digits.  Thus, @code{"\033"} is also a string containing the ASCII ESC
18525(escape) character.
18526
18527@item \@var{c}
18528The literal character @var{c}, if @var{c} is not one of the above.
18529@end table
18530
18531The escape sequences may also be used inside constant regular expressions
18532(e.g., the regexp @code{@w{/[@ \t\f\n\r\v]/}} matches whitespace
18533characters).
18534
18535@xref{Escape Sequences}.
18536
18537@node Functions Summary, Historical Features, Actions Summary, Gawk Summary
18538@appendixsec User-defined Functions
18539
18540Functions in @code{awk} are defined as follows:
18541
18542@example
18543function @var{name}(@var{parameter list}) @{ @var{statements} @}
18544@end example
18545
18546Actual parameters supplied in the function call are used to instantiate
18547the formal parameters declared in the function.  Arrays are passed by
18548reference, other variables are passed by value.
18549
18550If there are fewer arguments passed than there are names in @var{parameter-list},
18551the extra names are given the null string as their value.  Extra names have the
18552effect of local variables.
18553
18554The open-parenthesis in a function call of a user-defined function must
18555immediately follow the function name, without any intervening white space.
18556This is to avoid a syntactic ambiguity with the concatenation operator.
18557
18558The word @code{func} may be used in place of @code{function} (but not in
18559POSIX @code{awk}).
18560
18561Use the @code{return} statement to return a value from a function.
18562
18563@xref{User-defined, ,User-defined Functions}.
18564
18565@node Historical Features,  , Functions Summary, Gawk Summary
18566@appendixsec Historical Features
18567
18568@cindex historical features
18569There are two features of historical @code{awk} implementations that
18570@code{gawk} supports.
18571
18572First, it is possible to call the @code{length} built-in function not only
18573with no arguments, but even without parentheses!
18574
18575@example
18576a = length
18577@end example
18578
18579@noindent
18580is the same as either of
18581
18582@example
18583a = length()
18584a = length($0)
18585@end example
18586
18587@noindent
18588For example:
18589
18590@example
18591$ echo abcdef | awk '@{ print length @}'
18592@print{} 6
18593@end example
18594
18595@noindent
18596This feature is marked as ``deprecated'' in the POSIX standard, and
18597@code{gawk} will issue a warning about its use if @samp{--lint} is
18598specified on the command line.
18599(The ability to use @code{length} this way was actually an accident of the
18600original Unix @code{awk} implementation.  If any built-in function used
18601@code{$0} as its default argument, it was possible to call that function
18602without the parentheses.  In particular, it was common practice to use
18603the @code{length} function in this fashion, and this usage was documented
18604in the @code{awk} manual page.)
18605
18606The other historical feature is the use of either the @code{break} statement,
18607or the @code{continue} statement
18608outside the body of a @code{while}, @code{for}, or @code{do} loop.  Traditional
18609@code{awk} implementations have treated such usage as equivalent to the
18610@code{next} statement.  More recent versions of Unix @code{awk} do not allow
18611it. @code{gawk} supports this usage if @samp{--traditional} has been
18612specified.
18613
18614@xref{Options, ,Command Line Options}, for more information about the
18615@samp{--posix} and @samp{--lint} options.
18616
18617@node Installation, Notes, Gawk Summary, Top
18618@appendix Installing @code{gawk}
18619
18620This appendix provides instructions for installing @code{gawk} on the
18621various platforms that are supported by the developers.  The primary
18622developers support Unix (and one day, GNU), while the other ports were
18623contributed.  The file @file{ACKNOWLEDGMENT} in the @code{gawk}
18624distribution lists the electronic mail addresses of the people who did
18625the respective ports, and they are also provided in
18626@ref{Bugs, , Reporting Problems and Bugs}.
18627
18628@menu
18629* Gawk Distribution::           What is in the @code{gawk} distribution.
18630* Unix Installation::           Installing @code{gawk} under various versions
18631                                of Unix.
18632* VMS Installation::            Installing @code{gawk} on VMS.
18633* PC Installation::             Installing and Compiling @code{gawk} on MS-DOS
18634                                and OS/2
18635* Atari Installation::          Installing @code{gawk} on the Atari ST.
18636* Amiga Installation::          Installing @code{gawk} on an Amiga.
18637* Bugs::                        Reporting Problems and Bugs.
18638* Other Versions::              Other freely available @code{awk}
18639                                implementations.
18640@end menu
18641
18642@node Gawk Distribution, Unix Installation, Installation, Installation
18643@appendixsec The @code{gawk} Distribution
18644
18645This section first describes how to get the @code{gawk}
18646distribution, how to extract it, and then what is in the various files and
18647subdirectories.
18648
18649@menu
18650* Getting::                     How to get the distribution.
18651* Extracting::                  How to extract the distribution.
18652* Distribution contents::       What is in the distribution.
18653@end menu
18654
18655@node Getting, Extracting, Gawk Distribution, Gawk Distribution
18656@appendixsubsec Getting the @code{gawk} Distribution
18657@cindex getting @code{gawk}
18658@cindex anonymous @code{ftp}
18659@cindex @code{ftp}, anonymous
18660@cindex Free Software Foundation
18661There are three ways you can get GNU software.
18662
18663@enumerate
18664@item
18665You can copy it from someone else who already has it.
18666
18667@cindex Free Software Foundation
18668@item
18669You can order @code{gawk} directly from the Free Software Foundation.
18670Software distributions are available for Unix, MS-DOS, and VMS, on
18671tape and CD-ROM.  The address is:
18672
18673@quotation
18674Free Software Foundation @*
1867559 Temple Place---Suite 330 @*
18676Boston, MA  02111-1307 USA @*
18677Phone: +1-617-542-5942 @*
18678Fax (including Japan): +1-617-542-2652 @*
18679Email: @code{gnu@@gnu.org} @*
18680URL: @code{http://www.gnu.org/} @*
18681@end quotation
18682
18683@noindent
18684Ordering from the FSF directly contributes to the support of the foundation
18685and to the production of more free software.
18686
18687@item
18688You can get @code{gawk} by using anonymous @code{ftp} to the Internet host
18689@code{gnudist.gnu.org}, in the directory @file{/gnu/gawk}.
18690
18691Here is a list of alternate @code{ftp} sites from which you can obtain GNU
18692software.  When a site is listed as ``@var{site}@code{:}@var{directory}'' the
18693@var{directory} indicates the directory where GNU software is kept.
18694You should use a site that is geographically close to you.
18695
18696@table @asis
18697@item Asia:
18698@table @code
18699@item cair-archive.kaist.ac.kr:/pub/gnu
18700@itemx ftp.cs.titech.ac.jp
18701@itemx ftp.nectec.or.th:/pub/mirrors/gnu
18702@itemx utsun.s.u-tokyo.ac.jp:/ftpsync/prep
18703@end table
18704
18705@c NEEDED
18706@page
18707@item Australia:
18708@table @code
18709@item archie.au:/gnu
18710(@code{archie.oz} or @code{archie.oz.au} for ACSnet)
18711@end table
18712
18713@item Africa:
18714@table @code
18715@item ftp.sun.ac.za:/pub/gnu
18716@end table
18717
18718@item Middle East:
18719@table @code
18720@item ftp.technion.ac.il:/pub/unsupported/gnu
18721@end table
18722
18723@item Europe:
18724@table @code
18725@item archive.eu.net
18726@itemx ftp.denet.dk
18727@itemx ftp.eunet.ch
18728@itemx ftp.funet.fi:/pub/gnu
18729@itemx ftp.ieunet.ie:pub/gnu
18730@itemx ftp.informatik.rwth-aachen.de:/pub/gnu
18731@itemx ftp.informatik.tu-muenchen.de
18732@itemx ftp.luth.se:/pub/unix/gnu
18733@itemx ftp.mcc.ac.uk
18734@itemx ftp.stacken.kth.se
18735@itemx ftp.sunet.se:/pub/gnu
18736@itemx ftp.univ-lyon1.fr:pub/gnu
18737@itemx ftp.win.tue.nl:/pub/gnu
18738@itemx irisa.irisa.fr:/pub/gnu
18739@itemx isy.liu.se
18740@itemx nic.switch.ch:/mirror/gnu
18741@itemx src.doc.ic.ac.uk:/gnu
18742@itemx unix.hensa.ac.uk:/pub/uunet/systems/gnu
18743@end table
18744
18745@item South America:
18746@table @code
18747@item ftp.inf.utfsm.cl:/pub/gnu
18748@itemx ftp.unicamp.br:/pub/gnu
18749@end table
18750
18751@item Western Canada:
18752@table @code
18753@item ftp.cs.ubc.ca:/mirror2/gnu
18754@end table
18755
18756@item USA:
18757@table @code
18758@item col.hp.com:/mirrors/gnu
18759@itemx f.ms.uky.edu:/pub3/gnu
18760@itemx ftp.cc.gatech.edu:/pub/gnu
18761@itemx ftp.cs.columbia.edu:/archives/gnu/prep
18762@itemx ftp.digex.net:/pub/gnu
18763@itemx ftp.hawaii.edu:/mirrors/gnu
18764@itemx ftp.kpc.com:/pub/mirror/gnu
18765@end table
18766
18767@c NEEDED
18768@page
18769@item USA (continued):
18770@table @code
18771@itemx ftp.uu.net:/systems/gnu
18772@itemx gatekeeper.dec.com:/pub/GNU
18773@itemx jaguar.utah.edu:/gnustuff
18774@itemx labrea.stanford.edu
18775@itemx mrcnext.cso.uiuc.edu:/pub/gnu
18776@itemx vixen.cso.uiuc.edu:/gnu
18777@itemx wuarchive.wustl.edu:/systems/gnu
18778@end table
18779@end table
18780@end enumerate
18781
18782@node Extracting, Distribution contents, Getting, Gawk Distribution
18783@appendixsubsec Extracting the Distribution
18784@code{gawk} is distributed as a @code{tar} file compressed with the
18785GNU Zip program, @code{gzip}.
18786
18787Once you have the distribution (for example,
18788@file{gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz}), first use @code{gzip} to expand the
18789file, and then use @code{tar} to extract it.  You can use the following
18790pipeline to produce the @code{gawk} distribution:
18791
18792@example
18793# Under System V, add 'o' to the tar flags
18794gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf -
18795@end example
18796
18797@noindent
18798This will create a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}} in the current
18799directory.
18800
18801The distribution file name is of the form
18802@file{gawk-@var{V}.@var{R}.@var{n}.tar.gz}.
18803The @var{V} represents the major version of @code{gawk},
18804the @var{R} represents the current release of version @var{V}, and
18805the @var{n} represents a @dfn{patch level}, meaning that minor bugs have
18806been fixed in the release.  The current patch level is @value{PATCHLEVEL},
18807but when
18808retrieving distributions, you should get the version with the highest
18809version, release, and patch level.  (Note that release levels greater than
18810or equal to 90 denote ``beta,'' or non-production software; you may not wish
18811to retrieve such a version unless you don't mind experimenting.)
18812
18813If you are not on a Unix system, you will need to make other arrangements
18814for getting and extracting the @code{gawk} distribution.  You should consult
18815a local expert.
18816
18817@node Distribution contents,  , Extracting, Gawk Distribution
18818@appendixsubsec Contents of the @code{gawk} Distribution
18819
18820The @code{gawk} distribution has a number of C source files,
18821documentation files,
18822subdirectories and files related to the configuration process
18823(@pxref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}),
18824and several subdirectories related to different, non-Unix,
18825operating systems.
18826
18827@table @asis
18828@item various @samp{.c}, @samp{.y}, and @samp{.h} files
18829These files are the actual @code{gawk} source code.
18830@end table
18831
18832@table @file
18833@item README
18834@itemx README_d/README.*
18835Descriptive files: @file{README} for @code{gawk} under Unix, and the
18836rest for the various hardware and software combinations.
18837
18838@item INSTALL
18839A file providing an overview of the configuration and installation process.
18840
18841@item PORTS
18842A list of systems to which @code{gawk} has been ported, and which
18843have successfully run the test suite.
18844
18845@item ACKNOWLEDGMENT
18846A list of the people who contributed major parts of the code or documentation.
18847
18848@item ChangeLog
18849A detailed list of source code changes as bugs are fixed or improvements made.
18850
18851@item NEWS
18852A list of changes to @code{gawk} since the last release or patch.
18853
18854@item COPYING
18855The GNU General Public License.
18856
18857@item FUTURES
18858A brief list of features and/or changes being contemplated for future
18859releases, with some indication of the time frame for the feature, based
18860on its difficulty.
18861
18862@item LIMITATIONS
18863A list of those factors that limit @code{gawk}'s performance.
18864Most of these depend on the hardware or operating system software, and
18865are not limits in @code{gawk} itself.
18866
18867@item POSIX.STD
18868A description of one area where the POSIX standard for @code{awk} is
18869incorrect, and how @code{gawk} handles the problem.
18870
18871@item PROBLEMS
18872A file describing known problems with the current release.
18873
18874@cindex artificial intelligence, using @code{gawk}
18875@cindex AI programming, using @code{gawk}
18876@item doc/awkforai.txt
18877A short article describing why @code{gawk} is a good language for
18878AI (Artificial Intelligence) programming.
18879
18880@item doc/README.card
18881@itemx doc/ad.block
18882@itemx doc/awkcard.in
18883@itemx doc/cardfonts
18884@itemx doc/colors
18885@itemx doc/macros
18886@itemx doc/no.colors
18887@itemx doc/setter.outline
18888The @code{troff} source for a five-color @code{awk} reference card.
18889A modern version of @code{troff}, such as GNU Troff (@code{groff}) is
18890needed to produce the color version. See the file @file{README.card}
18891for instructions if you have an older @code{troff}.
18892
18893@item doc/gawk.1
18894The @code{troff} source for a manual page describing @code{gawk}.
18895This is distributed for the convenience of Unix users.
18896
18897@item doc/gawk.texi
18898The Texinfo source file for this @value{DOCUMENT}.
18899It should be processed with @TeX{} to produce a printed document, and
18900with @code{makeinfo} to produce an Info file.
18901
18902@item doc/gawk.info
18903The generated Info file for this @value{DOCUMENT}.
18904
18905@item doc/igawk.1
18906The @code{troff} source for a manual page describing the @code{igawk}
18907program presented in
18908@ref{Igawk Program, ,An Easy Way to Use Library Functions}.
18909
18910@item doc/Makefile.in
18911The input file used during the configuration process to generate the
18912actual @file{Makefile} for creating the documentation.
18913
18914@item Makefile.in
18915@itemx acconfig.h
18916@itemx aclocal.m4
18917@itemx configh.in
18918@itemx configure.in
18919@itemx configure
18920@itemx custom.h
18921@itemx missing/*
18922These files and subdirectory are used when configuring @code{gawk}
18923for various Unix systems.  They are explained in detail in
18924@ref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}.
18925
18926@item awklib/extract.awk
18927@itemx awklib/Makefile.in
18928The @file{awklib} directory contains a copy of @file{extract.awk}
18929(@pxref{Extract Program, ,Extracting Programs from Texinfo Source Files}),
18930which can be used to extract the sample programs from the Texinfo
18931source file for this @value{DOCUMENT}, and a @file{Makefile.in} file, which
18932@code{configure} uses to generate a @file{Makefile}.
18933As part of the process of building @code{gawk}, the library functions from
18934@ref{Library Functions, , A Library of @code{awk} Functions},
18935and the @code{igawk} program from
18936@ref{Igawk Program, , An Easy Way to Use Library Functions},
18937are extracted into ready to use files.
18938They are installed as part of the installation process.
18939
18940@item atari/*
18941Files needed for building @code{gawk} on an Atari ST.
18942@xref{Atari Installation, ,Installing @code{gawk} on the Atari ST}, for details.
18943
18944@item pc/*
18945Files needed for building @code{gawk} under MS-DOS and OS/2.
18946@xref{PC Installation, ,MS-DOS and OS/2 Installation and Compilation}, for details.
18947
18948@item vms/*
18949Files needed for building @code{gawk} under VMS.
18950@xref{VMS Installation, ,How to Compile and Install @code{gawk} on VMS}, for details.
18951
18952@item test/*
18953A test suite for
18954@code{gawk}.  You can use @samp{make check} from the top level @code{gawk}
18955directory to run your version of @code{gawk} against the test suite.
18956If @code{gawk} successfully passes @samp{make check} then you can
18957be confident of a successful port.
18958@end table
18959
18960@node Unix Installation, VMS Installation, Gawk Distribution, Installation
18961@appendixsec Compiling and Installing @code{gawk} on Unix
18962
18963Usually, you can compile and install @code{gawk} by typing only two
18964commands.  However, if you do use an unusual system, you may need
18965to configure @code{gawk} for your system yourself.
18966
18967@menu
18968* Quick Installation::          Compiling @code{gawk} under Unix.
18969* Configuration Philosophy::    How it's all supposed to work.
18970@end menu
18971
18972@node Quick Installation, Configuration Philosophy, Unix Installation, Unix Installation
18973@appendixsubsec Compiling @code{gawk} for Unix
18974
18975@cindex installation, unix
18976After you have extracted the @code{gawk} distribution, @code{cd}
18977to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}.  Like most GNU software,
18978@code{gawk} is configured
18979automatically for your Unix system by running the @code{configure} program.
18980This program is a Bourne shell script that was generated automatically using
18981GNU @code{autoconf}.
18982@iftex
18983(The @code{autoconf} software is
18984described fully in
18985@cite{Autoconf---Generating Automatic Configuration Scripts},
18986which is available from the Free Software Foundation.)
18987@end iftex
18988@ifinfo
18989(The @code{autoconf} software is described fully starting with
18990@ref{Top, , Introduction, autoconf, Autoconf---Generating Automatic Configuration Scripts}.)
18991@end ifinfo
18992
18993To configure @code{gawk}, simply run @code{configure}:
18994
18995@example
18996sh ./configure
18997@end example
18998
18999This produces a @file{Makefile} and @file{config.h} tailored to your system.
19000The @file{config.h} file describes various facts about your system.
19001You may wish to edit the @file{Makefile} to
19002change the @code{CFLAGS} variable, which controls
19003the command line options that are passed to the C compiler (such as
19004optimization levels, or compiling for debugging).
19005
19006Alternatively, you can add your own values for most @code{make}
19007variables, such as @code{CC} and @code{CFLAGS}, on the command line when
19008running @code{configure}:
19009
19010@example
19011CC=cc CFLAGS=-g sh ./configure
19012@end example
19013
19014@noindent
19015See the file @file{INSTALL} in the @code{gawk} distribution for
19016all the details.
19017
19018After you have run @code{configure}, and possibly edited the @file{Makefile},
19019type:
19020
19021@example
19022make
19023@end example
19024
19025@noindent
19026and shortly thereafter, you should have an executable version of @code{gawk}.
19027That's all there is to it!
19028(If these steps do not work, please send in a bug report;
19029@pxref{Bugs, ,Reporting Problems and Bugs}.)
19030
19031@node Configuration Philosophy, , Quick Installation, Unix Installation
19032@appendixsubsec The Configuration Process
19033
19034@cindex configuring @code{gawk}
19035(This section is of interest only if you know something about using the
19036C language and the Unix operating system.)
19037
19038The source code for @code{gawk} generally attempts to adhere to formal
19039standards wherever possible.  This means that @code{gawk} uses library
19040routines that are specified by the ANSI C standard and by the POSIX
19041operating system interface standard.  When using an ANSI C compiler,
19042function prototypes are used to help improve the compile-time checking.
19043
19044Many Unix systems do not support all of either the ANSI or the
19045POSIX standards.  The @file{missing} subdirectory in the @code{gawk}
19046distribution contains replacement versions of those subroutines that are
19047most likely to be missing.
19048
19049The @file{config.h} file that is created by the @code{configure} program
19050contains definitions that describe features of the particular operating
19051system where you are attempting to compile @code{gawk}.  The three things
19052described by this file are what header files are available, so that
19053they can be correctly included,
19054what (supposedly) standard functions are actually available in your C
19055libraries, and
19056other miscellaneous facts about your
19057variant of Unix.  For example, there may not be an @code{st_blksize}
19058element in the @code{stat} structure.  In this case @samp{HAVE_ST_BLKSIZE}
19059would be undefined.
19060
19061@cindex @code{custom.h} configuration file
19062It is possible for your C compiler to lie to @code{configure}. It may
19063do so by not exiting with an error when a library function is not
19064available.  To get around this, you can edit the file @file{custom.h}.
19065Use an @samp{#ifdef} that is appropriate for your system, and either
19066@code{#define} any constants that @code{configure} should have defined but
19067didn't, or @code{#undef} any constants that @code{configure} defined and
19068should not have.  @file{custom.h} is automatically included by
19069@file{config.h}.
19070
19071It is also possible that the @code{configure} program generated by
19072@code{autoconf}
19073will not work on your system in some other fashion.  If you do have a problem,
19074the file
19075@file{configure.in} is the input for @code{autoconf}.  You may be able to
19076change this file, and generate a new version of @code{configure} that will
19077work on your system.  @xref{Bugs, ,Reporting Problems and Bugs}, for
19078information on how to report problems in configuring @code{gawk}.  The same
19079mechanism may be used to send in updates to @file{configure.in} and/or
19080@file{custom.h}.
19081
19082@node VMS Installation, PC Installation, Unix Installation, Installation
19083@appendixsec How to Compile and Install @code{gawk} on VMS
19084
19085@c based on material from Pat Rankin <rankin@eql.caltech.edu>
19086
19087@cindex installation, vms
19088This section describes how to compile and install @code{gawk} under VMS.
19089
19090@menu
19091* VMS Compilation::             How to compile @code{gawk} under VMS.
19092* VMS Installation Details::    How to install @code{gawk} under VMS.
19093* VMS Running::                 How to run @code{gawk} under VMS.
19094* VMS POSIX::                   Alternate instructions for VMS POSIX.
19095@end menu
19096
19097@node VMS Compilation, VMS Installation Details, VMS Installation, VMS Installation
19098@appendixsubsec Compiling @code{gawk} on VMS
19099
19100To compile @code{gawk} under VMS, there is a @code{DCL} command procedure that
19101will issue all the necessary @code{CC} and @code{LINK} commands, and there is
19102also a @file{Makefile} for use with the @code{MMS} utility.  From the source
19103directory, use either
19104
19105@example
19106$ @@[.VMS]VMSBUILD.COM
19107@end example
19108
19109@noindent
19110or
19111
19112@example
19113$ MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK
19114@end example
19115
19116Depending upon which C compiler you are using, follow one of the sets
19117of instructions in this table:
19118
19119@table @asis
19120@item VAX C V3.x
19121Use either @file{vmsbuild.com} or @file{descrip.mms} as is.  These use
19122@code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0.
19123
19124@item VAX C V2.x
19125You must have Version 2.3 or 2.4; older ones won't work.  Edit either
19126@file{vmsbuild.com} or @file{descrip.mms} according to the comments in them.
19127For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters.
19128Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h})
19129and comment out or delete the two lines @samp{#define __STDC__ 0} and
19130@samp{#define VAXC_BUILTINS} near the end.
19131
19132@item GNU C
19133Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different
19134from those for VAX C V2.x, but equally straightforward.  No changes to
19135@file{config.h} should be needed.
19136
19137@item DEC C
19138Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments.
19139No changes to @file{config.h} should be needed.
19140@end table
19141
19142@code{gawk} has been tested under VAX/VMS 5.5-1 using VAX C V3.2,
19143GNU C 1.40 and 2.3.  It should work without modifications for VMS V4.6 and up.
19144
19145@node VMS Installation Details, VMS Running, VMS Compilation, VMS Installation
19146@appendixsubsec Installing @code{gawk} on VMS
19147
19148To install @code{gawk}, all you need is a ``foreign'' command, which is
19149a @code{DCL} symbol whose value begins with a dollar sign. For example:
19150
19151@example
19152$ GAWK :== $disk1:[gnubin]GAWK
19153@end example
19154
19155@noindent
19156(Substitute the actual location of @code{gawk.exe} for
19157@samp{$disk1:[gnubin]}.) The symbol should be placed in the
19158@file{login.com} of any user who wishes to run @code{gawk},
19159so that it will be defined every time the user logs on.
19160Alternatively, the symbol may be placed in the system-wide
19161@file{sylogin.com} procedure, which will allow all users
19162to run @code{gawk}.
19163
19164Optionally, the help entry can be loaded into a VMS help library:
19165
19166@example
19167$ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP
19168@end example
19169
19170@noindent
19171(You may want to substitute a site-specific help library rather than
19172the standard VMS library @samp{HELPLIB}.)  After loading the help text,
19173
19174@example
19175$ HELP GAWK
19176@end example
19177
19178@noindent
19179will provide information about both the @code{gawk} implementation and the
19180@code{awk} programming language.
19181
19182The logical name @samp{AWK_LIBRARY} can designate a default location
19183for @code{awk} program files.  For the @samp{-f} option, if the specified
19184filename has no device or directory path information in it, @code{gawk}
19185will look in the current directory first, then in the directory specified
19186by the translation of @samp{AWK_LIBRARY} if the file was not found.
19187If after searching in both directories, the file still is not found,
19188then @code{gawk} appends the suffix @samp{.awk} to the filename and the
19189file search will be re-tried.  If @samp{AWK_LIBRARY} is not defined, that
19190portion of the file search will fail benignly.
19191
19192@node VMS Running, VMS POSIX, VMS Installation Details, VMS Installation
19193@appendixsubsec Running @code{gawk} on VMS
19194
19195Command line parsing and quoting conventions are significantly different
19196on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor
19197changes.  They @emph{are} minor though, and all @code{awk} programs
19198should run correctly.
19199
19200Here are a couple of trivial tests:
19201
19202@example
19203$ gawk -- "BEGIN @{print ""Hello, World!""@}"
19204$ gawk -"W" version
19205! could also be -"W version" or "-W version"
19206@end example
19207
19208@noindent
19209Note that upper-case and mixed-case text must be quoted.
19210
19211The VMS port of @code{gawk} includes a @code{DCL}-style interface in addition
19212to the original shell-style interface (see the help entry for details).
19213One side-effect of dual command line parsing is that if there is only a
19214single parameter (as in the quoted string program above), the command
19215becomes ambiguous.  To work around this, the normally optional @samp{--}
19216flag is required to force Unix style rather than @code{DCL} parsing.  If any
19217other dash-type options (or multiple parameters such as data files to be
19218processed) are present, there is no ambiguity and @samp{--} can be omitted.
19219
19220The default search path when looking for @code{awk} program files specified
19221by the @samp{-f} option is @code{"SYS$DISK:[],AWK_LIBRARY:"}.  The logical
19222name @samp{AWKPATH} can be used to override this default.  The format
19223of @samp{AWKPATH} is a comma-separated list of directory specifications.
19224When defining it, the value should be quoted so that it retains a single
19225translation, and not a multi-translation @code{RMS} searchlist.
19226
19227@node VMS POSIX,  , VMS Running, VMS Installation
19228@appendixsubsec Building and Using @code{gawk} on VMS POSIX
19229
19230Ignore the instructions above, although @file{vms/gawk.hlp} should still
19231be made available in a help library.  The source tree should be unpacked
19232into a container file subsystem rather than into the ordinary VMS file
19233system.  Make sure that the two scripts, @file{configure} and
19234@file{vms/posix-cc.sh}, are executable; use @samp{chmod +x} on them if
19235necessary.  Then execute the following two commands:
19236
19237@example
19238@group
19239psx> CC=vms/posix-cc.sh configure
19240psx> make CC=c89 gawk
19241@end group
19242@end example
19243
19244@noindent
19245The first command will construct files @file{config.h} and @file{Makefile} out
19246of templates, using a script to make the C compiler fit @code{configure}'s
19247expectations.  The second command will compile and link @code{gawk} using
19248the C compiler directly; ignore any warnings from @code{make} about being
19249unable to redefine @code{CC}.  @code{configure} will take a very long
19250time to execute, but at least it provides incremental feedback as it
19251runs.
19252
19253This has been tested with VAX/VMS V6.2, VMS POSIX V2.0, and DEC C V5.2.
19254
19255Once built, @code{gawk} will work like any other shell utility.  Unlike
19256the normal VMS port of @code{gawk}, no special command line manipulation is
19257needed in the VMS POSIX environment.
19258
19259@c Rewritten by Scott Deifik <scottd@amgen.com>
19260@c and Darrel Hankerson <hankedr@mail.auburn.edu>
19261@node PC Installation, Atari Installation, VMS Installation, Installation
19262@appendixsec MS-DOS and OS/2 Installation and Compilation
19263
19264@cindex installation, MS-DOS and OS/2
19265If you have received a binary distribution prepared by the DOS
19266maintainers, then @code{gawk} and the necessary support files will appear
19267under the @file{gnu} directory, with executables in @file{gnu/bin},
19268libraries in @file{gnu/lib/awk}, and manual pages under @file{gnu/man}.
19269This is designed for easy installation to a @file{/gnu} directory on your
19270drive, but the files can be installed anywhere provided @code{AWKPATH} is
19271set properly.  Regardless of the installation directory, the first line of
19272@file{igawk.cmd} and @file{igawk.bat} (in @file{gnu/bin}) may need to be
19273edited.
19274
19275The binary distribution will contain a separate file describing the
19276contents. In particular, it may include more than one version of the
19277@code{gawk} executable. OS/2 binary distributions may have a
19278different arrangement, but installation is similar.
19279
19280The OS/2 and MS-DOS versions of @code{gawk} search for program files as
19281described in @ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
19282However, semicolons (rather than colons) separate elements
19283in the @code{AWKPATH} variable. If @code{AWKPATH} is not set or is empty,
19284then the default search path is @code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}.
19285
19286An @code{sh}-like shell (as opposed to @code{command.com} under MS-DOS
19287or @code{cmd.exe} under OS/2) may be useful for @code{awk} programming.
19288Ian Stewartson has written an excellent shell for MS-DOS and OS/2, and a
19289@code{ksh} clone and GNU Bash are available for OS/2. The file
19290@file{README_d/README.pc} in the @code{gawk} distribution contains
19291information on these shells. Users of Stewartson's shell on DOS should
19292examine its documentation on handling of command-lines. In particular,
19293the setting for @code{gawk} in the shell configuration may need to be
19294changed, and the @code{ignoretype} option may also be of interest.
19295
19296@code{gawk} can be compiled for MS-DOS and OS/2 using the GNU development tools
19297from DJ Delorie (DJGPP, MS-DOS-only) or Eberhard Mattes (EMX, MS-DOS and OS/2).
19298Microsoft C can be used to build 16-bit versions for MS-DOS and OS/2.  The file
19299@file{README_d/README.pc} in the @code{gawk} distribution contains additional
19300notes, and @file{pc/Makefile} contains important notes on compilation options.
19301
19302To build @code{gawk}, copy the files in the @file{pc} directory (@emph{except}
19303for @file{ChangeLog}) to the
19304directory with the rest of the @code{gawk} sources. The @file{Makefile}
19305contains a configuration section with comments, and may need to be
19306edited in order to work with your @code{make} utility.
19307
19308The @file{Makefile} contains a number of targets for building various MS-DOS
19309and OS/2 versions. A list of targets will be printed if the @code{make}
19310command is given without a target. As an example, to build @code{gawk}
19311using the DJGPP tools, enter @samp{make djgpp}.
19312
19313Using @code{make} to run the standard tests and to install @code{gawk}
19314requires additional Unix-like tools, including @code{sh}, @code{sed}, and
19315@code{cp}. In order to run the tests, the @file{test/*.ok} files may need to
19316be converted so that they have the usual DOS-style end-of-line markers. Most
19317of the tests will work properly with Stewartson's shell along with the
19318companion utilities or appropriate GNU utilities.  However, some editing of
19319@file{test/Makefile} is required. It is recommended that the file
19320@file{pc/Makefile.tst} be copied to @file{test/Makefile} as a
19321replacement. Details can be found in @file{README_d/README.pc}.
19322
19323@node Atari Installation, Amiga Installation, PC Installation, Installation
19324@appendixsec Installing @code{gawk} on the Atari ST
19325
19326@c based on material from Michal Jaegermann <michal@gortel.phys.ualberta.ca>
19327
19328@cindex atari
19329@cindex installation, atari
19330There are no substantial differences when installing @code{gawk} on
19331various Atari models.  Compiled @code{gawk} executables do not require
19332a large amount of memory with most @code{awk} programs and should run on all
19333Motorola processor based models (called further ST, even if that is not
19334exactly right).
19335
19336In order to use @code{gawk}, you need to have a shell, either text or
19337graphics, that does not map all the characters of a command line to
19338upper-case.  Maintaining case distinction in option flags is very
19339important (@pxref{Options, ,Command Line Options}).
19340These days this is the default, and it may only be a problem for some
19341very old machines.  If your system does not preserve the case of option
19342flags, you will need to upgrade your tools.  Support for I/O
19343redirection is necessary to make it easy to import @code{awk} programs
19344from other environments.  Pipes are nice to have, but not vital.
19345
19346@menu
19347* Atari Compiling::           Compiling @code{gawk} on Atari
19348* Atari Using::               Running @code{gawk} on Atari
19349@end menu
19350
19351@node Atari Compiling, Atari Using, Atari Installation, Atari Installation
19352@appendixsubsec Compiling @code{gawk} on the Atari ST
19353
19354A proper compilation of @code{gawk} sources when @code{sizeof(int)}
19355differs from @code{sizeof(void *)} requires an ANSI C compiler. An initial
19356port was done with @code{gcc}.  You may actually prefer executables
19357where @code{int}s are four bytes wide, but the other variant works as well.
19358
19359You may need quite a bit of memory when trying to recompile the @code{gawk}
19360sources, as some source files (@file{regex.c} in particular) are quite
19361big.  If you run out of memory compiling such a file, try reducing the
19362optimization level for this particular file; this may help.
19363
19364@cindex Linux
19365With a reasonable shell (Bash will do), and in particular if you run
19366Linux, MiNT or a similar operating system, you have a pretty good
19367chance that the @code{configure} utility will succeed.  Otherwise
19368sample versions of @file{config.h} and @file{Makefile.st} are given in the
19369@file{atari} subdirectory and can be edited and copied to the
19370corresponding files in the main source directory.  Even if
19371@code{configure} produced something, it might be advisable to compare
19372its results with the sample versions and possibly make adjustments.
19373
19374Some @code{gawk} source code fragments depend on a preprocessor define
19375@samp{atarist}.  This basically assumes the TOS environment with @code{gcc}.
19376Modify these sections as appropriate if they are not right for your
19377environment.  Also see the remarks about @code{AWKPATH} and @code{envsep} in
19378@ref{Atari Using, ,Running @code{gawk} on the Atari ST}.
19379
19380As shipped, the sample @file{config.h} claims that the @code{system}
19381function is missing from the libraries, which is not true, and an
19382alternative implementation of this function is provided in
19383@file{atari/system.c}.  Depending upon your particular combination of
19384shell and operating system, you may wish to change the file to indicate
19385that @code{system} is available.
19386
19387@node Atari Using, , Atari Compiling, Atari Installation
19388@appendixsubsec Running @code{gawk} on the Atari ST
19389
19390An executable version of @code{gawk} should be placed, as usual,
19391anywhere in your @code{PATH} where your shell can find it.
19392
19393While executing, @code{gawk} creates a number of temporary files.  When
19394using @code{gcc} libraries for TOS, @code{gawk} looks for either of
19395the environment variables @code{TEMP} or @code{TMPDIR}, in that order.
19396If either one is found, its value is assumed to be a directory for
19397temporary files.  This directory must exist, and if you can spare the
19398memory, it is a good idea to put it on a RAM drive.  If neither
19399@code{TEMP} nor @code{TMPDIR} are found, then @code{gawk} uses the
19400current directory for its temporary files.
19401
19402The ST version of @code{gawk} searches for its program files as described in
19403@ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
19404The default value for the @code{AWKPATH} variable is taken from
19405@code{DEFPATH} defined in @file{Makefile}. The sample @code{gcc}/TOS
19406@file{Makefile} for the ST in the distribution sets @code{DEFPATH} to
19407@code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}.  The search path can be
19408modified by explicitly setting @code{AWKPATH} to whatever you wish.
19409Note that colons cannot be used on the ST to separate elements in the
19410@code{AWKPATH} variable, since they have another, reserved, meaning.
19411Instead, you must use a comma to separate elements in the path.  When
19412recompiling, the separating character can be modified by initializing
19413the @code{envsep} variable in @file{atari/gawkmisc.atr} to another
19414value.
19415
19416Although @code{awk} allows great flexibility in doing I/O redirections
19417from within a program, this facility should be used with care on the ST
19418running under TOS.  In some circumstances the OS routines for file
19419handle pool processing lose track of certain events, causing the
19420computer to crash, and requiring a reboot.  Often a warm reboot is
19421sufficient.  Fortunately, this happens infrequently, and in rather
19422esoteric situations.  In particular, avoid having one part of an
19423@code{awk} program using @code{print} statements explicitly redirected
19424to @code{"/dev/stdout"}, while other @code{print} statements use the
19425default standard output, and a calling shell has redirected standard
19426output to a file.
19427
19428When @code{gawk} is compiled with the ST version of @code{gcc} and its
19429usual libraries, it will accept both @samp{/} and @samp{\} as path separators.
19430While this is convenient, it should be remembered that this removes one,
19431technically valid, character (@samp{/}) from your file names, and that
19432it may create problems for external programs, called via the @code{system}
19433function, which may not support this convention.  Whenever it is possible
19434that a file created by @code{gawk} will be used by some other program,
19435use only backslashes.  Also remember that in @code{awk}, backslashes in
19436strings have to be doubled in order to get literal backslashes
19437(@pxref{Escape Sequences}).
19438
19439@node Amiga Installation, Bugs, Atari Installation, Installation
19440@appendixsec Installing @code{gawk} on an Amiga
19441
19442@cindex amiga
19443@cindex installation, amiga
19444You can install @code{gawk} on an Amiga system using a Unix emulation
19445environment available via anonymous @code{ftp} from
19446@code{ftp.ninemoons.com} in the directory @file{pub/ade/current}.
19447This includes a shell based on @code{pdksh}.  The primary component of
19448this environment is a Unix emulation library, @file{ixemul.lib}.
19449@c could really use more background here, who wrote this, etc.
19450
19451A more complete distribution for the Amiga is available on
19452the Geek Gadgets CD-ROM from:
19453
19454@quotation
19455CRONUS @*
194561840 E. Warner Road #105-265 @*
19457Tempe, AZ 85284  USA @*
19458US Toll Free: (800) 804-0833 @*
19459Phone: +1-602-491-0442 @*
19460FAX: +1-602-491-0048 @*
19461Email:  @code{info@@ninemoons.com} @*
19462WWW: @code{http://www.ninemoons.com} @*
19463Anonymous @code{ftp} site: @code{ftp.ninemoons.com} @*
19464@end quotation
19465
19466Once you have the distribution, you can configure @code{gawk} simply by
19467running @code{configure}:
19468
19469@example
19470configure -v m68k-amigaos
19471@end example
19472
19473Then run @code{make}, and you should be all set!
19474(If these steps do not work, please send in a bug report;
19475@pxref{Bugs, ,Reporting Problems and Bugs}.)
19476
19477@node Bugs, Other Versions, Amiga Installation, Installation
19478@appendixsec Reporting Problems and Bugs
19479@display
19480@i{There is nothing more dangerous than a bored archeologist.}
19481The Hitchhiker's Guide to the Galaxy
19482@c the radio show, not the book. :-)
19483@end display
19484@sp 1
19485
19486If you have problems with @code{gawk} or think that you have found a bug,
19487please report it to the developers; we cannot promise to do anything
19488but we might well want to fix it.
19489
19490Before reporting a bug, make sure you have actually found a real bug.
19491Carefully reread the documentation and see if it really says you can do
19492what you're trying to do.  If it's not clear whether you should be able
19493to do something or not, report that too; it's a bug in the documentation!
19494
19495Before reporting a bug or trying to fix it yourself, try to isolate it
19496to the smallest possible @code{awk} program and input data file that
19497reproduces the problem.  Then send us the program and data file,
19498some idea of what kind of Unix system you're using, and the exact results
19499@code{gawk} gave you.  Also say what you expected to occur; this will help
19500us decide whether the problem was really in the documentation.
19501
19502Once you have a precise problem, send email to @email{bug-gawk@@gnu.org}.
19503
19504Please include the version number of @code{gawk} you are using.
19505You can get this information with the command @samp{gawk --version}.
19506Using this address will automatically send a carbon copy of your
19507mail to Arnold Robbins.  If necessary, he can be reached directly at
19508@email{arnold@@gnu.org}.
19509
19510@cindex @code{comp.lang.awk}
19511@strong{Important!} Do @emph{not} try to report bugs in @code{gawk} by
19512posting to the Usenet/Internet newsgroup @code{comp.lang.awk}.
19513While the @code{gawk} developers do occasionally read this newsgroup,
19514there is no guarantee that we will see your posting.  The steps described
19515above are the official, recognized ways for reporting bugs.
19516
19517Non-bug suggestions are always welcome as well.  If you have questions
19518about things that are unclear in the documentation or are just obscure
19519features, ask Arnold Robbins; he will try to help you out, although he
19520may not have the time to fix the problem.  You can send him electronic
19521mail at the Internet address above.
19522
19523If you find bugs in one of the non-Unix ports of @code{gawk}, please send
19524an electronic mail message to the person who maintains that port.  They
19525are listed below, and also in the @file{README} file in the @code{gawk}
19526distribution.  Information in the @file{README} file should be considered
19527authoritative if it conflicts with this @value{DOCUMENT}.
19528
19529@c NEEDED for looks
19530@page
19531The people maintaining the non-Unix ports of @code{gawk} are:
19532
19533@cindex Deifik, Scott
19534@cindex Fish, Fred
19535@cindex Hankerson, Darrel
19536@cindex Jaegermann, Michal
19537@cindex Rankin, Pat
19538@cindex Rommel, Kai Uwe
19539@table @asis
19540@item MS-DOS
19541Scott Deifik, @samp{scottd@@amgen.com}, and
19542Darrel Hankerson, @samp{hankedr@@mail.auburn.edu}.
19543
19544@item OS/2
19545Kai Uwe Rommel, @samp{rommel@@ars.de}.
19546
19547@item VMS
19548Pat Rankin, @samp{rankin@@eql.caltech.edu}.
19549
19550@item Atari ST
19551Michal Jaegermann, @samp{michal@@gortel.phys.ualberta.ca}.
19552
19553@item Amiga
19554Fred Fish, @samp{fnf@@ninemoons.com}.
19555@end table
19556
19557If your bug is also reproducible under Unix, please send copies of your
19558report to the general GNU bug list, as well as to Arnold Robbins, at the
19559addresses listed above.
19560
19561@node Other Versions, , Bugs, Installation
19562@appendixsec Other Freely Available @code{awk} Implementations
19563@cindex Brennan, Michael
19564@ignore
19565From: emory!amc.com!brennan (Michael Brennan)
19566Subject: C++ comments in awk programs
19567To: arnold@gnu.ai.mit.edu (Arnold Robbins)
19568Date: Wed, 4 Sep 1996 08:11:48 -0700 (PDT)
19569
19570@end ignore
19571@display
19572@i{It's kind of fun to put comments like this in your awk code.}
19573      @code{// Do C++ comments work? answer: yes! of course}
19574Michael Brennan
19575@end display
19576@sp 1
19577
19578There are two other freely available @code{awk} implementations.
19579This section briefly describes where to get them.
19580
19581@table @asis
19582@cindex Kernighan, Brian
19583@cindex anonymous @code{ftp}
19584@cindex @code{ftp}, anonymous
19585@item Unix @code{awk}
19586Brian Kernighan has been able to make his implementation of
19587@code{awk} freely available.  You can get it via anonymous @code{ftp}
19588to the host @code{@w{netlib.bell-labs.com}}.  Change directory to
19589@file{/netlib/research}. Use ``binary'' or ``image'' mode, and
19590retrieve @file{awk.bundle.gz}.
19591
19592This is a shell archive that has been compressed with the GNU @code{gzip}
19593utility. It can be uncompressed with the @code{gunzip} utility.
19594
19595You can also retrieve this version via the World Wide Web from his
19596@uref{http://cm.bell-labs.com/who/bwk, home page}.
19597
19598This version requires an ANSI C compiler; GCC (the GNU C compiler)
19599works quite nicely.
19600
19601@cindex Brennan, Michael
19602@cindex @code{mawk}
19603@item @code{mawk}
19604Michael Brennan has written an independent implementation of @code{awk},
19605called @code{mawk}.  It is available under the GPL
19606(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),
19607just as @code{gawk} is.
19608
19609You can get it via anonymous @code{ftp} to the host
19610@code{@w{ftp.whidbey.net}}.  Change directory to @file{/pub/brennan}.
19611Use ``binary'' or ``image'' mode, and retrieve @file{mawk1.3.3.tar.gz}
19612(or the latest version that is there).
19613
19614@code{gunzip} may be used to decompress this file. Installation
19615is similar to @code{gawk}'s
19616(@pxref{Unix Installation, , Compiling and Installing @code{gawk} on Unix}).
19617@end table
19618
19619@node Notes, Glossary, Installation, Top
19620@appendix Implementation Notes
19621
19622This appendix contains information mainly of interest to implementors and
19623maintainers of @code{gawk}.  Everything in it applies specifically to
19624@code{gawk}, and not to other implementations.
19625
19626@menu
19627* Compatibility Mode::          How to disable certain @code{gawk} extensions.
19628* Additions::                   Making Additions To @code{gawk}.
19629* Future Extensions::           New features that may be implemented one day.
19630* Improvements::                Suggestions for improvements by volunteers.
19631@end menu
19632
19633@node Compatibility Mode, Additions, Notes, Notes
19634@appendixsec Downward Compatibility and Debugging
19635
19636@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},
19637for a summary of the GNU extensions to the @code{awk} language and program.
19638All of these features can be turned off by invoking @code{gawk} with the
19639@samp{--traditional} option, or with the @samp{--posix} option.
19640
19641If @code{gawk} is compiled for debugging with @samp{-DDEBUG}, then there
19642is one more option available on the command line:
19643
19644@table @code
19645@item -W parsedebug
19646@itemx --parsedebug
19647Print out the parse stack information as the program is being parsed.
19648@end table
19649
19650This option is intended only for serious @code{gawk} developers,
19651and not for the casual user.  It probably has not even been compiled into
19652your version of @code{gawk}, since it slows down execution.
19653
19654@node Additions, Future Extensions, Compatibility Mode, Notes
19655@appendixsec Making Additions to @code{gawk}
19656
19657If you should find that you wish to enhance @code{gawk} in a significant
19658fashion, you are perfectly free to do so.  That is the point of having
19659free software; the source code is available, and you are free to change
19660it as you wish (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
19661
19662This section discusses the ways you might wish to change @code{gawk},
19663and any considerations you should bear in mind.
19664
19665@menu
19666* Adding Code::             Adding code to the main body of @code{gawk}.
19667* New Ports::               Porting @code{gawk} to a new operating system.
19668@end menu
19669
19670@node Adding Code, New Ports, Additions, Additions
19671@appendixsubsec Adding New Features
19672
19673@cindex adding new features
19674@cindex features, adding
19675You are free to add any new features you like to @code{gawk}.
19676However, if you want your changes to be incorporated into the @code{gawk}
19677distribution, there are several steps that you need to take in order to
19678make it possible for me to include your changes.
19679
19680@enumerate 1
19681@item
19682Get the latest version.
19683It is much easier for me to integrate changes if they are relative to
19684the most recent distributed version of @code{gawk}.  If your version of
19685@code{gawk} is very old, I may not be able to integrate them at all.
19686@xref{Getting, ,Getting the @code{gawk} Distribution},
19687for information on getting the latest version of @code{gawk}.
19688
19689@item
19690@iftex
19691Follow the @cite{GNU Coding Standards}.
19692@end iftex
19693@ifinfo
19694See @inforef{Top, , Version, standards, GNU Coding Standards}.
19695@end ifinfo
19696This document describes how GNU software should be written. If you haven't
19697read it, please do so, preferably @emph{before} starting to modify @code{gawk}.
19698(The @cite{GNU Coding Standards} are available as part of the Autoconf
19699distribution, from the FSF.)
19700
19701@cindex @code{gawk} coding style
19702@cindex coding style used in @code{gawk}
19703@item
19704Use the @code{gawk} coding style.
19705The C code for @code{gawk} follows the instructions in the
19706@cite{GNU Coding Standards}, with minor exceptions.  The code is formatted
19707using the traditional ``K&R'' style, particularly as regards the placement
19708of braces and the use of tabs.  In brief, the coding rules for @code{gawk}
19709are:
19710
19711@itemize @bullet
19712@item
19713Use old style (non-prototype) function headers when defining functions.
19714
19715@item
19716Put the name of the function at the beginning of its own line.
19717
19718@item
19719Put the return type of the function, even if it is @code{int}, on the
19720line above the line with the name and arguments of the function.
19721
19722@item
19723The declarations for the function arguments should not be indented.
19724
19725@item
19726Put spaces around parentheses used in control structures
19727(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch}
19728and @code{return}).
19729
19730@item
19731Do not put spaces in front of parentheses used in function calls.
19732
19733@item
19734Put spaces around all C operators, and after commas in function calls.
19735
19736@item
19737Do not use the comma operator to produce multiple side-effects, except
19738in @code{for} loop initialization and increment parts, and in macro bodies.
19739
19740@item
19741Use real tabs for indenting, not spaces.
19742
19743@item
19744Use the ``K&R'' brace layout style.
19745
19746@item
19747Use comparisons against @code{NULL} and @code{'\0'} in the conditions of
19748@code{if}, @code{while} and @code{for} statements, and in the @code{case}s
19749of @code{switch} statements, instead of just the
19750plain pointer or character value.
19751
19752@item
19753Use the @code{TRUE}, @code{FALSE}, and @code{NULL} symbolic constants,
19754and the character constant @code{'\0'} where appropriate, instead of @code{1}
19755and @code{0}.
19756
19757@item
19758Provide one-line descriptive comments for each function.
19759
19760@item
19761Do not use @samp{#elif}. Many older Unix C compilers cannot handle it.
19762
19763@item
19764Do not use the @code{alloca} function for allocating memory off the stack.
19765Its use causes more portability trouble than the minor benefit of not having
19766to free the storage. Instead, use @code{malloc} and @code{free}.
19767@end itemize
19768
19769If I have to reformat your code to follow the coding style used in
19770@code{gawk}, I may not bother.
19771
19772@item
19773Be prepared to sign the appropriate paperwork.
19774In order for the FSF to distribute your changes, you must either place
19775those changes in the public domain, and submit a signed statement to that
19776effect, or assign the copyright in your changes to the FSF.
19777Both of these actions are easy to do, and @emph{many} people have done so
19778already. If you have questions, please contact me
19779(@pxref{Bugs, , Reporting Problems and Bugs}),
19780or @code{gnu@@gnu.org}.
19781
19782@item
19783Update the documentation.
19784Along with your new code, please supply new sections and or chapters
19785for this @value{DOCUMENT}.  If at all possible, please use real
19786Texinfo, instead of just supplying unformatted ASCII text (although
19787even that is better than no documentation at all).
19788Conventions to be followed in @cite{@value{TITLE}} are provided
19789after the @samp{@@bye} at the end of the Texinfo source file.
19790If possible, please update the man page as well.
19791
19792You will also have to sign paperwork for your documentation changes.
19793
19794@item
19795Submit changes as context diffs or unified diffs.
19796Use @samp{diff -c -r -N} or @samp{diff -u -r -N} to compare
19797the original @code{gawk} source tree with your version.
19798(I find context diffs to be more readable, but unified diffs are
19799more compact.)
19800I recommend using the GNU version of @code{diff}.
19801Send the output produced by either run of @code{diff} to me when you
19802submit your changes.
19803@xref{Bugs, , Reporting Problems and Bugs}, for the electronic mail
19804information.
19805
19806Using this format makes it easy for me to apply your changes to the
19807master version of the @code{gawk} source code (using @code{patch}).
19808If I have to apply the changes manually, using a text editor, I may
19809not do so, particularly if there are lots of changes.
19810
19811@item
19812Include an entry for the @file{ChangeLog} file with your submission.
19813This further helps minimize the amount of work I have to do,
19814making it easier for me to accept patches.
19815@end enumerate
19816
19817Although this sounds like a lot of work, please remember that while you
19818may write the new code, I have to maintain it and support it, and if it
19819isn't possible for me to do that with a minimum of extra work, then I
19820probably will not.
19821
19822
19823@node New Ports, , Adding Code, Additions
19824@appendixsubsec Porting @code{gawk} to a New Operating System
19825
19826@cindex porting @code{gawk}
19827If you wish to port @code{gawk} to a new operating system, there are
19828several steps to follow.
19829
19830@enumerate 1
19831@item
19832Follow the guidelines in
19833@ref{Adding Code, ,Adding New Features},
19834concerning coding style, submission of diffs, and so on.
19835
19836@item
19837When doing a port, bear in mind that your code must co-exist peacefully
19838with the rest of @code{gawk}, and the other ports. Avoid gratuitous
19839changes to the system-independent parts of the code. If at all possible,
19840avoid sprinkling @samp{#ifdef}s just for your port throughout the
19841code.
19842
19843If the changes needed for a particular system affect too much of the
19844code, I probably will not accept them.  In such a case, you will, of course,
19845be able to distribute your changes on your own, as long as you comply
19846with the GPL
19847(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
19848
19849@item
19850A number of the files that come with @code{gawk} are maintained by other
19851people at the Free Software Foundation.  Thus, you should not change them
19852unless it is for a very good reason. I.e.@: changes are not out of the
19853question, but changes to these files will be scrutinized extra carefully.
19854The files are @file{alloca.c}, @file{getopt.h}, @file{getopt.c},
19855@file{getopt1.c}, @file{regex.h}, @file{regex.c}, @file{dfa.h},
19856@file{dfa.c}, @file{install-sh}, and @file{mkinstalldirs}.
19857
19858@item
19859Be willing to continue to maintain the port.
19860Non-Unix operating systems are supported by volunteers who maintain
19861the code needed to compile and run @code{gawk} on their systems. If no-one
19862volunteers to maintain a port, that port becomes unsupported, and it may
19863be necessary to remove it from the distribution.
19864
19865@item
19866Supply an appropriate @file{gawkmisc.???} file.
19867Each port has its own @file{gawkmisc.???} that implements certain
19868operating system specific functions. This is cleaner than a plethora of
19869@samp{#ifdef}s scattered throughout the code.  The @file{gawkmisc.c} in
19870the main source directory includes the appropriate
19871@file{gawkmisc.???} file from each subdirectory.
19872Be sure to update it as well.
19873
19874Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine
19875or operating system for the port. For example, @file{pc/gawkmisc.pc} and
19876@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain
19877@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory
19878into the main subdirectory, without accidentally destroying the real
19879@file{gawkmisc.c} file.  (Currently, this is only an issue for the MS-DOS
19880and OS/2 ports.)
19881
19882@item
19883Supply a @file{Makefile} and any other C source and header files that are
19884necessary for your operating system.  All your code should be in a
19885separate subdirectory, with a name that is the same as, or reminiscent
19886of, either your operating system or the computer system.  If possible,
19887try to structure things so that it is not necessary to move files out
19888of the subdirectory into the main source directory.  If that is not
19889possible, then be sure to avoid using names for your files that
19890duplicate the names of files in the main source directory.
19891
19892@item
19893Update the documentation.
19894Please write a section (or sections) for this @value{DOCUMENT} describing the
19895installation and compilation steps needed to install and/or compile
19896@code{gawk} for your system.
19897
19898@item
19899Be prepared to sign the appropriate paperwork.
19900In order for the FSF to distribute your code, you must either place
19901your code in the public domain, and submit a signed statement to that
19902effect, or assign the copyright in your code to the FSF.
19903@ifinfo
19904Both of these actions are easy to do, and @emph{many} people have done so
19905already. If you have questions, please contact me, or
19906@code{gnu@@gnu.org}.
19907@end ifinfo
19908@end enumerate
19909
19910Following these steps will make it much easier to integrate your changes
19911into @code{gawk}, and have them co-exist happily with the code for other
19912operating systems that is already there.
19913
19914In the code that you supply, and that you maintain, feel free to use a
19915coding style and brace layout that suits your taste.
19916
19917@node Future Extensions, Improvements, Additions, Notes
19918@appendixsec Probable Future Extensions
19919@ignore
19920From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995
19921Return-Path: <emory!scalpel.netlabs.com!lwall>
19922Message-Id: <9510311732.AA28472@scalpel.netlabs.com>
19923To: arnold@skeeve.atl.ga.us (Arnold D. Robbins)
19924Subject: Re: May I quote you?
19925In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST."
19926             <m0tAHPQ-00014MC@skeeve.atl.ga.us>
19927Date: Tue, 31 Oct 95 09:32:46 -0800
19928From: Larry Wall <emory!scalpel.netlabs.com!lwall>
19929
19930: Greetings. I am working on the release of gawk 3.0. Part of it will be a
19931: thoroughly updated manual. One of the sections deals with planned future
19932: extensions and enhancements.  I have the following at the beginning
19933: of it:
19934:
19935: @cindex PERL
19936: @cindex Wall, Larry
19937: @display
19938: @i{AWK is a language similar to PERL, only considerably more elegant.} @*
19939: Arnold Robbins
19940: @sp 1
19941: @i{Hey!} @*
19942: Larry Wall
19943: @end display
19944:
19945: Before I actually release this for publication, I wanted to get your
19946: permission to quote you.  (Hopefully, in the spirit of much of GNU, the
19947: implied humor is visible... :-)
19948
19949I think that would be fine.
19950
19951Larry
19952@end ignore
19953@cindex PERL
19954@cindex Wall, Larry
19955@display
19956@i{AWK is a language similar to PERL, only considerably more elegant.}
19957Arnold Robbins
19958
19959@i{Hey!}
19960Larry Wall
19961@end display
19962@sp 1
19963
19964This section briefly lists extensions and possible improvements
19965that indicate the directions we are
19966currently considering for @code{gawk}.  The file @file{FUTURES} in the
19967@code{gawk} distributions lists these extensions as well.
19968
19969This is a list of probable future changes that will be usable by the
19970@code{awk} language programmer.
19971
19972@c these are ordered by likelihood
19973@table @asis
19974@item Localization
19975The GNU project is starting to support multiple languages.
19976It will at least be possible to make @code{gawk} print its warnings and
19977error messages in languages other than English.
19978It may be possible for @code{awk} programs to also use the multiple
19979language facilities, separate from @code{gawk} itself.
19980
19981@item Databases
19982It may be possible to map a GDBM/NDBM/SDBM file into an @code{awk} array.
19983
19984@item A @code{PROCINFO} Array
19985The special files that provide process-related information
19986(@pxref{Special Files, ,Special File Names in @code{gawk}})
19987will be superseded by a @code{PROCINFO} array that would provide the same
19988information, in an easier to access fashion.
19989
19990@item More @code{lint} warnings
19991There are more things that could be checked for portability.
19992
19993@item Control of subprocess environment
19994Changes made in @code{gawk} to the array @code{ENVIRON} may be
19995propagated to subprocesses run by @code{gawk}.
19996
19997@ignore
19998@item @code{RECLEN} variable for fixed length records
19999Along with @code{FIELDWIDTHS}, this would speed up the processing of
20000fixed-length records.
20001
20002@item A @code{restart} keyword
20003After modifying @code{$0}, @code{restart} would restart the pattern
20004matching loop, without reading a new record from the input.
20005
20006@item A @samp{|&} redirection
20007The @samp{|&} redirection, in place of @samp{|}, would open a two-way
20008pipeline for communication with a sub-process (via @code{getline} and
20009@code{print} and @code{printf}).
20010
20011@item Function valued variables
20012It would be possible to assign the name of a user-defined or built-in
20013function to a regular @code{awk} variable, and then call the function
20014indirectly, by using the regular variable.  This would make it possible
20015to write general purpose sorting and comparing routines, for example,
20016by simply passing the name of one function into another.
20017
20018@item A built-in @code{stat} function
20019The @code{stat} function would provide an easy-to-use hook to the
20020@code{stat} system call so that @code{awk} programs could determine information
20021about files.
20022
20023@item A built-in @code{ftw} function
20024Combined with function valued variables and the @code{stat} function,
20025@code{ftw} (file tree walk) would make it easy for an @code{awk} program
20026to walk an entire file tree.
20027@end ignore
20028@end table
20029
20030This is a list of probable improvements that will make @code{gawk}
20031perform better.
20032
20033@table @asis
20034@item An Improved Version of @code{dfa}
20035The @code{dfa} pattern matcher from GNU @code{grep} has some
20036problems. Either a new version or a fixed one will deal with some
20037important regexp matching issues.
20038
20039@item Use of GNU @code{malloc}
20040The GNU version of @code{malloc} could potentially speed up @code{gawk},
20041since it relies heavily on the use of dynamic memory allocation.
20042
20043@end table
20044
20045@node Improvements,  , Future Extensions, Notes
20046@appendixsec Suggestions for Improvements
20047
20048Here are some projects that would-be @code{gawk} hackers might like to take
20049on.  They vary in size from a few days to a few weeks of programming,
20050depending on which one you choose and how fast a programmer you are.  Please
20051send any improvements you write to the maintainers at the GNU project.
20052@xref{Adding Code, , Adding New Features},
20053for guidelines to follow when adding new features to @code{gawk}.
20054@xref{Bugs, ,Reporting Problems and Bugs}, for information on
20055contacting the maintainers.
20056
20057@enumerate
20058@item
20059Compilation of @code{awk} programs: @code{gawk} uses a Bison (YACC-like)
20060parser to convert the script given it into a syntax tree; the syntax
20061tree is then executed by a simple recursive evaluator.  This method incurs
20062a lot of overhead, since the recursive evaluator performs many procedure
20063calls to do even the simplest things.
20064
20065It should be possible for @code{gawk} to convert the script's parse tree
20066into a C program which the user would then compile, using the normal
20067C compiler and a special @code{gawk} library to provide all the needed
20068functions (regexps, fields, associative arrays, type coercion, and so
20069on).
20070
20071An easier possibility might be for an intermediate phase of @code{awk} to
20072convert the parse tree into a linear byte code form like the one used
20073in GNU Emacs Lisp.  The recursive evaluator would then be replaced by
20074a straight line byte code interpreter that would be intermediate in speed
20075between running a compiled program and doing what @code{gawk} does
20076now.
20077
20078@item
20079The programs in the test suite could use documenting in this @value{DOCUMENT}.
20080
20081@item
20082See the @file{FUTURES} file for more ideas.  Contact us if you would
20083seriously like to tackle any of the items listed there.
20084@end enumerate
20085
20086@node Glossary, Copying, Notes, Top
20087@appendix Glossary
20088
20089@table @asis
20090@item Action
20091A series of @code{awk} statements attached to a rule.  If the rule's
20092pattern matches an input record, @code{awk} executes the
20093rule's action.  Actions are always enclosed in curly braces.
20094@xref{Action Overview, ,Overview of Actions}.
20095
20096@item Amazing @code{awk} Assembler
20097Henry Spencer at the University of Toronto wrote a retargetable assembler
20098completely as @code{awk} scripts.  It is thousands of lines long, including
20099machine descriptions for several eight-bit microcomputers.
20100It is a good example of a
20101program that would have been better written in another language.
20102
20103@item Amazingly Workable Formatter (@code{awf})
20104Henry Spencer at the University of Toronto wrote a formatter that accepts
20105a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting
20106commands, using @code{awk} and @code{sh}.
20107
20108@item ANSI
20109The American National Standards Institute.  This organization produces
20110many standards, among them the standards for the C and C++ programming
20111languages.
20112
20113@item Assignment
20114An @code{awk} expression that changes the value of some @code{awk}
20115variable or data object.  An object that you can assign to is called an
20116@dfn{lvalue}.  The assigned values are called @dfn{rvalues}.
20117@xref{Assignment Ops, ,Assignment Expressions}.
20118
20119@item @code{awk} Language
20120The language in which @code{awk} programs are written.
20121
20122@item @code{awk} Program
20123An @code{awk} program consists of a series of @dfn{patterns} and
20124@dfn{actions}, collectively known as @dfn{rules}.  For each input record
20125given to the program, the program's rules are all processed in turn.
20126@code{awk} programs may also contain function definitions.
20127
20128@item @code{awk} Script
20129Another name for an @code{awk} program.
20130
20131@item Bash
20132The GNU version of the standard shell (the Bourne-Again shell).
20133See ``Bourne Shell.''
20134
20135@item BBS
20136See ``Bulletin Board System.''
20137
20138@item Boolean Expression
20139Named after the English mathematician Boole. See ``Logical Expression.''
20140
20141@item Bourne Shell
20142The standard shell (@file{/bin/sh}) on Unix and Unix-like systems,
20143originally written by Steven R.@: Bourne.
20144Many shells (Bash, @code{ksh}, @code{pdksh}, @code{zsh}) are
20145generally upwardly compatible with the Bourne shell.
20146
20147@item Built-in Function
20148The @code{awk} language provides built-in functions that perform various
20149numerical, time stamp related, and string computations.  Examples are
20150@code{sqrt} (for the square root of a number) and @code{substr} (for a
20151substring of a string).  @xref{Built-in, ,Built-in Functions}.
20152
20153@item Built-in Variable
20154@code{ARGC}, @code{ARGIND}, @code{ARGV}, @code{CONVFMT}, @code{ENVIRON},
20155@code{ERRNO}, @code{FIELDWIDTHS}, @code{FILENAME}, @code{FNR}, @code{FS},
20156@code{IGNORECASE}, @code{NF}, @code{NR}, @code{OFMT}, @code{OFS}, @code{ORS},
20157@code{RLENGTH}, @code{RSTART}, @code{RS}, @code{RT}, and @code{SUBSEP},
20158are the variables that have special meaning to @code{awk}.
20159Changing some of them affects @code{awk}'s running environment.
20160Several of these variables are specific to @code{gawk}.
20161@xref{Built-in Variables}.
20162
20163@item Braces
20164See ``Curly Braces.''
20165
20166@item Bulletin Board System
20167A computer system allowing users to log in and read and/or leave messages
20168for other users of the system, much like leaving paper notes on a bulletin
20169board.
20170
20171@item C
20172The system programming language that most GNU software is written in.  The
20173@code{awk} programming language has C-like syntax, and this @value{DOCUMENT}
20174points out similarities between @code{awk} and C when appropriate.
20175
20176@cindex ISO 8859-1
20177@cindex ISO Latin-1
20178@item Character Set
20179The set of numeric codes used by a computer system to represent the
20180characters (letters, numbers, punctuation, etc.) of a particular country
20181or place. The most common character set in use today is ASCII (American
20182Standard Code for Information Interchange).  Many European
20183countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1).
20184
20185@item CHEM
20186A preprocessor for @code{pic} that reads descriptions of molecules
20187and produces @code{pic} input for drawing them.  It was written in @code{awk}
20188by Brian Kernighan and Jon Bentley, and is available from
20189@email{@w{netlib@@research.bell-labs.com}}.
20190
20191@item Compound Statement
20192A series of @code{awk} statements, enclosed in curly braces.  Compound
20193statements may be nested.
20194@xref{Statements, ,Control Statements in Actions}.
20195
20196@item Concatenation
20197Concatenating two strings means sticking them together, one after another,
20198giving a new string.  For example, the string @samp{foo} concatenated with
20199the string @samp{bar} gives the string @samp{foobar}.
20200@xref{Concatenation, ,String Concatenation}.
20201
20202@item Conditional Expression
20203An expression using the @samp{?:} ternary operator, such as
20204@samp{@var{expr1} ? @var{expr2} : @var{expr3}}.  The expression
20205@var{expr1} is evaluated; if the result is true, the value of the whole
20206expression is the value of @var{expr2}, otherwise the value is
20207@var{expr3}.  In either case, only one of @var{expr2} and @var{expr3}
20208is evaluated.  @xref{Conditional Exp, ,Conditional Expressions}.
20209
20210@item Comparison Expression
20211A relation that is either true or false, such as @samp{(a < b)}.
20212Comparison expressions are used in @code{if}, @code{while}, @code{do},
20213and @code{for}
20214statements, and in patterns to select which input records to process.
20215@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
20216
20217@item Curly Braces
20218The characters @samp{@{} and @samp{@}}.  Curly braces are used in
20219@code{awk} for delimiting actions, compound statements, and function
20220bodies.
20221
20222@item Dark Corner
20223An area in the language where specifications often were (or still
20224are) not clear, leading to unexpected or undesirable behavior.
20225Such areas are marked in this @value{DOCUMENT} with ``(d.c.)'' in the
20226text, and are indexed under the heading ``dark corner.''
20227
20228@item Data Objects
20229These are numbers and strings of characters.  Numbers are converted into
20230strings and vice versa, as needed.
20231@xref{Conversion, ,Conversion of Strings and Numbers}.
20232
20233@item Double Precision
20234An internal representation of numbers that can have fractional parts.
20235Double precision numbers keep track of more digits than do single precision
20236numbers, but operations on them are more expensive.  This is the way
20237@code{awk} stores numeric values.  It is the C type @code{double}.
20238
20239@item Dynamic Regular Expression
20240A dynamic regular expression is a regular expression written as an
20241ordinary expression.  It could be a string constant, such as
20242@code{"foo"}, but it may also be an expression whose value can vary.
20243@xref{Computed Regexps, , Using Dynamic Regexps}.
20244
20245@item Environment
20246A collection of strings, of the form @var{name@code{=}val}, that each
20247program has available to it. Users generally place values into the
20248environment in order to provide information to various programs. Typical
20249examples are the environment variables @code{HOME} and @code{PATH}.
20250
20251@item Empty String
20252See ``Null String.''
20253
20254@item Escape Sequences
20255A special sequence of characters used for describing non-printing
20256characters, such as @samp{\n} for newline, or @samp{\033} for the ASCII
20257ESC (escape) character.  @xref{Escape Sequences}.
20258
20259@item Field
20260When @code{awk} reads an input record, it splits the record into pieces
20261separated by whitespace (or by a separator regexp which you can
20262change by setting the built-in variable @code{FS}).  Such pieces are
20263called fields.  If the pieces are of fixed length, you can use the built-in
20264variable @code{FIELDWIDTHS} to describe their lengths.
20265@xref{Field Separators, ,Specifying How Fields are Separated},
20266and also see
20267@xref{Constant Size, , Reading Fixed-width Data}.
20268
20269@item Floating Point Number
20270Often referred to in mathematical terms as a ``rational'' number, this is
20271just a number that can have a fractional part.
20272See ``Double Precision'' and ``Single Precision.''
20273
20274@item Format
20275Format strings are used to control the appearance of output in the
20276@code{printf} statement.  Also, data conversions from numbers to strings
20277are controlled by the format string contained in the built-in variable
20278@code{CONVFMT}.  @xref{Control Letters, ,Format-Control Letters}.
20279
20280@item Function
20281A specialized group of statements used to encapsulate general
20282or program-specific tasks.  @code{awk} has a number of built-in
20283functions, and also allows you to define your own.
20284@xref{Built-in, ,Built-in Functions},
20285and @ref{User-defined, ,User-defined Functions}.
20286
20287@item FSF
20288See ``Free Software Foundation.''
20289
20290@item Free Software Foundation
20291A non-profit organization dedicated
20292to the production and distribution of freely distributable software.
20293It was founded by Richard M.@: Stallman, the author of the original
20294Emacs editor.  GNU Emacs is the most widely used version of Emacs today.
20295
20296@item @code{gawk}
20297The GNU implementation of @code{awk}.
20298
20299@item General Public License
20300This document describes the terms under which @code{gawk} and its source
20301code may be distributed. (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE})
20302
20303@item GNU
20304``GNU's not Unix''.  An on-going project of the Free Software Foundation
20305to create a complete, freely distributable, POSIX-compliant computing
20306environment.
20307
20308@item GPL
20309See ``General Public License.''
20310
20311@item Hexadecimal
20312Base 16 notation, where the digits are @code{0}-@code{9} and
20313@code{A}-@code{F}, with @samp{A}
20314representing 10, @samp{B} representing 11, and so on up to @samp{F} for 15.
20315Hexadecimal numbers are written in C using a leading @samp{0x},
20316to indicate their base.  Thus, @code{0x12} is 18 (one times 16 plus 2).
20317
20318@item I/O
20319Abbreviation for ``Input/Output,'' the act of moving data into and/or
20320out of a running program.
20321
20322@item Input Record
20323A single chunk of data read in by @code{awk}.  Usually, an @code{awk} input
20324record consists of one line of text.
20325@xref{Records, ,How Input is Split into Records}.
20326
20327@item Integer
20328A whole number, i.e.@: a number that does not have a fractional part.
20329
20330@item Keyword
20331In the @code{awk} language, a keyword is a word that has special
20332meaning.  Keywords are reserved and may not be used as variable names.
20333
20334@code{gawk}'s keywords are:
20335@code{BEGIN},
20336@code{END},
20337@code{if},
20338@code{else},
20339@code{while},
20340@code{do@dots{}while},
20341@code{for},
20342@code{for@dots{}in},
20343@code{break},
20344@code{continue},
20345@code{delete},
20346@code{next},
20347@code{nextfile},
20348@code{function},
20349@code{func},
20350and @code{exit}.
20351
20352@item Logical Expression
20353An expression using the operators for logic, AND, OR, and NOT, written
20354@samp{&&}, @samp{||}, and @samp{!} in @code{awk}. Often called Boolean
20355expressions, after the mathematician who pioneered this kind of
20356mathematical logic.
20357
20358@item Lvalue
20359An expression that can appear on the left side of an assignment
20360operator.  In most languages, lvalues can be variables or array
20361elements.  In @code{awk}, a field designator can also be used as an
20362lvalue.
20363
20364@item Null String
20365A string with no characters in it.  It is represented explicitly in
20366@code{awk} programs by placing two double-quote characters next to
20367each other (@code{""}).  It can appear in input data by having two successive
20368occurrences of the field separator appear next to each other.
20369
20370@item Number
20371A numeric valued data object.  The @code{gawk} implementation uses double
20372precision floating point to represent numbers.
20373Very old @code{awk} implementations use single precision floating
20374point.
20375
20376@item Octal
20377Base-eight notation, where the digits are @code{0}-@code{7}.
20378Octal numbers are written in C using a leading @samp{0},
20379to indicate their base.  Thus, @code{013} is 11 (one times 8 plus 3).
20380
20381@item Pattern
20382Patterns tell @code{awk} which input records are interesting to which
20383rules.
20384
20385A pattern is an arbitrary conditional expression against which input is
20386tested.  If the condition is satisfied, the pattern is said to @dfn{match}
20387the input record.  A typical pattern might compare the input record against
20388a regular expression.  @xref{Pattern Overview, ,Pattern Elements}.
20389
20390@item POSIX
20391The name for a series of standards being developed by the IEEE
20392that specify a Portable Operating System interface.  The ``IX'' denotes
20393the Unix heritage of these standards.  The main standard of interest for
20394@code{awk} users is
20395@cite{IEEE Standard for Information Technology, Standard 1003.2-1992,
20396Portable Operating System Interface (POSIX) Part 2: Shell and Utilities}.
20397Informally, this standard is often referred to as simply ``P1003.2.''
20398
20399@item Private
20400Variables and/or functions that are meant for use exclusively by library
20401functions, and not for the main @code{awk} program. Special care must be
20402taken when naming such variables and functions.
20403@xref{Library Names,  ,  Naming Library Function Global Variables}.
20404
20405@item Range (of input lines)
20406A sequence of consecutive lines from the input file.  A pattern
20407can specify ranges of input lines for @code{awk} to process, or it can
20408specify single lines.  @xref{Pattern Overview, ,Pattern Elements}.
20409
20410@item Recursion
20411When a function calls itself, either directly or indirectly.
20412If this isn't clear, refer to the entry for ``recursion.''
20413
20414@item Redirection
20415Redirection means performing input from other than the standard input
20416stream, or output to other than the standard output stream.
20417
20418You can redirect the output of the @code{print} and @code{printf} statements
20419to a file or a system command, using the @samp{>}, @samp{>>}, and @samp{|}
20420operators.  You can redirect input to the @code{getline} statement using
20421the @samp{<} and @samp{|} operators.
20422@xref{Redirection, ,Redirecting Output of @code{print} and @code{printf}},
20423and @ref{Getline, ,Explicit Input with @code{getline}}.
20424
20425@item Regexp
20426Short for @dfn{regular expression}.  A regexp is a pattern that denotes a
20427set of strings, possibly an infinite set.  For example, the regexp
20428@samp{R.*xp} matches any string starting with the letter @samp{R}
20429and ending with the letters @samp{xp}.  In @code{awk}, regexps are
20430used in patterns and in conditional expressions.  Regexps may contain
20431escape sequences.  @xref{Regexp, ,Regular Expressions}.
20432
20433@item Regular Expression
20434See ``regexp.''
20435
20436@item Regular Expression Constant
20437A regular expression constant is a regular expression written within
20438slashes, such as @code{/foo/}.  This regular expression is chosen
20439when you write the @code{awk} program, and cannot be changed doing
20440its execution.  @xref{Regexp Usage, ,How to Use Regular Expressions}.
20441
20442@item Rule
20443A segment of an @code{awk} program that specifies how to process single
20444input records.  A rule consists of a @dfn{pattern} and an @dfn{action}.
20445@code{awk} reads an input record; then, for each rule, if the input record
20446satisfies the rule's pattern, @code{awk} executes the rule's action.
20447Otherwise, the rule does nothing for that input record.
20448
20449@item Rvalue
20450A value that can appear on the right side of an assignment operator.
20451In @code{awk}, essentially every expression has a value. These values
20452are rvalues.
20453
20454@item @code{sed}
20455See ``Stream Editor.''
20456
20457@item Short-Circuit
20458The nature of the @code{awk} logical operators @samp{&&} and @samp{||}.
20459If the value of the entire expression can be deduced from evaluating just
20460the left-hand side of these operators, the right-hand side will not
20461be evaluated
20462(@pxref{Boolean Ops, ,Boolean Expressions}).
20463
20464@item Side Effect
20465A side effect occurs when an expression has an effect aside from merely
20466producing a value.  Assignment expressions, increment and decrement
20467expressions and function calls have side effects.
20468@xref{Assignment Ops, ,Assignment Expressions}.
20469
20470@item Single Precision
20471An internal representation of numbers that can have fractional parts.
20472Single precision numbers keep track of fewer digits than do double precision
20473numbers, but operations on them are less expensive in terms of CPU time.
20474This is the type used by some very old versions of @code{awk} to store
20475numeric values.  It is the C type @code{float}.
20476
20477@item Space
20478The character generated by hitting the space bar on the keyboard.
20479
20480@item Special File
20481A file name interpreted internally by @code{gawk}, instead of being handed
20482directly to the underlying operating system.  For example, @file{/dev/stderr}.
20483@xref{Special Files, ,Special File Names in @code{gawk}}.
20484
20485@item Stream Editor
20486A program that reads records from an input stream and processes them one
20487or more at a time.  This is in contrast with batch programs, which may
20488expect to read their input files in entirety before starting to do
20489anything, and with interactive programs, which require input from the
20490user.
20491
20492@item String
20493A datum consisting of a sequence of characters, such as @samp{I am a
20494string}.  Constant strings are written with double-quotes in the
20495@code{awk} language, and may contain escape sequences.
20496@xref{Escape Sequences}.
20497
20498@item Tab
20499The character generated by hitting the @kbd{TAB} key on the keyboard.
20500It usually expands to up to eight spaces upon output.
20501
20502@item Unix
20503A computer operating system originally developed in the early 1970's at
20504AT&T Bell Laboratories.  It initially became popular in universities around
20505the world, and later moved into commercial evnironments as a software
20506development system and network server system. There are many commercial
20507versions of Unix, as well as several work-alike systems whose source code
20508is freely available (such as Linux, NetBSD, and FreeBSD).
20509
20510@item Whitespace
20511A sequence of space, tab, or newline characters occurring inside an input
20512record or a string.
20513@end table
20514
20515@node Copying, Index, Glossary, Top
20516@unnumbered GNU GENERAL PUBLIC LICENSE
20517@center Version 2, June 1991
20518
20519@display
20520Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc.
2052159 Temple Place --- Suite 330, Boston, MA 02111-1307, USA
20522
20523Everyone is permitted to copy and distribute verbatim copies
20524of this license document, but changing it is not allowed.
20525@end display
20526
20527@c fakenode --- for prepinfo
20528@unnumberedsec Preamble
20529
20530  The licenses for most software are designed to take away your
20531freedom to share and change it.  By contrast, the GNU General Public
20532License is intended to guarantee your freedom to share and change free
20533software---to make sure the software is free for all its users.  This
20534General Public License applies to most of the Free Software
20535Foundation's software and to any other program whose authors commit to
20536using it.  (Some other Free Software Foundation software is covered by
20537the GNU Library General Public License instead.)  You can apply it to
20538your programs, too.
20539
20540  When we speak of free software, we are referring to freedom, not
20541price.  Our General Public Licenses are designed to make sure that you
20542have the freedom to distribute copies of free software (and charge for
20543this service if you wish), that you receive source code or can get it
20544if you want it, that you can change the software or use pieces of it
20545in new free programs; and that you know you can do these things.
20546
20547  To protect your rights, we need to make restrictions that forbid
20548anyone to deny you these rights or to ask you to surrender the rights.
20549These restrictions translate to certain responsibilities for you if you
20550distribute copies of the software, or if you modify it.
20551
20552  For example, if you distribute copies of such a program, whether
20553gratis or for a fee, you must give the recipients all the rights that
20554you have.  You must make sure that they, too, receive or can get the
20555source code.  And you must show them these terms so they know their
20556rights.
20557
20558  We protect your rights with two steps: (1) copyright the software, and
20559(2) offer you this license which gives you legal permission to copy,
20560distribute and/or modify the software.
20561
20562  Also, for each author's protection and ours, we want to make certain
20563that everyone understands that there is no warranty for this free
20564software.  If the software is modified by someone else and passed on, we
20565want its recipients to know that what they have is not the original, so
20566that any problems introduced by others will not reflect on the original
20567authors' reputations.
20568
20569  Finally, any free program is threatened constantly by software
20570patents.  We wish to avoid the danger that redistributors of a free
20571program will individually obtain patent licenses, in effect making the
20572program proprietary.  To prevent this, we have made it clear that any
20573patent must be licensed for everyone's free use or not licensed at all.
20574
20575  The precise terms and conditions for copying, distribution and
20576modification follow.
20577
20578@iftex
20579@c fakenode --- for prepinfo
20580@unnumberedsec TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
20581@end iftex
20582@ifinfo
20583@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
20584@end ifinfo
20585
20586@enumerate 0
20587@item
20588This License applies to any program or other work which contains
20589a notice placed by the copyright holder saying it may be distributed
20590under the terms of this General Public License.  The ``Program'', below,
20591refers to any such program or work, and a ``work based on the Program''
20592means either the Program or any derivative work under copyright law:
20593that is to say, a work containing the Program or a portion of it,
20594either verbatim or with modifications and/or translated into another
20595language.  (Hereinafter, translation is included without limitation in
20596the term ``modification''.)  Each licensee is addressed as ``you''.
20597
20598Activities other than copying, distribution and modification are not
20599covered by this License; they are outside its scope.  The act of
20600running the Program is not restricted, and the output from the Program
20601is covered only if its contents constitute a work based on the
20602Program (independent of having been made by running the Program).
20603Whether that is true depends on what the Program does.
20604
20605@item
20606You may copy and distribute verbatim copies of the Program's
20607source code as you receive it, in any medium, provided that you
20608conspicuously and appropriately publish on each copy an appropriate
20609copyright notice and disclaimer of warranty; keep intact all the
20610notices that refer to this License and to the absence of any warranty;
20611and give any other recipients of the Program a copy of this License
20612along with the Program.
20613
20614You may charge a fee for the physical act of transferring a copy, and
20615you may at your option offer warranty protection in exchange for a fee.
20616
20617@item
20618You may modify your copy or copies of the Program or any portion
20619of it, thus forming a work based on the Program, and copy and
20620distribute such modifications or work under the terms of Section 1
20621above, provided that you also meet all of these conditions:
20622
20623@enumerate a
20624@item
20625You must cause the modified files to carry prominent notices
20626stating that you changed the files and the date of any change.
20627
20628@item
20629You must cause any work that you distribute or publish, that in
20630whole or in part contains or is derived from the Program or any
20631part thereof, to be licensed as a whole at no charge to all third
20632parties under the terms of this License.
20633
20634@item
20635If the modified program normally reads commands interactively
20636when run, you must cause it, when started running for such
20637interactive use in the most ordinary way, to print or display an
20638announcement including an appropriate copyright notice and a
20639notice that there is no warranty (or else, saying that you provide
20640a warranty) and that users may redistribute the program under
20641these conditions, and telling the user how to view a copy of this
20642License.  (Exception: if the Program itself is interactive but
20643does not normally print such an announcement, your work based on
20644the Program is not required to print an announcement.)
20645@end enumerate
20646
20647These requirements apply to the modified work as a whole.  If
20648identifiable sections of that work are not derived from the Program,
20649and can be reasonably considered independent and separate works in
20650themselves, then this License, and its terms, do not apply to those
20651sections when you distribute them as separate works.  But when you
20652distribute the same sections as part of a whole which is a work based
20653on the Program, the distribution of the whole must be on the terms of
20654this License, whose permissions for other licensees extend to the
20655entire whole, and thus to each and every part regardless of who wrote it.
20656
20657Thus, it is not the intent of this section to claim rights or contest
20658your rights to work written entirely by you; rather, the intent is to
20659exercise the right to control the distribution of derivative or
20660collective works based on the Program.
20661
20662In addition, mere aggregation of another work not based on the Program
20663with the Program (or with a work based on the Program) on a volume of
20664a storage or distribution medium does not bring the other work under
20665the scope of this License.
20666
20667@item
20668You may copy and distribute the Program (or a work based on it,
20669under Section 2) in object code or executable form under the terms of
20670Sections 1 and 2 above provided that you also do one of the following:
20671
20672@enumerate a
20673@item
20674Accompany it with the complete corresponding machine-readable
20675source code, which must be distributed under the terms of Sections
206761 and 2 above on a medium customarily used for software interchange; or,
20677
20678@item
20679Accompany it with a written offer, valid for at least three
20680years, to give any third party, for a charge no more than your
20681cost of physically performing source distribution, a complete
20682machine-readable copy of the corresponding source code, to be
20683distributed under the terms of Sections 1 and 2 above on a medium
20684customarily used for software interchange; or,
20685
20686@item
20687Accompany it with the information you received as to the offer
20688to distribute corresponding source code.  (This alternative is
20689allowed only for non-commercial distribution and only if you
20690received the program in object code or executable form with such
20691an offer, in accord with Subsection b above.)
20692@end enumerate
20693
20694The source code for a work means the preferred form of the work for
20695making modifications to it.  For an executable work, complete source
20696code means all the source code for all modules it contains, plus any
20697associated interface definition files, plus the scripts used to
20698control compilation and installation of the executable.  However, as a
20699special exception, the source code distributed need not include
20700anything that is normally distributed (in either source or binary
20701form) with the major components (compiler, kernel, and so on) of the
20702operating system on which the executable runs, unless that component
20703itself accompanies the executable.
20704
20705If distribution of executable or object code is made by offering
20706access to copy from a designated place, then offering equivalent
20707access to copy the source code from the same place counts as
20708distribution of the source code, even though third parties are not
20709compelled to copy the source along with the object code.
20710
20711@item
20712You may not copy, modify, sublicense, or distribute the Program
20713except as expressly provided under this License.  Any attempt
20714otherwise to copy, modify, sublicense or distribute the Program is
20715void, and will automatically terminate your rights under this License.
20716However, parties who have received copies, or rights, from you under
20717this License will not have their licenses terminated so long as such
20718parties remain in full compliance.
20719
20720@item
20721You are not required to accept this License, since you have not
20722signed it.  However, nothing else grants you permission to modify or
20723distribute the Program or its derivative works.  These actions are
20724prohibited by law if you do not accept this License.  Therefore, by
20725modifying or distributing the Program (or any work based on the
20726Program), you indicate your acceptance of this License to do so, and
20727all its terms and conditions for copying, distributing or modifying
20728the Program or works based on it.
20729
20730@item
20731Each time you redistribute the Program (or any work based on the
20732Program), the recipient automatically receives a license from the
20733original licensor to copy, distribute or modify the Program subject to
20734these terms and conditions.  You may not impose any further
20735restrictions on the recipients' exercise of the rights granted herein.
20736You are not responsible for enforcing compliance by third parties to
20737this License.
20738
20739@item
20740If, as a consequence of a court judgment or allegation of patent
20741infringement or for any other reason (not limited to patent issues),
20742conditions are imposed on you (whether by court order, agreement or
20743otherwise) that contradict the conditions of this License, they do not
20744excuse you from the conditions of this License.  If you cannot
20745distribute so as to satisfy simultaneously your obligations under this
20746License and any other pertinent obligations, then as a consequence you
20747may not distribute the Program at all.  For example, if a patent
20748license would not permit royalty-free redistribution of the Program by
20749all those who receive copies directly or indirectly through you, then
20750the only way you could satisfy both it and this License would be to
20751refrain entirely from distribution of the Program.
20752
20753If any portion of this section is held invalid or unenforceable under
20754any particular circumstance, the balance of the section is intended to
20755apply and the section as a whole is intended to apply in other
20756circumstances.
20757
20758It is not the purpose of this section to induce you to infringe any
20759patents or other property right claims or to contest validity of any
20760such claims; this section has the sole purpose of protecting the
20761integrity of the free software distribution system, which is
20762implemented by public license practices.  Many people have made
20763generous contributions to the wide range of software distributed
20764through that system in reliance on consistent application of that
20765system; it is up to the author/donor to decide if he or she is willing
20766to distribute software through any other system and a licensee cannot
20767impose that choice.
20768
20769This section is intended to make thoroughly clear what is believed to
20770be a consequence of the rest of this License.
20771
20772@item
20773If the distribution and/or use of the Program is restricted in
20774certain countries either by patents or by copyrighted interfaces, the
20775original copyright holder who places the Program under this License
20776may add an explicit geographical distribution limitation excluding
20777those countries, so that distribution is permitted only in or among
20778countries not thus excluded.  In such case, this License incorporates
20779the limitation as if written in the body of this License.
20780
20781@item
20782The Free Software Foundation may publish revised and/or new versions
20783of the General Public License from time to time.  Such new versions will
20784be similar in spirit to the present version, but may differ in detail to
20785address new problems or concerns.
20786
20787Each version is given a distinguishing version number.  If the Program
20788specifies a version number of this License which applies to it and ``any
20789later version'', you have the option of following the terms and conditions
20790either of that version or of any later version published by the Free
20791Software Foundation.  If the Program does not specify a version number of
20792this License, you may choose any version ever published by the Free Software
20793Foundation.
20794
20795@item
20796If you wish to incorporate parts of the Program into other free
20797programs whose distribution conditions are different, write to the author
20798to ask for permission.  For software which is copyrighted by the Free
20799Software Foundation, write to the Free Software Foundation; we sometimes
20800make exceptions for this.  Our decision will be guided by the two goals
20801of preserving the free status of all derivatives of our free software and
20802of promoting the sharing and reuse of software generally.
20803
20804@iftex
20805@c fakenode --- for prepinfo
20806@heading NO WARRANTY
20807@end iftex
20808@ifinfo
20809@center NO WARRANTY
20810@end ifinfo
20811
20812@item
20813BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
20814FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW@.  EXCEPT WHEN
20815OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
20816PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
20817OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
20818MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE@.  THE ENTIRE RISK AS
20819TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU@.  SHOULD THE
20820PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
20821REPAIR OR CORRECTION.
20822
20823@item
20824IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
20825WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
20826REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
20827INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
20828OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
20829TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
20830YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
20831PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
20832POSSIBILITY OF SUCH DAMAGES.
20833@end enumerate
20834
20835@iftex
20836@c fakenode --- for prepinfo
20837@heading END OF TERMS AND CONDITIONS
20838@end iftex
20839@ifinfo
20840@center END OF TERMS AND CONDITIONS
20841@end ifinfo
20842
20843@page
20844@c fakenode --- for prepinfo
20845@unnumberedsec How to Apply These Terms to Your New Programs
20846
20847  If you develop a new program, and you want it to be of the greatest
20848possible use to the public, the best way to achieve this is to make it
20849free software which everyone can redistribute and change under these terms.
20850
20851  To do so, attach the following notices to the program.  It is safest
20852to attach them to the start of each source file to most effectively
20853convey the exclusion of warranty; and each file should have at least
20854the ``copyright'' line and a pointer to where the full notice is found.
20855
20856@smallexample
20857@var{one line to give the program's name and an idea of what it does.}
20858Copyright (C) @var{year}  @var{name of author}
20859
20860This program is free software; you can redistribute it and/or
20861modify it under the terms of the GNU General Public License
20862as published by the Free Software Foundation; either version 2
20863of the License, or (at your option) any later version.
20864
20865This program is distributed in the hope that it will be useful,
20866but WITHOUT ANY WARRANTY; without even the implied warranty of
20867MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE@.  See the
20868GNU General Public License for more details.
20869
20870You should have received a copy of the GNU General Public License
20871along with this program; if not, write to the Free Software
20872Foundation, Inc., 59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA.
20873@end smallexample
20874
20875Also add information on how to contact you by electronic and paper mail.
20876
20877If the program is interactive, make it output a short notice like this
20878when it starts in an interactive mode:
20879
20880@smallexample
20881Gnomovision version 69, Copyright (C) @var{year} @var{name of author}
20882Gnomovision comes with ABSOLUTELY NO WARRANTY; for details
20883type `show w'.  This is free software, and you are welcome
20884to redistribute it under certain conditions; type `show c'
20885for details.
20886@end smallexample
20887
20888The hypothetical commands @samp{show w} and @samp{show c} should show
20889the appropriate parts of the General Public License.  Of course, the
20890commands you use may be called something other than @samp{show w} and
20891@samp{show c}; they could even be mouse-clicks or menu items---whatever
20892suits your program.
20893
20894You should also get your employer (if you work as a programmer) or your
20895school, if any, to sign a ``copyright disclaimer'' for the program, if
20896necessary.  Here is a sample; alter the names:
20897
20898@smallexample
20899@group
20900Yoyodyne, Inc., hereby disclaims all copyright
20901interest in the program `Gnomovision'
20902(which makes passes at compilers) written
20903by James Hacker.
20904
20905@var{signature of Ty Coon}, 1 April 1989
20906Ty Coon, President of Vice
20907@end group
20908@end smallexample
20909
20910This General Public License does not permit incorporating your program into
20911proprietary programs.  If your program is a subroutine library, you may
20912consider it more useful to permit linking proprietary applications with the
20913library.  If this is what you want to do, use the GNU Library General
20914Public License instead of this License.
20915
20916@node Index, , Copying, Top
20917@unnumbered Index
20918@printindex cp
20919
20920@summarycontents
20921@contents
20922@bye
20923
20924Unresolved Issues:
20925------------------
209261. From ADR.
20927
20928   Robert J. Chassell points out that awk programs should have some indication
20929   of how to use them.  It would be useful to perhaps have a "programming
20930   style" section of the manual that would include this and other tips.
20931
209322. The default AWKPATH search path should be configurable via `configure'
20933   The default and how this changes needs to be documented.
20934
20935Consistency issues:
20936	/.../ regexps are in @code, not @samp
20937	".." strings are in @code, not @samp
20938	no @print before @dots
20939	values of expressions in the text (@code{x} has the value 15),
20940		should be in roman, not @code
20941	Use   tab   and not   TAB
20942	Use   ESC   and not   ESCAPE
20943	Use   space and not   blank	to describe the space bar's character
20944	The term "blank" is thus basically reserved for "blank lines" etc.
20945	The `(d.c.)' should appear inside the closing `.' of a sentence
20946		It should come before (pxref{...})
20947	" " should have an @w{} around it
20948	Use "non-" everywhere
20949	Use @code{ftp} when talking about anonymous ftp
20950	Use upper-case and lower-case, not "upper case" and "lower case"
20951	Use alphanumeric, not alpha-numeric
20952	Use --foo, not -Wfoo when describing long options
20953	Use findex for all programs and functions in the example chapters
20954	Use "Bell Laboratories", but not "Bell Labs".
20955	Use "behavior" instead of "behaviour".
20956	Use "zeros" instead of "zeroes".
20957	Use "Input/Output", not "input/output". Also "I/O", not "i/o".
20958	Use @code{do}, and not @code{do}-@code{while}, except where
20959		actually discussing the do-while.
20960	The words "a", "and", "as", "between", "for", "from", "in", "of",
20961		"on", "that", "the", "to", "with", and "without",
20962		should not be capitalized in @chapter, @section etc.
20963		"Into" and "How" should.
20964	Search for @dfn; make sure important items are also indexed.
20965	"e.g." should always be followed by a comma.
20966	"i.e." should never be followed by a comma, and should be followed
20967		by `@:'.
20968	The numbers zero through ten should be spelled out, except when
20969		talking about file descriptor numbers. > 10 and < 0, it's
20970		ok to use numbers.
20971	In tables, put command line options in @code, while in the text,
20972		put them in @samp.
20973	When using @strong, use "Note:" or "Caution:" with colons and
20974		not exclamation points.  Do not surround the paragraphs
20975		with @quotation ... @end quotation.
20976
20977Date: Wed, 13 Apr 94 15:20:52 -0400
20978From: rsm@gnu.ai.mit.edu (Richard Stallman)
20979To: gnu-prog@gnu.ai.mit.edu
20980Subject: A reminder: no pathnames in GNU
20981
20982It's a GNU convention to use the term "file name" for the name of a
20983file, never "pathname".  We use the term "path" for search paths,
20984which are lists of file names.  Using it for a single file name as
20985well is potentially confusing to users.
20986
20987So please check any documentation you maintain, if you think you might
20988have used "pathname".
20989
20990Note that "file name" should be two words when it appears as ordinary
20991text.  It's ok as one word when it's a metasyntactic variable, though.
20992
20993Suggestions:
20994------------
20995Enhance FIELDWIDTHS with some way to indicate "the rest of the record".
20996E.g., a length of 0 or -1 or something.  May be "n"?
20997
20998Make FIELDWIDTHS be an array?
20999
21000What if FIELDWIDTHS has invalid values in it?
21001