1\input texinfo   @c -*-texinfo-*-
2@c vim: filetype=texinfo
3@c %**start of header (This is for running Texinfo on a region.)
4@setfilename gawk.info
5@settitle The GNU Awk User's Guide
6@c %**end of header (This is for running Texinfo on a region.)
7
8@dircategory Text creation and manipulation
9@direntry
10* Gawk: (gawk).                 A text scanning and processing language.
11@end direntry
12@dircategory Individual utilities
13@direntry
14* awk: (gawk)Invoking Gawk.                     Text scanning and processing.
15@end direntry
16
17@ifset FOR_PRINT
18@tex
19\gdef\xrefprintnodename#1{``#1''}
20@end tex
21@end ifset
22
23@ifclear FOR_PRINT
24@c With early 2014 texinfo.tex, restore PDF links and colors
25@tex
26\gdef\linkcolor{0.5 0.09 0.12} % Dark Red
27\gdef\urlcolor{0.5 0.09 0.12} % Also
28\global\urefurlonlylinktrue
29@end tex
30@end ifclear
31
32@ifnotdocbook
33@set BULLET @bullet{}
34@set MINUS @minus{}
35@end ifnotdocbook
36
37@ifdocbook
38@set BULLET
39@set MINUS
40@end ifdocbook
41
42@iftex
43@set TIMES @times
44@end iftex
45@ifnottex
46@set TIMES *
47@end ifnottex
48
49@c Let texinfo.tex give us full section titles
50@xrefautomaticsectiontitle on
51
52@c The following information should be updated here only!
53@c This sets the edition of the document, the version of gawk it
54@c applies to and all the info about who's publishing this edition
55
56@c These apply across the board.
57@set UPDATE-MONTH October, 2021
58@set VERSION 5.1
59@set PATCHLEVEL 1
60
61@set GAWKINETTITLE TCP/IP Internetworking with @command{gawk}
62@set GAWKWORKFLOWTITLE Participating in @command{gawk} Development
63@ifset FOR_PRINT
64@set TITLE Effective awk Programming
65@end ifset
66@ifclear FOR_PRINT
67@set TITLE GAWK: Effective AWK Programming
68@end ifclear
69@set SUBTITLE A User's Guide for GNU Awk
70@set EDITION 5.1
71
72@iftex
73@set DOCUMENT book
74@set CHAPTER chapter
75@set APPENDIX appendix
76@set SECTION section
77@set SUBSECTION subsection
78@set DARKCORNER @inmargin{@image{lflashlight,1cm}, @image{rflashlight,1cm}}
79@set COMMONEXT (c.e.)
80@set PAGE page
81@end iftex
82@ifinfo
83@set DOCUMENT Info file
84@set CHAPTER major node
85@set APPENDIX major node
86@set SECTION minor node
87@set SUBSECTION node
88@set DARKCORNER (d.c.)
89@set COMMONEXT (c.e.)
90@set PAGE screen
91@end ifinfo
92@ifhtml
93@set DOCUMENT Web page
94@set CHAPTER chapter
95@set APPENDIX appendix
96@set SECTION section
97@set SUBSECTION subsection
98@set DARKCORNER (d.c.)
99@set COMMONEXT (c.e.)
100@set PAGE screen
101@end ifhtml
102@ifdocbook
103@set DOCUMENT book
104@set CHAPTER chapter
105@set APPENDIX appendix
106@set SECTION section
107@set SUBSECTION subsection
108@set DARKCORNER (d.c.)
109@set COMMONEXT (c.e.)
110@set PAGE page
111@end ifdocbook
112@ifxml
113@set DOCUMENT book
114@set CHAPTER chapter
115@set APPENDIX appendix
116@set SECTION section
117@set SUBSECTION subsection
118@set DARKCORNER (d.c.)
119@set COMMONEXT (c.e.)
120@set PAGE page
121@end ifxml
122@ifplaintext
123@set DOCUMENT book
124@set CHAPTER chapter
125@set APPENDIX appendix
126@set SECTION section
127@set SUBSECTION subsection
128@set DARKCORNER (d.c.)
129@set COMMONEXT (c.e.)
130@set PAGE page
131@end ifplaintext
132
133@ifdocbook
134@c empty on purpose
135@set PART1
136@set PART2
137@set PART3
138@set PART4
139@end ifdocbook
140
141@ifnotdocbook
142@set PART1 Part I:@*
143@set PART2 Part II:@*
144@set PART3 Part III:@*
145@set PART4 Part IV:@*
146@end ifnotdocbook
147
148@c some special symbols
149@iftex
150@set LEQ @math{@leq}
151@set PI @math{@pi}
152@end iftex
153@ifdocbook
154@set LEQ @inlineraw{docbook, ≤}
155@set PI @inlineraw{docbook, &pgr;}
156@end ifdocbook
157@ifnottex
158@ifnotdocbook
159@set LEQ <=
160@set PI @i{pi}
161@end ifnotdocbook
162@end ifnottex
163
164@ifnottex
165@ifnotdocbook
166@macro ii{text}
167@i{\text\}
168@end macro
169@end ifnotdocbook
170@end ifnottex
171
172@ifdocbook
173@macro ii{text}
174@inlineraw{docbook,<lineannotation>\text\</lineannotation>}
175@end macro
176@end ifdocbook
177
178@ifclear FOR_PRINT
179@set FN file name
180@set FFN File name
181@set DF data file
182@set DDF Data file
183@set PVERSION version
184@end ifclear
185@ifset FOR_PRINT
186@set FN filename
187@set FFN Filename
188@set DF datafile
189@set DDF Datafile
190@set PVERSION version
191@end ifset
192
193@c For HTML, spell out email addresses, to avoid problems with
194@c address harvesters for spammers.
195@ifhtml
196@macro EMAIL{real,spelled}
197``\spelled\''
198@end macro
199@end ifhtml
200@ifnothtml
201@macro EMAIL{real,spelled}
202@email{\real\}
203@end macro
204@end ifnothtml
205
206@c Indexing macros
207@ifinfo
208
209@macro cindexawkfunc{name}
210@cindex @code{\name\}
211@end macro
212
213@macro cindexgawkfunc{name}
214@cindex @code{\name\}
215@end macro
216
217@end ifinfo
218
219@ifnotinfo
220
221@macro cindexawkfunc{name}
222@cindex @code{\name\()} function
223@end macro
224
225@macro cindexgawkfunc{name}
226@cindex @code{\name\()} function (@command{gawk})
227@end macro
228@end ifnotinfo
229
230@ignore
231Some comments on the layout for TeX.
2321. Use at least texinfo.tex 2016-02-05.07.
233@end ignore
234
235@c merge the function and variable indexes into the concept index
236@ifinfo
237@synindex fn cp
238@synindex vr cp
239@end ifinfo
240@iftex
241@syncodeindex fn cp
242@syncodeindex vr cp
243@end iftex
244@ifxml
245@syncodeindex fn cp
246@syncodeindex vr cp
247@end ifxml
248@ifdocbook
249@synindex fn cp
250@synindex vr cp
251@end ifdocbook
252
253@c If "finalout" is commented out, the printed output will show
254@c black boxes that mark lines that are too long.  Thus, it is
255@c unwise to comment it out when running a master in case there are
256@c overfulls which are deemed okay.
257
258@iftex
259@finalout
260@end iftex
261
262@c Enabled '-quotes in PDF files so that cut/paste works in
263@c more places.
264
265@codequoteundirected on
266@codequotebacktick on
267
268@copying
269@docbook
270<para>
271&ldquo;To boldly go where no man has gone before&rdquo; is a
272Registered Trademark of Paramount Pictures Corporation.</para>
273
274<para>Published by:</para>
275
276<literallayout class="normal">Free Software Foundation
27751 Franklin Street, Fifth Floor
278Boston, MA  02110-1301 USA
279Phone: +1-617-542-5942
280Fax: +1-617-542-2652
281Email: <email>gnu@@gnu.org</email>
282URL: <ulink url="https://www.gnu.org">https://www.gnu.org/</ulink></literallayout>
283
284<literallayout class="normal">Copyright &copy; 1989, 1991, 1992, 1993, 1996&ndash;2005, 2007, 2009&ndash;2021
285Free Software Foundation, Inc.
286All Rights Reserved.</literallayout>
287@end docbook
288
289@ifnotdocbook
290Copyright @copyright{} 1989, 1991, 1992, 1993, 1996--2005, 2007, 2009--2021 @*
291Free Software Foundation, Inc.
292@end ifnotdocbook
293@sp 2
294
295This is Edition @value{EDITION} of @cite{@value{TITLE}: @value{SUBTITLE}},
296for the @value{VERSION}.@value{PATCHLEVEL} (or later) version of the GNU
297implementation of AWK.
298
299Permission is granted to copy, distribute and/or modify this document
300under the terms of the GNU Free Documentation License, Version 1.3 or
301any later version published by the Free Software Foundation; with the
302Invariant Sections being ``GNU General Public License'', with the
303Front-Cover Texts being ``A GNU Manual'', and with the Back-Cover Texts
304as in (a) below.
305@ifclear FOR_PRINT
306A copy of the license is included in the section entitled
307``GNU Free Documentation License''.
308@end ifclear
309@ifset FOR_PRINT
310A copy of the license
311may be found on the Internet at
312@uref{https://www.gnu.org/software/gawk/manual/html_node/GNU-Free-Documentation-License.html,
313the GNU Project's website}.
314@end ifset
315
316@enumerate a
317@item
318The FSF's Back-Cover Text is: ``You have the freedom to
319copy and modify this GNU manual.''
320@end enumerate
321@end copying
322
323@c Comment out the "smallbook" for technical review.  Saves
324@c considerable paper.  Remember to turn it back on *before*
325@c starting the page-breaking work.
326
327@c 4/2002: Karl Berry recommends commenting out this and the
328@c `@setchapternewpage odd', and letting users use `texi2dvi -t'
329@c if they want to waste paper.
330@c @smallbook
331
332
333@c Uncomment this for the release.  Leaving it off saves paper
334@c during editing and review.
335@setchapternewpage odd
336
337@shorttitlepage GNU Awk
338@titlepage
339@title @value{TITLE}
340@subtitle @value{SUBTITLE}
341@subtitle Edition @value{EDITION}
342@subtitle @value{UPDATE-MONTH}
343@author Arnold D. Robbins
344
345@ifnotdocbook
346@c Include the Distribution inside the titlepage environment so
347@c that headings are turned off.  Headings on and off do not work.
348
349@page
350@vskip 0pt plus 1filll
351``To boldly go where no man has gone before'' is a
352Registered Trademark of Paramount Pictures Corporation. @*
353@c sorry, i couldn't resist
354@sp 3
355Published by:
356@sp 1
357
358Free Software Foundation @*
35951 Franklin Street, Fifth Floor @*
360Boston, MA  02110-1301 USA @*
361Phone: +1-617-542-5942 @*
362Fax: +1-617-542-2652 @*
363Email: @email{gnu@@gnu.org} @*
364URL: @uref{https://www.gnu.org/} @*
365
366@c This one is correct for gawk 3.1.0 from the FSF
367ISBN 1-882114-28-0 @*
368@sp 2
369@insertcopying
370@end ifnotdocbook
371@end titlepage
372
373@c Thanks to Bob Chassell for directions on doing dedications.
374@iftex
375@headings off
376@page
377@w{ }
378@sp 9
379@center @i{To my parents, for their love, and for the wonderful example they set for me.}
380@sp 1
381@center @i{To my wife, Miriam, for making me complete.
382Thank you for building your life together with me.}
383@sp 1
384@center @i{To our children, Chana, Rivka, Nachum, and Malka, for enrichening our lives in innumerable ways.}
385@sp 1
386@w{ }
387@page
388@w{ }
389@page
390@headings on
391@end iftex
392
393@docbook
394<dedication>
395<para>To my parents, for their love, and for the wonderful
396example they set for me.</para>
397<para>To my wife Miriam, for making me complete.
398Thank you for building your life together with me.</para>
399<para>To our children Chana, Rivka, Nachum and Malka,
400for enrichening our lives in innumerable ways.</para>
401</dedication>
402@end docbook
403
404@iftex
405@headings off
406@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @|
407@oddheading  @| @| @strong{@thischapter}@ @ @ @thispage
408@end iftex
409
410@ifnottex
411@ifnotxml
412@ifnotdocbook
413@node Top
414@top General Introduction
415@c Preface node should come right after the Top
416@c node, in `unnumbered' sections, then the chapter, `What is gawk'.
417@c Licensing nodes are appendices, they're not central to AWK.
418
419This file documents @command{awk}, a program that you can use to select
420particular records in a file and perform operations upon them.
421
422@insertcopying
423
424@end ifnotdocbook
425@end ifnotxml
426@end ifnottex
427
428@menu
429* Foreword3::                      Some nice words about this
430                                   @value{DOCUMENT}.
431* Foreword4::                      More nice words.
432* Preface::                        What this @value{DOCUMENT} is about; brief
433                                   history and acknowledgments.
434* Getting Started::                A basic introduction to using
435                                   @command{awk}. How to run an @command{awk}
436                                   program. Command-line syntax.
437* Invoking Gawk::                  How to run @command{gawk}.
438* Regexp::                         All about matching things using regular
439                                   expressions.
440* Reading Files::                  How to read files and manipulate fields.
441* Printing::                       How to print using @command{awk}. Describes
442                                   the @code{print} and @code{printf}
443                                   statements. Also describes redirection of
444                                   output.
445* Expressions::                    Expressions are the basic building blocks
446                                   of statements.
447* Patterns and Actions::           Overviews of patterns and actions.
448* Arrays::                         The description and use of arrays. Also
449                                   includes array-oriented control statements.
450* Functions::                      Built-in and user-defined functions.
451* Library Functions::              A Library of @command{awk} Functions.
452* Sample Programs::                Many @command{awk} programs with complete
453                                   explanations.
454* Advanced Features::              Stuff for advanced users, specific to
455                                   @command{gawk}.
456* Internationalization::           Getting @command{gawk} to speak your
457                                   language.
458* Debugger::                       The @command{gawk} debugger.
459* Namespaces::                     How namespaces work in @command{gawk}.
460* Arbitrary Precision Arithmetic:: Arbitrary precision arithmetic with
461                                   @command{gawk}.
462* Dynamic Extensions::             Adding new built-in functions to
463                                   @command{gawk}.
464* Language History::               The evolution of the @command{awk}
465                                   language.
466* Installation::                   Installing @command{gawk} under various
467                                   operating systems.
468* Notes::                          Notes about adding things to @command{gawk}
469                                   and possible future work.
470* Basic Concepts::                 A very quick introduction to programming
471                                   concepts.
472* Glossary::                       An explanation of some unfamiliar terms.
473* Copying::                        Your right to copy and distribute
474                                   @command{gawk}.
475* GNU Free Documentation License:: The license for this @value{DOCUMENT}.
476* Index::                          Concept and Variable Index.
477
478@detailmenu
479* History::                             The history of @command{gawk} and
480                                        @command{awk}.
481* Names::                               What name to use to find
482                                        @command{awk}.
483* This Manual::                         Using this @value{DOCUMENT}. Includes
484                                        sample input files that you can use.
485* Conventions::                         Typographical Conventions.
486* Manual History::                      Brief history of the GNU project and
487                                        this @value{DOCUMENT}.
488* How To Contribute::                   Helping to save the world.
489* Acknowledgments::                     Acknowledgments.
490* Running gawk::                        How to run @command{gawk} programs;
491                                        includes command-line syntax.
492* One-shot::                            Running a short throwaway
493                                        @command{awk} program.
494* Read Terminal::                       Using no input files (input from the
495                                        keyboard instead).
496* Long::                                Putting permanent @command{awk}
497                                        programs in files.
498* Executable Scripts::                  Making self-contained @command{awk}
499                                        programs.
500* Comments::                            Adding documentation to @command{gawk}
501                                        programs.
502* Quoting::                             More discussion of shell quoting
503                                        issues.
504* DOS Quoting::                         Quoting in Windows Batch Files.
505* Sample Data Files::                   Sample data files for use in the
506                                        @command{awk} programs illustrated in
507                                        this @value{DOCUMENT}.
508* Very Simple::                         A very simple example.
509* Two Rules::                           A less simple one-line example using
510                                        two rules.
511* More Complex::                        A more complex example.
512* Statements/Lines::                    Subdividing or combining statements
513                                        into lines.
514* Other Features::                      Other Features of @command{awk}.
515* When::                                When to use @command{gawk} and when to
516                                        use other things.
517* Intro Summary::                       Summary of the introduction.
518* Command Line::                        How to run @command{awk}.
519* Options::                             Command-line options and their
520                                        meanings.
521* Other Arguments::                     Input file names and variable
522                                        assignments.
523* Naming Standard Input::               How to specify standard input with
524                                        other files.
525* Environment Variables::               The environment variables
526                                        @command{gawk} uses.
527* AWKPATH Variable::                    Searching directories for
528                                        @command{awk} programs.
529* AWKLIBPATH Variable::                 Searching directories for
530                                        @command{awk} shared libraries.
531* Other Environment Variables::         The environment variables.
532* Exit Status::                         @command{gawk}'s exit status.
533* Include Files::                       Including other files into your
534                                        program.
535* Loading Shared Libraries::            Loading shared libraries into your
536                                        program.
537* Obsolete::                            Obsolete Options and/or features.
538* Undocumented::                        Undocumented Options and Features.
539* Invoking Summary::                    Invocation summary.
540* Regexp Usage::                        How to Use Regular Expressions.
541* Escape Sequences::                    How to write nonprinting characters.
542* Regexp Operators::                    Regular Expression Operators.
543* Regexp Operator Details::             The actual details.
544* Interval Expressions::                Notes on interval expressions.
545* Bracket Expressions::                 What can go between @samp{[...]}.
546* Leftmost Longest::                    How much text matches.
547* Computed Regexps::                    Using Dynamic Regexps.
548* GNU Regexp Operators::                Operators specific to GNU software.
549* Case-sensitivity::                    How to do case-insensitive matching.
550* Regexp Summary::                      Regular expressions summary.
551* Records::                             Controlling how data is split into
552                                        records.
553* awk split records::                   How standard @command{awk} splits
554                                        records.
555* gawk split records::                  How @command{gawk} splits records.
556* Fields::                              An introduction to fields.
557* Nonconstant Fields::                  Nonconstant Field Numbers.
558* Changing Fields::                     Changing the Contents of a Field.
559* Field Separators::                    The field separator and how to change
560                                        it.
561* Default Field Splitting::             How fields are normally separated.
562* Regexp Field Splitting::              Using regexps as the field separator.
563* Single Character Fields::             Making each character a separate
564                                        field.
565* Command Line Field Separator::        Setting @code{FS} from the command
566                                        line.
567* Full Line Fields::                    Making the full line be a single
568                                        field.
569* Field Splitting Summary::             Some final points and a summary table.
570* Constant Size::                       Reading constant width data.
571* Fixed width data::                    Processing fixed-width data.
572* Skipping intervening::                Skipping intervening fields.
573* Allowing trailing data::              Capturing optional trailing data.
574* Fields with fixed data::              Field values with fixed-width data.
575* Splitting By Content::                Defining Fields By Content
576* More CSV::                            More on CSV files.
577* FS versus FPAT::                      A subtle difference.
578* Testing field creation::              Checking how @command{gawk} is
579                                        splitting records.
580* Multiple Line::                       Reading multiline records.
581* Getline::                             Reading files under explicit program
582                                        control using the @code{getline}
583                                        function.
584* Plain Getline::                       Using @code{getline} with no
585                                        arguments.
586* Getline/Variable::                    Using @code{getline} into a variable.
587* Getline/File::                        Using @code{getline} from a file.
588* Getline/Variable/File::               Using @code{getline} into a variable
589                                        from a file.
590* Getline/Pipe::                        Using @code{getline} from a pipe.
591* Getline/Variable/Pipe::               Using @code{getline} into a variable
592                                        from a pipe.
593* Getline/Coprocess::                   Using @code{getline} from a coprocess.
594* Getline/Variable/Coprocess::          Using @code{getline} into a variable
595                                        from a coprocess.
596* Getline Notes::                       Important things to know about
597                                        @code{getline}.
598* Getline Summary::                     Summary of @code{getline} Variants.
599* Read Timeout::                        Reading input with a timeout.
600* Retrying Input::                      Retrying input after certain errors.
601* Command-line directories::            What happens if you put a directory on
602                                        the command line.
603* Input Summary::                       Input summary.
604* Input Exercises::                     Exercises.
605* Print::                               The @code{print} statement.
606* Print Examples::                      Simple examples of @code{print}
607                                        statements.
608* Output Separators::                   The output separators and how to
609                                        change them.
610* OFMT::                                Controlling Numeric Output With
611                                        @code{print}.
612* Printf::                              The @code{printf} statement.
613* Basic Printf::                        Syntax of the @code{printf} statement.
614* Control Letters::                     Format-control letters.
615* Format Modifiers::                    Format-specification modifiers.
616* Printf Examples::                     Several examples.
617* Redirection::                         How to redirect output to multiple
618                                        files and pipes.
619* Special FD::                          Special files for I/O.
620* Special Files::                       File name interpretation in
621                                        @command{gawk}. @command{gawk} allows
622                                        access to inherited file descriptors.
623* Other Inherited Files::               Accessing other open files with
624                                        @command{gawk}.
625* Special Network::                     Special files for network
626                                        communications.
627* Special Caveats::                     Things to watch out for.
628* Close Files And Pipes::               Closing Input and Output Files and
629                                        Pipes.
630* Nonfatal::                            Enabling Nonfatal Output.
631* Output Summary::                      Output summary.
632* Output Exercises::                    Exercises.
633* Values::                              Constants, Variables, and Regular
634                                        Expressions.
635* Constants::                           String, numeric and regexp constants.
636* Scalar Constants::                    Numeric and string constants.
637* Nondecimal-numbers::                  What are octal and hex numbers.
638* Regexp Constants::                    Regular Expression constants.
639* Using Constant Regexps::              When and how to use a regexp constant.
640* Standard Regexp Constants::           Regexp constants in standard
641                                        @command{awk}.
642* Strong Regexp Constants::             Strongly typed regexp constants.
643* Variables::                           Variables give names to values for
644                                        later use.
645* Using Variables::                     Using variables in your programs.
646* Assignment Options::                  Setting variables on the command line
647                                        and a summary of command-line syntax.
648                                        This is an advanced method of input.
649* Conversion::                          The conversion of strings to numbers
650                                        and vice versa.
651* Strings And Numbers::                 How @command{awk} Converts Between
652                                        Strings And Numbers.
653* Locale influences conversions::       How the locale may affect conversions.
654* All Operators::                       @command{gawk}'s operators.
655* Arithmetic Ops::                      Arithmetic operations (@samp{+},
656                                        @samp{-}, etc.)
657* Concatenation::                       Concatenating strings.
658* Assignment Ops::                      Changing the value of a variable or a
659                                        field.
660* Increment Ops::                       Incrementing the numeric value of a
661                                        variable.
662* Truth Values and Conditions::         Testing for true and false.
663* Truth Values::                        What is ``true'' and what is
664                                        ``false''.
665* Typing and Comparison::               How variables acquire types and how
666                                        this affects comparison of numbers and
667                                        strings with @samp{<}, etc.
668* Variable Typing::                     String type versus numeric type.
669* Comparison Operators::                The comparison operators.
670* POSIX String Comparison::             String comparison with POSIX rules.
671* Boolean Ops::                         Combining comparison expressions using
672                                        boolean operators @samp{||} (``or''),
673                                        @samp{&&} (``and'') and @samp{!}
674                                        (``not'').
675* Conditional Exp::                     Conditional expressions select between
676                                        two subexpressions under control of a
677                                        third subexpression.
678* Function Calls::                      A function call is an expression.
679* Precedence::                          How various operators nest.
680* Locales::                             How the locale affects things.
681* Expressions Summary::                 Expressions summary.
682* Pattern Overview::                    What goes into a pattern.
683* Regexp Patterns::                     Using regexps as patterns.
684* Expression Patterns::                 Any expression can be used as a
685                                        pattern.
686* Ranges::                              Pairs of patterns specify record
687                                        ranges.
688* BEGIN/END::                           Specifying initialization and cleanup
689                                        rules.
690* Using BEGIN/END::                     How and why to use BEGIN/END rules.
691* I/O And BEGIN/END::                   I/O issues in BEGIN/END rules.
692* BEGINFILE/ENDFILE::                   Two special patterns for advanced
693                                        control.
694* Empty::                               The empty pattern, which matches every
695                                        record.
696* Using Shell Variables::               How to use shell variables with
697                                        @command{awk}.
698* Action Overview::                     What goes into an action.
699* Statements::                          Describes the various control
700                                        statements in detail.
701* If Statement::                        Conditionally execute some
702                                        @command{awk} statements.
703* While Statement::                     Loop until some condition is
704                                        satisfied.
705* Do Statement::                        Do specified action while looping
706                                        until some condition is satisfied.
707* For Statement::                       Another looping statement, that
708                                        provides initialization and increment
709                                        clauses.
710* Switch Statement::                    Switch/case evaluation for conditional
711                                        execution of statements based on a
712                                        value.
713* Break Statement::                     Immediately exit the innermost
714                                        enclosing loop.
715* Continue Statement::                  Skip to the end of the innermost
716                                        enclosing loop.
717* Next Statement::                      Stop processing the current input
718                                        record.
719* Nextfile Statement::                  Stop processing the current file.
720* Exit Statement::                      Stop execution of @command{awk}.
721* Built-in Variables::                  Summarizes the predefined variables.
722* User-modified::                       Built-in variables that you change to
723                                        control @command{awk}.
724* Auto-set::                            Built-in variables where @command{awk}
725                                        gives you information.
726* ARGC and ARGV::                       Ways to use @code{ARGC} and
727                                        @code{ARGV}.
728* Pattern Action Summary::              Patterns and Actions summary.
729* Array Basics::                        The basics of arrays.
730* Array Intro::                         Introduction to Arrays
731* Reference to Elements::               How to examine one element of an
732                                        array.
733* Assigning Elements::                  How to change an element of an array.
734* Array Example::                       Basic Example of an Array
735* Scanning an Array::                   A variation of the @code{for}
736                                        statement. It loops through the
737                                        indices of an array's existing
738                                        elements.
739* Controlling Scanning::                Controlling the order in which arrays
740                                        are scanned.
741* Numeric Array Subscripts::            How to use numbers as subscripts in
742                                        @command{awk}.
743* Uninitialized Subscripts::            Using Uninitialized variables as
744                                        subscripts.
745* Delete::                              The @code{delete} statement removes an
746                                        element from an array.
747* Multidimensional::                    Emulating multidimensional arrays in
748                                        @command{awk}.
749* Multiscanning::                       Scanning multidimensional arrays.
750* Arrays of Arrays::                    True multidimensional arrays.
751* Arrays Summary::                      Summary of arrays.
752* Built-in::                            Summarizes the built-in functions.
753* Calling Built-in::                    How to call built-in functions.
754* Numeric Functions::                   Functions that work with numbers,
755                                        including @code{int()}, @code{sin()}
756                                        and @code{rand()}.
757* String Functions::                    Functions for string manipulation,
758                                        such as @code{split()}, @code{match()}
759                                        and @code{sprintf()}.
760* Gory Details::                        More than you want to know about
761                                        @samp{\} and @samp{&} with
762                                        @code{sub()}, @code{gsub()}, and
763                                        @code{gensub()}.
764* I/O Functions::                       Functions for files and shell
765                                        commands.
766* Time Functions::                      Functions for dealing with timestamps.
767* Bitwise Functions::                   Functions for bitwise operations.
768* Type Functions::                      Functions for type information.
769* I18N Functions::                      Functions for string translation.
770* User-defined::                        Describes User-defined functions in
771                                        detail.
772* Definition Syntax::                   How to write definitions and what they
773                                        mean.
774* Function Example::                    An example function definition and
775                                        what it does.
776* Function Calling::                    Calling user-defined functions.
777* Calling A Function::                  Don't use spaces.
778* Variable Scope::                      Controlling variable scope.
779* Pass By Value/Reference::             Passing parameters.
780* Function Caveats::                    Other points to know about functions.
781* Return Statement::                    Specifying the value a function
782                                        returns.
783* Dynamic Typing::                      How variable types can change at
784                                        runtime.
785* Indirect Calls::                      Choosing the function to call at
786                                        runtime.
787* Functions Summary::                   Summary of functions.
788* Library Names::                       How to best name private global
789                                        variables in library functions.
790* General Functions::                   Functions that are of general use.
791* Strtonum Function::                   A replacement for the built-in
792                                        @code{strtonum()} function.
793* Assert Function::                     A function for assertions in
794                                        @command{awk} programs.
795* Round Function::                      A function for rounding if
796                                        @code{sprintf()} does not do it
797                                        correctly.
798* Cliff Random Function::               The Cliff Random Number Generator.
799* Ordinal Functions::                   Functions for using characters as
800                                        numbers and vice versa.
801* Join Function::                       A function to join an array into a
802                                        string.
803* Getlocaltime Function::               A function to get formatted times.
804* Readfile Function::                   A function to read an entire file at
805                                        once.
806* Shell Quoting::                       A function to quote strings for the
807                                        shell.
808* Isnumeric Function::                  A function to test whether a value is
809                                        numeric.
810* Data File Management::                Functions for managing command-line
811                                        data files.
812* Filetrans Function::                  A function for handling data file
813                                        transitions.
814* Rewind Function::                     A function for rereading the current
815                                        file.
816* File Checking::                       Checking that data files are readable.
817* Empty Files::                         Checking for zero-length files.
818* Ignoring Assigns::                    Treating assignments as file names.
819* Getopt Function::                     A function for processing command-line
820                                        arguments.
821* Passwd Functions::                    Functions for getting user
822                                        information.
823* Group Functions::                     Functions for getting group
824                                        information.
825* Walking Arrays::                      A function to walk arrays of arrays.
826* Library Functions Summary::           Summary of library functions.
827* Library Exercises::                   Exercises.
828* Running Examples::                    How to run these examples.
829* Clones::                              Clones of common utilities.
830* Cut Program::                         The @command{cut} utility.
831* Egrep Program::                       The @command{egrep} utility.
832* Id Program::                          The @command{id} utility.
833* Split Program::                       The @command{split} utility.
834* Tee Program::                         The @command{tee} utility.
835* Uniq Program::                        The @command{uniq} utility.
836* Wc Program::                          The @command{wc} utility.
837* Bytes vs. Characters::                Modern character sets.
838* Using extensions::                    A brief intro to extensions.
839* @command{wc} program::                Code for @file{wc.awk}.
840* Miscellaneous Programs::              Some interesting @command{awk}
841                                        programs.
842* Dupword Program::                     Finding duplicated words in a
843                                        document.
844* Alarm Program::                       An alarm clock.
845* Translate Program::                   A program similar to the @command{tr}
846                                        utility.
847* Labels Program::                      Printing mailing labels.
848* Word Sorting::                        A program to produce a word usage
849                                        count.
850* History Sorting::                     Eliminating duplicate entries from a
851                                        history file.
852* Extract Program::                     Pulling out programs from Texinfo
853                                        source files.
854* Simple Sed::                          A Simple Stream Editor.
855* Igawk Program::                       A wrapper for @command{awk} that
856                                        includes files.
857* Anagram Program::                     Finding anagrams from a dictionary.
858* Signature Program::                   People do amazing things with too much
859                                        time on their hands.
860* Programs Summary::                    Summary of programs.
861* Programs Exercises::                  Exercises.
862* Nondecimal Data::                     Allowing nondecimal input data.
863* Array Sorting::                       Facilities for controlling array
864                                        traversal and sorting arrays.
865* Controlling Array Traversal::         How to use PROCINFO["sorted_in"].
866* Array Sorting Functions::             How to use @code{asort()} and
867                                        @code{asorti()}.
868* Two-way I/O::                         Two-way communications with another
869                                        process.
870* TCP/IP Networking::                   Using @command{gawk} for network
871                                        programming.
872* Profiling::                           Profiling your @command{awk} programs.
873* Extension Philosophy::                What should be built-in and what
874                                        should not.
875* Advanced Features Summary::           Summary of advanced features.
876* I18N and L10N::                       Internationalization and Localization.
877* Explaining gettext::                  How GNU @command{gettext} works.
878* Programmer i18n::                     Features for the programmer.
879* Translator i18n::                     Features for the translator.
880* String Extraction::                   Extracting marked strings.
881* Printf Ordering::                     Rearranging @code{printf} arguments.
882* I18N Portability::                    @command{awk}-level portability
883                                        issues.
884* I18N Example::                        A simple i18n example.
885* Gawk I18N::                           @command{gawk} is also
886                                        internationalized.
887* I18N Summary::                        Summary of I18N stuff.
888* Debugging::                           Introduction to @command{gawk}
889                                        debugger.
890* Debugging Concepts::                  Debugging in General.
891* Debugging Terms::                     Additional Debugging Concepts.
892* Awk Debugging::                       Awk Debugging.
893* Sample Debugging Session::            Sample debugging session.
894* Debugger Invocation::                 How to Start the Debugger.
895* Finding The Bug::                     Finding the Bug.
896* List of Debugger Commands::           Main debugger commands.
897* Breakpoint Control::                  Control of Breakpoints.
898* Debugger Execution Control::          Control of Execution.
899* Viewing And Changing Data::           Viewing and Changing Data.
900* Execution Stack::                     Dealing with the Stack.
901* Debugger Info::                       Obtaining Information about the
902                                        Program and the Debugger State.
903* Miscellaneous Debugger Commands::     Miscellaneous Commands.
904* Readline Support::                    Readline support.
905* Limitations::                         Limitations and future plans.
906* Debugging Summary::                   Debugging summary.
907* Global Namespace::                    The global namespace in standard
908                                        @command{awk}.
909* Qualified Names::                     How to qualify names with a namespace.
910* Default Namespace::                   The default namespace.
911* Changing The Namespace::              How to change the namespace.
912* Naming Rules::                        Namespace and Component Naming Rules.
913* Internal Name Management::            How names are stored internally.
914* Namespace Example::                   An example of code using a namespace.
915* Namespace And Features::              Namespaces and other @command{gawk}
916                                        features.
917* Namespace Summary::                   Summarizing namespaces.
918* Computer Arithmetic::                 A quick intro to computer math.
919* Math Definitions::                    Defining terms used.
920* MPFR features::                       The MPFR features in @command{gawk}.
921* FP Math Caution::                     Things to know.
922* Inexactness of computations::         Floating point math is not exact.
923* Inexact representation::              Numbers are not exactly represented.
924* Comparing FP Values::                 How to compare floating point values.
925* Errors accumulate::                   Errors get bigger as they go.
926* Getting Accuracy::                    Getting more accuracy takes some work.
927* Try To Round::                        Add digits and round.
928* Setting precision::                   How to set the precision.
929* Setting the rounding mode::           How to set the rounding mode.
930* Arbitrary Precision Integers::        Arbitrary Precision Integer Arithmetic
931                                        with @command{gawk}.
932* Checking for MPFR::                   How to check if MPFR is available.
933* POSIX Floating Point Problems::       Standards Versus Existing Practice.
934* Floating point summary::              Summary of floating point discussion.
935* Extension Intro::                     What is an extension.
936* Plugin License::                      A note about licensing.
937* Extension Mechanism Outline::         An outline of how it works.
938* Extension API Description::           A full description of the API.
939* Extension API Functions Introduction:: Introduction to the API functions.
940* General Data Types::                  The data types.
941* Memory Allocation Functions::         Functions for allocating memory.
942* Constructor Functions::               Functions for creating values.
943* API Ownership of MPFR and GMP Values:: Managing MPFR and GMP Values.
944* Registration Functions::              Functions to register things with
945                                        @command{gawk}.
946* Extension Functions::                 Registering extension functions.
947* Exit Callback Functions::             Registering an exit callback.
948* Extension Version String::            Registering a version string.
949* Input Parsers::                       Registering an input parser.
950* Output Wrappers::                     Registering an output wrapper.
951* Two-way processors::                  Registering a two-way processor.
952* Printing Messages::                   Functions for printing messages.
953* Updating @code{ERRNO}::               Functions for updating @code{ERRNO}.
954* Requesting Values::                   How to get a value.
955* Accessing Parameters::                Functions for accessing parameters.
956* Symbol Table Access::                 Functions for accessing global
957                                        variables.
958* Symbol table by name::                Accessing variables by name.
959* Symbol table by cookie::              Accessing variables by ``cookie''.
960* Cached values::                       Creating and using cached values.
961* Array Manipulation::                  Functions for working with arrays.
962* Array Data Types::                    Data types for working with arrays.
963* Array Functions::                     Functions for working with arrays.
964* Flattening Arrays::                   How to flatten arrays.
965* Creating Arrays::                     How to create and populate arrays.
966* Redirection API::                     How to access and manipulate
967                                        redirections.
968* Extension API Variables::             Variables provided by the API.
969* Extension Versioning::                API Version information.
970* Extension GMP/MPFR Versioning::       Version information about GMP and
971                                        MPFR.
972* Extension API Informational Variables:: Variables providing information about
973                                        @command{gawk}'s invocation.
974* Extension API Boilerplate::           Boilerplate code for using the API.
975* Changes from API V1::                 Changes from V1 of the API.
976* Finding Extensions::                  How @command{gawk} finds compiled
977                                        extensions.
978* Extension Example::                   Example C code for an extension.
979* Internal File Description::           What the new functions will do.
980* Internal File Ops::                   The code for internal file operations.
981* Using Internal File Ops::             How to use an external extension.
982* Extension Samples::                   The sample extensions that ship with
983                                        @command{gawk}.
984* Extension Sample File Functions::     The file functions sample.
985* Extension Sample Fnmatch::            An interface to @code{fnmatch()}.
986* Extension Sample Fork::               An interface to @code{fork()} and
987                                        other process functions.
988* Extension Sample Inplace::            Enabling in-place file editing.
989* Extension Sample Ord::                Character to value to character
990                                        conversions.
991* Extension Sample Readdir::            An interface to @code{readdir()}.
992* Extension Sample Revout::             Reversing output sample output
993                                        wrapper.
994* Extension Sample Rev2way::            Reversing data sample two-way
995                                        processor.
996* Extension Sample Read write array::   Serializing an array to a file.
997* Extension Sample Readfile::           Reading an entire file into a string.
998* Extension Sample Time::               An interface to @code{gettimeofday()}
999                                        and @code{sleep()}.
1000* Extension Sample API Tests::          Tests for the API.
1001* gawkextlib::                          The @code{gawkextlib} project.
1002* Extension summary::                   Extension summary.
1003* Extension Exercises::                 Exercises.
1004* V7/SVR3.1::                           The major changes between V7 and
1005                                        System V Release 3.1.
1006* SVR4::                                Minor changes between System V
1007                                        Releases 3.1 and 4.
1008* POSIX::                               New features from the POSIX standard.
1009* BTL::                                 New features from Brian Kernighan's
1010                                        version of @command{awk}.
1011* POSIX/GNU::                           The extensions in @command{gawk} not
1012                                        in POSIX @command{awk}.
1013* Feature History::                     The history of the features in
1014                                        @command{gawk}.
1015* Common Extensions::                   Common Extensions Summary.
1016* Ranges and Locales::                  How locales used to affect regexp
1017                                        ranges.
1018* Contributors::                        The major contributors to
1019                                        @command{gawk}.
1020* History summary::                     History summary.
1021* Gawk Distribution::                   What is in the @command{gawk}
1022                                        distribution.
1023* Getting::                             How to get the distribution.
1024* Extracting::                          How to extract the distribution.
1025* Distribution contents::               What is in the distribution.
1026* Unix Installation::                   Installing @command{gawk} under
1027                                        various versions of Unix.
1028* Quick Installation::                  Compiling @command{gawk} under Unix.
1029* Compiling with MPFR::                 Building with MPFR.
1030* Shell Startup Files::                 Shell convenience functions.
1031* Additional Configuration Options::    Other compile-time options.
1032* Configuration Philosophy::            How it's all supposed to work.
1033* Compiling from Git::                  Compiling from Git.
1034* Building the Documentation::          Building the Documentation.
1035* Non-Unix Installation::               Installation on Other Operating
1036                                        Systems.
1037* PC Installation::                     Installing and Compiling
1038                                        @command{gawk} on Microsoft Windows.
1039* PC Binary Installation::              Installing a prepared distribution.
1040* PC Compiling::                        Compiling @command{gawk} for
1041                                        Windows32.
1042* PC Using::                            Running @command{gawk} on Windows32.
1043* Cygwin::                              Building and running @command{gawk}
1044                                        for Cygwin.
1045* MSYS::                                Using @command{gawk} In The MSYS
1046                                        Environment.
1047* VMS Installation::                    Installing @command{gawk} on VMS.
1048* VMS Compilation::                     How to compile @command{gawk} under
1049                                        VMS.
1050* VMS Dynamic Extensions::              Compiling @command{gawk} dynamic
1051                                        extensions on VMS.
1052* VMS Installation Details::            How to install @command{gawk} under
1053                                        VMS.
1054* VMS Running::                         How to run @command{gawk} under VMS.
1055* VMS GNV::                             The VMS GNV Project.
1056* Bugs::                                Reporting Problems and Bugs.
1057* Bug definition::                      Defining what is and is not a bug.
1058* Bug address::                         Where to send reports to.
1059* Usenet::                              Where not to send reports to.
1060* Performance bugs::                    What to do if you think there is a
1061                                        performance issue.
1062* Asking for help::                     Dealing with non-bug questions.
1063* Maintainers::                         Maintainers of non-*nix ports.
1064* Other Versions::                      Other freely available @command{awk}
1065                                        implementations.
1066* Installation summary::                Summary of installation.
1067* Compatibility Mode::                  How to disable certain @command{gawk}
1068                                        extensions.
1069* Additions::                           Making Additions To @command{gawk}.
1070* Accessing The Source::                Accessing the Git repository.
1071* Adding Code::                         Adding code to the main body of
1072                                        @command{gawk}.
1073* New Ports::                           Porting @command{gawk} to a new
1074                                        operating system.
1075* Derived Files::                       Why derived files are kept in the Git
1076                                        repository.
1077* Future Extensions::                   New features that may be implemented
1078                                        one day.
1079* Implementation Limitations::          Some limitations of the
1080                                        implementation.
1081* Extension Design::                    Design notes about the extension API.
1082* Old Extension Problems::              Problems with the old mechanism.
1083* Extension New Mechanism Goals::       Goals for the new mechanism.
1084* Extension Other Design Decisions::    Some other design decisions.
1085* Extension Future Growth::             Some room for future growth.
1086* Notes summary::                       Summary of implementation notes.
1087* Basic High Level::                    The high level view.
1088* Basic Data Typing::                   A very quick intro to data types.
1089@end detailmenu
1090@end menu
1091
1092@c dedication for Info file
1093@ifinfo
1094To my parents, for their love, and for the wonderful
1095example they set for me.
1096@sp 1
1097To my wife Miriam, for making me complete.
1098Thank you for building your life together with me.
1099@sp 1
1100To our children Chana, Rivka, Nachum and Malka,
1101for enrichening our lives in innumerable ways.
1102@end ifinfo
1103
1104@summarycontents
1105@contents
1106
1107@node Foreword3
1108@unnumbered Foreword to the Third Edition
1109
1110@c This bit is post-processed by a script which turns the chapter
1111@c tag into a preface tag, and moves this stuff to before the title.
1112@c Bleah.
1113@docbook
1114  <prefaceinfo>
1115    <author>
1116      <firstname>Michael</firstname>
1117      <surname>Brennan</surname>
1118      <!-- can't put mawk into command tags. sigh. -->
1119      <affiliation><jobtitle>Author of mawk</jobtitle></affiliation>
1120    </author>
1121    <date>March 2001</date>
1122   </prefaceinfo>
1123@end docbook
1124
1125Arnold Robbins and I are good friends. We were introduced
1126@c 11 years ago
1127in 1990
1128by circumstances---and our favorite programming language, AWK.
1129The circumstances started a couple of years
1130earlier. I was working at a new job and noticed an unplugged
1131Unix computer sitting in the corner.  No one knew how to use it,
1132and neither did I.  However,
1133a couple of days later, it was running, and
1134I was @code{root} and the one-and-only user.
1135That day, I began the transition from statistician to Unix programmer.
1136
1137On one of many trips to the library or bookstore in search of
1138books on Unix, I found the gray AWK book, a.k.a.@:
1139Alfred V.@: Aho, Brian W.@: Kernighan, and
1140Peter J.@: Weinberger's @cite{The AWK Programming Language} (Addison-Wesley,
11411988).  @command{awk}'s simple programming paradigm---find a pattern in the
1142input and then perform an action---often reduced complex or tedious
1143data manipulations to a few lines of code.  I was excited to try my
1144hand at programming in AWK.
1145
1146Alas,  the @command{awk} on my computer was a limited version of the
1147language described in the gray book.  I discovered that my computer
1148had ``old @command{awk}'' and the book described
1149``new @command{awk}.''
1150I learned that this was typical; the old version refused to step
1151aside or relinquish its name.  If a system had a new @command{awk}, it was
1152invariably called @command{nawk}, and few systems had it.
1153The best way to get a new @command{awk} was to @command{ftp} the source code for
1154@command{gawk} from @code{prep.ai.mit.edu}.  @command{gawk} was a version of
1155new @command{awk} written by David Trueman and Arnold, and available under
1156the GNU General Public License.
1157
1158(Incidentally,
1159it's no longer difficult to find a new @command{awk}. @command{gawk} ships with
1160GNU/Linux, and you can download binaries or source code for almost
1161any system; my wife uses @command{gawk} on her VMS box.)
1162
1163My Unix system started out unplugged from the wall; it certainly was not
1164plugged into a network.  So, oblivious to the existence of @command{gawk}
1165and the Unix community in general, and desiring a new @command{awk}, I wrote
1166my own, called @command{mawk}.
1167Before I was finished, I knew about @command{gawk},
1168but it was too late to stop, so I eventually posted
1169to a @code{comp.sources} newsgroup.
1170
1171A few days after my posting, I got a friendly email
1172from Arnold introducing
1173himself.   He suggested we share design and algorithms and
1174attached a draft of the POSIX standard so
1175that I could update @command{mawk} to support language extensions added
1176after publication of @cite{The AWK Programming Language}.
1177
1178Frankly, if our roles had
1179been reversed, I would not have been so open and we probably would
1180have never met.  I'm glad we did meet.
1181He is an AWK expert's AWK expert and a genuinely nice person.
1182Arnold contributes significant amounts of his
1183expertise and time to the Free Software Foundation.
1184
1185This book is the @command{gawk} reference manual, but at its core it
1186is a book about AWK programming that
1187will appeal to a wide audience.
1188It is a definitive reference to the AWK language as defined by the
11891987 Bell Laboratories release and codified in the 1992 POSIX Utilities
1190standard.
1191
1192On the other hand, the novice AWK programmer can study
1193a wealth of practical programs that emphasize
1194the power of AWK's basic idioms:
1195data-driven control flow, pattern matching with regular expressions,
1196and associative arrays.
1197Those looking for something new can try out @command{gawk}'s
1198interface to network protocols via special @file{/inet} files.
1199
1200The programs in this book make clear that an AWK program is
1201typically much smaller and faster to develop than
1202a counterpart written in C.
1203Consequently, there is often a payoff to prototyping an
1204algorithm or design in AWK to get it running quickly and expose
1205problems early. Often, the interpreted performance is adequate
1206and the AWK prototype becomes the product.
1207
1208The new @command{pgawk} (profiling @command{gawk}), produces
1209program execution counts.
1210I recently experimented with an algorithm that for
1211@ifnotdocbook
1212@math{n}
1213@end ifnotdocbook
1214@ifdocbook
1215@i{n}
1216@end ifdocbook
1217lines of input, exhibited
1218@tex
1219$\sim\! Cn^2$
1220@end tex
1221@ifnottex
1222@ifnotdocbook
1223~ C n^2
1224@end ifnotdocbook
1225@end ifnottex
1226@docbook
1227<emphasis>&sim; Cn<superscript>2</superscript></emphasis>
1228@end docbook
1229performance, while
1230theory predicted
1231@tex
1232$\sim\! Cn\log n$
1233@end tex
1234@ifnottex
1235@ifnotdocbook
1236~ C n log n
1237@end ifnotdocbook
1238@end ifnottex
1239@docbook
1240<emphasis>&sim; Cn log n</emphasis>
1241@end docbook
1242behavior. A few minutes poring
1243over the @file{awkprof.out} profile pinpointed the problem to
1244a single line of code.  @command{pgawk} is a welcome addition to
1245my programmer's toolbox.
1246
1247Arnold has distilled over a decade of experience writing and
1248using AWK programs, and developing @command{gawk}, into this book.  If you use
1249AWK or want to learn how, then read this book.
1250
1251@ifnotdocbook
1252@cindex Brennan, Michael
1253@display
1254Michael Brennan
1255Author of @command{mawk}
1256March 2001
1257@end display
1258@end ifnotdocbook
1259
1260@node Foreword4
1261@unnumbered Foreword to the Fourth Edition
1262
1263@c This bit is post-processed by a script which turns the chapter
1264@c tag into a preface tag, and moves this stuff to before the title.
1265@c Bleah.
1266@docbook
1267  <prefaceinfo>
1268    <author>
1269      <firstname>Michael</firstname>
1270      <surname>Brennan</surname>
1271      <!-- can't put mawk into command tags. sigh. -->
1272      <affiliation><jobtitle>Author of mawk</jobtitle></affiliation>
1273    </author>
1274    <date>October 2014</date>
1275   </prefaceinfo>
1276@end docbook
1277
1278Some things don't change.  Thirteen years ago I wrote:
1279``If you use AWK or want to learn how, then read this book.''
1280True then, and still true today.
1281
1282Learning to use a programming language is about more than mastering the
1283syntax.  One needs to acquire an understanding of how to use the
1284features of the language to solve practical programming problems.
1285A focus of this book is many examples that show how to use AWK.
1286
1287Some things do change. Our computers are much faster and have more memory.
1288Consequently, speed and storage inefficiencies of a high-level language
1289matter less.  Prototyping in AWK and then rewriting in C for performance
1290reasons happens less, because more often the prototype is fast enough.
1291
1292Of course, there are computing operations that are best done in C or C++.
1293With @command{gawk} 4.1 and later, you do not have to choose between writing
1294your program in AWK or in C/C++.  You can write most of your
1295program in AWK and the aspects that require C/C++ capabilities can be written
1296in C/C++, and then the pieces glued together when the @command{gawk} module loads
1297the C/C++ module as a dynamic plug-in.
1298@c Chapter 16
1299@ref{Dynamic Extensions},
1300has all the
1301details, and, as expected, many examples to help you learn the ins and outs.
1302
1303I enjoy programming in AWK and had fun (re)reading this book.
1304I think you will too.
1305
1306@ifnotdocbook
1307@cindex Brennan, Michael
1308@display
1309Michael Brennan
1310Author of @command{mawk}
1311October 2014
1312@end display
1313@end ifnotdocbook
1314
1315@node Preface
1316@unnumbered Preface
1317@c I saw a comment somewhere that the preface should describe the book itself,
1318@c and the introduction should describe what the book covers.
1319@c
1320@c 12/2000: Chuck wants the preface & intro combined.
1321
1322@c This bit is post-processed by a script which turns the chapter
1323@c tag into a preface tag, and moves this stuff to before the title.
1324@c Bleah.
1325@docbook
1326  <prefaceinfo>
1327    <author>
1328      <firstname>Arnold</firstname>
1329      <surname>Robbins</surname>
1330      <affiliation><jobtitle>Nof Ayalon</jobtitle></affiliation>
1331      <affiliation><jobtitle>Israel</jobtitle></affiliation>
1332    </author>
1333    <date>February 2015</date>
1334   </prefaceinfo>
1335@end docbook
1336
1337@cindex @command{awk}
1338Several kinds of tasks occur repeatedly when working with text files.
1339You might want to extract certain lines and discard the rest.  Or you
1340may need to make changes wherever certain patterns appear, but leave the
1341rest of the file alone.  Such jobs are often easy with @command{awk}.
1342The @command{awk} utility interprets a special-purpose programming
1343language that makes it easy to handle simple data-reformatting jobs.
1344
1345@cindex @command{gawk}
1346The GNU implementation of @command{awk} is called @command{gawk}; if you
1347invoke it with the proper options or environment variables,
1348it is fully compatible with
1349the POSIX@footnote{The 2018 POSIX standard is accessible online at
1350@w{@url{https://pubs.opengroup.org/onlinepubs/9699919799/}.}}
1351specification of the @command{awk} language
1352and with the Unix version of @command{awk} maintained
1353by Brian Kernighan.
1354This means that all
1355properly written @command{awk} programs should work with @command{gawk}.
1356So most of the time, we don't distinguish between @command{gawk} and other
1357@command{awk} implementations.
1358
1359@cindex @command{awk} @subentry POSIX and @seealso{POSIX @command{awk}}
1360@cindex @command{awk} @subentry POSIX and
1361@cindex POSIX @subentry @command{awk} and
1362@cindex @command{gawk} @subentry @command{awk} and
1363@cindex @command{awk} @subentry @command{gawk} and
1364@cindex @command{awk} @subentry uses for
1365Using @command{awk} you can:
1366
1367@itemize @value{BULLET}
1368@item
1369Manage small, personal databases
1370
1371@item
1372Generate reports
1373
1374@item
1375Validate data
1376
1377@item
1378Produce indexes and perform other document-preparation tasks
1379
1380@item
1381Experiment with algorithms that you can adapt later to other computer
1382languages
1383@end itemize
1384
1385@cindex @command{awk} @seealso{@command{gawk}}
1386@cindex @command{gawk} @seealso{@command{awk}}
1387@cindex @command{gawk} @subentry uses for
1388In addition,
1389@command{gawk}
1390provides facilities that make it easy to:
1391
1392@itemize @value{BULLET}
1393@item
1394Extract bits and pieces of data for processing
1395
1396@item
1397Sort data
1398
1399@item
1400Perform simple network communications
1401
1402@item
1403Profile and debug @command{awk} programs
1404
1405@item
1406Extend the language with functions written in C or C++
1407@end itemize
1408
1409This @value{DOCUMENT} teaches you about the @command{awk} language and
1410how you can use it effectively.  You should already be familiar with basic
1411system commands, such as @command{cat} and @command{ls},@footnote{These utilities
1412are available on POSIX-compliant systems, as well as on traditional
1413Unix-based systems. If you are using some other operating system, you still need to
1414be familiar with the ideas of I/O redirection and pipes.} as well as basic shell
1415facilities, such as input/output (I/O) redirection and pipes.
1416
1417@cindex GNU @command{awk} @seeentry{@command{gawk}}
1418Implementations of the @command{awk} language are available for many
1419different computing environments.  This @value{DOCUMENT}, while describing
1420the @command{awk} language in general, also describes the particular
1421implementation of @command{awk} called @command{gawk} (which stands for
1422``GNU @command{awk}'').  @command{gawk} runs on a broad range of Unix systems,
1423ranging from Intel-architecture PC-based computers
1424up through large-scale systems.
1425@command{gawk} has also been ported to Mac OS X,
1426Microsoft Windows
1427(all versions),
1428and OpenVMS.@footnote{Some other, obsolete systems to which @command{gawk}
1429was once ported are no longer supported and the code for those systems
1430has been removed.}
1431
1432@menu
1433* History::                     The history of @command{gawk} and
1434                                @command{awk}.
1435* Names::                       What name to use to find @command{awk}.
1436* This Manual::                 Using this @value{DOCUMENT}. Includes sample
1437                                input files that you can use.
1438* Conventions::                 Typographical Conventions.
1439* Manual History::              Brief history of the GNU project and this
1440                                @value{DOCUMENT}.
1441* How To Contribute::           Helping to save the world.
1442* Acknowledgments::             Acknowledgments.
1443@end menu
1444
1445@node History
1446@unnumberedsec History of @command{awk} and @command{gawk}
1447@cindex recipe for a programming language
1448@cindex programming language, recipe for
1449@sidebar Recipe for a Programming Language
1450
1451@multitable {2 parts} {1 part  @code{egrep}} {1 part  @code{snobol}}
1452@item @tab 1 part  @code{egrep} @tab 1 part  @code{snobol}
1453@item @tab 2 parts @code{ed} @tab 3 parts C
1454@end multitable
1455
1456Blend all parts well using @code{lex} and @code{yacc}.
1457Document minimally and release.
1458
1459After eight years, add another part @code{egrep} and two
1460more parts C.  Document very well and release.
1461@end sidebar
1462
1463@cindex Aho, Alfred
1464@cindex Weinberger, Peter
1465@cindex Kernighan, Brian
1466@cindex @command{awk} @subentry history of
1467The name @command{awk} comes from the initials of its designers: Alfred V.@:
1468Aho, Peter J.@: Weinberger, and Brian W.@: Kernighan.  The original version of
1469@command{awk} was written in 1977 at AT&T Bell Laboratories.
1470In 1985, a new version made the programming
1471language more powerful, introducing user-defined functions, multiple input
1472streams, and computed regular expressions.
1473This new version became widely available with Unix System V
1474Release 3.1 (1987).
1475The version in System V Release 4 (1989) added some new features and cleaned
1476up the behavior in some of the ``dark corners'' of the language.
1477The specification for @command{awk} in the POSIX Command Language
1478and Utilities standard further clarified the language.
1479Both the @command{gawk} designers and the original @command{awk} designers at Bell Laboratories
1480provided feedback for the POSIX specification.
1481
1482@cindex Rubin, Paul
1483@cindex Fenlason, Jay
1484@cindex Trueman, David
1485Paul Rubin wrote @command{gawk} in 1986.
1486Jay Fenlason completed it, with advice from Richard Stallman.  John Woods
1487contributed parts of the code as well.  In 1988 and 1989, David Trueman, with
1488help from me, thoroughly reworked @command{gawk} for compatibility
1489with the newer @command{awk}.
1490Circa 1994, I became the primary maintainer.
1491Current development focuses on bug fixes,
1492performance improvements, standards compliance, and, occasionally, new features.
1493
1494In May 1997, J@"urgen Kahrs felt the need for network access
1495from @command{awk}, and with a little help from me, set about adding
1496features to do this for @command{gawk}.  At that time, he also
1497wrote the bulk of
1498@cite{@value{GAWKINETTITLE}}
1499(a separate document, available as part of the @command{gawk} distribution).
1500His code finally became part of the main @command{gawk} distribution
1501with @command{gawk} @value{PVERSION} 3.1.
1502
1503John Haque rewrote the @command{gawk} internals, in the process providing
1504an @command{awk}-level debugger. This version became available as
1505@command{gawk} @value{PVERSION} 4.0 in 2011.
1506
1507@xref{Contributors}
1508for a full list of those who have made important contributions to @command{gawk}.
1509
1510@node Names
1511@unnumberedsec A Rose by Any Other Name
1512
1513@cindex @command{awk} @subentry new vs.@: old
1514The @command{awk} language has evolved over the years. Full details are
1515provided in @ref{Language History}.
1516The language described in this @value{DOCUMENT}
1517is often referred to as ``new @command{awk}.''
1518By analogy, the original version of @command{awk} is
1519referred to as ``old @command{awk}.''
1520
1521On most current systems, when you run the @command{awk} utility
1522you get some version of new @command{awk}.@footnote{Only
1523Solaris systems still use an old @command{awk} for the
1524default @command{awk} utility. A more modern @command{awk} lives in
1525@file{/usr/xpg6/bin} on these systems.} If your system's standard
1526@command{awk} is the old one, you will see something like this
1527if you try the following test program:
1528
1529@example
1530@group
1531$ @kbd{awk 1 /dev/null}
1532@error{} awk: syntax error near line 1
1533@error{} awk: bailing out near line 1
1534@end group
1535@end example
1536
1537@noindent
1538In this case, you should find a version of new @command{awk},
1539or just install @command{gawk}!
1540
1541Throughout this @value{DOCUMENT}, whenever we refer to a language feature
1542that should be available in any complete implementation of POSIX @command{awk},
1543we simply use the term @command{awk}.  When referring to a feature that is
1544specific to the GNU implementation, we use the term @command{gawk}.
1545
1546@node This Manual
1547@unnumberedsec Using This Book
1548@cindex @command{awk} @subentry terms describing
1549
1550The term @command{awk} refers to a particular program as well as to the language you
1551use to tell this program what to do.  When we need to be careful, we call
1552the language ``the @command{awk} language,''
1553and the program ``the @command{awk} utility.''
1554This @value{DOCUMENT} explains
1555both how to write programs in the @command{awk} language and how to
1556run the @command{awk} utility.
1557The term ``@command{awk} program'' refers to a program written by you in
1558the @command{awk} programming language.
1559
1560@cindex @command{gawk} @subentry @command{awk} and
1561@cindex @command{awk} @subentry @command{gawk} and
1562@cindex POSIX @command{awk}
1563Primarily, this @value{DOCUMENT} explains the features of @command{awk}
1564as defined in the POSIX standard.  It does so in the context of the
1565@command{gawk} implementation.  While doing so, it also
1566attempts to describe important differences between @command{gawk}
1567and other @command{awk}
1568@ifclear FOR_PRINT
1569implementations.@footnote{All such differences
1570appear in the index under the
1571entry ``differences in @command{awk} and @command{gawk}.''}
1572@end ifclear
1573@ifset FOR_PRINT
1574implementations.
1575@end ifset
1576Finally, it notes any @command{gawk} features that are not in
1577the POSIX standard for @command{awk}.
1578
1579@ifnotinfo
1580This @value{DOCUMENT} has the difficult task of being both a tutorial and a reference.
1581If you are a novice, feel free to skip over details that seem too complex.
1582You should also ignore the many cross-references; they are for the
1583expert user and for the Info and
1584@uref{https://www.gnu.org/software/gawk/manual/, HTML}
1585versions of the @value{DOCUMENT}.
1586@end ifnotinfo
1587
1588There are sidebars
1589scattered throughout the @value{DOCUMENT}.
1590They add a more complete explanation of points that are relevant, but not likely
1591to be of interest on first reading.
1592@ifclear FOR_PRINT
1593All appear in the index, under the heading ``sidebar.''
1594@end ifclear
1595
1596Most of the time, the examples use complete @command{awk} programs.
1597Some of the more advanced @value{SECTION}s show only the part of the @command{awk}
1598program that illustrates the concept being described.
1599
1600Although this @value{DOCUMENT} is aimed principally at people who have not been
1601exposed
1602to @command{awk}, there is a lot of information here that even the @command{awk}
1603expert should find useful.  In particular, the description of POSIX
1604@command{awk} and the example programs in
1605@ref{Library Functions}, and
1606@ifnotdocbook
1607in
1608@end ifnotdocbook
1609@ref{Sample Programs},
1610should be of interest.
1611
1612This @value{DOCUMENT} is split into several parts, as follows:
1613
1614@c FULLXREF ON
1615
1616@itemize @value{BULLET}
1617@item
1618Part I describes the @command{awk} language and the @command{gawk} program in detail.
1619It starts with the basics, and continues through all of the features of @command{awk}.
1620It contains the following chapters:
1621
1622@c nested
1623@itemize @value{MINUS}
1624@item
1625@ref{Getting Started},
1626provides the essentials you need to know to begin using @command{awk}.
1627
1628@item
1629@ref{Invoking Gawk},
1630describes how to run @command{gawk}, the meaning of its
1631command-line options, and how it finds @command{awk}
1632program source files.
1633
1634@item
1635@ref{Regexp},
1636introduces regular expressions in general, and in particular the flavors
1637supported by POSIX @command{awk} and @command{gawk}.
1638
1639@item
1640@ref{Reading Files},
1641describes how @command{awk} reads your data.
1642It introduces the concepts of records and fields, as well
1643as the @code{getline} command.
1644I/O redirection is first described here.
1645Network I/O is also briefly introduced here.
1646
1647@item
1648@ref{Printing},
1649describes how @command{awk} programs can produce output with
1650@code{print} and @code{printf}.
1651
1652@item
1653@ref{Expressions},
1654describes expressions, which are the basic building blocks
1655for getting most things done in a program.
1656
1657@item
1658@ref{Patterns and Actions},
1659describes how to write patterns for matching records, actions for
1660doing something when a record is matched, and the predefined variables
1661@command{awk} and @command{gawk} use.
1662
1663@item
1664@ref{Arrays},
1665covers @command{awk}'s one-and-only data structure: the associative array.
1666Deleting array elements and whole arrays is described, as well as
1667sorting arrays in @command{gawk}.  The @value{CHAPTER} also describes how
1668@command{gawk} provides arrays of arrays.
1669
1670@item
1671@ref{Functions},
1672describes the built-in functions @command{awk} and @command{gawk} provide,
1673as well as how to define your own functions.  It also discusses how
1674@command{gawk} lets you call functions indirectly.
1675@end itemize
1676
1677@item
1678Part II shows how to use @command{awk} and @command{gawk} for problem solving.
1679There is lots of code here for you to read and learn from.
1680This part contains the following chapters:
1681
1682@c nested
1683@itemize @value{MINUS}
1684@item
1685@ref{Library Functions}, provides a number of functions meant to
1686be used from main @command{awk} programs.
1687
1688@item
1689@ref{Sample Programs},
1690provides many sample @command{awk} programs.
1691@end itemize
1692
1693Reading these two chapters allows you to see @command{awk}
1694solving real problems.
1695
1696@item
1697Part III focuses on features specific to @command{gawk}.
1698It contains the following chapters:
1699
1700@c nested
1701@itemize @value{MINUS}
1702@item
1703@ref{Advanced Features},
1704describes a number of advanced features.
1705Of particular note
1706are the abilities to control the order of array traversal,
1707have two-way communications with another process,
1708perform TCP/IP networking, and
1709profile your @command{awk} programs.
1710
1711@item
1712@ref{Internationalization},
1713describes special features for translating program
1714messages into different languages at runtime.
1715
1716@item
1717@ref{Debugger}, describes the @command{gawk} debugger.
1718
1719@item
1720@ref{Namespaces}, describes how @command{gawk} allows variables and/or
1721functions of the same name to be in different namespaces.
1722
1723@item
1724@ref{Arbitrary Precision Arithmetic},
1725describes advanced arithmetic facilities.
1726
1727@item
1728@ref{Dynamic Extensions}, describes how to add new variables and
1729functions to @command{gawk} by writing extensions in C or C++.
1730@end itemize
1731
1732@item
1733@ifclear FOR_PRINT
1734Part IV provides the appendices, the Glossary, and two licenses that cover
1735the @command{gawk} source code and this @value{DOCUMENT}, respectively.
1736It contains the following appendices:
1737@end ifclear
1738@ifset FOR_PRINT
1739Part IV provides the following appendices,
1740including the GNU General Public License:
1741@end ifset
1742
1743@itemize @value{MINUS}
1744@item
1745@ref{Language History},
1746describes how the @command{awk} language has evolved since
1747its first release to the present.  It also describes how @command{gawk}
1748has acquired features over time.
1749
1750@item
1751@ref{Installation},
1752describes how to get @command{gawk}, how to compile it
1753on POSIX-compatible systems,
1754and how to compile and use it on different
1755non-POSIX systems.  It also describes how to report bugs
1756in @command{gawk} and where to get other freely
1757available @command{awk} implementations.
1758
1759@ifset FOR_PRINT
1760@item
1761@ref{Copying},
1762presents the license that covers the @command{gawk} source code.
1763@end ifset
1764
1765@ifclear FOR_PRINT
1766@item
1767@ref{Notes},
1768describes how to disable @command{gawk}'s extensions, as
1769well as how to contribute new code to @command{gawk},
1770and some possible future directions for @command{gawk} development.
1771
1772@item
1773@ref{Basic Concepts},
1774provides some very cursory background material for those who
1775are completely unfamiliar with computer programming.
1776
1777@item
1778The @ref{Glossary}, defines most, if not all, of the significant terms used
1779throughout the @value{DOCUMENT}.  If you find terms that you aren't familiar with,
1780try looking them up here.
1781
1782@item
1783@ref{Copying}, and
1784@ref{GNU Free Documentation License},
1785present the licenses that cover the @command{gawk} source code
1786and this @value{DOCUMENT}, respectively.
1787@end ifclear
1788@end itemize
1789@end itemize
1790
1791@ifset FOR_PRINT
1792The version of this @value{DOCUMENT} distributed with @command{gawk}
1793contains additional appendices and other end material.
1794To save space, we have omitted them from the
1795printed edition. You may find them online, as follows:
1796
1797@itemize @value{BULLET}
1798@item
1799@uref{https://www.gnu.org/software/gawk/manual/html_node/Notes.html,
1800The appendix on implementation notes}
1801describes how to disable @command{gawk}'s extensions, how to contribute
1802new code to @command{gawk}, where to find information on some possible
1803future directions for @command{gawk} development, and the design decisions
1804behind the extension API.
1805
1806@item
1807@uref{https://www.gnu.org/software/gawk/manual/html_node/Basic-Concepts.html,
1808The appendix on basic concepts}
1809provides some very cursory background material for those who
1810are completely unfamiliar with computer programming.
1811
1812@item
1813@uref{https://www.gnu.org/software/gawk/manual/html_node/Glossary.html,
1814The Glossary}
1815defines most, if not all, of the significant terms used
1816throughout the @value{DOCUMENT}.  If you find terms that you aren't familiar with,
1817try looking them up here.
1818
1819@item
1820@uref{https://www.gnu.org/software/gawk/manual/html_node/GNU-Free-Documentation-License.html,
1821The GNU FDL}
1822is the license that covers this @value{DOCUMENT}.
1823@end itemize
1824
1825@c ok not to use CHAPTER / SECTION here
1826Some of the chapters have exercise sections; these have also been
1827omitted from the print edition but are available online.
1828@end ifset
1829
1830@c FULLXREF OFF
1831
1832@node Conventions
1833@unnumberedsec Typographical Conventions
1834
1835@cindex Texinfo
1836This @value{DOCUMENT} is written in @uref{https://www.gnu.org/software/texinfo/, Texinfo},
1837the GNU documentation formatting language.
1838A single Texinfo source file is used to produce both the printed and online
1839versions of the documentation.
1840@ifnotinfo
1841Because of this, the typographical conventions
1842are slightly different than in other books you may have read.
1843@end ifnotinfo
1844@ifinfo
1845This @value{SECTION} briefly documents the typographical conventions used in Texinfo.
1846@end ifinfo
1847
1848Examples you would type at the command line are preceded by the common
1849shell primary and secondary prompts, @samp{$} and @samp{>}, respectively.
1850Input that you type is shown @kbd{like this}.
1851@c 8/2014: @print{} is stripped from the texi to make docbook.
1852@ifclear FOR_PRINT
1853Output from the command is preceded by the glyph ``@print{}''.
1854This typically represents the command's standard output.
1855@end ifclear
1856@ifset FOR_PRINT
1857Output from the command, usually its standard output, appears
1858@code{like this}.
1859@end ifset
1860Error messages and other output on the command's standard error are preceded
1861by the glyph ``@error{}''.  For example:
1862
1863@example
1864$ @kbd{echo hi on stdout}
1865@print{} hi on stdout
1866$ @kbd{echo hello on stderr 1>&2}
1867@error{} hello on stderr
1868@end example
1869
1870@ifnotinfo
1871In the text, almost anything related to programming, such as
1872command names,
1873variable and function names, and string, numeric and regexp constants
1874appear in @code{this font}. Code fragments
1875appear in the same font and quoted, @samp{like this}.
1876Things that are replaced by the user or programmer
1877appear in @var{this font}.
1878Options look like this: @option{-f}.
1879@value{FFN}s are indicated like this: @file{/path/to/ourfile}.
1880@ifclear FOR_PRINT
1881Some things are
1882emphasized @emph{like this}, and if a point needs to be made
1883strongly, it is done @strong{like this}.
1884@end ifclear
1885The first occurrence of
1886a new term is usually its @dfn{definition} and appears in the same
1887font as the previous occurrence of ``definition'' in this sentence.
1888@end ifnotinfo
1889
1890Characters that you type at the keyboard look @kbd{like this}.  In particular,
1891there are special characters called ``control characters.''  These are
1892characters that you type by holding down both the @kbd{CONTROL} key and
1893another key, at the same time.  For example, a @kbd{Ctrl-d} is typed
1894by first pressing and holding the @kbd{CONTROL} key, next
1895pressing the @kbd{d} key, and finally releasing both keys.
1896
1897For the sake of brevity, throughout this @value{DOCUMENT}, we refer to
1898Brian Kernighan's version of @command{awk} as ``BWK @command{awk}.''
1899(@xref{Other Versions} for information on his and other versions.)
1900
1901@ifset FOR_PRINT
1902@quotation NOTE
1903Notes of interest look like this.
1904@end quotation
1905
1906@quotation CAUTION
1907Cautionary or warning notes look like this.
1908@end quotation
1909@end ifset
1910
1911@c fakenode --- for prepinfo
1912@unnumberedsubsec Dark Corners
1913@cindex Kernighan, Brian @subentry quotes
1914@quotation
1915@i{Dark corners are basically fractal---no matter how much
1916you illuminate, there's always a smaller but darker one.}
1917@author Brian Kernighan
1918@end quotation
1919
1920@cindex d.c. @seeentry{dark corner}
1921@cindex dark corner
1922Until the POSIX standard (and @cite{@value{TITLE}}),
1923many features of @command{awk} were either poorly documented or not
1924documented at all.  Descriptions of such features
1925(often called ``dark corners'') are noted in this @value{DOCUMENT} with
1926@iftex
1927the picture of a flashlight in the margin, as shown here.
1928@value{DARKCORNER}
1929@end iftex
1930@ifnottex
1931``(d.c.).''
1932@end ifnottex
1933@ifclear FOR_PRINT
1934They also appear in the index under the heading ``dark corner.''
1935@end ifclear
1936
1937But, as noted by the opening quote, any coverage of dark
1938corners is by definition incomplete.
1939
1940@cindex c.e. @seeentry{common extensions}
1941Extensions to the standard @command{awk} language that are supported by
1942more than one @command{awk} implementation are marked
1943@ifclear FOR_PRINT
1944``@value{COMMONEXT},'' and listed in the index under ``common extensions''
1945and ``extensions, common.''
1946@end ifclear
1947@ifset FOR_PRINT
1948``@value{COMMONEXT}'' for ``common extension.''
1949@end ifset
1950
1951@node Manual History
1952@unnumberedsec The GNU Project and This Book
1953
1954@cindex FSF (Free Software Foundation)
1955@cindex Free Software Foundation (FSF)
1956@cindex Stallman, Richard
1957The Free Software Foundation (FSF) is a nonprofit organization dedicated
1958to the production and distribution of freely distributable software.
1959It was founded by Richard M.@: Stallman, the author of the original
1960Emacs editor.  GNU Emacs is the most widely used version of Emacs today.
1961
1962@cindex GNU Project
1963@cindex GPL (General Public License)
1964@cindex GNU General Public License @seeentry{GPL}
1965@cindex General Public License @seeentry{GPL}
1966@cindex documentation @subentry online
1967The GNU@footnote{GNU stands for ``GNU's Not Unix.''}
1968Project is an ongoing effort on the part of the Free Software
1969Foundation to create a complete, freely distributable, POSIX-compliant
1970computing environment.
1971The FSF uses the GNU General Public License (GPL) to ensure that
1972its software's
1973source code is always available to the end user.
1974@ifclear FOR_PRINT
1975A copy of the GPL is included
1976@ifnotinfo
1977in this @value{DOCUMENT}
1978@end ifnotinfo
1979for your reference
1980(@pxref{Copying}).
1981@end ifclear
1982The GPL applies to the C language source code for @command{gawk}.
1983To find out more about the FSF and the GNU Project online,
1984see @uref{https://www.gnu.org, the GNU Project's home page}.
1985This @value{DOCUMENT} may also be read from
1986@uref{https://www.gnu.org/software/gawk/manual/, GNU's website}.
1987
1988@ifclear FOR_PRINT
1989A shell, an editor (Emacs), highly portable optimizing C, C++, and
1990Objective-C compilers, a symbolic debugger and dozens of large and
1991small utilities (such as @command{gawk}), have all been completed and are
1992freely available.  The GNU operating
1993system kernel (the HURD), has been released but remains in an early
1994stage of development.
1995
1996@cindex Linux @seeentry{GNU/Linux}
1997@cindex GNU/Linux
1998@cindex operating systems @subentry BSD-based
1999Until the GNU operating system is more fully developed, you should
2000consider using GNU/Linux, a freely distributable, Unix-like operating
2001system for Intel,
2002Power Architecture,
2003Sun SPARC, IBM S/390, and other
2004systems.@footnote{The terminology ``GNU/Linux'' is explained
2005in the @ref{Glossary}.}
2006Many GNU/Linux distributions are
2007available for download from the Internet.
2008@end ifclear
2009
2010@ifnotinfo
2011The @value{DOCUMENT} you are reading is actually free---at least, the
2012information in it is free to anyone.  The machine-readable
2013source code for the @value{DOCUMENT} comes with @command{gawk}.
2014@ifclear FOR_PRINT
2015(Take a moment to check the Free Documentation
2016License in @ref{GNU Free Documentation License}.)
2017@end ifclear
2018@end ifnotinfo
2019
2020@cindex Close, Diane
2021The @value{DOCUMENT} itself has gone through multiple previous editions.
2022Paul Rubin wrote the very first draft of @cite{The GAWK Manual};
2023it was around 40 pages long.
2024Diane Close and Richard Stallman improved it, yielding a
2025version that was
2026around 90 pages and barely described the original, ``old''
2027version of @command{awk}.
2028
2029I started working with that version in the fall of 1988.
2030As work on it progressed,
2031the FSF published several preliminary versions (numbered 0.@var{x}).
2032In 1996, edition 1.0 was released with @command{gawk} 3.0.0.
2033The FSF published the first two editions under
2034the title @cite{The GNU Awk User's Guide}.
2035@ifset FOR_PRINT
2036SSC published two editions of the @value{DOCUMENT} under the
2037title @cite{Effective awk Programming}, and O'Reilly published
2038the third edition in 2001.
2039@end ifset
2040
2041This edition maintains the basic structure of the previous editions.
2042For FSF edition 4.0, the content was thoroughly reviewed and updated. All
2043references to @command{gawk} versions prior to 4.0 were removed.
2044Of significant note for that edition was the addition of @ref{Debugger}.
2045
2046For FSF edition
2047@ifclear FOR_PRINT
20485.0,
2049@end ifclear
2050@ifset FOR_PRINT
2051@value{EDITION}
2052(the fourth edition as published by O'Reilly),
2053@end ifset
2054the content has been reorganized into parts,
2055and the major new additions are @ref{Arbitrary Precision Arithmetic},
2056and @ref{Dynamic Extensions}.
2057
2058This @value{DOCUMENT} will undoubtedly continue to evolve.  If you
2059find an error in the @value{DOCUMENT}, please report it!  @xref{Bugs}
2060for information on submitting problem reports electronically.
2061
2062@ifset FOR_PRINT
2063@c fakenode --- for prepinfo
2064@unnumberedsec How to Stay Current
2065
2066You may have a newer version of @command{gawk} than the
2067one described here.  To find out what has changed,
2068you should first look at the @file{NEWS} file in the @command{gawk}
2069distribution, which provides a high-level summary of the changes in
2070each release.
2071
2072You can then look at the @uref{https://www.gnu.org/software/gawk/manual/,
2073online version} of this @value{DOCUMENT} to read about any new features.
2074@end ifset
2075
2076@ifclear FOR_PRINT
2077@node How To Contribute
2078@unnumberedsec How to Contribute
2079
2080As the maintainer of GNU @command{awk}, I once thought that I would be
2081able to manage a collection of publicly available @command{awk} programs
2082and I even solicited contributions.  Making things available on the Internet
2083helps keep the @command{gawk} distribution down to manageable size.
2084
2085The initial collection of material, such as it is, is still available
2086at @uref{ftp://ftp.freefriends.org/arnold/Awkstuff}.
2087
2088In the hopes of doing something more broad, I acquired the
2089@code{awklang.org} domain.  Late in 2017, a volunteer took on the task
2090of managing it.
2091
2092If you have written an interesting @command{awk} program, that
2093you would like to share with the rest of the world, please see
2094@uref{http://www.awklang.org} and use the ``Contact'' link.
2095
2096If you have written a @command{gawk} extension, please see
2097@ref{gawkextlib}.
2098@end ifclear
2099
2100@node Acknowledgments
2101@unnumberedsec Acknowledgments
2102
2103The initial draft of @cite{The GAWK Manual} had the following acknowledgments:
2104
2105@quotation
2106Many people need to be thanked for their assistance in producing this
2107manual.  Jay Fenlason contributed many ideas and sample programs.  Richard
2108Mlynarik and Robert Chassell gave helpful comments on drafts of this
2109manual.  The paper @cite{A Supplemental Document for AWK} by John W.@:
2110Pierce of the Chemistry Department at UC San Diego, pinpointed several
2111issues relevant both to @command{awk} implementation and to this manual, that
2112would otherwise have escaped us.
2113@end quotation
2114
2115@cindex Stallman, Richard
2116I would like to acknowledge Richard M.@: Stallman, for his vision of a
2117better world and for his courage in founding the FSF and starting the
2118GNU Project.
2119
2120@ifclear FOR_PRINT
2121Earlier editions of this @value{DOCUMENT} had the following acknowledgements:
2122@end ifclear
2123@ifset FOR_PRINT
2124The previous edition of this @value{DOCUMENT} had
2125the following acknowledgements:
2126@end ifset
2127
2128@quotation
2129The following people (in alphabetical order)
2130provided helpful comments on various
2131versions of this book:
2132Rick Adams,
2133Dr.@: Nelson H.F. Beebe,
2134Karl Berry,
2135Dr.@: Michael Brennan,
2136Rich Burridge,
2137Claire Cloutier,
2138Diane Close,
2139Scott Deifik,
2140Christopher (``Topher'') Eliot,
2141Jeffrey Friedl,
2142Dr.@: Darrel Hankerson,
2143Michal Jaegermann,
2144Dr.@: Richard J.@: LeBlanc,
2145Michael Lijewski,
2146Pat Rankin,
2147Miriam Robbins,
2148Mary Sheehan,
2149and
2150Chuck Toporek.
2151
2152@cindex Berry, Karl
2153@cindex Chassell, Robert J.@:
2154@c @cindex Texinfo
2155Robert J.@: Chassell provided much valuable advice on
2156the use of Texinfo.
2157He also deserves special thanks for
2158convincing me @emph{not} to title this @value{DOCUMENT}
2159@cite{How to Gawk Politely}.
2160Karl Berry helped significantly with the @TeX{} part of Texinfo.
2161
2162@cindex Hartholz @subentry Marshall
2163@cindex Hartholz @subentry Elaine
2164@cindex Schreiber @subentry Bert
2165@cindex Schreiber @subentry Rita
2166I would like to thank Marshall and Elaine Hartholz of Seattle and
2167Dr.@: Bert and Rita Schreiber of Detroit for large amounts of quiet vacation
2168time in their homes, which allowed me to make significant progress on
2169this @value{DOCUMENT} and on @command{gawk} itself.
2170
2171@cindex Hughes, Phil
2172Phil Hughes of SSC
2173contributed in a very important way by loaning me his laptop GNU/Linux
2174system, not once, but twice, which allowed me to do a lot of work while
2175away from home.
2176
2177@cindex Trueman, David
2178David Trueman deserves special credit; he has done a yeoman job
2179of evolving @command{gawk} so that it performs well and without bugs.
2180Although he is no longer involved with @command{gawk},
2181working with him on this project was a significant pleasure.
2182
2183@cindex Drepper, Ulrich
2184@cindex GNITS mailing list
2185@cindex mailing list, GNITS
2186The intrepid members of the GNITS mailing list, and most notably Ulrich
2187Drepper, provided invaluable help and feedback for the design of the
2188internationalization features.
2189
2190Chuck Toporek, Mary Sheehan, and Claire Cloutier of O'Reilly & Associates contributed
2191significant editorial help for this @value{DOCUMENT} for the
21923.1 release of @command{gawk}.
2193@end quotation
2194
2195@cindex Beebe, Nelson H.F.@:
2196@cindex Buening, Andreas
2197@cindex Collado, Manuel
2198@cindex Colombo, Antonio
2199@cindex Davies, Stephen
2200@cindex Deifik, Scott
2201@cindex Demaille, Akim
2202@cindex G., Daniel Richard
2203@cindex Guerrero, Juan Manuel
2204@cindex Hankerson, Darrel
2205@cindex Jaegermann, Michal
2206@cindex Kahrs, J@"urgen
2207@cindex Kasal, Stepan
2208@cindex Malmberg, John
2209@cindex Ramey, Chet
2210@cindex Rankin, Pat
2211@cindex Schorr, Andrew
2212@cindex Vinschen, Corinna
2213@cindex Zaretskii, Eli
2214
2215Dr.@: Nelson Beebe,
2216Andreas Buening,
2217Dr.@: Manuel Collado,
2218Antonio Colombo,
2219Stephen Davies,
2220Scott Deifik,
2221Akim Demaille,
2222Daniel Richard G.,
2223Juan Manuel Guerrero,
2224Darrel Hankerson,
2225Michal Jaegermann,
2226J@"urgen Kahrs,
2227Stepan Kasal,
2228John Malmberg,
2229Chet Ramey,
2230Pat Rankin,
2231Andrew Schorr,
2232Corinna Vinschen,
2233and Eli Zaretskii
2234(in alphabetical order)
2235make up the current @command{gawk} ``crack portability team.''  Without
2236their hard work and help, @command{gawk} would not be nearly the robust,
2237portable program it is today.  It has been and continues to be a pleasure
2238working with this team of fine people.
2239
2240Notable code and documentation contributions were made by
2241a number of people. @xref{Contributors} for the full list.
2242
2243@ifset FOR_PRINT
2244@cindex Oram, Andy
2245Thanks to Andy Oram of O'Reilly Media for initiating
2246the fourth edition and for his support during the work.
2247Thanks to Jasmine Kwityn for her copyediting work.
2248@end ifset
2249
2250Thanks to Michael Brennan for the Forewords.
2251
2252@cindex Duman, Patrice
2253@cindex Berry, Karl
2254@cindex Smith, Gavin
2255Thanks to Patrice Dumas for the new @command{makeinfo} program.
2256Thanks to Karl Berry for his past work on Texinfo, and
2257to Gavin Smith, who continues to work to improve
2258the Texinfo markup language.
2259
2260@cindex Kernighan, Brian
2261@cindex Brennan, Michael
2262@cindex Day, Robert P.J.@:
2263Robert P.J.@: Day, Michael Brennan, and Brian Kernighan kindly acted as
2264reviewers for the 2015 edition of this @value{DOCUMENT}. Their feedback
2265helped improve the final work.
2266
2267I would also like to thank Brian Kernighan for his invaluable assistance during the
2268testing and debugging of @command{gawk}, and for his ongoing
2269help and advice in clarifying numerous points about the language.
2270We could not have done nearly as good a job on either @command{gawk}
2271or its documentation without his help.
2272
2273Brian is in a class by himself as a programmer and technical
2274author.  I have to thank him (yet again) for his ongoing friendship
2275and for being a role model to me for over 30 years!
2276Having him as a reviewer is an exciting privilege. It has also
2277been extremely humbling@enddots{}
2278
2279@cindex Robbins @subentry Miriam
2280@cindex Robbins @subentry Jean
2281@cindex Robbins @subentry Harry
2282@cindex G-d
2283I must thank my wonderful wife, Miriam, for her patience through
2284the many versions of this project, for her proofreading,
2285and for sharing me with the computer.
2286I would like to thank my parents for their love, and for the grace with
2287which they raised and educated me.
2288Finally, I also must acknowledge my gratitude to G-d, for the many opportunities
2289He has sent my way, as well as for the gifts He has given me with which to
2290take advantage of those opportunities.
2291@ifnotdocbook
2292@sp 2
2293@noindent
2294Arnold Robbins @*
2295Nof Ayalon @*
2296Israel @*
2297March, 2020
2298@end ifnotdocbook
2299
2300@ifnotinfo
2301@part @value{PART1}The @command{awk} Language
2302@end ifnotinfo
2303
2304@ifdocbook
2305
2306Part I describes the @command{awk} language and @command{gawk} program
2307in detail.  It starts with the basics, and continues through all of
2308the features of @command{awk}. Included also are many, but not all,
2309of the features of @command{gawk}.  This part contains the
2310following chapters:
2311
2312@itemize @value{BULLET}
2313@item
2314@ref{Getting Started}
2315
2316@item
2317@ref{Invoking Gawk}
2318
2319@item
2320@ref{Regexp}
2321
2322@item
2323@ref{Reading Files}
2324
2325@item
2326@ref{Printing}
2327
2328@item
2329@ref{Expressions}
2330
2331@item
2332@ref{Patterns and Actions}
2333
2334@item
2335@ref{Arrays}
2336
2337@item
2338@ref{Functions}
2339@end itemize
2340@end ifdocbook
2341
2342@node Getting Started
2343@chapter Getting Started with @command{awk}
2344@c @cindex script, definition of
2345@c @cindex rule, definition of
2346@c @cindex program, definition of
2347@c @cindex basic function of @command{awk}
2348@cindex @command{awk} @subentry function of
2349
2350The basic function of @command{awk} is to search files for lines (or other
2351units of text) that contain certain patterns.  When a line matches one
2352of the patterns, @command{awk} performs specified actions on that line.
2353@command{awk} continues to process input lines in this way until it reaches
2354the end of the input files.
2355
2356@cindex @command{awk} @subentry uses for
2357@cindex programming languages @subentry data-driven vs.@: procedural
2358@cindex @command{awk} programs
2359Programs in @command{awk} are different from programs in most other languages,
2360because @command{awk} programs are @dfn{data driven} (i.e., you describe
2361the data you want to work with and then what to do when you find it).
2362Most other languages are @dfn{procedural}; you have to describe, in great
2363detail, every step the program should take.  When working with procedural
2364languages, it is usually much
2365harder to clearly describe the data your program will process.
2366For this reason, @command{awk} programs are often refreshingly easy to
2367read and write.
2368
2369@cindex program, definition of
2370@cindex rule, definition of
2371When you run @command{awk}, you specify an @command{awk} @dfn{program} that
2372tells @command{awk} what to do.  The program consists of a series of
2373@dfn{rules} (it may also contain @dfn{function definitions},
2374an advanced feature that we will ignore for now;
2375@pxref{User-defined}).  Each rule specifies one
2376pattern to search for and one action to perform
2377upon finding the pattern.
2378
2379Syntactically, a rule consists of a @dfn{pattern} followed by an
2380@dfn{action}.  The action is enclosed in braces to separate it from the
2381pattern.  Newlines usually separate rules.  Therefore, an @command{awk}
2382program looks like this:
2383
2384@example
2385@var{pattern} @{ @var{action} @}
2386@var{pattern} @{ @var{action} @}
2387@dots{}
2388@end example
2389
2390@menu
2391* Running gawk::                How to run @command{gawk} programs; includes
2392                                command-line syntax.
2393* Sample Data Files::           Sample data files for use in the @command{awk}
2394                                programs illustrated in this @value{DOCUMENT}.
2395* Very Simple::                 A very simple example.
2396* Two Rules::                   A less simple one-line example using two
2397                                rules.
2398* More Complex::                A more complex example.
2399* Statements/Lines::            Subdividing or combining statements into
2400                                lines.
2401* Other Features::              Other Features of @command{awk}.
2402* When::                        When to use @command{gawk} and when to use
2403                                other things.
2404* Intro Summary::               Summary of the introduction.
2405@end menu
2406
2407@node Running gawk
2408@section How to Run @command{awk} Programs
2409
2410@cindex @command{awk} programs @subentry running
2411There are several ways to run an @command{awk} program.  If the program is
2412short, it is easiest to include it in the command that runs @command{awk},
2413like this:
2414
2415@example
2416awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
2417@end example
2418
2419@cindex command line @subentry formats
2420When the program is long, it is usually more convenient to put it in a file
2421and run it with a command like this:
2422
2423@example
2424awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{}
2425@end example
2426
2427This @value{SECTION} discusses both mechanisms, along with several
2428variations of each.
2429
2430@menu
2431* One-shot::                    Running a short throwaway @command{awk}
2432                                program.
2433* Read Terminal::               Using no input files (input from the keyboard
2434                                instead).
2435* Long::                        Putting permanent @command{awk} programs in
2436                                files.
2437* Executable Scripts::          Making self-contained @command{awk} programs.
2438* Comments::                    Adding documentation to @command{gawk}
2439                                programs.
2440* Quoting::                     More discussion of shell quoting issues.
2441@end menu
2442
2443@node One-shot
2444@subsection One-Shot Throwaway @command{awk} Programs
2445
2446Once you are familiar with @command{awk}, you will often type in simple
2447programs the moment you want to use them.  Then you can write the
2448program as the first argument of the @command{awk} command, like this:
2449
2450@example
2451awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
2452@end example
2453
2454@noindent
2455where @var{program} consists of a series of patterns and
2456actions, as described earlier.
2457
2458@cindex single quote (@code{'})
2459@cindex @code{'} (single quote)
2460This command format instructs the @dfn{shell}, or command interpreter,
2461to start @command{awk} and use the @var{program} to process records in the
2462input file(s).  There are single quotes around @var{program} so
2463the shell won't interpret any @command{awk} characters as special shell
2464characters.  The quotes also cause the shell to treat all of @var{program} as
2465a single argument for @command{awk}, and allow @var{program} to be more
2466than one line long.
2467
2468@cindex shells @subentry scripts
2469@cindex @command{awk} programs @subentry running @subentry from shell scripts
2470This format is also useful for running short or medium-sized @command{awk}
2471programs from shell scripts, because it avoids the need for a separate
2472file for the @command{awk} program.  A self-contained shell script is more
2473reliable because there are no other files to misplace.
2474
2475Later in this chapter, in
2476@ifdocbook
2477the @value{SECTION}
2478@end ifdocbook
2479@ref{Very Simple},
2480we'll see examples of several short,
2481self-contained programs.
2482
2483@node Read Terminal
2484@subsection Running @command{awk} Without Input Files
2485
2486@cindex standard input
2487@cindex input @subentry standard
2488@cindex input files @subentry running @command{awk} without
2489You can also run @command{awk} without any input files.  If you type the
2490following command line:
2491
2492@example
2493awk '@var{program}'
2494@end example
2495
2496@noindent
2497@command{awk} applies the @var{program} to the @dfn{standard input},
2498which usually means whatever you type on the keyboard.  This continues
2499until you indicate end-of-file by typing @kbd{Ctrl-d}.
2500(On non-POSIX operating systems, the end-of-file character may be different.)
2501
2502@cindex files @subentry input @seeentry{input files}
2503@cindex input files @subentry running @command{awk} without
2504@cindex @command{awk} programs @subentry running @subentry without input files
2505As an example, the following program prints a friendly piece of advice
2506(from Douglas Adams's @cite{The Hitchhiker's Guide to the Galaxy}),
2507to keep you from worrying about the complexities of computer
2508programming:
2509
2510@example
2511$ @kbd{awk 'BEGIN @{ print "Don\47t Panic!" @}'}
2512@print{} Don't Panic!
2513@end example
2514
2515@command{awk} executes statements associated with @code{BEGIN} before
2516reading any input.  If there are no other statements in your program,
2517as is the case here, @command{awk} just stops, instead of trying to read
2518input it doesn't know how to process.
2519The @samp{\47} is a magic way (explained later) of getting a single quote into
2520the program, without having to engage in ugly shell quoting tricks.
2521
2522@quotation NOTE
2523If you use Bash as your shell, you should execute the
2524command @samp{set +H} before running this program interactively, to
2525disable the C shell-style command history, which treats @samp{!} as a
2526special character. We recommend putting this command into your personal
2527startup file.
2528@end quotation
2529
2530This next simple @command{awk} program
2531emulates the @command{cat} utility; it copies whatever you type on the
2532keyboard to its standard output (why this works is explained shortly):
2533
2534@example
2535$ @kbd{awk '@{ print @}'}
2536@kbd{Now is the time for all good men}
2537@print{} Now is the time for all good men
2538@kbd{to come to the aid of their country.}
2539@print{} to come to the aid of their country.
2540@kbd{Four score and seven years ago, ...}
2541@print{} Four score and seven years ago, ...
2542@kbd{What, me worry?}
2543@print{} What, me worry?
2544@kbd{Ctrl-d}
2545@end example
2546
2547@node Long
2548@subsection Running Long Programs
2549
2550@cindex @command{awk} programs @subentry running
2551@cindex @command{awk} programs @subentry lengthy
2552@cindex files @subentry @command{awk} programs in
2553Sometimes @command{awk} programs are very long.  In these cases, it is
2554more convenient to put the program into a separate file.  In order to tell
2555@command{awk} to use that file for its program, you type:
2556
2557@example
2558awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{}
2559@end example
2560
2561@cindex @option{-f} option
2562@cindex command line @subentry option @option{-f}
2563The @option{-f} instructs the @command{awk} utility to get the
2564@command{awk} program from the file @var{source-file} (@pxref{Options}).
2565Any @value{FN} can be used for @var{source-file}.  For example, you
2566could put the program:
2567
2568@example
2569BEGIN @{ print "Don't Panic!" @}
2570@end example
2571
2572@noindent
2573into the file @file{advice}.  Then this command:
2574
2575@example
2576awk -f advice
2577@end example
2578
2579@noindent
2580does the same thing as this one:
2581
2582@example
2583awk 'BEGIN @{ print "Don\47t Panic!" @}'
2584@end example
2585
2586@cindex quoting @subentry in @command{gawk} command lines
2587@noindent
2588This was explained earlier
2589(@pxref{Read Terminal}).
2590Note that you don't usually need single quotes around the @value{FN} that you
2591specify with @option{-f}, because most @value{FN}s don't contain any of the shell's
2592special characters.  Notice that in @file{advice}, the @command{awk}
2593program did not have single quotes around it.  The quotes are only needed
2594for programs that are provided on the @command{awk} command line.
2595(Also, placing the program in a file allows us to use a literal single quote in the program
2596text, instead of the magic @samp{\47}.)
2597
2598@cindex single quote (@code{'}) @subentry in @command{gawk} command lines
2599@cindex @code{'} (single quote) @subentry in @command{gawk} command lines
2600If you want to clearly identify an @command{awk} program file as such,
2601you can add the extension @file{.awk} to the @value{FN}.  This doesn't
2602affect the execution of the @command{awk} program but it does make
2603``housekeeping'' easier.
2604
2605@node Executable Scripts
2606@subsection Executable @command{awk} Programs
2607@cindex @command{awk} programs
2608@cindex @code{#} (number sign) @subentry @code{#!} (executable scripts)
2609@cindex Unix @subentry @command{awk} scripts and
2610@cindex number sign (@code{#}) @subentry @code{#!} (executable scripts)
2611
2612Once you have learned @command{awk}, you may want to write self-contained
2613@command{awk} scripts, using the @samp{#!} script mechanism.  You can do
2614this on many systems.@footnote{The @samp{#!} mechanism works on
2615GNU/Linux systems, BSD-based systems, and commercial Unix systems.}
2616For example, you could update the file @file{advice} to look like this:
2617
2618@example
2619#! /bin/awk -f
2620
2621BEGIN @{ print "Don't Panic!" @}
2622@end example
2623
2624@noindent
2625After making this file executable (with the @command{chmod} utility),
2626simply type @samp{advice}
2627at the shell and the system arranges to run @command{awk} as if you had
2628typed @samp{awk -f advice}:
2629
2630@example
2631$ @kbd{chmod +x advice}
2632$ @kbd{./advice}
2633@print{} Don't Panic!
2634@end example
2635
2636@noindent
2637Self-contained @command{awk} scripts are useful when you want to write a
2638program that users can invoke without their having to know that the program is
2639written in @command{awk}.
2640
2641@sidebar Understanding @samp{#!}
2642@cindex portability @subentry @code{#!} (executable scripts)
2643
2644@command{awk} is an @dfn{interpreted} language. This means that the
2645@command{awk} utility reads your program and then processes your data
2646according to the instructions in your program. (This is different
2647from a @dfn{compiled} language such as C, where your program is first
2648compiled into machine code that is executed directly by your system's
2649processor.)  The @command{awk} utility is thus termed an @dfn{interpreter}.
2650Many modern languages are interpreted.
2651
2652The line beginning with @samp{#!} lists the full @value{FN} of an
2653interpreter to run and a single optional initial command-line argument
2654to pass to that interpreter.  The operating system then runs the
2655interpreter with the given argument and the full argument list of the
2656executed program.  The first argument in the list is the full @value{FN}
2657of the @command{awk} program.  The rest of the argument list contains
2658either options to @command{awk}, or @value{DF}s, or both. (Note that on
2659many systems @command{awk} is found in @file{/usr/bin} instead of
2660in @file{/bin}.)
2661
2662Some systems limit the length of the interpreter name to 32 characters.
2663Often, this can be dealt with by using a symbolic link.
2664
2665You should not put more than one argument on the @samp{#!}
2666line after the path to @command{awk}. It does not work. The operating system
2667treats the rest of the line as a single argument and passes it to @command{awk}.
2668Doing this leads to confusing behavior---most likely a usage diagnostic
2669of some sort from @command{awk}.
2670
2671@cindex @code{ARGC}/@code{ARGV} variables @subentry portability and
2672@cindex portability @subentry @code{ARGV} variable
2673@cindex dark corner @subentry @code{ARGV} variable, value of
2674Finally, the value of @code{ARGV[0]}
2675(@pxref{Built-in Variables})
2676varies depending upon your operating system.
2677Some systems put @samp{awk} there, some put the full pathname
2678of @command{awk} (such as @file{/bin/awk}), and some put the name
2679of your script (@samp{advice}).  @value{DARKCORNER}
2680Don't rely on the value of @code{ARGV[0]}
2681to provide your script name.
2682@end sidebar
2683
2684@node Comments
2685@subsection Comments in @command{awk} Programs
2686@cindex @code{#} (number sign) @subentry commenting
2687@cindex number sign (@code{#}) @subentry commenting
2688@cindex commenting
2689@cindex @command{awk} programs @subentry documenting
2690
2691A @dfn{comment} is some text that is included in a program for the sake
2692of human readers; it is not really an executable part of the program.  Comments
2693can explain what the program does and how it works.  Nearly all
2694programming languages have provisions for comments, as programs are
2695typically hard to understand without them.
2696
2697In the @command{awk} language, a comment starts with the number sign
2698character (@samp{#}) and continues to the end of the line.
2699The @samp{#} does not have to be the first character on the line. The
2700@command{awk} language ignores the rest of a line following a number sign.
2701For example, we could have put the following into @file{advice}:
2702
2703@example
2704# This program prints a nice, friendly message.  It helps
2705# keep novice users from being afraid of the computer.
2706BEGIN    @{ print "Don't Panic!" @}
2707@end example
2708
2709You can put comment lines into keyboard-composed throwaway @command{awk}
2710programs, but this usually isn't very useful; the purpose of a
2711comment is to help you or another person understand the program
2712when reading it at a later time.
2713
2714@cindex quoting @subentry for small awk programs
2715@cindex single quote (@code{'}) @subentry vs.@: apostrophe
2716@cindex @code{'} (single quote) @subentry vs.@: apostrophe
2717@quotation CAUTION
2718As mentioned in
2719@ref{One-shot},
2720you can enclose short to medium-sized programs in single quotes,
2721in order to keep
2722your shell scripts self-contained.  When doing so, @emph{don't} put
2723an apostrophe (i.e., a single quote) into a comment (or anywhere else
2724in your program). The shell interprets the quote as the closing
2725quote for the entire program. As a result, usually the shell
2726prints a message about mismatched quotes, and if @command{awk} actually
2727runs, it will probably print strange messages about syntax errors.
2728For example, look at the following:
2729
2730@example
2731$ @kbd{awk 'BEGIN @{ print "hello" @} # let's be cute'}
2732>
2733@end example
2734
2735The shell sees that the first two quotes match, and that
2736a new quoted object begins at the end of the command line.
2737It therefore prompts with the secondary prompt, waiting for more input.
2738With Unix @command{awk}, closing the quoted string produces this result:
2739
2740@example
2741$ @kbd{awk '@{ print "hello" @} # let's be cute'}
2742> @kbd{'}
2743@error{} awk: can't open file be
2744@error{}  source line number 1
2745@end example
2746
2747@cindex @code{\} (backslash)
2748@cindex backslash (@code{\})
2749Putting a backslash before the single quote in @samp{let's} wouldn't help,
2750because backslashes are not special inside single quotes.
2751The next @value{SUBSECTION} describes the shell's quoting rules.
2752@end quotation
2753
2754@node Quoting
2755@subsection Shell Quoting Issues
2756@cindex shell quoting, rules for
2757
2758@menu
2759* DOS Quoting::                 Quoting in Windows Batch Files.
2760@end menu
2761
2762For short to medium-length @command{awk} programs, it is most convenient
2763to enter the program on the @command{awk} command line.
2764This is best done by enclosing the entire program in single quotes.
2765This is true whether you are entering the program interactively at
2766the shell prompt, or writing it as part of a larger shell script:
2767
2768@example
2769awk '@var{program text}' @var{input-file1} @var{input-file2} @dots{}
2770@end example
2771
2772@cindex shells @subentry quoting @subentry rules for
2773@cindex Bourne shell, quoting rules for
2774Once you are working with the shell, it is helpful to have a basic
2775knowledge of shell quoting rules.  The following rules apply only to
2776POSIX-compliant, Bourne-style shells (such as Bash, the GNU Bourne-Again
2777Shell).  If you use the C shell, you're on your own.
2778
2779Before diving into the rules, we introduce a concept that appears
2780throughout this @value{DOCUMENT}, which is that of the @dfn{null},
2781or empty, string.
2782
2783The null string is character data that has no value.
2784In other words, it is empty.  It is written in @command{awk} programs
2785like this: @code{""}. In the shell, it can be written using single
2786or double quotes: @code{""} or @code{''}. Although the null string has
2787no characters in it, it does exist. For example, consider this command:
2788
2789@example
2790$ @kbd{echo ""}
2791@end example
2792
2793@noindent
2794Here, the @command{echo} utility receives a single argument, even
2795though that argument has no characters in it. In the rest of this
2796@value{DOCUMENT}, we use the terms @dfn{null string} and @dfn{empty string}
2797interchangeably.  Now, on to the quoting rules:
2798
2799@itemize @value{BULLET}
2800@item
2801Quoted items can be concatenated with nonquoted items as well as with other
2802quoted items.  The shell turns everything into one argument for
2803the command.
2804
2805@item
2806Preceding any single character with a backslash (@samp{\}) quotes
2807that character.  The shell removes the backslash and passes the quoted
2808character on to the command.
2809
2810@item
2811@cindex @code{\} (backslash) @subentry in shell commands
2812@cindex backslash (@code{\}) @subentry in shell commands
2813@cindex single quote (@code{'}) @subentry in shell commands
2814@cindex @code{'} (single quote) @subentry in shell commands
2815Single quotes protect everything between the opening and closing quotes.
2816The shell does no interpretation of the quoted text, passing it on verbatim
2817to the command.
2818It is @emph{impossible} to embed a single quote inside single-quoted text.
2819Refer back to
2820@ref{Comments}
2821for an example of what happens if you try.
2822
2823@item
2824@cindex double quote (@code{"}) @subentry in shell commands
2825@cindex @code{"} (double quote) @subentry in shell commands
2826Double quotes protect most things between the opening and closing quotes.
2827The shell does at least variable and command substitution on the quoted text.
2828Different shells may do additional kinds of processing on double-quoted text.
2829
2830Because certain characters within double-quoted text are processed by the shell,
2831they must be @dfn{escaped} within the text.  Of note are the characters
2832@samp{$}, @samp{`}, @samp{\}, and @samp{"}, all of which must be preceded by
2833a backslash within double-quoted text if they are to be passed on literally
2834to the program.  (The leading backslash is stripped first.)
2835Thus, the example seen
2836@ifnotinfo
2837previously
2838@end ifnotinfo
2839in @ref{Read Terminal}:
2840
2841@example
2842awk 'BEGIN @{ print "Don\47t Panic!" @}'
2843@end example
2844
2845@noindent
2846could instead be written this way:
2847
2848@example
2849$ @kbd{awk "BEGIN @{ print \"Don't Panic!\" @}"}
2850@print{} Don't Panic!
2851@end example
2852
2853@cindex single quote (@code{'}) @subentry with double quotes
2854@cindex @code{'} (single quote) @subentry with double quotes
2855Note that the single quote is not special within double quotes.
2856
2857@item
2858Null strings are removed when they occur as part of a non-null
2859command-line argument, while explicit null objects are kept.
2860For example, to specify that the field separator @code{FS} should
2861be set to the null string, use:
2862
2863@example
2864awk -F "" '@var{program}' @var{files} # correct
2865@end example
2866
2867@noindent
2868@cindex null strings @subentry in @command{gawk} arguments, quoting and
2869Don't use this:
2870
2871@example
2872awk -F"" '@var{program}' @var{files}  # wrong!
2873@end example
2874
2875@noindent
2876In the second case, @command{awk} attempts to use the text of the program
2877as the value of @code{FS}, and the first @value{FN} as the text of the program!
2878This results in syntax errors at best, and confusing behavior at worst.
2879@end itemize
2880
2881@cindex quoting @subentry in @command{gawk} command lines @subentry tricks for
2882Mixing single and double quotes is difficult.  You have to resort
2883to shell quoting tricks, like this:
2884
2885@example
2886$ @kbd{awk 'BEGIN @{ print "Here is a single quote <'"'"'>" @}'}
2887@print{} Here is a single quote <'>
2888@end example
2889
2890@noindent
2891This program consists of three concatenated quoted strings.  The first and the
2892third are single-quoted, and the second is double-quoted.
2893
2894This can be ``simplified'' to:
2895
2896@example
2897$ @kbd{awk 'BEGIN @{ print "Here is a single quote <'\''>" @}'}
2898@print{} Here is a single quote <'>
2899@end example
2900
2901@noindent
2902Judge for yourself which of these two is the more readable.
2903
2904Another option is to use double quotes, escaping the embedded, @command{awk}-level
2905double quotes:
2906
2907@example
2908$ @kbd{awk "BEGIN @{ print \"Here is a single quote <'>\" @}"}
2909@print{} Here is a single quote <'>
2910@end example
2911
2912@noindent
2913This option is also painful, because double quotes, backslashes, and dollar signs
2914are very common in more advanced @command{awk} programs.
2915
2916A third option is to use the octal escape sequence equivalents
2917(@pxref{Escape Sequences})
2918for the
2919single- and double-quote characters, like so:
2920
2921@example
2922@group
2923$ @kbd{awk 'BEGIN @{ print "Here is a single quote <\47>" @}'}
2924@print{} Here is a single quote <'>
2925$ @kbd{awk 'BEGIN @{ print "Here is a double quote <\42>" @}'}
2926@print{} Here is a double quote <">
2927@end group
2928@end example
2929
2930@noindent
2931This works nicely, but you should comment clearly what the
2932escape sequences mean.
2933
2934A fourth option is to use command-line variable assignment, like this:
2935
2936@example
2937$ @kbd{awk -v sq="'" 'BEGIN @{ print "Here is a single quote <" sq ">" @}'}
2938@print{} Here is a single quote <'>
2939@end example
2940
2941(Here, the two string constants and the value of @code{sq} are concatenated
2942into a single string that is printed by @code{print}.)
2943
2944If you really need both single and double quotes in your @command{awk}
2945program, it is probably best to move it into a separate file, where
2946the shell won't be part of the picture and you can say what you mean.
2947
2948@node DOS Quoting
2949@subsubsection Quoting in MS-Windows Batch Files
2950
2951@ignore
2952Date: Wed, 21 May 2008 09:58:43 +0200 (CEST)
2953From: jeroen.brink@inter.NL.net
2954Subject: (g)awk "contribution"
2955To: arnold@skeeve.com
2956Message-id: <42220.193.172.132.34.1211356723.squirrel@webmail.internl.net>
2957
2958Hello Arnold,
2959
2960maybe you can help me out. Found your email on the GNU/awk online manual
2961pages.
2962
2963I've searched hard to figure out how, on Windows, to print double quotes.
2964Couldn't find it in the Quotes area, nor on google or elsewhere. Finally i
2965figured out how to do this myself.
2966
2967How to print all lines in a file surrounded by double quotes (on Windows):
2968
2969gawk "{ print \"\042\" $0 \"\042\" }" <file>
2970
2971Maybe this is a helpfull tip for other (Windows) gawk users. However, i
2972don't have a clue as to where to "publish" this tip! Do you?
2973
2974Kind regards,
2975
2976Jeroen Brink
2977@end ignore
2978
2979Although this @value{DOCUMENT} generally only worries about POSIX systems and the
2980POSIX shell, the following issue arises often enough for many users that
2981it is worth addressing.
2982
2983@cindex Brink, Jeroen
2984The ``shells'' on Microsoft Windows systems use the double-quote
2985character for quoting, and make it difficult or impossible to include an
2986escaped double-quote character in a command-line script.  The following
2987example, courtesy of Jeroen Brink, shows how to escape the double quotes
2988from this one liner script that prints all lines in a file surrounded by
2989double quotes:
2990
2991@example
2992@{ print "\"" $0 "\"" @}
2993@end example
2994
2995@noindent
2996In an MS-Windows command-line the one-liner script above may be passed as
2997follows:
2998
2999@example
3000gawk "@{ print \"\042\" $0 \"\042\" @}" @var{file}
3001@end example
3002
3003In this example the @samp{\042} is the octal code for a double-quote;
3004@command{gawk} converts it into a real double-quote for output by
3005the @code{print} statement.
3006
3007In MS-Windows escaping double-quotes is a little tricky because you use
3008backslashes to escape double-quotes, but backslashes themselves are not
3009escaped in the usual way; indeed they are either duplicated or not,
3010depending upon whether there is a subsequent double-quote.  The MS-Windows
3011rule for double-quoting a string is the following:
3012
3013@enumerate
3014@item
3015For each double quote in the original string, let @var{N} be the number
3016of backslash(es) before it, @var{N} might be zero. Replace these @var{N}
3017backslash(es) by @math{2@value{TIMES}@var{N}+1} backslash(es)
3018
3019@item
3020Let @var{N} be the number of backslash(es) tailing the original string,
3021@var{N} might be zero. Replace these @var{N} backslash(es) by
3022@math{2@value{TIMES}@var{N}} backslash(es)
3023
3024@item
3025Surround the resulting string by double-quotes.
3026@end enumerate
3027
3028So to double-quote the one-liner script @samp{@{ print "\"" $0 "\"" @}}
3029from the previous example you would do it this way:
3030
3031@example
3032gawk "@{ print \"\\\"\" $0 \"\\\"\" @}" @var{file}
3033@end example
3034
3035@noindent
3036However, the use of @samp{\042} instead of @samp{\\\"} is also possible
3037and easier to read, because backslashes that are not followed by a
3038double-quote don't need duplication.
3039
3040@node Sample Data Files
3041@section @value{DDF}s for the Examples
3042
3043@cindex input files @subentry examples
3044@cindex @code{mail-list} file
3045Many of the examples in this @value{DOCUMENT} take their input from two sample
3046@value{DF}s.  The first, @file{mail-list}, represents a list of peoples' names
3047together with their email addresses and information about those people.
3048The second @value{DF}, called @file{inventory-shipped}, contains
3049information about monthly shipments.  In both files,
3050each line is considered to be one @dfn{record}.
3051
3052In @file{mail-list}, each record contains the name of a person,
3053his/her phone number, his/her email address, and a code for his/her relationship
3054with the author of the list.
3055The columns are aligned using spaces.
3056An @samp{A} in the last column
3057means that the person is an acquaintance.  An @samp{F} in the last
3058column means that the person is a friend.
3059An @samp{R} means that the person is a relative:
3060
3061@example
3062@c system if test ! -d eg      ; then mkdir eg      ; fi
3063@c system if test ! -d eg/lib  ; then mkdir eg/lib  ; fi
3064@c system if test ! -d eg/data ; then mkdir eg/data ; fi
3065@c system if test ! -d eg/prog ; then mkdir eg/prog ; fi
3066@c system if test ! -d eg/misc ; then mkdir eg/misc ; fi
3067@c file eg/data/mail-list
3068Amelia       555-5553     amelia.zodiacusque@@gmail.com    F
3069Anthony      555-3412     anthony.asserturo@@hotmail.com   A
3070Becky        555-7685     becky.algebrarum@@gmail.com      A
3071Bill         555-1675     bill.drowning@@hotmail.com       A
3072Broderick    555-0542     broderick.aliquotiens@@yahoo.com R
3073Camilla      555-2912     camilla.infusarum@@skynet.be     R
3074Fabius       555-1234     fabius.undevicesimus@@ucb.edu    F
3075Julie        555-6699     julie.perscrutabor@@skeeve.com   F
3076Martin       555-6480     martin.codicibus@@hotmail.com    A
3077Samuel       555-3430     samuel.lanceolis@@shu.edu        A
3078Jean-Paul    555-2127     jeanpaul.campanorum@@nyu.edu     R
3079@c endfile
3080@end example
3081
3082@cindex @code{inventory-shipped} file
3083The @value{DF} @file{inventory-shipped} represents
3084information about shipments during the year.
3085Each record contains the month, the number
3086of green crates shipped, the number of red boxes shipped, the number of
3087orange bags shipped, and the number of blue packages shipped,
3088respectively.  There are 16 entries, covering the 12 months of last year
3089and the first four months of the current year.
3090An empty line separates the data for the two years:
3091
3092@example
3093@c file eg/data/inventory-shipped
3094Jan  13  25  15 115
3095Feb  15  32  24 226
3096Mar  15  24  34 228
3097Apr  31  52  63 420
3098May  16  34  29 208
3099Jun  31  42  75 492
3100Jul  24  34  67 436
3101Aug  15  34  47 316
3102Sep  13  55  37 277
3103Oct  29  54  68 525
3104Nov  20  87  82 577
3105Dec  17  35  61 401
3106
3107Jan  21  36  64 620
3108Feb  26  58  80 652
3109Mar  24  75  70 495
3110Apr  21  70  74 514
3111@c endfile
3112@end example
3113
3114The sample files are included in the @command{gawk} distribution,
3115in the directory @file{awklib/eg/data}.
3116
3117@node Very Simple
3118@section Some Simple Examples
3119
3120The following command runs a simple @command{awk} program that searches the
3121input file @file{mail-list} for the character string @samp{li} (a
3122grouping of characters is usually called a @dfn{string};
3123the term @dfn{string} is based on similar usage in English, such
3124as ``a string of pearls'' or ``a string of cars in a train''):
3125
3126@example
3127awk '/li/ @{ print $0 @}' mail-list
3128@end example
3129
3130@noindent
3131When lines containing @samp{li} are found, they are printed because
3132@w{@samp{print $0}} means print the current line.  (Just @samp{print} by
3133itself means the same thing, so we could have written that
3134instead.)
3135
3136You will notice that slashes (@samp{/}) surround the string @samp{li}
3137in the @command{awk} program.  The slashes indicate that @samp{li}
3138is the pattern to search for.  This type of pattern is called a
3139@dfn{regular expression}, which is covered in more detail later
3140(@pxref{Regexp}).
3141The pattern is allowed to match parts of words.
3142There are
3143single quotes around the @command{awk} program so that the shell won't
3144interpret any of it as special shell characters.
3145
3146Here is what this program prints:
3147
3148@example
3149$ @kbd{awk '/li/ @{ print $0 @}' mail-list}
3150@print{} Amelia       555-5553     amelia.zodiacusque@@gmail.com    F
3151@print{} Broderick    555-0542     broderick.aliquotiens@@yahoo.com R
3152@print{} Julie        555-6699     julie.perscrutabor@@skeeve.com   F
3153@print{} Samuel       555-3430     samuel.lanceolis@@shu.edu        A
3154@end example
3155
3156@cindex actions @subentry default
3157@cindex patterns @subentry default
3158In an @command{awk} rule, either the pattern or the action can be omitted,
3159but not both.  If the pattern is omitted, then the action is performed
3160for @emph{every} input line.  If the action is omitted, the default
3161action is to print all lines that match the pattern.
3162
3163@cindex actions @subentry empty
3164Thus, we could leave out the action (the @code{print} statement and the
3165braces) in the previous example and the result would be the same:
3166@command{awk} prints all lines matching the pattern @samp{li}.  By comparison,
3167omitting the @code{print} statement but retaining the braces makes an
3168empty action that does nothing (i.e., no lines are printed).
3169
3170@cindex @command{awk} programs @subentry one-line examples
3171Many practical @command{awk} programs are just a line or two long.  Following is a
3172collection of useful, short programs to get you started.  Some of these
3173programs contain constructs that haven't been covered yet. (The description
3174of the program will give you a good idea of what is going on, but you'll
3175need to read the rest of the @value{DOCUMENT} to become an @command{awk} expert!)
3176Most of the examples use a @value{DF} named @file{data}.  This is just a
3177placeholder; if you use these programs yourself, substitute
3178your own @value{FN}s for @file{data}.
3179
3180@cindex @command{ls} utility
3181Some of the following examples use the output of @w{@samp{ls -l}} as input.
3182@command{ls} is a system command that gives you a listing of the files in a
3183directory. With the @option{-l} option, this listing includes each file's
3184size and the date the file was last modified. Its output looks like this:
3185
3186@example
3187-rw-r--r--  1 arnold   user   1933 Nov  7 13:05 Makefile
3188-rw-r--r--  1 arnold   user  10809 Nov  7 13:03 awk.h
3189-rw-r--r--  1 arnold   user    983 Apr 13 12:14 awk.tab.h
3190-rw-r--r--  1 arnold   user  31869 Jun 15 12:20 awkgram.y
3191-rw-r--r--  1 arnold   user  22414 Nov  7 13:03 awk1.c
3192-rw-r--r--  1 arnold   user  37455 Nov  7 13:03 awk2.c
3193-rw-r--r--  1 arnold   user  27511 Dec  9 13:07 awk3.c
3194-rw-r--r--  1 arnold   user   7989 Nov  7 13:03 awk4.c
3195@end example
3196
3197@noindent
3198The first field contains read-write permissions, the second field contains
3199the number of links to the file, and the third field identifies the
3200file's owner.  The fourth field identifies the file's group.  The fifth
3201field contains the file's size in bytes.  The sixth, seventh, and eighth
3202fields contain the month, day, and time, respectively, that the file
3203was last modified.  Finally, the ninth field contains the @value{FN}.
3204
3205For future reference, note that there is often more than
3206one way to do things in @command{awk}.  At some point, you may want
3207to look back at these examples and see if
3208you can come up with different ways to do the same things shown here:
3209
3210@itemize @value{BULLET}
3211@item
3212Print every line that is longer than 80 characters:
3213
3214@example
3215awk 'length($0) > 80' data
3216@end example
3217
3218The sole rule has a relational expression as its pattern and has no
3219action---so it uses the default action, printing the record.
3220
3221@item
3222Print the length of the longest input line:
3223
3224@example
3225@group
3226awk '@{ if (length($0) > max) max = length($0) @}
3227     END @{ print max @}' data
3228@end group
3229@end example
3230
3231The code associated with @code{END} executes after all
3232input has been read; it's the other side of the coin to @code{BEGIN}.
3233
3234@cindex @command{expand} utility
3235@item
3236Print the length of the longest line in @file{data}:
3237
3238@example
3239expand data | awk '@{ if (x < length($0)) x = length($0) @}
3240                   END @{ print "maximum line length is " x @}'
3241@end example
3242
3243This example differs slightly from the previous one:
3244the input is processed by the @command{expand} utility to change TABs
3245into spaces, so the widths compared are actually the right-margin columns,
3246as opposed to the number of input characters on each line.
3247
3248@item
3249Print every line that has at least one field:
3250
3251@example
3252awk 'NF > 0' data
3253@end example
3254
3255This is an easy way to delete blank lines from a file (or rather, to
3256create a new file similar to the old file but from which the blank lines
3257have been removed).
3258
3259@item
3260Print seven random numbers from 0 to 100, inclusive:
3261
3262@example
3263awk 'BEGIN @{ for (i = 1; i <= 7; i++)
3264                 print int(101 * rand()) @}'
3265@end example
3266
3267@item
3268Print the total number of bytes used by @var{files}:
3269
3270@example
3271ls -l @var{files} | awk '@{ x += $5 @}
3272                   END @{ print "total bytes: " x @}'
3273@end example
3274
3275@item
3276Print the total number of kilobytes used by @var{files}:
3277
3278@c Don't use \ continuation, not discussed yet
3279@c Remember that awk does floating point division,
3280@c no need for (x+1023) / 1024
3281@example
3282ls -l @var{files} | awk '@{ x += $5 @}
3283   END @{ print "total K-bytes:", x / 1024 @}'
3284@end example
3285
3286@item
3287Print a sorted list of the login names of all users:
3288
3289@example
3290awk -F: '@{ print $1 @}' /etc/passwd | sort
3291@end example
3292
3293@item
3294Count the lines in a file:
3295
3296@example
3297awk 'END @{ print NR @}' data
3298@end example
3299
3300@item
3301Print the even-numbered lines in the @value{DF}:
3302
3303@example
3304awk 'NR % 2 == 0' data
3305@end example
3306
3307If you used the expression @samp{NR % 2 == 1} instead,
3308the program would print the odd-numbered lines.
3309@end itemize
3310
3311@node Two Rules
3312@section An Example with Two Rules
3313@cindex @command{awk} programs
3314
3315The @command{awk} utility reads the input files one line at a
3316time.  For each line, @command{awk} tries the patterns of each rule.
3317If several patterns match, then several actions execute in the order in
3318which they appear in the @command{awk} program.  If no patterns match, then
3319no actions run.
3320
3321After processing all the rules that match the line (and perhaps there are none),
3322@command{awk} reads the next line.  (However,
3323@pxref{Next Statement}
3324@ifdocbook
3325and @ref{Nextfile Statement}.)
3326@end ifdocbook
3327@ifnotdocbook
3328and also @pxref{Nextfile Statement}.)
3329@end ifnotdocbook
3330This continues until the program reaches the end of the file.
3331For example, the following @command{awk} program contains two rules:
3332
3333@example
3334/12/  @{ print $0 @}
3335/21/  @{ print $0 @}
3336@end example
3337
3338@noindent
3339The first rule has the string @samp{12} as the
3340pattern and @samp{print $0} as the action.  The second rule has the
3341string @samp{21} as the pattern and also has @samp{print $0} as the
3342action.  Each rule's action is enclosed in its own pair of braces.
3343
3344This program prints every line that contains the string
3345@samp{12} @emph{or} the string @samp{21}.  If a line contains both
3346strings, it is printed twice, once by each rule.
3347
3348This is what happens if we run this program on our two sample @value{DF}s,
3349@file{mail-list} and @file{inventory-shipped}:
3350
3351@example
3352$ @kbd{awk '/12/ @{ print $0 @}}
3353>      @kbd{/21/ @{ print $0 @}' mail-list inventory-shipped}
3354@print{} Anthony      555-3412     anthony.asserturo@@hotmail.com   A
3355@print{} Camilla      555-2912     camilla.infusarum@@skynet.be     R
3356@print{} Fabius       555-1234     fabius.undevicesimus@@ucb.edu    F
3357@print{} Jean-Paul    555-2127     jeanpaul.campanorum@@nyu.edu     R
3358@print{} Jean-Paul    555-2127     jeanpaul.campanorum@@nyu.edu     R
3359@print{} Jan  21  36  64 620
3360@print{} Apr  21  70  74 514
3361@end example
3362
3363@noindent
3364Note how the line beginning with @samp{Jean-Paul}
3365in @file{mail-list} was printed twice, once for each rule.
3366
3367@node More Complex
3368@section A More Complex Example
3369
3370Now that we've mastered some simple tasks, let's look at
3371what typical @command{awk}
3372programs do.  This example shows how @command{awk} can be used to
3373summarize, select, and rearrange the output of another utility.  It uses
3374features that haven't been covered yet, so don't worry if you don't
3375understand all the details:
3376
3377@example
3378ls -l | awk '$6 == "Nov" @{ sum += $5 @}
3379             END @{ print sum @}'
3380@end example
3381
3382@cindex @command{ls} utility
3383This command prints the total number of bytes in all the files in the
3384current directory that were last modified in November (of any year).
3385
3386As a reminder, the output of @w{@samp{ls -l}} gives you a listing of the
3387files in a directory, including each file's size and the date the file
3388was last modified.  The first field contains read-write permissions,
3389the second field contains the number of links to the file, and the
3390third field identifies the file's owner.  The fourth field identifies
3391the file's group.  The fifth field contains the file's size in bytes.
3392The sixth, seventh, and eighth fields contain the month, day, and time,
3393respectively, that the file was last modified.  Finally, the ninth field
3394contains the @value{FN}.
3395
3396@c @cindex automatic initialization
3397@cindex initialization, automatic
3398The @samp{$6 == "Nov"} in our @command{awk} program is an expression that
3399tests whether the sixth field of the output from @w{@samp{ls -l}}
3400matches the string @samp{Nov}.  Each time a line has the string
3401@samp{Nov} for its sixth field, @command{awk} performs the action
3402@samp{sum += $5}.  This adds the fifth field (the file's size) to the variable
3403@code{sum}.  As a result, when @command{awk} has finished reading all the
3404input lines, @code{sum} is the total of the sizes of the files whose
3405lines matched the pattern.  (This works because @command{awk} variables
3406are automatically initialized to zero.)
3407
3408After the last line of output from @command{ls} has been processed, the
3409@code{END} rule executes and prints the value of @code{sum}.
3410In this example, the value of @code{sum} is 80600.
3411
3412These more advanced @command{awk} techniques are covered in later
3413@value{SECTION}s
3414(@pxref{Action Overview}).  Before you can move on to more
3415advanced @command{awk} programming, you have to know how @command{awk} interprets
3416your input and displays your output.  By manipulating fields and using
3417@code{print} statements, you can produce some very useful and
3418impressive-looking reports.
3419
3420@node Statements/Lines
3421@section @command{awk} Statements Versus Lines
3422@cindex line breaks
3423@cindex newlines
3424
3425Most often, each line in an @command{awk} program is a separate statement or
3426separate rule, like this:
3427
3428@example
3429awk '/12/  @{ print $0 @}
3430     /21/  @{ print $0 @}' mail-list inventory-shipped
3431@end example
3432
3433@cindex @command{gawk} @subentry newlines in
3434However, @command{gawk} ignores newlines after any of the following
3435symbols and keywords:
3436
3437@example
3438,    @{    ?    :    ||    &&    do    else
3439@end example
3440
3441@noindent
3442A newline at any other point is considered the end of the
3443statement.@footnote{The @samp{?} and @samp{:} referred to here is the
3444three-operand conditional expression described in
3445@ref{Conditional Exp}.
3446Splitting lines after @samp{?} and @samp{:} is a minor @command{gawk}
3447extension; if @option{--posix} is specified
3448(@pxref{Options}), then this extension is disabled.}
3449
3450@cindex @code{\} (backslash) @subentry continuing lines and
3451@cindex backslash (@code{\}) @subentry continuing lines and
3452If you would like to split a single statement into two lines at a point
3453where a newline would terminate it, you can @dfn{continue} it by ending the
3454first line with a backslash character (@samp{\}).  The backslash must be
3455the final character on the line in order to be recognized as a continuation
3456character.  A backslash followed by a newline is allowed anywhere in the statement, even
3457in the middle of a string or regular expression.  For example:
3458
3459@example
3460awk '/This regular expression is too long, so continue it\
3461 on the next line/ @{ print $1 @}'
3462@end example
3463
3464@noindent
3465@cindex portability @subentry backslash continuation and
3466We have generally not used backslash continuation in our sample programs.
3467@command{gawk} places no limit on the
3468length of a line, so backslash continuation is never strictly necessary;
3469it just makes programs more readable.  For this same reason, as well as
3470for clarity, we have kept most statements short in the programs
3471presented throughout the @value{DOCUMENT}.
3472
3473Backslash continuation is
3474most useful when your @command{awk} program is in a separate source file
3475instead of entered from the command line.  You should also note that
3476many @command{awk} implementations are more particular about where you
3477may use backslash continuation. For example, they may not allow you to
3478split a string constant using backslash continuation.  Thus, for maximum
3479portability of your @command{awk} programs, it is best not to split your
3480lines in the middle of a regular expression or a string.
3481@c 10/2000: gawk, mawk, and current bell labs awk allow it,
3482@c solaris 2.7 nawk does not. Solaris /usr/xpg4/bin/awk does though!  sigh.
3483
3484@cindex @command{csh} utility
3485@cindex line continuations @subentry with C shell
3486@cindex backslash (@code{\}) @subentry continuing lines and @subentry in @command{csh}
3487@cindex @code{\} (backslash) @subentry continuing lines and @subentry in @command{csh}
3488@quotation CAUTION
3489@emph{Backslash continuation does not work as described
3490with the C shell.}  It works for @command{awk} programs in files and
3491for one-shot programs, @emph{provided} you are using a POSIX-compliant
3492shell, such as the Unix Bourne shell or Bash.  But the C shell behaves
3493differently!  There you must use two backslashes in a row, followed by
3494a newline.  Note also that when using the C shell, @emph{every} newline
3495in your @command{awk} program must be escaped with a backslash. To illustrate:
3496
3497@example
3498% @kbd{awk 'BEGIN @{ \}
3499? @kbd{  print \\}
3500? @kbd{      "hello, world" \}
3501? @kbd{@}'}
3502@print{} hello, world
3503@end example
3504
3505@noindent
3506Here, the @samp{%} and @samp{?} are the C shell's primary and secondary
3507prompts, analogous to the standard shell's @samp{$} and @samp{>}.
3508
3509Compare the previous example to how it is done with a POSIX-compliant shell:
3510
3511@example
3512$ @kbd{awk 'BEGIN @{}
3513>   @kbd{print \}
3514>       @kbd{"hello, world"}
3515> @kbd{@}'}
3516@print{} hello, world
3517@end example
3518@end quotation
3519
3520@command{awk} is a line-oriented language.  Each rule's action has to
3521begin on the same line as the pattern.  To have the pattern and action
3522on separate lines, you @emph{must} use backslash continuation; there
3523is no other option.
3524
3525@cindex backslash (@code{\}) @subentry continuing lines and @subentry comments and
3526@cindex @code{\} (backslash) @subentry continuing lines and @subentry comments and
3527@cindex commenting @subentry backslash continuation and
3528Another thing to keep in mind is that backslash continuation and
3529comments do not mix. As soon as @command{awk} sees the @samp{#} that
3530starts a comment, it ignores @emph{everything} on the rest of the
3531line. For example:
3532
3533@example
3534@group
3535$ @kbd{gawk 'BEGIN @{ print "dont panic" # a friendly \}
3536> @kbd{                                   BEGIN rule}
3537> @kbd{@}'}
3538@error{} gawk: cmd. line:2:                BEGIN rule
3539@error{} gawk: cmd. line:2:                ^ syntax error
3540@end group
3541@end example
3542
3543@noindent
3544In this case, it looks like the backslash would continue the comment onto the
3545next line. However, the backslash-newline combination is never even
3546noticed because it is ``hidden'' inside the comment. Thus, the
3547@code{BEGIN} is noted as a syntax error.
3548
3549@cindex statements @subentry multiple
3550@cindex @code{;} (semicolon) @subentry separating statements in actions
3551@cindex semicolon (@code{;}) @subentry separating statements in actions
3552@cindex @code{;} (semicolon) @subentry separating rules
3553@cindex semicolon (@code{;}) @subentry separating rules
3554When @command{awk} statements within one rule are short, you might want to put
3555more than one of them on a line.  This is accomplished by separating the statements
3556with a semicolon (@samp{;}).
3557This also applies to the rules themselves.
3558Thus, the program shown at the start of this @value{SECTION}
3559could also be written this way:
3560
3561@example
3562/12/ @{ print $0 @} ; /21/ @{ print $0 @}
3563@end example
3564
3565@quotation NOTE
3566The requirement that states that rules on the same line must be
3567separated with a semicolon was not in the original @command{awk}
3568language; it was added for consistency with the treatment of statements
3569within an action.
3570@end quotation
3571
3572@node Other Features
3573@section Other Features of @command{awk}
3574
3575@cindex variables
3576The @command{awk} language provides a number of predefined, or
3577@dfn{built-in}, variables that your programs can use to get information
3578from @command{awk}.  There are other variables your program can set
3579as well to control how @command{awk} processes your data.
3580
3581In addition, @command{awk} provides a number of built-in functions for doing
3582common computational and string-related operations.
3583@command{gawk} provides built-in functions for working with timestamps,
3584performing bit manipulation, for runtime string translation (internationalization),
3585determining the type of a variable,
3586and array sorting.
3587
3588As we develop our presentation of the @command{awk} language, we will introduce
3589most of the variables and many of the functions. They are described
3590systematically in @ref{Built-in Variables} and in
3591@ref{Built-in}.
3592
3593@node When
3594@section When to Use @command{awk}
3595
3596@cindex @command{awk} @subentry uses for
3597Now that you've seen some of what @command{awk} can do,
3598you might wonder how @command{awk} could be useful for you.  By using
3599utility programs, advanced patterns, field separators, arithmetic
3600statements, and other selection criteria, you can produce much more
3601complex output.  The @command{awk} language is very useful for producing
3602reports from large amounts of raw data, such as summarizing information
3603from the output of other utility programs like @command{ls}.
3604(@xref{More Complex}.)
3605
3606Programs written with @command{awk} are usually much smaller than they would
3607be in other languages.  This makes @command{awk} programs easy to compose and
3608use.  Often, @command{awk} programs can be quickly composed at your keyboard,
3609used once, and thrown away.  Because @command{awk} programs are interpreted, you
3610can avoid the (usually lengthy) compilation part of the typical
3611edit-compile-test-debug cycle of software development.
3612
3613@cindex BWK @command{awk} @seeentry{Brian Kernighan's @command{awk}}
3614@cindex Brian Kernighan's @command{awk}
3615Complex programs have been written in @command{awk}, including a complete
3616retargetable assembler for
3617@ifclear FOR_PRINT
3618eight-bit microprocessors (@pxref{Glossary}, for more information),
3619@end ifclear
3620@ifset FOR_PRINT
3621eight-bit microprocessors,
3622@end ifset
3623and a microcode assembler for a special-purpose Prolog
3624computer.
3625The original @command{awk}'s capabilities were strained by tasks
3626of such complexity, but modern versions are more capable.
3627
3628@cindex @command{awk} programs @subentry complex
3629If you find yourself writing @command{awk} scripts of more than, say,
3630a few hundred lines, you might consider using a different programming
3631language.  The shell is good at string and pattern matching; in addition,
3632it allows powerful use of the system utilities.  Python offers a nice
3633balance between high-level ease of programming and access to system
3634facilities.@footnote{Other popular scripting languages include Ruby
3635and Perl.}
3636
3637@node Intro Summary
3638@section Summary
3639
3640@itemize @value{BULLET}
3641@item
3642Programs in @command{awk} consist of @var{pattern}--@var{action} pairs.
3643
3644@item
3645An @var{action} without a @var{pattern} always runs.  The default
3646@var{action} for a pattern without one is @samp{@{ print $0 @}}.
3647
3648@item
3649Use either
3650@samp{awk '@var{program}' @var{files}}
3651or
3652@samp{awk -f @var{program-file} @var{files}}
3653to run @command{awk}.
3654
3655@item
3656You may use the special @samp{#!} header line to create @command{awk}
3657programs that are directly executable.
3658
3659@item
3660Comments in @command{awk} programs start with @samp{#} and continue to
3661the end of the same line.
3662
3663@item
3664Be aware of quoting issues when writing @command{awk} programs as
3665part of a larger shell script (or MS-Windows batch file).
3666
3667@item
3668You may use backslash continuation to continue a source line.
3669Lines are automatically continued after
3670a comma, open brace, question mark, colon,
3671@samp{||}, @samp{&&}, @code{do}, and @code{else}.
3672@end itemize
3673
3674@node Invoking Gawk
3675@chapter Running @command{awk} and @command{gawk}
3676
3677This @value{CHAPTER} covers how to run @command{awk}, both POSIX-standard
3678and @command{gawk}-specific command-line options, and what
3679@command{awk} and
3680@command{gawk} do with nonoption arguments.
3681It then proceeds to cover how @command{gawk} searches for source files,
3682reading standard input along with other files, @command{gawk}'s
3683environment variables, @command{gawk}'s exit status, using include files,
3684and obsolete and undocumented options and/or features.
3685
3686Many of the options and features described here are discussed in
3687more detail later in the @value{DOCUMENT}; feel free to skip over
3688things in this @value{CHAPTER} that don't interest you right now.
3689
3690@menu
3691* Command Line::                How to run @command{awk}.
3692* Options::                     Command-line options and their meanings.
3693* Other Arguments::             Input file names and variable assignments.
3694* Naming Standard Input::       How to specify standard input with other
3695                                files.
3696* Environment Variables::       The environment variables @command{gawk} uses.
3697* Exit Status::                 @command{gawk}'s exit status.
3698* Include Files::               Including other files into your program.
3699* Loading Shared Libraries::    Loading shared libraries into your program.
3700* Obsolete::                    Obsolete Options and/or features.
3701* Undocumented::                Undocumented Options and Features.
3702* Invoking Summary::            Invocation summary.
3703@end menu
3704
3705@node Command Line
3706@section Invoking @command{awk}
3707@cindex command line @subentry invoking @command{awk} from
3708@cindex @command{awk} @subentry invoking
3709@cindex arguments @subentry command-line @subentry invoking @command{awk}
3710@cindex options @subentry command-line @subentry invoking @command{awk}
3711
3712There are two ways to run @command{awk}---with an explicit program or with
3713one or more program files.  Here are templates for both of them; items
3714enclosed in [@dots{}] in these templates are optional:
3715
3716@display
3717@command{awk} [@var{options}] @option{-f} @var{progfile} [@option{--}] @var{file} @dots{}
3718@command{awk} [@var{options}] [@option{--}] @code{'@var{program}'} @var{file} @dots{}
3719@end display
3720
3721@cindex GNU long options
3722@cindex long options
3723@cindex options @subentry long
3724In addition to traditional one-letter POSIX-style options, @command{gawk} also
3725supports GNU long options.
3726
3727@cindex dark corner @subentry invoking @command{awk}
3728@cindex lint checking @subentry empty programs
3729It is possible to invoke @command{awk} with an empty program:
3730
3731@example
3732awk '' datafile1 datafile2
3733@end example
3734
3735@cindex @option{--lint} option
3736@cindex dark corner @subentry empty programs
3737@noindent
3738Doing so makes little sense, though; @command{awk} exits
3739silently when given an empty program.
3740@value{DARKCORNER}
3741If @option{--lint} has
3742been specified on the command line, @command{gawk} issues a
3743warning that the program is empty.
3744
3745@node Options
3746@section Command-Line Options
3747@cindex options @subentry command-line
3748@cindex command line @subentry options
3749@cindex GNU long options
3750@cindex options @subentry long
3751
3752Options begin with a dash and consist of a single character.
3753GNU-style long options consist of two dashes and a keyword.
3754The keyword can be abbreviated, as long as the abbreviation allows the option
3755to be uniquely identified.  If the option takes an argument, either the
3756keyword is immediately followed by an equals sign (@samp{=}) and the
3757argument's value, or the keyword and the argument's value are separated
3758by whitespace (spaces or TABs).
3759If a particular option with a value is given more than once, it is (usually)
3760the last value that counts.
3761
3762@cindex POSIX @command{awk} @subentry GNU long options and
3763Each long option for @command{gawk} has a corresponding
3764POSIX-style short option.
3765The long and short options are
3766interchangeable in all contexts.
3767The following list describes options mandated by the POSIX standard:
3768
3769@table @code
3770@item -F @var{fs}
3771@itemx --field-separator @var{fs}
3772@cindex @option{-F} option
3773@cindex @option{--field-separator} option
3774@cindex @code{FS} variable @subentry @option{--field-separator} option and
3775Set the @code{FS} variable to @var{fs}
3776(@pxref{Field Separators}).
3777
3778@item -f @var{source-file}
3779@itemx --file @var{source-file}
3780@cindex @option{-f} option
3781@cindex @option{--file} option
3782@cindex @command{awk} programs @subentry location of
3783Read the @command{awk} program source from @var{source-file}
3784instead of in the first nonoption argument.
3785This option may be given multiple times; the @command{awk}
3786program consists of the concatenation of the contents of
3787each specified @var{source-file}.
3788
3789Files named with @option{-f} are treated as if they had @samp{@@namespace "awk"}
3790at their beginning. @xref{Changing The Namespace}, for more information
3791on this advanced feature.
3792
3793@item -v @var{var}=@var{val}
3794@itemx --assign @var{var}=@var{val}
3795@cindex @option{-v} option
3796@cindex @option{--assign} option
3797@cindex variables @subentry setting
3798Set the variable @var{var} to the value @var{val} @emph{before}
3799execution of the program begins.  Such variable values are available
3800inside the @code{BEGIN} rule
3801(@pxref{Other Arguments}).
3802
3803The @option{-v} option can only set one variable, but it can be used
3804more than once, setting another variable each time, like this:
3805@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}.
3806
3807@cindex predefined variables @subentry @code{-v} option, setting with
3808@cindex variables @subentry predefined @subentry @code{-v} option, setting with
3809@quotation CAUTION
3810Using @option{-v} to set the values of the built-in
3811variables may lead to surprising results.  @command{awk} will reset the
3812values of those variables as it needs to, possibly ignoring any
3813initial value you may have given.
3814@end quotation
3815
3816@item -W @var{gawk-opt}
3817@cindex @option{-W} option
3818Provide an implementation-specific option.
3819This is the POSIX convention for providing implementation-specific options.
3820These options
3821also have corresponding GNU-style long options.
3822Note that the long options may be abbreviated, as long as
3823the abbreviations remain unique.
3824The full list of @command{gawk}-specific options is provided next.
3825
3826@item --
3827@cindex command line @subentry options @subentry end of
3828@cindex options @subentry command-line @subentry end of
3829Signal the end of the command-line options.  The following arguments
3830are not treated as options even if they begin with @samp{-}.  This
3831interpretation of @option{--} follows the POSIX argument parsing
3832conventions.
3833
3834@cindex @code{-} (hyphen) @subentry file names beginning with
3835@cindex hyphen (@code{-}) @subentry file names beginning with
3836This is useful if you have @value{FN}s that start with @samp{-},
3837or in shell scripts, if you have @value{FN}s that will be specified
3838by the user that could start with @samp{-}.
3839It is also useful for passing options on to the @command{awk}
3840program; see @ref{Getopt Function}.
3841@end table
3842
3843The following list describes @command{gawk}-specific options:
3844
3845@c Have to use @asis here to get docbook to come out right.
3846@table @asis
3847@item @option{-b}
3848@itemx @option{--characters-as-bytes}
3849@cindex @option{-b} option
3850@cindex @option{--characters-as-bytes} option
3851Cause @command{gawk} to treat all input data as single-byte characters.
3852In addition, all output written with @code{print} or @code{printf}
3853is treated as single-byte characters.
3854
3855Normally, @command{gawk} follows the POSIX standard and attempts to process
3856its input data according to the current locale (@pxref{Locales}). This can often involve
3857converting multibyte characters into wide characters (internally), and
3858can lead to problems or confusion if the input data does not contain valid
3859multibyte characters. This option is an easy way to tell @command{gawk},
3860``Hands off my data!''
3861
3862@item @option{-c}
3863@itemx @option{--traditional}
3864@cindex @option{-c} option
3865@cindex @option{--traditional} option
3866@cindex compatibility mode (@command{gawk}) @subentry specifying
3867Specify @dfn{compatibility mode}, in which the GNU extensions to
3868the @command{awk} language are disabled, so that @command{gawk} behaves just
3869like BWK @command{awk}.
3870@xref{POSIX/GNU},
3871which summarizes the extensions.
3872@ifclear FOR_PRINT
3873Also see
3874@ref{Compatibility Mode}.
3875@end ifclear
3876
3877@item @option{-C}
3878@itemx @option{--copyright}
3879@cindex @option{-C} option
3880@cindex @option{--copyright} option
3881@cindex GPL (General Public License) @subentry printing
3882Print the short version of the General Public License and then exit.
3883
3884@item @option{-d}[@var{file}]
3885@itemx @option{--dump-variables}[@code{=}@var{file}]
3886@cindex @option{-d} option
3887@cindex @option{--dump-variables} option
3888@cindex dump all variables of a program
3889@cindex @file{awkvars.out} file
3890@cindex files @subentry @file{awkvars.out}
3891@cindex variables @subentry global @subentry printing list of
3892Print a sorted list of global variables, their types, and final values
3893to @var{file}.  If no @var{file} is provided, print this
3894list to a file named @file{awkvars.out} in the current directory.
3895No space is allowed between the @option{-d} and @var{file}, if
3896@var{file} is supplied.
3897
3898@cindex troubleshooting @subentry typographical errors, global variables
3899Having a list of all global variables is a good way to look for
3900typographical errors in your programs.
3901You would also use this option if you have a large program with a lot of
3902functions, and you want to be sure that your functions don't
3903inadvertently use global variables that you meant to be local.
3904(This is a particularly easy mistake to make with simple variable
3905names like @code{i}, @code{j}, etc.)
3906
3907@item @option{-D}[@var{file}]
3908@itemx @option{--debug}[@code{=}@var{file}]
3909@cindex @option{-D} option
3910@cindex @option{--debug} option
3911@cindex @command{awk} programs @subentry debugging, enabling
3912Enable debugging of @command{awk} programs
3913(@pxref{Debugging}).
3914By default, the debugger reads commands interactively from the keyboard
3915(standard input).
3916The optional @var{file} argument allows you to specify a file with a list
3917of commands for the debugger to execute noninteractively.
3918No space is allowed between the @option{-D} and @var{file}, if
3919@var{file} is supplied.
3920
3921@item @option{-e} @var{program-text}
3922@itemx @option{--source} @var{program-text}
3923@cindex @option{-e} option
3924@cindex @option{--source} option
3925@cindex source code @subentry mixing
3926Provide program source code in the @var{program-text}.
3927This option allows you to mix source code in files with source
3928code that you enter on the command line.
3929This is particularly useful
3930when you have library functions that you want to use from your command-line
3931programs (@pxref{AWKPATH Variable}).
3932
3933Note that @command{gawk} treats each string as if it ended with
3934a newline character (even if it doesn't). This makes building
3935the total program easier.
3936
3937@quotation CAUTION
3938Prior to @value{PVERSION} 5.0, there was
3939no requirement that each @var{program-text}
3940be a full syntactic unit. I.e., the following worked:
3941
3942@example
3943$ @kbd{gawk -e 'BEGIN @{ a = 5 ;' -e 'print a @}'}
3944@print{} 5
3945@end example
3946
3947@noindent
3948However, this is no longer true. If you have any scripts that
3949rely upon this feature, you should revise them.
3950
3951This is because each @var{program-text} is treated as if it had
3952@samp{@@namespace "awk"} at its beginning. @xref{Changing The Namespace},
3953for more information.
3954@end quotation
3955
3956@item @option{-E} @var{file}
3957@itemx @option{--exec} @var{file}
3958@cindex @option{-E} option
3959@cindex @option{--exec} option
3960@cindex @command{awk} programs @subentry location of
3961@cindex CGI, @command{awk} scripts for
3962Similar to @option{-f}, read @command{awk} program text from @var{file}.
3963There are two differences from @option{-f}:
3964
3965@itemize @value{BULLET}
3966@item
3967This option terminates option processing; anything
3968else on the command line is passed on directly to the @command{awk} program.
3969
3970@item
3971Command-line variable assignments of the form
3972@samp{@var{var}=@var{value}} are disallowed.
3973@end itemize
3974
3975This option is particularly necessary for World Wide Web CGI applications
3976that pass arguments through the URL; using this option prevents a malicious
3977(or other) user from passing in options, assignments, or @command{awk} source
3978code (via @option{-e}) to the CGI application.@footnote{For more detail,
3979please see Section 4.4 of @uref{http://www.ietf.org/rfc/rfc3875,
3980RFC 3875}. Also see the
3981@uref{https://lists.gnu.org/archive/html/bug-gawk/2014-11/msg00022.html,
3982explanatory note sent to the @command{gawk} bug
3983mailing list}.}
3984This option should be used
3985with @samp{#!} scripts (@pxref{Executable Scripts}), like so:
3986
3987@example
3988#! /usr/local/bin/gawk -E
3989
3990@var{awk program here @dots{}}
3991@end example
3992
3993@item @option{-g}
3994@itemx @option{--gen-pot}
3995@cindex @option{-g} option
3996@cindex @option{--gen-pot} option
3997@cindex portable object @subentry files @subentry generating
3998@cindex files @subentry portable object @subentry generating
3999Analyze the source program and
4000generate a GNU @command{gettext} portable object template file on standard
4001output for all string constants that have been marked for translation.
4002@xref{Internationalization},
4003for information about this option.
4004
4005@item @option{-h}
4006@itemx @option{--help}
4007@cindex @option{-h} option
4008@cindex @option{--help} option
4009@cindex GNU long options @subentry printing list of
4010@cindex options @subentry printing list of
4011@cindex printing @subentry list of options
4012Print a ``usage'' message summarizing the short- and long-style options
4013that @command{gawk} accepts and then exit.
4014
4015@item @option{-i} @var{source-file}
4016@itemx @option{--include} @var{source-file}
4017@cindex @option{-i} option
4018@cindex @option{--include} option
4019@cindex @command{awk} programs @subentry location of
4020Read an @command{awk} source library from @var{source-file}.  This option
4021is completely equivalent to using the @code{@@include} directive inside
4022your program.  It is very similar to the @option{-f} option,
4023but there are two important differences.  First, when @option{-i} is
4024used, the program source is not loaded if it has been previously
4025loaded, whereas with @option{-f}, @command{gawk} always loads the file.
4026Second, because this option is intended to be used with code libraries,
4027@command{gawk} does not recognize such files as constituting main program
4028input.  Thus, after processing an @option{-i} argument, @command{gawk}
4029still expects to find the main source code via the @option{-f} option
4030or on the command line.
4031
4032Files named with @option{-i} are treated as if they had @samp{@@namespace "awk"}
4033at their beginning.  @xref{Changing The Namespace}, for more information.
4034
4035@item @option{-I}
4036@itemx @option{--trace}
4037@cindex @option{-I} option
4038@cindex @option{--trace} option
4039@cindex trace, internal instructions
4040@cindex instructions, trace of internal
4041@cindex op-codes, trace of internal
4042Print the internal byte code names as they are executed when running
4043the program. The trace is printed to standard error. Each ``op code''
4044is preceded by a @code{+}
4045sign in the output.
4046
4047@item @option{-l} @var{ext}
4048@itemx @option{--load} @var{ext}
4049@cindex @option{-l} option
4050@cindex @option{--load} option
4051@cindex loading extensions
4052Load a dynamic extension named @var{ext}. Extensions
4053are stored as system shared libraries.
4054This option searches for the library using the @env{AWKLIBPATH}
4055environment variable.  The correct library suffix for your platform will be
4056supplied by default, so it need not be specified in the extension name.
4057The extension initialization routine should be named @code{dl_load()}.
4058An alternative is to use the @code{@@load} keyword inside the program to load
4059a shared library.  This advanced feature is described in detail in @ref{Dynamic Extensions}.
4060
4061@item @option{-L}[@var{value}]
4062@itemx @option{--lint}[@code{=}@var{value}]
4063@cindex @option{-l} option
4064@cindex @option{--lint} option
4065@cindex lint checking @subentry issuing warnings
4066@cindex warnings, issuing
4067Warn about constructs that are dubious or nonportable to
4068other @command{awk} implementations.
4069No space is allowed between the @option{-L} and @var{value}, if
4070@var{value} is supplied.
4071Some warnings are issued when @command{gawk} first reads your program.  Others
4072are issued at runtime, as your program executes. The optional
4073argument may be one of the following:
4074
4075@table @code
4076@item fatal
4077Cause lint warnings become fatal errors.
4078This may be drastic, but its use will certainly encourage the
4079development of cleaner @command{awk} programs.
4080
4081@item invalid
4082Only issue warnings about things
4083that are actually invalid are issued. (This is not fully implemented yet.)
4084
4085@item no-ext
4086Disable warnings about @command{gawk} extensions.
4087@end table
4088
4089Some warnings are only printed once, even if the dubious constructs they
4090warn about occur multiple times in your @command{awk} program.  Thus,
4091when eliminating problems pointed out by @option{--lint}, you should take
4092care to search for all occurrences of each inappropriate construct. As
4093@command{awk} programs are usually short, doing so is not burdensome.
4094
4095@item @option{-M}
4096@itemx @option{--bignum}
4097@cindex @option{-M} option
4098@cindex @option{--bignum} option
4099Select arbitrary-precision arithmetic on numbers. This option has no effect
4100if @command{gawk} is not compiled to use the GNU MPFR and MP libraries
4101(@pxref{Arbitrary Precision Arithmetic}).
4102
4103@item @option{-n}
4104@itemx @option{--non-decimal-data}
4105@cindex @option{-n} option
4106@cindex @option{--non-decimal-data} option
4107@cindex hexadecimal values, enabling interpretation of
4108@cindex octal values, enabling interpretation of
4109@cindex troubleshooting @subentry @option{--non-decimal-data} option
4110Enable automatic interpretation of octal and hexadecimal
4111values in input data
4112(@pxref{Nondecimal Data}).
4113
4114@quotation CAUTION
4115This option can severely break old programs.  Use with care.  Also note
4116that this option may disappear in a future version of @command{gawk}.
4117@end quotation
4118
4119@item @option{-N}
4120@itemx @option{--use-lc-numeric}
4121@cindex @option{-N} option
4122@cindex @option{--use-lc-numeric} option
4123Force the use of the locale's decimal point character
4124when parsing numeric input data (@pxref{Locales}).
4125
4126@cindex pretty printing
4127@item @option{-o}[@var{file}]
4128@itemx @option{--pretty-print}[@code{=}@var{file}]
4129@cindex @option{-o} option
4130@cindex @option{--pretty-print} option
4131Enable pretty-printing of @command{awk} programs.
4132Implies @option{--no-optimize}.
4133By default, the output program is created in a file named @file{awkprof.out}
4134(@pxref{Profiling}).
4135The optional @var{file} argument allows you to specify a different
4136@value{FN} for the output.
4137No space is allowed between the @option{-o} and @var{file}, if
4138@var{file} is supplied.
4139
4140@quotation NOTE
4141In the past, this option would also execute your program.
4142This is no longer the case.
4143@end quotation
4144
4145@item @option{-O}
4146@itemx @option{--optimize}
4147@cindex @option{--optimize} option
4148@cindex @option{-O} option
4149Enable @command{gawk}'s default optimizations on the internal
4150representation of the program.  At the moment, this includes just simple
4151constant folding.
4152
4153Optimization is enabled by default.
4154This option remains primarily for backwards compatibility. However, it may
4155be used to cancel the effect of an earlier @option{-s} option
4156(see later in this list).
4157
4158@item @option{-p}[@var{file}]
4159@itemx @option{--profile}[@code{=}@var{file}]
4160@cindex @option{-p} option
4161@cindex @option{--profile} option
4162@cindex @command{awk} @subentry profiling, enabling
4163Enable profiling of @command{awk} programs
4164(@pxref{Profiling}).
4165Implies @option{--no-optimize}.
4166By default, profiles are created in a file named @file{awkprof.out}.
4167The optional @var{file} argument allows you to specify a different
4168@value{FN} for the profile file.
4169No space is allowed between the @option{-p} and @var{file}, if
4170@var{file} is supplied.
4171
4172The profile contains execution counts for each statement in the program
4173in the left margin, and function call counts for each function.
4174
4175@item @option{-P}
4176@itemx @option{--posix}
4177@cindex @option{-P} option
4178@cindex @option{--posix} option
4179@cindex POSIX mode
4180@cindex @command{gawk} @subentry extensions, disabling
4181Operate in strict POSIX mode.  This disables all @command{gawk}
4182extensions (just like @option{--traditional}) and
4183disables all extensions not allowed by POSIX.
4184@xref{Common Extensions} for a summary of the extensions
4185in @command{gawk} that are disabled by this option.
4186Also,
4187the following additional
4188restrictions apply:
4189
4190@itemize @value{BULLET}
4191
4192@cindex newlines
4193@cindex whitespace @subentry newlines as
4194@item
4195Newlines are not allowed after @samp{?} or @samp{:}
4196(@pxref{Conditional Exp}).
4197
4198
4199@cindex @code{FS} variable @subentry TAB character as
4200@item
4201Specifying @samp{-Ft} on the command line does not set the value
4202of @code{FS} to be a single TAB character
4203(@pxref{Field Separators}).
4204
4205@cindex locale decimal point character
4206@cindex decimal point character, locale specific
4207@item
4208The locale's decimal point character is used for parsing input
4209data (@pxref{Locales}).
4210@end itemize
4211
4212@c @cindex automatic warnings
4213@c @cindex warnings, automatic
4214@cindex @option{--traditional} option @subentry @option{--posix} option and
4215@cindex @option{--posix} option @subentry @option{--traditional} option and
4216If you supply both @option{--traditional} and @option{--posix} on the
4217command line, @option{--posix} takes precedence. @command{gawk}
4218issues a warning if both options are supplied.
4219
4220@item @option{-r}
4221@itemx @option{--re-interval}
4222@cindex @option{-r} option
4223@cindex @option{--re-interval} option
4224@cindex regular expressions @subentry interval expressions and
4225Allow interval expressions
4226(@pxref{Regexp Operators})
4227in regexps.
4228This is now @command{gawk}'s default behavior.
4229Nevertheless, this option remains (both for backward compatibility
4230and for use in combination with @option{--traditional}).
4231
4232@item @option{-s}
4233@itemx @option{--no-optimize}
4234@cindex @option{--no-optimize} option
4235@cindex @option{-s} option
4236Disable @command{gawk}'s default optimizations on the internal
4237representation of the program.
4238
4239@item @option{-S}
4240@itemx @option{--sandbox}
4241@cindex @option{-S} option
4242@cindex @option{--sandbox} option
4243@cindex sandbox mode
4244@cindex @code{ARGV} array
4245Disable the @code{system()} function,
4246input redirections with @code{getline},
4247output redirections with @code{print} and @code{printf},
4248and dynamic extensions.
4249Also, disallow adding @value{FN}s to @code{ARGV} that were
4250not there when @command{gawk} started running.
4251This is particularly useful when you want to run @command{awk} scripts
4252from questionable sources and need to make sure the scripts
4253can't access your system (other than the specified input @value{DF}s).
4254
4255@item @option{-t}
4256@itemx @option{--lint-old}
4257@cindex @option{-L} option
4258@cindex @option{--lint-old} option
4259Warn about constructs that are not available in the original version of
4260@command{awk} from Version 7 Unix
4261(@pxref{V7/SVR3.1}).
4262
4263@item @option{-V}
4264@itemx @option{--version}
4265@cindex @option{-V} option
4266@cindex @option{--version} option
4267@cindex @command{gawk} @subentry version of @subentry printing information about
4268Print version information for this particular copy of @command{gawk}.
4269This allows you to determine if your copy of @command{gawk} is up to date
4270with respect to whatever the Free Software Foundation is currently
4271distributing.
4272It is also useful for bug reports
4273(@pxref{Bugs}).
4274
4275@cindex @code{-} (hyphen) @subentry @code{--} end of options marker
4276@cindex hyphen (@code{-}) @subentry @code{--} end of options marker
4277@item @code{--}
4278Mark the end of all options.
4279Any command-line arguments following @code{--} are placed in @code{ARGV},
4280even if they start with a minus sign.
4281@end table
4282
4283In compatibility mode,
4284as long as program text has been supplied,
4285any other options are flagged as invalid with a warning message but
4286are otherwise ignored.
4287
4288@cindex @option{-F} option @subentry @option{-Ft} sets @code{FS} to TAB
4289In compatibility mode, as a special case, if the value of @var{fs} supplied
4290to the @option{-F} option is @samp{t}, then @code{FS} is set to the TAB
4291character (@code{"\t"}).  This is true only for @option{--traditional} and not
4292for @option{--posix}
4293(@pxref{Field Separators}).
4294
4295@cindex @option{-f} option @subentry multiple uses
4296The @option{-f} option may be used more than once on the command line.
4297If it is, @command{awk} reads its program source from all of the named files, as
4298if they had been concatenated together into one big file.  This is
4299useful for creating libraries of @command{awk} functions.  These functions
4300can be written once and then retrieved from a standard place, instead
4301of having to be included in each individual program.
4302The @option{-i} option is similar in this regard.
4303(As mentioned in
4304@ref{Definition Syntax},
4305function names must be unique.)
4306
4307With standard @command{awk}, library functions can still be used, even
4308if the program is entered at the keyboard,
4309by specifying @samp{-f /dev/tty}.  After typing your program,
4310type @kbd{Ctrl-d} (the end-of-file character) to terminate it.
4311(You may also use @samp{-f -} to read program source from the standard
4312input, but then you will not be able to also use the standard input as a
4313source of data.)
4314
4315Because it is clumsy using the standard @command{awk} mechanisms to mix
4316source file and command-line @command{awk} programs, @command{gawk}
4317provides the @option{-e} option.  This does not require you to
4318preempt the standard input for your source code, and it allows you to easily
4319mix command-line and library source code (@pxref{AWKPATH Variable}).
4320As with @option{-f}, the @option{-e} and @option{-i}
4321options may also be used multiple times on the command line.
4322
4323@cindex @option{-e} option
4324If no @option{-f} option (or @option{-e} option for @command{gawk})
4325is specified, then @command{awk} uses the first nonoption command-line
4326argument as the text of the program source code.  Arguments on
4327the command line that follow the program text are entered into the
4328@code{ARGV} array; @command{awk} does @emph{not} continue to parse the
4329command line looking for options.
4330
4331@cindex @env{POSIXLY_CORRECT} environment variable
4332@cindex environment variables @subentry @env{POSIXLY_CORRECT}
4333@cindex lint checking @subentry @env{POSIXLY_CORRECT} environment variable
4334@cindex POSIX mode
4335If the environment variable @env{POSIXLY_CORRECT} exists,
4336then @command{gawk} behaves in strict POSIX mode, exactly as if
4337you had supplied @option{--posix}.
4338Many GNU programs look for this environment variable to suppress
4339extensions that conflict with POSIX, but @command{gawk} behaves
4340differently: it suppresses all extensions, even those that do not
4341conflict with POSIX, and behaves in
4342strict POSIX mode. If @option{--lint} is supplied on the command line
4343and @command{gawk} turns on POSIX mode because of @env{POSIXLY_CORRECT},
4344then it issues a warning message indicating that POSIX
4345mode is in effect.
4346You would typically set this variable in your shell's startup file.
4347For a Bourne-compatible shell (such as Bash), you would add these
4348lines to the @file{.profile} file in your home directory:
4349
4350@example
4351POSIXLY_CORRECT=true
4352export POSIXLY_CORRECT
4353@end example
4354
4355@cindex @command{csh} utility @subentry @env{POSIXLY_CORRECT} environment variable
4356For a C shell-compatible
4357shell,@footnote{Not recommended.}
4358you would add this line to the @file{.login} file in your home directory:
4359
4360@example
4361setenv POSIXLY_CORRECT true
4362@end example
4363
4364@cindex portability @subentry @env{POSIXLY_CORRECT} environment variable
4365Having @env{POSIXLY_CORRECT} set is not recommended for daily use,
4366but it is good for testing the portability of your programs to other
4367environments.
4368
4369@node Other Arguments
4370@section Other Command-Line Arguments
4371@cindex command line @subentry arguments
4372@cindex arguments @subentry command-line
4373
4374Any additional arguments on the command line are normally treated as
4375input files to be processed in the order specified.   However, an
4376argument that has the form @code{@var{var}=@var{value}}, assigns
4377the value @var{value} to the variable @var{var}---it does not specify a
4378file at all.  (See @ref{Assignment Options}.) In the following example,
4379@samp{count=1} is a variable assignment, not a @value{FN}:
4380
4381@example
4382awk -f program.awk file1 count=1 file2
4383@end example
4384
4385@noindent
4386As a side point, should you really need to have @command{awk}
4387process a file named @file{count=1} (or any file whose name looks like
4388a variable assignment), precede the file name with @samp{./}, like so:
4389
4390@example
4391awk -f program.awk file1 ./count=1 file2
4392@end example
4393
4394@cindex @command{gawk} @subentry @code{ARGIND} variable in
4395@cindex @code{ARGIND} variable @subentry command-line arguments
4396@cindex @code{ARGV} array, indexing into
4397@cindex @code{ARGC}/@code{ARGV} variables @subentry command-line arguments
4398@cindex @command{gawk} @subentry @code{PROCINFO} array in
4399All the command-line arguments are made available to your @command{awk} program in the
4400@code{ARGV} array (@pxref{Built-in Variables}).  Command-line options
4401and the program text (if present) are omitted from @code{ARGV}.
4402All other arguments, including variable assignments, are
4403included.   As each element of @code{ARGV} is processed, @command{gawk}
4404sets @code{ARGIND} to the index in @code{ARGV} of the
4405current element.  (@command{gawk} makes the full command line,
4406including program text and options, available in @code{PROCINFO["argv"]};
4407@pxref{Auto-set}.)
4408
4409Changing @code{ARGC} and @code{ARGV} in your @command{awk} program lets
4410you control how @command{awk} processes the input files; this is described
4411in more detail in @ref{ARGC and ARGV}.
4412
4413@cindex input files @subentry variable assignments and
4414@cindex variable assignments and input files
4415The distinction between @value{FN} arguments and variable-assignment
4416arguments is made when @command{awk} is about to open the next input file.
4417At that point in execution, it checks the @value{FN} to see whether
4418it is really a variable assignment; if so, @command{awk} sets the variable
4419instead of reading a file.
4420
4421Therefore, the variables actually receive the given values after all
4422previously specified files have been read.  In particular, the values of
4423variables assigned in this fashion are @emph{not} available inside a
4424@code{BEGIN} rule
4425(@pxref{BEGIN/END}),
4426because such rules are run before @command{awk} begins scanning the argument list.
4427
4428@cindex dark corner @subentry escape sequences
4429The variable values given on the command line are processed for escape
4430sequences (@pxref{Escape Sequences}).
4431@value{DARKCORNER}
4432
4433In some very early implementations of @command{awk}, when a variable assignment
4434occurred before any @value{FN}s, the assignment would happen @emph{before}
4435the @code{BEGIN} rule was executed.  @command{awk}'s behavior was thus
4436inconsistent; some command-line assignments were available inside the
4437@code{BEGIN} rule, while others were not.  Unfortunately,
4438some applications came to depend
4439upon this ``feature.''  When @command{awk} was changed to be more consistent,
4440the @option{-v} option was added to accommodate applications that depended
4441upon the old behavior.
4442
4443The variable assignment feature is most useful for assigning to variables
4444such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and
4445output formats, before scanning the @value{DF}s.  It is also useful for
4446controlling state if multiple passes are needed over a @value{DF}.  For
4447example:
4448
4449@cindex files @subentry multiple passes over
4450@example
4451awk 'pass == 1  @{ @var{pass 1 stuff} @}
4452     pass == 2  @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata
4453@end example
4454
4455Given the variable assignment feature, the @option{-F} option for setting
4456the value of @code{FS} is not
4457strictly necessary.  It remains for historical compatibility.
4458
4459@sidebar Quoting Shell Variables On The @command{awk} Command Line
4460@cindex quoting @subentry in @command{gawk} command lines
4461@cindex shell quoting, rules for
4462@cindex null strings @subentry in @command{gawk} arguments, quoting and
4463
4464Small @command{awk} programs are often embedded in larger shell scripts,
4465so it's worthwhile to understand some shell basics. Consider the following:
4466
4467@example
4468f=""
4469awk '@{ print("hi") @}' $f
4470@end example
4471
4472In this case, @command{awk} reads from standard input instead of trying
4473to open any command line files. To the unwary, this looks like @command{awk}
4474is hanging.
4475
4476However @command{awk} doesn't see an explicit empty string. When a
4477variable expansion is the null string, @emph{and} it's not quoted,
4478the shell simply removes it from the command line. To demonstrate:
4479
4480@example
4481$ @kbd{f=""}
4482$ @kbd{awk 'BEGIN @{ print ARGC @}' $f}
4483@print{} 1
4484$ @kbd{awk 'BEGIN @{ print ARGC @}' "$f"}
4485@print{} 2
4486@end example
4487@end sidebar
4488
4489@node Naming Standard Input
4490@section Naming Standard Input
4491
4492Often, you may wish to read standard input together with other files.
4493For example, you may wish to read one file, read standard input coming
4494from a pipe, and then read another file.
4495
4496The way to name the standard input, with all versions of @command{awk},
4497is to use a single, standalone minus sign or dash, @samp{-}.  For example:
4498
4499@example
4500@var{some_command} | awk -f myprog.awk file1 - file2
4501@end example
4502
4503@noindent
4504Here, @command{awk} first reads @file{file1}, then it reads
4505the output of @var{some_command}, and finally it reads
4506@file{file2}.
4507
4508You may also use @code{"-"} to name standard input when reading
4509files with @code{getline} (@pxref{Getline/File}).
4510And, you can even use @code{"-"} with the @option{-f} option
4511to read program source code from standard input (@pxref{Options}).
4512
4513In addition, @command{gawk} allows you to specify the special
4514@value{FN} @file{/dev/stdin}, both on the command line and
4515with @code{getline}.
4516Some other versions of @command{awk} also support this, but it
4517is not standard.
4518(Some operating systems provide a @file{/dev/stdin} file
4519in the filesystem; however, @command{gawk} always processes
4520this @value{FN} itself.)
4521
4522@node Environment Variables
4523@section The Environment Variables @command{gawk} Uses
4524@cindex environment variables @subentry used by @command{gawk}
4525
4526A number of environment variables influence how @command{gawk}
4527behaves.
4528
4529@menu
4530* AWKPATH Variable::            Searching directories for @command{awk}
4531                                programs.
4532* AWKLIBPATH Variable::         Searching directories for @command{awk} shared
4533                                libraries.
4534* Other Environment Variables:: The environment variables.
4535@end menu
4536
4537@node AWKPATH Variable
4538@subsection The @env{AWKPATH} Environment Variable
4539@cindex @env{AWKPATH} environment variable
4540@cindex environment variables @subentry @env{AWKPATH}
4541@cindex directories @subentry searching @subentry for source files
4542@cindex search paths @subentry for source files
4543@cindex differences in @command{awk} and @command{gawk} @subentry @env{AWKPATH} environment variable
4544@ifinfo
4545The previous @value{SECTION} described how @command{awk} program files can be named
4546on the command line with the @option{-f} option.
4547@end ifinfo
4548In most @command{awk}
4549implementations, you must supply a precise pathname for each program
4550file, unless the file is in the current directory.
4551But with @command{gawk}, if the @value{FN} supplied to the @option{-f}
4552or @option{-i} options
4553does not contain a directory separator @samp{/}, then @command{gawk} searches a list of
4554directories (called the @dfn{search path}) one by one, looking for a
4555file with the specified name.
4556
4557The search path is a string consisting of directory names
4558separated by colons.@footnote{Semicolons on MS-Windows.}
4559@command{gawk} gets its search path from the
4560@env{AWKPATH} environment variable.  If that variable does not exist,
4561or if it has an empty value,
4562@command{gawk} uses a default path (described shortly).
4563
4564The search path feature is particularly helpful for building libraries
4565of useful @command{awk} functions.  The library files can be placed in a
4566standard directory in the default path and then specified on
4567the command line with a short @value{FN}.  Otherwise, you would have to
4568type the full @value{FN} for each file.
4569
4570By using the @option{-i} or @option{-f} options, your command-line
4571@command{awk} programs can use facilities in @command{awk} library files
4572(@pxref{Library Functions}).
4573Path searching is not done if @command{gawk} is in compatibility mode.
4574This is true for both @option{--traditional} and @option{--posix}.
4575@xref{Options}.
4576
4577If the source code file is not found after the initial search, the path is searched
4578again after adding the suffix @samp{.awk} to the @value{FN}.
4579
4580@command{gawk}'s path search mechanism is similar
4581to the shell's.
4582(See @uref{https://www.gnu.org/software/bash/manual/,
4583@cite{The Bourne-Again SHell manual}}.)
4584It treats a null entry in the path as indicating the current
4585directory.
4586(A null entry is indicated by starting or ending the path with a
4587colon or by placing two colons next to each other [@samp{::}].)
4588
4589@quotation NOTE
4590To include the current directory in the path, either place @file{.}
4591as an entry in the path or write a null entry in the path.
4592
4593Different past versions of @command{gawk} would also look explicitly in
4594the current directory, either before or after the path search.  As of
4595@value{PVERSION} 4.1.2, this no longer happens; if you wish to look
4596in the current directory, you must include @file{.} either as a separate
4597entry or as a null entry in the search path.
4598@end quotation
4599
4600The default value for @env{AWKPATH} is
4601@samp{.:/usr/local/share/awk}.@footnote{Your version of @command{gawk}
4602may use a different directory; it
4603will depend upon how @command{gawk} was built and installed. The actual
4604directory is the value of @code{$(pkgdatadir)} generated when
4605@command{gawk} was configured.
4606(For more detail, see the @file{INSTALL} file in the source distribution,
4607and see @ref{Quick Installation}.
4608You probably don't need to worry about this,
4609though.)}  Since @file{.} is included at the beginning, @command{gawk}
4610searches first in the current directory and then in @file{/usr/local/share/awk}.
4611In practice, this means that you will rarely need to change the
4612value of @env{AWKPATH}.
4613
4614@xref{Shell Startup Files}, for information on functions that help to
4615manipulate the @env{AWKPATH} variable.
4616
4617@command{gawk} places the value of the search path that it used into
4618@code{ENVIRON["AWKPATH"]}. This provides access to the actual search
4619path value from within an @command{awk} program.
4620
4621Although you can change @code{ENVIRON["AWKPATH"]} within your @command{awk}
4622program, this has no effect on the running program's behavior.  This makes
4623sense: the @env{AWKPATH} environment variable is used to find the program
4624source files.  Once your program is running, all the files have been
4625found, and @command{gawk} no longer needs to use @env{AWKPATH}.
4626
4627@node AWKLIBPATH Variable
4628@subsection The @env{AWKLIBPATH} Environment Variable
4629@cindex @env{AWKLIBPATH} environment variable
4630@cindex environment variables @subentry @env{AWKLIBPATH}
4631@cindex directories @subentry searching @subentry for loadable extensions
4632@cindex search paths @subentry for loadable extensions
4633@cindex differences in @command{awk} and @command{gawk} @subentry @code{AWKLIBPATH} environment variable
4634
4635The @env{AWKLIBPATH} environment variable is similar to the @env{AWKPATH}
4636variable, but it is used to search for loadable extensions (stored as
4637system shared libraries) specified with the @option{-l} option rather
4638than for source files.  If the extension is not found, the path is
4639searched again after adding the appropriate shared library suffix for
4640the platform.  For example, on GNU/Linux systems, the suffix @samp{.so}
4641is used.  The search path specified is also used for extensions loaded
4642via the @code{@@load} keyword (@pxref{Loading Shared Libraries}).
4643
4644If @env{AWKLIBPATH} does not exist in the environment, or if it has
4645an empty value, @command{gawk} uses a default path; this
4646is typically @samp{/usr/local/lib/gawk}, although it can vary depending
4647upon how @command{gawk} was built.@footnote{Your version of @command{gawk}
4648may use a different directory; it
4649will depend upon how @command{gawk} was built and installed. The actual
4650directory is the value of @code{$(pkgextensiondir)} generated when
4651@command{gawk} was configured.
4652(For more detail, see the @file{INSTALL} file in the source distribution,
4653and see @ref{Quick Installation}.
4654You probably don't need to worry about this,
4655though.)}
4656
4657@xref{Shell Startup Files}, for information on functions that help to
4658manipulate the @env{AWKLIBPATH} variable.
4659
4660@command{gawk} places the value of the search path that it used into
4661@code{ENVIRON["AWKLIBPATH"]}. This provides access to the actual search
4662path value from within an @command{awk} program.
4663
4664Although you can change @code{ENVIRON["AWKLIBPATH"]} within your
4665@command{awk} program, this has no effect on the running program's
4666behavior.  This makes sense: the @env{AWKLIBPATH} environment variable
4667is used to find any requested extensions, and they are loaded before
4668the program starts to run.  Once your program is running, all the
4669extensions have been found, and @command{gawk} no longer needs to use
4670@env{AWKLIBPATH}.
4671
4672@node Other Environment Variables
4673@subsection Other Environment Variables
4674
4675A number of other environment variables affect @command{gawk}'s
4676behavior, but they are more specialized. Those in the following
4677list are meant to be used by regular users:
4678
4679@table @env
4680@item GAWK_MSEC_SLEEP
4681Specifies the interval between connection retries,
4682in milliseconds. On systems that do not support
4683the @code{usleep()} system call,
4684the value is rounded up to an integral number of seconds.
4685
4686@item GAWK_READ_TIMEOUT
4687Specifies the time, in milliseconds, for @command{gawk} to
4688wait for input before returning with an error.
4689@xref{Read Timeout}.
4690
4691@item GAWK_SOCK_RETRIES
4692Controls the number of times @command{gawk} attempts to
4693retry a two-way TCP/IP (socket) connection before giving up.
4694@xref{TCP/IP Networking}.
4695Note that when nonfatal I/O is enabled (@pxref{Nonfatal}),
4696@command{gawk} only tries to open a TCP/IP socket once.
4697
4698@item POSIXLY_CORRECT
4699Causes @command{gawk} to switch to POSIX-compatibility
4700mode, disabling all traditional and GNU extensions.
4701@xref{Options}.
4702@end table
4703
4704The environment variables in the following list are meant
4705for use by the @command{gawk} developers for testing and tuning.
4706They are subject to change. The variables are:
4707
4708@table @env
4709@item AWKBUFSIZE
4710This variable only affects @command{gawk} on POSIX-compliant systems.
4711With a value of @samp{exact}, @command{gawk} uses the size of each input
4712file as the size of the memory buffer to allocate for I/O. Otherwise,
4713the value should be a number, and @command{gawk} uses that number as
4714the size of the buffer to allocate.  (When this variable is not set,
4715@command{gawk} uses the smaller of the file's size and the ``default''
4716blocksize, which is usually the filesystem's I/O blocksize.)
4717
4718@item AWK_HASH
4719If this variable exists with a value of @samp{gst}, @command{gawk}
4720switches to using the hash function from GNU Smalltalk for
4721managing arrays.
4722This function may be marginally faster than the standard function.
4723
4724@item AWKREADFUNC
4725If this variable exists, @command{gawk} switches to reading source
4726files one line at a time, instead of reading in blocks. This exists
4727for debugging problems on filesystems on non-POSIX operating systems
4728where I/O is performed in records, not in blocks.
4729
4730@item GAWK_MSG_SRC
4731If this variable exists, @command{gawk} includes the @value{FN}
4732and line number within the @command{gawk} source code
4733from which warning and/or fatal messages
4734are generated.  Its purpose is to help isolate the source of a
4735message, as there are multiple places that produce the
4736same warning or error message.
4737
4738@item GAWK_LOCALE_DIR
4739Specifies the location of compiled message object files
4740for @command{gawk} itself. This is passed to the @code{bindtextdomain()}
4741function when @command{gawk} starts up.
4742
4743@item GAWK_NO_DFA
4744If this variable exists, @command{gawk} does not use the DFA regexp matcher
4745for ``does it match'' kinds of tests. This can cause @command{gawk}
4746to be slower. Its purpose is to help isolate differences between the
4747two regexp matchers that @command{gawk} uses internally. (There aren't
4748supposed to be differences, but occasionally theory and practice don't
4749coordinate with each other.)
4750
4751@item GAWK_STACKSIZE
4752This specifies the amount by which @command{gawk} should grow its
4753internal evaluation stack, when needed.
4754
4755@item INT_CHAIN_MAX
4756This specifies intended maximum number of items @command{gawk} will maintain on a
4757hash chain for managing arrays indexed by integers.
4758
4759@item STR_CHAIN_MAX
4760This specifies intended maximum number of items @command{gawk} will maintain on a
4761hash chain for managing arrays indexed by strings.
4762
4763@item TIDYMEM
4764If this variable exists, @command{gawk} uses the @code{mtrace()} library
4765calls from the GNU C library to help track down possible memory leaks.
4766@end table
4767
4768@node Exit Status
4769@section @command{gawk}'s Exit Status
4770
4771@cindex exit status, of @command{gawk}
4772If the @code{exit} statement is used with a value
4773(@pxref{Exit Statement}), then @command{gawk} exits with
4774the numeric value given to it.
4775
4776Otherwise, if there were no problems during execution,
4777@command{gawk} exits with the value of the C constant
4778@code{EXIT_SUCCESS}.  This is usually zero.
4779
4780If an error occurs, @command{gawk} exits with the value of
4781the C constant @code{EXIT_FAILURE}.  This is usually one.
4782
4783If @command{gawk} exits because of a fatal error, the exit
4784status is two.  On non-POSIX systems, this value may be mapped
4785to @code{EXIT_FAILURE}.
4786
4787@node Include Files
4788@section Including Other Files into Your Program
4789
4790@c Panos Papadopoulos <panos1962@gmail.com> contributed the original
4791@c text for this section.
4792
4793This @value{SECTION} describes a feature that is specific to @command{gawk}.
4794
4795@cindex @code{@@} (at-sign) @subentry @code{@@include} directive
4796@cindex at-sign (@code{@@}) @subentry @code{@@include} directive
4797@cindex file inclusion, @code{@@include} directive
4798@cindex including files, @code{@@include} directive
4799@cindex @code{@@include} directive @sortas{include directive}
4800The @code{@@include} keyword can be used to read external @command{awk} source
4801files.  This gives you the ability to split large @command{awk} source files
4802into smaller, more manageable pieces, and also lets you reuse common @command{awk}
4803code from various @command{awk} scripts.  In other words, you can group
4804together @command{awk} functions used to carry out specific tasks
4805into external files. These files can be used just like function libraries,
4806using the @code{@@include} keyword in conjunction with the @env{AWKPATH}
4807environment variable.  Note that source files may also be included
4808using the @option{-i} option.
4809
4810Let's see an example.
4811We'll start with two (trivial) @command{awk} scripts, namely
4812@file{test1} and @file{test2}. Here is the @file{test1} script:
4813
4814@example
4815BEGIN @{
4816    print "This is script test1."
4817@}
4818@end example
4819
4820@noindent
4821and here is @file{test2}:
4822
4823@example
4824@@include "test1"
4825BEGIN @{
4826    print "This is script test2."
4827@}
4828@end example
4829
4830Running @command{gawk} with @file{test2}
4831produces the following result:
4832
4833@example
4834$ @kbd{gawk -f test2}
4835@print{} This is script test1.
4836@print{} This is script test2.
4837@end example
4838
4839@command{gawk} runs the @file{test2} script, which includes @file{test1}
4840using the @code{@@include}
4841keyword.  So, to include external @command{awk} source files, you just
4842use @code{@@include} followed by the name of the file to be included,
4843enclosed in double quotes.
4844
4845@quotation NOTE
4846Keep in mind that this is a language construct and the @value{FN} cannot
4847be a string variable, but rather just a literal string constant in double quotes.
4848@end quotation
4849
4850The files to be included may be nested; e.g., given a third
4851script, namely @file{test3}:
4852
4853@example
4854@group
4855@@include "test2"
4856BEGIN @{
4857    print "This is script test3."
4858@}
4859@end group
4860@end example
4861
4862@noindent
4863Running @command{gawk} with the @file{test3} script produces the
4864following results:
4865
4866@example
4867$ @kbd{gawk -f test3}
4868@print{} This is script test1.
4869@print{} This is script test2.
4870@print{} This is script test3.
4871@end example
4872
4873The @value{FN} can, of course, be a pathname. For example:
4874
4875@example
4876@@include "../io_funcs"
4877@end example
4878
4879@noindent
4880and:
4881
4882@example
4883@@include "/usr/awklib/network"
4884@end example
4885
4886@noindent
4887are both valid. The @env{AWKPATH} environment variable can be of great
4888value when using @code{@@include}. The same rules for the use
4889of the @env{AWKPATH} variable in command-line file searches
4890(@pxref{AWKPATH Variable}) apply to
4891@code{@@include} also.
4892
4893This is very helpful in constructing @command{gawk} function libraries.
4894If you have a large script with useful, general-purpose @command{awk}
4895functions, you can break it down into library files and put those files
4896in a special directory.  You can then include those ``libraries,''
4897either by using the full pathnames of the files, or by setting the @env{AWKPATH}
4898environment variable accordingly and then using @code{@@include} with
4899just the file part of the full pathname. Of course,
4900you can keep library files in more than one directory;
4901the more complex the working
4902environment is, the more directories you may need to organize the files
4903to be included.
4904
4905Given the ability to specify multiple @option{-f} options, the
4906@code{@@include} mechanism is not strictly necessary.
4907However, the @code{@@include} keyword
4908can help you in constructing self-contained @command{gawk} programs,
4909thus reducing the need for writing complex and tedious command lines.
4910In particular, @code{@@include} is very useful for writing CGI scripts
4911to be run from web pages.
4912
4913The rules for finding a source file described in @ref{AWKPATH Variable} also
4914apply to files loaded with @code{@@include}.
4915
4916Finally, files included with @code{@@include}
4917are treated as if they had @samp{@@namespace "awk"}
4918at their beginning.  @xref{Changing The Namespace}, for more information.
4919
4920@node Loading Shared Libraries
4921@section Loading Dynamic Extensions into Your Program
4922
4923This @value{SECTION} describes a feature that is specific to @command{gawk}.
4924
4925@cindex @code{@@} (at-sign) @subentry @code{@@load} directive
4926@cindex at-sign (@code{@@}) @subentry @code{@@load} directive
4927@cindex loading extensions @subentry @code{@@load} directive
4928@cindex extensions @subentry loadable @subentry loading, @code{@@load} directive
4929@cindex @code{@@load} directive @sortas{load directive}
4930The @code{@@load} keyword can be used to read external @command{awk} extensions
4931(stored as system shared libraries).
4932This allows you to link in compiled code that may offer superior
4933performance and/or give you access to extended capabilities not supported
4934by the @command{awk} language.  The @env{AWKLIBPATH} variable is used to
4935search for the extension.  Using @code{@@load} is completely equivalent
4936to using the @option{-l} command-line option.
4937
4938If the extension is not initially found in @env{AWKLIBPATH}, another
4939search is conducted after appending the platform's default shared library
4940suffix to the @value{FN}.  For example, on GNU/Linux systems, the suffix
4941@samp{.so} is used:
4942
4943@example
4944$ @kbd{gawk '@@load "ordchr"; BEGIN @{print chr(65)@}'}
4945@print{} A
4946@end example
4947
4948@noindent
4949This is equivalent to the following example:
4950
4951@example
4952@group
4953$ @kbd{gawk -lordchr 'BEGIN @{print chr(65)@}'}
4954@print{} A
4955@end group
4956@end example
4957
4958@noindent
4959For command-line usage, the @option{-l} option is more convenient,
4960but @code{@@load} is useful for embedding inside an @command{awk} source file
4961that requires access to an extension.
4962
4963@ref{Dynamic Extensions}, describes how to write extensions (in C or C++)
4964that can be loaded with either @code{@@load} or the @option{-l} option.
4965It also describes the @code{ordchr} extension.
4966
4967@node Obsolete
4968@section Obsolete Options and/or Features
4969
4970@c update this section for each release!
4971
4972@cindex options @subentry deprecated
4973@cindex features @subentry deprecated
4974@cindex obsolete features
4975This @value{SECTION} describes features and/or command-line options from
4976previous releases of @command{gawk} that either are not available in the
4977current version or are still supported but deprecated (meaning that
4978they will @emph{not} be in the next release).
4979
4980The process-related special files @file{/dev/pid}, @file{/dev/ppid},
4981@file{/dev/pgrpid}, and @file{/dev/user} were deprecated in @command{gawk}
49823.1, but still worked.  As of @value{PVERSION} 4.0, they are no longer
4983interpreted specially by @command{gawk}.  (Use @code{PROCINFO} instead;
4984see @ref{Auto-set}.)
4985
4986@ignore
4987This @value{SECTION}
4988is thus essentially a place holder,
4989in case some option becomes obsolete in a future version of @command{gawk}.
4990@end ignore
4991
4992@node Undocumented
4993@section Undocumented Options and Features
4994@cindex undocumented features
4995@cindex features @subentry undocumented
4996@cindex Skywalker, Luke
4997@cindex Kenobi, Obi-Wan
4998@cindex jedi knights
4999@cindex knights, jedi
5000@quotation
5001@i{Use the Source, Luke!}
5002@author Obi-Wan
5003@end quotation
5004
5005@cindex shells @subentry sea
5006This @value{SECTION} intentionally left
5007blank.
5008
5009@ignore
5010@c If these came out in the Info file or TeX document, then they wouldn't
5011@c be undocumented, would they?
5012
5013@command{gawk} has one undocumented option:
5014
5015@table @code
5016@item -W nostalgia
5017@itemx --nostalgia
5018Print the message @samp{awk: bailing out near line 1} and dump core.
5019This option was inspired by the common behavior of very early versions of
5020Unix @command{awk} and by a t--shirt.
5021The message is @emph{not} subject to translation in non-English locales.
5022@c so there! nyah, nyah.
5023@end table
5024
5025Early versions of @command{awk} used to not require any separator (either
5026a newline or @samp{;}) between the rules in @command{awk} programs.  Thus,
5027it was common to see one-line programs like:
5028
5029@example
5030awk '@{ sum += $1 @} END @{ print sum @}'
5031@end example
5032
5033@command{gawk} actually supports this but it is purposely undocumented
5034because it is bad style.  The correct way to write such a program
5035is either:
5036
5037@example
5038awk '@{ sum += $1 @} ; END @{ print sum @}'
5039@end example
5040
5041@noindent
5042or:
5043
5044@example
5045awk '@{ sum += $1 @}
5046     END @{ print sum @}' data
5047@end example
5048
5049@noindent
5050@xref{Statements/Lines}, for a fuller explanation.
5051
5052You can insert newlines after the @samp{;} in @code{for} loops.
5053This seems to have been a long-undocumented feature in Unix @command{awk}.
5054
5055Similarly, you may use @code{print} or @code{printf} statements in the
5056@var{init} and @var{increment} parts of a @code{for} loop.  This is another
5057long-undocumented ``feature'' of Unix @command{awk}.
5058
5059@command{gawk} lets you use the names of built-in functions that are
5060@command{gawk} extensions as the names of parameters in user-defined functions.
5061This is intended to ``future-proof'' old code that happens to use
5062function names added by @command{gawk} after the code was written.
5063Standard @command{awk} built-in functions, such as @code{sin()} or
5064@code{substr()} are @emph{not} shadowed in this way.
5065
5066You can use a @samp{P} modifier for the @code{printf()} floating-point
5067format control letters to use the underlying C library's result for
5068NaN and Infinity values, instead of the special values @command{gawk}
5069usually produces, as described in @ref{POSIX Floating Point Problems}.
5070This is mainly useful for the included unit tests.
5071
5072The @code{typeof()} built-in function
5073(@pxref{Type Functions})
5074takes an optional second array argument that, if present, will be cleared
5075and populated with some information about the internal implementation of
5076the variable. This can be useful for debugging. At the moment, this
5077returns a textual version of the flags for scalar variables, and the
5078array back-end implementation type for arrays. This interface is subject
5079to change and may not be stable.
5080
5081When not in POSIX or compatibility mode, if you set @code{LINENO} to a
5082numeric value using the @option{-v} option, @command{gawk} adds that value
5083to the real line number for use in error messages.  This is intended for
5084use within Bash shell scripts, such that the error message will reflect
5085the line number in the shell script, instead of in the @command{awk}
5086program. To demonstrate:
5087
5088@example
5089$ @kbd{gawk -v LINENO=10 'BEGIN @{ print("hi" @}'}
5090@error{} gawk: cmd. line:11: BEGIN @{ print("hi" @}
5091@error{} gawk: cmd. line:11:                    ^ syntax error
5092@end example
5093
5094@end ignore
5095
5096@node Invoking Summary
5097@section Summary
5098
5099@itemize @value{BULLET}
5100
5101@c From Neil R. Ormos
5102@item
5103@command{gawk} parses arguments on the command line, left to right, to
5104determine if they should be treated as options or as non-option arguments.
5105
5106@item
5107@command{gawk} recognizes several options which control its operation,
5108as described in @ref{Options}.  All options begin with @samp{-}.
5109
5110@item
5111Any argument that is not recognized as an option is treated as a
5112non-option argument, even if it begins with @samp{-}.
5113
5114@itemize @value{MINUS}
5115@item
5116However, when an option itself requires an argument, and the option is separated
5117from that argument on the command line by at least one space, the space
5118is ignored, and the argument is considered to be related to the option.  Thus, in
5119the invocation, @samp{gawk -F x}, the @samp{x} is treated as belonging to the
5120@option{-F} option, not as a separate non-option argument.
5121@end itemize
5122
5123@item
5124Once @command{gawk} finds a non-option argument, it stops looking for
5125options. Therefore, all following arguments are also non-option arguments,
5126even if they resemble recognized options.
5127
5128@item
5129If no @option{-e} or @option{-f} options are present, @command{gawk}
5130expects the program text to be in the first non-option argument.
5131
5132@item
5133All non-option arguments, except program text provided in the first
5134non-option argument, are placed in @code{ARGV} as explained in
5135@ref{ARGC and ARGV}, and are processed as described in @ref{Other Arguments}.
5136@c And I wrote:
5137Adjusting @code{ARGC} and @code{ARGV}
5138affects how @command{awk} processes input.
5139
5140@c ----------------------------------------
5141
5142@item
5143The three standard options for all versions of @command{awk} are
5144@option{-f}, @option{-F}, and @option{-v}.  @command{gawk} supplies these
5145and many others, as well as corresponding GNU-style long options.
5146
5147@item
5148Nonoption command-line arguments are usually treated as @value{FN}s,
5149unless they have the form @samp{@var{var}=@var{value}}, in which case
5150they are taken as variable assignments to be performed at that point
5151in processing the input.
5152
5153@item
5154You can use a single minus sign (@samp{-}) to refer to standard input
5155on the command line. @command{gawk} also lets you use the special
5156@value{FN} @file{/dev/stdin}.
5157
5158@item
5159@command{gawk} pays attention to a number of environment variables.
5160@env{AWKPATH}, @env{AWKLIBPATH}, and @env{POSIXLY_CORRECT} are the
5161most important ones.
5162
5163@item
5164@command{gawk}'s exit status conveys information to the program
5165that invoked it. Use the @code{exit} statement from within
5166an @command{awk} program to set the exit status.
5167
5168@item
5169@command{gawk} allows you to include other @command{awk} source files into
5170your program using the @code{@@include} statement and/or the @option{-i}
5171and @option{-f} command-line options.
5172
5173@item
5174@command{gawk} allows you to load additional functions written in C
5175or C++ using the @code{@@load} statement and/or the @option{-l} option.
5176(This advanced feature is described later, in @ref{Dynamic Extensions}.)
5177@end itemize
5178
5179@node Regexp
5180@chapter Regular Expressions
5181@cindex regexp
5182@cindex regular expressions
5183
5184A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a
5185set of strings.
5186Because regular expressions are such a fundamental part of @command{awk}
5187programming, their format and use deserve a separate @value{CHAPTER}.
5188
5189@cindex forward slash (@code{/}) @subentry to enclose regular expressions
5190@cindex @code{/} (forward slash) @subentry to enclose regular expressions
5191A regular expression enclosed in slashes (@samp{/})
5192is an @command{awk} pattern that matches every input record whose text
5193belongs to that set.
5194The simplest regular expression is a sequence of letters, numbers, or
5195both.  Such a regexp matches any string that contains that sequence.
5196Thus, the regexp @samp{foo} matches any string containing @samp{foo}.
5197Thus, the pattern @code{/foo/} matches any input record containing
5198the three adjacent characters @samp{foo} @emph{anywhere} in the record.  Other
5199kinds of regexps let you specify more complicated classes of strings.
5200
5201@ifnotinfo
5202Initially, the examples in this @value{CHAPTER} are simple.
5203As we explain more about how
5204regular expressions work, we present more complicated instances.
5205@end ifnotinfo
5206
5207@menu
5208* Regexp Usage::                How to Use Regular Expressions.
5209* Escape Sequences::            How to write nonprinting characters.
5210* Regexp Operators::            Regular Expression Operators.
5211* Bracket Expressions::         What can go between @samp{[...]}.
5212* Leftmost Longest::            How much text matches.
5213* Computed Regexps::            Using Dynamic Regexps.
5214* GNU Regexp Operators::        Operators specific to GNU software.
5215* Case-sensitivity::            How to do case-insensitive matching.
5216* Regexp Summary::              Regular expressions summary.
5217@end menu
5218
5219@node Regexp Usage
5220@section How to Use Regular Expressions
5221
5222@cindex patterns @subentry regexp constants as
5223@cindex regular expressions @subentry as patterns
5224A regular expression can be used as a pattern by enclosing it in
5225slashes.  Then the regular expression is tested against the
5226entire text of each record.  (Normally, it only needs
5227to match some part of the text in order to succeed.)  For example, the
5228following prints the second field of each record where the string
5229@samp{li} appears anywhere in the record:
5230
5231@example
5232$ @kbd{awk '/li/ @{ print $2 @}' mail-list}
5233@print{} 555-5553
5234@print{} 555-0542
5235@print{} 555-6699
5236@print{} 555-3430
5237@end example
5238
5239@cindex regular expressions @subentry operators
5240@cindex operators @subentry string-matching
5241@c @cindex operators, @code{~}
5242@cindex string-matching operators
5243@cindex @code{~} (tilde), @code{~} operator
5244@cindex tilde (@code{~}), @code{~} operator
5245@cindex @code{!} (exclamation point) @subentry @code{!~} operator
5246@cindex exclamation point (@code{!}) @subentry @code{!~} operator
5247@c @cindex operators, @code{!~}
5248@cindex @code{if} statement @subentry use of regexps in
5249@cindex @code{while} statement @subentry use of regexps in
5250@cindex @code{do}-@code{while} statement @subentry use of regexps in
5251@c @cindex statements, @code{if}
5252@c @cindex statements, @code{while}
5253@c @cindex statements, @code{do}
5254Regular expressions can also be used in matching expressions.  These
5255expressions allow you to specify the string to match against; it need
5256not be the entire current input record.  The two operators @samp{~}
5257and @samp{!~} perform regular expression comparisons.  Expressions
5258using these operators can be used as patterns, or in @code{if},
5259@code{while}, @code{for}, and @code{do} statements.
5260(@xref{Statements}.)
5261For example, the following is true if the expression @var{exp} (taken
5262as a string) matches @var{regexp}:
5263
5264@example
5265@var{exp} ~ /@var{regexp}/
5266@end example
5267
5268@noindent
5269This example matches, or selects, all input records with the uppercase
5270letter @samp{J} somewhere in the first field:
5271
5272@example
5273$ @kbd{awk '$1 ~ /J/' inventory-shipped}
5274@print{} Jan  13  25  15 115
5275@print{} Jun  31  42  75 492
5276@print{} Jul  24  34  67 436
5277@print{} Jan  21  36  64 620
5278@end example
5279
5280So does this:
5281
5282@example
5283awk '@{ if ($1 ~ /J/) print @}' inventory-shipped
5284@end example
5285
5286This next example is true if the expression @var{exp}
5287(taken as a character string)
5288does @emph{not} match @var{regexp}:
5289
5290@example
5291@var{exp} !~ /@var{regexp}/
5292@end example
5293
5294The following example matches,
5295or selects, all input records whose first field @emph{does not} contain
5296the uppercase letter @samp{J}:
5297
5298@example
5299$ @kbd{awk '$1 !~ /J/' inventory-shipped}
5300@print{} Feb  15  32  24 226
5301@print{} Mar  15  24  34 228
5302@print{} Apr  31  52  63 420
5303@print{} May  16  34  29 208
5304@dots{}
5305@end example
5306
5307@cindex regexp constants
5308@cindex constants @subentry regexp
5309@cindex regular expressions, constants @seeentry{regexp constants}
5310When a regexp is enclosed in slashes, such as @code{/foo/}, we call it
5311a @dfn{regexp constant}, much like @code{5.27} is a numeric constant and
5312@code{"foo"} is a string constant.
5313
5314@node Escape Sequences
5315@section Escape Sequences
5316
5317@cindex escape sequences
5318@cindex escape sequences @seealso{backslash}
5319@cindex backslash (@code{\}) @subentry in escape sequences
5320@cindex @code{\} (backslash) @subentry in escape sequences
5321Some characters cannot be included literally in string constants
5322(@code{"foo"}) or regexp constants (@code{/foo/}).
5323Instead, they should be represented with @dfn{escape sequences},
5324which are character sequences beginning with a backslash (@samp{\}).
5325One use of an escape sequence is to include a double-quote character in
5326a string constant.  Because a plain double quote ends the string, you
5327must use @samp{\"} to represent an actual double-quote character as a
5328part of the string.  For example:
5329
5330@example
5331$ @kbd{awk 'BEGIN @{ print "He said \"hi!\" to her." @}'}
5332@print{} He said "hi!" to her.
5333@end example
5334
5335The  backslash character itself is another character that cannot be
5336included normally; you must write @samp{\\} to put one backslash in the
5337string or regexp.  Thus, the string whose contents are the two characters
5338@samp{"} and @samp{\} must be written @code{"\"\\"}.
5339
5340Other escape sequences represent unprintable characters
5341such as TAB or newline.  There is nothing to stop you from entering most
5342unprintable characters directly in a string constant or regexp constant,
5343but they may look ugly.
5344
5345The following list presents
5346all the escape sequences used in @command{awk} and
5347what they represent. Unless noted otherwise, all these escape
5348sequences apply to both string constants and regexp constants:
5349
5350@cindex ASCII
5351@table @code
5352@item \\
5353A literal backslash, @samp{\}.
5354
5355@c @cindex @command{awk} language, V.4 version
5356@cindex @code{\} (backslash) @subentry @code{\a} escape sequence
5357@cindex backslash (@code{\}) @subentry @code{\a} escape sequence
5358@item \a
5359The ``alert'' character, @kbd{Ctrl-g}, ASCII code 7 (BEL).
5360(This often makes some sort of audible noise.)
5361
5362@cindex @code{\} (backslash) @subentry @code{\b} escape sequence
5363@cindex backslash (@code{\}) @subentry @code{\b} escape sequence
5364@item \b
5365Backspace, @kbd{Ctrl-h}, ASCII code 8 (BS).
5366
5367@cindex @code{\} (backslash) @subentry @code{\f} escape sequence
5368@cindex backslash (@code{\}) @subentry @code{\f} escape sequence
5369@item \f
5370Formfeed, @kbd{Ctrl-l}, ASCII code 12 (FF).
5371
5372@cindex @code{\} (backslash) @subentry @code{\n} escape sequence
5373@cindex backslash (@code{\}) @subentry @code{\n} escape sequence
5374@item \n
5375Newline, @kbd{Ctrl-j}, ASCII code 10 (LF).
5376
5377@cindex @code{\} (backslash) @subentry @code{\r} escape sequence
5378@cindex backslash (@code{\}) @subentry @code{\r} escape sequence
5379@item \r
5380Carriage return, @kbd{Ctrl-m}, ASCII code 13 (CR).
5381
5382@cindex @code{\} (backslash) @subentry @code{\t} escape sequence
5383@cindex backslash (@code{\}) @subentry @code{\t} escape sequence
5384@item \t
5385Horizontal TAB, @kbd{Ctrl-i}, ASCII code 9 (HT).
5386
5387@c @cindex @command{awk} language, V.4 version
5388@cindex @code{\} (backslash) @subentry @code{\v} escape sequence
5389@cindex backslash (@code{\}) @subentry @code{\v} escape sequence
5390@item \v
5391Vertical TAB, @kbd{Ctrl-k}, ASCII code 11 (VT).
5392
5393@cindex @code{\} (backslash) @subentry @code{\}@var{nnn} escape sequence
5394@cindex backslash (@code{\}) @subentry @code{\}@var{nnn} escape sequence
5395@item \@var{nnn}
5396The octal value @var{nnn}, where @var{nnn} stands for 1 to 3 digits
5397between @samp{0} and @samp{7}.  For example, the code for the ASCII ESC
5398(escape) character is @samp{\033}.
5399
5400@c @cindex @command{awk} language, V.4 version
5401@c @cindex @command{awk} language, POSIX version
5402@cindex @code{\} (backslash) @subentry @code{\x} escape sequence
5403@cindex backslash (@code{\}) @subentry @code{\x} escape sequence
5404@cindex common extensions @subentry @code{\x} escape sequence
5405@cindex extensions @subentry common @subentry @code{\x} escape sequence
5406@item \x@var{hh}@dots{}
5407The hexadecimal value @var{hh}, where @var{hh} stands for a sequence
5408of hexadecimal digits (@samp{0}--@samp{9}, and either @samp{A}--@samp{F}
5409or @samp{a}--@samp{f}).  A maximum of two digts are allowed after
5410the @samp{\x}. Any further hexadecimal digits are treated as simple
5411letters or numbers.  @value{COMMONEXT}
5412(The @samp{\x} escape sequence is not allowed in POSIX awk.)
5413
5414@quotation CAUTION
5415In ISO C, the escape sequence continues until the first nonhexadecimal
5416digit is seen.
5417For many years, @command{gawk} would continue incorporating
5418hexadecimal digits into the value until a non-hexadecimal digit
5419or the end of the string was encountered.
5420However, using more than two hexadecimal digits produced
5421undefined results.
5422As of @value{PVERSION} 4.2, only two digits
5423are processed.
5424@end quotation
5425
5426@cindex @code{\} (backslash) @subentry @code{\/} escape sequence
5427@cindex backslash (@code{\}) @subentry @code{\/} escape sequence
5428@item \/
5429A literal slash (should be used for regexp constants only).
5430This sequence is used when you want to write a regexp
5431constant that contains a slash
5432(such as @code{/.*:\/home\/[[:alnum:]]+:.*/}; the @samp{[[:alnum:]]}
5433notation is discussed in @ref{Bracket Expressions}).
5434Because the regexp is delimited by
5435slashes, you need to escape any slash that is part of the pattern,
5436in order to tell @command{awk} to keep processing the rest of the regexp.
5437
5438@cindex @code{\} (backslash) @subentry @code{\"} escape sequence
5439@cindex backslash (@code{\}) @subentry @code{\"} escape sequence
5440@item \"
5441A literal double quote (should be used for string constants only).
5442This sequence is used when you want to write a string
5443constant that contains a double quote
5444(such as @code{"He said \"hi!\" to her."}).
5445Because the string is delimited by
5446double quotes, you need to escape any quote that is part of the string,
5447in order to tell @command{awk} to keep processing the rest of the string.
5448@end table
5449
5450In @command{gawk}, a number of additional two-character sequences that begin
5451with a backslash have special meaning in regexps.
5452@xref{GNU Regexp Operators}.
5453
5454In a regexp, a backslash before any character that is not in the previous list
5455and not listed in
5456@ref{GNU Regexp Operators}
5457means that the next character should be taken literally, even if it would
5458normally be a regexp operator.  For example, @code{/a\+b/} matches the three
5459characters @samp{a+b}.
5460
5461@cindex backslash (@code{\}) @subentry in escape sequences
5462@cindex @code{\} (backslash) @subentry in escape sequences
5463@cindex portability
5464For complete portability, do not use a backslash before any character not
5465shown in the previous list or that is not an operator.
5466
5467@c 11/2014: Moved so as to not stack sidebars
5468@sidebar Backslash Before Regular Characters
5469@cindex portability @subentry backslash in escape sequences
5470@cindex POSIX @command{awk} @subentry backslashes in string constants
5471@cindex backslash (@code{\}) @subentry in escape sequences @subentry POSIX and
5472@cindex @code{\} (backslash) @subentry in escape sequences @subentry POSIX and
5473
5474@cindex troubleshooting @subentry backslash before nonspecial character
5475If you place a backslash in a string constant before something that is
5476not one of the characters previously listed, POSIX @command{awk} purposely
5477leaves what happens as undefined.  There are two choices:
5478
5479@c @cindex automatic warnings
5480@c @cindex warnings, automatic
5481@cindex Brian Kernighan's @command{awk}
5482@table @asis
5483@item Strip the backslash out
5484This is what BWK @command{awk} and @command{gawk} both do.
5485For example, @code{"a\qc"} is the same as @code{"aqc"}.
5486(Because this is such an easy bug both to introduce and to miss,
5487@command{gawk} warns you about it.)
5488Consider @samp{FS = @w{"[ \t]+\|[ \t]+"}} to use vertical bars
5489surrounded by whitespace as the field separator. There should be
5490two backslashes in the string: @samp{FS = @w{"[ \t]+\\|[ \t]+"}}.)
5491@c I did this!  This is why I added the warning.
5492
5493@cindex @command{gawk} @subentry escape sequences
5494@cindex @command{gawk} @subentry escape sequences @seealso{backslash}
5495@cindex Unix @command{awk} @subentry backslashes in escape sequences
5496@cindex @command{mawk} utility
5497@item Leave the backslash alone
5498Some other @command{awk} implementations do this.
5499In such implementations, typing @code{"a\qc"} is the same as typing
5500@code{"a\\qc"}.
5501@end table
5502@end sidebar
5503
5504To summarize:
5505
5506@itemize @value{BULLET}
5507@item
5508The escape sequences in the preceding list are always processed first,
5509for both string constants and regexp constants. This happens very early,
5510as soon as @command{awk} reads your program.
5511
5512@item
5513@command{gawk} processes both regexp constants and dynamic regexps
5514(@pxref{Computed Regexps}),
5515for the special operators listed in
5516@ref{GNU Regexp Operators}.
5517
5518@item
5519A backslash before any other character means to treat that character
5520literally.
5521@end itemize
5522
5523@sidebar Escape Sequences for Metacharacters
5524@cindex metacharacters @subentry escape sequences for
5525
5526Suppose you use an octal or hexadecimal
5527escape to represent a regexp metacharacter.
5528(See @ref{Regexp Operators}.)
5529Does @command{awk} treat the character as a literal character or as a regexp
5530operator?
5531
5532@cindex dark corner @subentry escape sequences @subentry for metacharacters
5533Historically, such characters were taken literally.
5534@value{DARKCORNER}
5535However, the POSIX standard indicates that they should be treated
5536as real metacharacters, which is what @command{gawk} does.
5537In compatibility mode (@pxref{Options}),
5538@command{gawk} treats the characters represented by octal and hexadecimal
5539escape sequences literally when used in regexp constants. Thus,
5540@code{/a\52b/} is equivalent to @code{/a\*b/}.
5541@end sidebar
5542
5543@node Regexp Operators
5544@section Regular Expression Operators
5545@cindex regular expressions @subentry operators
5546@cindex metacharacters @subentry in regular expressions
5547
5548You can combine regular expressions with special characters,
5549called @dfn{regular expression operators} or @dfn{metacharacters}, to
5550increase the power and versatility of regular expressions.
5551
5552@menu
5553* Regexp Operator Details::     The actual details.
5554* Interval Expressions::        Notes on interval expressions.
5555@end menu
5556
5557@node Regexp Operator Details
5558@subsection Regexp Operators in @command{awk}
5559
5560The escape sequences described
5561@ifnotinfo
5562earlier
5563@end ifnotinfo
5564in @ref{Escape Sequences}
5565are valid inside a regexp.  They are introduced by a @samp{\} and
5566are recognized and converted into corresponding real characters as
5567the very first step in processing regexps.
5568
5569Here is a list of metacharacters.  All characters that are not escape
5570sequences and that are not listed here stand for themselves:
5571
5572@c Use @asis so the docbook comes out ok. Sigh.
5573@table @asis
5574@cindex backslash (@code{\}) @subentry regexp operator
5575@cindex @code{\} (backslash) @subentry regexp operator
5576@item @code{\}
5577This suppresses the special meaning of a character when
5578matching.  For example, @samp{\$}
5579matches the character @samp{$}.
5580
5581@cindex regular expressions @subentry anchors in
5582@cindex Texinfo @subentry chapter beginnings in files
5583@cindex @code{^} (caret) @subentry regexp operator
5584@cindex caret (@code{^}) @subentry regexp operator
5585@item @code{^}
5586This matches the beginning of a string.  @samp{^@@chapter}
5587matches @samp{@@chapter} at the beginning of a string,
5588for example, and can be used
5589to identify chapter beginnings in Texinfo source files.
5590The @samp{^} is known as an @dfn{anchor}, because it anchors the pattern to
5591match only at the beginning of the string.
5592
5593It is important to realize that @samp{^} does not match the beginning of
5594a line (the point right after a @samp{\n} newline character) embedded in a string.
5595The condition is not true in the following example:
5596
5597@example
5598if ("line1\nLINE 2" ~ /^L/) @dots{}
5599@end example
5600
5601@cindex @code{$} (dollar sign) @subentry regexp operator
5602@cindex dollar sign (@code{$}) @subentry regexp operator
5603@item @code{$}
5604This is similar to @samp{^}, but it matches only at the end of a string.
5605For example, @samp{p$}
5606matches a record that ends with a @samp{p}.  The @samp{$} is an anchor
5607and does not match the end of a line
5608(the point right before a @samp{\n} newline character)
5609embedded in a string.
5610The condition in the following example is not true:
5611
5612@example
5613if ("line1\nLINE 2" ~ /1$/) @dots{}
5614@end example
5615
5616@cindex @code{.} (period), regexp operator
5617@cindex period (@code{.}), regexp operator
5618@item @code{.} (period)
5619This matches any single character,
5620@emph{including} the newline character.  For example, @samp{.P}
5621matches any single character followed by a @samp{P} in a string.  Using
5622concatenation, we can make a regular expression such as @samp{U.A}, which
5623matches any three-character sequence that begins with @samp{U} and ends
5624with @samp{A}.
5625
5626@cindex POSIX mode
5627@cindex POSIX @command{awk} @subentry period (@code{.}), using
5628In strict POSIX mode (@pxref{Options}),
5629@samp{.} does not match the @sc{nul}
5630character, which is a character with all bits equal to zero.
5631Otherwise, @sc{nul} is just another character. Other versions of @command{awk}
5632may not be able to match the @sc{nul} character.
5633
5634@cindex @code{[]} (square brackets), regexp operator
5635@cindex square brackets (@code{[]}), regexp operator
5636@cindex bracket expressions
5637@cindex character sets (in regular expressions) @seeentry{bracket expressions}
5638@cindex character lists @seeentry{bracket expressions}
5639@cindex character classes @seeentry{bracket expressions}
5640@item @code{[}@dots{}@code{]}
5641This is called a @dfn{bracket expression}.@footnote{In other literature,
5642you may see a bracket expression referred to as either a
5643@dfn{character set}, a @dfn{character class}, or a @dfn{character list}.}
5644It matches any @emph{one} of the characters that are enclosed in
5645the square brackets.  For example, @samp{[MVX]} matches any one of
5646the characters @samp{M}, @samp{V}, or @samp{X} in a string.  A full
5647discussion of what can be inside the square brackets of a bracket expression
5648is given in
5649@ref{Bracket Expressions}.
5650
5651@cindex bracket expressions @subentry complemented
5652@item @code{[^}@dots{}@code{]}
5653This is a @dfn{complemented bracket expression}.  The first character after
5654the @samp{[} @emph{must} be a @samp{^}.  It matches any characters
5655@emph{except} those in the square brackets.  For example, @samp{[^awk]}
5656matches any character that is not an @samp{a}, @samp{w},
5657or @samp{k}.
5658
5659@cindex @code{|} (vertical bar)
5660@cindex vertical bar (@code{|})
5661@item @code{|}
5662This is the @dfn{alternation operator} and it is used to specify
5663alternatives.  The @samp{|} has the lowest precedence of all the regular
5664expression operators.  For example, @samp{^P|[aeiouy]} matches any string
5665that matches either @samp{^P} or @samp{[aeiouy]}.  This means it matches
5666any string that starts with @samp{P} or contains (anywhere within it)
5667a lowercase English vowel.
5668
5669The alternation applies to the largest possible regexps on either side.
5670
5671@cindex @code{()} (parentheses) @subentry regexp operator
5672@cindex parentheses @code{()} @subentry regexp operator
5673@item @code{(}@dots{}@code{)}
5674Parentheses are used for grouping in regular expressions, as in
5675arithmetic.  They can be used to concatenate regular expressions
5676containing the alternation operator, @samp{|}.  For example,
5677@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and
5678@samp{@@samp@{bar@}}.
5679(These are Texinfo formatting control sequences. The @samp{+} is
5680explained further on in this list.)
5681
5682The left or opening parenthesis is always a metacharacter; to match
5683one literally, precede it with a backslash. However, the right or
5684closing parenthesis is only special when paired with a left parenthesis;
5685an unpaired right parenthesis is (silently) treated as a regular character.
5686
5687@cindex @code{*} (asterisk) @subentry @code{*} operator @subentry as regexp operator
5688@cindex asterisk (@code{*}) @subentry @code{*} operator @subentry as regexp operator
5689@item @code{*}
5690This symbol means that the preceding regular expression should be
5691repeated as many times as necessary to find a match.  For example, @samp{ph*}
5692applies the @samp{*} symbol to the preceding @samp{h} and looks for matches
5693of one @samp{p} followed by any number of @samp{h}s.  This also matches
5694just @samp{p} if no @samp{h}s are present.
5695
5696There are two subtle points to understand about how @samp{*} works.
5697First, the @samp{*} applies only to the single preceding regular expression
5698component (e.g., in @samp{ph*}, it applies just to the @samp{h}).
5699To cause @samp{*} to apply to a larger subexpression, use parentheses:
5700@samp{(ph)*} matches @samp{ph}, @samp{phph}, @samp{phphph}, and so on.
5701
5702Second, @samp{*} finds as many repetitions as possible. If the text
5703to be matched is @samp{phhhhhhhhhhhhhhooey}, @samp{ph*} matches all of
5704the @samp{h}s.
5705
5706@cindex @code{+} (plus sign) @subentry regexp operator
5707@cindex plus sign (@code{+}) @subentry regexp operator
5708@item @code{+}
5709This symbol is similar to @samp{*}, except that the preceding expression must be
5710matched at least once.  This means that @samp{wh+y}
5711would match @samp{why} and @samp{whhy}, but not @samp{wy}, whereas
5712@samp{wh*y} would match all three.
5713
5714@cindex @code{?} (question mark) @subentry regexp operator
5715@cindex question mark (@code{?}) @subentry regexp operator
5716@item @code{?}
5717This symbol is similar to @samp{*}, except that the preceding expression can be
5718matched either once or not at all.  For example, @samp{fe?d}
5719matches @samp{fed} and @samp{fd}, but nothing else.
5720
5721@cindex @code{@{@}} (braces) @subentry regexp operator
5722@cindex braces (@code{@{@}}) @subentry regexp operator
5723@cindex interval expressions, regexp operator
5724@item @code{@{}@var{n}@code{@}}
5725@itemx @code{@{}@var{n}@code{,@}}
5726@itemx @code{@{}@var{n}@code{,}@var{m}@code{@}}
5727One or two numbers inside braces denote an @dfn{interval expression}.
5728If there is one number in the braces, the preceding regexp is repeated
5729@var{n} times.
5730If there are two numbers separated by a comma, the preceding regexp is
5731repeated @var{n} to @var{m} times.
5732If there is one number followed by a comma, then the preceding regexp
5733is repeated at least @var{n} times:
5734
5735@table @code
5736@item wh@{3@}y
5737Matches @samp{whhhy}, but not @samp{why} or @samp{whhhhy}.
5738
5739@item wh@{3,5@}y
5740Matches @samp{whhhy}, @samp{whhhhy}, or @samp{whhhhhy} only.
5741
5742@item wh@{2,@}y
5743Matches @samp{whhy}, @samp{whhhy}, and so on.
5744@end table
5745@end table
5746
5747@cindex precedence @subentry regexp operators
5748@cindex regular expressions @subentry operators @subentry precedence of
5749In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators,
5750as well as the braces @samp{@{} and @samp{@}},
5751have
5752the highest precedence, followed by concatenation, and finally by @samp{|}.
5753As in arithmetic, parentheses can change how operators are grouped.
5754
5755@cindex POSIX @command{awk} @subentry regular expressions and
5756@cindex @command{gawk} @subentry regular expressions @subentry precedence
5757In POSIX @command{awk} and @command{gawk}, the @samp{*}, @samp{+}, and
5758@samp{?} operators stand for themselves when there is nothing in the
5759regexp that precedes them.  For example, @code{/+/} matches a literal
5760plus sign.  However, many other versions of @command{awk} treat such a
5761usage as a syntax error.
5762
5763@sidebar What About The Empty Regexp?
5764@cindex empty regexps
5765@cindex regexps, empty
5766We describe here an advanced regexp usage. Feel free to skip it
5767upon first reading.
5768
5769You can supply an empty regexp constant (@samp{//}) in all places
5770where a regexp is expected. Is this useful?  What does it match?
5771
5772It is useful. It matches the (invisible) empty string at the start
5773and end of a string of characters, as well as the empty string
5774between characters. This is best illustrated with the @code{gsub()}
5775function, which makes global substitutions in a string
5776(@pxref{String Functions}).  Normal usage of @code{gsub()} is like
5777so:
5778
5779@example
5780$ @kbd{awk '}
5781> @kbd{BEGIN @{}
5782> @kbd{    x = "ABC_CBA"}
5783> @kbd{    gsub(/B/, "bb", x)}
5784> @kbd{    print x}
5785> @kbd{@}'}
5786@print{} AbbC_CbbA
5787@end example
5788
5789We can use @code{gsub()} to see where the empty strings
5790are that match the empty regexp:
5791
5792@example
5793$ @kbd{awk '}
5794> @kbd{BEGIN @{}
5795> @kbd{    x = "ABC"}
5796> @kbd{    gsub(//, "x", x)}
5797> @kbd{    print x}
5798> @kbd{@}'}
5799@print{} xAxBxCx
5800@end example
5801@end sidebar
5802
5803@node Interval Expressions
5804@subsection Some Notes On Interval Expressions
5805
5806@cindex POSIX @command{awk} @subentry interval expressions in
5807Interval expressions were not traditionally available in @command{awk}.
5808They were added as part of the POSIX standard to make @command{awk}
5809and @command{egrep} consistent with each other.
5810
5811@cindex @command{gawk} @subentry interval expressions and
5812Initially, because old programs may use @samp{@{} and @samp{@}} in regexp
5813constants,
5814@command{gawk} did @emph{not} match interval expressions
5815in regexps.
5816
5817However, beginning with @value{PVERSION} 4.0,
5818@command{gawk} does match interval expressions by default.
5819This is because compatibility with POSIX has become more
5820important to most @command{gawk} users than compatibility with
5821old programs.
5822
5823For programs that use @samp{@{} and @samp{@}} in regexp constants,
5824it is good practice to always escape them with a backslash.  Then the
5825regexp constants are valid and work the way you want them to, using
5826any version of @command{awk}.@footnote{Use two backslashes if you're
5827using a string constant with a regexp operator or function.}
5828
5829When @samp{@{} and @samp{@}} appear in regexp constants
5830in a way that cannot be interpreted as an interval expression
5831(such as @code{/q@{a@}/}), then they stand for themselves.
5832
5833As mentioned, interval expressions were not traditionally available
5834in @command{awk}. In March of 2019, BWK @command{awk} (finally) acquired them.
5835Nonetheless, because they were not available for
5836so many decades, @command{gawk} continues to not supply them
5837when in compatibility mode (@pxref{Options}).
5838
5839POSIX says that interval expressions containing repetition counts greater
5840than 255 produce unspecified results.
5841
5842@cindex Eggert, Paul
5843In the manual for GNU @command{grep}, Paul Eggert notes the following:
5844
5845@quotation
5846Interval expressions may be implemented internally via repetition.
5847For example, @samp{^(a|bc)@{2,4@}$} might be implemented as
5848@samp{^(a|bc)(a|bc)((a|bc)(a|bc)?)?$}.  A large repetition count may
5849exhaust memory or greatly slow matching.  Even small counts can cause
5850problems if cascaded; for example, @samp{grep -E
5851".*@{10,@}@{10,@}@{10,@}@{10,@}@{10,@}"} is likely to overflow a
5852stack.  Fortunately, regular expressions like these are typically
5853artificial, and cascaded repetitions do not conform to POSIX so cannot
5854be used in portable programs anyway.
5855@end quotation
5856
5857@noindent
5858This same caveat applies to @command{gawk}.
5859
5860@node Bracket Expressions
5861@section Using Bracket Expressions
5862@cindex bracket expressions
5863@cindex bracket expressions @subentry range expressions
5864@cindex range expressions (regexps)
5865@cindex bracket expressions @subentry character lists
5866
5867As mentioned earlier, a bracket expression matches any character among
5868those listed between the opening and closing square brackets.
5869
5870Within a bracket expression, a @dfn{range expression} consists of two
5871characters separated by a hyphen.  It matches any single character that
5872sorts between the two characters, based upon the system's native character
5873set.  For example, @samp{[0-9]} is equivalent to @samp{[0123456789]}.
5874(See @ref{Ranges and Locales} for an explanation of how the POSIX
5875standard and @command{gawk} have changed over time.  This is mainly
5876of historical interest.)
5877
5878With the increasing popularity of the
5879@uref{http://www.unicode.org, Unicode character standard},
5880there is an additional wrinkle to consider. Octal and hexadecimal
5881escape sequences inside bracket expressions are taken to represent
5882only single-byte characters (characters whose values fit within
5883the range 0--256).  To match a range of characters where the endpoints
5884of the range are larger than 256, enter the multibyte encodings of
5885the characters directly.
5886
5887@cindex @code{\} (backslash) @subentry in bracket expressions
5888@cindex backslash (@code{\}) @subentry in bracket expressions
5889@cindex @code{^} (caret) @subentry in bracket expressions
5890@cindex caret (@code{^}) @subentry in bracket expressions
5891@cindex @code{-} (hyphen) @subentry in bracket expressions
5892@cindex hyphen (@code{-}) @subentry in bracket expressions
5893To include one of the characters @samp{\}, @samp{]}, @samp{-}, or @samp{^} in a
5894bracket expression, put a @samp{\} in front of it.  For example:
5895
5896@example
5897[d\]]
5898@end example
5899
5900@noindent
5901matches either @samp{d} or @samp{]}.
5902Additionally, if you place @samp{]} right after the opening
5903@samp{[}, the closing bracket is treated as one of the
5904characters to be matched.
5905
5906@cindex POSIX @command{awk} @subentry bracket expressions and
5907@cindex Extended Regular Expressions (EREs)
5908@cindex EREs (Extended Regular Expressions)
5909@cindex @command{egrep} utility
5910The treatment of @samp{\} in bracket expressions
5911is compatible with other @command{awk}
5912implementations and is also mandated by POSIX.
5913The regular expressions in @command{awk} are a superset
5914of the POSIX specification for Extended Regular Expressions (EREs).
5915POSIX EREs are based on the regular expressions accepted by the
5916traditional @command{egrep} utility.
5917
5918@cindex bracket expressions @subentry character classes
5919@cindex POSIX @command{awk} @subentry bracket expressions and @subentry character classes
5920@dfn{Character classes} are a feature introduced in the POSIX standard.
5921A character class is a special notation for describing
5922lists of characters that have a specific attribute, but the
5923actual characters can vary from country to country and/or
5924from character set to character set.  For example, the notion of what
5925is an alphabetic character differs between the United States and France.
5926
5927A character class is only valid in a regexp @emph{inside} the
5928brackets of a bracket expression.  Character classes consist of @samp{[:},
5929a keyword denoting the class, and @samp{:]}.
5930@ref{table-char-classes} lists the character classes defined by the
5931POSIX standard.
5932
5933@float Table,table-char-classes
5934@caption{POSIX character classes}
5935@multitable @columnfractions .15 .85
5936@headitem Class @tab Meaning
5937@item @code{[:alnum:]} @tab Alphanumeric characters
5938@item @code{[:alpha:]} @tab Alphabetic characters
5939@item @code{[:blank:]} @tab Space and TAB characters
5940@item @code{[:cntrl:]} @tab Control characters
5941@item @code{[:digit:]} @tab Numeric characters
5942@item @code{[:graph:]} @tab Characters that are both printable and visible
5943(a space is printable but not visible, whereas an @samp{a} is both)
5944@item @code{[:lower:]} @tab Lowercase alphabetic characters
5945@item @code{[:print:]} @tab Printable characters (characters that are not control characters)
5946@item @code{[:punct:]} @tab Punctuation characters (characters that are not letters, digits,
5947control characters, or space characters)
5948@item @code{[:space:]} @tab Space characters (these are: space, TAB, newline, carriage return, formfeed and vertical tab)
5949@item @code{[:upper:]} @tab Uppercase alphabetic characters
5950@item @code{[:xdigit:]} @tab Characters that are hexadecimal digits
5951@end multitable
5952@end float
5953
5954For example, before the POSIX standard, you had to write @code{/[A-Za-z0-9]/}
5955to match alphanumeric characters.  If your
5956character set had other alphabetic characters in it, this would not
5957match them.
5958With the POSIX character classes, you can write
5959@code{/[[:alnum:]]/} to match the alphabetic
5960and numeric characters in your character set.
5961
5962@ignore
5963From eliz@gnu.org  Fri Feb 15 03:38:41 2019
5964Date: Fri, 15 Feb 2019 12:38:23 +0200
5965From: Eli Zaretskii <eliz@gnu.org>
5966To: arnold@skeeve.com
5967CC: pengyu.ut@gmail.com, bug-gawk@gnu.org
5968Subject: Re: [bug-gawk] Does gawk character classes follow this?
5969
5970> From: arnold@skeeve.com
5971> Date: Fri, 15 Feb 2019 03:01:34 -0700
5972> Cc: pengyu.ut@gmail.com, bug-gawk@gnu.org
5973>
5974> I get the feeling that there's something really bothering you, but
5975> I don't understand what.
5976>
5977> Can you clarify, please?
5978
5979I thought I already did: we cannot be expected to provide a definitive
5980description of what the named classes stand for, because the answer
5981depends on various factors out of our control.
5982@end ignore
5983
5984@c Thanks to
5985@c Date: Tue, 01 Jul 2014 07:39:51 +0200
5986@c From: Hermann Peifer <peifer@gmx.eu>
5987@cindex ASCII
5988Some utilities that match regular expressions provide a nonstandard
5989@samp{[:ascii:]} character class; @command{awk} does not. However, you
5990can simulate such a construct using @samp{[\x00-\x7F]}.  This matches
5991all values numerically between zero and 127, which is the defined
5992range of the ASCII character set.  Use a complemented character list
5993(@samp{[^\x00-\x7F]}) to match any single-byte characters that are not
5994in the ASCII range.
5995
5996@quotation NOTE
5997Some older versions of Unix @command{awk}
5998treat @code{[:blank:]} like @code{[:space:]}, incorrectly matching
5999more characters than they should.  Caveat Emptor.
6000@end quotation
6001
6002@cindex bracket expressions @subentry collating elements
6003@cindex bracket expressions @subentry non-ASCII
6004@cindex collating elements
6005Two additional special sequences can appear in bracket expressions.
6006These apply to non-ASCII character sets, which can have single symbols
6007(called @dfn{collating elements}) that are represented with more than one
6008character. They can also have several characters that are equivalent for
6009@dfn{collating}, or sorting, purposes.  (For example, in French, a plain ``e''
6010and a grave-accented ``@`e'' are equivalent.)
6011These sequences are:
6012
6013@table @asis
6014@cindex bracket expressions @subentry collating symbols
6015@cindex collating symbols
6016@item Collating symbols
6017Multicharacter collating elements enclosed between
6018@samp{[.} and @samp{.]}.  For example, if @samp{ch} is a collating element,
6019then @samp{[[.ch.]]} is a regexp that matches this collating element, whereas
6020@samp{[ch]} is a regexp that matches either @samp{c} or @samp{h}.
6021
6022@cindex bracket expressions @subentry equivalence classes
6023@item Equivalence classes
6024Locale-specific names for a list of
6025characters that are equal. The name is enclosed between
6026@samp{[=} and @samp{=]}.
6027For example, the name @samp{e} might be used to represent all of
6028``e,'' ``@^e,'' ``@`e,'' and ``@'e.'' In this case, @samp{[[=e=]]} is a regexp
6029that matches any of @samp{e}, @samp{@^e}, @samp{@'e}, or @samp{@`e}.
6030@end table
6031
6032These features are very valuable in non-English-speaking locales.
6033
6034@cindex internationalization @subentry localization @subentry character classes
6035@cindex @command{gawk} @subentry character classes and
6036@cindex POSIX @command{awk} @subentry bracket expressions and @subentry character classes
6037@quotation CAUTION
6038The library functions that @command{gawk} uses for regular
6039expression matching currently recognize only POSIX character classes;
6040they do not recognize collating symbols or equivalence classes.
6041@end quotation
6042@c maybe one day ...
6043
6044Inside a bracket expression, an opening bracket (@samp{[}) that does
6045not start a character class, collating element or equivalence class is
6046taken literally. This is also true of @samp{.} and @samp{*}.
6047
6048@node Leftmost Longest
6049@section How Much Text Matches?
6050
6051@cindex regular expressions @subentry leftmost longest match
6052@c @cindex matching, leftmost longest
6053Consider the following:
6054
6055@example
6056echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
6057@end example
6058
6059This example uses the @code{sub()} function to make a change to the input
6060record.  (@code{sub()} replaces the first instance of any text matched
6061by the first argument with the string provided as the second argument;
6062@pxref{String Functions}.)  Here, the regexp @code{/a+/} indicates ``one
6063or more @samp{a} characters,'' and the replacement text is @samp{<A>}.
6064
6065The input contains four @samp{a} characters.
6066@command{awk} (and POSIX) regular expressions always match
6067the leftmost, @emph{longest} sequence of input characters that can
6068match.  Thus, all four @samp{a} characters are
6069replaced with @samp{<A>} in this example:
6070
6071@example
6072$ @kbd{echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'}
6073@print{} <A>bcd
6074@end example
6075
6076For simple match/no-match tests, this is not so important. But when doing
6077text matching and substitutions with the @code{match()}, @code{sub()}, @code{gsub()},
6078and @code{gensub()} functions, it is very important.
6079@ifinfo
6080@xref{String Functions},
6081for more information on these functions.
6082@end ifinfo
6083Understanding this principle is also important for regexp-based record
6084and field splitting (@pxref{Records},
6085and also @pxref{Field Separators}).
6086
6087@node Computed Regexps
6088@section Using Dynamic Regexps
6089
6090@cindex regular expressions @subentry computed
6091@cindex regular expressions @subentry dynamic
6092@cindex @code{~} (tilde), @code{~} operator
6093@cindex tilde (@code{~}), @code{~} operator
6094@cindex @code{!} (exclamation point) @subentry @code{!~} operator
6095@cindex exclamation point (@code{!}) @subentry @code{!~} operator
6096@c @cindex operators, @code{~}
6097@c @cindex operators, @code{!~}
6098The righthand side of a @samp{~} or @samp{!~} operator need not be a
6099regexp constant (i.e., a string of characters between slashes).  It may
6100be any expression.  The expression is evaluated and converted to a string
6101if necessary; the contents of the string are then used as the
6102regexp.  A regexp computed in this way is called a @dfn{dynamic
6103regexp} or a @dfn{computed regexp}:
6104
6105@example
6106BEGIN @{ digits_regexp = "[[:digit:]]+" @}
6107$0 ~ digits_regexp    @{ print @}
6108@end example
6109
6110@noindent
6111This sets @code{digits_regexp} to a regexp that describes one or more digits,
6112and tests whether the input record matches this regexp.
6113
6114@quotation NOTE
6115When using the @samp{~} and @samp{!~}
6116operators, be aware that there is a difference between a regexp constant
6117enclosed in slashes and a string constant enclosed in double quotes.
6118If you are going to use a string constant, you have to understand that
6119the string is, in essence, scanned @emph{twice}: the first time when
6120@command{awk} reads your program, and the second time when it goes to
6121match the string on the lefthand side of the operator with the pattern
6122on the right.  This is true of any string-valued expression (such as
6123@code{digits_regexp}, shown in the previous example), not just string constants.
6124@end quotation
6125
6126@cindex regexp constants @subentry slashes vs.@: quotes
6127@cindex @code{\} (backslash) @subentry in regexp constants
6128@cindex backslash (@code{\}) @subentry in regexp constants
6129@cindex @code{"} (double quote) @subentry in regexp constants
6130@cindex double quote (@code{"}) @subentry in regexp constants
6131What difference does it make if the string is
6132scanned twice? The answer has to do with escape sequences, and particularly
6133with backslashes.  To get a backslash into a regular expression inside a
6134string, you have to type two backslashes.
6135
6136For example, @code{/\*/} is a regexp constant for a literal @samp{*}.
6137Only one backslash is needed.  To do the same thing with a string,
6138you have to type @code{"\\*"}.  The first backslash escapes the
6139second one so that the string actually contains the
6140two characters @samp{\} and @samp{*}.
6141
6142@cindex troubleshooting @subentry regexp constants vs.@: string constants
6143@cindex regexp constants @subentry vs.@: string constants
6144@cindex string @subentry constants @subentry vs.@: regexp constants
6145Given that you can use both regexp and string constants to describe
6146regular expressions, which should you use?  The answer is ``regexp
6147constants,'' for several reasons:
6148
6149@itemize @value{BULLET}
6150@item
6151String constants are more complicated to write and
6152more difficult to read. Using regexp constants makes your programs
6153less error-prone.  Not understanding the difference between the two
6154kinds of constants is a common source of errors.
6155
6156@item
6157It is more efficient to use regexp constants. @command{awk} can note
6158that you have supplied a regexp and store it internally in a form that
6159makes pattern matching more efficient.  When using a string constant,
6160@command{awk} must first convert the string into this internal form and
6161then perform the pattern matching.
6162
6163@item
6164Using regexp constants is better form; it shows clearly that you
6165intend a regexp match.
6166@end itemize
6167
6168@sidebar Using @code{\n} in Bracket Expressions of Dynamic Regexps
6169@cindex regular expressions @subentry dynamic @subentry with embedded newlines
6170@cindex newlines @subentry in dynamic regexps
6171
6172Some older versions of @command{awk} do not allow the newline
6173character to be used inside a bracket expression for a dynamic regexp:
6174
6175@example
6176$ @kbd{awk '$0 ~ "[ \t\n]"'}
6177@error{} awk: newline in character class [
6178@error{} ]...
6179@error{}  source line number 1
6180@error{}  context is
6181@error{}        $0 ~ "[ >>>  \t\n]" <<<
6182@end example
6183
6184@cindex newlines @subentry in regexp constants
6185But a newline in a regexp constant works with no problem:
6186
6187@example
6188$ @kbd{awk '$0 ~ /[ \t\n]/'}
6189@kbd{here is a sample line}
6190@print{} here is a sample line
6191@kbd{Ctrl-d}
6192@end example
6193
6194@command{gawk} does not have this problem, and it isn't likely to
6195occur often in practice, but it's worth noting for future reference.
6196@end sidebar
6197
6198@node GNU Regexp Operators
6199@section @command{gawk}-Specific Regexp Operators
6200
6201@c This section adapted (long ago) from the regex-0.12 manual
6202
6203@cindex regular expressions @subentry operators @subentry @command{gawk}
6204@cindex @command{gawk} @subentry regular expressions @subentry operators
6205@cindex operators @subentry GNU-specific
6206@cindex regular expressions @subentry operators @subentry for words
6207@cindex word, regexp definition of
6208GNU software that deals with regular expressions provides a number of
6209additional regexp operators.  These operators are described in this
6210@value{SECTION} and are specific to @command{gawk};
6211they are not available in other @command{awk} implementations.
6212Most of the additional operators deal with word matching.
6213For our purposes, a @dfn{word} is a sequence of one or more letters, digits,
6214or underscores (@samp{_}):
6215
6216@table @code
6217@c @cindex operators, @code{\s} (@command{gawk})
6218@cindex backslash (@code{\}) @subentry @code{\s} operator (@command{gawk})
6219@cindex @code{\} (backslash) @subentry @code{\s} operator (@command{gawk})
6220@item \s
6221Matches any space character as defined by the current locale.
6222Think of it as shorthand for
6223@w{@samp{[[:space:]]}}.
6224
6225@c @cindex operators, @code{\S} (@command{gawk})
6226@cindex backslash (@code{\}) @subentry @code{\S} operator (@command{gawk})
6227@cindex @code{\} (backslash) @subentry @code{\S} operator (@command{gawk})
6228@item \S
6229Matches any character that is not a space, as defined by the current locale.
6230Think of it as shorthand for
6231@w{@samp{[^[:space:]]}}.
6232
6233@c @cindex operators, @code{\w} (@command{gawk})
6234@cindex backslash (@code{\}) @subentry @code{\w} operator (@command{gawk})
6235@cindex @code{\} (backslash) @subentry @code{\w} operator (@command{gawk})
6236@item \w
6237Matches any word-constituent character---that is, it matches any
6238letter, digit, or underscore. Think of it as shorthand for
6239@w{@samp{[[:alnum:]_]}}.
6240
6241@c @cindex operators, @code{\W} (@command{gawk})
6242@cindex backslash (@code{\}) @subentry @code{\W} operator (@command{gawk})
6243@cindex @code{\} (backslash) @subentry @code{\W} operator (@command{gawk})
6244@item \W
6245Matches any character that is not word-constituent.
6246Think of it as shorthand for
6247@w{@samp{[^[:alnum:]_]}}.
6248
6249@c @cindex operators, @code{\<} (@command{gawk})
6250@cindex backslash (@code{\}) @subentry @code{\<} operator (@command{gawk})
6251@cindex @code{\} (backslash) @subentry @code{\<} operator (@command{gawk})
6252@item \<
6253Matches the empty string at the beginning of a word.
6254For example, @code{/\<away/} matches @samp{away} but not
6255@samp{stowaway}.
6256
6257@c @cindex operators, @code{\>} (@command{gawk})
6258@cindex backslash (@code{\}) @subentry @code{\>} operator (@command{gawk})
6259@cindex @code{\} (backslash) @subentry @code{\>} operator (@command{gawk})
6260@item \>
6261Matches the empty string at the end of a word.
6262For example, @code{/stow\>/} matches @samp{stow} but not @samp{stowaway}.
6263
6264@c @cindex operators, @code{\y} (@command{gawk})
6265@cindex backslash (@code{\}) @subentry @code{\y} operator (@command{gawk})
6266@cindex @code{\} (backslash) @subentry @code{\y} operator (@command{gawk})
6267@cindex word boundaries, matching
6268@item \y
6269Matches the empty string at either the beginning or the
6270end of a word (i.e., the word boundar@strong{y}).  For example, @samp{\yballs?\y}
6271matches either @samp{ball} or @samp{balls}, as a separate word.
6272
6273@c @cindex operators, @code{\B} (@command{gawk})
6274@cindex backslash (@code{\}) @subentry @code{\B} operator (@command{gawk})
6275@cindex @code{\} (backslash) @subentry @code{\B} operator (@command{gawk})
6276@item \B
6277Matches the empty string that occurs between two
6278word-constituent characters. For example,
6279@code{/\Brat\B/} matches @samp{crate}, but it does not match @samp{dirty rat}.
6280@samp{\B} is essentially the opposite of @samp{\y}.
6281@end table
6282
6283@cindex buffers @subentry operators for
6284@cindex regular expressions @subentry operators @subentry for buffers
6285@cindex operators @subentry string-matching @subentry for buffers
6286There are two other operators that work on buffers.  In Emacs, a
6287@dfn{buffer} is, naturally, an Emacs buffer.
6288Other GNU programs, including @command{gawk},
6289consider the entire string to match as the buffer.
6290The operators are:
6291
6292@table @code
6293@item \`
6294@c @cindex operators, @code{\`} (@command{gawk})
6295@cindex backslash (@code{\}) @subentry @code{\`} operator (@command{gawk})
6296@cindex @code{\} (backslash) @subentry @code{\`} operator (@command{gawk})
6297Matches the empty string at the
6298beginning of a buffer (string)
6299
6300@c @cindex operators, @code{\'} (@command{gawk})
6301@cindex backslash (@code{\}) @subentry @code{\'} operator (@command{gawk})
6302@cindex @code{\} (backslash) @subentry @code{\'} operator (@command{gawk})
6303@item \'
6304Matches the empty string at the
6305end of a buffer (string)
6306@end table
6307
6308@cindex @code{^} (caret) @subentry regexp operator
6309@cindex caret (@code{^}) @subentry regexp operator
6310@cindex @code{?} (question mark) @subentry regexp operator
6311@cindex question mark (@code{?}) @subentry regexp operator
6312Because @samp{^} and @samp{$} always work in terms of the beginning
6313and end of strings, these operators don't add any new capabilities
6314for @command{awk}.  They are provided for compatibility with other
6315GNU software.
6316
6317@cindex @command{gawk} @subentry word-boundary operator
6318@cindex word-boundary operator (@command{gawk})
6319@cindex operators @subentry word-boundary (@command{gawk})
6320In other GNU software, the word-boundary operator is @samp{\b}. However,
6321that conflicts with the @command{awk} language's definition of @samp{\b}
6322as backspace, so @command{gawk} uses a different letter.
6323An alternative method would have been to require two backslashes in the
6324GNU operators, but this was deemed too confusing. The current
6325method of using @samp{\y} for the GNU @samp{\b} appears to be the
6326lesser of two evils.
6327
6328@cindex regular expressions @subentry @command{gawk}, command-line options
6329@cindex @command{gawk} @subentry command-line options, regular expressions and
6330The various command-line options
6331(@pxref{Options})
6332control how @command{gawk} interprets characters in regexps:
6333
6334@table @asis
6335@item No options
6336In the default case, @command{gawk} provides all the facilities of
6337POSIX regexps and the
6338@ifnotinfo
6339previously described
6340GNU regexp operators.
6341@end ifnotinfo
6342@ifnottex
6343@ifnotdocbook
6344GNU regexp operators described
6345in @ref{Regexp Operators}.
6346@end ifnotdocbook
6347@end ifnottex
6348
6349@item @option{--posix}
6350Match only POSIX regexps; the GNU operators are not special
6351(e.g., @samp{\w} matches a literal @samp{w}).  Interval expressions
6352are allowed.
6353
6354@cindex Brian Kernighan's @command{awk}
6355@item @option{--traditional}
6356Match traditional Unix @command{awk} regexps. The GNU operators
6357are not special, and interval expressions are not available.
6358Because BWK @command{awk} supports them,
6359the POSIX character classes (@samp{[[:alnum:]]}, etc.) are available.
6360Characters described by octal and hexadecimal escape sequences are
6361treated literally, even if they represent regexp metacharacters.
6362
6363@item @option{--re-interval}
6364Allow interval expressions in regexps, if @option{--traditional}
6365has been provided.
6366Otherwise, interval expressions are available by default.
6367@end table
6368
6369@node Case-sensitivity
6370@section Case Sensitivity in Matching
6371
6372@cindex regular expressions @subentry case sensitivity
6373@cindex case sensitivity @subentry regexps and
6374Case is normally significant in regular expressions, both when matching
6375ordinary characters (i.e., not metacharacters) and inside bracket
6376expressions.  Thus, a @samp{w} in a regular expression matches only a lowercase
6377@samp{w} and not an uppercase @samp{W}.
6378
6379The simplest way to do a case-independent match is to use a bracket
6380expression---for example, @samp{[Ww]}.  However, this can be cumbersome if
6381you need to use it often, and it can make the regular expressions harder
6382to read.  There are two alternatives that you might prefer.
6383
6384One way to perform a case-insensitive match at a particular point in the
6385program is to convert the data to a single case, using the
6386@code{tolower()} or @code{toupper()} built-in string functions (which we
6387haven't discussed yet;
6388@pxref{String Functions}).
6389For example:
6390
6391@example
6392tolower($1) ~ /foo/  @{ @dots{} @}
6393@end example
6394
6395@noindent
6396converts the first field to lowercase before matching against it.
6397This works in any POSIX-compliant @command{awk}.
6398
6399@cindex @command{gawk} @subentry regular expressions @subentry case sensitivity
6400@cindex case sensitivity @subentry @command{gawk}
6401@cindex differences in @command{awk} and @command{gawk} @subentry regular expressions
6402@cindex @code{~} (tilde), @code{~} operator
6403@cindex tilde (@code{~}), @code{~} operator
6404@cindex @code{!} (exclamation point) @subentry @code{!~} operator
6405@cindex exclamation point (@code{!}) @subentry @code{!~} operator
6406@cindex @code{IGNORECASE} variable @subentry with @code{~} and @code{!~} operators
6407@cindex @command{gawk} @subentry @code{IGNORECASE} variable in
6408@c @cindex variables, @code{IGNORECASE}
6409Another method, specific to @command{gawk}, is to set the variable
6410@code{IGNORECASE} to a nonzero value (@pxref{Built-in Variables}).
6411When @code{IGNORECASE} is not zero, @emph{all} regexp and string
6412operations ignore case.
6413
6414Changing the value of @code{IGNORECASE} dynamically controls the
6415case sensitivity of the program as it runs.  Case is significant by
6416default because @code{IGNORECASE} (like most variables) is initialized
6417to zero:
6418
6419@example
6420x = "aB"
6421if (x ~ /ab/) @dots{}   # this test will fail
6422
6423IGNORECASE = 1
6424if (x ~ /ab/) @dots{}   # now it will succeed
6425@end example
6426
6427In general, you cannot use @code{IGNORECASE} to make certain rules
6428case insensitive and other rules case sensitive, as there is no
6429straightforward way
6430to set @code{IGNORECASE} just for the pattern of
6431a particular rule.@footnote{Experienced C and C++ programmers will note
6432that it is possible, using something like
6433@samp{IGNORECASE = 1 && /foObAr/ @{ @dots{} @}}
6434and
6435@samp{IGNORECASE = 0 || /foobar/ @{ @dots{} @}}.
6436However, this is somewhat obscure and we don't recommend it.}
6437To do this, use either bracket expressions or @code{tolower()}.  However, one
6438thing you can do with @code{IGNORECASE} only is dynamically turn
6439case sensitivity on or off for all the rules at once.
6440
6441@code{IGNORECASE} can be set on the command line or in a @code{BEGIN} rule
6442(@pxref{Other Arguments}; also
6443@pxref{Using BEGIN/END}).
6444Setting @code{IGNORECASE} from the command line is a way to make
6445a program case insensitive without having to edit it.
6446
6447@c @cindex ISO 8859-1
6448@c @cindex ISO Latin-1
6449In multibyte locales, the equivalences between upper- and lowercase
6450characters are tested based on the wide-character values of the locale's
6451character set.  Prior to @value{PVERSION} 5.0, single-byte characters were
6452tested based on the ISO-8859-1 (ISO Latin-1) character set.  However, as
6453of @value{PVERSION} 5.0, single-byte characters are also tested based on
6454the values of the locale's character set.@footnote{If you don't understand
6455this, don't worry about it; it just means that @command{gawk} does the
6456right thing.}
6457
6458The value of @code{IGNORECASE} has no effect if @command{gawk} is in
6459compatibility mode (@pxref{Options}).
6460Case is always significant in compatibility mode.
6461
6462@node Regexp Summary
6463@section Summary
6464
6465@itemize @value{BULLET}
6466@item
6467Regular expressions describe sets of strings to be matched.
6468In @command{awk}, regular expression constants are written enclosed
6469between slashes: @code{/}@dots{}@code{/}.
6470
6471@item
6472Regexp constants may be used standalone in patterns and
6473in conditional expressions, or as part of matching expressions
6474using the @samp{~} and @samp{!~} operators.
6475
6476@item
6477Escape sequences let you represent nonprintable characters and
6478also let you represent regexp metacharacters as literal characters
6479to be matched.
6480
6481@item
6482Regexp operators provide grouping, alternation, and repetition.
6483
6484@item
6485Bracket expressions give you a shorthand for specifying sets
6486of characters that can match at a particular point in a regexp.
6487Within bracket expressions, POSIX character classes let you specify
6488certain groups of characters in a locale-independent fashion.
6489
6490@item
6491Regular expressions match the leftmost longest text in the string being
6492matched.  This matters for cases where you need to know the extent of
6493the match, such as for text substitution and when the record separator
6494is a regexp.
6495
6496@item
6497Matching expressions may use dynamic regexps (i.e., string values
6498treated as regular expressions).
6499
6500@item
6501@command{gawk}'s @code{IGNORECASE} variable lets you control the
6502case sensitivity of regexp matching.  In other @command{awk}
6503versions, use @code{tolower()} or @code{toupper()}.
6504
6505@end itemize
6506
6507
6508@node Reading Files
6509@chapter Reading Input Files
6510
6511@cindex reading input files
6512@cindex input files @subentry reading
6513@cindex input files
6514@cindex @code{FILENAME} variable
6515In the typical @command{awk} program,
6516@command{awk} reads all input either from the
6517standard input (by default, this is the keyboard, but often it is a pipe from another
6518command) or from files whose names you specify on the @command{awk}
6519command line.  If you specify input files, @command{awk} reads them
6520in order, processing all the data from one before going on to the next.
6521The name of the current input file can be found in the predefined variable
6522@code{FILENAME}
6523(@pxref{Built-in Variables}).
6524
6525@cindex records
6526@cindex fields
6527The input is read in units called @dfn{records}, and is processed by the
6528rules of your program one record at a time.
6529By default, each record is one line.  Each
6530record is automatically split into chunks called @dfn{fields}.
6531This makes it more convenient for programs to work on the parts of a record.
6532
6533@cindex @code{getline} command
6534On rare occasions, you may need to use the @code{getline} command.
6535The  @code{getline} command is valuable both because it
6536can do explicit input from any number of files, and because the files
6537used with it do not have to be named on the @command{awk} command line
6538(@pxref{Getline}).
6539
6540@menu
6541* Records::                     Controlling how data is split into records.
6542* Fields::                      An introduction to fields.
6543* Nonconstant Fields::          Nonconstant Field Numbers.
6544* Changing Fields::             Changing the Contents of a Field.
6545* Field Separators::            The field separator and how to change it.
6546* Constant Size::               Reading constant width data.
6547* Splitting By Content::        Defining Fields By Content
6548* Testing field creation::      Checking how @command{gawk} is splitting
6549                                records.
6550* Multiple Line::               Reading multiline records.
6551* Getline::                     Reading files under explicit program control
6552                                using the @code{getline} function.
6553* Read Timeout::                Reading input with a timeout.
6554* Retrying Input::              Retrying input after certain errors.
6555* Command-line directories::    What happens if you put a directory on the
6556                                command line.
6557* Input Summary::               Input summary.
6558* Input Exercises::             Exercises.
6559@end menu
6560
6561@node Records
6562@section How Input Is Split into Records
6563
6564@cindex input @subentry splitting into records
6565@cindex records @subentry splitting input into
6566@cindex @code{NR} variable
6567@cindex @code{FNR} variable
6568@command{awk} divides the input for your program into records and fields.
6569It keeps track of the number of records that have been read so far from
6570the current input file.  This value is stored in a predefined variable
6571called @code{FNR}, which is reset to zero every time a new file is started.
6572Another predefined variable, @code{NR}, records the total number of input
6573records read so far from all @value{DF}s.  It starts at zero, but is
6574never automatically reset to zero.
6575
6576Normally, records are separated by newline characters.  You can control how
6577records are separated by assigning values to the built-in variable @code{RS}.
6578If @code{RS} is any single character, that character separates records.
6579Otherwise (in @command{gawk}), @code{RS} is treated as a regular expression.
6580This mechanism is explained in greater detail shortly.
6581
6582@menu
6583* awk split records::           How standard @command{awk} splits records.
6584* gawk split records::          How @command{gawk} splits records.
6585@end menu
6586
6587@node awk split records
6588@subsection Record Splitting with Standard @command{awk}
6589
6590@cindex separators @subentry for records
6591@cindex record separators
6592Records are separated by a character called the @dfn{record separator}.
6593By default, the record separator is the newline character.
6594This is why records are, by default, single lines.
6595To use a different character for the record separator,
6596simply assign that character to the predefined variable @code{RS}.
6597
6598@cindex record separators @subentry newlines as
6599@cindex newlines @subentry as record separators
6600@cindex @code{RS} variable
6601Like any other variable,
6602the value of @code{RS} can be changed in the @command{awk} program
6603with the assignment operator, @samp{=}
6604(@pxref{Assignment Ops}).
6605The new record-separator character should be enclosed in quotation marks,
6606which indicate a string constant.  Often, the right time to do this is
6607at the beginning of execution, before any input is processed,
6608so that the very first record is read with the proper separator.
6609To do this, use the special @code{BEGIN} pattern
6610(@pxref{BEGIN/END}).
6611For example:
6612
6613@example
6614awk 'BEGIN @{ RS = "u" @}
6615     @{ print $0 @}' mail-list
6616@end example
6617
6618@noindent
6619changes the value of @code{RS} to @samp{u}, before reading any input.
6620The new value is a string whose first character is the letter ``u''; as a result, records
6621are separated by the letter ``u''.  Then the input file is read, and the second
6622rule in the @command{awk} program (the action with no pattern) prints each
6623record.  Because each @code{print} statement adds a newline at the end of
6624its output, this @command{awk} program copies the input
6625with each @samp{u} changed to a newline.  Here are the results of running
6626the program on @file{mail-list}:
6627
6628@example
6629@group
6630$ @kbd{awk 'BEGIN @{ RS = "u" @}}
6631>      @kbd{@{ print $0 @}' mail-list}
6632@end group
6633@print{} Amelia       555-5553     amelia.zodiac
6634@print{} sq
6635@print{} e@@gmail.com    F
6636@print{} Anthony      555-3412     anthony.assert
6637@print{} ro@@hotmail.com   A
6638@print{} Becky        555-7685     becky.algebrar
6639@print{} m@@gmail.com      A
6640@print{} Bill         555-1675     bill.drowning@@hotmail.com       A
6641@print{} Broderick    555-0542     broderick.aliq
6642@print{} otiens@@yahoo.com R
6643@print{} Camilla      555-2912     camilla.inf
6644@print{} sar
6645@print{} m@@skynet.be     R
6646@print{} Fabi
6647@print{} s       555-1234     fabi
6648@print{} s.
6649@print{} ndevicesim
6650@print{} s@@
6651@print{} cb.ed
6652@print{}     F
6653@print{} J
6654@print{} lie        555-6699     j
6655@print{} lie.perscr
6656@print{} tabor@@skeeve.com   F
6657@print{} Martin       555-6480     martin.codicib
6658@print{} s@@hotmail.com    A
6659@print{} Sam
6660@print{} el       555-3430     sam
6661@print{} el.lanceolis@@sh
6662@print{} .ed
6663@print{}         A
6664@print{} Jean-Pa
6665@print{} l    555-2127     jeanpa
6666@print{} l.campanor
6667@print{} m@@ny
6668@print{} .ed
6669@print{}      R
6670@print{}
6671@end example
6672
6673@noindent
6674Note that the entry for the name @samp{Bill} is not split.
6675In the original @value{DF}
6676(@pxref{Sample Data Files}),
6677the line looks like this:
6678
6679@example
6680Bill         555-1675     bill.drowning@@hotmail.com       A
6681@end example
6682
6683@noindent
6684It contains no @samp{u}, so there is no reason to split the record,
6685unlike the others, which each have one or more occurrences of the @samp{u}.
6686In fact, this record is treated as part of the previous record;
6687the newline separating them in the output
6688is the original newline in the @value{DF}, not the one added by
6689@command{awk} when it printed the record!
6690
6691@cindex record separators @subentry changing
6692@cindex separators @subentry for records
6693Another way to change the record separator is on the command line,
6694using the variable-assignment feature
6695(@pxref{Other Arguments}):
6696
6697@example
6698awk '@{ print $0 @}' RS="u" mail-list
6699@end example
6700
6701@noindent
6702This sets @code{RS} to @samp{u} before processing @file{mail-list}.
6703
6704Using an alphabetic character such as @samp{u} for the record separator
6705is highly likely to produce strange results.
6706Using an unusual character such as @samp{/} is more likely to
6707produce correct behavior in the majority of cases, but there
6708are no guarantees. The moral is: Know Your Data.
6709
6710@command{gawk} allows @code{RS} to be a full regular expression
6711(discussed shortly; @pxref{gawk split records}).  Even so, using
6712a regular expression metacharacter, such as @samp{.} as the single
6713character in the value of @code{RS} has no special effect: it is
6714treated literally. This is required for backwards compatibility with
6715both Unix @command{awk} and with POSIX.
6716
6717@cindex dark corner @subentry input files
6718Reaching the end of an input file terminates the current input record,
6719even if the last character in the file is not the character in @code{RS}.
6720@value{DARKCORNER}
6721
6722@cindex empty strings @seeentry{null strings}
6723@cindex null strings
6724@cindex strings @subentry empty @seeentry{null strings}
6725The empty string @code{""} (a string without any characters)
6726has a special meaning
6727as the value of @code{RS}. It means that records are separated
6728by one or more blank lines and nothing else.
6729@xref{Multiple Line} for more details.
6730
6731If you change the value of @code{RS} in the middle of an @command{awk} run,
6732the new value is used to delimit subsequent records, but the record
6733currently being processed, as well as records already processed, are not
6734affected.
6735
6736@cindex @command{gawk} @subentry @code{RT} variable in
6737@cindex @code{RT} variable
6738@cindex records @subentry terminating
6739@cindex terminating records
6740@cindex differences in @command{awk} and @command{gawk} @subentry record separators
6741@cindex differences in @command{awk} and @command{gawk} @subentry @code{RS}/@code{RT} variables
6742@cindex regular expressions @subentry as record separators
6743@cindex record separators @subentry regular expressions as
6744@cindex separators @subentry for records @subentry regular expressions as
6745After the end of the record has been determined, @command{gawk}
6746sets the variable @code{RT} to the text in the input that matched
6747@code{RS}.
6748
6749@node gawk split records
6750@subsection Record Splitting with @command{gawk}
6751
6752@cindex common extensions @subentry @code{RS} as a regexp
6753@cindex extensions @subentry common @subentry @code{RS} as a regexp
6754When using @command{gawk}, the value of @code{RS} is not limited to a
6755one-character string.  If it contains more than one character, it is
6756treated as a regular expression
6757(@pxref{Regexp}). @value{COMMONEXT}
6758In general, each record
6759ends at the next string that matches the regular expression; the next
6760record starts at the end of the matching string.  This general rule is
6761actually at work in the usual case, where @code{RS} contains just a
6762newline: a record ends at the beginning of the next matching string (the
6763next newline in the input), and the following record starts just after
6764the end of this string (at the first character of the following line).
6765The newline, because it matches @code{RS}, is not part of either record.
6766
6767When @code{RS} is a single character, @code{RT}
6768contains the same single character. However, when @code{RS} is a
6769regular expression, @code{RT} contains
6770the actual input text that matched the regular expression.
6771
6772If the input file ends without any text matching @code{RS},
6773@command{gawk} sets @code{RT} to the null string.
6774
6775The following example illustrates both of these features.
6776It sets @code{RS} equal to a regular expression that
6777matches either a newline or a series of one or more uppercase letters
6778with optional leading and/or trailing whitespace:
6779
6780@example
6781@group
6782$ @kbd{echo record 1 AAAA record 2 BBBB record 3 |}
6783> @kbd{gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @}}
6784>             @kbd{@{ print "Record =", $0,"and RT = [" RT "]" @}'}
6785@end group
6786@print{} Record = record 1 and RT = [ AAAA ]
6787@print{} Record = record 2 and RT = [ BBBB ]
6788@print{} Record = record 3 and RT = [
6789@print{} ]
6790@end example
6791
6792@noindent
6793The square brackets delineate the contents of @code{RT}, letting you
6794see the leading and trailing whitespace. The final value of
6795@code{RT} is a newline.
6796@xref{Simple Sed} for a more useful example
6797of @code{RS} as a regexp and @code{RT}.
6798
6799If you set @code{RS} to a regular expression that allows optional
6800trailing text, such as @samp{RS = "abc(XYZ)?"}, it is possible, due
6801to implementation constraints, that @command{gawk} may match the leading
6802part of the regular expression, but not the trailing part, particularly
6803if the input text that could match the trailing part is fairly long.
6804@command{gawk} attempts to avoid this problem, but currently, there's
6805no guarantee that this will never happen.
6806
6807@sidebar Caveats When Using Regular Expressions for @code{RS}
6808Remember that in @command{awk}, the @samp{^} and @samp{$} anchor
6809metacharacters match the beginning and end of a @emph{string}, and not
6810the beginning and end of a @emph{line}.  As a result, something like
6811@samp{RS = "^[[:upper:]]"} can only match at the beginning of a file.
6812This is because @command{gawk} views the input file as one long string
6813that happens to contain newline characters.
6814It is thus best to avoid anchor metacharacters in the value of @code{RS}.
6815
6816Record splitting with regular expressions works differently than
6817regexp matching with the @code{sub()}, @code{gsub()}, and @code{gensub()}
6818(@pxref{String Functions}).  Those functions allow a regexp to match the empty string;
6819record splitting does not.  Thus, for example @samp{RS = "()"} does @emph{not}
6820split records between characters.
6821@end sidebar
6822
6823@cindex @command{gawk} @subentry @code{RT} variable in
6824@cindex @code{RT} variable
6825@cindex differences in @command{awk} and @command{gawk} @subentry @code{RS}/@code{RT} variables
6826The use of @code{RS} as a regular expression and the @code{RT}
6827variable are @command{gawk} extensions; they are not available in
6828compatibility mode
6829(@pxref{Options}).
6830In compatibility mode, only the first character of the value of
6831@code{RS} determines the end of the record.
6832
6833@cindex Brian Kernighan's @command{awk}
6834@command{mawk} has allowed @code{RS} to be a regexp for decades.
6835As of October, 2019, BWK @command{awk} also supports it.  Neither
6836version supplies @code{RT}, however.
6837
6838@sidebar @code{RS = "\0"} Is Not Portable
6839@cindex portability @subentry data files as single record
6840There are times when you might want to treat an entire @value{DF} as a
6841single record.  The only way to make this happen is to give @code{RS}
6842a value that you know doesn't occur in the input file.  This is hard
6843to do in a general way, such that a program always works for arbitrary
6844input files.
6845
6846You might think that for text files, the @sc{nul} character, which
6847consists of a character with all bits equal to zero, is a good
6848value to use for @code{RS} in this case:
6849
6850@example
6851BEGIN @{ RS = "\0" @}  # whole file becomes one record?
6852@end example
6853
6854@cindex differences in @command{awk} and @command{gawk} @subentry strings @subentry storing
6855@command{gawk} in fact accepts this, and uses the @sc{nul}
6856character for the record separator.
6857This works for certain special files, such as @file{/proc/environ} on
6858GNU/Linux systems, where the @sc{nul} character is in fact the record separator.
6859However, this usage is @emph{not} portable
6860to most other @command{awk} implementations.
6861
6862@cindex dark corner @subentry strings, storing
6863Almost all other @command{awk} implementations@footnote{At least that we know
6864about.} store strings internally as C-style strings.  C strings use the
6865@sc{nul} character as the string terminator.  In effect, this means that
6866@samp{RS = "\0"} is the same as @samp{RS = ""}.
6867@value{DARKCORNER}
6868
6869It happens that recent versions of @command{mawk} can use the @sc{nul}
6870character as a record separator. However, this is a special case:
6871@command{mawk} does not allow embedded @sc{nul} characters in strings.
6872(This may change in a future version of @command{mawk}.)
6873
6874@cindex records @subentry treating files as
6875@cindex treating files, as single records
6876@cindex single records, treating files as
6877@xref{Readfile Function} for an interesting way to read
6878whole files.  If you are using @command{gawk}, see @ref{Extension Sample
6879Readfile} for another option.
6880@end sidebar
6881
6882@node Fields
6883@section Examining Fields
6884
6885@cindex examining fields
6886@cindex fields
6887@cindex accessing fields
6888@cindex fields @subentry examining
6889@cindex whitespace @subentry definition of
6890When @command{awk} reads an input record, the record is
6891automatically @dfn{parsed} or separated by the @command{awk} utility into chunks
6892called @dfn{fields}.  By default, fields are separated by @dfn{whitespace},
6893like words in a line.
6894Whitespace in @command{awk} means any string of one or more spaces,
6895TABs, or newlines; other characters
6896that are considered whitespace by other languages
6897(such as formfeed, vertical tab, etc.) are @emph{not} considered
6898whitespace by @command{awk}.
6899
6900The purpose of fields is to make it more convenient for you to refer to
6901these pieces of the record.  You don't have to use them---you can
6902operate on the whole record if you want---but fields are what make
6903simple @command{awk} programs so powerful.
6904
6905@cindex field operator @code{$}
6906@cindex @code{$} (dollar sign) @subentry @code{$} field operator
6907@cindex dollar sign (@code{$}) @subentry @code{$} field operator
6908@cindex field operators, dollar sign as
6909You use a dollar sign (@samp{$})
6910to refer to a field in an @command{awk} program,
6911followed by the number of the field you want.  Thus, @code{$1}
6912refers to the first field, @code{$2} to the second, and so on.
6913(Unlike in the Unix shells, the field numbers are not limited to single digits.
6914@code{$127} is the 127th field in the record.)
6915For example, suppose the following is a line of input:
6916
6917@example
6918This seems like a pretty nice example.
6919@end example
6920
6921@noindent
6922Here the first field, or @code{$1}, is @samp{This}, the second field, or
6923@code{$2}, is @samp{seems}, and so on.  Note that the last field,
6924@code{$7}, is @samp{example.}.  Because there is no space between the
6925@samp{e} and the @samp{.}, the period is considered part of the seventh
6926field.
6927
6928@cindex @code{NF} variable
6929@cindex fields @subentry number of
6930@code{NF} is a predefined variable whose value is the number of fields
6931in the current record.  @command{awk} automatically updates the value
6932of @code{NF} each time it reads a record.  No matter how many fields
6933there are, the last field in a record can be represented by @code{$NF}.
6934So, @code{$NF} is the same as @code{$7}, which is @samp{example.}.
6935If you try to reference a field beyond the last
6936one (such as @code{$8} when the record has only seven fields), you get
6937the empty string.  (If used in a numeric operation, you get zero.)
6938
6939The use of @code{$0}, which looks like a reference to the ``zeroth'' field, is
6940a special case: it represents the whole input record. Use it
6941when you are not interested in specific fields.
6942Here are some more examples:
6943
6944@example
6945$ @kbd{awk '$1 ~ /li/ @{ print $0 @}' mail-list}
6946@print{} Amelia       555-5553     amelia.zodiacusque@@gmail.com    F
6947@print{} Julie        555-6699     julie.perscrutabor@@skeeve.com   F
6948@end example
6949
6950@noindent
6951This example prints each record in the file @file{mail-list} whose first
6952field contains the string @samp{li}.
6953
6954By contrast, the following example looks for @samp{li} in @emph{the
6955entire record} and prints the first and last fields for each matching
6956input record:
6957
6958@example
6959$ @kbd{awk '/li/ @{ print $1, $NF @}' mail-list}
6960@print{} Amelia F
6961@print{} Broderick R
6962@print{} Julie F
6963@print{} Samuel A
6964@end example
6965
6966@node Nonconstant Fields
6967@section Nonconstant Field Numbers
6968@cindex fields @subentry numbers
6969@cindex field numbers
6970
6971A field number need not be a constant.  Any expression in
6972the @command{awk} language can be used after a @samp{$} to refer to a
6973field.  The value of the expression specifies the field number.  If the
6974value is a string, rather than a number, it is converted to a number.
6975Consider this example:
6976
6977@example
6978awk '@{ print $NR @}'
6979@end example
6980
6981@noindent
6982Recall that @code{NR} is the number of records read so far: one in the
6983first record, two in the second, and so on.  So this example prints the first
6984field of the first record, the second field of the second record, and so
6985on.  For the twentieth record, field number 20 is printed; most likely,
6986the record has fewer than 20 fields, so this prints a blank line.
6987Here is another example of using expressions as field numbers:
6988
6989@example
6990awk '@{ print $(2*2) @}' mail-list
6991@end example
6992
6993@command{awk} evaluates the expression @samp{(2*2)} and uses
6994its value as the number of the field to print.  The @samp{*}
6995represents multiplication, so the expression @samp{2*2} evaluates to four.
6996The parentheses are used so that the multiplication is done before the
6997@samp{$} operation; they are necessary whenever there is a binary
6998operator@footnote{A @dfn{binary operator}, such as @samp{*} for
6999multiplication, is one that takes two operands. The distinction
7000is required because @command{awk} also has unary (one-operand)
7001and ternary (three-operand) operators.}
7002in the field-number expression.  This example, then, prints the
7003type of relationship (the fourth field) for every line of the file
7004@file{mail-list}.  (All of the @command{awk} operators are listed, in
7005order of decreasing precedence, in
7006@ref{Precedence}.)
7007
7008If the field number you compute is zero, you get the entire record.
7009Thus, @samp{$(2-2)} has the same value as @code{$0}.  Negative field
7010numbers are not allowed; trying to reference one usually terminates
7011the program.  (The POSIX standard does not define
7012what happens when you reference a negative field number.  @command{gawk}
7013notices this and terminates your program.  Other @command{awk}
7014implementations may behave differently.)
7015
7016As mentioned in @ref{Fields},
7017@command{awk} stores the current record's number of fields in the built-in
7018variable @code{NF} (also @pxref{Built-in Variables}).  Thus, the expression
7019@code{$NF} is not a special feature---it is the direct consequence of
7020evaluating @code{NF} and using its value as a field number.
7021
7022@node Changing Fields
7023@section Changing the Contents of a Field
7024
7025@cindex fields @subentry changing contents of
7026The contents of a field, as seen by @command{awk}, can be changed within an
7027@command{awk} program; this changes what @command{awk} perceives as the
7028current input record.  (The actual input is untouched; @command{awk} @emph{never}
7029modifies the input file.)
7030Consider the following example and its output:
7031
7032@example
7033$ @kbd{awk '@{ nboxes = $3 ; $3 = $3 - 10}
7034>        @kbd{print nboxes, $3 @}' inventory-shipped}
7035@print{} 25 15
7036@print{} 32 22
7037@print{} 24 14
7038@dots{}
7039@end example
7040
7041@noindent
7042The program first saves the original value of field three in the variable
7043@code{nboxes}.
7044The @samp{-} sign represents subtraction, so this program reassigns
7045field three, @code{$3}, as the original value of field three minus ten:
7046@samp{$3 - 10}.  (@xref{Arithmetic Ops}.)
7047Then it prints the original and new values for field three.
7048(Someone in the warehouse made a consistent mistake while inventorying
7049the red boxes.)
7050
7051For this to work, the text in @code{$3} must make sense
7052as a number; the string of characters must be converted to a number
7053for the computer to do arithmetic on it.  The number resulting
7054from the subtraction is converted back to a string of characters that
7055then becomes field three.
7056@xref{Conversion}.
7057
7058When the value of a field is changed (as perceived by @command{awk}), the
7059text of the input record is recalculated to contain the new field where
7060the old one was.  In other words, @code{$0} changes to reflect the altered
7061field.  Thus, this program
7062prints a copy of the input file, with 10 subtracted from the second
7063field of each line:
7064
7065@example
7066$ @kbd{awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped}
7067@print{} Jan 3 25 15 115
7068@print{} Feb 5 32 24 226
7069@print{} Mar 5 24 34 228
7070@dots{}
7071@end example
7072
7073It is also possible to assign contents to fields that are out
7074of range.  For example:
7075
7076@example
7077$ @kbd{awk '@{ $6 = ($5 + $4 + $3 + $2)}
7078> @kbd{       print $6 @}' inventory-shipped}
7079@print{} 168
7080@print{} 297
7081@print{} 301
7082@dots{}
7083@end example
7084
7085@cindex adding @subentry fields
7086@cindex fields @subentry adding
7087@noindent
7088We've just created @code{$6}, whose value is the sum of fields
7089@code{$2}, @code{$3}, @code{$4}, and @code{$5}.  The @samp{+} sign
7090represents addition.  For the file @file{inventory-shipped}, @code{$6}
7091represents the total number of parcels shipped for a particular month.
7092
7093Creating a new field changes @command{awk}'s internal copy of the current
7094input record, which is the value of @code{$0}.  Thus, if you do @samp{print $0}
7095after adding a field, the record printed includes the new field, with
7096the appropriate number of field separators between it and the previously
7097existing fields.
7098
7099@cindex @code{OFS} variable
7100@cindex output field separator @seeentry{@code{OFS} variable}
7101@cindex field separator @seealso{@code{OFS}}
7102This recomputation affects and is affected by
7103@code{NF} (the number of fields; @pxref{Fields}).
7104For example, the value of @code{NF} is set to the number of the highest
7105field you create.
7106The exact format of @code{$0} is also affected by a feature that has not been discussed yet:
7107the @dfn{output field separator}, @code{OFS},
7108used to separate the fields (@pxref{Output Separators}).
7109
7110Note, however, that merely @emph{referencing} an out-of-range field
7111does @emph{not} change the value of either @code{$0} or @code{NF}.
7112Referencing an out-of-range field only produces an empty string.  For
7113example:
7114
7115@example
7116if ($(NF+1) != "")
7117    print "can't happen"
7118else
7119    print "everything is normal"
7120@end example
7121
7122@noindent
7123should print @samp{everything is normal}, because @code{NF+1} is certain
7124to be out of range.  (@xref{If Statement}
7125for more information about @command{awk}'s @code{if-else} statements.
7126@xref{Typing and Comparison}
7127for more information about the @samp{!=} operator.)
7128
7129It is important to note that making an assignment to an existing field
7130changes the
7131value of @code{$0} but does not change the value of @code{NF},
7132even when you assign the empty string to a field.  For example:
7133
7134@example
7135$ @kbd{echo a b c d | awk '@{ OFS = ":"; $2 = ""}
7136>                       @kbd{print $0; print NF @}'}
7137@print{} a::c:d
7138@print{} 4
7139@end example
7140
7141@noindent
7142The field is still there; it just has an empty value, delimited by
7143the two colons between @samp{a} and @samp{c}.
7144This example shows what happens if you create a new field:
7145
7146@example
7147$ @kbd{echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new"}
7148>                       @kbd{print $0; print NF @}'}
7149@print{} a::c:d::new
7150@print{} 6
7151@end example
7152
7153@noindent
7154The intervening field, @code{$5}, is created with an empty value
7155(indicated by the second pair of adjacent colons),
7156and @code{NF} is updated with the value six.
7157
7158@cindex dark corner @subentry @code{NF} variable, decrementing
7159@cindex @code{NF} variable @subentry decrementing
7160Decrementing @code{NF} throws away the values of the fields
7161after the new value of @code{NF} and recomputes @code{$0}.
7162@value{DARKCORNER}
7163Here is an example:
7164
7165@example
7166$ @kbd{echo a b c d e f | awk '@{ print "NF =", NF;}
7167> @kbd{                          NF = 3; print $0 @}'}
7168@print{} NF = 6
7169@print{} a b c
7170@end example
7171
7172@cindex portability @subentry @code{NF} variable, decrementing
7173@quotation CAUTION
7174Some versions of @command{awk} don't
7175rebuild @code{$0} when @code{NF} is decremented.
7176Until August, 2018, this included BWK @command{awk}; fortunately
7177his version now handles this correctly.
7178@end quotation
7179
7180Finally, there are times when it is convenient to force
7181@command{awk} to rebuild the entire record, using the current
7182values of the fields and @code{OFS}.  To do this, use the
7183seemingly innocuous assignment:
7184
7185@example
7186@group
7187$1 = $1   # force record to be reconstituted
7188print $0  # or whatever else with $0
7189@end group
7190@end example
7191
7192@noindent
7193This forces @command{awk} to rebuild the record.  It does help
7194to add a comment, as we've shown here.
7195
7196There is a flip side to the relationship between @code{$0} and
7197the fields.  Any assignment to @code{$0} causes the record to be
7198reparsed into fields using the @emph{current} value of @code{FS}.
7199This also applies to any built-in function that updates @code{$0},
7200such as @code{sub()} and @code{gsub()}
7201(@pxref{String Functions}).
7202
7203@sidebar Understanding @code{$0}
7204
7205It is important to remember that @code{$0} is the @emph{full}
7206record, exactly as it was read from the input.  This includes
7207any leading or trailing whitespace, and the exact whitespace (or other
7208characters) that separates the fields.
7209
7210It is a common error to try to change the field separators
7211in a record simply by setting @code{FS} and @code{OFS}, and then
7212expecting a plain @samp{print} or @samp{print $0} to print the
7213modified record.
7214
7215But this does not work, because nothing was done to change the record
7216itself.  Instead, you must force the record to be rebuilt, typically
7217with a statement such as @samp{$1 = $1}, as described earlier.
7218@end sidebar
7219
7220
7221@node Field Separators
7222@section Specifying How Fields Are Separated
7223
7224@menu
7225* Default Field Splitting::      How fields are normally separated.
7226* Regexp Field Splitting::       Using regexps as the field separator.
7227* Single Character Fields::      Making each character a separate field.
7228* Command Line Field Separator:: Setting @code{FS} from the command line.
7229* Full Line Fields::             Making the full line be a single field.
7230* Field Splitting Summary::      Some final points and a summary table.
7231@end menu
7232
7233@cindex @code{FS} variable
7234@cindex fields @subentry separating
7235@cindex field separator
7236@cindex fields @subentry separating
7237The @dfn{field separator}, which is either a single character or a regular
7238expression, controls the way @command{awk} splits an input record into fields.
7239@command{awk} scans the input record for character sequences that
7240match the separator; the fields themselves are the text between the matches.
7241
7242In the examples that follow, we use the bullet symbol (@bullet{}) to
7243represent spaces in the output.
7244If the field separator is @samp{oo}, then the following line:
7245
7246@example
7247moo goo gai pan
7248@end example
7249
7250@noindent
7251is split into three fields: @samp{m}, @samp{@bullet{}g}, and
7252@samp{@bullet{}gai@bullet{}pan}.
7253Note the leading spaces in the values of the second and third fields.
7254
7255@cindex troubleshooting @subentry @command{awk} uses @code{FS} not @code{IFS}
7256The field separator is represented by the predefined variable @code{FS}.
7257Shell programmers take note:  @command{awk} does @emph{not} use the
7258name @code{IFS} that is used by the POSIX-compliant shells (such as
7259the Unix Bourne shell, @command{sh}, or Bash).
7260
7261@cindex @code{FS} variable @subentry changing value of
7262The value of @code{FS} can be changed in the @command{awk} program with the
7263assignment operator, @samp{=} (@pxref{Assignment Ops}).
7264Often, the right time to do this is at the beginning of execution
7265before any input has been processed, so that the very first record
7266is read with the proper separator.  To do this, use the special
7267@code{BEGIN} pattern
7268(@pxref{BEGIN/END}).
7269For example, here we set the value of @code{FS} to the string
7270@code{","}:
7271
7272@example
7273awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
7274@end example
7275
7276@cindex @code{BEGIN} pattern
7277@noindent
7278Given the input line:
7279
7280@example
7281John Q. Smith, 29 Oak St., Walamazoo, MI 42139
7282@end example
7283
7284@noindent
7285this @command{awk} program extracts and prints the string
7286@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
7287
7288@cindex field separator @subentry choice of
7289@cindex regular expressions @subentry as field separators
7290@cindex field separator @subentry regular expression as
7291Sometimes the input data contains separator characters that don't
7292separate fields the way you thought they would.  For instance, the
7293person's name in the example we just used might have a title or
7294suffix attached, such as:
7295
7296@example
7297John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
7298@end example
7299
7300@noindent
7301The same program would extract @samp{@bullet{}LXIX} instead of
7302@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
7303If you were expecting the program to print the
7304address, you would be surprised.  The moral is to choose your data layout and
7305separator characters carefully to prevent such problems.
7306(If the data is not in a form that is easy to process, perhaps you
7307can massage it first with a separate @command{awk} program.)
7308
7309
7310@node Default Field Splitting
7311@subsection Whitespace Normally Separates Fields
7312
7313@cindex field separator @subentry whitespace as
7314@cindex whitespace @subentry as field separators
7315@cindex field separator @subentry @code{FS} variable and
7316@cindex separators @subentry field @subentry @code{FS} variable and
7317Fields are normally separated by whitespace sequences
7318(spaces, TABs, and newlines), not by single spaces.  Two spaces in a row do not
7319delimit an empty field.  The default value of the field separator @code{FS}
7320is a string containing a single space, @w{@code{" "}}.  If @command{awk}
7321interpreted this value in the usual way, each space character would separate
7322fields, so two spaces in a row would make an empty field between them.
7323The reason this does not happen is that a single space as the value of
7324@code{FS} is a special case---it is taken to specify the default manner
7325of delimiting fields.
7326
7327If @code{FS} is any other single character, such as @code{","}, then
7328each occurrence of that character separates two fields.  Two consecutive
7329occurrences delimit an empty field.  If the character occurs at the
7330beginning or the end of the line, that too delimits an empty field.  The
7331space character is the only single character that does not follow these
7332rules.
7333
7334@node Regexp Field Splitting
7335@subsection Using Regular Expressions to Separate Fields
7336
7337@cindex regular expressions @subentry as field separators
7338@cindex field separator @subentry regular expression as
7339The previous @value{SUBSECTION}
7340discussed the use of single characters or simple strings as the
7341value of @code{FS}.
7342More generally, the value of @code{FS} may be a string containing any
7343regular expression.  In this case, each match in the record for the regular
7344expression separates fields.  For example, the assignment:
7345
7346@example
7347FS = ", \t"
7348@end example
7349
7350@noindent
7351makes every area of an input line that consists of a comma followed by a
7352space and a TAB into a field separator.
7353@ifinfo
7354(@samp{\t}
7355is an @dfn{escape sequence} that stands for a TAB;
7356@pxref{Escape Sequences},
7357for the complete list of similar escape sequences.)
7358@end ifinfo
7359
7360For a less trivial example of a regular expression, try using
7361single spaces to separate fields the way single commas are used.
7362@code{FS} can be set to @w{@code{"[@ ]"}} (left bracket, space, right
7363bracket).  This regular expression matches a single space and nothing else
7364(@pxref{Regexp}).
7365
7366There is an important difference between the two cases of @samp{FS = @w{" "}}
7367(a single space) and @samp{FS = @w{"[ \t\n]+"}}
7368(a regular expression matching one or more spaces, TABs, or newlines).
7369For both values of @code{FS}, fields are separated by @dfn{runs}
7370(multiple adjacent occurrences) of spaces, TABs,
7371and/or newlines.  However, when the value of @code{FS} is @w{@code{" "}},
7372@command{awk} first strips leading and trailing whitespace from
7373the record and then decides where the fields are.
7374For example, the following pipeline prints @samp{b}:
7375
7376@example
7377$ @kbd{echo ' a b c d ' | awk '@{ print $2 @}'}
7378@print{} b
7379@end example
7380
7381@noindent
7382However, this pipeline prints @samp{a} (note the extra spaces around
7383each letter):
7384
7385@example
7386$ @kbd{echo ' a  b  c  d ' | awk 'BEGIN @{ FS = "[ \t\n]+" @}}
7387>                                  @kbd{@{ print $2 @}'}
7388@print{} a
7389@end example
7390
7391@noindent
7392@cindex null strings
7393@cindex strings @subentry null
7394In this case, the first field is null, or empty.
7395
7396The stripping of leading and trailing whitespace also comes into
7397play whenever @code{$0} is recomputed.  For instance, study this pipeline:
7398
7399@example
7400$ @kbd{echo '   a b c d' | awk '@{ print; $2 = $2; print @}'}
7401@print{}    a b c d
7402@print{} a b c d
7403@end example
7404
7405@noindent
7406The first @code{print} statement prints the record as it was read,
7407with leading whitespace intact.  The assignment to @code{$2} rebuilds
7408@code{$0} by concatenating @code{$1} through @code{$NF} together,
7409separated by the value of @code{OFS} (which is a space by default).
7410Because the leading whitespace was ignored when finding @code{$1},
7411it is not part of the new @code{$0}.  Finally, the last @code{print}
7412statement prints the new @code{$0}.
7413
7414@cindex @code{FS} variable @subentry containing @code{^}
7415@cindex @code{^} (caret) @subentry in @code{FS}
7416@cindex dark corner @subentry @code{^}, in @code{FS}
7417There is an additional subtlety to be aware of when using regular expressions
7418for field splitting.
7419It is not well specified in the POSIX standard, or anywhere else, what @samp{^}
7420means when splitting fields.  Does the @samp{^}  match only at the beginning of
7421the entire record? Or is each field separator a new string?  It turns out that
7422different @command{awk} versions answer this question differently, and you
7423should not rely on any specific behavior in your programs.
7424@value{DARKCORNER}
7425
7426@cindex Brian Kernighan's @command{awk}
7427As a point of information, BWK @command{awk} allows @samp{^}
7428to match only at the beginning of the record. @command{gawk}
7429also works this way. For example:
7430
7431@example
7432$ @kbd{echo 'xxAA  xxBxx  C' |}
7433> @kbd{gawk -F '(^x+)|( +)' '@{ for (i = 1; i <= NF; i++)}
7434> @kbd{                            printf "-->%s<--\n", $i @}'}
7435@print{} --><--
7436@print{} -->AA<--
7437@print{} -->xxBxx<--
7438@print{} -->C<--
7439@end example
7440
7441Finally, field splitting with regular expressions works differently than
7442regexp matching with the @code{sub()}, @code{gsub()}, and @code{gensub()}
7443(@pxref{String Functions}).  Those functions allow a regexp to match the
7444empty string; field splitting does not.  Thus, for example @samp{FS =
7445"()"} does @emph{not} split fields between characters.
7446
7447@node Single Character Fields
7448@subsection Making Each Character a Separate Field
7449
7450@cindex common extensions @subentry single character fields
7451@cindex extensions @subentry common @subentry single character fields
7452@cindex differences in @command{awk} and @command{gawk} @subentry single-character fields
7453@cindex single-character fields
7454@cindex fields @subentry single-character
7455There are times when you may want to examine each character
7456of a record separately.  This can be done in @command{gawk} by
7457simply assigning the null string (@code{""}) to @code{FS}. @value{COMMONEXT}
7458In this case,
7459each individual character in the record becomes a separate field.
7460For example:
7461
7462@example
7463$ @kbd{echo a b | gawk 'BEGIN @{ FS = "" @}}
7464>                  @kbd{@{}
7465>                      @kbd{for (i = 1; i <= NF; i = i + 1)}
7466>                          @kbd{print "Field", i, "is", $i}
7467>                  @kbd{@}'}
7468@print{} Field 1 is a
7469@print{} Field 2 is
7470@print{} Field 3 is b
7471@end example
7472
7473@cindex dark corner @subentry @code{FS} as null string
7474@cindex @code{FS} variable @subentry null string as
7475Traditionally, the behavior of @code{FS} equal to @code{""} was not defined.
7476In this case, most versions of Unix @command{awk} simply treat the entire record
7477as only having one field.
7478@value{DARKCORNER}
7479In compatibility mode
7480(@pxref{Options}),
7481if @code{FS} is the null string, then @command{gawk} also
7482behaves this way.
7483
7484@node Command Line Field Separator
7485@subsection Setting @code{FS} from the Command Line
7486@cindex @option{-F} option @subentry command-line
7487@cindex field separator @subentry on command line
7488@cindex command line @subentry @code{FS} on, setting
7489@cindex @code{FS} variable @subentry setting from command line
7490
7491@code{FS} can be set on the command line.  Use the @option{-F} option to
7492do so.  For example:
7493
7494@example
7495awk -F, '@var{program}' @var{input-files}
7496@end example
7497
7498@noindent
7499sets @code{FS} to the @samp{,} character.  Notice that the option uses
7500an uppercase @samp{F} instead of a lowercase @samp{f}. The latter
7501option (@option{-f}) specifies a file containing an @command{awk} program.
7502
7503The value used for the argument to @option{-F} is processed in exactly the
7504same way as assignments to the predefined variable @code{FS}.
7505Any special characters in the field separator must be escaped
7506appropriately.  For example, to use a @samp{\} as the field separator
7507on the command line, you would have to type:
7508
7509@example
7510# same as FS = "\\"
7511awk -F\\\\ '@dots{}' files @dots{}
7512@end example
7513
7514@noindent
7515@cindex field separator @subentry backslash (@code{\}) as
7516@cindex @code{\} (backslash) @subentry as field separator
7517@cindex backslash (@code{\}) @subentry as field separator
7518Because @samp{\} is used for quoting in the shell, @command{awk} sees
7519@samp{-F\\}.  Then @command{awk} processes the @samp{\\} for escape
7520characters (@pxref{Escape Sequences}), finally yielding
7521a single @samp{\} to use for the field separator.
7522
7523@c @cindex historical features
7524As a special case, in compatibility mode
7525(@pxref{Options}),
7526if the argument to @option{-F} is @samp{t}, then @code{FS} is set to
7527the TAB character.  If you type @samp{-F\t} at the
7528shell, without any quotes, the @samp{\} gets deleted, so @command{awk}
7529figures that you really want your fields to be separated with TABs and
7530not @samp{t}s.  Use @samp{-v FS="t"} or @samp{-F"[t]"} on the command line
7531if you really do want to separate your fields with @samp{t}s.
7532Use @samp{-F '\t'} when not in compatibility mode to specify that TABs
7533separate fields.
7534
7535As an example, let's use an @command{awk} program file called @file{edu.awk}
7536that contains the pattern @code{/edu/} and the action @samp{print $1}:
7537
7538@example
7539/edu/   @{ print $1 @}
7540@end example
7541
7542Let's also set @code{FS} to be the @samp{-} character and run the
7543program on the file @file{mail-list}.  The following command prints a
7544list of the names of the people that work at or attend a university, and
7545the first three digits of their phone numbers:
7546
7547@example
7548$ @kbd{awk -F- -f edu.awk mail-list}
7549@print{} Fabius       555
7550@print{} Samuel       555
7551@print{} Jean
7552@end example
7553
7554@noindent
7555Note the third line of output.  The third line
7556in the original file looked like this:
7557
7558@example
7559Jean-Paul    555-2127     jeanpaul.campanorum@@nyu.edu     R
7560@end example
7561
7562The @samp{-} as part of the person's name was used as the field
7563separator, instead of the @samp{-} in the phone number that was
7564originally intended.  This demonstrates why you have to be careful in
7565choosing your field and record separators.
7566
7567@cindex Unix @command{awk} @subentry password files, field separators and
7568Perhaps the most common use of a single character as the field separator
7569occurs when processing the Unix system password file.  On many Unix
7570systems, each user has a separate entry in the system password file, with one
7571line per user.  The information in these lines is separated by colons.
7572The first field is the user's login name and the second is the user's
7573encrypted or shadow password.  (A shadow password is indicated by the
7574presence of a single @samp{x} in the second field.)  A password file
7575entry might look like this:
7576
7577@cindex Robbins @subentry Arnold
7578@example
7579arnold:x:2076:10:Arnold Robbins:/home/arnold:/bin/bash
7580@end example
7581
7582The following program searches the system password file and prints
7583the entries for users whose full name is not indicated:
7584
7585@example
7586awk -F: '$5 == ""' /etc/passwd
7587@end example
7588
7589@node Full Line Fields
7590@subsection Making the Full Line Be a Single Field
7591
7592Occasionally, it's useful to treat the whole input line as a
7593single field.  This can be done easily and portably simply by
7594setting @code{FS} to @code{"\n"} (a newline):@footnote{Thanks to
7595Andrew Schorr for this tip.}
7596
7597@example
7598awk -F'\n' '@var{program}' @var{files @dots{}}
7599@end example
7600
7601@noindent
7602When you do this, @code{$1} is the same as @code{$0}.
7603
7604@sidebar Changing @code{FS} Does Not Affect the Fields
7605
7606@cindex POSIX @command{awk} @subentry field separators and
7607@cindex field separator @subentry POSIX and
7608According to the POSIX standard, @command{awk} is supposed to behave
7609as if each record is split into fields at the time it is read.
7610In particular, this means that if you change the value of @code{FS}
7611after a record is read, the values of the fields (i.e., how they were split)
7612should reflect the old value of @code{FS}, not the new one.
7613
7614@cindex dark corner @subentry field separators
7615@cindex @command{sed} utility
7616@cindex stream editors
7617However, many older implementations of @command{awk} do not work this way.  Instead,
7618they defer splitting the fields until a field is actually
7619referenced.  The fields are split
7620using the @emph{current} value of @code{FS}!
7621@value{DARKCORNER}
7622This behavior can be difficult
7623to diagnose. The following example illustrates the difference
7624between the two methods:
7625
7626@example
7627sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}'
7628@end example
7629
7630@noindent
7631which usually prints:
7632
7633@example
7634root
7635@end example
7636
7637@noindent
7638on an incorrect implementation of @command{awk}, while @command{gawk}
7639prints the full first line of the file, something like:
7640
7641@example
7642root:x:0:0:Root:/:
7643@end example
7644
7645(The @command{sed}@footnote{The @command{sed} utility is a ``stream editor.''
7646Its behavior is also defined by the POSIX standard.}
7647command prints just the first line of @file{/etc/passwd}.)
7648@end sidebar
7649
7650@node Field Splitting Summary
7651@subsection Field-Splitting Summary
7652
7653It is important to remember that when you assign a string constant
7654as the value of @code{FS}, it undergoes normal @command{awk} string
7655processing.  For example, with Unix @command{awk} and @command{gawk},
7656the assignment @samp{FS = "\.."} assigns the character string @code{".."}
7657to @code{FS} (the backslash is stripped).  This creates a regexp meaning
7658``fields are separated by occurrences of any two characters.''
7659If instead you want fields to be separated by a literal period followed
7660by any single character, use @samp{FS = "\\.."}.
7661
7662The following list summarizes how fields are split, based on the value
7663of @code{FS} (@samp{==} means ``is equal to''):
7664
7665@table @code
7666@item FS == " "
7667Fields are separated by runs of whitespace.  Leading and trailing
7668whitespace are ignored.  This is the default.
7669
7670@item FS == @var{any other single character}
7671Fields are separated by each occurrence of the character.  Multiple
7672successive occurrences delimit empty fields, as do leading and
7673trailing occurrences.
7674The character can even be a regexp metacharacter; it does not need
7675to be escaped.
7676
7677@item FS == @var{regexp}
7678Fields are separated by occurrences of characters that match @var{regexp}.
7679Leading and trailing matches of @var{regexp} delimit empty fields.
7680
7681@item FS == ""
7682Each individual character in the record becomes a separate field.
7683(This is a common extension; it is not specified by the POSIX standard.)
7684@end table
7685
7686@sidebar @code{FS} and @code{IGNORECASE}
7687
7688The @code{IGNORECASE} variable
7689(@pxref{User-modified})
7690affects field splitting @emph{only} when the value of @code{FS} is a regexp.
7691It has no effect when @code{FS} is a single character, even if
7692that character is a letter.  Thus, in the following code:
7693
7694@example
7695FS = "c"
7696IGNORECASE = 1
7697$0 = "aCa"
7698print $1
7699@end example
7700
7701@noindent
7702The output is @samp{aCa}.  If you really want to split fields on an
7703alphabetic character while ignoring case, use a regexp that will
7704do it for you (e.g., @samp{FS = "[c]"}).  In this case, @code{IGNORECASE}
7705will take effect.
7706@end sidebar
7707
7708
7709@node Constant Size
7710@section Reading Fixed-Width Data
7711
7712@cindex data, fixed-width
7713@cindex fixed-width data
7714@cindex advanced features @subentry fixed-width data
7715
7716@c O'Reilly doesn't like it as a note the first thing in the section.
7717This @value{SECTION} discusses an advanced
7718feature of @command{gawk}.  If you are a novice @command{awk} user,
7719you might want to skip it on the first reading.
7720
7721@command{gawk} provides a facility for dealing with fixed-width fields
7722with no distinctive field separator. We discuss this feature in
7723the following @value{SUBSECTION}s.
7724
7725@menu
7726* Fixed width data::            Processing fixed-width data.
7727* Skipping intervening::        Skipping intervening fields.
7728* Allowing trailing data::      Capturing optional trailing data.
7729* Fields with fixed data::      Field values with fixed-width data.
7730@end menu
7731
7732@node Fixed width data
7733@subsection Processing Fixed-Width Data
7734
7735An example of fixed-width data would be the input for old Fortran programs
7736where numbers are run together, or the output of programs that did not
7737anticipate the use of their output as input for other programs.
7738
7739An example of the latter is a table where all the columns are lined up
7740by the use of a variable number of spaces and @emph{empty fields are
7741just spaces}.  Clearly, @command{awk}'s normal field splitting based
7742on @code{FS} does not work well in this case.  Although a portable
7743@command{awk} program can use a series of @code{substr()} calls on
7744@code{$0} (@pxref{String Functions}), this is awkward and inefficient
7745for a large number of fields.
7746
7747@cindex troubleshooting @subentry fatal errors @subentry field widths, specifying
7748@cindex @command{w} utility
7749@cindex @code{FIELDWIDTHS} variable
7750@cindex @command{gawk} @subentry @code{FIELDWIDTHS} variable in
7751The splitting of an input record into fixed-width fields is specified by
7752assigning a string containing space-separated numbers to the built-in
7753variable @code{FIELDWIDTHS}.  Each number specifies the width of the
7754field, @emph{including} columns between fields.  If you want to ignore
7755the columns between fields, you can specify the width as a separate
7756field that is subsequently ignored.  It is a fatal error to supply a
7757field width that has a negative value.
7758
7759The following data is the output of the Unix @command{w} utility.  It is useful
7760to illustrate the use of @code{FIELDWIDTHS}:
7761
7762@example
7763@group
7764 10:06pm  up 21 days, 14:04,  23 users
7765User     tty       login@  idle   JCPU   PCPU  what
7766hzuo     ttyV0     8:58pm            9      5  vi p24.tex
7767hzang    ttyV3     6:37pm    50                -csh
7768eklye    ttyV5     9:53pm            7      1  em thes.tex
7769dportein ttyV6     8:17pm  1:47                -csh
7770gierd    ttyD3    10:00pm     1                elm
7771dave     ttyD4     9:47pm            4      4  w
7772brent    ttyp0    26Jun91  4:46  26:46   4:41  bash
7773dave     ttyq4    26Jun9115days     46     46  wnewmail
7774@end group
7775@end example
7776
7777The following program takes this input, converts the idle time to
7778number of seconds, and prints out the first two fields and the calculated
7779idle time:
7780
7781@example
7782BEGIN  @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @}
7783NR > 2 @{
7784    idle = $4
7785    sub(/^ +/, "", idle)   # strip leading spaces
7786    if (idle == "")
7787        idle = 0
7788    if (idle ~ /:/) @{      # hh:mm
7789        split(idle, t, ":")
7790        idle = t[1] * 60 + t[2]
7791    @}
7792    if (idle ~ /days/)
7793        idle *= 24 * 60 * 60
7794
7795    print $1, $2, idle
7796@}
7797@end example
7798
7799@quotation NOTE
7800The preceding program uses a number of @command{awk} features that
7801haven't been introduced yet.
7802@end quotation
7803
7804Running the program on the data produces the following results:
7805
7806@example
7807hzuo      ttyV0  0
7808hzang     ttyV3  50
7809eklye     ttyV5  0
7810dportein  ttyV6  107
7811gierd     ttyD3  1
7812dave      ttyD4  0
7813brent     ttyp0  286
7814dave      ttyq4  1296000
7815@end example
7816
7817Another (possibly more practical) example of fixed-width input data
7818is the input from a deck of balloting cards.  In some parts of
7819the United States, voters mark their choices by punching holes in computer
7820cards.  These cards are then processed to count the votes for any particular
7821candidate or on any particular issue.  Because a voter may choose not to
7822vote on some issue, any column on the card may be empty.  An @command{awk}
7823program for processing such data could use the @code{FIELDWIDTHS} feature
7824to simplify reading the data.  (Of course, getting @command{gawk} to run on
7825a system with card readers is another story!)
7826
7827@node Skipping intervening
7828@subsection Skipping Intervening Fields
7829
7830Starting in @value{PVERSION} 4.2, each field width may optionally be
7831preceded by a colon-separated value specifying the number of characters
7832to skip before the field starts.  Thus, the preceding program could be
7833rewritten to specify @code{FIELDWIDTHS} like so:
7834
7835@example
7836BEGIN  @{ FIELDWIDTHS = "8 1:5 4:7 6 1:6 1:6 2:33" @}
7837@end example
7838
7839This strips away some of the white space separating the fields. With such
7840a change, the program produces the following results:
7841
7842@example
7843hzang    ttyV3 50
7844eklye    ttyV5 0
7845dportein ttyV6 107
7846gierd    ttyD3 1
7847dave     ttyD4 0
7848brent    ttyp0 286
7849dave     ttyq4 1296000
7850@end example
7851
7852@node Allowing trailing data
7853@subsection Capturing Optional Trailing Data
7854
7855There are times when fixed-width data may be followed by additional data
7856that has no fixed length.  Such data may or may not be present, but if
7857it is, it should be possible to get at it from an @command{awk} program.
7858
7859Starting with @value{PVERSION} 4.2, in order to provide a way to say ``anything
7860else in the record after the defined fields,'' @command{gawk}
7861allows you to add a final @samp{*} character to the value of
7862@code{FIELDWIDTHS}. There can only be one such character, and it must
7863be the final non-whitespace character in @code{FIELDWIDTHS}.
7864For example:
7865
7866@example
7867$ @kbd{cat fw.awk}                         @ii{Show the program}
7868@print{} BEGIN @{ FIELDWIDTHS = "2 2 *" @}
7869@print{} @{ print NF, $1, $2, $3 @}
7870$ @kbd{cat fw.in}                          @ii{Show sample input}
7871@print{} 1234abcdefghi
7872$ @kbd{gawk -f fw.awk fw.in}               @ii{Run the program}
7873@print{} 3 12 34 abcdefghi
7874@end example
7875
7876@node Fields with fixed data
7877@subsection Field Values With Fixed-Width Data
7878
7879So far, so good.  But what happens if there isn't as much data as there
7880should be based on the contents of @code{FIELDWIDTHS}? Or, what happens
7881if there is more data than expected?
7882
7883For many years, what happens in these cases was not well defined. Starting
7884with @value{PVERSION} 4.2, the rules are as follows:
7885
7886@table @asis
7887@item Enough data for some fields
7888For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4"} and the
7889input record is @samp{aabbb}.  In this case, @code{NF} is set to two.
7890
7891@item Not enough data for a field
7892For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4"} and the
7893input record is @samp{aab}.  In this case, @code{NF} is set to two and
7894@code{$2} has the value @code{"b"}. The idea is that even though there
7895aren't as many characters as were expected, there are some, so the data
7896should be made available to the program.
7897
7898@item Too much data
7899For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4"} and the
7900input record is @samp{aabbbccccddd}.  In this case, @code{NF} is set to
7901three and the extra characters (@samp{ddd}) are ignored.  If you want
7902@command{gawk} to capture the extra characters, supply a final @samp{*}
7903in the value of @code{FIELDWIDTHS}.
7904
7905@item Too much data, but with @samp{*} supplied
7906For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4 *"} and the
7907input record is @samp{aabbbccccddd}.  In this case, @code{NF} is set to
7908four, and @code{$4} has the value @code{"ddd"}.
7909
7910@end table
7911
7912@node Splitting By Content
7913@section Defining Fields by Content
7914
7915@menu
7916* More CSV::                    More on CSV files.
7917* FS versus FPAT::              A subtle difference.
7918@end menu
7919
7920@c O'Reilly doesn't like it as a note the first thing in the section.
7921This @value{SECTION} discusses an advanced
7922feature of @command{gawk}.  If you are a novice @command{awk} user,
7923you might want to skip it on the first reading.
7924
7925@cindex advanced features @subentry specifying field content
7926Normally, when using @code{FS}, @command{gawk} defines the fields as the
7927parts of the record that occur in between each field separator. In other
7928words, @code{FS} defines what a field @emph{is not}, instead of what a field
7929@emph{is}.
7930However, there are times when you really want to define the fields by
7931what they are, and not by what they are not.
7932
7933@cindex CSV (comma separated values) data @subentry parsing with @code{FPAT}
7934@cindex Comma separated values (CSV) data @subentry parsing with @code{FPAT}
7935The most notorious such case
7936is so-called @dfn{comma-separated values} (CSV) data. Many spreadsheet programs,
7937for example, can export their data into text files, where each record is
7938terminated with a newline, and fields are separated by commas. If
7939commas only separated the data, there wouldn't be an issue. The problem comes when
7940one of the fields contains an @emph{embedded} comma.
7941In such cases, most programs embed the field in double quotes.@footnote{The
7942CSV format lacked a formal standard definition for many years.
7943@uref{http://www.ietf.org/rfc/rfc4180.txt, RFC 4180}
7944standardizes the most common practices.}
7945So, we might have data like this:
7946
7947@example
7948@c file eg/misc/addresses.csv
7949Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
7950@c endfile
7951@end example
7952
7953@cindex @command{gawk} @subentry @code{FPAT} variable in
7954@cindex @code{FPAT} variable
7955The @code{FPAT} variable offers a solution for cases like this.
7956The value of @code{FPAT} should be a string that provides a regular expression.
7957This regular expression describes the contents of each field.
7958
7959In the case of CSV data as presented here, each field is either ``anything that
7960is not a comma,'' or ``a double quote, anything that is not a double quote, and a
7961closing double quote.''  (There are more complicated definitions of CSV data,
7962treated shortly.)
7963If written as a regular expression constant
7964(@pxref{Regexp}),
7965we would have @code{/([^,]+)|("[^"]+")/}.
7966Writing this as a string requires us to escape the double quotes, leading to:
7967
7968@example
7969FPAT = "([^,]+)|(\"[^\"]+\")"
7970@end example
7971
7972Putting this to use, here is a simple program to parse the data:
7973
7974@example
7975@c file eg/misc/simple-csv.awk
7976@group
7977BEGIN @{
7978    FPAT = "([^,]+)|(\"[^\"]+\")"
7979@}
7980@end group
7981
7982@group
7983@{
7984    print "NF = ", NF
7985    for (i = 1; i <= NF; i++) @{
7986        printf("$%d = <%s>\n", i, $i)
7987    @}
7988@}
7989@end group
7990@c endfile
7991@end example
7992
7993When run, we get the following:
7994
7995@example
7996$ @kbd{gawk -f simple-csv.awk addresses.csv}
7997NF =  7
7998$1 = <Robbins>
7999$2 = <Arnold>
8000$3 = <"1234 A Pretty Street, NE">
8001$4 = <MyTown>
8002$5 = <MyState>
8003$6 = <12345-6789>
8004$7 = <USA>
8005@end example
8006
8007Note the embedded comma in the value of @code{$3}.
8008
8009A straightforward improvement when processing CSV data of this sort
8010would be to remove the quotes when they occur, with something like this:
8011
8012@example
8013if (substr($i, 1, 1) == "\"") @{
8014    len = length($i)
8015    $i = substr($i, 2, len - 2)    # Get text within the two quotes
8016@}
8017@end example
8018
8019@quotation NOTE
8020Some programs export CSV data that contains embedded newlines between
8021the double quotes.  @command{gawk} provides no way to deal with this.
8022Even though a formal specification for CSV data exists, there isn't much
8023more to be done;
8024the @code{FPAT} mechanism provides an elegant solution for the majority
8025of cases, and the @command{gawk} developers are satisfied with that.
8026@end quotation
8027
8028As written, the regexp used for @code{FPAT} requires that each field
8029contain at least one character.  A straightforward modification
8030(changing the first @samp{+} to @samp{*}) allows fields to be empty:
8031
8032@example
8033FPAT = "([^,]*)|(\"[^\"]+\")"
8034@end example
8035
8036@c 4/2015:
8037@c Consider use of FPAT = "([^,]*)|(\"[^\"]*\")"
8038@c (star in latter part of value) to allow quoted strings to be empty.
8039@c Per email from Ed Morton <mortoneccc@comcast.net>
8040@c
8041@c WONTFIX: 10/2020
8042@c This is too much work. FPAT and CSV files are very flaky and
8043@c fragile. Doing something like this is merely inviting trouble.
8044
8045As with @code{FS}, the @code{IGNORECASE} variable (@pxref{User-modified})
8046affects field splitting with @code{FPAT}.
8047
8048Assigning a value to @code{FPAT} overrides field splitting
8049with @code{FS} and with @code{FIELDWIDTHS}.
8050
8051Finally, the @code{patsplit()} function makes the same functionality
8052available for splitting regular strings (@pxref{String Functions}).
8053
8054@node More CSV
8055@subsection More on CSV Files
8056
8057@cindex Collado, Manuel
8058Manuel Collado notes that in addition to commas, a CSV field can also
8059contains quotes, that have to be escaped by doubling them. The previously
8060described regexps fail to accept quoted fields with both commas and
8061quotes inside. He suggests that the simplest @code{FPAT} expression that
8062recognizes this kind of fields is @code{/([^,]*)|("([^"]|"")+")/}. He
8063provides the following input data to test these variants:
8064
8065@example
8066@c file eg/misc/sample.csv
8067p,"q,r",s
8068p,"q""r",s
8069p,"q,""r",s
8070p,"",s
8071p,,s
8072@c endfile
8073@end example
8074
8075@noindent
8076And here is his test program:
8077
8078@example
8079@c file eg/misc/test-csv.awk
8080@group
8081BEGIN @{
8082     fp[0] = "([^,]+)|(\"[^\"]+\")"
8083     fp[1] = "([^,]*)|(\"[^\"]+\")"
8084     fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
8085     FPAT = fp[fpat+0]
8086@}
8087@end group
8088
8089@group
8090@{
8091     print "<" $0 ">"
8092     printf("NF = %s ", NF)
8093     for (i = 1; i <= NF; i++) @{
8094         printf("<%s>", $i)
8095     @}
8096     print ""
8097@}
8098@end group
8099@c endfile
8100@end example
8101
8102When run on the third variant, it produces:
8103
8104@example
8105$ @kbd{gawk -v fpat=2 -f test-csv.awk sample.csv}
8106@print{} <p,"q,r",s>
8107@print{} NF = 3 <p><"q,r"><s>
8108@print{} <p,"q""r",s>
8109@print{} NF = 3 <p><"q""r"><s>
8110@print{} <p,"q,""r",s>
8111@print{} NF = 3 <p><"q,""r"><s>
8112@print{} <p,"",s>
8113@print{} NF = 3 <p><""><s>
8114@print{} <p,,s>
8115@print{} NF = 3 <p><><s>
8116@end example
8117
8118@cindex Collado, Manuel
8119@cindex @code{CSVMODE} library for @command{gawk}
8120@cindex CSV (comma separated values) data @subentry parsing with @code{CSVMODE} library
8121@cindex Comma separated values (CSV) data @subentry parsing with @code{FPAT} library
8122In general, using @code{FPAT} to do your own CSV parsing is like having
8123a bed with a blanket that's not quite big enough. There's always a corner
8124that isn't covered. We recommend, instead, that you use Manuel Collado's
8125@uref{http://mcollado.z15.es/xgawk/, @code{CSVMODE} library for @command{gawk}}.
8126
8127@node FS versus FPAT
8128@subsection @code{FS} Versus @code{FPAT}: A Subtle Difference
8129
8130As we discussed earlier, @code{FS} describes the data between fields (``what fields are not'')
8131and @code{FPAT} describes the fields themselves (``what fields are'').
8132This leads to a subtle difference in how fields are found when using regexps as the value
8133for @code{FS} or @code{FPAT}.
8134
8135In order to distinguish one field from another, there must be a non-empty separator between
8136each field.  This makes intuitive sense---otherwise one could not distinguish fields from
8137separators.
8138
8139Thus, regular expression matching as done when splitting fields with @code{FS} is not
8140allowed to match the null string; it must always match at least one character, in order
8141to be able to proceed through the entire record.
8142
8143On the other hand, regular expression matching with @code{FPAT} can match the null
8144string, and the non-matching intervening characters function as the separators.
8145
8146This same difference is reflected in how matching is done with the @code{split()}
8147and @code{patsplit()} functions (@pxref{String Functions}).
8148
8149@node Testing field creation
8150@section Checking How @command{gawk} Is Splitting Records
8151
8152@cindex @command{gawk} @subentry splitting fields and
8153As we've seen, @command{gawk} provides three independent methods to split
8154input records into fields.  The mechanism used is based on which of the
8155three variables---@code{FS}, @code{FIELDWIDTHS}, or @code{FPAT}---was
8156last assigned to. In addition, an API input parser may choose to override
8157the record parsing mechanism; please refer to @ref{Input Parsers} for
8158further information about this feature.
8159
8160To restore normal field splitting after using @code{FIELDWIDTHS}
8161and/or @code{FPAT}, simply assign a value to @code{FS}.
8162You can use @samp{FS = FS} to do this,
8163without having to know the current value of @code{FS}.
8164
8165In order to tell which kind of field splitting is in effect,
8166use @code{PROCINFO["FS"]} (@pxref{Auto-set}).
8167The value is @code{"FS"} if regular field splitting is being used,
8168@code{"FIELDWIDTHS"} if fixed-width field splitting is being used,
8169or @code{"FPAT"} if content-based field splitting is being used:
8170
8171@example
8172if (PROCINFO["FS"] == "FS")
8173    @var{regular field splitting} @dots{}
8174else if (PROCINFO["FS"] == "FIELDWIDTHS")
8175    @var{fixed-width field splitting} @dots{}
8176else if (PROCINFO["FS"] == "FPAT")
8177    @var{content-based field splitting} @dots{}
8178else
8179    @var{API input parser field splitting} @dots{} @ii{(advanced feature)}
8180@end example
8181
8182This information is useful when writing a function that needs to
8183temporarily change @code{FS} or @code{FIELDWIDTHS}, read some records,
8184and then restore the original settings (@pxref{Passwd Functions} for an
8185example of such a function).
8186
8187@node Multiple Line
8188@section Multiple-Line Records
8189
8190@cindex multiple-line records
8191@cindex records @subentry multiline
8192@cindex input @subentry multiline records
8193@cindex files @subentry reading @subentry multiline records
8194@cindex input, files @seeentry{input files}
8195In some databases, a single line cannot conveniently hold all the
8196information in one entry.  In such cases, you can use multiline
8197records.  The first step in doing this is to choose your data format.
8198
8199@cindex record separators @subentry with multiline records
8200One technique is to use an unusual character or string to separate
8201records.  For example, you could use the formfeed character (written
8202@samp{\f} in @command{awk}, as in C) to separate them, making each record
8203a page of the file.  To do this, just set the variable @code{RS} to
8204@code{"\f"} (a string containing the formfeed character).  Any
8205other character could equally well be used, as long as it won't be part
8206of the data in a record.
8207
8208@cindex @code{RS} variable @subentry multiline records and
8209Another technique is to have blank lines separate records.  By a special
8210dispensation, an empty string as the value of @code{RS} indicates that
8211records are separated by one or more blank lines.  When @code{RS} is set
8212to the empty string, each record always ends at the first blank line
8213encountered.  The next record doesn't start until the first nonblank
8214line that follows.  No matter how many blank lines appear in a row, they
8215all act as one record separator.
8216(Blank lines must be completely empty; lines that contain only
8217whitespace do not count.)
8218
8219@cindex leftmost longest match
8220@cindex matching @subentry leftmost longest
8221You can achieve the same effect as @samp{RS = ""} by assigning the
8222string @code{"\n\n+"} to @code{RS}. This regexp matches the newline
8223at the end of the record and one or more blank lines after the record.
8224In addition, a regular expression always matches the longest possible
8225sequence when there is a choice
8226(@pxref{Leftmost Longest}).
8227So, the next record doesn't start until
8228the first nonblank line that follows---no matter how many blank lines
8229appear in a row, they are considered one record separator.
8230
8231@cindex dark corner @subentry multiline records
8232However, there is an important difference between @samp{RS = ""} and
8233@samp{RS = "\n\n+"}. In the first case, leading newlines in the input
8234@value{DF} are ignored, and if a file ends without extra blank lines
8235after the last record, the final newline is removed from the record.
8236In the second case, this special processing is not done.
8237@value{DARKCORNER}
8238
8239@cindex field separator @subentry in multiline records
8240@cindex @code{FS} variable @subentry in multiline records
8241Now that the input is separated into records, the second step is to
8242separate the fields in the records.  One way to do this is to divide each
8243of the lines into fields in the normal manner.  This happens by default
8244as the result of a special feature.  When @code{RS} is set to the empty
8245string @emph{and} @code{FS} is set to a single character,
8246the newline character @emph{always} acts as a field separator.
8247This is in addition to whatever field separations result from
8248@code{FS}.
8249
8250@quotation NOTE
8251When @code{FS} is the null string (@code{""})
8252or a regexp, this special feature of @code{RS} does not apply.
8253It does apply to the default field separator of a single space:
8254@samp{FS = @w{" "}}.
8255
8256Note that language in the POSIX specification implies that
8257this special feature should apply when @code{FS} is a regexp.
8258However, Unix @command{awk} has never behaved that way, nor has
8259@command{gawk}. This is essentially a bug in POSIX.
8260@c Noted as of 4/2019; working to get the standard fixed.
8261@end quotation
8262
8263The original motivation for this special exception was probably to provide
8264useful behavior in the default case (i.e., @code{FS} is equal
8265to @w{@code{" "}}).  This feature can be a problem if you really don't
8266want the newline character to separate fields, because there is no way to
8267prevent it.  However, you can work around this by using the @code{split()}
8268function to break up the record manually
8269(@pxref{String Functions}).
8270If you have a single-character field separator, you can work around
8271the special feature in a different way, by making @code{FS} into a
8272regexp for that single character.  For example, if the field
8273separator is a percent character, instead of
8274@samp{FS = "%"}, use @samp{FS = "[%]"}.
8275
8276Another way to separate fields is to
8277put each field on a separate line: to do this, just set the
8278variable @code{FS} to the string @code{"\n"}.
8279(This single-character separator matches a single newline.)
8280A practical example of a @value{DF} organized this way might be a mailing
8281list, where blank lines separate the entries.  Consider a mailing
8282list in a file named @file{addresses}, which looks like this:
8283
8284@example
8285Jane Doe
8286123 Main Street
8287Anywhere, SE 12345-6789
8288
8289John Smith
8290456 Tree-lined Avenue
8291Smallville, MW 98765-4321
8292@dots{}
8293@end example
8294
8295@noindent
8296A simple program to process this file is as follows:
8297
8298@example
8299# addrs.awk --- simple mailing list program
8300
8301# Records are separated by blank lines.
8302# Each line is one field.
8303BEGIN @{ RS = "" ; FS = "\n" @}
8304
8305@{
8306      print "Name is:", $1
8307      print "Address is:", $2
8308      print "City and State are:", $3
8309      print ""
8310@}
8311@end example
8312
8313Running the program produces the following output:
8314
8315@example
8316$ @kbd{awk -f addrs.awk addresses}
8317@print{} Name is: Jane Doe
8318@print{} Address is: 123 Main Street
8319@print{} City and State are: Anywhere, SE 12345-6789
8320@print{}
8321@print{} Name is: John Smith
8322@print{} Address is: 456 Tree-lined Avenue
8323@print{} City and State are: Smallville, MW 98765-4321
8324@print{}
8325@dots{}
8326@end example
8327
8328@xref{Labels Program} for a more realistic program dealing with
8329address lists.  The following list summarizes how records are split,
8330based on the value of
8331@ifinfo
8332@code{RS}.
8333(@samp{==} means ``is equal to.'')
8334@end ifinfo
8335@ifnotinfo
8336@code{RS}:
8337@end ifnotinfo
8338
8339@table @code
8340@item RS == "\n"
8341Records are separated by the newline character (@samp{\n}).  In effect,
8342every line in the @value{DF} is a separate record, including blank lines.
8343This is the default.
8344
8345@item RS == @var{any single character}
8346Records are separated by each occurrence of the character.  Multiple
8347successive occurrences delimit empty records.
8348
8349@item RS == ""
8350Records are separated by runs of blank lines.
8351When @code{FS} is a single character, then
8352the newline character
8353always serves as a field separator, in addition to whatever value
8354@code{FS} may have. Leading and trailing newlines in a file are ignored.
8355
8356@item RS == @var{regexp}
8357Records are separated by occurrences of characters that match @var{regexp}.
8358Leading and trailing matches of @var{regexp} delimit empty records.
8359(This is a @command{gawk} extension; it is not specified by the
8360POSIX standard.)
8361@end table
8362
8363@cindex @command{gawk} @subentry @code{RT} variable in
8364@cindex @code{RT} variable
8365@cindex differences in @command{awk} and @command{gawk} @subentry @code{RS}/@code{RT} variables
8366If not in compatibility mode (@pxref{Options}), @command{gawk} sets
8367@code{RT} to the input text that matched the value specified by @code{RS}.
8368But if the input file ended without any text that matches @code{RS},
8369then @command{gawk} sets @code{RT} to the null string.
8370
8371@node Getline
8372@section Explicit Input with @code{getline}
8373
8374@cindex @code{getline} command @subentry explicit input with
8375@cindex input @subentry explicit
8376So far we have been getting our input data from @command{awk}'s main
8377input stream---either the standard input (usually your keyboard, sometimes
8378the output from another program) or the
8379files specified on the command line.  The @command{awk} language has a
8380special built-in command called @code{getline} that
8381can be used to read input under your explicit control.
8382
8383The @code{getline} command is used in several different ways and should
8384@emph{not} be used by beginners.
8385The examples that follow the explanation of the @code{getline} command
8386include material that has not been covered yet.  Therefore, come back
8387and study the @code{getline} command @emph{after} you have reviewed the
8388rest of
8389@ifinfo
8390this @value{DOCUMENT}
8391@end ifinfo
8392@ifhtml
8393this @value{DOCUMENT}
8394@end ifhtml
8395@ifnotinfo
8396@ifnothtml
8397Parts I and II
8398@end ifnothtml
8399@end ifnotinfo
8400and have a good knowledge of how @command{awk} works.
8401
8402@cindex @command{gawk} @subentry @code{ERRNO} variable in
8403@cindex @code{ERRNO} variable @subentry with @command{getline} command
8404@cindex differences in @command{awk} and @command{gawk} @subentry @code{getline} command
8405@cindex @code{getline} command @subentry return values
8406@cindex @option{--sandbox} option @subentry input redirection with @code{getline}
8407
8408The @code{getline} command returns 1 if it finds a record and 0 if
8409it encounters the end of the file.  If there is some error in getting
8410a record, such as a file that cannot be opened, then @code{getline}
8411returns @minus{}1.  In this case, @command{gawk} sets the variable
8412@code{ERRNO} to a string describing the error that occurred.
8413
8414If @code{ERRNO} indicates that the I/O operation may be
8415retried, and @code{PROCINFO["@var{input}", "RETRY"]} is set,
8416then @code{getline} returns @minus{}2
8417instead of @minus{}1, and further calls to @code{getline}
8418may be attempted.  @xref{Retrying Input} for further information about
8419this feature.
8420
8421In the following examples, @var{command} stands for a string value that
8422represents a shell command.
8423
8424@quotation NOTE
8425When @option{--sandbox} is specified (@pxref{Options}),
8426reading lines from files, pipes, and coprocesses is disabled.
8427@end quotation
8428
8429@menu
8430* Plain Getline::               Using @code{getline} with no arguments.
8431* Getline/Variable::            Using @code{getline} into a variable.
8432* Getline/File::                Using @code{getline} from a file.
8433* Getline/Variable/File::       Using @code{getline} into a variable from a
8434                                file.
8435* Getline/Pipe::                Using @code{getline} from a pipe.
8436* Getline/Variable/Pipe::       Using @code{getline} into a variable from a
8437                                pipe.
8438* Getline/Coprocess::           Using @code{getline} from a coprocess.
8439* Getline/Variable/Coprocess::  Using @code{getline} into a variable from a
8440                                coprocess.
8441* Getline Notes::               Important things to know about @code{getline}.
8442* Getline Summary::             Summary of @code{getline} Variants.
8443@end menu
8444
8445@node Plain Getline
8446@subsection Using @code{getline} with No Arguments
8447
8448The @code{getline} command can be used without arguments to read input
8449from the current input file.  All it does in this case is read the next
8450input record and split it up into fields.  This is useful if you've
8451finished processing the current record, but want to do some special
8452processing on the next record @emph{right now}.  For example:
8453
8454@c 6/2019: Thanks to Mark Krauze <daburashka@ya.ru> for suggested
8455@c improvements (the inner while loop).
8456@example
8457# Remove text between /* and */, inclusive
8458@{
8459    while ((start = index($0, "/*")) != 0) @{
8460        out = substr($0, 1, start - 1)  # leading part of the string
8461        rest = substr($0, start + 2)    # ... */ ...
8462        while ((end = index(rest, "*/")) == 0) @{  # is */ in trailing part?
8463            # get more text
8464            if (getline <= 0) @{
8465                print("unexpected EOF or error:", ERRNO) > "/dev/stderr"
8466                exit
8467            @}
8468            # build up the line using string concatenation
8469            rest = rest $0
8470        @}
8471        rest = substr(rest, end + 2)  # remove comment
8472        # build up the output line using string concatenation
8473        $0 = out rest
8474    @}
8475    print $0
8476@}
8477@end example
8478
8479This @command{awk} program deletes C-style comments (@samp{/* @dots{}
8480*/}) from the input.
8481It uses a number of features we haven't covered yet, including
8482string concatenation
8483(@pxref{Concatenation})
8484and the @code{index()} and @code{substr()} built-in
8485functions
8486(@pxref{String Functions}).
8487By replacing the @samp{print $0} with other
8488statements, you could perform more complicated processing on the
8489decommented input, such as searching for matches of a regular
8490expression.
8491
8492Here is some sample input:
8493
8494@example
8495mon/*comment*/key
8496rab/*commen
8497t*/bit
8498horse /*comment*/more text
8499part 1 /*comment*/part 2 /*comment*/part 3
8500no comment
8501@end example
8502
8503When run, the output is:
8504
8505@example
8506$ @kbd{awk -f strip_comments.awk example_text}
8507@print{} monkey
8508@print{} rabbit
8509@print{} horse more text
8510@print{} part 1 part 2 part 3
8511@print{} no comment
8512@end example
8513
8514This form of the @code{getline} command sets @code{NF},
8515@code{NR}, @code{FNR}, @code{RT}, and the value of @code{$0}.
8516
8517@quotation NOTE
8518The new value of @code{$0} is used to test
8519the patterns of any subsequent rules.  The original value
8520of @code{$0} that triggered the rule that executed @code{getline}
8521is lost.
8522By contrast, the @code{next} statement reads a new record
8523but immediately begins processing it normally, starting with the first
8524rule in the program.  @xref{Next Statement}.
8525@end quotation
8526
8527@node Getline/Variable
8528@subsection Using @code{getline} into a Variable
8529@cindex @code{getline} command @subentry into a variable
8530@cindex variables @subentry @code{getline} command into, using
8531
8532You can use @samp{getline @var{var}} to read the next record from
8533@command{awk}'s input into the variable @var{var}.  No other processing is
8534done.
8535For example, suppose the next line is a comment or a special string,
8536and you want to read it without triggering
8537any rules.  This form of @code{getline} allows you to read that line
8538and store it in a variable so that the main
8539read-a-line-and-check-each-rule loop of @command{awk} never sees it.
8540The following example swaps every two lines of input:
8541
8542@example
8543@group
8544@{
8545     if ((getline tmp) > 0) @{
8546          print tmp
8547          print $0
8548     @} else
8549          print $0
8550@}
8551@end group
8552@end example
8553
8554@noindent
8555It takes the following list:
8556
8557@example
8558wan
8559tew
8560free
8561phore
8562@end example
8563
8564@noindent
8565and produces these results:
8566
8567@example
8568tew
8569wan
8570phore
8571free
8572@end example
8573
8574The @code{getline} command used in this way sets only the variables
8575@code{NR}, @code{FNR}, and @code{RT} (and, of course, @var{var}).
8576The record is not
8577split into fields, so the values of the fields (including @code{$0}) and
8578the value of @code{NF} do not change.
8579
8580@node Getline/File
8581@subsection Using @code{getline} from a File
8582
8583@cindex @code{getline} command @subentry from a file
8584@cindex input redirection
8585@cindex redirection @subentry of input
8586@cindex @code{<} (left angle bracket) @subentry @code{<} operator (I/O)
8587@cindex left angle bracket (@code{<}) @subentry @code{<} operator (I/O)
8588@cindex operators @subentry input/output
8589Use @samp{getline < @var{file}} to read the next record from @var{file}.
8590Here, @var{file} is a string-valued expression that
8591specifies the @value{FN}.  @samp{< @var{file}} is called a @dfn{redirection}
8592because it directs input to come from a different place.
8593For example, the following
8594program reads its input record from the file @file{secondary.input} when it
8595encounters a first field with a value equal to 10 in the current input
8596file:
8597
8598@example
8599@{
8600    if ($1 == 10) @{
8601         getline < "secondary.input"
8602         print
8603    @} else
8604         print
8605@}
8606@end example
8607
8608Because the main input stream is not used, the values of @code{NR} and
8609@code{FNR} are not changed. However, the record it reads is split into fields in
8610the normal manner, so the values of @code{$0} and the other fields are
8611changed, resulting in a new value of @code{NF}.
8612@code{RT} is also set.
8613
8614@cindex POSIX @command{awk} @subentry @code{<} operator and
8615@c Thanks to Paul Eggert for initial wording here
8616According to POSIX, @samp{getline < @var{expression}} is ambiguous if
8617@var{expression} contains unparenthesized operators other than
8618@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous
8619because the concatenation operator (not discussed yet; @pxref{Concatenation})
8620is not parenthesized.  You should write it as @samp{getline < (dir "/" file)} if
8621you want your program to be portable to all @command{awk} implementations.
8622
8623@node Getline/Variable/File
8624@subsection Using @code{getline} into a Variable from a File
8625@cindex variables @subentry @code{getline} command into, using
8626
8627Use @samp{getline @var{var} < @var{file}} to read input
8628from the file
8629@var{file}, and put it in the variable @var{var}.  As earlier, @var{file}
8630is a string-valued expression that specifies the file from which to read.
8631
8632In this version of @code{getline}, none of the predefined variables are
8633changed and the record is not split into fields.  The only variable
8634changed is @var{var}.@footnote{This is not quite true. @code{RT} could
8635be changed if @code{RS} is a regular expression.}
8636For example, the following program copies all the input files to the
8637output, except for records that say @w{@samp{@@include @var{filename}}}.
8638Such a record is replaced by the contents of the file
8639@var{filename}:
8640
8641@example
8642@{
8643     if (NF == 2 && $1 == "@@include") @{
8644          while ((getline line < $2) > 0)
8645               print line
8646          close($2)
8647     @} else
8648          print
8649@}
8650@end example
8651
8652Note here how the name of the extra input file is not built into
8653the program; it is taken directly from the data, specifically from the second field on
8654the @code{@@include} line.
8655
8656The @code{close()} function is called to ensure that if two identical
8657@code{@@include} lines appear in the input, the entire specified file is
8658included twice.
8659@xref{Close Files And Pipes}.
8660
8661One deficiency of this program is that it does not process nested
8662@code{@@include} statements
8663(i.e., @code{@@include} statements in included files)
8664the way a true macro preprocessor would.
8665@xref{Igawk Program} for a program
8666that does handle nested @code{@@include} statements.
8667
8668@node Getline/Pipe
8669@subsection Using @code{getline} from a Pipe
8670
8671@c From private email, dated October 2, 1988. Used by permission, March 2013.
8672@cindex Kernighan, Brian @subentry quotes
8673@quotation
8674@i{Omniscience has much to recommend it.
8675Failing that, attention to details would be useful.}
8676@author Brian Kernighan
8677@end quotation
8678
8679@cindex @code{|} (vertical bar) @subentry @code{|} operator (I/O)
8680@cindex vertical bar (@code{|}) @subentry @code{|} operator (I/O)
8681@cindex input pipeline
8682@cindex pipe @subentry input
8683@cindex operators @subentry input/output
8684The output of a command can also be piped into @code{getline}, using
8685@samp{@var{command} | getline}.  In
8686this case, the string @var{command} is run as a shell command and its output
8687is piped into @command{awk} to be used as input.  This form of @code{getline}
8688reads one record at a time from the pipe.
8689For example, the following program copies its input to its output, except for
8690lines that begin with @samp{@@execute}, which are replaced by the output
8691produced by running the rest of the line as a shell command:
8692
8693@example
8694@group
8695@{
8696     if ($1 == "@@execute") @{
8697          tmp = substr($0, 10)        # Remove "@@execute"
8698          while ((tmp | getline) > 0)
8699               print
8700          close(tmp)
8701     @} else
8702          print
8703@}
8704@end group
8705@end example
8706
8707@noindent
8708The @code{close()} function is called to ensure that if two identical
8709@samp{@@execute} lines appear in the input, the command is run for
8710each one.
8711@ifnottex
8712@ifnotdocbook
8713@xref{Close Files And Pipes}.
8714@end ifnotdocbook
8715@end ifnottex
8716@c This example is unrealistic, since you could just use system
8717Given the input:
8718
8719@example
8720foo
8721bar
8722baz
8723@@execute who
8724bletch
8725@end example
8726
8727@noindent
8728the program might produce:
8729
8730@cindex Robbins @subentry Bill
8731@cindex Robbins @subentry Miriam
8732@cindex Robbins @subentry Arnold
8733@example
8734foo
8735bar
8736baz
8737arnold     ttyv0   Jul 13 14:22
8738miriam     ttyp0   Jul 13 14:23     (murphy:0)
8739bill       ttyp1   Jul 13 14:23     (murphy:0)
8740bletch
8741@end example
8742
8743@noindent
8744Notice that this program ran the command @command{who} and printed the result.
8745(If you try this program yourself, you will of course get different results,
8746depending upon who is logged in on your system.)
8747
8748This variation of @code{getline} splits the record into fields, sets the
8749value of @code{NF}, and recomputes the value of @code{$0}.  The values of
8750@code{NR} and @code{FNR} are not changed.
8751@code{RT} is set.
8752
8753@cindex POSIX @command{awk} @subentry @code{|} I/O operator and
8754@c Thanks to Paul Eggert for initial wording here
8755According to POSIX, @samp{@var{expression} | getline} is ambiguous if
8756@var{expression} contains unparenthesized operators other than
8757@samp{$}---for example, @samp{@w{"echo "} "date" | getline} is ambiguous
8758because the concatenation operator is not parenthesized.  You should
8759write it as @samp{(@w{"echo "} "date") | getline} if you want your program
8760to be portable to all @command{awk} implementations.
8761
8762@cindex Brian Kernighan's @command{awk}
8763@cindex @command{mawk} utility
8764@quotation NOTE
8765Unfortunately, @command{gawk} has not been consistent in its treatment
8766of a construct like @samp{@w{"echo "} "date" | getline}.
8767Most versions, including the current version, treat it as
8768@samp{@w{("echo "} "date") | getline}.
8769(This is also how BWK @command{awk} behaves.)
8770Some versions instead treat it as
8771@samp{@w{"echo "} ("date" | getline)}.
8772(This is how @command{mawk} behaves.)
8773In short, @emph{always} use explicit parentheses, and then you won't
8774have to worry.
8775@end quotation
8776
8777@node Getline/Variable/Pipe
8778@subsection Using @code{getline} into a Variable from a Pipe
8779@cindex variables @subentry @code{getline} command into, using
8780
8781When you use @samp{@var{command} | getline @var{var}}, the
8782output of @var{command} is sent through a pipe to
8783@code{getline} and into the variable @var{var}.  For example, the
8784following program reads the current date and time into the variable
8785@code{current_time}, using the @command{date} utility, and then
8786prints it:
8787
8788@example
8789BEGIN @{
8790     "date" | getline current_time
8791     close("date")
8792     print "Report printed on " current_time
8793@}
8794@end example
8795
8796In this version of @code{getline}, none of the predefined variables are
8797changed and the record is not split into fields. However, @code{RT} is set.
8798
8799@ifinfo
8800@c Thanks to Paul Eggert for initial wording here
8801According to POSIX, @samp{@var{expression} | getline @var{var}} is ambiguous if
8802@var{expression} contains unparenthesized operators other than
8803@samp{$}; for example, @samp{@w{"echo "} "date" | getline @var{var}} is ambiguous
8804because the concatenation operator is not parenthesized. You should
8805write it as @samp{(@w{"echo "} "date") | getline @var{var}} if you want your
8806program to be portable to other @command{awk} implementations.
8807@end ifinfo
8808
8809@node Getline/Coprocess
8810@subsection Using @code{getline} from a Coprocess
8811@cindex coprocesses @subentry @code{getline} from
8812@cindex @code{getline} command @subentry coprocesses, using from
8813@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O)
8814@cindex vertical bar (@code{|}) @subentry @code{|&} operator (I/O)
8815@cindex operators @subentry input/output
8816@cindex differences in @command{awk} and @command{gawk} @subentry input/output operators
8817
8818Reading input into @code{getline} from a pipe is a one-way operation.
8819The command that is started with @samp{@var{command} | getline} only
8820sends data @emph{to} your @command{awk} program.
8821
8822On occasion, you might want to send data to another program
8823for processing and then read the results back.
8824@command{gawk} allows you to start a @dfn{coprocess}, with which two-way
8825communications are possible.  This is done with the @samp{|&}
8826operator.
8827Typically, you write data to the coprocess first and then
8828read the results back, as shown in the following:
8829
8830@example
8831print "@var{some query}" |& "db_server"
8832"db_server" |& getline
8833@end example
8834
8835@noindent
8836which sends a query to @command{db_server} and then reads the results.
8837
8838The values of @code{NR} and
8839@code{FNR} are not changed,
8840because the main input stream is not used.
8841However, the record is split into fields in
8842the normal manner, thus changing the values of @code{$0}, of the other fields,
8843and of @code{NF} and @code{RT}.
8844
8845Coprocesses are an advanced feature. They are discussed here only because
8846this is the @value{SECTION} on @code{getline}.
8847@xref{Two-way I/O},
8848where coprocesses are discussed in more detail.
8849
8850@node Getline/Variable/Coprocess
8851@subsection Using @code{getline} into a Variable from a Coprocess
8852@cindex variables @subentry @code{getline} command into, using
8853
8854When you use @samp{@var{command} |& getline @var{var}}, the output from
8855the coprocess @var{command} is sent through a two-way pipe to @code{getline}
8856and into the variable @var{var}.
8857
8858In this version of @code{getline}, none of the predefined variables are
8859changed and the record is not split into fields.  The only variable
8860changed is @var{var}.
8861However, @code{RT} is set.
8862
8863@ifinfo
8864Coprocesses are an advanced feature. They are discussed here only because
8865this is the @value{SECTION} on @code{getline}.
8866@xref{Two-way I/O},
8867where coprocesses are discussed in more detail.
8868@end ifinfo
8869
8870@node Getline Notes
8871@subsection Points to Remember About @code{getline}
8872Here are some miscellaneous points about @code{getline} that
8873you should bear in mind:
8874
8875@itemize @value{BULLET}
8876@item
8877When @code{getline} changes the value of @code{$0} and @code{NF},
8878@command{awk} does @emph{not} automatically jump to the start of the
8879program and start testing the new record against every pattern.
8880However, the new record is tested against any subsequent rules.
8881
8882@cindex differences in @command{awk} and @command{gawk} @subentry implementation limitations
8883@cindex implementation issues, @command{gawk} @subentry limits
8884@cindex @command{awk} @subentry implementations @subentry limits
8885@cindex @command{gawk} @subentry implementation issues @subentry limits
8886@item
8887Some very old @command{awk} implementations limit the number of pipelines that an @command{awk}
8888program may have open to just one.  In @command{gawk}, there is no such limit.
8889You can open as many pipelines (and coprocesses) as the underlying operating
8890system permits.
8891
8892@cindex side effects @subentry @code{FILENAME} variable
8893@cindex @code{FILENAME} variable @subentry @code{getline}, setting with
8894@cindex dark corner @subentry @code{FILENAME} variable
8895@cindex @code{getline} command @subentry @code{FILENAME} variable and
8896@cindex @code{BEGIN} pattern @subentry @code{getline} and
8897@item
8898An interesting side effect occurs if you use @code{getline} without a
8899redirection inside a @code{BEGIN} rule. Because an unredirected @code{getline}
8900reads from the command-line @value{DF}s, the first @code{getline} command
8901causes @command{awk} to set the value of @code{FILENAME}. Normally,
8902@code{FILENAME} does not have a value inside @code{BEGIN} rules, because you
8903have not yet started to process the command-line @value{DF}s.
8904@value{DARKCORNER}
8905(See @ref{BEGIN/END};
8906also @pxref{Auto-set}.)
8907
8908@item
8909Using @code{FILENAME} with @code{getline}
8910(@samp{getline < FILENAME})
8911is likely to be a source of
8912confusion.  @command{awk} opens a separate input stream from the
8913current input file.  However, by not using a variable, @code{$0}
8914and @code{NF} are still updated.  If you're doing this, it's
8915probably by accident, and you should reconsider what it is you're
8916trying to accomplish.
8917
8918@item
8919@ifdocbook
8920The next @value{SECTION}
8921@end ifdocbook
8922@ifnotdocbook
8923@ref{Getline Summary},
8924@end ifnotdocbook
8925presents a table summarizing the
8926@code{getline} variants and which variables they can affect.
8927It is worth noting that those variants that do not use redirection
8928can cause @code{FILENAME} to be updated if they cause
8929@command{awk} to start reading a new input file.
8930
8931@item
8932@cindex Moore, Duncan
8933If the variable being assigned is an expression with side effects,
8934different versions of @command{awk} behave differently upon encountering
8935end-of-file.  Some versions don't evaluate the expression; many versions
8936(including @command{gawk}) do.  Here is an example, courtesy of Duncan Moore:
8937
8938@ignore
8939Date: Sun, 01 Apr 2012 11:49:33 +0100
8940From: Duncan Moore <duncan.moore@@gmx.com>
8941@end ignore
8942
8943@example
8944BEGIN @{
8945    system("echo 1 > f")
8946    while ((getline a[++c] < "f") > 0) @{ @}
8947    print c
8948@}
8949@end example
8950
8951@noindent
8952Here, the side effect is the @samp{++c}.  Is @code{c} incremented if
8953end-of-file is encountered before the element in @code{a} is assigned?
8954
8955@command{gawk} treats @code{getline} like a function call, and evaluates
8956the expression @samp{a[++c]} before attempting to read from @file{f}.
8957However, some versions of @command{awk} only evaluate the expression once they
8958know that there is a string value to be assigned.
8959@end itemize
8960
8961@node Getline Summary
8962@subsection Summary of @code{getline} Variants
8963@cindex @code{getline} command @subentry variants
8964
8965@ref{table-getline-variants}
8966summarizes the eight variants of @code{getline},
8967listing which predefined variables are set by each one,
8968and whether the variant is standard or a @command{gawk} extension.
8969Note: for each variant, @command{gawk} sets the @code{RT} predefined variable.
8970
8971@float Table,table-getline-variants
8972@caption{@code{getline} variants and what they set}
8973@multitable @columnfractions .33 .38 .27
8974@headitem Variant @tab Effect @tab @command{awk} / @command{gawk}
8975@item @code{getline} @tab Sets @code{$0}, @code{NF}, @code{FNR}, @code{NR}, and @code{RT} @tab @command{awk}
8976@item @code{getline} @var{var} @tab Sets @var{var}, @code{FNR}, @code{NR}, and @code{RT} @tab @command{awk}
8977@item @code{getline <} @var{file} @tab Sets @code{$0}, @code{NF}, and @code{RT} @tab @command{awk}
8978@item @code{getline @var{var} < @var{file}} @tab Sets @var{var} and @code{RT} @tab @command{awk}
8979@item @var{command} @code{| getline} @tab Sets @code{$0}, @code{NF}, and @code{RT} @tab @command{awk}
8980@item @var{command} @code{| getline} @var{var} @tab Sets @var{var} and @code{RT} @tab @command{awk}
8981@item @var{command} @code{|& getline} @tab Sets @code{$0}, @code{NF}, and @code{RT} @tab @command{gawk}
8982@item @var{command} @code{|& getline} @var{var} @tab Sets @var{var} and @code{RT} @tab @command{gawk}
8983@end multitable
8984@end float
8985
8986@node Read Timeout
8987@section Reading Input with a Timeout
8988@cindex timeout, reading input
8989
8990@cindex differences in @command{awk} and @command{gawk} @subentry read timeouts
8991This @value{SECTION} describes a feature that is specific to @command{gawk}.
8992
8993You may specify a timeout in milliseconds for reading input from the keyboard,
8994a pipe, or two-way communication, including TCP/IP sockets. This can be done
8995on a per-input, per-command, or per-connection basis, by setting a special
8996element in the @code{PROCINFO} array (@pxref{Auto-set}):
8997
8998@example
8999PROCINFO["input_name", "READ_TIMEOUT"] = @var{timeout in milliseconds}
9000@end example
9001
9002When set, this causes @command{gawk} to time out and return failure
9003if no data is available to read within the specified timeout period.
9004For example, a TCP client can decide to give up on receiving
9005any response from the server after a certain amount of time:
9006
9007@example
9008@group
9009Service = "/inet/tcp/0/localhost/daytime"
9010PROCINFO[Service, "READ_TIMEOUT"] = 100
9011if ((Service |& getline) > 0)
9012    print $0
9013else if (ERRNO != "")
9014    print ERRNO
9015@end group
9016@end example
9017
9018Here is how to read interactively from the user@footnote{This assumes
9019that standard input is the keyboard.} without waiting
9020for more than five seconds:
9021
9022@example
9023PROCINFO["/dev/stdin", "READ_TIMEOUT"] = 5000
9024while ((getline < "/dev/stdin") > 0)
9025    print $0
9026@end example
9027
9028@command{gawk} terminates the read operation if input does not
9029arrive after waiting for the timeout period, returns failure,
9030and sets @code{ERRNO} to an appropriate string value.
9031A negative or zero value for the timeout is the same as specifying
9032no timeout at all.
9033
9034A timeout can also be set for reading from the keyboard in the implicit
9035loop that reads input records and matches them against patterns,
9036like so:
9037
9038@example
9039$ @kbd{gawk 'BEGIN @{ PROCINFO["-", "READ_TIMEOUT"] = 5000 @}}
9040> @kbd{@{ print "You entered: " $0 @}'}
9041@kbd{gawk}
9042@print{} You entered: gawk
9043@end example
9044
9045In this case, failure to respond within five seconds results in the following
9046error message:
9047
9048@example
9049@error{} gawk: cmd. line:2: (FILENAME=- FNR=1) fatal: error reading input file `-': Connection timed out
9050@end example
9051
9052The timeout can be set or changed at any time, and will take effect on the
9053next attempt to read from the input device. In the following example,
9054we start with a timeout value of one second, and progressively
9055reduce it by one-tenth of a second until we wait indefinitely
9056for the input to arrive:
9057
9058@example
9059PROCINFO[Service, "READ_TIMEOUT"] = 1000
9060while ((Service |& getline) > 0) @{
9061    print $0
9062    PROCINFO[Service, "READ_TIMEOUT"] -= 100
9063@}
9064@end example
9065
9066@quotation NOTE
9067You should not assume that the read operation will block
9068exactly after the tenth record has been printed. It is possible that
9069@command{gawk} will read and buffer more than one record's
9070worth of data the first time. Because of this, changing the value
9071of timeout like in the preceding example is not very useful.
9072@end quotation
9073
9074@cindex @env{GAWK_READ_TIMEOUT} environment variable
9075@cindex environment variables @subentry @env{GAWK_READ_TIMEOUT}
9076If the @code{PROCINFO} element is not present and the
9077@env{GAWK_READ_TIMEOUT} environment variable exists,
9078@command{gawk} uses its value to initialize the timeout value.
9079The exclusive use of the environment variable to specify timeout
9080has the disadvantage of not being able to control it
9081on a per-command or per-connection basis.
9082
9083@command{gawk} considers a timeout event to be an error even though
9084the attempt to read from the underlying device may
9085succeed in a later attempt. This is a limitation, and it also
9086means that you cannot use this to multiplex input from
9087two or more sources.  @xref{Retrying Input} for a way to enable
9088later I/O attempts to succeed.
9089
9090Assigning a timeout value prevents read operations from
9091blocking indefinitely. But bear in mind that there are other ways
9092@command{gawk} can stall waiting for an input device to be ready.
9093A network client can sometimes take a long time to establish
9094a connection before it can start reading any data,
9095or the attempt to open a FIFO special file for reading can block
9096indefinitely until some other process opens it for writing.
9097
9098@node Retrying Input
9099@section Retrying Reads After Certain Input Errors
9100@cindex retrying input
9101
9102@cindex differences in @command{awk} and @command{gawk} @subentry retrying input
9103This @value{SECTION} describes a feature that is specific to @command{gawk}.
9104
9105When @command{gawk} encounters an error while reading input, by
9106default @code{getline} returns @minus{}1, and subsequent attempts to
9107read from that file result in an end-of-file indication.  However, you
9108may optionally instruct @command{gawk} to allow I/O to be retried when
9109certain errors are encountered by setting a special element in
9110the @code{PROCINFO} array (@pxref{Auto-set}):
9111
9112@example
9113PROCINFO["@var{input_name}", "RETRY"] = 1
9114@end example
9115
9116When this element exists, @command{gawk} checks the value of the system
9117(C language)
9118@code{errno} variable when an I/O error occurs.  If @code{errno} indicates
9119a subsequent I/O attempt may succeed, @code{getline} instead returns
9120@minus{}2 and
9121further calls to @code{getline} may succeed.  This applies to the @code{errno}
9122values @code{EAGAIN}, @code{EWOULDBLOCK}, @code{EINTR}, or @code{ETIMEDOUT}.
9123
9124This feature is useful in conjunction with
9125@code{PROCINFO["@var{input_name}", "READ_TIMEOUT"]} or situations where a file
9126descriptor has been configured to behave in a non-blocking fashion.
9127
9128@node Command-line directories
9129@section Directories on the Command Line
9130@cindex differences in @command{awk} and @command{gawk} @subentry command-line directories
9131@cindex directories @subentry command-line
9132@cindex command line @subentry directories on
9133
9134According to the POSIX standard, files named on the @command{awk}
9135command line must be text files; it is a fatal error if they are not.
9136Most versions of @command{awk} treat a directory on the command line as
9137a fatal error.
9138
9139By default, @command{gawk} produces a warning for a directory on the
9140command line, but otherwise ignores it.  This makes it easier to use
9141shell wildcards with your @command{awk} program:
9142
9143@example
9144$ @kbd{gawk -f whizprog.awk *}        @ii{Directories could kill this program}
9145@end example
9146
9147If either of the @option{--posix}
9148or @option{--traditional} options is given, then @command{gawk} reverts
9149to treating a directory on the command line as a fatal error.
9150
9151@xref{Extension Sample Readdir} for a way to treat directories
9152as usable data from an @command{awk} program.
9153
9154@node Input Summary
9155@section Summary
9156
9157@itemize @value{BULLET}
9158@item
9159Input is split into records based on the value of @code{RS}.
9160The possibilities are as follows:
9161
9162@multitable @columnfractions .25 .35 .40
9163@headitem Value of @code{RS} @tab Records are split on @dots{} @tab @command{awk} / @command{gawk}
9164@item Any single character @tab That character @tab @command{awk}
9165@item The empty string (@code{""}) @tab Runs of two or more newlines @tab @command{awk}
9166@item A regexp @tab Text that matches the regexp @tab @command{gawk}
9167@end multitable
9168
9169@item
9170@code{FNR} indicates how many records have been read from the current input file;
9171@code{NR} indicates how many records have been read in total.
9172
9173@item
9174@command{gawk} sets @code{RT} to the text matched by @code{RS}.
9175
9176@item
9177After splitting the input into records, @command{awk} further splits
9178the records into individual fields, named @code{$1}, @code{$2}, and so
9179on. @code{$0} is the whole record, and @code{NF} indicates how many
9180fields there are.  The default way to split fields is between whitespace
9181characters.
9182
9183@item
9184Fields may be referenced using a variable, as in @code{$NF}.  Fields
9185may also be assigned values, which causes the value of @code{$0} to be
9186recomputed when it is later referenced. Assigning to a field with a number
9187greater than @code{NF} creates the field and rebuilds the record, using
9188@code{OFS} to separate the fields.  Incrementing @code{NF} does the same
9189thing. Decrementing @code{NF} throws away fields and rebuilds the record.
9190
9191@item
9192Field splitting is more complicated than record splitting:
9193
9194@multitable @columnfractions .40 .40 .20
9195@headitem Field separator value @tab Fields are split @dots{} @tab @command{awk} / @command{gawk}
9196@item @code{FS == " "} @tab On runs of whitespace @tab @command{awk}
9197@item @code{FS == @var{any single character}} @tab On that character @tab @command{awk}
9198@item @code{FS == @var{regexp}} @tab On text matching the regexp @tab @command{awk}
9199@item @code{FS == ""}  @tab Such that each individual character is a separate field @tab @command{gawk}
9200@item @code{FIELDWIDTHS == @var{list of columns}} @tab Based on character position @tab @command{gawk}
9201@item @code{FPAT == @var{regexp}} @tab On the text surrounding text matching the regexp @tab @command{gawk}
9202@end multitable
9203
9204@item
9205Using @samp{FS = "\n"} causes the entire record to be a single field
9206(assuming that newlines separate records).
9207
9208@item
9209@code{FS} may be set from the command line using the @option{-F} option.
9210This can also be done using command-line variable assignment.
9211
9212@item
9213Use @code{PROCINFO["FS"]} to see how fields are being split.
9214
9215@item
9216Use @code{getline} in its various forms to read additional records
9217from the default input stream, from a file, or from a pipe or coprocess.
9218
9219@item
9220Use @code{PROCINFO[@var{file}, "READ_TIMEOUT"]} to cause reads to time out
9221for @var{file}.
9222
9223@cindex POSIX mode
9224@item
9225Directories on the command line are fatal for standard @command{awk};
9226@command{gawk} ignores them if not in POSIX mode.
9227
9228@end itemize
9229
9230@c EXCLUDE START
9231@node Input Exercises
9232@section Exercises
9233
9234@enumerate
9235@item
9236Using the @code{FIELDWIDTHS} variable (@pxref{Constant Size}),
9237write a program to read election data, where each record represents
9238one voter's votes.  Come up with a way to define which columns are
9239associated with each ballot item, and print the total votes,
9240including abstentions, for each item.
9241
9242@end enumerate
9243@c EXCLUDE END
9244
9245@node Printing
9246@chapter Printing Output
9247
9248@cindex printing
9249@cindex output, printing @seeentry{printing}
9250One of the most common programming actions is to @dfn{print}, or output,
9251some or all of the input.  Use the @code{print} statement
9252for simple output, and the @code{printf} statement
9253for fancier formatting.
9254The @code{print} statement is not limited when
9255computing @emph{which} values to print. However, with two exceptions,
9256you cannot specify @emph{how} to print them---how many
9257columns, whether to use exponential notation or not, and so on.
9258(For the exceptions, @pxref{Output Separators} and
9259@ref{OFMT}.)
9260For printing with specifications, you need the @code{printf} statement
9261(@pxref{Printf}).
9262
9263@cindex @code{print} statement
9264@cindex @code{printf} statement
9265Besides basic and formatted printing, this @value{CHAPTER}
9266also covers I/O redirections to files and pipes, introduces
9267the special @value{FN}s that @command{gawk} processes internally,
9268and discusses the @code{close()} built-in function.
9269
9270@menu
9271* Print::                       The @code{print} statement.
9272* Print Examples::              Simple examples of @code{print} statements.
9273* Output Separators::           The output separators and how to change them.
9274* OFMT::                        Controlling Numeric Output With @code{print}.
9275* Printf::                      The @code{printf} statement.
9276* Redirection::                 How to redirect output to multiple files and
9277                                pipes.
9278* Special FD::                  Special files for I/O.
9279* Special Files::               File name interpretation in @command{gawk}.
9280                                @command{gawk} allows access to inherited file
9281                                descriptors.
9282* Close Files And Pipes::       Closing Input and Output Files and Pipes.
9283* Nonfatal::                    Enabling Nonfatal Output.
9284* Output Summary::              Output summary.
9285* Output Exercises::            Exercises.
9286@end menu
9287
9288@node Print
9289@section The @code{print} Statement
9290
9291Use the @code{print} statement to produce output with simple, standardized
9292formatting.  You specify only the strings or numbers to print, in a
9293list separated by commas.  They are output, separated by single spaces,
9294followed by a newline.  The statement looks like this:
9295
9296@example
9297print @var{item1}, @var{item2}, @dots{}
9298@end example
9299
9300@noindent
9301The entire list of items may be optionally enclosed in parentheses.  The
9302parentheses are necessary if any of the item expressions uses the @samp{>}
9303relational operator; otherwise it could be confused with an output redirection
9304(@pxref{Redirection}).
9305
9306The items to print can be constant strings or numbers, fields of the
9307current record (such as @code{$1}), variables, or any @command{awk}
9308expression.  Numeric values are converted to strings and then printed.
9309
9310@cindex records @subentry printing
9311@cindex lines @subentry blank, printing
9312@cindex text, printing
9313The simple statement @samp{print} with no items is equivalent to
9314@samp{print $0}: it prints the entire current record.  To print a blank
9315line, use @samp{print ""}.
9316To print a fixed piece of text, use a string constant, such as
9317@w{@code{"Don't Panic"}}, as one item.  If you forget to use the
9318double-quote characters, your text is taken as an @command{awk}
9319expression, and you will probably get an error.  Keep in mind that a
9320space is printed between any two items.
9321
9322Note that the @code{print} statement is a statement and not an
9323expression---you can't use it in the pattern part of a
9324pattern--action statement, for example.
9325
9326@node Print Examples
9327@section @code{print} Statement Examples
9328
9329Each @code{print} statement makes at least one line of output.  However, it
9330isn't limited to only one line.  If an item value is a string containing a
9331newline, the newline is output along with the rest of the string.  A
9332single @code{print} statement can make any number of lines this way.
9333
9334@cindex newlines @subentry printing
9335The following is an example of printing a string that contains embedded
9336@ifinfo
9337newlines
9338(the @samp{\n} is an escape sequence, used to represent the newline
9339character; @pxref{Escape Sequences}):
9340@end ifinfo
9341@ifhtml
9342newlines
9343(the @samp{\n} is an escape sequence, used to represent the newline
9344character; @pxref{Escape Sequences}):
9345@end ifhtml
9346@ifnotinfo
9347@ifnothtml
9348newlines:
9349@end ifnothtml
9350@end ifnotinfo
9351
9352@example
9353@group
9354$ @kbd{awk 'BEGIN @{ print "line one\nline two\nline three" @}'}
9355@print{} line one
9356@print{} line two
9357@print{} line three
9358@end group
9359@end example
9360
9361@cindex fields @subentry printing
9362The next example, which is run on the @file{inventory-shipped} file,
9363prints the first two fields of each input record, with a space between
9364them:
9365
9366@example
9367$ @kbd{awk '@{ print $1, $2 @}' inventory-shipped}
9368@print{} Jan 13
9369@print{} Feb 15
9370@print{} Mar 15
9371@dots{}
9372@end example
9373
9374@cindex @code{print} statement @subentry commas, omitting
9375@cindex troubleshooting @subentry @code{print} statement, omitting commas
9376A common mistake in using the @code{print} statement is to omit the comma
9377between two items.  This often has the effect of making the items run
9378together in the output, with no space.  The reason for this is that
9379juxtaposing two string expressions in @command{awk} means to concatenate
9380them.  Here is the same program, without the comma:
9381
9382@example
9383$ @kbd{awk '@{ print $1 $2 @}' inventory-shipped}
9384@print{} Jan13
9385@print{} Feb15
9386@print{} Mar15
9387@dots{}
9388@end example
9389
9390@cindex @code{BEGIN} pattern @subentry headings, adding
9391To someone unfamiliar with the @file{inventory-shipped} file, neither
9392example's output makes much sense.  A heading line at the beginning
9393would make it clearer.  Let's add some headings to our table of months
9394(@code{$1}) and green crates shipped (@code{$2}).  We do this using
9395a @code{BEGIN} rule (@pxref{BEGIN/END}) so that the headings are only
9396printed once:
9397
9398@example
9399awk 'BEGIN @{  print "Month Crates"
9400              print "----- ------" @}
9401           @{  print $1, $2 @}' inventory-shipped
9402@end example
9403
9404@noindent
9405When run, the program prints the following:
9406
9407@example
9408Month Crates
9409----- ------
9410Jan 13
9411Feb 15
9412Mar 15
9413@dots{}
9414@end example
9415
9416@noindent
9417The only problem, however, is that the headings and the table data
9418don't line up!  We can fix this by printing some spaces between the
9419two fields:
9420
9421@example
9422@group
9423awk 'BEGIN @{ print "Month Crates"
9424             print "----- ------" @}
9425           @{ print $1, "     ", $2 @}' inventory-shipped
9426@end group
9427@end example
9428
9429@cindex @code{printf} statement @subentry columns, aligning
9430@cindex columns @subentry aligning
9431Lining up columns this way can get pretty
9432complicated when there are many columns to fix.  Counting spaces for two
9433or three columns is simple, but any more than this can take up
9434a lot of time. This is why the @code{printf} statement was
9435created (@pxref{Printf});
9436one of its specialties is lining up columns of data.
9437
9438@cindex line continuations @subentry in @code{print} statement
9439@cindex @code{print} statement @subentry line continuations and
9440@quotation NOTE
9441You can continue either a @code{print} or
9442@code{printf} statement simply by putting a newline after any comma
9443(@pxref{Statements/Lines}).
9444@end quotation
9445
9446@node Output Separators
9447@section Output Separators
9448
9449@cindex @code{OFS} variable
9450As mentioned previously, a @code{print} statement contains a list
9451of items separated by commas.  In the output, the items are normally
9452separated by single spaces.  However, this doesn't need to be the case;
9453a single space is simply the default.  Any string of
9454characters may be used as the @dfn{output field separator} by setting the
9455predefined variable @code{OFS}.  The initial value of this variable
9456is the string @w{@code{" "}} (i.e., a single space).
9457
9458The output from an entire @code{print} statement is called an @dfn{output
9459record}.  Each @code{print} statement outputs one output record, and
9460then outputs a string called the @dfn{output record separator} (or
9461@code{ORS}).  The initial value of @code{ORS} is the string @code{"\n"}
9462(i.e., a newline character).  Thus, each @code{print} statement normally
9463makes a separate line.
9464
9465@cindex output @subentry records
9466@cindex output record separator @seeentry{@code{ORS} variable}
9467@cindex @code{ORS} variable
9468@cindex @code{BEGIN} pattern @subentry @code{OFS}/@code{ORS} variables, assigning values to
9469In order to change how output fields and records are separated, assign
9470new values to the variables @code{OFS} and @code{ORS}.  The usual
9471place to do this is in the @code{BEGIN} rule
9472(@pxref{BEGIN/END}), so
9473that it happens before any input is processed.  It can also be done
9474with assignments on the command line, before the names of the input
9475files, or using the @option{-v} command-line option
9476(@pxref{Options}).
9477The following example prints the first and second fields of each input
9478record, separated by a semicolon, with a blank line added after each
9479newline:
9480
9481
9482@example
9483$ @kbd{awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @}}
9484>            @kbd{@{ print $1, $2 @}' mail-list}
9485@print{} Amelia;555-5553
9486@print{}
9487@print{} Anthony;555-3412
9488@print{}
9489@print{} Becky;555-7685
9490@print{}
9491@print{} Bill;555-1675
9492@print{}
9493@print{} Broderick;555-0542
9494@print{}
9495@print{} Camilla;555-2912
9496@print{}
9497@print{} Fabius;555-1234
9498@print{}
9499@print{} Julie;555-6699
9500@print{}
9501@print{} Martin;555-6480
9502@print{}
9503@print{} Samuel;555-3430
9504@print{}
9505@print{} Jean-Paul;555-2127
9506@print{}
9507@end example
9508
9509If the value of @code{ORS} does not contain a newline, the program's output
9510runs together on a single line.
9511
9512@node OFMT
9513@section Controlling Numeric Output with @code{print}
9514@cindex numeric @subentry output format
9515@cindex formats, numeric output
9516When printing numeric values with the @code{print} statement,
9517@command{awk} internally converts each number to a string of characters
9518and prints that string.  @command{awk} uses the @code{sprintf()} function
9519to do this conversion
9520(@pxref{String Functions}).
9521For now, it suffices to say that the @code{sprintf()}
9522function accepts a @dfn{format specification} that tells it how to format
9523numbers (or strings), and that there are a number of different ways in which
9524numbers can be formatted.  The different format specifications are discussed
9525more fully in
9526@ref{Control Letters}.
9527
9528@cindexawkfunc{sprintf}
9529@cindex @code{OFMT} variable
9530@cindex output @subentry format specifier, @code{OFMT}
9531The predefined variable @code{OFMT} contains the format specification
9532that @code{print} uses with @code{sprintf()} when it wants to convert a
9533number to a string for printing.
9534The default value of @code{OFMT} is @code{"%.6g"}.
9535The way @code{print} prints numbers can be changed
9536by supplying a different format specification
9537for the value of @code{OFMT}, as shown in the following example:
9538
9539@example
9540$ @kbd{awk 'BEGIN @{}
9541>   @kbd{OFMT = "%.0f"  # print numbers as integers (rounds)}
9542>   @kbd{print 17.23, 17.54 @}'}
9543@print{} 17 18
9544@end example
9545
9546@noindent
9547@cindex dark corner @subentry @code{OFMT} variable
9548@cindex POSIX @command{awk} @subentry @code{OFMT} variable and
9549@cindex @code{OFMT} variable @subentry POSIX @command{awk} and
9550According to the POSIX standard, @command{awk}'s behavior is undefined
9551if @code{OFMT} contains anything but a floating-point conversion specification.
9552@value{DARKCORNER}
9553
9554@node Printf
9555@section Using @code{printf} Statements for Fancier Printing
9556
9557@cindex @code{printf} statement
9558@cindex output @subentry formatted
9559@cindex formatting @subentry output
9560For more precise control over the output format than what is
9561provided by @code{print}, use @code{printf}.
9562With @code{printf} you can
9563specify the width to use for each item, as well as various
9564formatting choices for numbers (such as what output base to use, whether to
9565print an exponent, whether to print a sign, and how many digits to print
9566after the decimal point).
9567
9568@menu
9569* Basic Printf::                Syntax of the @code{printf} statement.
9570* Control Letters::             Format-control letters.
9571* Format Modifiers::            Format-specification modifiers.
9572* Printf Examples::             Several examples.
9573@end menu
9574
9575@node Basic Printf
9576@subsection Introduction to the @code{printf} Statement
9577
9578@cindex @code{printf} statement @subentry syntax of
9579A simple @code{printf} statement looks like this:
9580
9581@example
9582printf @var{format}, @var{item1}, @var{item2}, @dots{}
9583@end example
9584
9585@noindent
9586As for @code{print}, the entire list of arguments may optionally be
9587enclosed in parentheses. Here too, the parentheses are necessary if any
9588of the item expressions uses the @samp{>} relational operator; otherwise,
9589it can be confused with an output redirection (@pxref{Redirection}).
9590
9591@cindex format specifiers
9592The difference between @code{printf} and @code{print} is the @var{format}
9593argument.  This is an expression whose value is taken as a string; it
9594specifies how to output each of the other arguments.  It is called the
9595@dfn{format string}.
9596
9597The format string is very similar to that in the ISO C library function
9598@code{printf()}.  Most of @var{format} is text to output verbatim.
9599Scattered among this text are @dfn{format specifiers}---one per item.
9600Each format specifier says to output the next item in the argument list
9601at that place in the format.
9602
9603The @code{printf} statement does not automatically append a newline
9604to its output.  It outputs only what the format string specifies.
9605So if a newline is needed, you must include one in the format string.
9606The output separator variables @code{OFS} and @code{ORS} have no effect
9607on @code{printf} statements. For example:
9608
9609@example
9610@group
9611$ @kbd{awk 'BEGIN @{}
9612>    @kbd{ORS = "\nOUCH!\n"; OFS = "+"}
9613>    @kbd{msg = "Don\47t Panic!"}
9614>    @kbd{printf "%s\n", msg}
9615> @kbd{@}'}
9616@print{} Don't Panic!
9617@end group
9618@end example
9619
9620@noindent
9621Here, neither the @samp{+} nor the @samp{OUCH!} appears in
9622the output message.
9623
9624@node Control Letters
9625@subsection Format-Control Letters
9626@cindex @code{printf} statement @subentry format-control characters
9627@cindex format specifiers @subentry @code{printf} statement
9628
9629A format specifier starts with the character @samp{%} and ends with
9630a @dfn{format-control letter}---it tells the @code{printf} statement
9631how to output one item.  The format-control letter specifies what @emph{kind}
9632of value to print.  The rest of the format specifier is made up of
9633optional @dfn{modifiers} that control @emph{how} to print the value, such as
9634the field width.  Here is a list of the format-control letters:
9635
9636@c @asis for docbook to come out right
9637@table @asis
9638@item @code{%a}, @code{%A}
9639A floating point number of the form
9640[@code{-}]@code{0x@var{h}.@var{hhhh}p+-@var{dd}}
9641(C99 hexadecimal floating point format).
9642For @code{%A},
9643uppercase letters are used instead of lowercase ones.
9644
9645@quotation NOTE
9646The current POSIX standard requires support for @code{%a} and @code{%A} in
9647@command{awk}. As far as we know, besides @command{gawk}, the only other
9648version of @command{awk} that actually implements it is BWK @command{awk}.
9649It's use is thus highly nonportable!
9650
9651Furthermore, these formats are not available on any system where the
9652underlying C library @code{printf()} function does not support them. As
9653of this writing, among current systems, only OpenVMS is known to not
9654support them.
9655@end quotation
9656
9657@item @code{%c}
9658Print a number as a character; thus, @samp{printf "%c",
965965} outputs the letter @samp{A}. The output for a string value is
9660the first character of the string.
9661
9662@cindex dark corner @subentry format-control characters
9663@cindex @command{gawk} @subentry format-control characters
9664@quotation NOTE
9665The POSIX standard says the first character of a string is printed.
9666In locales with multibyte characters, @command{gawk} attempts to
9667convert the leading bytes of the string into a valid wide character
9668and then to print the multibyte encoding of that character.
9669Similarly, when printing a numeric value, @command{gawk} allows the
9670value to be within the numeric range of values that can be held
9671in a wide character.
9672If the conversion to multibyte encoding fails, @command{gawk}
9673uses the low eight bits of the value as the character to print.
9674
9675Other @command{awk} versions generally restrict themselves to printing
9676the first byte of a string or to numeric values within the range of
9677a single byte (0--255).
9678@value{DARKCORNER}
9679@end quotation
9680
9681
9682@item @code{%d}, @code{%i}
9683Print a decimal integer.
9684The two control letters are equivalent.
9685(The @samp{%i} specification is for compatibility with ISO C.)
9686
9687@item @code{%e}, @code{%E}
9688Print a number in scientific (exponential) notation.
9689For example:
9690
9691@example
9692printf "%4.3e\n", 1950
9693@end example
9694
9695@noindent
9696prints @samp{1.950e+03}, with a total of four significant figures, three of
9697which follow the decimal point.
9698(The @samp{4.3} represents two modifiers,
9699discussed in the next @value{SUBSECTION}.)
9700@samp{%E} uses @samp{E} instead of @samp{e} in the output.
9701
9702@item @code{%f}
9703Print a number in floating-point notation.
9704For example:
9705
9706@example
9707printf "%4.3f", 1950
9708@end example
9709
9710@noindent
9711prints @samp{1950.000}, with a minimum of four significant figures, three of
9712which follow the decimal point.
9713(The @samp{4.3} represents two modifiers,
9714discussed in the next @value{SUBSECTION}.)
9715
9716On systems supporting IEEE 754 floating-point format, values
9717representing negative
9718infinity are formatted as
9719@samp{-inf} or @samp{-infinity},
9720and positive infinity as
9721@samp{inf} or @samp{infinity}.
9722The special ``not a number'' value formats as @samp{-nan} or @samp{nan}
9723(@pxref{Math Definitions}).
9724
9725@item @code{%F}
9726Like @samp{%f}, but the infinity and ``not a number'' values are spelled
9727using uppercase letters.
9728
9729The @samp{%F} format is a POSIX extension to ISO C; not all systems
9730support it.  On those that don't, @command{gawk} uses @samp{%f} instead.
9731
9732@item @code{%g}, @code{%G}
9733Print a number in either scientific notation or in floating-point
9734notation, whichever uses fewer characters; if the result is printed in
9735scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}.
9736
9737@item @code{%o}
9738Print an unsigned octal integer
9739(@pxref{Nondecimal-numbers}).
9740
9741@item @code{%s}
9742Print a string.
9743
9744@item @code{%u}
9745Print an unsigned decimal integer.
9746(This format is of marginal use, because all numbers in @command{awk}
9747are floating point; it is provided primarily for compatibility with C.)
9748
9749@item @code{%x}, @code{%X}
9750Print an unsigned hexadecimal integer;
9751@samp{%X} uses the letters @samp{A} through @samp{F}
9752instead of @samp{a} through @samp{f}
9753(@pxref{Nondecimal-numbers}).
9754
9755@item @code{%%}
9756Print a single @samp{%}.
9757This does not consume an
9758argument and it ignores any modifiers.
9759@end table
9760
9761@cindex dark corner @subentry format-control characters
9762@cindex @command{gawk} @subentry format-control characters
9763@quotation NOTE
9764When using the integer format-control letters for values that are
9765outside the range of the widest C integer type, @command{gawk} switches to
9766the @samp{%g} format specifier. If @option{--lint} is provided on the
9767command line (@pxref{Options}), @command{gawk}
9768warns about this.  Other versions of @command{awk} may print invalid
9769values or do something else entirely.
9770@value{DARKCORNER}
9771@end quotation
9772
9773@quotation NOTE
9774The IEEE 754 standard for floating-point arithmetic allows for special
9775values that represent ``infinity'' (positive and negative) and values
9776that are ``not a number'' (NaN).
9777
9778Input and output of these values occurs as text strings. This is
9779somewhat problematic for the @command{awk} language, which predates
9780the IEEE standard.  Further details are provided in
9781@ref{POSIX Floating Point Problems}; please see there.
9782@end quotation
9783
9784@node Format Modifiers
9785@subsection Modifiers for @code{printf} Formats
9786
9787@cindex @code{printf} statement @subentry modifiers
9788@cindex modifiers, in format specifiers
9789A format specification can also include @dfn{modifiers} that can control
9790how much of the item's value is printed, as well as how much space it gets.
9791The modifiers come between the @samp{%} and the format-control letter.
9792We use the bullet symbol ``@bullet{}'' in the following examples to
9793represent
9794spaces in the output. Here are the possible modifiers, in the order in
9795which they may appear:
9796
9797@table @asis
9798@cindex differences in @command{awk} and @command{gawk} @subentry @code{print}/@code{printf} statements
9799@cindex @code{printf} statement @subentry positional specifiers
9800@c the code{} does NOT start a secondary
9801@cindex positional specifiers, @code{printf} statement
9802@item @code{@var{N}$}
9803An integer constant followed by a @samp{$} is a @dfn{positional specifier}.
9804Normally, format specifications are applied to arguments in the order
9805given in the format string.  With a positional specifier, the format
9806specification is applied to a specific argument, instead of what
9807would be the next argument in the list.  Positional specifiers begin
9808counting with one. Thus:
9809
9810@example
9811printf "%s %s\n", "don't", "panic"
9812printf "%2$s %1$s\n", "panic", "don't"
9813@end example
9814
9815@noindent
9816prints the famous friendly message twice.
9817
9818At first glance, this feature doesn't seem to be of much use.
9819It is in fact a @command{gawk} extension, intended for use in translating
9820messages at runtime.
9821@xref{Printf Ordering},
9822which describes how and why to use positional specifiers.
9823For now, we ignore them.
9824
9825@item @code{-} (Minus)
9826The minus sign, used before the width modifier (see later on in
9827this list),
9828says to left-justify
9829the argument within its specified width.  Normally, the argument
9830is printed right-justified in the specified width.  Thus:
9831
9832@example
9833printf "%-4s", "foo"
9834@end example
9835
9836@noindent
9837prints @samp{foo@bullet{}}.
9838
9839@item @var{space}
9840For numeric conversions, prefix positive values with a space and
9841negative values with a minus sign.
9842
9843@item @code{+}
9844The plus sign, used before the width modifier (see later on in
9845this list),
9846says to always supply a sign for numeric conversions, even if the data
9847to format is positive. The @samp{+} overrides the space modifier.
9848
9849@item @code{#}
9850Use an ``alternative form'' for certain control letters.
9851For @samp{%o}, supply a leading zero.
9852For @samp{%x} and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for
9853a nonzero result.
9854For @samp{%e}, @samp{%E}, @samp{%f}, and @samp{%F}, the result always
9855contains a decimal point.
9856For @samp{%g} and @samp{%G}, trailing zeros are not removed from the result.
9857
9858@item @code{0}
9859A leading @samp{0} (zero) acts as a flag indicating that output should be
9860padded with zeros instead of spaces.
9861This applies only to the numeric output formats.
9862This flag only has an effect when the field width is wider than the
9863value to print.
9864
9865@item @code{'}
9866A single quote or apostrophe character is a POSIX extension to ISO C.
9867It indicates that the integer part of a floating-point value, or the
9868entire part of an integer decimal value, should have a thousands-separator
9869character in it.  This only works in locales that support such characters.
9870For example:
9871
9872@example
9873$ @kbd{cat thousands.awk}          @ii{Show source program}
9874@print{} BEGIN @{ printf "%'d\n", 1234567 @}
9875$ @kbd{LC_ALL=C gawk -f thousands.awk}
9876@print{} 1234567                   @ii{Results in} "C" @ii{locale}
9877$ @kbd{LC_ALL=en_US.UTF-8 gawk -f thousands.awk}
9878@print{} 1,234,567                 @ii{Results in US English UTF locale}
9879@end example
9880
9881@noindent
9882For more information about locales and internationalization issues,
9883see @ref{Locales}.
9884
9885@quotation NOTE
9886The @samp{'} flag is a nice feature, but its use complicates things: it
9887becomes difficult to use it in command-line programs.  For information
9888on appropriate quoting tricks, see @ref{Quoting}.
9889@end quotation
9890
9891@item @var{width}
9892This is a number specifying the desired minimum width of a field.  Inserting any
9893number between the @samp{%} sign and the format-control character forces the
9894field to expand to this width.  The default way to do this is to
9895pad with spaces on the left.  For example:
9896
9897@example
9898printf "%4s", "foo"
9899@end example
9900
9901@noindent
9902prints @samp{@bullet{}foo}.
9903
9904The value of @var{width} is a minimum width, not a maximum.  If the item
9905value requires more than @var{width} characters, it can be as wide as
9906necessary.  Thus, the following:
9907
9908@example
9909printf "%4s", "foobar"
9910@end example
9911
9912@noindent
9913prints @samp{foobar}.
9914
9915Preceding the @var{width} with a minus sign causes the output to be
9916padded with spaces on the right, instead of on the left.
9917
9918@item @code{.@var{prec}}
9919A period followed by an integer constant
9920specifies the precision to use when printing.
9921The meaning of the precision varies by control letter:
9922
9923@table @asis
9924@item @code{%d}, @code{%i}, @code{%o}, @code{%u}, @code{%x}, @code{%X}
9925Minimum number of digits to print.
9926
9927@item @code{%e}, @code{%E}, @code{%f}, @code{%F}
9928Number of digits to the right of the decimal point.
9929
9930@item @code{%g}, @code{%G}
9931Maximum number of significant digits.
9932
9933@item @code{%s}
9934Maximum number of characters from the string that should print.
9935@end table
9936
9937Thus, the following:
9938
9939@example
9940printf "%.4s", "foobar"
9941@end example
9942
9943@noindent
9944prints @samp{foob}.
9945@end table
9946
9947The C library @code{printf}'s dynamic @var{width} and @var{prec}
9948capability (e.g., @code{"%*.*s"}) is supported.  Instead of
9949supplying explicit @var{width} and/or @var{prec} values in the format
9950string, they are passed in the argument list.  For example:
9951
9952@example
9953w = 5
9954p = 3
9955s = "abcdefg"
9956printf "%*.*s\n", w, p, s
9957@end example
9958
9959@noindent
9960is exactly equivalent to:
9961
9962@example
9963s = "abcdefg"
9964printf "%5.3s\n", s
9965@end example
9966
9967@noindent
9968Both programs output @samp{@w{@bullet{}@bullet{}abc}}.
9969Earlier versions of @command{awk} did not support this capability.
9970If you must use such a version, you may simulate this feature by using
9971concatenation to build up the format string, like so:
9972
9973@example
9974w = 5
9975p = 3
9976s = "abcdefg"
9977printf "%" w "." p "s\n", s
9978@end example
9979
9980@noindent
9981This is not particularly easy to read, but it does work.
9982
9983@c @cindex lint checks
9984@cindex troubleshooting @subentry fatal errors @subentry @code{printf} format strings
9985@cindex POSIX @command{awk} @subentry @code{printf} format strings and
9986C programmers may be used to supplying additional modifiers (@samp{h},
9987@samp{j}, @samp{l}, @samp{L}, @samp{t}, and @samp{z}) in @code{printf}
9988format strings. These are not valid in @command{awk}.  Most @command{awk}
9989implementations silently ignore them.  If @option{--lint} is provided
9990on the command line (@pxref{Options}), @command{gawk} warns about their
9991use. If @option{--posix} is supplied, their use is a fatal error.
9992
9993@node Printf Examples
9994@subsection Examples Using @code{printf}
9995
9996The following simple example shows
9997how to use @code{printf} to make an aligned table:
9998
9999@example
10000awk '@{ printf "%-10s %s\n", $1, $2 @}' mail-list
10001@end example
10002
10003@noindent
10004This command
10005prints the names of the people (@code{$1}) in the file
10006@file{mail-list} as a string of 10 characters that are left-justified.  It also
10007prints the phone numbers (@code{$2}) next on the line.  This
10008produces an aligned two-column table of names and phone numbers,
10009as shown here:
10010
10011@example
10012$ @kbd{awk '@{ printf "%-10s %s\n", $1, $2 @}' mail-list}
10013@print{} Amelia     555-5553
10014@print{} Anthony    555-3412
10015@print{} Becky      555-7685
10016@print{} Bill       555-1675
10017@print{} Broderick  555-0542
10018@print{} Camilla    555-2912
10019@print{} Fabius     555-1234
10020@print{} Julie      555-6699
10021@print{} Martin     555-6480
10022@print{} Samuel     555-3430
10023@print{} Jean-Paul  555-2127
10024@end example
10025
10026In this case, the phone numbers had to be printed as strings because
10027the numbers are separated by dashes.  Printing the phone numbers as
10028numbers would have produced just the first three digits: @samp{555}.
10029This would have been pretty confusing.
10030
10031It wasn't necessary to specify a width for the phone numbers because
10032they are last on their lines.  They don't need to have spaces
10033after them.
10034
10035The table could be made to look even nicer by adding headings to the
10036tops of the columns.  This is done using a @code{BEGIN} rule
10037(@pxref{BEGIN/END})
10038so that the headers are only printed once, at the beginning of
10039the @command{awk} program:
10040
10041@example
10042awk 'BEGIN @{ print "Name      Number"
10043             print "----      ------" @}
10044           @{ printf "%-10s %s\n", $1, $2 @}' mail-list
10045@end example
10046
10047The preceding example mixes @code{print} and @code{printf} statements in
10048the same program.  Using just @code{printf} statements can produce the
10049same results:
10050
10051@example
10052awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number"
10053             printf "%-10s %s\n", "----", "------" @}
10054           @{ printf "%-10s %s\n", $1, $2 @}' mail-list
10055@end example
10056
10057@noindent
10058Printing each column heading with the same format specification
10059used for the column elements ensures that the headings
10060are aligned just like the columns.
10061
10062The fact that the same format specification is used three times can be
10063emphasized by storing it in a variable, like this:
10064
10065@example
10066awk 'BEGIN @{ format = "%-10s %s\n"
10067             printf format, "Name", "Number"
10068             printf format, "----", "------" @}
10069           @{ printf format, $1, $2 @}' mail-list
10070@end example
10071
10072
10073@node Redirection
10074@section Redirecting Output of @code{print} and @code{printf}
10075
10076@cindex output redirection
10077@cindex redirection @subentry of output
10078@cindex @option{--sandbox} option @subentry output redirection with @code{print} @subentry @code{printf}
10079So far, the output from @code{print} and @code{printf} has gone
10080to the standard
10081output, usually the screen.  Both @code{print} and @code{printf} can
10082also send their output to other places.
10083This is called @dfn{redirection}.
10084
10085@quotation NOTE
10086When @option{--sandbox} is specified (@pxref{Options}),
10087redirecting output to files, pipes, and coprocesses is disabled.
10088@end quotation
10089
10090A redirection appears after the @code{print} or @code{printf} statement.
10091Redirections in @command{awk} are written just like redirections in shell
10092commands, except that they are written inside the @command{awk} program.
10093
10094@c the commas here are part of the see also
10095@cindex @code{print} statement @seealso{redirection of output}
10096@cindex @code{printf} statement @seealso{redirection of output}
10097There are four forms of output redirection: output to a file, output
10098appended to a file, output through a pipe to another command, and output
10099to a coprocess.  We show them all for the @code{print} statement,
10100but they work identically for @code{printf}:
10101
10102@table @code
10103@cindex @code{>} (right angle bracket) @subentry @code{>} operator (I/O)
10104@cindex right angle bracket (@code{>}) @subentry @code{>} operator (I/O)
10105@cindex operators @subentry input/output
10106@item print @var{items} > @var{output-file}
10107This redirection prints the items into the output file named
10108@var{output-file}.  The @value{FN} @var{output-file} can be any
10109expression.  Its value is changed to a string and then used as a
10110@value{FN} (@pxref{Expressions}).
10111
10112When this type of redirection is used, the @var{output-file} is erased
10113before the first output is written to it.  Subsequent writes to the same
10114@var{output-file} do not erase @var{output-file}, but append to it.
10115(This is different from how you use redirections in shell scripts.)
10116If @var{output-file} does not exist, it is created.  For example, here
10117is how an @command{awk} program can write a list of peoples' names to one
10118file named @file{name-list}, and a list of phone numbers to another file
10119named @file{phone-list}:
10120
10121@example
10122$ @kbd{awk '@{ print $2 > "phone-list"}
10123>        @kbd{print $1 > "name-list" @}' mail-list}
10124$ @kbd{cat phone-list}
10125@print{} 555-5553
10126@print{} 555-3412
10127@dots{}
10128$ @kbd{cat name-list}
10129@print{} Amelia
10130@print{} Anthony
10131@dots{}
10132@end example
10133
10134@noindent
10135Each output file contains one name or number per line.
10136
10137@cindex @code{>} (right angle bracket) @subentry @code{>>} operator (I/O)
10138@cindex right angle bracket (@code{>}) @subentry @code{>>} operator (I/O)
10139@item print @var{items} >> @var{output-file}
10140This redirection prints the items into the preexisting output file
10141named @var{output-file}.  The difference between this and the
10142single-@samp{>} redirection is that the old contents (if any) of
10143@var{output-file} are not erased.  Instead, the @command{awk} output is
10144appended to the file.
10145If @var{output-file} does not exist, then it is created.
10146
10147@cindex @code{|} (vertical bar) @subentry @code{|} operator (I/O)
10148@cindex pipe @subentry output
10149@cindex output @subentry pipes
10150@item print @var{items} | @var{command}
10151It is possible to send output to another program through a pipe
10152instead of into a file.   This redirection opens a pipe to
10153@var{command}, and writes the values of @var{items} through this pipe
10154to another process created to execute @var{command}.
10155
10156The redirection argument @var{command} is actually an @command{awk}
10157expression.  Its value is converted to a string whose contents give
10158the shell command to be run.  For example, the following produces two
10159files, one unsorted list of peoples' names, and one list sorted in reverse
10160alphabetical order:
10161
10162@ignore
1016310/2000:
10164This isn't the best style, since COMMAND is assigned for each
10165record.  It's done to avoid overfull hboxes in TeX.  Leave it
10166alone for now and let's hope no-one notices.
10167@end ignore
10168
10169@example
10170@group
10171awk '@{ print $1 > "names.unsorted"
10172       command = "sort -r > names.sorted"
10173       print $1 | command @}' mail-list
10174@end group
10175@end example
10176
10177The unsorted list is written with an ordinary redirection, while
10178the sorted list is written by piping through the @command{sort} utility.
10179
10180The next example uses redirection to mail a message to the mailing
10181list @code{bug-system}.  This might be useful when trouble is encountered
10182in an @command{awk} script run periodically for system maintenance:
10183
10184@example
10185report = "mail bug-system"
10186print("Awk script failed:", $0) | report
10187print("at record number", FNR, "of", FILENAME) | report
10188close(report)
10189@end example
10190
10191The @code{close()} function is called here because it's a good idea to close
10192the pipe as soon as all the intended output has been sent to it.
10193@xref{Close Files And Pipes}
10194for more information.
10195
10196This example also illustrates the use of a variable to represent
10197a @var{file} or @var{command}---it is not necessary to always
10198use a string constant.  Using a variable is generally a good idea,
10199because (if you mean to refer to that same file or command)
10200@command{awk} requires that the string value be written identically
10201every time.
10202
10203@cindex coprocesses
10204@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O)
10205@cindex operators @subentry input/output
10206@cindex differences in @command{awk} and @command{gawk} @subentry input/output operators
10207@item print @var{items} |& @var{command}
10208This redirection prints the items to the input of @var{command}.
10209The difference between this and the
10210single-@samp{|} redirection is that the output from @var{command}
10211can be read with @code{getline}.
10212Thus, @var{command} is a @dfn{coprocess}, which works together with
10213but is subsidiary to the @command{awk} program.
10214
10215This feature is a @command{gawk} extension, and is not available in
10216POSIX @command{awk}.
10217@ifnotdocbook
10218@xref{Getline/Coprocess},
10219for a brief discussion.
10220@xref{Two-way I/O},
10221for a more complete discussion.
10222@end ifnotdocbook
10223@ifdocbook
10224@xref{Getline/Coprocess}
10225for a brief discussion and
10226@ref{Two-way I/O}
10227for a more complete discussion.
10228@end ifdocbook
10229@end table
10230
10231Redirecting output using @samp{>}, @samp{>>}, @samp{|}, or @samp{|&}
10232asks the system to open a file, pipe, or coprocess only if the particular
10233@var{file} or @var{command} you specify has not already been written
10234to by your program or if it has been closed since it was last written to.
10235
10236@cindex troubleshooting @subentry printing
10237It is a common error to use @samp{>} redirection for the first @code{print}
10238to a file, and then to use @samp{>>} for subsequent output:
10239
10240@example
10241# clear the file
10242print "Don't panic" > "guide.txt"
10243@dots{}
10244# append
10245print "Avoid improbability generators" >> "guide.txt"
10246@end example
10247
10248@noindent
10249This is indeed how redirections must be used from the shell.  But in
10250@command{awk}, it isn't necessary.  In this kind of case, a program should
10251use @samp{>} for all the @code{print} statements, because the output file
10252is only opened once. (It happens that if you mix @samp{>} and @samp{>>}
10253output is produced in the expected order. However, mixing the operators
10254for the same file is definitely poor style, and is confusing to readers
10255of your program.)
10256
10257@cindex differences in @command{awk} and @command{gawk} @subentry implementation limitations
10258@cindex implementation issues, @command{gawk} @subentry limits
10259@cindex @command{awk} @subentry implementation issues @subentry pipes
10260@cindex @command{gawk} @subentry implementation issues @subentry pipes
10261@ifnotinfo
10262As mentioned earlier
10263(@pxref{Getline Notes}),
10264many
10265@end ifnotinfo
10266@ifnottex
10267@ifnotdocbook
10268Many
10269@end ifnotdocbook
10270@end ifnottex
10271older
10272@command{awk} implementations limit the number of pipelines that an @command{awk}
10273program may have open to just one!  In @command{gawk}, there is no such limit.
10274@command{gawk} allows a program to
10275open as many pipelines as the underlying operating system permits.
10276
10277@sidebar Piping into @command{sh}
10278@cindex shells @subentry piping commands into
10279
10280A particularly powerful way to use redirection is to build command lines
10281and pipe them into the shell, @command{sh}.  For example, suppose you
10282have a list of files brought over from a system where all the @value{FN}s
10283are stored in uppercase, and you wish to rename them to have names in
10284all lowercase.  The following program is both simple and efficient:
10285
10286@c @cindex @command{mv} utility
10287@example
10288@{ printf("mv %s %s\n", $0, tolower($0)) | "sh" @}
10289
10290END @{ close("sh") @}
10291@end example
10292
10293The @code{tolower()} function returns its argument string with all
10294uppercase characters converted to lowercase
10295(@pxref{String Functions}).
10296The program builds up a list of command lines,
10297using the @command{mv} utility to rename the files.
10298It then sends the list to the shell for execution.
10299
10300@xref{Shell Quoting} for a function that can help in generating
10301command lines to be fed to the shell.
10302@end sidebar
10303
10304@node Special FD
10305@section Special Files for Standard Preopened Data Streams
10306@cindex standard input
10307@cindex input @subentry standard
10308@cindex standard output
10309@cindex output @subentry standard
10310@cindex error output
10311@cindex standard error
10312@cindex file descriptors
10313@cindex files @subentry descriptors @seeentry{file descriptors}
10314
10315Running programs conventionally have three input and output streams
10316already available to them for reading and writing.  These are known
10317as the @dfn{standard input}, @dfn{standard output}, and @dfn{standard
10318error output}.  These open streams (and any other open files or pipes)
10319are often referred to by the technical term @dfn{file descriptors}.
10320
10321These streams are, by default, connected to your keyboard and screen, but
10322they are often redirected with the shell, via the @samp{<}, @samp{<<},
10323@samp{>}, @samp{>>}, @samp{>&}, and @samp{|} operators.  Standard error
10324is typically used for writing error messages; the reason there are two separate
10325streams, standard output and standard error, is so that they can be
10326redirected separately.
10327
10328@cindex differences in @command{awk} and @command{gawk} @subentry error messages
10329@cindex error handling
10330In traditional implementations of @command{awk}, the only way to write an error
10331message to standard error in an @command{awk} program is as follows:
10332
10333@example
10334print "Serious error detected!" | "cat 1>&2"
10335@end example
10336
10337@noindent
10338This works by opening a pipeline to a shell command that can access the
10339standard error stream that it inherits from the @command{awk} process.
10340@c 8/2014: Mike Brennan says not to cite this as inefficient. So, fixed.
10341This is far from elegant, and it also requires a
10342separate process.  So people writing @command{awk} programs often
10343don't do this.  Instead, they send the error messages to the
10344screen, like this:
10345
10346@example
10347print "Serious error detected!" > "/dev/tty"
10348@end example
10349
10350@noindent
10351(@file{/dev/tty} is a special file supplied by the operating system
10352that is connected to your keyboard and screen. It represents the
10353``terminal,''@footnote{The ``tty'' in @file{/dev/tty} stands for
10354``Teletype,'' a serial terminal.} which on modern systems is a keyboard
10355and screen, not a serial console.)
10356This generally has the same effect, but not always: although the
10357standard error stream is usually the screen, it can be redirected; when
10358that happens, writing to the screen is not correct.  In fact, if
10359@command{awk} is run from a background job, it may not have a
10360terminal at all.
10361Then opening @file{/dev/tty} fails.
10362
10363@command{gawk}, BWK @command{awk}, and @command{mawk} provide
10364special @value{FN}s for accessing the three standard streams.
10365If the @value{FN} matches one of these special names when @command{gawk}
10366(or one of the others) redirects input or output, then it directly uses
10367the descriptor that the @value{FN} stands for.  These special
10368@value{FN}s work for all operating systems that @command{gawk}
10369has been ported to, not just those that are POSIX-compliant:
10370
10371@cindex common extensions @subentry @code{/dev/stdin} special file
10372@cindex common extensions @subentry @code{/dev/stdout} special file
10373@cindex common extensions @subentry @code{/dev/stderr} special file
10374@cindex extensions @subentry common @subentry @code{/dev/stdin} special file
10375@cindex extensions @subentry common @subentry @code{/dev/stdout} special file
10376@cindex extensions @subentry common @subentry @code{/dev/stderr} special file
10377@cindex file names @subentry standard streams in @command{gawk}
10378@cindex @code{/dev/@dots{}} special files
10379@cindex files @subentry @code{/dev/@dots{}} special files
10380@cindex @code{/dev/fd/@var{N}} special files (@command{gawk})
10381@table @file
10382@item /dev/stdin
10383The standard input (file descriptor 0).
10384
10385@item /dev/stdout
10386The standard output (file descriptor 1).
10387
10388@item /dev/stderr
10389The standard error output (file descriptor 2).
10390@end table
10391
10392With these facilities,
10393the proper way to write an error message then becomes:
10394
10395@example
10396print "Serious error detected!" > "/dev/stderr"
10397@end example
10398
10399@cindex troubleshooting @subentry quotes with file names
10400Note the use of quotes around the @value{FN}.
10401Like with any other redirection, the value must be a string.
10402It is a common error to omit the quotes, which leads
10403to confusing results.
10404
10405@command{gawk} does not treat these @value{FN}s as special when
10406in POSIX-compatibility mode. However, because BWK @command{awk}
10407supports them, @command{gawk} does support them even when
10408invoked with the @option{--traditional} option (@pxref{Options}).
10409
10410@node Special Files
10411@section Special @value{FFN}s in @command{gawk}
10412@cindex @command{gawk} @subentry file names in
10413
10414Besides access to standard input, standard output, and standard error,
10415@command{gawk} provides access to any open file descriptor.
10416Additionally, there are special @value{FN}s reserved for
10417TCP/IP networking.
10418
10419@menu
10420* Other Inherited Files::       Accessing other open files with
10421                                @command{gawk}.
10422* Special Network::             Special files for network communications.
10423* Special Caveats::             Things to watch out for.
10424@end menu
10425
10426@node Other Inherited Files
10427@subsection Accessing Other Open Files with @command{gawk}
10428
10429Besides the @code{/dev/stdin}, @code{/dev/stdout}, and @code{/dev/stderr}
10430special @value{FN}s mentioned earlier, @command{gawk} provides syntax
10431for accessing any other inherited open file:
10432
10433@table @file
10434@item /dev/fd/@var{N}
10435The file associated with file descriptor @var{N}.  Such a file must
10436be opened by the program initiating the @command{awk} execution (typically
10437the shell).  Unless special pains are taken in the shell from which
10438@command{gawk} is invoked, only descriptors 0, 1, and 2 are available.
10439@end table
10440
10441The @value{FN}s @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr}
10442are essentially aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and
10443@file{/dev/fd/2}, respectively. However, those names are more self-explanatory.
10444
10445Note that using @code{close()} on a @value{FN} of the
10446form @code{"/dev/fd/@var{N}"}, for file descriptor numbers
10447above two, does actually close the given file descriptor.
10448
10449@node Special Network
10450@subsection Special Files for Network Communications
10451@cindex networks @subentry support for
10452@cindex TCP/IP @subentry support for
10453
10454@command{gawk} programs
10455can open a two-way
10456TCP/IP connection, acting as either a client or a server.
10457This is done using a special @value{FN} of the form:
10458
10459@example
10460@file{/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}
10461@end example
10462
10463The @var{net-type} is one of @samp{inet}, @samp{inet4}, or @samp{inet6}.
10464The @var{protocol} is one of @samp{tcp} or @samp{udp},
10465and the other fields represent the other essential pieces of information
10466for making a networking connection.
10467These @value{FN}s are used with the @samp{|&} operator for communicating
10468with @w{a coprocess}
10469(@pxref{Two-way I/O}).
10470This is an advanced feature, mentioned here only for completeness.
10471Full discussion is delayed until
10472@ref{TCP/IP Networking}.
10473
10474@node Special Caveats
10475@subsection Special @value{FFN} Caveats
10476
10477Here are some things to bear in mind when using the
10478special @value{FN}s that @command{gawk} provides:
10479
10480@itemize @value{BULLET}
10481@cindex compatibility mode (@command{gawk}) @subentry file names
10482@cindex file names @subentry in compatibility mode
10483@cindex POSIX mode
10484@item
10485Recognition of the @value{FN}s for the three standard preopened
10486files is disabled only in POSIX mode.
10487
10488@item
10489Recognition of the other special @value{FN}s is disabled if @command{gawk} is in
10490compatibility mode (either @option{--traditional} or @option{--posix};
10491@pxref{Options}).
10492
10493@item
10494@command{gawk} @emph{always}
10495interprets these special @value{FN}s.
10496For example, using @samp{/dev/fd/4}
10497for output actually writes on file descriptor 4, and not on a new
10498file descriptor that is @code{dup()}ed from file descriptor 4.  Most of
10499the time this does not matter; however, it is important to @emph{not}
10500close any of the files related to file descriptors 0, 1, and 2.
10501Doing so results in unpredictable behavior.
10502@end itemize
10503
10504@node Close Files And Pipes
10505@section Closing Input and Output Redirections
10506@cindex files @subentry output @seeentry{output files}
10507@cindex input files @subentry closing
10508@cindex output @subentry files, closing
10509@cindex pipe @subentry closing
10510@cindex coprocesses @subentry closing
10511@cindex @code{getline} command @subentry coprocesses, using from
10512
10513If the same @value{FN} or the same shell command is used with @code{getline}
10514more than once during the execution of an @command{awk} program
10515(@pxref{Getline}),
10516the file is opened (or the command is executed) the first time only.
10517At that time, the first record of input is read from that file or command.
10518The next time the same file or command is used with @code{getline},
10519another record is read from it, and so on.
10520
10521Similarly, when a file or pipe is opened for output, @command{awk} remembers
10522the @value{FN} or command associated with it, and subsequent
10523writes to the same file or command are appended to the previous writes.
10524The file or pipe stays open until @command{awk} exits.
10525
10526@cindexawkfunc{close}
10527This implies that special steps are necessary in order to read the same
10528file again from the beginning, or to rerun a shell command (rather than
10529reading more output from the same command).  The @code{close()} function
10530makes these things possible:
10531
10532@example
10533close(@var{filename})
10534@end example
10535
10536@noindent
10537or:
10538
10539@example
10540close(@var{command})
10541@end example
10542
10543The argument @var{filename} or @var{command} can be any expression.  Its
10544value must @emph{exactly} match the string that was used to open the file or
10545start the command (spaces and other ``irrelevant'' characters
10546included). For example, if you open a pipe with this:
10547
10548@example
10549"sort -r names" | getline foo
10550@end example
10551
10552@noindent
10553then you must close it with this:
10554
10555@example
10556close("sort -r names")
10557@end example
10558
10559Once this function call is executed, the next @code{getline} from that
10560file or command, or the next @code{print} or @code{printf} to that
10561file or command, reopens the file or reruns the command.
10562Because the expression that you use to close a file or pipeline must
10563exactly match the expression used to open the file or run the command,
10564it is good practice to use a variable to store the @value{FN} or command.
10565The previous example becomes the following:
10566
10567@example
10568@group
10569sortcom = "sort -r names"
10570sortcom | getline foo
10571@end group
10572@group
10573@dots{}
10574close(sortcom)
10575@end group
10576@end example
10577
10578@noindent
10579This helps avoid hard-to-find typographical errors in your @command{awk}
10580programs.  Here are some of the reasons for closing an output file:
10581
10582@itemize @value{BULLET}
10583@item
10584To write a file and read it back later on in the same @command{awk}
10585program.  Close the file after writing it, then
10586begin reading it with @code{getline}.
10587
10588@item
10589To write numerous files, successively, in the same @command{awk}
10590program.  If the files aren't closed, eventually @command{awk} may exceed a
10591system limit on the number of open files in one process.  It is best to
10592close each one when the program has finished writing it.
10593
10594@item
10595To make a command finish.  When output is redirected through a pipe,
10596the command reading the pipe normally continues to try to read input
10597as long as the pipe is open.  Often this means the command cannot
10598really do its work until the pipe is closed.  For example, if
10599output is redirected to the @command{mail} program, the message is not
10600actually sent until the pipe is closed.
10601
10602@item
10603To run the same program a second time, with the same arguments.
10604This is not the same thing as giving more input to the first run!
10605
10606For example, suppose a program pipes output to the @command{mail} program.
10607If it outputs several lines redirected to this pipe without closing
10608it, they make a single message of several lines.  By contrast, if the
10609program closes the pipe after each line of output, then each line makes
10610a separate message.
10611@end itemize
10612
10613@cindex differences in @command{awk} and @command{gawk} @subentry @code{close()} function
10614@cindex portability @subentry @code{close()} function and
10615@cindex @code{close()} function @subentry portability
10616If you use more files than the system allows you to have open,
10617@command{gawk} attempts to multiplex the available open files among
10618your @value{DF}s.  @command{gawk}'s ability to do this depends upon the
10619facilities of your operating system, so it may not always work.  It is
10620therefore both good practice and good portability advice to always
10621use @code{close()} on your files when you are done with them.
10622In fact, if you are using a lot of pipes, it is essential that
10623you close commands when done. For example, consider something like this:
10624
10625@example
10626@{
10627    @dots{}
10628    command = ("grep " $1 " /some/file | my_prog -q " $3)
10629    while ((command | getline) > 0) @{
10630        @var{process output of} command
10631    @}
10632    # need close(command) here
10633@}
10634@end example
10635
10636This example creates a new pipeline based on data in @emph{each} record.
10637Without the call to @code{close()} indicated in the comment, @command{awk}
10638creates child processes to run the commands, until it eventually
10639runs out of file descriptors for more pipelines.
10640
10641Even though each command has finished (as indicated by the end-of-file
10642return status from @code{getline}), the child process is not
10643terminated;@footnote{The technical terminology is rather morbid.
10644The finished child is called a ``zombie,'' and cleaning up after
10645it is referred to as ``reaping.''}
10646@c Good old UNIX: give the marketing guys fits, that's the ticket
10647more importantly, the file descriptor for the pipe
10648is not closed and released until @code{close()} is called or
10649@command{awk} exits.
10650
10651@code{close()} silently does nothing if given an argument that
10652does not represent a file, pipe, or coprocess that was opened with
10653a redirection.  In such a case, it returns a negative value,
10654indicating an error. In addition, @command{gawk} sets @code{ERRNO}
10655to a string indicating the error.
10656
10657Note also that @samp{close(FILENAME)} has no ``magic'' effects on the
10658implicit loop that reads through the files named on the command line.
10659It is, more likely, a close of a file that was never opened with a
10660redirection, so @command{awk} silently does nothing, except return
10661a negative value.
10662
10663@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O) @subentry pipes, closing
10664When using the @samp{|&} operator to communicate with a coprocess,
10665it is occasionally useful to be able to close one end of the two-way
10666pipe without closing the other.
10667This is done by supplying a second argument to @code{close()}.
10668As in any other call to @code{close()},
10669the first argument is the name of the command or special file used
10670to start the coprocess.
10671The second argument should be a string, with either of the values
10672@code{"to"} or @code{"from"}.  Case does not matter.
10673As this is an advanced feature, discussion is
10674delayed until
10675@ref{Two-way I/O},
10676which describes it in more detail and gives an example.
10677
10678@sidebar Using @code{close()}'s Return Value
10679@cindex dark corner @subentry @code{close()} function
10680@cindex @code{close()} function @subentry return value
10681@cindex return value, @code{close()} function
10682@cindex differences in @command{awk} and @command{gawk} @subentry @code{close()} function
10683@cindex Unix @command{awk} @subentry @code{close()} function and
10684
10685In many older versions of Unix @command{awk}, the @code{close()} function
10686is actually a statement.
10687@value{DARKCORNER}
10688It is a syntax error to try and use the return
10689value from @code{close()}:
10690
10691@example
10692command = "@dots{}"
10693command | getline info
10694retval = close(command)  # syntax error in many Unix awks
10695@end example
10696
10697@cindex @command{gawk} @subentry @code{ERRNO} variable in
10698@cindex @code{ERRNO} variable @subentry with @command{close()} function
10699@command{gawk} treats @code{close()} as a function.
10700The return value is @minus{}1 if the argument names something
10701that was never opened with a redirection, or if there is
10702a system problem closing the file or process.
10703In these cases, @command{gawk} sets the predefined variable
10704@code{ERRNO} to a string describing the problem.
10705
10706In @command{gawk}, starting with @value{PVERSION} 4.2, when closing a pipe or
10707coprocess (input or output), the return value is the exit status of the
10708command, as described in @ref{table-close-pipe-return-values}.@footnote{Prior
10709to @value{PVERSION} 4.2, the return value from closing a pipe or co-process
10710was the full 16-bit exit value as defined by the @code{wait()} system
10711call.} Otherwise, it is the return value from the system's @code{close()}
10712or @code{fclose()} C functions when closing input or output files,
10713respectively.  This value is zero if the close succeeds, or @minus{}1
10714if it fails.
10715
10716@float Table,table-close-pipe-return-values
10717@caption{Return values from @code{close()} of a pipe}
10718@multitable @columnfractions .50 .50
10719@headitem Situation @tab Return value from @code{close()}
10720@item Normal exit of command @tab Command's exit status
10721@item Death by signal of command @tab 256 + number of murderous signal
10722@item Death by signal of command with core dump @tab 512 + number of murderous signal
10723@item Some kind of error @tab @minus{}1
10724@end multitable
10725@end float
10726
10727@cindex POSIX mode
10728The POSIX standard is very vague; it says that @code{close()}
10729returns zero on success and a nonzero value otherwise.  In general,
10730different implementations vary in what they report when closing
10731pipes; thus, the return value cannot be used portably.
10732@value{DARKCORNER}
10733In POSIX mode (@pxref{Options}), @command{gawk} just returns zero
10734when closing a pipe.
10735@end sidebar
10736
10737@node Nonfatal
10738@section Enabling Nonfatal Output
10739
10740This @value{SECTION} describes a @command{gawk}-specific feature.
10741
10742In standard @command{awk}, output with @code{print} or @code{printf}
10743to a nonexistent file, or some other I/O error (such as filling up the
10744disk) is a fatal error.
10745
10746@example
10747$ @kbd{gawk 'BEGIN @{ print "hi" > "/no/such/file" @}'}
10748@error{} gawk: cmd. line:1: fatal: can't redirect to `/no/such/file' (No
10749@error{} such file or directory)
10750@end example
10751
10752@command{gawk} makes it possible to detect that an error has
10753occurred, allowing you to possibly recover from the error, or
10754at least print an error message of your choosing before exiting.
10755You can do this in one of two ways:
10756
10757@itemize @bullet
10758@item
10759For all output files, by assigning any value to @code{PROCINFO["NONFATAL"]}.
10760
10761@item
10762On a per-file basis, by assigning any value to
10763@code{PROCINFO[@var{filename}, "NONFATAL"]}.
10764Here, @var{filename} is the name of the file to which
10765you wish output to be nonfatal.
10766@end itemize
10767
10768Once you have enabled nonfatal output, you must check @code{ERRNO}
10769after every relevant @code{print} or @code{printf} statement to
10770see if something went wrong.  It is also a good idea to initialize
10771@code{ERRNO} to zero before attempting the output. For example:
10772
10773@example
10774$ @kbd{gawk '}
10775> @kbd{BEGIN @{}
10776> @kbd{    PROCINFO["NONFATAL"] = 1}
10777> @kbd{    ERRNO = 0}
10778> @kbd{    print "hi" > "/no/such/file"}
10779> @kbd{    if (ERRNO) @{}
10780> @kbd{        print("Output failed:", ERRNO) > "/dev/stderr"}
10781> @kbd{        exit 1}
10782> @kbd{    @}}
10783> @kbd{@}'}
10784@error{} Output failed: No such file or directory
10785@end example
10786
10787Here, @command{gawk} did not produce a fatal error; instead
10788it let the @command{awk} program code detect the problem and handle it.
10789
10790This mechanism works also for standard output and standard error.
10791For standard output, you may use @code{PROCINFO["-", "NONFATAL"]}
10792or @code{PROCINFO["/dev/stdout", "NONFATAL"]}.  For standard error, use
10793@code{PROCINFO["/dev/stderr", "NONFATAL"]}.
10794
10795@cindex @env{GAWK_SOCK_RETRIES} environment variable
10796@cindex environment variables @subentry @env{GAWK_SOCK_RETRIES}
10797When attempting to open a TCP/IP socket (@pxref{TCP/IP Networking}),
10798@command{gawk} tries multiple times. The @env{GAWK_SOCK_RETRIES}
10799environment variable (@pxref{Other Environment Variables}) allows you to
10800override @command{gawk}'s builtin default number of attempts.  However,
10801once nonfatal I/O is enabled for a given socket, @command{gawk} only
10802retries once, relying on @command{awk}-level code to notice that there
10803was a problem.
10804
10805@node Output Summary
10806@section Summary
10807
10808@itemize @value{BULLET}
10809@item
10810The @code{print} statement prints comma-separated expressions. Each
10811expression is separated by the value of @code{OFS} and terminated by
10812the value of @code{ORS}.  @code{OFMT} provides the conversion format
10813for numeric values for the @code{print} statement.
10814
10815@item
10816The @code{printf} statement provides finer-grained control over output,
10817with format-control letters for different data types and various flags
10818that modify the behavior of the format-control letters.
10819
10820@item
10821Output from both @code{print} and @code{printf} may be redirected to
10822files, pipes, and coprocesses.
10823
10824@item
10825@command{gawk} provides special @value{FN}s for access to standard input,
10826output, and error, and for network communications.
10827
10828@item
10829Use @code{close()} to close open file, pipe, and coprocess redirections.
10830For coprocesses, it is possible to close only one direction of the
10831communications.
10832
10833@item
10834Normally errors with @code{print} or @code{printf} are fatal.
10835@command{gawk} lets you make output errors be nonfatal either for
10836all files or on a per-file basis. You must then check for errors
10837after every relevant output statement.
10838
10839@end itemize
10840
10841@c EXCLUDE START
10842@node Output Exercises
10843@section Exercises
10844
10845@enumerate
10846@item
10847Rewrite the program:
10848
10849@example
10850awk 'BEGIN @{ print "Month Crates"
10851             print "----- ------" @}
10852           @{ print $1, "     ", $2 @}' inventory-shipped
10853@end example
10854
10855@noindent
10856from @ref{Output Separators}, by using a new value of @code{OFS}.
10857
10858@item
10859Use the @code{printf} statement to line up the headings and table data
10860for the @file{inventory-shipped} example that was covered in @ref{Print}.
10861
10862@item
10863What happens if you forget the double quotes when redirecting
10864output, as follows:
10865
10866@example
10867BEGIN @{ print "Serious error detected!" > /dev/stderr @}
10868@end example
10869
10870@end enumerate
10871@c EXCLUDE END
10872
10873
10874@node Expressions
10875@chapter Expressions
10876@cindex expressions
10877
10878Expressions are the basic building blocks of @command{awk} patterns
10879and actions.  An expression evaluates to a value that you can print, test,
10880or pass to a function.  Additionally, an expression
10881can assign a new value to a variable or a field by using an assignment operator.
10882
10883An expression can serve as a pattern or action statement on its own.
10884Most other kinds of
10885statements contain one or more expressions that specify the data on which to
10886operate.  As in other languages, expressions in @command{awk} can include
10887variables, array references, constants, and function calls, as well as
10888combinations of these with various operators.
10889
10890@menu
10891* Values::                      Constants, Variables, and Regular Expressions.
10892* All Operators::               @command{gawk}'s operators.
10893* Truth Values and Conditions:: Testing for true and false.
10894* Function Calls::              A function call is an expression.
10895* Precedence::                  How various operators nest.
10896* Locales::                     How the locale affects things.
10897* Expressions Summary::         Expressions summary.
10898@end menu
10899
10900@node Values
10901@section Constants, Variables, and Conversions
10902
10903Expressions are built up from values and the operations performed
10904upon them. This @value{SECTION} describes the elementary objects
10905that provide the values used in expressions.
10906
10907@menu
10908* Constants::                   String, numeric and regexp constants.
10909* Using Constant Regexps::      When and how to use a regexp constant.
10910* Variables::                   Variables give names to values for later use.
10911* Conversion::                  The conversion of strings to numbers and vice
10912                                versa.
10913@end menu
10914
10915@node Constants
10916@subsection Constant Expressions
10917
10918@cindex constants @subentry types of
10919
10920The simplest type of expression is the @dfn{constant}, which always has
10921the same value.  There are three types of constants: numeric,
10922string, and regular expression.
10923
10924Each is used in the appropriate context when you need a data
10925value that isn't going to change.  Numeric constants can
10926have different forms, but are internally stored in an identical manner.
10927
10928@menu
10929* Scalar Constants::            Numeric and string constants.
10930* Nondecimal-numbers::          What are octal and hex numbers.
10931* Regexp Constants::            Regular Expression constants.
10932@end menu
10933
10934@node Scalar Constants
10935@subsubsection Numeric and String Constants
10936
10937@cindex constants @subentry numeric
10938@cindex numeric @subentry constants
10939A @dfn{numeric constant} stands for a number.  This number can be an
10940integer, a decimal fraction, or a number in scientific (exponential)
10941notation.@footnote{The internal representation of all numbers,
10942including integers, uses double-precision floating-point numbers.
10943On most modern systems, these are in IEEE 754 standard format.
10944@xref{Arbitrary Precision Arithmetic}, for much more information.}
10945Here are some examples of numeric constants that all
10946have the same value:
10947
10948@example
10949105
109501.05e+2
109511050e-1
10952@end example
10953
10954@cindex string @subentry constants
10955@cindex constants @subentry string
10956A @dfn{string constant} consists of a sequence of characters enclosed in
10957double quotation marks.  For example:
10958
10959@example
10960"parrot"
10961@end example
10962
10963@noindent
10964@cindex differences in @command{awk} and @command{gawk} @subentry strings
10965@cindex strings @subentry length limitations
10966@cindex ASCII
10967represents the string whose contents are @samp{parrot}.  Strings in
10968@command{gawk} can be of any length, and they can contain any of the possible
10969eight-bit ASCII characters, including ASCII @sc{nul} (character code zero).
10970Other @command{awk}
10971implementations may have difficulty with some character codes.
10972
10973Some languages allow you to continue long strings across
10974multiple lines by ending the line with a backslash. For example in C:
10975
10976@example
10977#include <stdio.h>
10978
10979int main()
10980@{
10981    printf("hello, \
10982world\n");
10983    return 0;
10984@}
10985@end example
10986
10987@noindent
10988In such a case, the C compiler removes both the backslash and the newline,
10989producing a string as if it had been typed @samp{"hello, world\n"}.
10990This is useful when a single string needs to contain a large amount of text.
10991
10992The POSIX standard says explicitly that newlines are not allowed inside string
10993constants.  And indeed, all @command{awk} implementations report an error
10994if you try to do so. For example:
10995
10996@example
10997$ @kbd{gawk 'BEGIN @{ print "hello, }
10998> @kbd{world" @}'}
10999@print{} gawk: cmd. line:1: BEGIN @{ print "hello,
11000@print{} gawk: cmd. line:1:               ^ unterminated string
11001@print{} gawk: cmd. line:1: BEGIN @{ print "hello,
11002@print{} gawk: cmd. line:1:               ^ syntax error
11003@end example
11004
11005@cindex dark corner @subentry string continuation
11006@cindex strings @subentry continuation across lines
11007@cindex differences in @command{awk} and @command{gawk} @subentry strings
11008Although POSIX doesn't define what happens if you use an escaped
11009newline, as in the previous C example, all known versions of
11010@command{awk} allow you to do so.  Unfortunately, what each one
11011does with such a string varies.  @value{DARKCORNER} @command{gawk},
11012@command{mawk}, and the OpenSolaris POSIX @command{awk}
11013(@pxref{Other Versions}) elide the backslash and newline, as in C:
11014
11015@example
11016$ @kbd{gawk 'BEGIN @{ print "hello, \}
11017> @kbd{world" @}'}
11018@print{} hello, world
11019@end example
11020
11021@cindex POSIX mode
11022In POSIX mode (@pxref{Options}), @command{gawk} does not
11023allow escaped newlines.  Otherwise, it behaves as just described.
11024
11025BWK @command{awk} and BusyBox @command{awk}
11026remove the backslash but leave the newline
11027intact, as part of the string:
11028
11029@example
11030$ @kbd{nawk 'BEGIN @{ print "hello, \}
11031> @kbd{world" @}'}
11032@print{} hello,
11033@print{} world
11034@end example
11035
11036@node Nondecimal-numbers
11037@subsubsection Octal and Hexadecimal Numbers
11038@cindex octal numbers
11039@cindex hexadecimal numbers
11040@cindex numbers @subentry octal
11041@cindex numbers @subentry hexadecimal
11042
11043In @command{awk}, all numbers are in decimal (i.e., base 10).  Many other
11044programming languages allow you to specify numbers in other bases, often
11045octal (base 8) and hexadecimal (base 16).
11046In octal, the numbers go 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, and so on.
11047Just as @samp{11} in decimal is 1 times 10 plus 1, so
11048@samp{11} in octal is 1 times 8 plus 1. This equals 9 in decimal.
11049In hexadecimal, there are 16 digits. Because the everyday decimal
11050number system only has ten digits (@samp{0}--@samp{9}), the letters
11051@samp{a} through @samp{f} represent the rest.
11052(Case in the letters is usually irrelevant; hexadecimal @samp{a} and @samp{A}
11053have the same value.)
11054Thus, @samp{11} in
11055hexadecimal is 1 times 16 plus 1, which equals 17 in decimal.
11056
11057Just by looking at plain @samp{11}, you can't tell what base it's in.
11058So, in C, C++, and other languages derived from C,
11059@c such as PERL, but we won't mention that....
11060there is a special notation to signify the base.
11061Octal numbers start with a leading @samp{0},
11062and hexadecimal numbers start with a leading @samp{0x} or @samp{0X}:
11063
11064@table @code
11065@item 11
11066Decimal value 11
11067
11068@item 011
11069Octal 11, decimal value 9
11070
11071@item 0x11
11072Hexadecimal 11, decimal value 17
11073@end table
11074
11075This example shows the difference:
11076
11077@example
11078$ @kbd{gawk 'BEGIN @{ printf "%d, %d, %d\n", 011, 11, 0x11 @}'}
11079@print{} 9, 11, 17
11080@end example
11081
11082Being able to use octal and hexadecimal constants in your programs is most
11083useful when working with data that cannot be represented conveniently as
11084characters or as regular numbers, such as binary data of various sorts.
11085
11086@cindex @command{gawk} @subentry octal numbers and
11087@cindex @command{gawk} @subentry hexadecimal numbers and
11088@command{gawk} allows the use of octal and hexadecimal
11089constants in your program text.  However, such numbers in the input data
11090are not treated differently; doing so by default would break old
11091programs.
11092(If you really need to do this, use the @option{--non-decimal-data}
11093command-line option;
11094@pxref{Nondecimal Data}.)
11095If you have octal or hexadecimal data,
11096you can use the @code{strtonum()} function
11097(@pxref{String Functions})
11098to convert the data into a number.
11099Most of the time, you will want to use octal or hexadecimal constants
11100when working with the built-in bit-manipulation functions;
11101see @ref{Bitwise Functions}
11102for more information.
11103
11104Unlike in some early C implementations, @samp{8} and @samp{9} are not
11105valid in octal constants.  For example, @command{gawk} treats @samp{018}
11106as decimal 18:
11107
11108@example
11109$ @kbd{gawk 'BEGIN @{ print "021 is", 021 ; print 018 @}'}
11110@print{} 021 is 17
11111@print{} 18
11112@end example
11113
11114@cindex compatibility mode (@command{gawk}) @subentry octal numbers
11115@cindex compatibility mode (@command{gawk}) @subentry hexadecimal numbers
11116Octal and hexadecimal source code constants are a @command{gawk} extension.
11117If @command{gawk} is in compatibility mode
11118(@pxref{Options}),
11119they are not available.
11120
11121@sidebar A Constant's Base Does Not Affect Its Value
11122
11123Once a numeric constant has
11124been converted internally into a number,
11125@command{gawk} no longer remembers
11126what the original form of the constant was; the internal value is
11127always used.  This has particular consequences for conversion of
11128numbers to strings:
11129
11130@example
11131$ @kbd{gawk 'BEGIN @{ printf "0x11 is <%s>\n", 0x11 @}'}
11132@print{} 0x11 is <17>
11133@end example
11134@end sidebar
11135
11136@node Regexp Constants
11137@subsubsection Regular Expression Constants
11138
11139@cindex regexp constants
11140@cindex @code{~} (tilde), @code{~} operator
11141@cindex tilde (@code{~}), @code{~} operator
11142@cindex @code{!} (exclamation point) @subentry @code{!~} operator
11143@cindex exclamation point (@code{!}) @subentry @code{!~} operator
11144A @dfn{regexp constant} is a regular expression description enclosed in
11145slashes, such as @code{@w{/^beginning and end$/}}.  Most regexps used in
11146@command{awk} programs are constant, but the @samp{~} and @samp{!~}
11147matching operators can also match computed or dynamic regexps
11148(which are typically just ordinary strings or variables that contain a regexp,
11149but could be more complex expressions).
11150
11151@node Using Constant Regexps
11152@subsection Using Regular Expression Constants
11153
11154Regular expression constants consist of text describing
11155a regular expression enclosed in slashes (such as @code{/the +answer/}).
11156This @value{SECTION} describes how such constants work in
11157POSIX @command{awk} and @command{gawk}, and then goes on to describe
11158@dfn{strongly typed regexp constants}, which are a @command{gawk} extension.
11159
11160@menu
11161* Standard Regexp Constants::   Regexp constants in standard @command{awk}.
11162* Strong Regexp Constants::     Strongly typed regexp constants.
11163@end menu
11164
11165@node Standard Regexp Constants
11166@subsubsection Standard Regular Expression Constants
11167
11168@cindex dark corner @subentry regexp constants
11169When used on the righthand side of the @samp{~} or @samp{!~}
11170operators, a regexp constant merely stands for the regexp that is to be
11171matched.
11172However, regexp constants (such as @code{/foo/}) may be used like simple expressions.
11173When a
11174regexp constant appears by itself, it has the same meaning as if it appeared
11175in a pattern (i.e., @samp{($0 ~ /foo/)}).
11176@value{DARKCORNER}
11177@xref{Expression Patterns}.
11178This means that the following two code segments:
11179
11180@example
11181if ($0 ~ /barfly/ || $0 ~ /camelot/)
11182    print "found"
11183@end example
11184
11185@noindent
11186and:
11187
11188@example
11189if (/barfly/ || /camelot/)
11190    print "found"
11191@end example
11192
11193@noindent
11194are exactly equivalent.
11195One rather bizarre consequence of this rule is that the following
11196Boolean expression is valid, but does not do what its author probably
11197intended:
11198
11199@example
11200# Note that /foo/ is on the left of the ~
11201if (/foo/ ~ $1) print "found foo"
11202@end example
11203
11204@c @cindex automatic warnings
11205@c @cindex warnings, automatic
11206@cindex @command{gawk} @subentry regexp constants and
11207@cindex regexp constants @subentry in @command{gawk}
11208@noindent
11209This code is ``obviously'' testing @code{$1} for a match against the regexp
11210@code{/foo/}.  But in fact, the expression @samp{/foo/ ~ $1} really means
11211@samp{($0 ~ /foo/) ~ $1}.  In other words, first match the input record
11212against the regexp @code{/foo/}.  The result is either zero or one,
11213depending upon the success or failure of the match.  That result
11214is then matched against the first field in the record.
11215Because it is unlikely that you would ever really want to make this kind of
11216test, @command{gawk} issues a warning when it sees this construct in
11217a program.
11218Another consequence of this rule is that the assignment statement:
11219
11220@example
11221matches = /foo/
11222@end example
11223
11224@noindent
11225assigns either zero or one to the variable @code{matches}, depending
11226upon the contents of the current input record.
11227
11228@cindex differences in @command{awk} and @command{gawk} @subentry regexp constants
11229@cindex dark corner @subentry regexp constants @subentry as arguments to user-defined functions
11230@cindexgawkfunc{gensub}
11231@cindexawkfunc{sub}
11232@cindexawkfunc{gsub}
11233Constant regular expressions are also used as the first argument for
11234the @code{gensub()}, @code{sub()}, and @code{gsub()} functions, as the
11235second argument of the @code{match()} function,
11236and as the third argument of the @code{split()} and @code{patsplit()} functions
11237(@pxref{String Functions}).
11238Modern implementations of @command{awk}, including @command{gawk}, allow
11239the third argument of @code{split()} to be a regexp constant, but some
11240older implementations do not.
11241@value{DARKCORNER}
11242Because some built-in functions accept regexp constants as arguments,
11243confusion can arise when attempting to use regexp constants as arguments
11244to user-defined functions (@pxref{User-defined}).  For example:
11245
11246@example
11247@group
11248function mysub(pat, repl, str, global)
11249@{
11250    if (global)
11251        gsub(pat, repl, str)
11252    else
11253        sub(pat, repl, str)
11254    return str
11255@}
11256@end group
11257
11258@group
11259@{
11260    @dots{}
11261    text = "hi! hi yourself!"
11262    mysub(/hi/, "howdy", text, 1)
11263    @dots{}
11264@}
11265@end group
11266@end example
11267
11268@c @cindex automatic warnings
11269@c @cindex warnings, automatic
11270In this example, the programmer wants to pass a regexp constant to the
11271user-defined function @code{mysub()}, which in turn passes it on to
11272either @code{sub()} or @code{gsub()}.  However, what really happens is that
11273the @code{pat} parameter is assigned a value of either one or zero, depending upon whether
11274or not @code{$0} matches @code{/hi/}.
11275@command{gawk} issues a warning when it sees a regexp constant used as
11276a parameter to a user-defined function, because passing a truth value in
11277this way is probably not what was intended.
11278
11279@node Strong Regexp Constants
11280@subsubsection Strongly Typed Regexp Constants
11281
11282This @value{SECTION} describes a @command{gawk}-specific feature.
11283
11284As we saw in the previous @value{SECTION},
11285regexp constants (@code{/@dots{}/}) hold a strange position in the
11286@command{awk} language. In most contexts, they act like an expression:
11287@samp{$0 ~ /@dots{}/}. In other contexts, they denote only a regexp to
11288be matched. In no case are they really a ``first class citizen'' of the
11289language. That is, you cannot define a scalar variable whose type is
11290``regexp'' in the same sense that you can define a variable to be a
11291number or a string:
11292
11293@example
11294num = 42        @ii{Numeric variable}
11295str = "hi"      @ii{String variable}
11296re = /foo/      @ii{Wrong!} re @ii{is the result of} $0 ~ /foo/
11297@end example
11298
11299For a number of more advanced use cases,
11300it would be nice to have regexp constants that
11301are @dfn{strongly typed}; in other words, that denote a regexp useful
11302for matching, and not an expression.
11303
11304@cindex values @subentry regexp
11305@command{gawk} provides this feature.  A strongly typed regexp constant
11306looks almost like a regular regexp constant, except that it is preceded
11307by an @samp{@@} sign:
11308
11309@example
11310re = @@/foo/     @ii{Regexp variable}
11311@end example
11312
11313Strongly typed regexp constants @emph{cannot} be used everywhere that a
11314regular regexp constant can, because this would make the language even more
11315confusing.  Instead, you may use them only in certain contexts:
11316
11317@itemize @bullet
11318@item
11319On the righthand side of the @samp{~} and @samp{!~} operators: @samp{some_var ~ @@/foo/}
11320(@pxref{Regexp Usage}).
11321
11322@item
11323In the @code{case} part of a @code{switch} statement
11324(@pxref{Switch Statement}).
11325
11326@item
11327As an argument to one of the built-in functions that accept regexp constants:
11328@code{gensub()},
11329@code{gsub()},
11330@code{match()},
11331@code{patsplit()},
11332@code{split()},
11333and
11334@code{sub()}
11335(@pxref{String Functions}).
11336
11337@item
11338As a parameter in a call to a user-defined function
11339(@pxref{User-defined}).
11340
11341@item
11342As the return value of a user-defined function.
11343
11344@item
11345On the righthand side of an assignment to a variable: @samp{some_var = @@/foo/}.
11346In this case, the type of @code{some_var} is regexp. Additionally, @code{some_var}
11347can be used with @samp{~} and @samp{!~}, passed to one of the built-in functions
11348listed above, or passed as a parameter to a user-defined function.
11349@end itemize
11350
11351You may use the @option{-v} option (@pxref{Options}) to assign a
11352strongly-typed regexp constant to a variable on the command line, like so:
11353
11354@example
11355gawk -v pattern='@@/something(interesting)+/' @dots{}
11356@end example
11357
11358@noindent
11359You may also make such assignments as regular command-line arguments
11360(@pxref{Other Arguments}).
11361
11362You may use the @code{typeof()} built-in function
11363(@pxref{Type Functions})
11364to determine if a variable or function parameter is
11365a regexp variable.
11366
11367The true power of this feature comes from the ability to create variables that
11368have regexp type. Such variables can be passed on to user-defined functions,
11369without the confusing aspects of computed regular expressions created from
11370strings or string constants. They may also be passed through indirect function
11371calls (@pxref{Indirect Calls})
11372and on to the built-in functions that accept regexp constants.
11373
11374When used in numeric conversions, strongly typed regexp variables convert
11375to zero. When used in string conversions, they convert to the string
11376value of the original regexp text.
11377
11378There is an additional, interesting corner case. When used as the third
11379argument to @code{sub()} or @code{gsub()}, they retain their type.  Thus,
11380if you have something like this:
11381
11382@example
11383re = @/don't panic/
11384sub(/don't/, "do", re)
11385print typeof(re), re
11386@end example
11387
11388@noindent
11389then @code{re} retains its type, but now attempts to match the string
11390@samp{do panic}.  This provides a (very indirect) way to create regexp-typed
11391variables at runtime.
11392
11393@node Variables
11394@subsection Variables
11395
11396@cindex variables @subentry user-defined
11397@cindex user-defined @subentry variables
11398@dfn{Variables} are ways of storing values at one point in your program for
11399use later in another part of your program.  They can be manipulated
11400entirely within the program text, and they can also be assigned values
11401on the @command{awk} command line.
11402
11403@menu
11404* Using Variables::             Using variables in your programs.
11405* Assignment Options::          Setting variables on the command line and a
11406                                summary of command-line syntax. This is an
11407                                advanced method of input.
11408@end menu
11409
11410@node Using Variables
11411@subsubsection Using Variables in a Program
11412
11413Variables let you give names to values and refer to them later.  Variables
11414have already been used in many of the examples.  The name of a variable
11415must be a sequence of letters, digits, or underscores, and it may not begin
11416with a digit.
11417Here, a @dfn{letter} is any one of the 52 upper- and lowercase
11418English letters.  Other characters that may be defined as letters
11419in non-English locales are not valid in variable names.
11420Case is significant in variable names; @code{a} and @code{A}
11421are distinct variables.
11422
11423A variable name is a valid expression by itself; it represents the
11424variable's current value.  Variables are given new values with
11425@dfn{assignment operators}, @dfn{increment operators}, and
11426@dfn{decrement operators}
11427(@pxref{Assignment Ops}).
11428In addition, the @code{sub()} and @code{gsub()} functions can
11429change a variable's value, and the @code{match()}, @code{split()},
11430and @code{patsplit()} functions can change the contents of their
11431array parameters (@pxref{String Functions}).
11432
11433@cindex variables @subentry built-in
11434@cindex variables @subentry initializing
11435A few variables have special built-in meanings, such as @code{FS} (the
11436field separator) and @code{NF} (the number of fields in the current input
11437record).  @xref{Built-in Variables} for a list of the predefined variables.
11438These predefined variables can be used and assigned just like all other
11439variables, but their values are also used or changed automatically by
11440@command{awk}.  All predefined variables' names are entirely uppercase.
11441
11442Variables in @command{awk} can be assigned either numeric or string values.
11443The kind of value a variable holds can change over the life of a program.
11444By default, variables are initialized to the empty string, which
11445is zero if converted to a number.  There is no need to explicitly
11446initialize a variable in @command{awk},
11447which is what you would do in C and in most other traditional languages.
11448
11449@node Assignment Options
11450@subsubsection Assigning Variables on the Command Line
11451@cindex variables @subentry assigning on command line
11452@cindex command line @subentry variables, assigning on
11453
11454Any @command{awk} variable can be set by including a @dfn{variable assignment}
11455among the arguments on the command line when @command{awk} is invoked
11456(@pxref{Other Arguments}).
11457Such an assignment has the following form:
11458
11459@example
11460@var{variable}=@var{text}
11461@end example
11462
11463@cindex @option{-v} option
11464@noindent
11465With it, a variable is set either at the beginning of the
11466@command{awk} run or in between input files.
11467When the assignment is preceded with the @option{-v} option,
11468as in the following:
11469
11470@example
11471-v @var{variable}=@var{text}
11472@end example
11473
11474@noindent
11475the variable is set at the very beginning, even before the
11476@code{BEGIN} rules execute.  The @option{-v} option and its assignment
11477must precede all the @value{FN} arguments, as well as the program text.
11478(@xref{Options} for more information about
11479the @option{-v} option.)
11480Otherwise, the variable assignment is performed at a time determined by
11481its position among the input file arguments---after the processing of the
11482preceding input file argument.  For example:
11483
11484@example
11485awk '@{ print $n @}' n=4 inventory-shipped n=2 mail-list
11486@end example
11487
11488@noindent
11489prints the value of field number @code{n} for all input records.  Before
11490the first file is read, the command line sets the variable @code{n}
11491equal to four.  This causes the fourth field to be printed in lines from
11492@file{inventory-shipped}.  After the first file has finished,
11493but before the second file is started, @code{n} is set to two, so that the
11494second field is printed in lines from @file{mail-list}:
11495
11496@example
11497$ @kbd{awk '@{ print $n @}' n=4 inventory-shipped n=2 mail-list}
11498@print{} 15
11499@print{} 24
11500@dots{}
11501@print{} 555-5553
11502@print{} 555-3412
11503@dots{}
11504@end example
11505
11506@cindex dark corner @subentry command-line arguments
11507Command-line arguments are made available for explicit examination by
11508the @command{awk} program in the @code{ARGV} array
11509(@pxref{ARGC and ARGV}).
11510@command{awk} processes the values of command-line assignments for escape
11511sequences
11512(@pxref{Escape Sequences}).
11513@value{DARKCORNER}
11514
11515Normally, variables assigned on the command line (with or without the
11516@option{-v} option) are treated as strings.  When such variables are
11517used as numbers, @command{awk}'s normal automatic conversion of strings
11518to numbers takes place, and everything ``just works.''
11519
11520However, @command{gawk} supports variables whose types are ``regexp''.
11521You can assign variables of this type using the following syntax:
11522
11523@example
11524gawk -v 're1=@@/foo|bar/' '@dots{}' /path/to/file1 're2=@@/baz|quux/' /path/to/file2
11525@end example
11526
11527@noindent
11528Strongly typed regexps are an advanced feature (@pxref{Strong Regexp Constants}).
11529We mention them here only for completeness.
11530
11531@node Conversion
11532@subsection Conversion of Strings and Numbers
11533
11534Number-to-string and string-to-number conversion are generally
11535straightforward.  There can be subtleties to be aware of;
11536this @value{SECTION} discusses this important facet of @command{awk}.
11537
11538@menu
11539* Strings And Numbers::         How @command{awk} Converts Between Strings And
11540                                Numbers.
11541* Locale influences conversions:: How the locale may affect conversions.
11542@end menu
11543
11544@node Strings And Numbers
11545@subsubsection How @command{awk} Converts Between Strings and Numbers
11546
11547@cindex converting @subentry string to numbers
11548@cindex strings @subentry converting
11549@cindex numbers @subentry converting
11550@cindex converting @subentry numbers to strings
11551Strings are converted to numbers and numbers are converted to strings, if the context
11552of the @command{awk} program demands it.  For example, if the value of
11553either @code{foo} or @code{bar} in the expression @samp{foo + bar}
11554happens to be a string, it is converted to a number before the addition
11555is performed.  If numeric values appear in string concatenation, they
11556are converted to strings.  Consider the following:
11557
11558@example
11559@group
11560two = 2; three = 3
11561print (two three) + 4
11562@end group
11563@end example
11564
11565@noindent
11566This prints the (numeric) value 27.  The numeric values of
11567the variables @code{two} and @code{three} are converted to strings and
11568concatenated together.  The resulting string is converted back to the
11569number 23, to which 4 is then added.
11570
11571@cindex null strings @subentry converting numbers to strings
11572@cindex type @subentry conversion
11573If, for some reason, you need to force a number to be converted to a
11574string, concatenate that number with the empty string, @code{""}.
11575To force a string to be converted to a number, add zero to that string.
11576A string is converted to a number by interpreting any numeric prefix
11577of the string as numerals:
11578@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1,000, and @code{"25fix"}
11579has a numeric value of 25.
11580Strings that can't be interpreted as valid numbers convert to zero.
11581
11582@cindex @code{CONVFMT} variable
11583The exact manner in which numbers are converted into strings is controlled
11584by the @command{awk} predefined variable @code{CONVFMT} (@pxref{Built-in Variables}).
11585Numbers are converted using the @code{sprintf()} function
11586with @code{CONVFMT} as the format
11587specifier
11588(@pxref{String Functions}).
11589
11590@code{CONVFMT}'s default value is @code{"%.6g"}, which creates a value with
11591at most six significant digits.  For some applications, you might want to
11592change it to specify more precision.
11593On most modern machines,
1159417 digits is usually enough to capture a floating-point number's
11595value exactly.@footnote{Pathological cases can require up to
11596752 digits (!), but we doubt that you need to worry about this.}
11597
11598@cindex dark corner @subentry @code{CONVFMT} variable
11599Strange results can occur if you set @code{CONVFMT} to a string that doesn't
11600tell @code{sprintf()} how to format floating-point numbers in a useful way.
11601For example, if you forget the @samp{%} in the format, @command{awk} converts
11602all numbers to the same constant string.
11603
11604As a special case, if a number is an integer, then the result of converting
11605it to a string is @emph{always} an integer, no matter what the value of
11606@code{CONVFMT} may be.  Given the following code fragment:
11607
11608@example
11609CONVFMT = "%2.2f"
11610a = 12
11611b = a ""
11612@end example
11613
11614@noindent
11615@code{b} has the value @code{"12"}, not @code{"12.00"}.
11616@value{DARKCORNER}
11617
11618@sidebar Pre-POSIX @command{awk} Used @code{OFMT} for String Conversion
11619@cindex POSIX @command{awk} @subentry @code{OFMT} variable and
11620@cindex @code{OFMT} variable
11621@cindex portability @subentry new @command{awk} vs.@: old @command{awk}
11622@cindex @command{awk} @subentry new vs.@: old @subentry @code{OFMT} variable
11623Prior to the POSIX standard, @command{awk} used the value
11624of @code{OFMT} for converting numbers to strings.  @code{OFMT}
11625specifies the output format to use when printing numbers with @code{print}.
11626@code{CONVFMT} was introduced in order to separate the semantics of
11627conversion from the semantics of printing.  Both @code{CONVFMT} and
11628@code{OFMT} have the same default value: @code{"%.6g"}.  In the vast majority
11629of cases, old @command{awk} programs do not change their behavior.
11630@xref{Print} for more information on the @code{print} statement.
11631@end sidebar
11632
11633@node Locale influences conversions
11634@subsubsection Locales Can Influence Conversion
11635
11636Where you are can matter when it comes to converting between numbers and
11637strings.  The local character set and language---the @dfn{locale}---can
11638affect numeric formats.  In particular, for @command{awk} programs,
11639it affects the decimal point character and the thousands-separator
11640character.  The @code{"C"} locale, and most English-language locales,
11641use the period character (@samp{.}) as the decimal point and don't
11642have a thousands separator.  However, many (if not most) European and
11643non-English locales use the comma (@samp{,}) as the decimal point
11644character. European locales often use either a space or a period as
11645the thousands separator, if they have one.
11646
11647@cindex dark corner @subentry locale's decimal point character
11648The POSIX standard says that @command{awk} always uses the period as the decimal
11649point when reading the @command{awk} program source code, and for
11650command-line variable assignments (@pxref{Other Arguments}).  However,
11651when interpreting input data, for @code{print} and @code{printf} output,
11652and for number-to-string conversion, the local decimal point character
11653is used.  @value{DARKCORNER} In all cases, numbers in source code and
11654in input data cannot have a thousands separator.  Here are some examples
11655indicating the difference in behavior, on a GNU/Linux system:
11656
11657@example
11658$ @kbd{export POSIXLY_CORRECT=1}                        @ii{Force POSIX behavior}
11659$ @kbd{gawk 'BEGIN @{ printf "%g\n", 3.1415927 @}'}
11660@print{} 3.14159
11661$ @kbd{LC_ALL=en_DK.utf-8 gawk 'BEGIN @{ printf "%g\n", 3.1415927 @}'}
11662@print{} 3,14159
11663$ @kbd{echo 4,321 | gawk '@{ print $1 + 1 @}'}
11664@print{} 5
11665$ @kbd{echo 4,321 | LC_ALL=en_DK.utf-8 gawk '@{ print $1 + 1 @}'}
11666@print{} 5,321
11667@end example
11668
11669@noindent
11670The @code{en_DK.utf-8} locale is for English in Denmark, where the comma acts as
11671the decimal point separator.  In the normal @code{"C"} locale, @command{gawk}
11672treats @samp{4,321} as 4, while in the Danish locale, it's treated
11673as the full number including the fractional part, 4.321.
11674
11675@cindex POSIX mode
11676Some earlier versions of @command{gawk} fully complied with this aspect
11677of the standard.  However, many users in non-English locales complained
11678about this behavior, because their data used a period as the decimal
11679point, so the default behavior was restored to use a period as the
11680decimal point character.  You can use the @option{--use-lc-numeric}
11681option (@pxref{Options}) to force @command{gawk} to use the locale's
11682decimal point character.  (@command{gawk} also uses the locale's decimal
11683point character when in POSIX mode, either via @option{--posix} or the
11684@env{POSIXLY_CORRECT} environment variable, as shown previously.)
11685
11686@ref{table-locale-affects} describes the cases in which the locale's decimal
11687point character is used and when a period is used. Some of these
11688features have not been described yet.
11689
11690@float Table,table-locale-affects
11691@caption{Locale decimal point versus a period}
11692@multitable @columnfractions .15 .20 .45
11693@headitem Feature @tab Default @tab @option{--posix} or @option{--use-lc-numeric}
11694@item @code{%'g} @tab Use locale @tab Use locale
11695@item @code{%g} @tab Use period @tab Use locale
11696@item Input @tab Use period @tab Use locale
11697@item @code{strtonum()} @tab Use period @tab Use locale
11698@end multitable
11699@end float
11700
11701Finally, modern-day formal standards and the IEEE standard floating-point
11702representation can have an unusual but important effect on the way
11703@command{gawk} converts some special string values to numbers.  The details
11704are presented in @ref{POSIX Floating Point Problems}.
11705
11706@node All Operators
11707@section Operators: Doing Something with Values
11708
11709This @value{SECTION} introduces the @dfn{operators} that make use
11710of the values provided by constants and variables.
11711
11712@menu
11713* Arithmetic Ops::              Arithmetic operations (@samp{+}, @samp{-},
11714                                etc.)
11715* Concatenation::               Concatenating strings.
11716* Assignment Ops::              Changing the value of a variable or a field.
11717* Increment Ops::               Incrementing the numeric value of a variable.
11718@end menu
11719
11720@node Arithmetic Ops
11721@subsection Arithmetic Operators
11722@cindex arithmetic operators
11723@cindex operators @subentry arithmetic
11724@c @cindex addition
11725@c @cindex subtraction
11726@c @cindex multiplication
11727@c @cindex division
11728@c @cindex remainder
11729@c @cindex quotient
11730@c @cindex exponentiation
11731
11732The @command{awk} language uses the common arithmetic operators when
11733evaluating expressions.  All of these arithmetic operators follow normal
11734precedence rules and work as you would expect them to.
11735
11736The following example uses a file named @file{grades}, which contains
11737a list of student names as well as three test scores per student (it's
11738a small class):
11739
11740@example
11741Pat   100 97 58
11742Sandy  84 72 93
11743Chris  72 92 89
11744@end example
11745
11746@noindent
11747This program takes the file @file{grades} and prints the average
11748of the scores:
11749
11750@example
11751$ @kbd{awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3}
11752>        @kbd{print $1, avg @}' grades}
11753@print{} Pat 85
11754@print{} Sandy 83
11755@print{} Chris 84.3333
11756@end example
11757
11758The following list provides the arithmetic operators in @command{awk},
11759in order from the highest precedence to the lowest:
11760
11761@table @code
11762@cindex common extensions @subentry @code{**} operator
11763@cindex extensions @subentry common @subentry @code{**} operator
11764@cindex POSIX @command{awk} @subentry arithmetic operators and
11765@item @var{x} ^ @var{y}
11766@itemx @var{x} ** @var{y}
11767Exponentiation; @var{x} raised to the @var{y} power.  @samp{2 ^ 3} has
11768the value eight; the character sequence @samp{**} is equivalent to
11769@samp{^}. @value{COMMONEXT}
11770
11771@item - @var{x}
11772Negation.
11773
11774@item + @var{x}
11775Unary plus; the expression is converted to a number.
11776
11777@item @var{x} * @var{y}
11778Multiplication.
11779
11780@cindex troubleshooting @subentry division
11781@cindex division
11782@item @var{x} / @var{y}
11783Division;  because all numbers in @command{awk} are floating-point
11784numbers, the result is @emph{not} rounded to an integer---@samp{3 / 4} has
11785the value 0.75.  (It is a common mistake, especially for C programmers,
11786to forget that @emph{all} numbers in @command{awk} are floating point,
11787and that division of integer-looking constants produces a real number,
11788not an integer.)
11789
11790@item @var{x} % @var{y}
11791Remainder; further discussion is provided in the text, just
11792after this list.
11793
11794@item @var{x} + @var{y}
11795Addition.
11796
11797@item @var{x} - @var{y}
11798Subtraction.
11799@end table
11800
11801Unary plus and minus have the same precedence,
11802the multiplication operators all have the same precedence, and
11803addition and subtraction have the same precedence.
11804
11805@cindex differences in @command{awk} and @command{gawk} @subentry trunc-mod operation
11806@cindex trunc-mod operation
11807When computing the remainder of @samp{@var{x} % @var{y}},
11808the quotient is rounded toward zero to an integer and
11809multiplied by @var{y}. This result is subtracted from @var{x};
11810this operation is sometimes known as ``trunc-mod.''  The following
11811relation always holds:
11812
11813@example
11814b * int(a / b) + (a % b) == a
11815@end example
11816
11817One possibly undesirable effect of this definition of remainder is that
11818@samp{@var{x} % @var{y}} is negative if @var{x} is negative.  Thus:
11819
11820@example
11821-17 % 8 = -1
11822@end example
11823
11824@noindent
11825This definition is compliant with the POSIX standard, which says that the @code{%}
11826operator produces results equivalent to using the standard C
11827@code{fmod()} function, and that function in turn works as just
11828described.
11829
11830In other @command{awk} implementations, the signedness of the remainder
11831may be machine-dependent.
11832
11833@cindex portability @subentry @code{**} operator and
11834@cindex @code{*} (asterisk) @subentry @code{**} operator
11835@cindex asterisk (@code{*}) @subentry @code{**} operator
11836@quotation NOTE
11837The POSIX standard only specifies the use of @samp{^}
11838for exponentiation.
11839For maximum portability, do not use the @samp{**} operator.
11840@end quotation
11841
11842@node Concatenation
11843@subsection String Concatenation
11844@cindex Kernighan, Brian @subentry quotes
11845@quotation
11846@i{It seemed like a good idea at the time.}
11847@author Brian Kernighan
11848@end quotation
11849
11850@cindex string @subentry operators
11851@cindex operators @subentry string
11852@cindex concatenating
11853There is only one string operation: concatenation.  It does not have a
11854specific operator to represent it.  Instead, concatenation is performed by
11855writing expressions next to one another, with no operator.  For example:
11856
11857@example
11858$ @kbd{awk '@{ print "Field number one: " $1 @}' mail-list}
11859@print{} Field number one: Amelia
11860@print{} Field number one: Anthony
11861@dots{}
11862@end example
11863
11864Without the space in the string constant after the @samp{:}, the line
11865runs together.  For example:
11866
11867@example
11868$ @kbd{awk '@{ print "Field number one:" $1 @}' mail-list}
11869@print{} Field number one:Amelia
11870@print{} Field number one:Anthony
11871@dots{}
11872@end example
11873
11874@cindex troubleshooting @subentry string concatenation
11875Because string concatenation does not have an explicit operator, it is
11876often necessary to ensure that it happens at the right time by using
11877parentheses to enclose the items to concatenate.  For example,
11878you might expect that the
11879following code fragment concatenates @code{file} and @code{name}:
11880
11881@example
11882file = "file"
11883name = "name"
11884print "something meaningful" > file name
11885@end example
11886
11887@cindex Brian Kernighan's @command{awk}
11888@cindex @command{mawk} utility
11889@noindent
11890This produces a syntax error with some versions of Unix
11891@command{awk}.@footnote{It happens that BWK
11892@command{awk}, @command{gawk}, and @command{mawk} all ``get it right,''
11893but you should not rely on this.}
11894It is necessary to use the following:
11895
11896@example
11897print "something meaningful" > (file name)
11898@end example
11899
11900@cindex order of evaluation, concatenation
11901@cindex evaluation order @subentry concatenation
11902@cindex side effects
11903Parentheses should be used around concatenation in all but the
11904most common contexts, such as on the righthand side of @samp{=}.
11905Be careful about the kinds of expressions used in string concatenation.
11906In particular, the order of evaluation of expressions used for concatenation
11907is undefined in the @command{awk} language.  Consider this example:
11908
11909@example
11910BEGIN @{
11911    a = "don't"
11912    print (a " " (a = "panic"))
11913@}
11914@end example
11915
11916@noindent
11917It is not defined whether the second assignment to @code{a} happens
11918before or after the value of @code{a} is retrieved for producing the
11919concatenated value.  The result could be either @samp{don't panic},
11920or @samp{panic panic}.
11921@c see test/nasty.awk for a worse example
11922
11923The precedence of concatenation, when mixed with other operators, is often
11924counter-intuitive.  Consider this example:
11925
11926@ignore
11927> To: bug-gnu-utils@@gnu.org
11928> CC: arnold@@gnu.org
11929> Subject: gawk 3.0.4 bug with {print -12 " " -24}
11930> From: Russell Schulz <Russell_Schulz@locutus.ofB.ORG>
11931> Date: Tue, 8 Feb 2000 19:56:08 -0700
11932>
11933> gawk 3.0.4 on NT gives me:
11934>
11935> prompt> cat bad.awk
11936> BEGIN { print -12 " " -24; }
11937>
11938> prompt> gawk -f bad.awk
11939> -12-24
11940>
11941> when I would expect
11942>
11943> -12 -24
11944>
11945> I have not investigated the source, or other implementations.  The
11946> bug is there on my NT and DOS versions 2.15.6 .
11947@end ignore
11948
11949@example
11950$ @kbd{awk 'BEGIN @{ print -12 " " -24 @}'}
11951@print{} -12-24
11952@end example
11953
11954This ``obviously'' is concatenating @minus{}12, a space, and @minus{}24.
11955But where did the space disappear to?
11956The answer lies in the combination of operator precedences and
11957@command{awk}'s automatic conversion rules.  To get the desired result,
11958write the program this way:
11959
11960@example
11961$ @kbd{awk 'BEGIN @{ print -12 " " (-24) @}'}
11962@print{} -12 -24
11963@end example
11964
11965This forces @command{awk} to treat the @samp{-} on the @samp{-24} as unary.
11966Otherwise, it's parsed as follows:
11967
11968@display
11969    @minus{}12 (@code{"@ "} @minus{} 24)
11970@result{} @minus{}12 (0 @minus{} 24)
11971@result{} @minus{}12 (@minus{}24)
11972@result{} @minus{}12@minus{}24
11973@end display
11974
11975As mentioned earlier,
11976when mixing concatenation with other operators, @emph{parenthesize}.  Otherwise,
11977you're never quite sure what you'll get.
11978
11979@node Assignment Ops
11980@subsection Assignment Expressions
11981@cindex assignment operators
11982@cindex operators @subentry assignment
11983@cindex expressions @subentry assignment
11984@cindex @code{=} (equals sign) @subentry @code{=} operator
11985@cindex equals sign (@code{=}) @subentry @code{=} operator
11986An @dfn{assignment} is an expression that stores a (usually different)
11987value into a variable.  For example, let's assign the value one to the variable
11988@code{z}:
11989
11990@example
11991z = 1
11992@end example
11993
11994After this expression is executed, the variable @code{z} has the value one.
11995Whatever old value @code{z} had before the assignment is forgotten.
11996
11997Assignments can also store string values.  For example, the
11998following stores
11999the value @code{"this food is good"} in the variable @code{message}:
12000
12001@example
12002thing = "food"
12003predicate = "good"
12004message = "this " thing " is " predicate
12005@end example
12006
12007@noindent
12008@cindex side effects @subentry assignment expressions
12009This also illustrates string concatenation.
12010The @samp{=} sign is called an @dfn{assignment operator}.  It is the
12011simplest assignment operator because the value of the righthand
12012operand is stored unchanged.
12013Most operators (addition, concatenation, and so on) have no effect
12014except to compute a value.  If the value isn't used, there's no reason to
12015use the operator.  An assignment operator is different; it does
12016produce a value, but even if you ignore it, the assignment still
12017makes itself felt through the alteration of the variable.  We call this
12018a @dfn{side effect}.
12019
12020@cindex lvalues/rvalues
12021@cindex rvalues/lvalues
12022@cindex assignment operators @subentry lvalues/rvalues
12023@cindex operators @subentry assignment
12024The lefthand operand of an assignment need not be a variable
12025(@pxref{Variables}); it can also be a field
12026(@pxref{Changing Fields}) or
12027an array element (@pxref{Arrays}).
12028These are all called @dfn{lvalues},
12029which means they can appear on the lefthand side of an assignment operator.
12030The righthand operand may be any expression; it produces the new value
12031that the assignment stores in the specified variable, field, or array
12032element. (Such values are called @dfn{rvalues}.)
12033
12034@cindex variables @subentry types of
12035It is important to note that variables do @emph{not} have permanent types.
12036A variable's type is simply the type of whatever value was last assigned
12037to it.  In the following program fragment, the variable
12038@code{foo} has a numeric value at first, and a string value later on:
12039
12040@example
12041@group
12042foo = 1
12043print foo
12044@end group
12045@group
12046foo = "bar"
12047print foo
12048@end group
12049@end example
12050
12051@noindent
12052When the second assignment gives @code{foo} a string value, the fact that
12053it previously had a numeric value is forgotten.
12054
12055String values that do not begin with a digit have a numeric value of
12056zero. After executing the following code, the value of @code{foo} is five:
12057
12058@example
12059foo = "a string"
12060foo = foo + 5
12061@end example
12062
12063@quotation NOTE
12064Using a variable as a number and then later as a string
12065can be confusing and is poor programming style.  The previous two examples
12066illustrate how @command{awk} works, @emph{not} how you should write your
12067programs!
12068@end quotation
12069
12070An assignment is an expression, so it has a value---the same value that
12071is assigned.  Thus, @samp{z = 1} is an expression with the value one.
12072One consequence of this is that you can write multiple assignments together,
12073such as:
12074
12075@example
12076x = y = z = 5
12077@end example
12078
12079@noindent
12080This example stores the value five in all three variables
12081(@code{x}, @code{y}, and @code{z}).
12082It does so because the
12083value of @samp{z = 5}, which is five, is stored into @code{y} and then
12084the value of @samp{y = z = 5}, which is five, is stored into @code{x}.
12085
12086Assignments may be used anywhere an expression is called for.  For
12087example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one,
12088and then test whether @code{x} equals one.  But this style tends to make
12089programs hard to read; such nesting of assignments should be avoided,
12090except perhaps in a one-shot program.
12091
12092@cindex @code{+} (plus sign) @subentry @code{+=} operator
12093@cindex plus sign (@code{+}) @subentry @code{+=} operator
12094Aside from @samp{=}, there are several other assignment operators that
12095do arithmetic with the old value of the variable.  For example, the
12096operator @samp{+=} computes a new value by adding the righthand value
12097to the old value of the variable.  Thus, the following assignment adds
12098five to the value of @code{foo}:
12099
12100@example
12101foo += 5
12102@end example
12103
12104@noindent
12105This is equivalent to the following:
12106
12107@example
12108foo = foo + 5
12109@end example
12110
12111@noindent
12112Use whichever makes the meaning of your program clearer.
12113
12114There are situations where using @samp{+=} (or any assignment operator)
12115is @emph{not} the same as simply repeating the lefthand operand in the
12116righthand expression.  For example:
12117
12118@cindex Rankin, Pat
12119@example
12120@group
12121# Thanks to Pat Rankin for this example
12122BEGIN  @{
12123    foo[rand()] += 5
12124    for (x in foo)
12125       print x, foo[x]
12126@end group
12127
12128@group
12129    bar[rand()] = bar[rand()] + 5
12130    for (x in bar)
12131       print x, bar[x]
12132@}
12133@end group
12134@end example
12135
12136@cindex operators @subentry assignment @subentry evaluation order
12137@cindex assignment operators @subentry evaluation order
12138@noindent
12139The indices of @code{bar} are practically guaranteed to be different, because
12140@code{rand()} returns different values each time it is called.
12141(Arrays and the @code{rand()} function haven't been covered yet.
12142@xref{Arrays},
12143and
12144@ifnotdocbook
12145@pxref{Numeric Functions}
12146@end ifnotdocbook
12147@ifdocbook
12148@ref{Numeric Functions}
12149@end ifdocbook
12150for more information.)
12151This example illustrates an important fact about assignment
12152operators: the lefthand expression is only evaluated @emph{once}.
12153
12154It is up to the implementation as to which expression is evaluated
12155first, the lefthand or the righthand.
12156Consider this example:
12157
12158@example
12159i = 1
12160a[i += 2] = i + 1
12161@end example
12162
12163@noindent
12164The value of @code{a[3]} could be either two or four.
12165
12166@ref{table-assign-ops} lists the arithmetic assignment operators.  In each
12167case, the righthand operand is an expression whose value is converted
12168to a number.
12169
12170@cindex @code{-} (hyphen) @subentry @code{-=} operator
12171@cindex hyphen (@code{-}) @subentry @code{-=} operator
12172@cindex @code{*} (asterisk) @subentry @code{*=} operator
12173@cindex asterisk (@code{*}) @subentry @code{*=} operator
12174@cindex @code{/} (forward slash) @subentry @code{/=} operator
12175@cindex forward slash (@code{/}) @subentry @code{/=} operator
12176@cindex @code{%} (percent sign) @subentry @code{%=} operator
12177@cindex percent sign (@code{%}) @subentry @code{%=} operator
12178@cindex @code{^} (caret) @subentry @code{^=} operator
12179@cindex caret (@code{^}) @subentry @code{^=} operator
12180@cindex @code{*} (asterisk) @subentry @code{**=} operator
12181@cindex asterisk (@code{*}) @subentry @code{**=} operator
12182@float Table,table-assign-ops
12183@caption{Arithmetic assignment operators}
12184@multitable @columnfractions .30 .70
12185@headitem Operator @tab Effect
12186@item @var{lvalue} @code{+=} @var{increment} @tab Add @var{increment} to the value of @var{lvalue}.
12187@item @var{lvalue} @code{-=} @var{decrement} @tab Subtract @var{decrement} from the value of @var{lvalue}.
12188@item @var{lvalue} @code{*=} @var{coefficient} @tab Multiply the value of @var{lvalue} by @var{coefficient}.
12189@item @var{lvalue} @code{/=} @var{divisor} @tab Divide the value of @var{lvalue} by @var{divisor}.
12190@item @var{lvalue} @code{%=} @var{modulus} @tab Set @var{lvalue} to its remainder by @var{modulus}.
12191@cindex common extensions @subentry @code{**=} operator
12192@cindex extensions @subentry common @subentry @code{**=} operator
12193@cindex @command{awk} @subentry language, POSIX version
12194@cindex POSIX @command{awk}
12195@item @var{lvalue} @code{^=} @var{power} @tab Raise @var{lvalue} to the power @var{power}.
12196@item @var{lvalue} @code{**=} @var{power} @tab Raise @var{lvalue} to the power @var{power}. @value{COMMONEXT}
12197@end multitable
12198@end float
12199
12200@cindex POSIX @command{awk} @subentry @code{**=} operator and
12201@cindex portability @subentry @code{**=} operator and
12202@quotation NOTE
12203Only the @samp{^=} operator is specified by POSIX.
12204For maximum portability, do not use the @samp{**=} operator.
12205@end quotation
12206
12207@sidebar Syntactic Ambiguities Between @samp{/=} and Regular Expressions
12208@cindex dark corner @subentry regexp constants @subentry @code{/=} operator and
12209@cindex @code{/} (forward slash) @subentry @code{/=} operator @subentry vs.@: @code{/=@dots{}/} regexp constant
12210@cindex forward slash (@code{/}) @subentry @code{/=} operator @subentry vs.@: @code{/=@dots{}/} regexp constant
12211@cindex regexp constants @subentry @code{/=@dots{}/} @subentry @code{/=} operator and
12212
12213@c derived from email from  "Nelson H. F. Beebe" <beebe@math.utah.edu>
12214@c Date: Mon, 1 Sep 1997 13:38:35 -0600 (MDT)
12215
12216@cindex dark corner @subentry @code{/=} operator vs.@: @code{/=@dots{}/} regexp constant
12217@cindex ambiguity, syntactic: @code{/=} operator vs.@: @code{/=@dots{}/} regexp constant
12218@cindex syntactic ambiguity: @code{/=} operator vs.@: @code{/=@dots{}/} regexp constant
12219@cindex @code{/=} operator vs.@: @code{/=@dots{}/} regexp constant
12220There is a syntactic ambiguity between the @code{/=} assignment
12221operator and regexp constants whose first character is an @samp{=}.
12222@value{DARKCORNER}
12223This is most notable in some commercial @command{awk} versions.
12224For example:
12225
12226@example
12227$ @kbd{awk /==/ /dev/null}
12228@error{} awk: syntax error at source line 1
12229@error{}  context is
12230@error{}         >>> /= <<<
12231@error{} awk: bailing out at source line 1
12232@end example
12233
12234@noindent
12235A workaround is:
12236
12237@example
12238awk '/[=]=/' /dev/null
12239@end example
12240
12241@command{gawk} does not have this problem; BWK @command{awk}
12242and @command{mawk} also do not.
12243@end sidebar
12244
12245@node Increment Ops
12246@subsection Increment and Decrement Operators
12247
12248@cindex increment operators
12249@cindex operators @subentry decrement/increment
12250@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of
12251a variable by one.  An assignment operator can do the same thing, so
12252the increment operators add no power to the @command{awk} language; however, they
12253are convenient abbreviations for very common operations.
12254
12255@cindex side effects
12256@cindex @code{+} (plus sign) @subentry @code{++} operator
12257@cindex plus sign (@code{+}) @subentry @code{++} operator
12258@cindex side effects @subentry decrement/increment operators
12259The operator used for adding one is written @samp{++}.  It can be used to increment
12260a variable either before or after taking its value.
12261To @dfn{pre-increment} a variable @code{v}, write @samp{++v}.  This adds
12262one to the value of @code{v}---that new value is also the value of the
12263expression. (The assignment expression @samp{v += 1} is completely equivalent.)
12264Writing the @samp{++} after the variable specifies @dfn{post-increment}.  This
12265increments the variable value just the same; the difference is that the
12266value of the increment expression itself is the variable's @emph{old}
12267value.  Thus, if @code{foo} has the value four, then the expression @samp{foo++}
12268has the value four, but it changes the value of @code{foo} to five.
12269In other words, the operator returns the old value of the variable,
12270but with the side effect of incrementing it.
12271
12272The post-increment @samp{foo++} is nearly the same as writing @samp{(foo
12273+= 1) - 1}.  It is not perfectly equivalent because all numbers in
12274@command{awk} are floating point---in floating point, @samp{foo + 1 - 1} does
12275not necessarily equal @code{foo}.  But the difference is minute as
12276long as you stick to numbers that are fairly small (less than
12277@iftex
12278@math{10^{12}}).
12279@end iftex
12280@ifinfo
1228110e12).
12282@end ifinfo
12283@ifnottex
12284@ifnotinfo
1228510@sup{12}).
12286@end ifnotinfo
12287@end ifnottex
12288
12289@cindex @code{$} (dollar sign) @subentry incrementing fields and arrays
12290@cindex dollar sign (@code{$}) @subentry incrementing fields and arrays
12291Fields and array elements are incremented
12292just like variables.  (Use @samp{$(i++)} when you want to do a field reference
12293and a variable increment at the same time.  The parentheses are necessary
12294because of the precedence of the field reference operator @samp{$}.)
12295
12296@cindex decrement operators
12297The decrement operator @samp{--} works just like @samp{++}, except that
12298it subtracts one instead of adding it.  As with @samp{++}, it can be used before
12299the lvalue to pre-decrement or after it to post-decrement.
12300Following is a summary of increment and decrement expressions:
12301
12302@table @code
12303@cindex @code{+} (plus sign) @subentry @code{++} operator
12304@cindex plus sign (@code{+}) @subentry @code{++} operator
12305@item ++@var{lvalue}
12306Increment @var{lvalue}, returning the new value as the
12307value of the expression.
12308
12309@item @var{lvalue}++
12310Increment @var{lvalue}, returning the @emph{old} value of @var{lvalue}
12311as the value of the expression.
12312
12313@cindex @code{-} (hyphen) @subentry @code{--} operator
12314@cindex hyphen (@code{-}) @subentry @code{--} operator
12315@item --@var{lvalue}
12316Decrement @var{lvalue}, returning the new value as the
12317value of the expression.
12318(This expression is
12319like @samp{++@var{lvalue}}, but instead of adding, it subtracts.)
12320
12321@item @var{lvalue}--
12322Decrement @var{lvalue}, returning the @emph{old} value of @var{lvalue}
12323as the value of the expression.
12324(This expression is
12325like @samp{@var{lvalue}++}, but instead of adding, it subtracts.)
12326@end table
12327
12328@sidebar Operator Evaluation Order
12329@cindex precedence
12330@cindex operators @subentry precedence of
12331@cindex portability @subentry operators
12332@cindex evaluation order
12333@cindex Marx, Groucho
12334@quotation
12335@i{Doctor, it hurts when I do this!@*
12336Then don't do that!}
12337@author Groucho Marx
12338@end quotation
12339
12340@noindent
12341What happens for something like the following?
12342
12343@example
12344b = 6
12345print b += b++
12346@end example
12347
12348@noindent
12349Or something even stranger?
12350
12351@example
12352b = 6
12353b += ++b + b++
12354print b
12355@end example
12356
12357@cindex side effects
12358In other words, when do the various side effects prescribed by the
12359postfix operators (@samp{b++}) take effect?
12360When side effects happen is @dfn{implementation-defined}.
12361In other words, it is up to the particular version of @command{awk}.
12362The result for the first example may be 12 or 13, and for the second, it
12363may be 22 or 23.
12364
12365In short, doing things like this is not recommended and definitely
12366not anything that you can rely upon for portability.
12367You should avoid such things in your own programs.
12368@c You'll sleep better at night and be able to look at yourself
12369@c in the mirror in the morning.
12370@end sidebar
12371
12372@node Truth Values and Conditions
12373@section Truth Values and Conditions
12374
12375In certain contexts, expression values also serve as ``truth values''; i.e.,
12376they determine what should happen next as the program runs. This
12377@value{SECTION} describes how @command{awk} defines ``true'' and ``false''
12378and how values are compared.
12379
12380@menu
12381* Truth Values::                What is ``true'' and what is ``false''.
12382* Typing and Comparison::       How variables acquire types and how this
12383                                affects comparison of numbers and strings with
12384                                @samp{<}, etc.
12385* Boolean Ops::                 Combining comparison expressions using boolean
12386                                operators @samp{||} (``or''), @samp{&&}
12387                                (``and'') and @samp{!} (``not'').
12388* Conditional Exp::             Conditional expressions select between two
12389                                subexpressions under control of a third
12390                                subexpression.
12391@end menu
12392
12393@node Truth Values
12394@subsection True and False in @command{awk}
12395@cindex truth values
12396@cindex logical false/true
12397@cindex false, logical
12398@cindex true, logical
12399
12400@cindex null strings
12401Many programming languages have a special representation for the concepts
12402of ``true'' and ``false.''  Such languages usually use the special
12403constants @code{true} and @code{false}, or perhaps their uppercase
12404equivalents.
12405However, @command{awk} is different.
12406It borrows a very simple concept of true and
12407false from C.  In @command{awk}, any nonzero numeric value @emph{or} any
12408nonempty string value is true.  Any other value (zero or the null
12409string, @code{""}) is false.  The following program prints @samp{A strange
12410truth value} three times:
12411
12412@example
12413BEGIN @{
12414   if (3.1415927)
12415       print "A strange truth value"
12416   if ("Four Score And Seven Years Ago")
12417       print "A strange truth value"
12418   if (j = 57)
12419       print "A strange truth value"
12420@}
12421@end example
12422
12423@cindex dark corner @subentry @code{"0"} is actually true
12424There is a surprising consequence of the ``nonzero or non-null'' rule:
12425the string constant @code{"0"} is actually true, because it is non-null.
12426@value{DARKCORNER}
12427
12428@node Typing and Comparison
12429@subsection Variable Typing and Comparison Expressions
12430@quotation
12431@i{The Guide is definitive. Reality is frequently inaccurate.}
12432@author Douglas Adams, @cite{The Hitchhiker's Guide to the Galaxy}
12433@end quotation
12434@c 2/2015: Antonio Colombo points out that this is really from
12435@c The Restaurant at the End of the Universe. But I'm going to
12436@c leave it alone.
12437
12438@cindex comparison expressions
12439@cindex expressions @subentry comparison
12440@cindex expressions, matching @seeentry{comparison expressions}
12441@cindex matching @subentry expressions @seeentry{comparison expressions}
12442@cindex relational operators @seeentry{comparison operators}
12443@cindex operators, relational @seeentry{operators, comparison}
12444@cindex variables @subentry types of @subentry comparison expressions and
12445Unlike in other programming languages, in @command{awk} variables do not have a
12446fixed type. Instead, they can be either a number or a string, depending
12447upon the value that is assigned to them.
12448We look now at how variables are typed, and how @command{awk}
12449compares variables.
12450
12451@menu
12452* Variable Typing::             String type versus numeric type.
12453* Comparison Operators::        The comparison operators.
12454* POSIX String Comparison::     String comparison with POSIX rules.
12455@end menu
12456
12457@node Variable Typing
12458@subsubsection String Type versus Numeric Type
12459
12460Scalar objects in @command{awk} (variables, array elements, and fields)
12461are @emph{dynamically} typed.  This means their type can change as the
12462program runs, from @dfn{untyped} before any use,@footnote{@command{gawk}
12463calls this @dfn{unassigned}, as the following example shows.} to string
12464or number, and then from string to number or number to string, as the
12465program progresses.  (@command{gawk} also provides regexp-typed scalars,
12466but let's ignore that for now; @pxref{Strong Regexp Constants}.)
12467
12468You can't do much with untyped variables, other than tell that they
12469are untyped. The following program tests @code{a} against @code{""}
12470and @code{0}; the test succeeds when @code{a} has never been assigned
12471a value.  It also uses the built-in @code{typeof()} function
12472(not presented yet; @pxref{Type Functions}) to show @code{a}'s type:
12473
12474@example
12475$ @kbd{gawk 'BEGIN @{ print (a == "" && a == 0 ?}
12476> @kbd{"a is untyped" : "a has a type!") ; print typeof(a) @}'}
12477@print{} a is untyped
12478@print{} unassigned
12479@end example
12480
12481A scalar has numeric type when assigned a numeric value,
12482such as from a numeric constant, or from another scalar
12483with numeric type:
12484
12485@example
12486$ @kbd{gawk 'BEGIN @{ a = 42 ; print typeof(a)}
12487> @kbd{b = a ; print typeof(b) @}'}
12488number
12489number
12490@end example
12491
12492Similarly, a scalar has string type when assigned a string
12493value, such as from a string constant, or from another scalar
12494with string type:
12495
12496@example
12497$ @kbd{gawk 'BEGIN @{ a = "forty two" ; print typeof(a)}
12498> @kbd{b = a ; print typeof(b) @}'}
12499string
12500string
12501@end example
12502
12503So far, this is all simple and straightforward.  What happens, though,
12504when @command{awk} has to process data from a user?  Let's start with
12505field data.  What should the following command produce as output?
12506
12507@example
12508echo hello | awk '@{ printf("%s %s < 42\n", $1,
12509                           ($1 < 42 ? "is" : "is not")) @}'
12510@end example
12511
12512@noindent
12513Since @samp{hello} is alphabetic data, @command{awk} can only do a string
12514comparison.  Internally, it converts @code{42} into @code{"42"} and compares
12515the two string values @code{"hello"} and @code{"42"}. Here's the result:
12516
12517@example
12518$ @kbd{echo hello | awk '@{ printf("%s %s < 42\n", $1,}
12519> @kbd{                           ($1 < 42 ? "is" : "is not")) @}'}
12520@print{} hello is not < 42
12521@end example
12522
12523However, what happens when data from a user @emph{looks like} a number?
12524On the one hand, in reality, the input data consists of characters, not
12525binary numeric
12526values.  But, on the other hand, the data looks numeric, and @command{awk}
12527really ought to treat it as such. And indeed, it does:
12528
12529@example
12530$ @kbd{echo 37 | awk '@{ printf("%s %s < 42\n", $1,}
12531> @kbd{                        ($1 < 42 ? "is" : "is not")) @}'}
12532@print{} 37 is < 42
12533@end example
12534
12535Here are the rules for when @command{awk}
12536treats data as a number, and for when it treats data as a string.
12537
12538@cindex numeric @subentry strings
12539@cindex strings @subentry numeric
12540@cindex POSIX @command{awk} @subentry numeric strings and
12541The POSIX standard uses the term @dfn{numeric string} for input data that
12542looks numeric.  The @samp{37} in the previous example is a numeric string.
12543So what is the type of a numeric string? Answer: numeric.
12544
12545The type of a variable is important because the types of two variables
12546determine how they are compared.
12547Variable typing follows these definitions and rules:
12548
12549@itemize @value{BULLET}
12550@item
12551A numeric constant or the result of a numeric operation has the @dfn{numeric}
12552attribute.
12553
12554@item
12555A string constant or the result of a string operation has the @dfn{string}
12556attribute.
12557
12558@item
12559Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements,
12560@code{ENVIRON} elements, and the elements of an array created by
12561@code{match()}, @code{split()}, and @code{patsplit()} that are numeric
12562strings have the @dfn{strnum} attribute.@footnote{Thus, a POSIX
12563numeric string and @command{gawk}'s strnum are the same thing.}
12564Otherwise, they have
12565the @dfn{string} attribute.  Uninitialized variables also have the
12566@dfn{strnum} attribute.
12567
12568@item
12569Attributes propagate across assignments but are not changed by
12570any use.
12571@c (Although a use may cause the entity to acquire an additional
12572@c value such that it has both a numeric and string value, this leaves the
12573@c attribute unchanged.)
12574@c This is important but not relevant
12575@end itemize
12576
12577The last rule is particularly important. In the following program,
12578@code{a} has numeric type, even though it is later used in a string
12579operation:
12580
12581@example
12582BEGIN @{
12583     a = 12.345
12584     b = a " is a cute number"
12585     print b
12586@}
12587@end example
12588
12589When two operands are compared, either string comparison or numeric comparison
12590may be used. This depends upon the attributes of the operands, according to the
12591following symmetric matrix:
12592
12593@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables
12594@tex
12595\centerline{
12596\vbox{\bigskip % space above the table (about 1 linespace)
12597% Because we have vertical rules, we can't let TeX insert interline space
12598% in its usual way.
12599\offinterlineskip
12600%
12601% Define the table template. & separates columns, and \cr ends the
12602% template (and each row). # is replaced by the text of that entry on
12603% each row. The template for the first column breaks down like this:
12604%   \strut -- a way to make each line have the height and depth
12605%             of a normal line of type, since we turned off interline spacing.
12606%   \hfil -- infinite glue; has the effect of right-justifying in this case.
12607%   #     -- replaced by the text (for instance, `STRNUM', in the last row).
12608%   \quad -- about the width of an `M'. Just separates the columns.
12609%
12610% The second column (\vrule#) is what generates the vertical rule that
12611% spans table rows.
12612%
12613% The doubled && before the next entry means `repeat the following
12614% template as many times as necessary on each line' -- in our case, twice.
12615%
12616% The template itself, \quad#\hfil, left-justifies with a little space before.
12617%
12618\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr
12619	&&STRING	&NUMERIC	&STRNUM\cr
12620% The \omit tells TeX to skip inserting the template for this column on
12621% this particular row. In this case, we only want a little extra space
12622% to separate the heading row from the rule below it.  the depth 2pt --
12623% `\vrule depth 2pt' is that little space.
12624\omit	&depth 2pt\cr
12625% This is the horizontal rule below the heading. Since it has nothing to
12626% do with the columns of the table, we use \noalign to get it in there.
12627\noalign{\hrule}
12628% Like above, this time a little more space.
12629\omit	&depth 4pt\cr
12630% The remaining rows have nothing special about them.
12631STRING	&&string	&string		&string\cr
12632NUMERIC	&&string	&numeric	&numeric\cr
12633STRNUM  &&string	&numeric	&numeric\cr
12634}}}
12635@end tex
12636@ifnottex
12637@ifnotdocbook
12638@verbatim
12639        +----------------------------------------------
12640        |       STRING          NUMERIC         STRNUM
12641--------+----------------------------------------------
12642        |
12643STRING  |       string          string          string
12644        |
12645NUMERIC |       string          numeric         numeric
12646        |
12647STRNUM  |       string          numeric         numeric
12648--------+----------------------------------------------
12649@end verbatim
12650@end ifnotdocbook
12651@end ifnottex
12652@docbook
12653<informaltable>
12654<tgroup cols="4">
12655<colspec colname="1" align="left"/>
12656<colspec colname="2" align="left"/>
12657<colspec colname="3" align="left"/>
12658<colspec colname="4" align="left"/>
12659<thead>
12660<row>
12661<entry/>
12662<entry>STRING</entry>
12663<entry>NUMERIC</entry>
12664<entry>STRNUM</entry>
12665</row>
12666</thead>
12667
12668<tbody>
12669<row>
12670<entry><emphasis role="bold">STRING</emphasis></entry>
12671<entry>string</entry>
12672<entry>string</entry>
12673<entry>string</entry>
12674</row>
12675
12676<row>
12677<entry><emphasis role="bold">NUMERIC</emphasis></entry>
12678<entry>string</entry>
12679<entry>numeric</entry>
12680<entry>numeric</entry>
12681</row>
12682
12683<row>
12684<entry><emphasis role="bold">STRNUM</emphasis></entry>
12685<entry>string</entry>
12686<entry>numeric</entry>
12687<entry>numeric</entry>
12688</row>
12689
12690</tbody>
12691</tgroup>
12692</informaltable>
12693
12694@end docbook
12695
12696The basic idea is that user input that looks numeric---and @emph{only}
12697user input---should be treated as numeric, even though it is actually
12698made of characters and is therefore also a string.
12699Thus, for example, the string constant @w{@code{" +3.14"}},
12700when it appears in program source code,
12701is a string---even though it looks numeric---and
12702is @emph{never} treated as a number for comparison
12703purposes.
12704
12705In short, when one operand is a ``pure'' string, such as a string
12706constant, then a string comparison is performed.  Otherwise, a
12707numeric comparison is performed.
12708(The primary difference between a number and a strnum is that
12709for strnums @command{gawk} preserves the original string value that
12710the scalar had when it came in.)
12711
12712This point bears additional emphasis:
12713Input that looks numeric @emph{is} numeric.
12714All other input is treated as strings.
12715
12716Thus, the six-character input string @w{@samp{ +3.14}} receives the
12717strnum attribute. In contrast, the eight characters
12718@w{@code{" +3.14"}} appearing in program text comprise a string constant.
12719The following examples print @samp{1} when the comparison between
12720the two different constants is true, and @samp{0} otherwise:
12721
12722@c 22.9.2014: Tested with mawk and BWK awk, got same results.
12723@example
12724$ @kbd{echo ' +3.14' | awk '@{ print($0 == " +3.14") @}'}    @ii{True}
12725@print{} 1
12726$ @kbd{echo ' +3.14' | awk '@{ print($0 == "+3.14") @}'}     @ii{False}
12727@print{} 0
12728$ @kbd{echo ' +3.14' | awk '@{ print($0 == "3.14") @}'}      @ii{False}
12729@print{} 0
12730$ @kbd{echo ' +3.14' | awk '@{ print($0 == 3.14) @}'}        @ii{True}
12731@print{} 1
12732$ @kbd{echo ' +3.14' | awk '@{ print($1 == " +3.14") @}'}    @ii{False}
12733@print{} 0
12734$ @kbd{echo ' +3.14' | awk '@{ print($1 == "+3.14") @}'}     @ii{True}
12735@print{} 1
12736$ @kbd{echo ' +3.14' | awk '@{ print($1 == "3.14") @}'}      @ii{False}
12737@print{} 0
12738$ @kbd{echo ' +3.14' | awk '@{ print($1 == 3.14) @}'}        @ii{True}
12739@print{} 1
12740@end example
12741
12742You can see the type of an input field (or other user input)
12743using @code{typeof()}:
12744
12745@example
12746$ @kbd{echo hello 37 | gawk '@{ print typeof($1), typeof($2) @}'}
12747@print{} string strnum
12748@end example
12749
12750@node Comparison Operators
12751@subsubsection Comparison Operators
12752@cindex operators @subentry comparison
12753
12754@dfn{Comparison expressions} compare strings or numbers for
12755relationships such as equality.  They are written using @dfn{relational
12756operators}, which are a superset of those in C.
12757@ref{table-relational-ops} describes them.
12758
12759@cindex @code{<} (left angle bracket) @subentry @code{<} operator
12760@cindex left angle bracket (@code{<}) @subentry @code{<} operator
12761@cindex @code{<} (left angle bracket) @subentry @code{<=} operator
12762@cindex left angle bracket (@code{<}) @subentry @code{<=} operator
12763@cindex @code{>} (right angle bracket) @subentry @code{>=} operator
12764@cindex right angle bracket (@code{>}) @subentry @code{>=} operator
12765@cindex @code{>} (right angle bracket) @subentry @code{>} operator
12766@cindex right angle bracket (@code{>}) @subentry @code{>} operator
12767@cindex @code{=} (equals sign) @subentry @code{==} operator
12768@cindex equals sign (@code{=}) @subentry @code{==} operator
12769@cindex @code{!} (exclamation point) @subentry @code{!=} operator
12770@cindex exclamation point (@code{!}) @subentry @code{!=} operator
12771@cindex @code{~} (tilde), @code{~} operator
12772@cindex tilde (@code{~}), @code{~} operator
12773@cindex @code{!} (exclamation point) @subentry @code{!~} operator
12774@cindex exclamation point (@code{!}) @subentry @code{!~} operator
12775@cindex @code{in} operator
12776@float Table,table-relational-ops
12777@caption{Relational operators}
12778@multitable @columnfractions .25 .75
12779@headitem Expression @tab Result
12780@item @var{x} @code{<} @var{y} @tab True if @var{x} is less than @var{y}
12781@item @var{x} @code{<=} @var{y} @tab True if @var{x} is less than or equal to @var{y}
12782@item @var{x} @code{>} @var{y} @tab True if @var{x} is greater than @var{y}
12783@item @var{x} @code{>=} @var{y} @tab True if @var{x} is greater than or equal to @var{y}
12784@item @var{x} @code{==} @var{y} @tab True if @var{x} is equal to @var{y}
12785@item @var{x} @code{!=} @var{y} @tab True if @var{x} is not equal to @var{y}
12786@item @var{x} @code{~} @var{y} @tab True if the string @var{x} matches the regexp denoted by @var{y}
12787@item @var{x} @code{!~} @var{y} @tab True if the string @var{x} does not match the regexp denoted by @var{y}
12788@item @var{subscript} @code{in} @var{array} @tab True if the array @var{array} has an element with the subscript @var{subscript}
12789@end multitable
12790@end float
12791
12792Comparison expressions have the value one if true and zero if false.
12793When comparing operands of mixed types, numeric operands are converted
12794to strings using the value of @code{CONVFMT}
12795(@pxref{Conversion}).
12796
12797Strings are compared
12798by comparing the first character of each, then the second character of each,
12799and so on.  Thus, @code{"10"} is less than @code{"9"}.  If there are two
12800strings where one is a prefix of the other, the shorter string is less than
12801the longer one.  Thus, @code{"abc"} is less than @code{"abcd"}.
12802
12803@cindex troubleshooting @subentry @code{==} operator
12804It is very easy to accidentally mistype the @samp{==} operator and
12805leave off one of the @samp{=} characters.  The result is still valid
12806@command{awk} code, but the program does not do what is intended:
12807
12808@example
12809@group
12810if (a = b)   # oops! should be a == b
12811   @dots{}
12812else
12813   @dots{}
12814@end group
12815@end example
12816
12817@noindent
12818Unless @code{b} happens to be zero or the null string, the @code{if}
12819part of the test always succeeds.  Because the operators are
12820so similar, this kind of error is very difficult to spot when
12821scanning the source code.
12822
12823The following list of expressions illustrates the kinds of comparisons
12824@command{awk} performs, as well as what the result of each comparison is:
12825
12826@table @code
12827@item 1.5 <= 2.0
12828Numeric comparison (true)
12829
12830@item "abc" >= "xyz"
12831String comparison (false)
12832
12833@item 1.5 != " +2"
12834String comparison (true)
12835
12836@item "1e2" < "3"
12837String comparison (true)
12838
12839@item a = 2; b = "2"
12840@itemx a == b
12841String comparison (true)
12842
12843@item a = 2; b = " +2"
12844@itemx a == b
12845String comparison (false)
12846@end table
12847
12848In this example:
12849
12850@example
12851$ @kbd{echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}'}
12852@print{} false
12853@end example
12854
12855@cindex comparison expressions @subentry string vs.@: regexp
12856@c @cindex string comparison vs.@: regexp comparison
12857@c @cindex regexp comparison vs.@: string comparison
12858@noindent
12859the result is @samp{false} because both @code{$1} and @code{$2}
12860are user input.  They are numeric strings---therefore both have
12861the strnum attribute, dictating a numeric comparison.
12862The purpose of the comparison rules and the use of numeric strings is
12863to attempt to produce the behavior that is ``least surprising,'' while
12864still ``doing the right thing.''
12865
12866String comparisons and regular expression comparisons are very different.
12867For example:
12868
12869@example
12870x == "foo"
12871@end example
12872
12873@noindent
12874has the value one, or is true if the variable @code{x}
12875is precisely @samp{foo}.  By contrast:
12876
12877@example
12878x ~ /foo/
12879@end example
12880
12881@noindent
12882has the value one if @code{x} contains @samp{foo}, such as
12883@code{"Oh, what a fool am I!"}.
12884
12885@cindex @code{~} (tilde), @code{~} operator
12886@cindex tilde (@code{~}), @code{~} operator
12887@cindex @code{!} (exclamation point) @subentry @code{!~} operator
12888@cindex exclamation point (@code{!}) @subentry @code{!~} operator
12889The righthand operand of the @samp{~} and @samp{!~} operators may be
12890either a regexp constant (@code{/}@dots{}@code{/}) or an ordinary
12891expression. In the latter case, the value of the expression as a string is used as a
12892dynamic regexp (@pxref{Regexp Usage}; also
12893@pxref{Computed Regexps}).
12894
12895@cindex @command{awk} @subentry regexp constants and
12896@cindex regexp constants
12897A constant regular
12898expression in slashes by itself is also an expression.
12899@code{/@var{regexp}/} is an abbreviation for the following comparison expression:
12900
12901@example
12902$0 ~ /@var{regexp}/
12903@end example
12904
12905One special place where @code{/foo/} is @emph{not} an abbreviation for
12906@samp{$0 ~ /foo/} is when it is the righthand operand of @samp{~} or
12907@samp{!~}.
12908@xref{Using Constant Regexps},
12909where this is discussed in more detail.
12910
12911@node POSIX String Comparison
12912@subsubsection String Comparison Based on Locale Collating Order
12913
12914The POSIX standard used to say that all string comparisons are
12915performed based on the locale's @dfn{collating order}. This
12916is the order in which characters sort, as defined by the locale
12917(for more discussion, @pxref{Locales}).  This order is usually very
12918different from the results obtained when doing straight byte-by-byte
12919comparison.@footnote{Technically, string comparison is supposed to behave
12920the same way as if the strings were compared with the C @code{strcoll()}
12921function.}
12922
12923@cindex POSIX mode
12924Because this behavior differs considerably from existing practice,
12925@command{gawk} only implemented it when in POSIX mode (@pxref{Options}).
12926Here is an example to illustrate the difference, in an @code{en_US.UTF-8}
12927locale:
12928
12929@example
12930$ @kbd{gawk 'BEGIN @{ printf("ABC < abc = %s\n",}
12931>                     @kbd{("ABC" < "abc" ? "TRUE" : "FALSE")) @}'}
12932@print{} ABC < abc = TRUE
12933$ @kbd{gawk --posix 'BEGIN @{ printf("ABC < abc = %s\n",}
12934>                             @kbd{("ABC" < "abc" ? "TRUE" : "FALSE")) @}'}
12935@print{} ABC < abc = FALSE
12936@end example
12937
12938Fortunately, as of August 2016, comparison based on locale
12939collating order is no longer required for the @code{==} and @code{!=}
12940operators.@footnote{See @uref{http://austingroupbugs.net/view.php?id=1070,
12941the Austin Group website}.} However, comparison based on locales is still
12942required for @code{<}, @code{<=}, @code{>}, and @code{>=}.  POSIX thus
12943recommends as follows:
12944
12945@quotation
12946Since the @code{==} operator checks whether strings are identical,
12947not whether they collate equally, applications needing to check whether
12948strings collate equally can use:
12949
12950@example
12951a <= b && a >= b
12952@end example
12953@end quotation
12954
12955@cindex POSIX mode
12956As of @value{PVERSION} 4.2, @command{gawk} continues to use locale
12957collating order for @code{<}, @code{<=}, @code{>}, and @code{>=} only
12958in POSIX mode.
12959
12960@ignore
12961References: http://austingroupbugs.net/view.php?id=963
12962and http://austingroupbugs.net/view.php?id=1070.
12963@end ignore
12964
12965@node Boolean Ops
12966@subsection Boolean Expressions
12967@cindex and Boolean-logic operator
12968@cindex or Boolean-logic operator
12969@cindex not Boolean-logic operator
12970@cindex expressions @subentry Boolean
12971@cindex Boolean expressions
12972@cindex operators, Boolean @seeentry{Boolean expressions}
12973@cindex Boolean operators @seeentry{Boolean expressions}
12974@cindex logical operators @seeentry{Boolean expressions}
12975@cindex operators, logical @seeentry{Boolean expressions}
12976
12977A @dfn{Boolean expression} is a combination of comparison expressions or
12978matching expressions, using the Boolean operators ``or''
12979(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with
12980parentheses to control nesting.  The truth value of the Boolean expression is
12981computed by combining the truth values of the component expressions.
12982Boolean expressions are also referred to as @dfn{logical expressions}.
12983The terms are equivalent.
12984
12985Boolean expressions can be used wherever comparison and matching
12986expressions can be used.  They can be used in @code{if}, @code{while},
12987@code{do}, and @code{for} statements
12988(@pxref{Statements}).
12989They have numeric values (one if true, zero if false) that come into play
12990if the result of the Boolean expression is stored in a variable or
12991used in arithmetic.
12992
12993In addition, every Boolean expression is also a valid pattern, so
12994you can use one as a pattern to control the execution of rules.
12995The Boolean operators are:
12996
12997@table @code
12998@item @var{boolean1} && @var{boolean2}
12999True if both @var{boolean1} and @var{boolean2} are true.  For example,
13000the following statement prints the current input record if it contains
13001both @samp{edu} and @samp{li}:
13002
13003@example
13004if ($0 ~ /edu/ && $0 ~ /li/) print
13005@end example
13006
13007@cindex side effects @subentry Boolean operators
13008The subexpression @var{boolean2} is evaluated only if @var{boolean1}
13009is true.  This can make a difference when @var{boolean2} contains
13010expressions that have side effects. In the case of @samp{$0 ~ /foo/ &&
13011($2 == bar++)}, the variable @code{bar} is not incremented if there is
13012no substring @samp{foo} in the record.
13013
13014@item @var{boolean1} || @var{boolean2}
13015True if at least one of @var{boolean1} or @var{boolean2} is true.
13016For example, the following statement prints all records in the input
13017that contain @emph{either} @samp{edu} or
13018@samp{li}:
13019
13020@example
13021if ($0 ~ /edu/ || $0 ~ /li/) print
13022@end example
13023
13024The subexpression @var{boolean2} is evaluated only if @var{boolean1}
13025is false.  This can make a difference when @var{boolean2} contains
13026expressions that have side effects.
13027(Thus, this test never really distinguishes records that contain both
13028@samp{edu} and @samp{li}---as soon as @samp{edu} is matched,
13029the full test succeeds.)
13030
13031@item ! @var{boolean}
13032True if @var{boolean} is false.  For example,
13033the following program prints @samp{no home!} in
13034the unusual event that the @env{HOME} environment
13035variable is not defined:
13036
13037@example
13038BEGIN @{ if (! ("HOME" in ENVIRON))
13039            print "no home!" @}
13040@end example
13041
13042(The @code{in} operator is described in
13043@ref{Reference to Elements}.)
13044@end table
13045
13046@cindex short-circuit operators
13047@cindex operators @subentry short-circuit
13048@cindex @code{&} (ampersand) @subentry @code{&&} operator
13049@cindex ampersand (@code{&}) @subentry @code{&&} operator
13050@cindex @code{|} (vertical bar) @subentry @code{||} operator
13051@cindex vertical bar (@code{|}) @subentry @code{||} operator
13052The @samp{&&} and @samp{||} operators are called @dfn{short-circuit}
13053operators because of the way they work.  Evaluation of the full expression
13054is ``short-circuited'' if the result can be determined partway through
13055its evaluation.
13056
13057@cindex line continuations
13058Statements that end with @samp{&&} or @samp{||} can be continued simply
13059by putting a newline after them.  But you cannot put a newline in front
13060of either of these operators without using backslash continuation
13061(@pxref{Statements/Lines}).
13062
13063@cindex @code{!} (exclamation point) @subentry @code{!}  operator
13064@cindex exclamation point (@code{!}) @subentry @code{!} operator
13065@cindex newlines
13066@cindex variables @subentry flag
13067@cindex flag variables
13068The actual value of an expression using the @samp{!} operator is
13069either one or zero, depending upon the truth value of the expression it
13070is applied to.
13071The @samp{!} operator is often useful for changing the sense of a flag
13072variable from false to true and back again. For example, the following
13073program is one way to print lines in between special bracketing lines:
13074
13075@example
13076$1 == "START"   @{ interested = ! interested; next @}
13077interested      @{ print @}
13078$1 == "END"     @{ interested = ! interested; next @}
13079@end example
13080
13081@noindent
13082The variable @code{interested}, as with all @command{awk} variables, starts
13083out initialized to zero, which is also false.  When a line is seen whose
13084first field is @samp{START}, the value of @code{interested} is toggled
13085to true, using @samp{!}. The next rule prints lines as long as
13086@code{interested} is true.  When a line is seen whose first field is
13087@samp{END}, @code{interested} is toggled back to false.@footnote{This
13088program has a bug; it prints lines starting with @samp{END}. How
13089would you fix it?}
13090
13091@ignore
13092Scott Deifik points out that this program isn't robust against
13093bogus input data, but the point is to illustrate the use of `!',
13094so we'll leave well enough alone.
13095@end ignore
13096
13097Most commonly, the @samp{!} operator is used in the conditions of
13098@code{if} and @code{while} statements, where it often makes more
13099sense to phrase the logic in the negative:
13100
13101@example
13102if (! @var{some condition} || @var{some other condition}) @{
13103    @var{@dots{} do whatever processing @dots{}}
13104@}
13105@end example
13106
13107@cindex @code{next} statement
13108@quotation NOTE
13109The @code{next} statement is discussed in
13110@ref{Next Statement}.
13111@code{next} tells @command{awk} to skip the rest of the rules, get the
13112next record, and start processing the rules over again at the top.
13113The reason it's there is to avoid printing the bracketing
13114@samp{START} and @samp{END} lines.
13115@end quotation
13116
13117@node Conditional Exp
13118@subsection Conditional Expressions
13119@cindex conditional expressions
13120@cindex expressions @subentry conditional
13121@cindex expressions @subentry selecting
13122
13123A @dfn{conditional expression} is a special kind of expression that has
13124three operands.  It allows you to use one expression's value to select
13125one of two other expressions.
13126The conditional expression in @command{awk} is the same as in the C
13127language, as shown here:
13128
13129@example
13130@var{selector} ? @var{if-true-exp} : @var{if-false-exp}
13131@end example
13132
13133@noindent
13134There are three subexpressions.  The first, @var{selector}, is always
13135computed first.  If it is ``true'' (not zero or not null), then
13136@var{if-true-exp} is computed next, and its value becomes the value of
13137the whole expression.  Otherwise, @var{if-false-exp} is computed next,
13138and its value becomes the value of the whole expression.
13139For example, the following expression produces the absolute value of @code{x}:
13140
13141@example
13142x >= 0 ? x : -x
13143@end example
13144
13145@cindex side effects @subentry conditional expressions
13146Each time the conditional expression is computed, only one of
13147@var{if-true-exp} and @var{if-false-exp} is used; the other is ignored.
13148This is important when the expressions have side effects.  For example,
13149this conditional expression examines element @code{i} of either array
13150@code{a} or array @code{b}, and increments @code{i}:
13151
13152@example
13153x == y ? a[i++] : b[i++]
13154@end example
13155
13156@noindent
13157This is guaranteed to increment @code{i} exactly once, because each time
13158only one of the two increment expressions is executed
13159and the other is not.
13160@xref{Arrays},
13161for more information about arrays.
13162
13163@cindex differences in @command{awk} and @command{gawk} @subentry line continuations
13164@cindex line continuations @subentry @command{gawk}
13165@cindex @command{gawk} @subentry line continuation in
13166As a minor @command{gawk} extension,
13167a statement that uses @samp{?:} can be continued simply
13168by putting a newline after either character.
13169However, putting a newline in front
13170of either character does not work without using backslash continuation
13171(@pxref{Statements/Lines}).
13172If @option{--posix} is specified
13173(@pxref{Options}), this extension is disabled.
13174
13175@node Function Calls
13176@section Function Calls
13177@cindex function calls
13178
13179A @dfn{function} is a name for a particular calculation.
13180This enables you to
13181ask for it by name at any point in the program.  For
13182example, the function @code{sqrt()} computes the square root of a number.
13183
13184@cindex functions @subentry built-in
13185A fixed set of functions are @dfn{built in}, which means they are
13186available in every @command{awk} program.  The @code{sqrt()} function is one
13187of these.  @xref{Built-in} for a list of built-in
13188functions and their descriptions.  In addition, you can define
13189functions for use in your program.
13190@xref{User-defined}
13191for instructions on how to do this.
13192Finally, @command{gawk} lets you write functions in C or C++
13193that may be called from your program (@pxref{Dynamic Extensions}).
13194
13195@cindex arguments @subentry in function calls
13196The way to use a function is with a @dfn{function call} expression,
13197which consists of the function name followed immediately by a list of
13198@dfn{arguments} in parentheses.  The arguments are expressions that
13199provide the raw materials for the function's calculations.
13200When there is more than one argument, they are separated by commas.  If
13201there are no arguments, just write @samp{()} after the function name.
13202The following examples show function calls with and without arguments:
13203
13204@example
13205sqrt(x^2 + y^2)        @ii{one argument}
13206atan2(y, x)            @ii{two arguments}
13207rand()                 @ii{no arguments}
13208@end example
13209
13210@cindex troubleshooting @subentry function call syntax
13211@quotation CAUTION
13212Do not put any space between the function name and the opening parenthesis!
13213A user-defined function name looks just like the name of a
13214variable---a space would make the expression look like concatenation of
13215a variable with an expression inside parentheses.
13216With built-in functions, space before the parenthesis is harmless, but
13217it is best not to get into the habit of using space to avoid mistakes
13218with user-defined functions.
13219@end quotation
13220
13221Each function expects a particular number
13222of arguments.  For example, the @code{sqrt()} function must be called with
13223a single argument, the number of which to take the square root:
13224
13225@example
13226sqrt(@var{argument})
13227@end example
13228
13229Some of the built-in functions have one or
13230more optional arguments.
13231If those arguments are not supplied, the functions
13232use a reasonable default value.
13233@xref{Built-in} for full details.  If arguments
13234are omitted in calls to user-defined functions, then those arguments are
13235treated as local variables. Such local variables act like the
13236empty string if referenced where a string value is required,
13237and like zero if referenced where a numeric value is required
13238(@pxref{User-defined}).
13239
13240As an advanced feature, @command{gawk} provides indirect function calls,
13241which is a way to choose the function to call at runtime, instead of
13242when you write the source code to your program. We defer discussion of
13243this feature until later; see @ref{Indirect Calls}.
13244
13245@cindex side effects @subentry function calls
13246Like every other expression, the function call has a value, often
13247called the @dfn{return value}, which is computed by the function
13248based on the arguments you give it.  In this example, the return value
13249of @samp{sqrt(@var{argument})} is the square root of @var{argument}.
13250The following program reads numbers, one number per line, and prints
13251the square root of each one:
13252
13253@example
13254$ @kbd{awk '@{ print "The square root of", $1, "is", sqrt($1) @}'}
13255@kbd{1}
13256@print{} The square root of 1 is 1
13257@kbd{3}
13258@print{} The square root of 3 is 1.73205
13259@kbd{5}
13260@print{} The square root of 5 is 2.23607
13261@kbd{Ctrl-d}
13262@end example
13263
13264A function can also have side effects, such as assigning
13265values to certain variables or doing I/O.
13266This program shows how the @code{match()} function
13267(@pxref{String Functions})
13268changes the variables @code{RSTART} and @code{RLENGTH}:
13269
13270@example
13271@{
13272    if (match($1, $2))
13273        print RSTART, RLENGTH
13274    else
13275        print "no match"
13276@}
13277@end example
13278
13279@noindent
13280Here is a sample run:
13281
13282@example
13283$ @kbd{awk -f matchit.awk}
13284@kbd{aaccdd  c+}
13285@print{} 3 2
13286@kbd{foo     bar}
13287@print{} no match
13288@kbd{abcdefg e}
13289@print{} 5 1
13290@end example
13291
13292@node Precedence
13293@section Operator Precedence (How Operators Nest)
13294@cindex precedence
13295@cindex operators @subentry precedence of
13296
13297@dfn{Operator precedence} determines how operators are grouped when
13298different operators appear close by in one expression.  For example,
13299@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c}
13300means to multiply @code{b} and @code{c}, and then add @code{a} to the
13301product (i.e., @samp{a + (b * c)}).
13302
13303The normal precedence of the operators can be overruled by using parentheses.
13304Think of the precedence rules as saying where the
13305parentheses are assumed to be.  In
13306fact, it is wise to always use parentheses whenever there is an unusual
13307combination of operators, because other people who read the program may
13308not remember what the precedence is in this case.
13309Even experienced programmers occasionally forget the exact rules,
13310which leads to mistakes.
13311Explicit parentheses help prevent
13312any such mistakes.
13313
13314When operators of equal precedence are used together, the leftmost
13315operator groups first, except for the assignment, conditional, and
13316exponentiation operators, which group in the opposite order.
13317Thus, @samp{a - b + c} groups as @samp{(a - b) + c} and
13318@samp{a = b = c} groups as @samp{a = (b = c)}.
13319
13320Normally the precedence of prefix unary operators does not matter,
13321because there is only one way to interpret
13322them: innermost first.  Thus, @samp{$++i} means @samp{$(++i)} and
13323@samp{++$x} means @samp{++($x)}.  However, when another operator follows
13324the operand, then the precedence of the unary operators can matter.
13325@samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means
13326@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^},
13327whereas @samp{$} has higher precedence.
13328Also, operators cannot be combined in a way that violates the
13329precedence rules; for example, @samp{$$0++--} is not a valid
13330expression because the first @samp{$} has higher precedence than the
13331@samp{++}; to avoid the problem the expression can be rewritten as
13332@samp{$($0++)--}.
13333
13334This list presents @command{awk}'s operators, in order of highest
13335to lowest precedence:
13336
13337@c @asis for docbook to come out right
13338@table @asis
13339@item @code{(}@dots{}@code{)}
13340Grouping.
13341
13342@cindex @code{$} (dollar sign) @subentry @code{$} field operator
13343@cindex dollar sign (@code{$}) @subentry @code{$} field operator
13344@item @code{$}
13345Field reference.
13346
13347@cindex @code{+} (plus sign) @subentry @code{++} operator
13348@cindex plus sign (@code{+}) @subentry @code{++} operator
13349@cindex @code{-} (hyphen) @subentry @code{--} operator
13350@cindex hyphen (@code{-}) @subentry @code{--} operator
13351@item @code{++ --}
13352Increment, decrement.
13353
13354@cindex @code{^} (caret) @subentry @code{^} operator
13355@cindex caret (@code{^}) @subentry @code{^} operator
13356@cindex @code{*} (asterisk) @subentry @code{**} operator
13357@cindex asterisk (@code{*}) @subentry @code{**} operator
13358@item @code{^ **}
13359Exponentiation.  These operators group right to left.
13360
13361@cindex @code{+} (plus sign) @subentry @code{+} operator
13362@cindex plus sign (@code{+}) @subentry @code{+} operator
13363@cindex @code{-} (hyphen) @subentry @code{-} operator
13364@cindex hyphen (@code{-}) @subentry @code{-} operator
13365@cindex @code{!} (exclamation point) @subentry @code{!} operator
13366@cindex exclamation point (@code{!}) @subentry @code{!} operator
13367@item @code{+ - !}
13368Unary plus, minus, logical ``not.''
13369
13370@cindex @code{*} (asterisk) @subentry @code{*} operator @subentry as multiplication operator
13371@cindex asterisk (@code{*}) @subentry @code{*} operator @subentry as multiplication operator
13372@cindex @code{/} (forward slash) @subentry @code{/} operator
13373@cindex forward slash (@code{/}) @subentry @code{/} operator
13374@cindex @code{%} (percent sign) @subentry @code{%} operator
13375@cindex percent sign (@code{%}) @subentry @code{%} operator
13376@item @code{* / %}
13377Multiplication, division, remainder.
13378
13379@cindex @code{+} (plus sign) @subentry @code{+} operator
13380@cindex plus sign (@code{+}) @subentry @code{+} operator
13381@cindex @code{-} (hyphen) @subentry @code{-} operator
13382@cindex hyphen (@code{-}) @subentry @code{-} operator
13383@item @code{+ -}
13384Addition, subtraction.
13385
13386@item String concatenation
13387There is no special symbol for concatenation.
13388The operands are simply written side by side
13389(@pxref{Concatenation}).
13390
13391@cindex @code{<} (left angle bracket) @subentry @code{<} operator
13392@cindex left angle bracket (@code{<}) @subentry @code{<} operator
13393@cindex @code{<} (left angle bracket) @subentry @code{<=} operator
13394@cindex left angle bracket (@code{<}) @subentry @code{<=} operator
13395@cindex @code{>} (right angle bracket) @subentry @code{>=} operator
13396@cindex right angle bracket (@code{>}) @subentry @code{>=} operator
13397@cindex @code{>} (right angle bracket) @subentry @code{>} operator
13398@cindex right angle bracket (@code{>}) @subentry @code{>} operator
13399@cindex @code{=} (equals sign) @subentry @code{==} operator
13400@cindex equals sign (@code{=}) @subentry @code{==} operator
13401@cindex @code{!} (exclamation point) @subentry @code{!=} operator
13402@cindex exclamation point (@code{!}) @subentry @code{!=} operator
13403@cindex @code{>} (right angle bracket) @subentry @code{>>} operator (I/O)
13404@cindex right angle bracket (@code{>}) @subentry @code{>>} operator (I/O)
13405@cindex operators @subentry input/output
13406@cindex @code{|} (vertical bar) @subentry @code{|} operator (I/O)
13407@cindex vertical bar (@code{|}) @subentry @code{|} operator (I/O)
13408@cindex operators @subentry input/output
13409@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O)
13410@cindex vertical bar (@code{|}) @subentry @code{|&} operator (I/O)
13411@cindex operators @subentry input/output
13412@item @code{< <= == != > >= >> | |&}
13413Relational and redirection.
13414The relational operators and the redirections have the same precedence
13415level.  Characters such as @samp{>} serve both as relationals and as
13416redirections; the context distinguishes between the two meanings.
13417
13418@cindex @code{print} statement @subentry I/O operators in
13419@cindex @code{printf} statement @subentry I/O operators in
13420Note that the I/O redirection operators in @code{print} and @code{printf}
13421statements belong to the statement level, not to expressions.  The
13422redirection does not produce an expression that could be the operand of
13423another operator.  As a result, it does not make sense to use a
13424redirection operator near another operator of lower precedence without
13425parentheses.  Such combinations (e.g., @samp{print foo > a ? b : c})
13426result in syntax errors.
13427The correct way to write this statement is @samp{print foo > (a ? b : c)}.
13428
13429@cindex @code{~} (tilde), @code{~} operator
13430@cindex tilde (@code{~}), @code{~} operator
13431@cindex @code{!} (exclamation point) @subentry @code{!~} operator
13432@cindex exclamation point (@code{!}) @subentry @code{!~} operator
13433@item @code{~ !~}
13434Matching, nonmatching.
13435
13436@cindex @code{in} operator
13437@item @code{in}
13438Array membership.
13439
13440@cindex @code{&} (ampersand) @subentry @code{&&} operator
13441@cindex ampersand (@code{&}) @subentry @code{&&} operator
13442@item @code{&&}
13443Logical ``and.''
13444
13445@cindex @code{|} (vertical bar) @subentry @code{||} operator
13446@cindex vertical bar (@code{|}) @subentry @code{||} operator
13447@item @code{||}
13448Logical ``or.''
13449
13450@cindex @code{?} (question mark) @subentry @code{?:} operator
13451@cindex question mark (@code{?}) @subentry @code{?:} operator
13452@cindex @code{:} (colon) @subentry @code{?:} operator
13453@cindex colon (@code{:}) @subentry @code{?:} operator
13454@item @code{?:}
13455Conditional.  This operator groups right to left.
13456
13457@cindex @code{+} (plus sign) @subentry @code{+=} operator
13458@cindex plus sign (@code{+}) @subentry @code{+=} operator
13459@cindex @code{-} (hyphen) @subentry @code{-=} operator
13460@cindex hyphen (@code{-}) @subentry @code{-=} operator
13461@cindex @code{*} (asterisk) @subentry @code{*=} operator
13462@cindex asterisk (@code{*}) @subentry @code{*=} operator
13463@cindex @code{*} (asterisk) @subentry @code{**=} operator
13464@cindex asterisk (@code{*}) @subentry @code{**=} operator
13465@cindex @code{/} (forward slash) @subentry @code{/=} operator
13466@cindex forward slash (@code{/}) @subentry @code{/=} operator
13467@cindex @code{%} (percent sign) @subentry @code{%=} operator
13468@cindex percent sign (@code{%}) @subentry @code{%=} operator
13469@cindex @code{^} (caret) @subentry @code{^=} operator
13470@cindex caret (@code{^}) @subentry @code{^=} operator
13471@item @code{= += -= *= /= %= ^= **=}
13472Assignment.  These operators group right to left.
13473@end table
13474
13475@cindex POSIX @command{awk} @subentry @code{**} operator and
13476@cindex portability @subentry operators @subentry not in POSIX @command{awk}
13477@quotation NOTE
13478The @samp{|&}, @samp{**}, and @samp{**=} operators are not specified by POSIX.
13479For maximum portability, do not use them.
13480@end quotation
13481
13482@node Locales
13483@section Where You Are Makes a Difference
13484@cindex locale, definition of
13485
13486Modern systems support the notion of @dfn{locales}: a way to tell the
13487system about the local character set and language.  The ISO C standard
13488defines a default @code{"C"} locale, which is an environment that is
13489typical of what many C programmers are used to.
13490
13491Once upon a time, the locale setting used to affect regexp matching,
13492but this is no longer true (@pxref{Ranges and Locales}).
13493
13494Locales can affect record splitting.  For the normal case of @samp{RS =
13495"\n"}, the locale is largely irrelevant.  For other single-character
13496record separators, setting @samp{LC_ALL=C} in the environment will
13497give you much better performance when reading records.  Otherwise,
13498@command{gawk} has to make several function calls, @emph{per input
13499character}, to find the record terminator.
13500
13501Locales can affect how dates and times are formatted (@pxref{Time
13502Functions}).  For example, a common way to abbreviate the date September
135034, 2015, in the United States is ``9/4/15.''  In many countries in
13504Europe, however, it is abbreviated ``4.9.15.''  Thus, the @samp{%x}
13505specification in a @code{"US"} locale might produce @samp{9/4/15},
13506while in a @code{"EUROPE"} locale, it might produce @samp{4.9.15}.
13507
13508According to POSIX, string comparison is also affected by locales (similar
13509to regular expressions).  The details are presented in @ref{POSIX String
13510Comparison}.
13511
13512Finally, the locale affects the value of the decimal point character
13513used when @command{gawk} parses input data.  This is discussed in detail
13514in @ref{Conversion}.
13515
13516@node Expressions Summary
13517@section Summary
13518
13519@itemize @value{BULLET}
13520@item
13521Expressions are the basic elements of computation in programs.  They are
13522built from constants, variables, function calls, and combinations of the
13523various kinds of values with operators.
13524
13525@item
13526@command{awk} supplies three kinds of constants: numeric, string, and
13527regexp.  @command{gawk} lets you specify numeric constants in octal
13528and hexadecimal (bases 8 and 16) as well as decimal (base 10).
13529In certain contexts, a standalone regexp constant such as @code{/foo/}
13530has the same meaning as @samp{$0 ~ /foo/}.
13531
13532@item
13533Variables hold values between uses in computations. A number of built-in
13534variables provide information to your @command{awk} program, and a number
13535of others let you control how @command{awk} behaves.
13536
13537@item
13538Numbers are automatically converted to strings, and strings to numbers,
13539as needed by @command{awk}. Numeric values are converted as if they were
13540formatted with @code{sprintf()} using the format in @code{CONVFMT}.
13541Locales can influence the conversions.
13542
13543@item
13544@command{awk} provides the usual arithmetic operators (addition,
13545subtraction, multiplication, division, modulus), and unary plus and minus.
13546It also provides comparison operators, Boolean operators, an array membership
13547testing operator, and regexp
13548matching operators.  String concatenation is accomplished by placing
13549two expressions next to each other; there is no explicit operator.
13550The three-operand @samp{?:} operator provides an ``if-else'' test within
13551expressions.
13552
13553@item
13554Assignment operators provide convenient shorthands for common arithmetic
13555operations.
13556
13557@item
13558In @command{awk}, a value is considered to be true if it is nonzero
13559@emph{or} non-null. Otherwise, the value is false.
13560
13561@item
13562A variable's type is set upon each assignment and may change over its
13563lifetime.  The type determines how it behaves in comparisons (string
13564or numeric).
13565
13566@item
13567Function calls return a value that may be used as part of a larger
13568expression.  Expressions used to pass parameter values are fully
13569evaluated before the function is called.  @command{awk} provides
13570built-in and user-defined functions; this is described in
13571@ref{Functions}.
13572
13573@item
13574Operator precedence specifies the order in which operations are performed,
13575unless explicitly overridden by parentheses.  @command{awk}'s operator
13576precedence is compatible with that of C.
13577
13578@item
13579Locales can affect the format of data as output by an @command{awk}
13580program, and occasionally the format for data read as input.
13581
13582@end itemize
13583
13584
13585@node Patterns and Actions
13586@chapter Patterns, Actions, and Variables
13587@cindex patterns
13588
13589As you have already seen, each @command{awk} statement consists of
13590a pattern with an associated action.  This @value{CHAPTER} describes how
13591you build patterns and actions, what kinds of things you can do within
13592actions, and @command{awk}'s predefined variables.
13593
13594The pattern--action rules and the statements available for use
13595within actions form the core of @command{awk} programming.
13596In a sense, everything covered
13597up to here has been the foundation
13598that programs are built on top of.  Now it's time to start
13599building something useful.
13600
13601@menu
13602* Pattern Overview::            What goes into a pattern.
13603* Using Shell Variables::       How to use shell variables with @command{awk}.
13604* Action Overview::             What goes into an action.
13605* Statements::                  Describes the various control statements in
13606                                detail.
13607* Built-in Variables::          Summarizes the predefined variables.
13608* Pattern Action Summary::      Patterns and Actions summary.
13609@end menu
13610
13611@node Pattern Overview
13612@section Pattern Elements
13613
13614@menu
13615* Regexp Patterns::             Using regexps as patterns.
13616* Expression Patterns::         Any expression can be used as a pattern.
13617* Ranges::                      Pairs of patterns specify record ranges.
13618* BEGIN/END::                   Specifying initialization and cleanup rules.
13619* BEGINFILE/ENDFILE::           Two special patterns for advanced control.
13620* Empty::                       The empty pattern, which matches every record.
13621@end menu
13622
13623@cindex patterns @subentry types of
13624Patterns in @command{awk} control the execution of rules---a rule is
13625executed when its pattern matches the current input record.
13626The following is a summary of the types of @command{awk} patterns:
13627
13628@table @code
13629@item /@var{regular expression}/
13630A regular expression. It matches when the text of the
13631input record fits the regular expression.
13632(@xref{Regexp}.)
13633
13634@item @var{expression}
13635A single expression.  It matches when its value
13636is nonzero (if a number) or non-null (if a string).
13637(@xref{Expression Patterns}.)
13638
13639@item @var{begpat}, @var{endpat}
13640A pair of patterns separated by a comma, specifying a @dfn{range} of records.
13641The range includes both the initial record that matches @var{begpat} and
13642the final record that matches @var{endpat}.
13643(@xref{Ranges}.)
13644
13645@item BEGIN
13646@itemx END
13647Special patterns for you to supply startup or cleanup actions for your
13648@command{awk} program.
13649(@xref{BEGIN/END}.)
13650
13651@item BEGINFILE
13652@itemx ENDFILE
13653Special patterns for you to supply startup or cleanup actions to be
13654done on a per-file basis.
13655(@xref{BEGINFILE/ENDFILE}.)
13656
13657@item @var{empty}
13658The empty pattern matches every input record.
13659(@xref{Empty}.)
13660@end table
13661
13662@node Regexp Patterns
13663@subsection Regular Expressions as Patterns
13664@cindex patterns @subentry regexp constants as
13665@cindex regular expressions @subentry as patterns
13666
13667Regular expressions are one of the first kinds of patterns presented
13668in this book.
13669This kind of pattern is simply a regexp constant in the pattern part of
13670a rule.  Its  meaning is @samp{$0 ~ /@var{pattern}/}.
13671The pattern matches when the input record matches the regexp.
13672For example:
13673
13674@example
13675/foo|bar|baz/  @{ buzzwords++ @}
13676END            @{ print buzzwords, "buzzwords seen" @}
13677@end example
13678
13679@node Expression Patterns
13680@subsection Expressions as Patterns
13681@cindex expressions @subentry as patterns
13682@cindex patterns @subentry expressions as
13683
13684Any @command{awk} expression is valid as an @command{awk} pattern.
13685The pattern matches if the expression's value is nonzero (if a
13686number) or non-null (if a string).
13687The expression is reevaluated each time the rule is tested against a new
13688input record.  If the expression uses fields such as @code{$1}, the
13689value depends directly on the new input record's text; otherwise, it
13690depends on only what has happened so far in the execution of the
13691@command{awk} program.
13692
13693@cindex comparison expressions @subentry as patterns
13694@cindex patterns @subentry comparison expressions as
13695Comparison expressions, using the comparison operators described in
13696@ref{Typing and Comparison},
13697are a very common kind of pattern.
13698Regexp matching and nonmatching are also very common expressions.
13699The left operand of the @samp{~} and @samp{!~} operators is a string.
13700The right operand is either a constant regular expression enclosed in
13701slashes (@code{/@var{regexp}/}), or any expression whose string value
13702is used as a dynamic regular expression
13703(@pxref{Computed Regexps}).
13704The following example prints the second field of each input record
13705whose first field is precisely @samp{li}:
13706
13707@cindex @code{/} (forward slash) @subentry patterns and
13708@cindex forward slash (@code{/}) @subentry patterns and
13709@cindex @code{~} (tilde), @code{~} operator
13710@cindex tilde (@code{~}), @code{~} operator
13711@cindex @code{!} (exclamation point) @subentry @code{!~} operator
13712@cindex exclamation point (@code{!}) @subentry @code{!~} operator
13713@example
13714$ @kbd{awk '$1 == "li" @{ print $2 @}' mail-list}
13715@end example
13716
13717@noindent
13718(There is no output, because there is no person with the exact name @samp{li}.)
13719Contrast this with the following regular expression match, which
13720accepts any record with a first field that contains @samp{li}:
13721
13722@example
13723$ @kbd{awk '$1 ~ /li/ @{ print $2 @}' mail-list}
13724@print{} 555-5553
13725@print{} 555-6699
13726@end example
13727
13728@cindex regexp constants @subentry as patterns
13729@cindex patterns @subentry regexp constants as
13730A regexp constant as a pattern is also a special case of an expression
13731pattern.  The expression @code{/li/} has the value one if @samp{li}
13732appears in the current input record. Thus, as a pattern, @code{/li/}
13733matches any record containing @samp{li}.
13734
13735@cindex Boolean expressions @subentry as patterns
13736@cindex patterns @subentry Boolean expressions as
13737Boolean expressions are also commonly used as patterns.
13738Whether the pattern
13739matches an input record depends on whether its subexpressions match.
13740For example, the following command prints all the records in
13741@file{mail-list} that contain both @samp{edu} and @samp{li}:
13742
13743@example
13744$ @kbd{awk '/edu/ && /li/' mail-list}
13745@print{} Samuel       555-3430     samuel.lanceolis@@shu.edu        A
13746@end example
13747
13748The following command prints all records in
13749@file{mail-list} that contain @emph{either} @samp{edu} or @samp{li}
13750(or both, of course):
13751
13752@example
13753$ @kbd{awk '/edu/ || /li/' mail-list}
13754@print{} Amelia       555-5553     amelia.zodiacusque@@gmail.com    F
13755@print{} Broderick    555-0542     broderick.aliquotiens@@yahoo.com R
13756@print{} Fabius       555-1234     fabius.undevicesimus@@ucb.edu    F
13757@print{} Julie        555-6699     julie.perscrutabor@@skeeve.com   F
13758@print{} Samuel       555-3430     samuel.lanceolis@@shu.edu        A
13759@print{} Jean-Paul    555-2127     jeanpaul.campanorum@@nyu.edu     R
13760@end example
13761
13762The following command prints all records in
13763@file{mail-list} that do @emph{not} contain the string @samp{li}:
13764
13765@example
13766$ @kbd{awk '! /li/' mail-list}
13767@print{} Anthony      555-3412     anthony.asserturo@@hotmail.com   A
13768@print{} Becky        555-7685     becky.algebrarum@@gmail.com      A
13769@print{} Bill         555-1675     bill.drowning@@hotmail.com       A
13770@print{} Camilla      555-2912     camilla.infusarum@@skynet.be     R
13771@print{} Fabius       555-1234     fabius.undevicesimus@@ucb.edu    F
13772@group
13773@print{} Martin       555-6480     martin.codicibus@@hotmail.com    A
13774@print{} Jean-Paul    555-2127     jeanpaul.campanorum@@nyu.edu     R
13775@end group
13776@end example
13777
13778@cindex @code{BEGIN} pattern @subentry Boolean patterns and
13779@cindex @code{END} pattern @subentry Boolean patterns and
13780@cindex @code{BEGINFILE} pattern @subentry Boolean patterns and
13781@cindex @code{ENDFILE} pattern @subentry Boolean patterns and
13782The subexpressions of a Boolean operator in a pattern can be constant regular
13783expressions, comparisons, or any other @command{awk} expressions.  Range
13784patterns are not expressions, so they cannot appear inside Boolean
13785patterns.  Likewise, the special patterns @code{BEGIN}, @code{END},
13786@code{BEGINFILE}, and @code{ENDFILE},
13787which never match any input record, are not expressions and cannot
13788appear inside Boolean patterns.
13789
13790The precedence of the different operators that can appear in
13791patterns is described in @ref{Precedence}.
13792
13793@node Ranges
13794@subsection Specifying Record Ranges with Patterns
13795
13796@cindex range patterns
13797@cindex patterns @subentry ranges in
13798@cindex lines @subentry matching ranges of
13799@cindex @code{,} (comma), in range patterns
13800@cindex comma (@code{,}), in range patterns
13801A @dfn{range pattern} is made of two patterns separated by a comma, in
13802the form @samp{@var{begpat}, @var{endpat}}.  It is used to match ranges of
13803consecutive input records.  The first pattern, @var{begpat}, controls
13804where the range begins, while @var{endpat} controls where
13805the pattern ends.  For example, the following:
13806
13807@example
13808awk '$1 == "on", $1 == "off"' myfile
13809@end example
13810
13811@noindent
13812prints every record in @file{myfile} between @samp{on}/@samp{off} pairs, inclusive.
13813
13814A range pattern starts out by matching @var{begpat} against every
13815input record.  When a record matches @var{begpat}, the range pattern is
13816@dfn{turned on}, and the range pattern matches this record as well.  As long as
13817the range pattern stays turned on, it automatically matches every input
13818record read.  The range pattern also matches @var{endpat} against every
13819input record; when this succeeds, the range pattern is @dfn{turned off} again
13820for the following record.  Then the range pattern goes back to checking
13821@var{begpat} against each record.
13822
13823@cindex @code{if} statement @subentry actions, changing
13824The record that turns on the range pattern and the one that turns it
13825off both match the range pattern.  If you don't want to operate on
13826these records, you can write @code{if} statements in the rule's action
13827to distinguish them from the records you are interested in.
13828
13829It is possible for a pattern to be turned on and off by the same
13830record. If the record satisfies both conditions, then the action is
13831executed for just that record.
13832For example, suppose there is text between two identical markers (e.g.,
13833the @samp{%} symbol), each on its own line, that should be ignored.
13834A first attempt would be to
13835combine a range pattern that describes the delimited text with the
13836@code{next} statement
13837(not discussed yet, @pxref{Next Statement}).
13838This causes @command{awk} to skip any further processing of the current
13839record and start over again with the next input record. Such a program
13840looks like this:
13841
13842@example
13843/^%$/,/^%$/    @{ next @}
13844               @{ print @}
13845@end example
13846
13847@noindent
13848@cindex lines @subentry skipping between markers
13849@c @cindex flag variables
13850This program fails because the range pattern is both turned on and turned off
13851by the first line, which just has a @samp{%} on it.  To accomplish this task,
13852write the program in the following manner, using a flag:
13853
13854@cindex @code{!} (exclamation point) @subentry @code{!} operator
13855@example
13856/^%$/     @{ skip = ! skip; next @}
13857skip == 1 @{ next @} # skip lines with `skip' set
13858@end example
13859
13860In a range pattern, the comma (@samp{,}) has the lowest precedence of
13861all the operators (i.e., it is evaluated last).  Thus, the following
13862program attempts to combine a range pattern with another, simpler test:
13863
13864@example
13865echo Yes | awk '/1/,/2/ || /Yes/'
13866@end example
13867
13868The intent of this program is @samp{(/1/,/2/) || /Yes/}.
13869However, @command{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}.
13870This cannot be changed or worked around; range patterns do not combine
13871with other patterns:
13872
13873@example
13874$ @kbd{echo Yes | gawk '(/1/,/2/) || /Yes/'}
13875@error{} gawk: cmd. line:1: (/1/,/2/) || /Yes/
13876@error{} gawk: cmd. line:1:           ^ syntax error
13877@end example
13878
13879@cindex range patterns @subentry line continuation and
13880@cindex dark corner @subentry range patterns, line continuation and
13881As a minor point of interest, although it is poor style,
13882POSIX allows you to put a newline after the comma in
13883a range pattern.  @value{DARKCORNER}
13884
13885@node BEGIN/END
13886@subsection The @code{BEGIN} and @code{END} Special Patterns
13887
13888@cindex @code{BEGIN} pattern
13889@cindex @code{END} pattern
13890All the patterns described so far are for matching input records.
13891The @code{BEGIN} and @code{END} special patterns are different.
13892They supply startup and cleanup actions for @command{awk} programs.
13893@code{BEGIN} and @code{END} rules must have actions; there is no default
13894action for these rules because there is no current record when they run.
13895@code{BEGIN} and @code{END} rules are often referred to as
13896``@code{BEGIN} and @code{END} blocks'' by longtime @command{awk}
13897programmers.
13898
13899@menu
13900* Using BEGIN/END::             How and why to use BEGIN/END rules.
13901* I/O And BEGIN/END::           I/O issues in BEGIN/END rules.
13902@end menu
13903
13904@node Using BEGIN/END
13905@subsubsection Startup and Cleanup Actions
13906
13907@cindex @code{BEGIN} pattern
13908@cindex @code{END} pattern
13909A @code{BEGIN} rule is executed once only, before the first input record
13910is read. Likewise, an @code{END} rule is executed once only, after all the
13911input is read.  For example:
13912
13913@example
13914$ @kbd{awk '}
13915> @kbd{BEGIN @{ print "Analysis of \"li\"" @}}
13916> @kbd{/li/  @{ ++n @}}
13917> @kbd{END   @{ print "\"li\" appears in", n, "records." @}' mail-list}
13918@print{} Analysis of "li"
13919@print{} "li" appears in 4 records.
13920@end example
13921
13922@cindex @code{BEGIN} pattern @subentry operators and
13923@cindex @code{END} pattern @subentry operators and
13924This program finds the number of records in the input file @file{mail-list}
13925that contain the string @samp{li}.  The @code{BEGIN} rule prints a title
13926for the report.  There is no need to use the @code{BEGIN} rule to
13927initialize the counter @code{n} to zero, as @command{awk} does this
13928automatically (@pxref{Variables}).
13929The second rule increments the variable @code{n} every time a
13930record containing the pattern @samp{li} is read.  The @code{END} rule
13931prints the value of @code{n} at the end of the run.
13932
13933The special patterns @code{BEGIN} and @code{END} cannot be used in ranges
13934or with Boolean operators (indeed, they cannot be used with any operators).
13935An @command{awk} program may have multiple @code{BEGIN} and/or @code{END}
13936rules.  They are executed in the order in which they appear: all the @code{BEGIN}
13937rules at startup and all the @code{END} rules at termination.
13938
13939@code{BEGIN} and @code{END} rules may be intermixed with other rules.
13940This feature was added in the 1987 version of @command{awk} and is included
13941in the POSIX standard.
13942The original (1978) version of @command{awk}
13943required the @code{BEGIN} rule to be placed at the beginning of the
13944program, the @code{END} rule to be placed at the end, and only allowed one of
13945each.
13946This is no longer required, but it is a good idea to follow this template
13947in terms of program organization and readability.
13948
13949Multiple @code{BEGIN} and @code{END} rules are useful for writing
13950library functions, because each library file can have its own @code{BEGIN} and/or
13951@code{END} rule to do its own initialization and/or cleanup.
13952The order in which library functions are named on the command line
13953controls the order in which their @code{BEGIN} and @code{END} rules are
13954executed.  Therefore, you have to be careful when writing such rules in
13955library files so that the order in which they are executed doesn't matter.
13956@xref{Options} for more information on
13957using library functions.
13958@xref{Library Functions},
13959for a number of useful library functions.
13960
13961If an @command{awk} program has only @code{BEGIN} rules and no
13962other rules, then the program exits after the @code{BEGIN} rules are
13963run.@footnote{The original version of @command{awk} kept
13964reading and ignoring input until the end of the file was seen.}  However, if an
13965@code{END} rule exists, then the input is read, even if there are
13966no other rules in the program.  This is necessary in case the @code{END}
13967rule checks the @code{FNR} and @code{NR} variables, or the fields.
13968
13969@node I/O And BEGIN/END
13970@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules
13971
13972@cindex input/output @subentry from @code{BEGIN} and @code{END}
13973There are several (sometimes subtle) points to be aware of when doing I/O
13974from a @code{BEGIN} or @code{END} rule.
13975The first has to do with the value of @code{$0} in a @code{BEGIN}
13976rule.  Because @code{BEGIN} rules are executed before any input is read,
13977there simply is no input record, and therefore no fields, when
13978executing @code{BEGIN} rules.  References to @code{$0} and the fields
13979yield a null string or zero, depending upon the context.  One way
13980to give @code{$0} a real value is to execute a @code{getline} command
13981without a variable (@pxref{Getline}).
13982Another way is simply to assign a value to @code{$0}.
13983
13984@cindex Brian Kernighan's @command{awk}
13985@cindex differences in @command{awk} and @command{gawk} @subentry @code{BEGIN}/@code{END} patterns
13986@cindex POSIX @command{awk} @subentry @code{BEGIN}/@code{END} patterns
13987@cindex @code{print} statement @subentry @code{BEGIN}/@code{END} patterns and
13988@cindex @code{BEGIN} pattern @subentry @code{print} statement and
13989@cindex @code{END} pattern @subentry @code{print} statement and
13990The second point is similar to the first, but from the other direction.
13991Traditionally, due largely to implementation issues, @code{$0} and
13992@code{NF} were @emph{undefined} inside an @code{END} rule.
13993The POSIX standard specifies that @code{NF} is available in an @code{END}
13994rule. It contains the number of fields from the last input record.
13995@c FIXME: Update this if POSIX is ever fixed.
13996Most probably due to an oversight, the standard does not say that @code{$0}
13997is also preserved, although logically one would think that it should be.
13998In fact, all of BWK @command{awk}, @command{mawk}, and @command{gawk}
13999preserve the value of @code{$0} for use in @code{END} rules.  Be aware,
14000however, that some other implementations and many older versions
14001of Unix @command{awk} do not.
14002
14003The third point follows from the first two.  The meaning of @samp{print}
14004inside a @code{BEGIN} or @code{END} rule is the same as always:
14005@samp{print $0}.  If @code{$0} is the null string, then this prints an
14006empty record.  Many longtime @command{awk} programmers use an unadorned
14007@samp{print} in @code{BEGIN} and @code{END} rules to mean @samp{@w{print ""}},
14008relying on @code{$0} being null.  Although one might generally get away with
14009this in @code{BEGIN} rules, it is a very bad idea in @code{END} rules,
14010at least in @command{gawk}.  It is also poor style, because if an empty
14011line is needed in the output, the program should print one explicitly.
14012
14013@cindex @code{next} statement @subentry @code{BEGIN}/@code{END} patterns and
14014@cindex @code{nextfile} statement @subentry @code{BEGIN}/@code{END} patterns and
14015@cindex @code{BEGIN} pattern @subentry @code{next}/@code{nextfile} statements and
14016@cindex @code{END} pattern @subentry @code{next}/@code{nextfile} statements and
14017Finally, the @code{next} and @code{nextfile} statements are not allowed
14018in a @code{BEGIN} rule, because the implicit
14019read-a-record-and-match-against-the-rules loop has not started yet.  Similarly, those statements
14020are not valid in an @code{END} rule, because all the input has been read.
14021(@xref{Next Statement} and
14022@ifnotdocbook
14023@pxref{Nextfile Statement}.)
14024@end ifnotdocbook
14025@ifdocbook
14026@ref{Nextfile Statement}.)
14027@end ifdocbook
14028
14029@node BEGINFILE/ENDFILE
14030@subsection The @code{BEGINFILE} and @code{ENDFILE} Special Patterns
14031@cindex @code{BEGINFILE} pattern
14032@cindex @code{ENDFILE} pattern
14033@cindex differences in @command{awk} and @command{gawk} @subentry @code{BEGINFILE}/@code{ENDFILE} patterns
14034
14035This @value{SECTION} describes a @command{gawk}-specific feature.
14036
14037Two special kinds of rule, @code{BEGINFILE} and @code{ENDFILE}, give
14038you ``hooks'' into @command{gawk}'s command-line file processing loop.
14039As with the @code{BEGIN} and @code{END} rules
14040@ifnottex
14041@ifnotdocbook
14042(@pxref{BEGIN/END}),
14043@end ifnotdocbook
14044@end ifnottex
14045@iftex
14046(see the previous @value{SECTION}),
14047@end iftex
14048@ifdocbook
14049(see the previous @value{SECTION}),
14050@end ifdocbook
14051@code{BEGINFILE} rules in a program execute in the order they are
14052read by @command{gawk}. Similarly, all @code{ENDFILE} rules also execute in
14053the order they are read.
14054
14055The bodies of the @code{BEGINFILE} rules execute just before
14056@command{gawk} reads the first record from a file.  @code{FILENAME}
14057is set to the name of the current file, and @code{FNR} is set to zero.
14058
14059Prior to @value{PVERSION} 5.1.1 of @command{gawk}, as an accident of the
14060implementation, @code{$0} and the fields retained any previous values
14061they had in @code{BEGINFILE} rules.  Starting with @value{PVERSION}
140625.1.1, @code{$0} and the fields are cleared, since no record has been
14063read yet from the file that is about to be processed.
14064
14065The @code{BEGINFILE} rule provides you the opportunity to accomplish two tasks
14066that would otherwise be difficult or impossible to perform:
14067
14068@itemize @value{BULLET}
14069@item
14070You can test if the file is readable.  Normally, it is a fatal error if a
14071file named on the command line cannot be opened for reading.  However,
14072you can bypass the fatal error and move on to the next file on the
14073command line.
14074
14075@cindex @command{gawk} @subentry @code{ERRNO} variable in
14076@cindex @code{ERRNO} variable @subentry with @code{BEGINFILE} pattern
14077@cindex @code{nextfile} statement @subentry @code{BEGINFILE}/@code{ENDFILE} patterns and
14078You do this by checking if the @code{ERRNO} variable is not the empty
14079string; if so, then @command{gawk} was not able to open the file. In
14080this case, your program can execute the @code{nextfile} statement
14081(@pxref{Nextfile Statement}).  This causes @command{gawk} to skip
14082the file entirely.  Otherwise, @command{gawk} exits with the usual
14083fatal error.
14084
14085@item
14086If you have written extensions that modify the record handling (by
14087inserting an ``input parser''; @pxref{Input Parsers}), you can invoke
14088them at this point, before @command{gawk} has started processing the file.
14089(This is a @emph{very} advanced feature, currently used only by the
14090@uref{https://sourceforge.net/projects/gawkextlib, @code{gawkextlib} project}.)
14091@end itemize
14092
14093The @code{ENDFILE} rule is called when @command{gawk} has finished processing
14094the last record in an input file.  For the last input file,
14095it will be called before any @code{END} rules.
14096The @code{ENDFILE} rule is executed even for empty input files.
14097
14098Normally, when an error occurs when reading input in the normal
14099input-processing loop, the error is fatal.  However, if a @code{BEGINFILE}
14100rule is present, the error becomes non-fatal, and instead @code{ERRNO}
14101is set.  This makes it possible to catch and process I/O errors at the
14102level of the @command{awk} program.
14103
14104@cindex @code{next} statement @subentry @code{BEGINFILE}/@code{ENDFILE} patterns and
14105The @code{next} statement (@pxref{Next Statement}) is not allowed inside
14106either a @code{BEGINFILE} or an @code{ENDFILE} rule.  The @code{nextfile}
14107statement is allowed only inside a
14108@code{BEGINFILE} rule, not inside an @code{ENDFILE} rule.
14109
14110@cindex @code{getline} command @subentry @code{BEGINFILE}/@code{ENDFILE} patterns and
14111The @code{getline} statement (@pxref{Getline}) is restricted inside
14112both @code{BEGINFILE} and @code{ENDFILE}: only redirected
14113forms of @code{getline} are allowed.
14114
14115@code{BEGINFILE} and @code{ENDFILE} are @command{gawk} extensions.
14116In most other @command{awk} implementations, or if @command{gawk} is in
14117compatibility mode (@pxref{Options}), they are not special.
14118
14119@node Empty
14120@subsection The Empty Pattern
14121
14122@cindex empty pattern
14123@cindex patterns @subentry empty
14124An empty (i.e., nonexistent) pattern is considered to match @emph{every}
14125input record.  For example, the program:
14126
14127@example
14128awk '@{ print $1 @}' mail-list
14129@end example
14130
14131@noindent
14132prints the first field of every record.
14133
14134@node Using Shell Variables
14135@section Using Shell Variables in Programs
14136@cindex shells @subentry variables
14137@cindex @command{awk} programs @subentry shell variables in
14138@c @cindex shell and @command{awk} interaction
14139
14140@command{awk} programs are often used as components in larger
14141programs written in shell.
14142For example, it is very common to use a shell variable to
14143hold a pattern that the @command{awk} program searches for.
14144There are two ways to get the value of the shell variable
14145into the body of the @command{awk} program.
14146
14147@cindex shells @subentry quoting
14148A common method is to use shell quoting to substitute
14149the variable's value into the program inside the script.
14150For example, consider the following program:
14151
14152@example
14153@group
14154printf "Enter search pattern: "
14155read pattern
14156awk "/$pattern/ "'@{ nmatches++ @}
14157     END @{ print nmatches, "found" @}' /path/to/data
14158@end group
14159@end example
14160
14161@noindent
14162The @command{awk} program consists of two pieces of quoted text
14163that are concatenated together to form the program.
14164The first part is double-quoted, which allows substitution of
14165the @code{pattern} shell variable inside the quotes.
14166The second part is single-quoted.
14167
14168Variable substitution via quoting works, but can potentially be
14169messy.  It requires a good understanding of the shell's quoting rules
14170(@pxref{Quoting}),
14171and it's often difficult to correctly
14172match up the quotes when reading the program.
14173
14174A better method is to use @command{awk}'s variable assignment feature
14175(@pxref{Assignment Options})
14176to assign the shell variable's value to an @command{awk} variable.
14177Then use dynamic regexps to match the pattern
14178(@pxref{Computed Regexps}).
14179The following shows how to redo the
14180previous example using this technique:
14181
14182@example
14183printf "Enter search pattern: "
14184read pattern
14185awk -v pat="$pattern" '$0 ~ pat @{ nmatches++ @}
14186       END @{ print nmatches, "found" @}' /path/to/data
14187@end example
14188
14189@noindent
14190Now, the @command{awk} program is just one single-quoted string.
14191The assignment @samp{-v pat="$pattern"} still requires double quotes,
14192in case there is whitespace in the value of @code{$pattern}.
14193The @command{awk} variable @code{pat} could be named @code{pattern}
14194too, but that would be more confusing.  Using a variable also
14195provides more flexibility, as the variable can be used anywhere inside
14196the program---for printing, as an array subscript, or for any other
14197use---without requiring the quoting tricks at every point in the program.
14198
14199@node Action Overview
14200@section Actions
14201@c @cindex action, definition of
14202@c @cindex curly braces
14203@c @cindex action, curly braces
14204@c @cindex action, separating statements
14205@cindex actions
14206
14207An @command{awk} program or script consists of a series of
14208rules and function definitions interspersed.  (Functions are
14209described later.  @xref{User-defined}.)
14210A rule contains a pattern and an action, either of which (but not
14211both) may be omitted.  The purpose of the @dfn{action} is to tell
14212@command{awk} what to do once a match for the pattern is found.  Thus,
14213in outline, an @command{awk} program generally looks like this:
14214
14215@display
14216[@var{pattern}]  @code{@{ @var{action} @}}
14217 @var{pattern}  [@code{@{ @var{action} @}}]
14218@dots{}
14219@code{function @var{name}(@var{args}) @{ @dots{} @}}
14220@dots{}
14221@end display
14222
14223@cindex @code{@{@}} (braces) @subentry actions and
14224@cindex braces (@code{@{@}}) @subentry actions and
14225@cindex separators @subentry for statements in actions
14226@cindex newlines @subentry separating statements in actions
14227@cindex @code{;} (semicolon) @subentry separating statements in actions
14228@cindex semicolon (@code{;}) @subentry separating statements in actions
14229An action consists of one or more @command{awk} @dfn{statements}, enclosed
14230in braces (@samp{@{@r{@dots{}}@}}).  Each statement specifies one
14231thing to do.  The statements are separated by newlines or semicolons.
14232The braces around an action must be used even if the action
14233contains only one statement, or if it contains no statements at
14234all.  However, if you omit the action entirely, omit the braces as
14235well.  An omitted action is equivalent to @samp{@{ print $0 @}}:
14236
14237@example
14238/foo/  @{ @}     @ii{match @code{foo}, do nothing --- empty action}
14239/foo/          @ii{match @code{foo}, print the record --- omitted action}
14240@end example
14241
14242The following types of statements are supported in @command{awk}:
14243
14244@table @asis
14245@cindex side effects @subentry statements
14246@item Expressions
14247Call functions or assign values to variables
14248(@pxref{Expressions}).  Executing
14249this kind of statement simply computes the value of the expression.
14250This is useful when the expression has side effects
14251(@pxref{Assignment Ops}).
14252
14253@item Control statements
14254Specify the control flow of @command{awk}
14255programs.  The @command{awk} language gives you C-like constructs
14256(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few
14257special ones (@pxref{Statements}).
14258
14259@item Compound statements
14260Enclose one or more statements in braces.  A compound statement
14261is used in order to put several statements together in the body of an
14262@code{if}, @code{while}, @code{do}, or @code{for} statement.
14263
14264@item Input statements
14265Use the @code{getline} command
14266(@pxref{Getline}).
14267Also supplied in @command{awk} are the @code{next}
14268statement (@pxref{Next Statement})
14269and the @code{nextfile} statement
14270(@pxref{Nextfile Statement}).
14271
14272@item Output statements
14273Such as @code{print} and @code{printf}.
14274@xref{Printing}.
14275
14276@item Deletion statements
14277For deleting array elements.
14278@xref{Delete}.
14279@end table
14280
14281@node Statements
14282@section Control Statements in Actions
14283@cindex control statements
14284@cindex statements @subentry control, in actions
14285@cindex actions @subentry control statements in
14286
14287@dfn{Control statements}, such as @code{if}, @code{while}, and so on,
14288control the flow of execution in @command{awk} programs.  Most of @command{awk}'s
14289control statements are patterned after similar statements in C.
14290
14291@cindex compound statements, control statements and
14292@cindex statements @subentry compound, control statements and
14293@cindex body @subentry in actions
14294@cindex @code{@{@}} (braces) @subentry statements, grouping
14295@cindex braces (@code{@{@}}) @subentry statements, grouping
14296@cindex newlines @subentry separating statements in actions
14297@cindex @code{;} (semicolon) @subentry separating statements in actions
14298@cindex semicolon (@code{;}) @subentry separating statements in actions
14299All the control statements start with special keywords, such as @code{if}
14300and @code{while}, to distinguish them from simple expressions.
14301Many control statements contain other statements.  For example, the
14302@code{if} statement contains another statement that may or may not be
14303executed.  The contained statement is called the @dfn{body}.
14304To include more than one statement in the body, group them into a
14305single @dfn{compound statement} with braces, separating them with
14306newlines or semicolons.
14307
14308@menu
14309* If Statement::                Conditionally execute some @command{awk}
14310                                statements.
14311* While Statement::             Loop until some condition is satisfied.
14312* Do Statement::                Do specified action while looping until some
14313                                condition is satisfied.
14314* For Statement::               Another looping statement, that provides
14315                                initialization and increment clauses.
14316* Switch Statement::            Switch/case evaluation for conditional
14317                                execution of statements based on a value.
14318* Break Statement::             Immediately exit the innermost enclosing loop.
14319* Continue Statement::          Skip to the end of the innermost enclosing
14320                                loop.
14321* Next Statement::              Stop processing the current input record.
14322* Nextfile Statement::          Stop processing the current file.
14323* Exit Statement::              Stop execution of @command{awk}.
14324@end menu
14325
14326@node If Statement
14327@subsection The @code{if}-@code{else} Statement
14328
14329@cindex @code{if} statement
14330The @code{if}-@code{else} statement is @command{awk}'s decision-making
14331statement.  It looks like this:
14332
14333@display
14334@code{if (@var{condition}) @var{then-body}} [@code{else @var{else-body}}]
14335@end display
14336
14337@noindent
14338The @var{condition} is an expression that controls what the rest of the
14339statement does.  If the @var{condition} is true, @var{then-body} is
14340executed; otherwise, @var{else-body} is executed.
14341The @code{else} part of the statement is
14342optional.  The condition is considered false if its value is zero or
14343the null string; otherwise, the condition is true.
14344Refer to the following:
14345
14346@example
14347@group
14348if (x % 2 == 0)
14349    print "x is even"
14350else
14351    print "x is odd"
14352@end group
14353@end example
14354
14355In this example, if the expression @samp{x % 2 == 0} is true (i.e.,
14356if the value of @code{x} is evenly divisible by two), then the first
14357@code{print} statement is executed; otherwise, the second @code{print}
14358statement is executed.
14359If the @code{else} keyword appears on the same line as @var{then-body} and
14360@var{then-body} is not a compound statement (i.e., not surrounded by
14361braces), then a semicolon must separate @var{then-body} from
14362the @code{else}.
14363To illustrate this, the previous example can be rewritten as:
14364
14365@example
14366if (x % 2 == 0) print "x is even"; else
14367        print "x is odd"
14368@end example
14369
14370@noindent
14371If the @samp{;} is left out, @command{awk} can't interpret the statement and
14372it produces a syntax error.  Don't actually write programs this way,
14373because a human reader might fail to see the @code{else} if it is not
14374the first thing on its line.
14375
14376@node While Statement
14377@subsection The @code{while} Statement
14378@cindex @code{while} statement
14379@cindex loops
14380@cindex loops @subentry @code{while}
14381@cindex loops @seealso{@code{while} statement}
14382
14383In programming, a @dfn{loop} is a part of a program that can
14384be executed two or more times in succession.
14385The @code{while} statement is the simplest looping statement in
14386@command{awk}.  It repeatedly executes a statement as long as a condition is
14387true.  For example:
14388
14389@example
14390while (@var{condition})
14391  @var{body}
14392@end example
14393
14394@cindex body @subentry in loops
14395@noindent
14396@var{body} is a statement called the @dfn{body} of the loop,
14397and @var{condition} is an expression that controls how long the loop
14398keeps running.
14399The first thing the @code{while} statement does is test the @var{condition}.
14400If the @var{condition} is true, it executes the statement @var{body}.
14401@ifinfo
14402(The @var{condition} is true when the value
14403is not zero and not a null string.)
14404@end ifinfo
14405After @var{body} has been executed,
14406@var{condition} is tested again, and if it is still true, @var{body}
14407executes again.  This process repeats until the @var{condition} is no longer
14408true.  If the @var{condition} is initially false, the body of the loop
14409never executes and @command{awk} continues with the statement following
14410the loop.
14411This example prints the first three fields of each record, one per line:
14412
14413@example
14414awk '
14415@{
14416    i = 1
14417    while (i <= 3) @{
14418        print $i
14419        i++
14420    @}
14421@}' inventory-shipped
14422@end example
14423
14424@noindent
14425The body of this loop is a compound statement enclosed in braces,
14426containing two statements.
14427The loop works in the following manner: first, the value of @code{i} is set to one.
14428Then, the @code{while} statement tests whether @code{i} is less than or equal to
14429three.  This is true when @code{i} equals one, so the @code{i}th
14430field is printed.  Then the @samp{i++} increments the value of @code{i}
14431and the loop repeats.  The loop terminates when @code{i} reaches four.
14432
14433A newline is not required between the condition and the
14434body; however, using one makes the program clearer unless the body is a
14435compound statement or else is very simple.  The newline after the open brace
14436that begins the compound statement is not required either, but the
14437program is harder to read without it.
14438
14439@node Do Statement
14440@subsection The @code{do}-@code{while} Statement
14441@cindex @code{do}-@code{while} statement
14442@cindex loops @subentry @code{do}-@code{while}
14443
14444The @code{do} loop is a variation of the @code{while} looping statement.
14445The @code{do} loop executes the @var{body} once and then repeats the
14446@var{body} as long as the @var{condition} is true.  It looks like this:
14447
14448@example
14449do
14450  @var{body}
14451while (@var{condition})
14452@end example
14453
14454Even if the @var{condition} is false at the start, the @var{body}
14455executes at least once (and only once, unless executing @var{body}
14456makes @var{condition} true).  Contrast this with the corresponding
14457@code{while} statement:
14458
14459@example
14460while (@var{condition})
14461    @var{body}
14462@end example
14463
14464@noindent
14465This statement does not execute the @var{body} even once if the
14466@var{condition} is false to begin with.  The following is an example of
14467a @code{do} statement:
14468
14469@example
14470@{
14471    i = 1
14472    do @{
14473        print $0
14474        i++
14475    @} while (i <= 10)
14476@}
14477@end example
14478
14479@noindent
14480This program prints each input record 10 times.  However, it isn't a very
14481realistic example, because in this case an ordinary @code{while} would do
14482just as well.  This situation reflects actual experience; only
14483occasionally is there a real use for a @code{do} statement.
14484
14485@node For Statement
14486@subsection The @code{for} Statement
14487@cindex @code{for} statement
14488@cindex loops @subentry @code{for} @subentry iterative
14489
14490The @code{for} statement makes it more convenient to count iterations of a
14491loop.  The general form of the @code{for} statement looks like this:
14492
14493@example
14494for (@var{initialization}; @var{condition}; @var{increment})
14495  @var{body}
14496@end example
14497
14498@noindent
14499The @var{initialization}, @var{condition}, and @var{increment} parts are
14500arbitrary @command{awk} expressions, and @var{body} stands for any
14501@command{awk} statement.
14502
14503The @code{for} statement starts by executing @var{initialization}.
14504Then, as long
14505as the @var{condition} is true, it repeatedly executes @var{body} and then
14506@var{increment}.  Typically, @var{initialization} sets a variable to
14507either zero or one, @var{increment} adds one to it, and @var{condition}
14508compares it against the desired number of iterations.
14509For example:
14510
14511@example
14512awk '
14513@{
14514    for (i = 1; i <= 3; i++)
14515        print $i
14516@}' inventory-shipped
14517@end example
14518
14519@noindent
14520This prints the first three fields of each input record, with one
14521input field per output line.
14522
14523@c @cindex comma operator, not supported
14524C and C++ programmers might expect to be able to use the comma
14525operator to set more than one variable in the @var{initialization}
14526part of the @code{for} loop, or to increment multiple variables in the
14527@var{increment} part of the loop, like so:
14528
14529@example
14530for (i = 0, j = length(a); i < j; i++, j--) @dots{}   @ii{C/C++, not awk!}
14531@end example
14532
14533@noindent
14534You cannot do this; the comma operator is not supported in @command{awk}.
14535There are workarounds, but they are nonobvious and can lead to
14536code that is difficult to read and understand. It is best, therefore,
14537to simply write additional initializations as separate statements
14538preceding the @code{for} loop and to place additional increment statements
14539at the end of the loop's body.
14540
14541Most often, @var{increment} is an increment expression, as in the earlier
14542example.  But this is not required; it can be any expression
14543whatsoever.  For example, the following statement prints all the powers of two
14544between 1 and 100:
14545
14546@example
14547for (i = 1; i <= 100; i *= 2)
14548    print i
14549@end example
14550
14551If there is nothing to be done, any of the three expressions in the
14552parentheses following the @code{for} keyword may be omitted.  Thus,
14553@w{@samp{for (; x > 0;)}} is equivalent to @w{@samp{while (x > 0)}}.  If the
14554@var{condition} is omitted, it is treated as true, effectively
14555yielding an @dfn{infinite loop} (i.e., a loop that never terminates).
14556
14557In most cases, a @code{for} loop is an abbreviation for a @code{while}
14558loop, as shown here:
14559
14560@example
14561@var{initialization}
14562while (@var{condition}) @{
14563  @var{body}
14564  @var{increment}
14565@}
14566@end example
14567
14568@cindex loops @subentry @code{continue} statement and
14569@noindent
14570The only exception is when the @code{continue} statement
14571(@pxref{Continue Statement}) is used
14572inside the loop. Changing a @code{for} statement to a @code{while}
14573statement in this way can change the effect of the @code{continue}
14574statement inside the loop.
14575
14576The @command{awk} language has a @code{for} statement in addition to a
14577@code{while} statement because a @code{for} loop is often both less work to
14578type and more natural to think of.  Counting the number of iterations is
14579very common in loops.  It can be easier to think of this counting as part
14580of looping rather than as something to do inside the loop.
14581
14582@cindex @code{in} operator
14583There is an alternative version of the @code{for} loop, for iterating over
14584all the indices of an array:
14585
14586@example
14587for (i in array)
14588    @var{do something with} array[i]
14589@end example
14590
14591@noindent
14592@xref{Scanning an Array}
14593for more information on this version of the @code{for} loop.
14594
14595@node Switch Statement
14596@subsection The @code{switch} Statement
14597@cindex @code{switch} statement
14598@cindex @code{case} keyword
14599@cindex @code{default} keyword
14600
14601This @value{SECTION} describes a @command{gawk}-specific feature.
14602If @command{gawk} is in compatibility mode (@pxref{Options}),
14603it is not available.
14604
14605The @code{switch} statement allows the evaluation of an expression and
14606the execution of statements based on a @code{case} match. Case statements
14607are checked for a match in the order they are defined.  If no suitable
14608@code{case} is found, the @code{default} section is executed, if supplied.
14609
14610Each @code{case} contains a single constant, be it numeric, string,
14611or regexp.  The @code{switch} expression is evaluated, and then each
14612@code{case}'s constant is compared against the result in turn. The
14613type of constant determines the comparison: numeric or string do the
14614usual comparisons.  A regexp constant (either regular, @code{/foo/}, or
14615strongly typed, @code{@@/foo/}) does a regular expression match against
14616the string value of the original expression.  The general form of the
14617@code{switch} statement looks like this:
14618
14619@example
14620switch (@var{expression}) @{
14621case @var{value or regular expression}:
14622    @var{case-body}
14623default:
14624    @var{default-body}
14625@}
14626@end example
14627
14628Control flow in
14629the @code{switch} statement works as it does in C. Once a match to a given
14630case is made, the case statement bodies execute until a @code{break},
14631@code{continue}, @code{next}, @code{nextfile}, or @code{exit} is encountered,
14632or the end of the @code{switch} statement itself. For example:
14633
14634@example
14635while ((c = getopt(ARGC, ARGV, "aksx")) != -1) @{
14636    switch (c) @{
14637    case "a":
14638        # report size of all files
14639        all_files = TRUE;
14640        break
14641    case "k":
14642        BLOCK_SIZE = 1024       # 1K block size
14643        break
14644    case "s":
14645        # do sums only
14646        sum_only = TRUE
14647        break
14648    case "x":
14649        # don't cross filesystems
14650        fts_flags = or(fts_flags, FTS_XDEV)
14651        break
14652    case "?":
14653    default:
14654        usage()
14655        break
14656    @}
14657@}
14658@end example
14659
14660Note that if none of the statements specified here halt execution
14661of a matched @code{case} statement, execution falls through to the
14662next @code{case} until execution halts. In this example, the
14663@code{case} for @code{"?"} falls through to the @code{default}
14664case, which is to call a function named @code{usage()}.
14665(The @code{getopt()} function being called here is
14666described in @ref{Getopt Function}.)
14667
14668@node Break Statement
14669@subsection The @code{break} Statement
14670@cindex @code{break} statement
14671@cindex loops @subentry exiting
14672@cindex loops @subentry @code{break} statement and
14673
14674The @code{break} statement jumps out of the innermost @code{for},
14675@code{while}, or @code{do} loop that encloses it.  The following example
14676finds the smallest divisor of any integer, and also identifies prime
14677numbers:
14678
14679@example
14680@group
14681# find smallest divisor of num
14682@{
14683    num = $1
14684    for (divisor = 2; divisor * divisor <= num; divisor++) @{
14685        if (num % divisor == 0)
14686            break
14687    @}
14688@end group
14689@group
14690    if (num % divisor == 0)
14691        printf "Smallest divisor of %d is %d\n", num, divisor
14692    else
14693        printf "%d is prime\n", num
14694@}
14695@end group
14696@end example
14697
14698When the remainder is zero in the first @code{if} statement, @command{awk}
14699immediately @dfn{breaks out} of the containing @code{for} loop.  This means
14700that @command{awk} proceeds immediately to the statement following the loop
14701and continues processing.  (This is very different from the @code{exit}
14702statement, which stops the entire @command{awk} program.
14703@xref{Exit Statement}.)
14704
14705The following program illustrates how the @var{condition} of a @code{for}
14706or @code{while} statement could be replaced with a @code{break} inside
14707an @code{if}:
14708
14709@example
14710# find smallest divisor of num
14711@{
14712    num = $1
14713    for (divisor = 2; ; divisor++) @{
14714        if (num % divisor == 0) @{
14715            printf "Smallest divisor of %d is %d\n", num, divisor
14716            break
14717        @}
14718        if (divisor * divisor > num) @{
14719            printf "%d is prime\n", num
14720            break
14721        @}
14722    @}
14723@}
14724@end example
14725
14726The @code{break} statement is also used to break out of the
14727@code{switch} statement.
14728This is discussed in @ref{Switch Statement}.
14729
14730@c @cindex @code{break}, outside of loops
14731@c @cindex historical features
14732@c @cindex @command{awk} language, POSIX version
14733@cindex POSIX @command{awk} @subentry @code{break} statement and
14734@cindex dark corner @subentry @code{break} statement
14735@cindex @command{gawk} @subentry @code{break} statement in
14736@cindex Brian Kernighan's @command{awk}
14737The @code{break} statement has no meaning when
14738used outside the body of a loop or @code{switch}.
14739However, although it was never documented,
14740historical implementations of @command{awk} treated the @code{break}
14741statement outside of a loop as if it were a @code{next} statement
14742(@pxref{Next Statement}).
14743@value{DARKCORNER}
14744Recent versions of BWK @command{awk} no longer allow this usage,
14745nor does @command{gawk}.
14746
14747@node Continue Statement
14748@subsection The @code{continue} Statement
14749
14750@cindex @code{continue} statement
14751Similar to @code{break}, the @code{continue} statement is used only inside
14752@code{for}, @code{while}, and @code{do} loops.  It skips
14753over the rest of the loop body, causing the next cycle around the loop
14754to begin immediately.  Contrast this with @code{break}, which jumps out
14755of the loop altogether.
14756
14757The @code{continue} statement in a @code{for} loop directs @command{awk} to
14758skip the rest of the body of the loop and resume execution with the
14759increment-expression of the @code{for} statement.  The following program
14760illustrates this fact:
14761
14762@example
14763BEGIN @{
14764     for (x = 0; x <= 20; x++) @{
14765         if (x == 5)
14766             continue
14767         printf "%d ", x
14768     @}
14769     print ""
14770@}
14771@end example
14772
14773@noindent
14774This program prints all the numbers from 0 to 20---except for 5, for
14775which the @code{printf} is skipped.  Because the increment @samp{x++}
14776is not skipped, @code{x} does not remain stuck at 5.  Contrast the
14777@code{for} loop from the previous example with the following @code{while} loop:
14778
14779@example
14780BEGIN @{
14781     x = 0
14782     while (x <= 20) @{
14783         if (x == 5)
14784             continue
14785         printf "%d ", x
14786         x++
14787     @}
14788     print ""
14789@}
14790@end example
14791
14792@noindent
14793This program loops forever once @code{x} reaches 5, because
14794the increment (@samp{x++}) is never reached.
14795
14796@c @cindex @code{continue}, outside of loops
14797@c @cindex historical features
14798@c @cindex @command{awk} language, POSIX version
14799@cindex POSIX @command{awk} @subentry @code{continue} statement and
14800@cindex dark corner @subentry @code{continue} statement
14801@cindex @command{gawk} @subentry @code{continue} statement in
14802@cindex Brian Kernighan's @command{awk}
14803The @code{continue} statement has no special meaning with respect to the
14804@code{switch} statement, nor does it have any meaning when used outside the
14805body of a loop.  Historical versions of @command{awk} treated a @code{continue}
14806statement outside a loop the same way they treated a @code{break}
14807statement outside a loop: as if it were a @code{next}
14808statement
14809@ifset FOR_PRINT
14810(discussed in the following @value{SECTION}).
14811@end ifset
14812@ifclear FOR_PRINT
14813(@pxref{Next Statement}).
14814@end ifclear
14815@value{DARKCORNER}
14816Recent versions of BWK @command{awk} no longer work this way, nor
14817does @command{gawk}.
14818
14819@node Next Statement
14820@subsection The @code{next} Statement
14821@cindex @code{next} statement
14822
14823The @code{next} statement forces @command{awk} to immediately stop processing
14824the current record and go on to the next record.  This means that no
14825further rules are executed for the current record, and the rest of the
14826current rule's action isn't executed.
14827
14828Contrast this with the effect of the @code{getline} function
14829(@pxref{Getline}).  That also causes
14830@command{awk} to read the next record immediately, but it does not alter the
14831flow of control in any way (i.e., the rest of the current action executes
14832with a new input record).
14833
14834@cindex @command{awk} programs @subentry execution of
14835At the highest level, @command{awk} program execution is a loop that reads
14836an input record and then tests each rule's pattern against it.  If you
14837think of this loop as a @code{for} statement whose body contains the
14838rules, then the @code{next} statement is analogous to a @code{continue}
14839statement. It skips to the end of the body of this implicit loop and
14840executes the increment (which reads another record).
14841
14842For example, suppose an @command{awk} program works only on records
14843with four fields, and it shouldn't fail when given bad input.  To avoid
14844complicating the rest of the program, write a ``weed out'' rule near
14845the beginning, in the following manner:
14846
14847@example
14848NF != 4 @{
14849    printf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) > "/dev/stderr"
14850    next
14851@}
14852@end example
14853
14854@noindent
14855Because of the @code{next} statement,
14856the program's subsequent rules won't see the bad record.  The error
14857message is redirected to the standard error output stream, as error
14858messages should be.
14859For more detail, see
14860@ref{Special Files}.
14861
14862If the @code{next} statement causes the end of the input to be reached,
14863then the code in any @code{END} rules is executed.
14864@xref{BEGIN/END}.
14865
14866The @code{next} statement is not allowed inside @code{BEGINFILE} and
14867@code{ENDFILE} rules. @xref{BEGINFILE/ENDFILE}.
14868
14869@c @cindex @code{next}, inside a user-defined function
14870@cindex @command{awk} @subentry language, POSIX version
14871@cindex @code{BEGIN} pattern @subentry @code{next}/@code{nextfile} statements and
14872@cindex @code{END} pattern @subentry @code{next}/@code{nextfile} statements and
14873@cindex POSIX @command{awk} @subentry @code{next}/@code{nextfile} statements and
14874@cindex @code{next} statement @subentry user-defined functions and
14875@cindex functions @subentry user-defined @subentry @code{next}/@code{nextfile} statements and
14876According to the POSIX standard, the behavior is undefined if the
14877@code{next} statement is used in a @code{BEGIN} or @code{END} rule.
14878@command{gawk} treats it as a syntax error.  Although POSIX does not disallow it,
14879most other @command{awk} implementations don't allow the @code{next}
14880statement inside function bodies (@pxref{User-defined}).  Just as with any
14881other @code{next} statement, a @code{next} statement inside a function
14882body reads the next record and starts processing it with the first rule
14883in the program.
14884
14885@node Nextfile Statement
14886@subsection The @code{nextfile} Statement
14887@cindex @code{nextfile} statement
14888
14889The @code{nextfile} statement
14890is similar to the @code{next} statement.
14891However, instead of abandoning processing of the current record, the
14892@code{nextfile} statement instructs @command{awk} to stop processing the
14893current @value{DF}.
14894
14895Upon execution of the @code{nextfile} statement,
14896@code{FILENAME} is
14897updated to the name of the next @value{DF} listed on the command line,
14898@code{FNR} is reset to one,
14899and processing
14900starts over with the first rule in the program.
14901If the @code{nextfile} statement causes the end of the input to be reached,
14902then the code in any @code{END} rules is executed. An exception to this is
14903when @code{nextfile} is invoked during execution of any statement in an
14904@code{END} rule; in this case, it causes the program to stop immediately.
14905@xref{BEGIN/END}.
14906
14907The @code{nextfile} statement is useful when there are many @value{DF}s
14908to process but it isn't necessary to process every record in every file.
14909Without @code{nextfile},
14910in order to move on to the next @value{DF}, a program
14911would have to continue scanning the unwanted records.  The @code{nextfile}
14912statement accomplishes this much more efficiently.
14913
14914In @command{gawk}, execution of @code{nextfile} causes additional things
14915to happen: any @code{ENDFILE} rules are executed if @command{gawk} is
14916not currently in an @code{END} rule, @code{ARGIND} is
14917incremented, and any @code{BEGINFILE} rules are executed.  (@code{ARGIND}
14918hasn't been introduced yet. @xref{Built-in Variables}.)
14919
14920There is an additional, special, use case
14921with @command{gawk}. @code{nextfile} is useful inside a @code{BEGINFILE}
14922rule to skip over a file that would otherwise cause @command{gawk}
14923to exit with a fatal error. In this special case, @code{ENDFILE} rules are not
14924executed. @xref{BEGINFILE/ENDFILE}.
14925
14926Although it might seem that @samp{close(FILENAME)} would accomplish
14927the same as @code{nextfile}, this isn't true.  @code{close()} is
14928reserved for closing files, pipes, and coprocesses that are
14929opened with redirections.  It is not related to the main processing that
14930@command{awk} does with the files listed in @code{ARGV}.
14931
14932@quotation NOTE
14933For many years, @code{nextfile} was a
14934common extension. In September 2012, it was accepted for
14935inclusion into the POSIX standard.
14936See @uref{http://austingroupbugs.net/view.php?id=607, the Austin Group website}.
14937@end quotation
14938
14939@cindex functions @subentry user-defined @subentry @code{next}/@code{nextfile} statements and
14940@cindex @code{nextfile} statement @subentry user-defined functions and
14941@cindex Brian Kernighan's @command{awk}
14942@cindex @command{mawk} utility
14943The current version of BWK @command{awk} and @command{mawk}
14944also support @code{nextfile}.  However, they don't allow the
14945@code{nextfile} statement inside function bodies (@pxref{User-defined}).
14946@command{gawk} does; a @code{nextfile} inside a function body reads the
14947first record from the next file and starts processing it with the first
14948rule in the program, just as any other @code{nextfile} statement.
14949
14950@node Exit Statement
14951@subsection The @code{exit} Statement
14952
14953@cindex @code{exit} statement
14954The @code{exit} statement causes @command{awk} to immediately stop
14955executing the current rule and to stop processing input; any remaining input
14956is ignored.  The @code{exit} statement is written as follows:
14957
14958@display
14959@code{exit} [@var{return code}]
14960@end display
14961
14962@cindex @code{BEGIN} pattern @subentry @code{exit} statement and
14963@cindex @code{END} pattern @subentry @code{exit} statement and
14964When an @code{exit} statement is executed from a @code{BEGIN} rule, the
14965program stops processing everything immediately.  No input records are
14966read.  However, if an @code{END} rule is present,
14967as part of executing the @code{exit} statement,
14968the @code{END} rule is executed
14969(@pxref{BEGIN/END}).
14970If @code{exit} is used in the body of an @code{END} rule, it causes
14971the program to stop immediately.
14972
14973An @code{exit} statement that is not part of a @code{BEGIN} or @code{END}
14974rule stops the execution of any further automatic rules for the current
14975record, skips reading any remaining input records, and executes the
14976@code{END} rule if there is one.  @command{gawk} also skips
14977any @code{ENDFILE} rules; they do not execute.
14978
14979In such a case,
14980if you don't want the @code{END} rule to do its job, set a variable
14981to a nonzero value before the @code{exit} statement and check that variable in
14982the @code{END} rule.
14983@xref{Assert Function}
14984for an example that does this.
14985
14986@cindex dark corner @subentry @code{exit} statement
14987If an argument is supplied to @code{exit}, its value is used as the exit
14988status code for the @command{awk} process.  If no argument is supplied,
14989@code{exit} causes @command{awk} to return a ``success'' status.
14990In the case where an argument
14991is supplied to a first @code{exit} statement, and then @code{exit} is
14992called a second time from an @code{END} rule with no argument,
14993@command{awk} uses the previously supplied exit value.  @value{DARKCORNER}
14994@xref{Exit Status} for more information.
14995
14996@cindex programming conventions @subentry @code{exit} statement
14997For example, suppose an error condition occurs that is difficult or
14998impossible to handle.  Conventionally, programs report this by
14999exiting with a nonzero status.  An @command{awk} program can do this
15000using an @code{exit} statement with a nonzero argument, as shown
15001in the following example:
15002
15003@example
15004@group
15005BEGIN @{
15006    if (("date" | getline date_now) <= 0) @{
15007        print "Can't get system date" > "/dev/stderr"
15008        exit 1
15009    @}
15010@end group
15011@group
15012    print "current date is", date_now
15013    close("date")
15014@}
15015@end group
15016@end example
15017
15018@quotation NOTE
15019For full portability, exit values should be between zero and 126, inclusive.
15020Negative values, and values of 127 or greater, may not produce consistent
15021results across different operating systems.
15022@end quotation
15023
15024
15025@node Built-in Variables
15026@section Predefined Variables
15027@cindex predefined variables
15028@cindex variables @subentry predefined
15029
15030Most @command{awk} variables are available to use for your own
15031purposes; they never change unless your program assigns values to
15032them, and they never affect anything unless your program examines them.
15033However, a few variables in @command{awk} have special built-in meanings.
15034@command{awk} examines some of these automatically, so that they enable you
15035to tell @command{awk} how to do certain things.  Others are set
15036automatically by @command{awk}, so that they carry information from the
15037internal workings of @command{awk} to your program.
15038
15039@cindex @command{gawk} @subentry predefined variables and
15040This @value{SECTION} documents all of @command{gawk}'s predefined variables,
15041most of which are also documented in the @value{CHAPTER}s describing
15042their areas of activity.
15043
15044@menu
15045* User-modified::               Built-in variables that you change to control
15046                                @command{awk}.
15047* Auto-set::                    Built-in variables where @command{awk} gives
15048                                you information.
15049* ARGC and ARGV::               Ways to use @code{ARGC} and @code{ARGV}.
15050@end menu
15051
15052@node User-modified
15053@subsection Built-in Variables That Control @command{awk}
15054@cindex predefined variables @subentry user-modifiable
15055@cindex user-modifiable variables
15056
15057The following is an alphabetical list of variables that you can change to
15058control how @command{awk} does certain things.
15059
15060The variables that are specific to @command{gawk} are marked with a pound
15061sign (@samp{#}).  These variables are @command{gawk} extensions.  In other
15062@command{awk} implementations or if @command{gawk} is in compatibility
15063mode (@pxref{Options}), they are not special.  (Any exceptions are noted
15064in the description of each variable.)
15065
15066@table @code
15067@cindex @code{BINMODE} variable
15068@cindex binary input/output
15069@cindex input/output @subentry binary
15070@cindex differences in @command{awk} and @command{gawk} @subentry @code{BINMODE} variable
15071@item BINMODE #
15072On non-POSIX systems, this variable specifies use of binary mode
15073for all I/O.  Numeric values of one, two, or three specify that input
15074files, output files, or all files, respectively, should use binary I/O.
15075A numeric value less than zero is treated as zero, and a numeric value
15076greater than three is treated as three.  Alternatively, string values
15077of @code{"r"} or @code{"w"} specify that input files and output files,
15078respectively, should use binary I/O.  A string value of @code{"rw"} or
15079@code{"wr"} indicates that all files should use binary I/O.  Any other
15080string value is treated the same as @code{"rw"}, but causes @command{gawk}
15081to generate a warning message.  @code{BINMODE} is described in more
15082detail in @ref{PC Using}.  @command{mawk} (@pxref{Other Versions})
15083also supports this variable, but only using numeric values.
15084
15085@cindex @code{CONVFMT} variable
15086@cindex POSIX @command{awk} @subentry @code{CONVFMT} variable and
15087@cindex numbers @subentry converting @subentry to strings
15088@cindex strings @subentry converting @subentry numbers to
15089@item @code{CONVFMT}
15090A string that controls the conversion of numbers to
15091strings (@pxref{Conversion}).
15092It works by being passed, in effect, as the first argument to the
15093@code{sprintf()} function
15094(@pxref{String Functions}).
15095Its default value is @code{"%.6g"}.
15096@code{CONVFMT} was introduced by the POSIX standard.
15097
15098@cindex @command{gawk} @subentry @code{FIELDWIDTHS} variable in
15099@cindex @code{FIELDWIDTHS} variable
15100@cindex differences in @command{awk} and @command{gawk} @subentry @code{FIELDWIDTHS} variable
15101@cindex field separator @subentry @code{FIELDWIDTHS} variable and
15102@cindex separators @subentry field @subentry @code{FIELDWIDTHS} variable and
15103@item FIELDWIDTHS #
15104A space-separated list of columns that tells @command{gawk}
15105how to split input with fixed columnar boundaries.
15106Starting in @value{PVERSION} 4.2, each field width may optionally be
15107preceded by a colon-separated value specifying the number of characters to skip
15108before the field starts.
15109Assigning a value to @code{FIELDWIDTHS}
15110overrides the use of @code{FS} and @code{FPAT} for field splitting.
15111@xref{Constant Size} for more information.
15112
15113@cindex @command{gawk} @subentry @code{FPAT} variable in
15114@cindex @code{FPAT} variable
15115@cindex differences in @command{awk} and @command{gawk} @subentry @code{FPAT} variable
15116@cindex field separator @subentry @code{FPAT} variable and
15117@cindex separators @subentry field @subentry @code{FPAT} variable and
15118@item FPAT #
15119A regular expression (as a string) that tells @command{gawk}
15120to create the fields based on text that matches the regular expression.
15121Assigning a value to @code{FPAT}
15122overrides the use of @code{FS} and @code{FIELDWIDTHS} for field splitting.
15123@xref{Splitting By Content} for more information.
15124
15125@cindex @code{FS} variable
15126@cindex separators @subentry field
15127@cindex field separator
15128@item FS
15129The input field separator (@pxref{Field Separators}).
15130The value is a single-character string or a multicharacter regular
15131expression that matches the separations between fields in an input
15132record.  If the value is the null string (@code{""}), then each
15133character in the record becomes a separate field.
15134(This behavior is a @command{gawk} extension. POSIX @command{awk} does not
15135specify the behavior when @code{FS} is the null string.
15136Nonetheless, some other versions of @command{awk} also treat
15137@code{""} specially.)
15138
15139The default value is @w{@code{" "}}, a string consisting of a single
15140space.  As a special exception, this value means that any sequence of
15141spaces, TABs, and/or newlines is a single separator.  It also causes
15142spaces, TABs, and newlines at the beginning and end of a record to
15143be ignored.
15144
15145You can set the value of @code{FS} on the command line using the
15146@option{-F} option:
15147
15148@example
15149awk -F, '@var{program}' @var{input-files}
15150@end example
15151
15152@cindex @command{gawk} @subentry field separators and
15153If @command{gawk} is using @code{FIELDWIDTHS} or @code{FPAT}
15154for field splitting,
15155assigning a value to @code{FS} causes @command{gawk} to return to
15156the normal, @code{FS}-based field splitting. An easy way to do this
15157is to simply say @samp{FS = FS}, perhaps with an explanatory comment.
15158
15159@cindex @command{gawk} @subentry @code{IGNORECASE} variable in
15160@cindex @code{IGNORECASE} variable
15161@cindex differences in @command{awk} and @command{gawk} @subentry @code{IGNORECASE} variable
15162@cindex case sensitivity @subentry string comparisons and
15163@cindex case sensitivity @subentry regexps and
15164@cindex regular expressions @subentry case sensitivity
15165@item IGNORECASE #
15166If @code{IGNORECASE} is nonzero or non-null, then all string comparisons
15167and all regular expression matching are case-independent.
15168This applies to
15169regexp matching with @samp{~} and @samp{!~},
15170the @code{gensub()}, @code{gsub()}, @code{index()}, @code{match()},
15171@code{patsplit()}, @code{split()}, and @code{sub()} functions,
15172record termination with @code{RS}, and field splitting with
15173@code{FS} and @code{FPAT}.
15174However, the value of @code{IGNORECASE} does @emph{not} affect array subscripting
15175and it does not affect field splitting when using a single-character
15176field separator.
15177@xref{Case-sensitivity}.
15178
15179@cindex @command{gawk} @subentry @code{LINT} variable in
15180@cindex @code{LINT} variable
15181@cindex differences in @command{awk} and @command{gawk} @subentry @code{LINT} variable
15182@cindex lint checking
15183@item LINT #
15184When this variable is true (nonzero or non-null), @command{gawk}
15185behaves as if the @option{--lint} command-line option is in effect
15186(@pxref{Options}).
15187With a value of @code{"fatal"}, lint warnings become fatal errors.
15188With a value of @code{"invalid"}, only warnings about things that are
15189actually invalid are issued. (This is not fully implemented yet.)
15190Any other true value prints nonfatal warnings.
15191Assigning a false value to @code{LINT} turns off the lint warnings.
15192
15193This variable is a @command{gawk} extension.  It is not special
15194in other @command{awk} implementations.  Unlike with the other special variables,
15195changing @code{LINT} does affect the production of lint warnings,
15196even if @command{gawk} is in compatibility mode.  Much as
15197the @option{--lint} and @option{--traditional} options independently
15198control different aspects of @command{gawk}'s behavior, the control
15199of lint warnings during program execution is independent of the flavor
15200of @command{awk} being executed.
15201
15202@cindex @code{OFMT} variable
15203@cindex numbers @subentry converting @subentry to strings
15204@cindex strings @subentry converting @subentry numbers to
15205@item OFMT
15206A string that controls conversion of numbers to
15207strings (@pxref{Conversion}) for
15208printing with the @code{print} statement.  It works by being passed
15209as the first argument to the @code{sprintf()} function
15210(@pxref{String Functions}).
15211Its default value is @code{"%.6g"}.  Earlier versions of @command{awk}
15212used @code{OFMT} to specify the format for converting numbers to
15213strings in general expressions; this is now done by @code{CONVFMT}.
15214
15215@cindex @code{print} statement @subentry @code{OFMT} variable and
15216@cindex @code{OFS} variable
15217@cindex separators @subentry field
15218@cindex field separator
15219@item OFS
15220The output field separator (@pxref{Output Separators}).  It is
15221output between the fields printed by a @code{print} statement.  Its
15222default value is @w{@code{" "}}, a string consisting of a single space.
15223
15224@cindex @code{ORS} variable
15225@item ORS
15226The output record separator.  It is output at the end of every
15227@code{print} statement.  Its default value is @code{"\n"}, the newline
15228character.  (@xref{Output Separators}.)
15229
15230@cindex @code{PREC} variable
15231@item PREC #
15232The working precision of arbitrary-precision floating-point numbers,
1523353 bits by default (@pxref{Setting precision}).
15234
15235@cindex @code{ROUNDMODE} variable
15236@item ROUNDMODE #
15237The rounding mode to use for arbitrary-precision arithmetic on
15238numbers, by default @code{"N"} (@code{roundTiesToEven} in
15239the IEEE 754 standard; @pxref{Setting the rounding mode}).
15240
15241@cindex @code{RS} variable
15242@cindex separators @subentry for records
15243@cindex record separators
15244@item @code{RS}
15245The input record separator.  Its default value is a string
15246containing a single newline character, which means that an input record
15247consists of a single line of text.
15248It can also be the null string, in which case records are separated by
15249runs of blank lines.
15250If it is a regexp, records are separated by
15251matches of the regexp in the input text.
15252(@xref{Records}.)
15253
15254The ability for @code{RS} to be a regular expression
15255is a @command{gawk} extension.
15256In most other @command{awk} implementations,
15257or if @command{gawk} is in compatibility mode
15258(@pxref{Options}),
15259just the first character of @code{RS}'s value is used.
15260
15261@cindex @code{SUBSEP} variable
15262@cindex separators @subentry subscript
15263@cindex subscript separators
15264@item @code{SUBSEP}
15265The subscript separator.  It has the default value of
15266@code{"\034"} and is used to separate the parts of the indices of a
15267multidimensional array.  Thus, the expression @samp{@w{foo["A", "B"]}}
15268really accesses @code{foo["A\034B"]}
15269(@pxref{Multidimensional}).
15270
15271@cindex @command{gawk} @subentry @code{TEXTDOMAIN} variable in
15272@cindex @code{TEXTDOMAIN} variable
15273@cindex differences in @command{awk} and @command{gawk} @subentry @code{TEXTDOMAIN} variable
15274@cindex internationalization @subentry localization
15275@item TEXTDOMAIN #
15276Used for internationalization of programs at the
15277@command{awk} level.  It sets the default text domain for specially
15278marked string constants in the source text, as well as for the
15279@code{dcgettext()}, @code{dcngettext()}, and @code{bindtextdomain()} functions
15280(@pxref{Internationalization}).
15281The default value of @code{TEXTDOMAIN} is @code{"messages"}.
15282@end table
15283
15284@node Auto-set
15285@subsection Built-in Variables That Convey Information
15286
15287@cindex predefined variables @subentry conveying information
15288@cindex variables @subentry predefined @subentry conveying information
15289The following is an alphabetical list of variables that @command{awk}
15290sets automatically on certain occasions in order to provide
15291information to your program.
15292
15293The variables that are specific to @command{gawk} are marked with a pound
15294sign (@samp{#}).  These variables are @command{gawk} extensions.  In other
15295@command{awk} implementations or if @command{gawk} is in compatibility
15296mode (@pxref{Options}), they are not special:
15297
15298@c @asis for docbook
15299@table @asis
15300@cindex @code{ARGC}/@code{ARGV} variables
15301@cindex arguments @subentry command-line
15302@cindex command line @subentry arguments
15303@item @code{ARGC}, @code{ARGV}
15304The command-line arguments available to @command{awk} programs are stored in
15305an array called @code{ARGV}.  @code{ARGC} is the number of command-line
15306arguments present.  @xref{Other Arguments}.
15307Unlike most @command{awk} arrays,
15308@code{ARGV} is indexed from 0 to @code{ARGC} @minus{} 1.
15309In the following example:
15310
15311@example
15312@group
15313$ @kbd{awk 'BEGIN @{}
15314>         @kbd{for (i = 0; i < ARGC; i++)}
15315>             @kbd{print ARGV[i]}
15316>      @kbd{@}' inventory-shipped mail-list}
15317@print{} awk
15318@print{} inventory-shipped
15319@print{} mail-list
15320@end group
15321@end example
15322
15323@noindent
15324@code{ARGV[0]} contains @samp{awk}, @code{ARGV[1]}
15325contains @samp{inventory-shipped}, and @code{ARGV[2]} contains
15326@samp{mail-list}.  The value of @code{ARGC} is three, one more than the
15327index of the last element in @code{ARGV}, because the elements are numbered
15328from zero.
15329
15330@cindex programming conventions @subentry @code{ARGC}/@code{ARGV} variables
15331The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing
15332the array from 0 to @code{ARGC} @minus{} 1, are derived from the C language's
15333method of accessing command-line arguments.
15334
15335@cindex dark corner @subentry value of @code{ARGV[0]}
15336The value of @code{ARGV[0]} can vary from system to system.
15337Also, you should note that the program text is @emph{not} included in
15338@code{ARGV}, nor are any of @command{awk}'s command-line options.
15339@xref{ARGC and ARGV} for information
15340about how @command{awk} uses these variables.
15341@value{DARKCORNER}
15342
15343@cindex @code{ARGIND} variable
15344@cindex differences in @command{awk} and @command{gawk} @subentry @code{ARGIND} variable
15345@item @code{ARGIND #}
15346The index in @code{ARGV} of the current file being processed.
15347Every time @command{gawk} opens a new @value{DF} for processing, it sets
15348@code{ARGIND} to the index in @code{ARGV} of the @value{FN}.
15349When @command{gawk} is processing the input files,
15350@samp{FILENAME == ARGV[ARGIND]} is always true.
15351
15352@cindex files @subentry processing, @code{ARGIND} variable and
15353This variable is useful in file processing; it allows you to tell how far
15354along you are in the list of @value{DF}s as well as to distinguish between
15355successive instances of the same @value{FN} on the command line.
15356
15357@cindex file names @subentry distinguishing
15358While you can change the value of @code{ARGIND} within your @command{awk}
15359program, @command{gawk} automatically sets it to a new value when it
15360opens the next file.
15361
15362@cindex @code{ENVIRON} array
15363@cindex environment variables @subentry in @code{ENVIRON} array
15364@item @code{ENVIRON}
15365An associative array containing the values of the environment.  The array
15366indices are the environment variable names; the elements are the values of
15367the particular environment variables.  For example,
15368@code{ENVIRON["HOME"]} might be @code{/home/arnold}.
15369
15370For POSIX @command{awk}, changing this array does not affect the
15371environment passed on to any programs that @command{awk} may spawn via
15372redirection or the @code{system()} function.
15373
15374However, beginning with @value{PVERSION} 4.2, if not in POSIX
15375compatibility mode, @command{gawk} does update its own environment when
15376@code{ENVIRON} is changed, thus changing the environment seen by programs
15377that it creates.  You should therefore be especially careful if you
15378modify @code{ENVIRON["PATH"]}, which is the search path for finding
15379executable programs.
15380
15381This can also affect the running @command{gawk} program, since some of the
15382built-in functions may pay attention to certain environment variables.
15383The most notable instance of this is @code{mktime()} (@pxref{Time
15384Functions}), which pays attention the value of the @env{TZ} environment
15385variable on many systems.
15386
15387Some operating systems may not have environment variables.
15388On such systems, the @code{ENVIRON} array is empty (except for
15389@w{@code{ENVIRON["AWKPATH"]}} and
15390@w{@code{ENVIRON["AWKLIBPATH"]}};
15391@pxref{AWKPATH Variable} and
15392@ifdocbook
15393@ref{AWKLIBPATH Variable}).
15394@end ifdocbook
15395@ifnotdocbook
15396@pxref{AWKLIBPATH Variable}).
15397@end ifnotdocbook
15398
15399@cindex @command{gawk} @subentry @code{ERRNO} variable in
15400@cindex @code{ERRNO} variable
15401@cindex differences in @command{awk} and @command{gawk} @subentry @code{ERRNO} variable
15402@cindex error handling @subentry @code{ERRNO} variable and
15403@item @code{ERRNO #}
15404If a system error occurs during a redirection for @code{getline}, during
15405a read for @code{getline}, or during a @code{close()} operation, then
15406@code{ERRNO} contains a string describing the error.
15407
15408In addition, @command{gawk} clears @code{ERRNO} before opening each
15409command-line input file. This enables checking if the file is readable
15410inside a @code{BEGINFILE} pattern (@pxref{BEGINFILE/ENDFILE}).
15411
15412Otherwise, @code{ERRNO} works similarly to the C variable @code{errno}.
15413Except for the case just mentioned, @command{gawk} @emph{never} clears
15414it (sets it to zero or @code{""}).  Thus, you should only expect its
15415value to be meaningful when an I/O operation returns a failure value,
15416such as @code{getline} returning @minus{}1.  You are, of course, free
15417to clear it yourself before doing an I/O operation.
15418
15419If the value of @code{ERRNO} corresponds to a system error in the C
15420@code{errno} variable, then @code{PROCINFO["errno"]} will be set to the value
15421of @code{errno}.  For non-system errors, @code{PROCINFO["errno"]} will
15422be zero.
15423
15424@cindex @code{FILENAME} variable
15425@cindex dark corner @subentry @code{FILENAME} variable
15426@item @code{FILENAME}
15427The name of the current input file.  When no @value{DF}s are listed
15428on the command line, @command{awk} reads from the standard input and
15429@code{FILENAME} is set to @code{"-"}.  @code{FILENAME} changes each
15430time a new file is read (@pxref{Reading Files}).  Inside a @code{BEGIN}
15431rule, the value of @code{FILENAME} is @code{""}, because there are no input
15432files being processed yet.@footnote{Some early implementations of Unix
15433@command{awk} initialized @code{FILENAME} to @code{"-"}, even if there
15434were @value{DF}s to be processed. This behavior was incorrect and should
15435not be relied upon in your programs.} @value{DARKCORNER} Note, though,
15436that using @code{getline} (@pxref{Getline}) inside a @code{BEGIN} rule
15437can give @code{FILENAME} a value.
15438
15439@cindex @code{FNR} variable
15440@item @code{FNR}
15441The current record number in the current file.  @command{awk} increments
15442@code{FNR} each time it reads a new record (@pxref{Records}).
15443@command{awk} resets @code{FNR} to zero each time it starts a new
15444input file.
15445
15446@cindex @code{NF} variable
15447@item @code{NF}
15448The number of fields in the current input record.
15449@code{NF} is set each time a new record is read, when a new field is
15450created, or when @code{$0} changes (@pxref{Fields}).
15451
15452Unlike most of the variables described in this @value{SUBSECTION},
15453assigning a value to @code{NF} has the potential to affect
15454@command{awk}'s internal workings.  In particular, assignments
15455to @code{NF} can be used to create fields in or remove fields from the
15456current record. @xref{Changing Fields}.
15457
15458@cindex @code{FUNCTAB} array
15459@cindex @command{gawk} @subentry @code{FUNCTAB} array in
15460@cindex differences in @command{awk} and @command{gawk} @subentry @code{FUNCTAB} variable
15461@item @code{FUNCTAB #}
15462An array whose indices and corresponding values are the names of all
15463the built-in, user-defined, and extension functions in the program.
15464
15465@quotation NOTE
15466Attempting to use the @code{delete} statement with the @code{FUNCTAB}
15467array causes a fatal error.  Any attempt to assign to an element of
15468@code{FUNCTAB} also causes a fatal error.
15469@end quotation
15470
15471@cindex @code{NR} variable
15472@item @code{NR}
15473The number of input records @command{awk} has processed since
15474the beginning of the program's execution
15475(@pxref{Records}).
15476@command{awk} increments @code{NR} each time it reads a new record.
15477
15478@cindex @command{gawk} @subentry @code{PROCINFO} array in
15479@cindex @code{PROCINFO} array
15480@cindex differences in @command{awk} and @command{gawk} @subentry @code{PROCINFO} array
15481@item @code{PROCINFO #}
15482The elements of this array provide access to information about the
15483running @command{awk} program.
15484The following elements (listed alphabetically)
15485are guaranteed to be available:
15486
15487@table @code
15488@item PROCINFO["argv"]
15489@cindex command line @subentry arguments
15490The @code{PROCINFO["argv"]} array contains all of the command-line arguments
15491(after glob expansion and redirection processing on platforms where that must
15492be done manually by the program) with subscripts ranging from 0 through
15493@code{argc} @minus{} 1.  For example, @code{PROCINFO["argv"][0]} will contain
15494the name by which @command{gawk} was invoked.  Here is an example of how this
15495feature may be used:
15496
15497@example
15498gawk '
15499BEGIN @{
15500        for (i = 0; i < length(PROCINFO["argv"]); i++)
15501                print i, PROCINFO["argv"][i]
15502@}'
15503@end example
15504
15505Please note that this differs from the standard @code{ARGV} array which does
15506not include command-line arguments that have already been processed by
15507@command{gawk} (@pxref{ARGC and ARGV}).
15508
15509@cindex effective group ID of @command{gawk} user
15510@item PROCINFO["egid"]
15511The value of the @code{getegid()} system call.
15512
15513@item PROCINFO["errno"]
15514The value of the C @code{errno} variable when @code{ERRNO} is set to
15515the associated error message.
15516
15517@item PROCINFO["euid"]
15518@cindex effective user ID of @command{gawk} user
15519The value of the @code{geteuid()} system call.
15520
15521@item PROCINFO["FS"]
15522This is
15523@code{"FS"} if field splitting with @code{FS} is in effect,
15524@code{"FIELDWIDTHS"} if field splitting with @code{FIELDWIDTHS} is in effect,
15525@code{"FPAT"} if field matching with @code{FPAT} is in effect,
15526or @code{"API"} if field splitting is controlled by an API input parser.
15527
15528@item PROCINFO["gid"]
15529@cindex group ID of @command{gawk} user
15530The value of the @code{getgid()} system call.
15531
15532@item PROCINFO["identifiers"]
15533@cindex program identifiers
15534A subarray, indexed by the names of all identifiers used in the text of
15535the @command{awk} program.  An @dfn{identifier} is simply the name of a variable
15536(be it scalar or array), built-in function, user-defined function, or
15537extension function.  For each identifier, the value of the element is
15538one of the following:
15539
15540@table @code
15541@item "array"
15542The identifier is an array.
15543
15544@item "builtin"
15545The identifier is a built-in function.
15546
15547@item "extension"
15548The identifier is an extension function loaded via
15549@code{@@load} or @option{-l}.
15550
15551@item "scalar"
15552The identifier is a scalar.
15553
15554@item "untyped"
15555The identifier is untyped (could be used as a scalar or an array;
15556@command{gawk} doesn't know yet).
15557
15558@item "user"
15559The identifier is a user-defined function.
15560@end table
15561
15562@noindent
15563The values indicate what @command{gawk} knows about the identifiers
15564after it has finished parsing the program; they are @emph{not} updated
15565while the program runs.
15566
15567@item PROCINFO["platform"]
15568@cindex platform running on
15569@cindex @code{PROCINFO} array @subentry platform running on
15570This element gives a string indicating the platform for which
15571@command{gawk} was compiled. The value will be one of the following:
15572
15573@c nested table
15574@table @code
15575@item "djgpp"
15576@itemx "mingw"
15577Microsoft Windows, using either DJGPP or MinGW, respectively.
15578
15579@item "os2"
15580OS/2.
15581
15582@item "os390"
15583OS/390.
15584
15585@item "posix"
15586GNU/Linux, Cygwin, Mac OS X, and legacy Unix systems.
15587
15588@item "vms"
15589OpenVMS or Vax/VMS.
15590@end table
15591
15592@item PROCINFO["pgrpid"]
15593@cindex process group ID of @command{gawk} process
15594The process group ID of the current process.
15595
15596@item PROCINFO["pid"]
15597@cindex process ID of @command{gawk} process
15598The process ID of the current process.
15599
15600@item PROCINFO["ppid"]
15601@cindex parent process ID of @command{gawk} process
15602The parent process ID of the current process.
15603
15604@item PROCINFO["strftime"]
15605The default time format string for @code{strftime()}.
15606Assigning a new value to this element changes the default.
15607@xref{Time Functions}.
15608
15609@item PROCINFO["uid"]
15610The value of the @code{getuid()} system call.
15611
15612@item PROCINFO["version"]
15613@cindex version of @subentry @command{gawk}
15614@cindex @command{gawk} @subentry version of
15615The version of @command{gawk}.
15616@end table
15617
15618The following additional elements in the array
15619are available to provide information about the MPFR and GMP libraries
15620if your version of @command{gawk} supports arbitrary-precision arithmetic
15621(@pxref{Arbitrary Precision Arithmetic}):
15622
15623@table @code
15624@item PROCINFO["gmp_version"]
15625@cindex version of @subentry GNU MP library
15626The version of the GNU MP library.
15627
15628@cindex version of @subentry GNU MPFR library
15629@item PROCINFO["mpfr_version"]
15630The version of the GNU MPFR library.
15631
15632@item PROCINFO["prec_max"]
15633@cindex maximum precision supported by MPFR library
15634The maximum precision supported by MPFR.
15635
15636@item PROCINFO["prec_min"]
15637@cindex minimum precision required by MPFR library
15638The minimum precision required by MPFR.
15639@end table
15640
15641The following additional elements in the array are available to provide
15642information about the version of the extension API, if your version
15643of @command{gawk} supports dynamic loading of extension functions
15644(@pxref{Dynamic Extensions}):
15645
15646@table @code
15647@item PROCINFO["api_major"]
15648@cindex version of @subentry @command{gawk} extension API
15649@cindex extension API @subentry version number
15650The major version of the extension API.
15651
15652@item PROCINFO["api_minor"]
15653The minor version of the extension API.
15654@end table
15655
15656@cindex supplementary groups of @command{gawk} process
15657On some systems, there may be elements in the array, @code{"group1"}
15658through @code{"group@var{N}"} for some @var{N}. @var{N} is the number of
15659supplementary groups that the process has.  Use the @code{in} operator
15660to test for these elements
15661(@pxref{Reference to Elements}).
15662
15663The following elements allow you to change @command{gawk}'s behavior:
15664
15665@table @code
15666@item PROCINFO["NONFATAL"]
15667If this element exists, then I/O errors for all redirections become nonfatal.
15668@xref{Nonfatal}.
15669
15670@item PROCINFO["@var{name}", "NONFATAL"]
15671Make I/O errors for @var{name} be nonfatal.
15672@xref{Nonfatal}.
15673
15674@item PROCINFO["@var{command}", "pty"]
15675For two-way communication to @var{command}, use a pseudo-tty instead
15676of setting up a two-way pipe.
15677@xref{Two-way I/O} for more information.
15678
15679@item PROCINFO["@var{input_name}", "READ_TIMEOUT"]
15680Set a timeout for reading from input redirection @var{input_name}.
15681@xref{Read Timeout} for more information.
15682
15683@item PROCINFO["@var{input_name}", "RETRY"]
15684If an I/O error that may be retried occurs when reading data from
15685@var{input_name}, and this array entry exists, then @code{getline} returns
15686@minus{}2 instead of following the default behavior of returning @minus{}1
15687and configuring @var{input_name} to return no further data.  An I/O error
15688that may be retried is one where @code{errno} has the value @code{EAGAIN},
15689@code{EWOULDBLOCK}, @code{EINTR}, or @code{ETIMEDOUT}.  This may be useful
15690in conjunction with @code{PROCINFO["@var{input_name}", "READ_TIMEOUT"]}
15691or situations where a file descriptor has been configured to behave in
15692a non-blocking fashion.
15693@xref{Retrying Input} for more information.
15694
15695@item PROCINFO["sorted_in"]
15696If this element exists in @code{PROCINFO}, its value controls the
15697order in which array indices will be processed by
15698@samp{for (@var{indx} in @var{array})} loops.
15699This is an advanced feature, so we defer the
15700full description until later; see
15701@ref{Controlling Scanning}.
15702@end table
15703
15704@cindex @code{RLENGTH} variable
15705@item @code{RLENGTH}
15706The length of the substring matched by the
15707@code{match()} function
15708(@pxref{String Functions}).
15709@code{RLENGTH} is set by invoking the @code{match()} function.  Its value
15710is the length of the matched string, or @minus{}1 if no match is found.
15711
15712@cindex @code{RSTART} variable
15713@item @code{RSTART}
15714The start index in characters of the substring that is matched by the
15715@code{match()} function
15716(@pxref{String Functions}).
15717@code{RSTART} is set by invoking the @code{match()} function.  Its value
15718is the position of the string where the matched substring starts, or zero
15719if no match was found.
15720
15721@cindex @command{gawk} @subentry @code{RT} variable in
15722@cindex @code{RT} variable
15723@cindex differences in @command{awk} and @command{gawk} @subentry @code{RS}/@code{RT} variables
15724@item @code{RT #}
15725The input text that matched the text denoted by @code{RS},
15726the record separator.  It is set every time a record is read.
15727
15728@cindex @command{gawk} @subentry @code{SYMTAB} array in
15729@cindex @code{SYMTAB} array
15730@cindex differences in @command{awk} and @command{gawk} @subentry @code{SYMTAB} variable
15731@item @code{SYMTAB #}
15732An array whose indices are the names of all defined global variables and
15733arrays in the program.  @code{SYMTAB} makes @command{gawk}'s symbol table
15734visible to the @command{awk} programmer.  It is built as @command{gawk}
15735parses the program and is complete before the program starts to run.
15736
15737The array may be used for indirect access to read or write the value of
15738a variable:
15739
15740@example
15741foo = 5
15742SYMTAB["foo"] = 4
15743print foo    # prints 4
15744@end example
15745
15746@noindent
15747The @code{isarray()} function (@pxref{Type Functions}) may be used to test
15748if an element in @code{SYMTAB} is an array.
15749Also, you may not use the @code{delete} statement with the
15750@code{SYMTAB} array.
15751
15752Prior to @value{PVERSION} 5.0 of @command{gawk}, you could
15753use an index for @code{SYMTAB} that was not a predefined identifier:
15754
15755@example
15756SYMTAB["xxx"] = 5
15757print SYMTAB["xxx"]
15758@end example
15759
15760@noindent
15761This no longer works, instead producing a fatal error, as it led
15762to rampant confusion.
15763
15764@cindex Schorr, Andrew
15765The @code{SYMTAB} array is more interesting than it looks. Andrew Schorr
15766points out that it effectively gives @command{awk} data pointers. Consider his
15767example:
15768
15769@example
15770@group
15771# Indirect multiply of any variable by amount, return result
15772
15773function multiply(variable, amount)
15774@{
15775    return SYMTAB[variable] *= amount
15776@}
15777@end group
15778@end example
15779
15780@noindent
15781You would use it like this:
15782
15783@example
15784BEGIN @{
15785    answer = 10.5
15786    multiply("answer", 4)
15787    print "The answer is", answer
15788@}
15789@end example
15790
15791@noindent
15792When run, this produces:
15793
15794@example
15795$ @kbd{gawk -f answer.awk}
15796@print{} The answer is 42
15797@end example
15798
15799@quotation NOTE
15800In order to avoid severe time-travel paradoxes,@footnote{Not to mention
15801difficult implementation issues.} neither @code{FUNCTAB} nor @code{SYMTAB}
15802is available as an element within the @code{SYMTAB} array.
15803@end quotation
15804@end table
15805
15806@sidebar Changing @code{NR} and @code{FNR}
15807@cindex @code{NR} variable @subentry changing
15808@cindex @code{FNR} variable @subentry changing
15809@cindex dark corner @subentry @code{FNR}/@code{NR} variables
15810@command{awk} increments @code{NR} and @code{FNR}
15811each time it reads a record, instead of setting them to the absolute
15812value of the number of records read.  This means that a program can
15813change these variables and their new values are incremented for
15814each record.
15815@value{DARKCORNER}
15816The following example shows this:
15817
15818@example
15819$ @kbd{echo '1}
15820> @kbd{2}
15821> @kbd{3}
15822> @kbd{4' | awk 'NR == 2 @{ NR = 17 @}}
15823> @kbd{@{ print NR @}'}
15824@print{} 1
15825@print{} 17
15826@print{} 18
15827@print{} 19
15828@end example
15829
15830@noindent
15831Before @code{FNR} was added to the @command{awk} language
15832(@pxref{V7/SVR3.1}),
15833many @command{awk} programs used this feature to track the number of
15834records in a file by resetting @code{NR} to zero when @code{FILENAME}
15835changed.
15836@end sidebar
15837
15838@node ARGC and ARGV
15839@subsection Using @code{ARGC} and @code{ARGV}
15840@cindex @code{ARGC}/@code{ARGV} variables @subentry how to use
15841@cindex arguments @subentry command-line
15842@cindex command line @subentry arguments
15843
15844@ref{Auto-set}
15845presented the following program describing the information contained in @code{ARGC}
15846and @code{ARGV}:
15847
15848@example
15849@group
15850$ @kbd{awk 'BEGIN @{}
15851>        @kbd{for (i = 0; i < ARGC; i++)}
15852>            @kbd{print ARGV[i]}
15853>      @kbd{@}' inventory-shipped mail-list}
15854@print{} awk
15855@print{} inventory-shipped
15856@print{} mail-list
15857@end group
15858@end example
15859
15860@noindent
15861In this example, @code{ARGV[0]} contains @samp{awk}, @code{ARGV[1]}
15862contains @samp{inventory-shipped}, and @code{ARGV[2]} contains
15863@samp{mail-list}.
15864Notice that the @command{awk} program is not entered in @code{ARGV}.  The
15865other command-line options, with their arguments, are also not
15866entered.  This includes variable assignments done with the @option{-v}
15867option (@pxref{Options}).
15868Normal variable assignments on the command line @emph{are}
15869treated as arguments and do show up in the @code{ARGV} array.
15870Given the following program in a file named @file{showargs.awk}:
15871
15872@example
15873BEGIN @{
15874    printf "A=%d, B=%d\n", A, B
15875    for (i = 0; i < ARGC; i++)
15876        printf "\tARGV[%d] = %s\n", i, ARGV[i]
15877@}
15878END   @{ printf "A=%d, B=%d\n", A, B @}
15879@end example
15880
15881@noindent
15882Running it produces the following:
15883
15884@example
15885$ @kbd{awk -v A=1 -f showargs.awk B=2 /dev/null}
15886@print{} A=1, B=0
15887@print{}        ARGV[0] = awk
15888@print{}        ARGV[1] = B=2
15889@print{}        ARGV[2] = /dev/null
15890@print{} A=1, B=2
15891@end example
15892
15893A program can alter @code{ARGC} and the elements of @code{ARGV}.
15894Each time @command{awk} reaches the end of an input file, it uses the next
15895element of @code{ARGV} as the name of the next input file.  By storing a
15896different string there, a program can change which files are read.
15897Use @code{"-"} to represent the standard input.  Storing
15898additional elements and incrementing @code{ARGC} causes
15899additional files to be read.
15900
15901If the value of @code{ARGC} is decreased, that eliminates input files
15902from the end of the list.  By recording the old value of @code{ARGC}
15903elsewhere, a program can treat the eliminated arguments as
15904something other than @value{FN}s.
15905
15906To eliminate a file from the middle of the list, store the null string
15907(@code{""}) into @code{ARGV} in place of the file's name.  As a
15908special feature, @command{awk} ignores @value{FN}s that have been
15909replaced with the null string.
15910Another option is to
15911use the @code{delete} statement to remove elements from
15912@code{ARGV} (@pxref{Delete}).
15913
15914All of these actions are typically done in the @code{BEGIN} rule,
15915before actual processing of the input begins.
15916@xref{Split Program} and
15917@ifnotdocbook
15918@pxref{Tee Program}
15919@end ifnotdocbook
15920@ifdocbook
15921@ref{Tee Program}
15922@end ifdocbook
15923for examples
15924of each way of removing elements from @code{ARGV}.
15925
15926To actually get options into an @command{awk} program,
15927end the @command{awk} options with @option{--} and then supply
15928the @command{awk} program's options, in the following manner:
15929
15930@example
15931awk -f myprog.awk -- -v -q file1 file2 @dots{}
15932@end example
15933
15934The following fragment processes @code{ARGV} in order to examine, and
15935then remove, the previously mentioned command-line options:
15936
15937@example
15938BEGIN @{
15939    for (i = 1; i < ARGC; i++) @{
15940        if (ARGV[i] == "-v")
15941            verbose = 1
15942        else if (ARGV[i] == "-q")
15943            debug = 1
15944        else if (ARGV[i] ~ /^-./) @{
15945            e = sprintf("%s: unrecognized option -- %c",
15946                    ARGV[0], substr(ARGV[i], 2, 1))
15947            print e > "/dev/stderr"
15948        @} else
15949            break
15950        delete ARGV[i]
15951    @}
15952@}
15953@end example
15954
15955@cindex differences in @command{awk} and @command{gawk} @subentry @code{ARGC}/@code{ARGV} variables
15956Ending the @command{awk} options with @option{--} isn't
15957necessary in @command{gawk}. Unless @option{--posix} has
15958been specified, @command{gawk} silently puts any unrecognized options
15959into @code{ARGV} for the @command{awk} program to deal with.  As soon
15960as it sees an unknown option, @command{gawk} stops looking for other
15961options that it might otherwise recognize.  The previous command line with
15962@command{gawk} would be:
15963
15964@example
15965gawk -f myprog.awk -q -v file1 file2 @dots{}
15966@end example
15967
15968@noindent
15969Because @option{-q} is not a valid @command{gawk} option, it and the
15970following @option{-v} are passed on to the @command{awk} program.
15971(@xref{Getopt Function} for an @command{awk} library function that
15972parses command-line options.)
15973
15974When designing your program, you should choose options that don't
15975conflict with @command{gawk}'s, because it will process any options
15976that it accepts before passing the rest of the command line on to
15977your program.  Using @samp{#!} with the @option{-E} option may help
15978(@pxref{Executable Scripts}
15979and
15980@ifnotdocbook
15981@pxref{Options}).
15982@end ifnotdocbook
15983@ifdocbook
15984@ref{Options}).
15985@end ifdocbook
15986
15987@node Pattern Action Summary
15988@section Summary
15989
15990@itemize @value{BULLET}
15991@item
15992Pattern--action pairs make up the basic elements of an @command{awk}
15993program.  Patterns are either normal expressions, range expressions,
15994or regexp constants; one of the special keywords @code{BEGIN}, @code{END},
15995@code{BEGINFILE}, or @code{ENDFILE}; or empty.  The action executes if
15996the current record matches the pattern.  Empty (missing) patterns match
15997all records.
15998
15999@item
16000I/O from @code{BEGIN} and @code{END} rules has certain constraints.
16001This is also true, only more so, for @code{BEGINFILE} and @code{ENDFILE}
16002rules.  The latter two give you ``hooks'' into @command{gawk}'s file
16003processing, allowing you to recover from a file that otherwise would
16004cause a fatal error (such as a file that cannot be opened).
16005
16006@item
16007Shell variables can be used in @command{awk} programs by careful
16008use of shell quoting.  It is easier to pass a shell variable into
16009@command{awk} by using the @option{-v} option and an @command{awk}
16010variable.
16011
16012@item
16013Actions consist of statements enclosed in curly braces. Statements
16014are built up from expressions, control statements, compound statements,
16015input and output statements, and deletion statements.
16016
16017@item
16018The control statements in @command{awk} are @code{if}-@code{else},
16019@code{while}, @code{for}, and @code{do}-@code{while}.  @command{gawk}
16020adds the @code{switch} statement.  There are two flavors of @code{for}
16021statement: one for performing general looping, and the other for iterating
16022through an array.
16023
16024@item
16025@code{break} and @code{continue} let you exit early or start the next
16026iteration of a loop (or get out of a @code{switch}).
16027
16028@item
16029@code{next} and @code{nextfile} let you read the next record and start
16030over at the top of your program or skip to the next input file and
16031start over, respectively.
16032
16033@item
16034The @code{exit} statement terminates your program. When executed
16035from an action (or function body), it transfers control to the
16036@code{END} statements. From an @code{END} statement body, it exits
16037immediately.  You may pass an optional numeric value to be used
16038as @command{awk}'s exit status.
16039
16040@item
16041Some predefined variables provide control over @command{awk}, mainly for I/O.
16042Other variables convey information from @command{awk} to your program.
16043
16044@item
16045@code{ARGC} and @code{ARGV} make the command-line arguments available
16046to your program. Manipulating them from a @code{BEGIN} rule lets you
16047control how @command{awk} will process the provided @value{DF}s.
16048
16049@end itemize
16050
16051@node Arrays
16052@chapter Arrays in @command{awk}
16053@cindex arrays
16054
16055An @dfn{array} is a table of values called @dfn{elements}.  The
16056elements of an array are distinguished by their @dfn{indices}.  Indices
16057may be either numbers or strings.
16058
16059This @value{CHAPTER} describes how arrays work in @command{awk},
16060how to use array elements, how to scan through every element in an array,
16061and how to remove array elements.
16062It also describes how @command{awk} simulates multidimensional
16063arrays, as well as some of the less obvious points about array usage.
16064The @value{CHAPTER} moves on to discuss @command{gawk}'s facility
16065for sorting arrays, and ends with a brief description of @command{gawk}'s
16066ability to support true arrays of arrays.
16067
16068@menu
16069* Array Basics::                The basics of arrays.
16070* Numeric Array Subscripts::    How to use numbers as subscripts in
16071                                @command{awk}.
16072* Uninitialized Subscripts::    Using Uninitialized variables as subscripts.
16073* Delete::                      The @code{delete} statement removes an element
16074                                from an array.
16075* Multidimensional::            Emulating multidimensional arrays in
16076                                @command{awk}.
16077* Arrays of Arrays::            True multidimensional arrays.
16078* Arrays Summary::              Summary of arrays.
16079@end menu
16080
16081@node Array Basics
16082@section The Basics of Arrays
16083
16084This @value{SECTION} presents the basics: working with elements
16085in arrays one at a time, and traversing all of the elements in
16086an array.
16087
16088@menu
16089* Array Intro::                 Introduction to Arrays
16090* Reference to Elements::       How to examine one element of an array.
16091* Assigning Elements::          How to change an element of an array.
16092* Array Example::               Basic Example of an Array
16093* Scanning an Array::           A variation of the @code{for} statement. It
16094                                loops through the indices of an array's
16095                                existing elements.
16096* Controlling Scanning::        Controlling the order in which arrays are
16097                                scanned.
16098@end menu
16099
16100@node Array Intro
16101@subsection Introduction to Arrays
16102
16103@cindex Wall, Larry
16104@quotation
16105@i{Doing linear scans over an associative array is like trying to club someone
16106to death with a loaded Uzi.}
16107@author Larry Wall
16108@end quotation
16109
16110The @command{awk} language provides one-dimensional arrays
16111for storing groups of related strings or numbers.
16112Every @command{awk} array must have a name.  Array names have the same
16113syntax as variable names; any valid variable name would also be a valid
16114array name.  But one name cannot be used in both ways (as an array and
16115as a variable) in the same @command{awk} program.
16116
16117Arrays in @command{awk} superficially resemble arrays in other programming
16118languages, but there are fundamental differences.  In @command{awk}, it
16119isn't necessary to specify the size of an array before starting to use it.
16120Additionally, any number or string, not just consecutive integers,
16121may be used as an array index.
16122
16123In most other languages, arrays must be @dfn{declared} before use,
16124including a specification of
16125how many elements or components they contain.  In such languages, the
16126declaration causes a contiguous block of memory to be allocated for that
16127many elements.  Usually, an index in the array must be a nonnegative integer.
16128For example, the index zero specifies the first element in the array, which is
16129actually stored at the beginning of the block of memory.  Index one
16130specifies the second element, which is stored in memory right after the
16131first element, and so on.  It is impossible to add more elements to the
16132array, because it has room only for as many elements as given in
16133the declaration.
16134(Some languages allow arbitrary starting and ending
16135indices---e.g., @samp{15 .. 27}---but the size of the array is still fixed when
16136the array is declared.)
16137
16138@c 1/2015: Do not put the numeric values into @code. Array element
16139@c values are no different than scalar variable values.
16140A contiguous array of four elements might look like
16141@ifnotdocbook
16142@ref{figure-array-elements},
16143@end ifnotdocbook
16144@ifdocbook
16145@inlineraw{docbook, <xref linkend="figure-array-elements"/>},
16146@end ifdocbook
16147conceptually, if the element values are eight, @code{"foo"},
16148@code{""}, and 30.
16149
16150@ifnotdocbook
16151@float Figure,figure-array-elements
16152@caption{A contiguous array}
16153@center @image{array-elements, , , A Contiguous Array}
16154@end float
16155@end ifnotdocbook
16156
16157@docbook
16158<figure id="figure-array-elements" float="0">
16159<title>A contiguous array</title>
16160<mediaobject>
16161<imageobject role="web"><imagedata fileref="array-elements.png" format="PNG"/></imageobject>
16162</mediaobject>
16163</figure>
16164@end docbook
16165
16166@noindent
16167Only the values are stored; the indices are implicit from the order of
16168the values. Here, eight is the value at index zero, because eight appears in the
16169position with zero elements before it.
16170
16171@cindex arrays @subentry indexing
16172@cindex indexing arrays
16173@cindex associative arrays
16174@cindex arrays @subentry associative
16175Arrays in @command{awk} are different---they are @dfn{associative}.  This means
16176that each array is a collection of pairs---an index and its corresponding
16177array element value:
16178
16179@ifnotdocbook
16180@c extra empty column to indent it right
16181@multitable @columnfractions .1 .1 .1
16182@headitem @tab Index @tab Value
16183@item @tab @code{3} @tab @code{30}
16184@item @tab @code{1} @tab @code{"foo"}
16185@item @tab @code{0} @tab @code{8}
16186@item @tab @code{2} @tab @code{""}
16187@end multitable
16188@end ifnotdocbook
16189
16190@docbook
16191<informaltable>
16192<tgroup cols="2">
16193<colspec colname="1" align="left"/>
16194<colspec colname="2" align="left"/>
16195<thead>
16196<row>
16197<entry>Index</entry>
16198<entry>Value</entry>
16199</row>
16200</thead>
16201
16202<tbody>
16203<row>
16204<entry><literal>3</literal></entry>
16205<entry><literal>30</literal></entry>
16206</row>
16207
16208<row>
16209<entry><literal>1</literal></entry>
16210<entry><literal>"foo"</literal></entry>
16211</row>
16212
16213<row>
16214<entry><literal>0</literal></entry>
16215<entry><literal>8</literal></entry>
16216</row>
16217
16218<row>
16219<entry><literal>2</literal></entry>
16220<entry><literal>""</literal></entry>
16221</row>
16222
16223</tbody>
16224</tgroup>
16225</informaltable>
16226
16227@end docbook
16228
16229@noindent
16230The pairs are shown in jumbled order because their order is
16231irrelevant.@footnote{The ordering will vary among @command{awk}
16232implementations, which typically use hash tables to store array elements
16233and values.}
16234
16235One advantage of associative arrays is that new pairs can be added
16236at any time.  For example, suppose a tenth element is added to the array
16237whose value is @w{@code{"number ten"}}.  The result is:
16238
16239@ifnotdocbook
16240@c extra empty column to indent it right
16241@multitable @columnfractions .1 .1 .2
16242@headitem @tab Index @tab Value
16243@item @tab @code{10} @tab @code{"number ten"}
16244@item @tab @code{3} @tab @code{30}
16245@item @tab @code{1} @tab @code{"foo"}
16246@item @tab @code{0} @tab @code{8}
16247@item @tab @code{2} @tab @code{""}
16248@end multitable
16249@end ifnotdocbook
16250
16251@docbook
16252<informaltable>
16253<tgroup cols="2">
16254<colspec colname="1" align="left"/>
16255<colspec colname="2" align="left"/>
16256<thead>
16257<row>
16258<entry>Index</entry>
16259<entry>Value</entry>
16260</row>
16261</thead>
16262<tbody>
16263
16264<row>
16265<entry><literal>10</literal></entry>
16266<entry><literal>"number ten"</literal></entry>
16267</row>
16268
16269<row>
16270<entry><literal>3</literal></entry>
16271<entry><literal>30</literal></entry>
16272</row>
16273
16274<row>
16275<entry><literal>1</literal></entry>
16276<entry><literal>"foo"</literal></entry>
16277</row>
16278
16279<row>
16280<entry><literal>0</literal></entry>
16281<entry><literal>8</literal></entry>
16282</row>
16283
16284<row>
16285<entry><literal>2</literal></entry>
16286<entry><literal>""</literal></entry>
16287</row>
16288
16289</tbody>
16290</tgroup>
16291</informaltable>
16292
16293@end docbook
16294
16295@noindent
16296@cindex sparse arrays
16297@cindex arrays @subentry sparse
16298Now the array is @dfn{sparse}, which just means some indices are missing.
16299It has elements 0--3 and 10, but doesn't have elements 4, 5, 6, 7, 8, or 9.
16300
16301Another consequence of associative arrays is that the indices don't
16302have to be nonnegative integers.  Any number, or even a string, can be
16303an index.  For example, the following is an array that translates words from
16304English to French:
16305
16306@ifnotdocbook
16307@multitable @columnfractions .1 .1 .1
16308@headitem @tab Index @tab Value
16309@item @tab @code{"dog"} @tab @code{"chien"}
16310@item @tab @code{"cat"} @tab @code{"chat"}
16311@item @tab @code{"one"} @tab @code{"un"}
16312@item @tab @code{1} @tab @code{"un"}
16313@end multitable
16314@end ifnotdocbook
16315
16316@docbook
16317<informaltable>
16318<tgroup cols="2">
16319<colspec colname="1" align="left"/>
16320<colspec colname="2" align="left"/>
16321<thead>
16322<row>
16323<entry>Index</entry>
16324<entry>Value</entry>
16325</row>
16326</thead>
16327<tbody>
16328<row>
16329<entry><literal>"dog"</literal></entry>
16330<entry><literal>"chien"</literal></entry>
16331</row>
16332
16333<row>
16334<entry><literal>"cat"</literal></entry>
16335<entry><literal>"chat"</literal></entry>
16336</row>
16337
16338<row>
16339<entry><literal>"one"</literal></entry>
16340<entry><literal>"un"</literal></entry>
16341</row>
16342
16343<row>
16344<entry><literal>1</literal></entry>
16345<entry><literal>"un"</literal></entry>
16346</row>
16347
16348</tbody>
16349</tgroup>
16350</informaltable>
16351
16352@end docbook
16353
16354@noindent
16355Here we decided to translate the number one in both spelled-out and
16356numeric form---thus illustrating that a single array can have both
16357numbers and strings as indices.
16358(In fact, array subscripts are always strings.
16359There are some subtleties to how numbers work when used as
16360array subscripts; this is discussed in more detail in
16361@ref{Numeric Array Subscripts}.)
16362Here, the number @code{1} isn't double-quoted, because @command{awk}
16363automatically converts it to a string.
16364
16365@cindex @command{gawk} @subentry @code{IGNORECASE} variable in
16366@cindex case sensitivity @subentry array indices and
16367@cindex arrays @subentry @code{IGNORECASE} variable and
16368@cindex @code{IGNORECASE} variable @subentry array indices and
16369The value of @code{IGNORECASE} has no effect upon array subscripting.
16370The identical string value used to store an array element must be used
16371to retrieve it.
16372When @command{awk} creates an array (e.g., with the @code{split()}
16373built-in function),
16374that array's indices are consecutive integers starting at one.
16375(@xref{String Functions}.)
16376
16377@command{awk}'s arrays are efficient---the time to access an element
16378is independent of the number of elements in the array.
16379
16380@node Reference to Elements
16381@subsection Referring to an Array Element
16382@cindex arrays @subentry referencing elements
16383@cindex array members
16384@cindex elements in arrays
16385
16386The principal way to use an array is to refer to one of its elements.
16387An @dfn{array reference} is an expression as follows:
16388
16389@example
16390@var{array}[@var{index-expression}]
16391@end example
16392
16393@noindent
16394Here, @var{array} is the name of an array.  The expression @var{index-expression} is
16395the index of the desired element of the array.
16396
16397@c 1/2015: Having the 4.3 in @samp is a little iffy. It's essentially
16398@c an expression though, so leave be. It's to early in the discussion
16399@c to mention that it's really a string.
16400The value of the array reference is the current value of that array
16401element.  For example, @code{foo[4.3]} is an expression referencing the element
16402of array @code{foo} at index @samp{4.3}.
16403
16404@cindex arrays @subentry unassigned elements
16405@cindex unassigned array elements
16406@cindex empty array elements
16407A reference to an array element that has no recorded value yields a value of
16408@code{""}, the null string.  This includes elements
16409that have not been assigned any value as well as elements that have been
16410deleted (@pxref{Delete}).
16411
16412@cindex non-existent array elements
16413@cindex arrays @subentry elements @subentry that don't exist
16414@quotation NOTE
16415A reference to an element that does not exist @emph{automatically} creates
16416that array element, with the null string as its value.  (In some cases,
16417this is unfortunate, because it might waste memory inside @command{awk}.)
16418
16419Novice @command{awk} programmers often make the mistake of checking if
16420an element exists by checking if the value is empty:
16421
16422@example
16423# Check if "foo" exists in a:         @ii{Incorrect!}
16424if (a["foo"] != "") @dots{}
16425@end example
16426
16427@noindent
16428This is incorrect for two reasons. First, it @emph{creates} @code{a["foo"]}
16429if it didn't exist before! Second, it is valid (if a bit unusual) to set
16430an array element equal to the empty string.
16431@end quotation
16432
16433@c @cindex arrays, @code{in} operator and
16434@cindex @code{in} operator @subentry testing if array element exists
16435To determine whether an element exists in an array at a certain index, use
16436the following expression:
16437
16438@example
16439@var{indx} in @var{array}
16440@end example
16441
16442@cindex side effects @subentry array indexing
16443@noindent
16444This expression tests whether the particular index @var{indx} exists,
16445without the side effect of creating that element if it is not present.
16446The expression has the value one (true) if @code{@var{array}[@var{indx}]}
16447exists and zero (false) if it does not exist.
16448(We use @var{indx} here, because @samp{index} is the name of a built-in
16449function.)
16450For example, this statement tests whether the array @code{frequencies}
16451contains the index @samp{2}:
16452
16453@example
16454@group
16455if (2 in frequencies)
16456    print "Subscript 2 is present."
16457@end group
16458@end example
16459
16460Note that this is @emph{not} a test of whether the array
16461@code{frequencies} contains an element whose @emph{value} is two.
16462There is no way to do that except to scan all the elements.  Also, this
16463@emph{does not} create @code{frequencies[2]}, while the following
16464(incorrect) alternative does:
16465
16466@example
16467@group
16468if (frequencies[2] != "")
16469    print "Subscript 2 is present."
16470@end group
16471@end example
16472
16473@node Assigning Elements
16474@subsection Assigning Array Elements
16475@cindex arrays @subentry elements @subentry assigning values
16476@cindex elements in arrays @subentry assigning values
16477
16478Array elements can be assigned values just like
16479@command{awk} variables:
16480
16481@example
16482@var{array}[@var{index-expression}] = @var{value}
16483@end example
16484
16485@noindent
16486@var{array} is the name of an array.  The expression
16487@var{index-expression} is the index of the element of the array that is
16488assigned a value.  The expression @var{value} is the value to
16489assign to that element of the array.
16490
16491@node Array Example
16492@subsection Basic Array Example
16493@cindex arrays @subentry example of using
16494
16495The following program takes a list of lines, each beginning with a line
16496number, and prints them out in order of line number.  The line numbers
16497are not in order when they are first read---instead, they
16498are scrambled.  This program sorts the lines by making an array using
16499the line numbers as subscripts.  The program then prints out the lines
16500in sorted order of their numbers.  It is a very simple program and gets
16501confused upon encountering repeated numbers, gaps, or lines that don't
16502begin with a number:
16503
16504@example
16505@c file eg/misc/arraymax.awk
16506@{
16507    if ($1 > max)
16508        max = $1
16509    arr[$1] = $0
16510@}
16511
16512END @{
16513    for (x = 1; x <= max; x++)
16514        print arr[x]
16515@}
16516@c endfile
16517@end example
16518
16519The first rule keeps track of the largest line number seen so far;
16520it also stores each line into the array @code{arr}, at an index that
16521is the line's number.
16522The second rule runs after all the input has been read, to print out
16523all the lines.
16524When this program is run with the following input:
16525
16526@example
16527@group
16528@c file eg/misc/arraymax.data
165295  I am the Five man
165302  Who are you?  The new number two!
165314  . . . And four on the floor
165321  Who is number one?
165333  I three you.
16534@c endfile
16535@end group
16536@end example
16537
16538@noindent
16539Its output is:
16540
16541@example
16542@group
165431  Who is number one?
165442  Who are you?  The new number two!
165453  I three you.
165464  . . . And four on the floor
165475  I am the Five man
16548@end group
16549@end example
16550
16551If a line number is repeated, the last line with a given number overrides
16552the others.
16553Gaps in the line numbers can be handled with an easy improvement to the
16554program's @code{END} rule, as follows:
16555
16556@example
16557@group
16558END @{
16559    for (x = 1; x <= max; x++)
16560        if (x in arr)
16561            print arr[x]
16562@}
16563@end group
16564@end example
16565
16566@node Scanning an Array
16567@subsection Scanning All Elements of an Array
16568@cindex elements in arrays @subentry scanning
16569@cindex scanning arrays
16570@cindex arrays @subentry scanning
16571@cindex loops @subentry @code{for} @subentry array scanning
16572
16573In programs that use arrays, it is often necessary to use a loop that
16574executes once for each element of an array.  In other languages, where
16575arrays are contiguous and indices are limited to nonnegative integers,
16576this is easy: all the valid indices can be found by counting from
16577the lowest index up to the highest.  This technique won't do the job
16578in @command{awk}, because any number or string can be an array index.
16579So @command{awk} has a special kind of @code{for} statement for scanning
16580an array:
16581
16582@example
16583@group
16584for (@var{var} in @var{array})
16585    @var{body}
16586@end group
16587@end example
16588
16589@noindent
16590@cindex @code{in} operator @subentry use in loops
16591This loop executes @var{body} once for each index in @var{array} that the
16592program has previously used, with the variable @var{var} set to that index.
16593
16594@cindex arrays @subentry @code{for} statement and
16595@cindex @code{for} statement @subentry looping over arrays
16596The following program uses this form of the @code{for} statement.  The
16597first rule scans the input records and notes which words appear (at
16598least once) in the input, by storing a one into the array @code{used} with
16599the word as the index.  The second rule scans the elements of @code{used} to
16600find all the distinct words that appear in the input.  It prints each
16601word that is more than 10 characters long and also prints the number of
16602such words.
16603@xref{String Functions}
16604for more information on the built-in function @code{length()}.
16605
16606@example
16607@group
16608# Record a 1 for each word that is used at least once
16609@{
16610    for (i = 1; i <= NF; i++)
16611        used[$i] = 1
16612@}
16613@end group
16614
16615@group
16616# Find number of distinct words more than 10 characters long
16617END @{
16618    for (x in used) @{
16619        if (length(x) > 10) @{
16620            ++num_long_words
16621            print x
16622        @}
16623    @}
16624    print num_long_words, "words longer than 10 characters"
16625@}
16626@end group
16627@end example
16628
16629@noindent
16630@xref{Word Sorting}
16631for a more detailed example of this type.
16632
16633@cindex arrays @subentry elements @subentry order of access by @code{in} operator
16634@cindex elements in arrays @subentry order of access by @code{in} operator
16635@cindex @code{in} operator @subentry order of array access
16636The order in which elements of the array are accessed by this statement
16637is determined by the internal arrangement of the array elements within
16638@command{awk} and in standard @command{awk} cannot be controlled
16639or changed.  This can lead to problems if new elements are added to
16640@var{array} by statements in the loop body; it is not predictable whether
16641the @code{for} loop will reach them.  Similarly, changing @var{var} inside
16642the loop may produce strange results.  It is best to avoid such things.
16643
16644As a point of information, @command{gawk} sets up the list of elements
16645to be iterated over before the loop starts, and does not change it.
16646But not all @command{awk} versions do so. Consider this program, named
16647@file{loopcheck.awk}:
16648
16649@example
16650BEGIN @{
16651    a["here"] = "here"
16652    a["is"] = "is"
16653    a["a"] = "a"
16654    a["loop"] = "loop"
16655    for (i in a) @{
16656        j++
16657        a[j] = j
16658        print i
16659    @}
16660@}
16661@end example
16662
16663Here is what happens when run with @command{gawk} (and @command{mawk}):
16664
16665@example
16666$ @kbd{gawk -f loopcheck.awk}
16667@print{} here
16668@print{} loop
16669@print{} a
16670@print{} is
16671@end example
16672
16673Contrast this to BWK @command{awk}:
16674
16675@example
16676$ @kbd{nawk -f loopcheck.awk}
16677@print{} loop
16678@print{} here
16679@print{} is
16680@print{} a
16681@print{} 1
16682@end example
16683
16684@node Controlling Scanning
16685@subsection Using Predefined Array Scanning Orders with @command{gawk}
16686
16687This @value{SUBSECTION} describes a feature that is specific to @command{gawk}.
16688
16689By default, when a @code{for} loop traverses an array, the order
16690is undefined, meaning that the @command{awk} implementation
16691determines the order in which the array is traversed.
16692This order is usually based on the internal implementation of arrays
16693and will vary from one version of @command{awk} to the next.
16694
16695@cindex array scanning order, controlling
16696@cindex controlling array scanning order
16697Often, though, you may wish to do something simple, such as
16698``traverse the array by comparing the indices in ascending order,''
16699or ``traverse the array by comparing the values in descending order.''
16700@command{gawk} provides two mechanisms that give you this control:
16701
16702@itemize @value{BULLET}
16703@item
16704Set @code{PROCINFO["sorted_in"]} to one of a set of predefined values.
16705We describe this now.
16706
16707@item
16708Set @code{PROCINFO["sorted_in"]} to the name of a user-defined function
16709to use for comparison of array elements. This advanced feature
16710is described later in @ref{Array Sorting}.
16711@end itemize
16712
16713@cindex @code{PROCINFO} array @subentry values of @code{sorted_in}
16714The following special values for @code{PROCINFO["sorted_in"]} are available:
16715
16716@table @code
16717@item "@@unsorted"
16718Array elements are processed in arbitrary order, which is the default
16719@command{awk} behavior.
16720
16721@item "@@ind_str_asc"
16722Order by indices in ascending order compared as strings; this is the most basic sort.
16723(Internally, array indices are always strings, so with @samp{a[2*5] = 1}
16724the index is @code{"10"} rather than numeric 10.)
16725
16726@item "@@ind_num_asc"
16727Order by indices in ascending order but force them to be treated as numbers in the process.
16728Any index with a non-numeric value will end up positioned as if it were zero.
16729
16730@item "@@val_type_asc"
16731Order by element values in ascending order (rather than by indices).
16732Ordering is by the type assigned to the element
16733(@pxref{Typing and Comparison}).
16734All numeric values come before all string values,
16735which in turn come before all subarrays.
16736(Subarrays have not been described yet;
16737@pxref{Arrays of Arrays}.)
16738
16739If you choose to use this feature in traversing @code{FUNCTAB}
16740(@pxref{Auto-set}), then the order is built-in functions first
16741(@pxref{Built-in}), then user-defined functions (@pxref{User-defined})
16742next, and finally functions loaded from an extension
16743(@pxref{Dynamic Extensions}).
16744
16745@item "@@val_str_asc"
16746Order by element values in ascending order (rather than by indices).  Scalar values are
16747compared as strings.
16748If the string values are identical,
16749the index string values are compared instead.
16750When comparing non-scalar values,
16751@code{"@@val_type_asc"} sort ordering is used, so subarrays, if present,
16752come out last.
16753
16754@item "@@val_num_asc"
16755Order by element values in ascending order (rather than by indices).  Scalar values are
16756compared as numbers.
16757Non-scalar values are compared using @code{"@@val_type_asc"} sort ordering,
16758so subarrays, if present, come out last.
16759When numeric values are equal, the string values are used to provide
16760an ordering: this guarantees consistent results across different
16761versions of the C @code{qsort()} function,@footnote{When two elements
16762compare as equal, the C @code{qsort()} function does not guarantee
16763that they will maintain their original relative order after sorting.
16764Using the string value to provide a unique ordering when the numeric
16765values are equal ensures that @command{gawk} behaves consistently
16766across different environments.} which @command{gawk} uses internally
16767to perform the sorting.
16768If the string values are also identical,
16769the index string values are compared instead.
16770
16771
16772@item "@@ind_str_desc"
16773Like @code{"@@ind_str_asc"}, but the
16774string indices are ordered from high to low.
16775
16776@item "@@ind_num_desc"
16777Like @code{"@@ind_num_asc"}, but the
16778numeric indices are ordered from high to low.
16779
16780@item "@@val_type_desc"
16781Like @code{"@@val_type_asc"}, but the
16782element values, based on type, are ordered from high to low.
16783Subarrays, if present, come out first.
16784
16785@item "@@val_str_desc"
16786Like @code{"@@val_str_asc"}, but the
16787element values, treated as strings, are ordered from high to low.
16788If the string values are identical,
16789the index string values are compared instead.
16790When comparing non-scalar values,
16791@code{"@@val_type_desc"} sort ordering is used, so subarrays, if present,
16792come out first.
16793
16794@item "@@val_num_desc"
16795Like @code{"@@val_num_asc"}, but the
16796element values, treated as numbers, are ordered from high to low.
16797If the numeric values are equal, the string values are compared instead.
16798If they are also identical, the index string values are compared instead.
16799Non-scalar values are compared using @code{"@@val_type_desc"} sort ordering,
16800so subarrays, if present, come out first.
16801@end table
16802
16803The array traversal order is determined before the @code{for} loop
16804starts to run. Changing @code{PROCINFO["sorted_in"]} in the loop body
16805does not affect the loop.
16806For example:
16807
16808@example
16809$ @kbd{gawk '}
16810> @kbd{BEGIN @{}
16811> @kbd{   a[4] = 4}
16812> @kbd{   a[3] = 3}
16813> @kbd{   for (i in a)}
16814> @kbd{       print i, a[i]}
16815> @kbd{@}'}
16816@print{} 4 4
16817@print{} 3 3
16818$ @kbd{gawk '}
16819> @kbd{BEGIN @{}
16820> @kbd{   PROCINFO["sorted_in"] = "@@ind_str_asc"}
16821> @kbd{   a[4] = 4}
16822> @kbd{   a[3] = 3}
16823> @kbd{   for (i in a)}
16824> @kbd{       print i, a[i]}
16825> @kbd{@}'}
16826@print{} 3 3
16827@print{} 4 4
16828@end example
16829
16830When sorting an array by element values, if a value happens to be
16831a subarray then it is considered to be greater than any string or
16832numeric value, regardless of what the subarray itself contains,
16833and all subarrays are treated as being equal to each other.  Their
16834order relative to each other is determined by their index strings.
16835
16836Here are some additional things to bear in mind about sorted
16837array traversal:
16838
16839@itemize @value{BULLET}
16840@item
16841The value of @code{PROCINFO["sorted_in"]} is global. That is, it affects
16842all array traversal @code{for} loops.  If you need to change it within your
16843own code, you should see if it's defined and save and restore the value:
16844
16845@example
16846@dots{}
16847if ("sorted_in" in PROCINFO) @{
16848    save_sorted = PROCINFO["sorted_in"]
16849    PROCINFO["sorted_in"] = "@@val_str_desc" # or whatever
16850@}
16851@dots{}
16852if (save_sorted)
16853    PROCINFO["sorted_in"] = save_sorted
16854@end example
16855
16856@item
16857As already mentioned, the default array traversal order is represented by
16858@code{"@@unsorted"}.  You can also get the default behavior by assigning
16859the null string to @code{PROCINFO["sorted_in"]} or by just deleting the
16860@code{"sorted_in"} element from the @code{PROCINFO} array with
16861the @code{delete} statement.
16862(The @code{delete} statement hasn't been described yet; @pxref{Delete}.)
16863@end itemize
16864
16865In addition, @command{gawk} provides built-in functions for
16866sorting arrays; see @ref{Array Sorting Functions}.
16867
16868@node Numeric Array Subscripts
16869@section Using Numbers to Subscript Arrays
16870
16871@cindex numbers @subentry as array subscripts
16872@cindex array subscripts @subentry numbers as
16873@cindex arrays @subentry numeric subscripts
16874@cindex subscripts in arrays @subentry numbers as
16875@cindex @code{CONVFMT} variable @subentry array subscripts and
16876An important aspect to remember about arrays is that @emph{array subscripts
16877are always strings}.  When a numeric value is used as a subscript,
16878it is converted to a string value before being used for subscripting
16879(@pxref{Conversion}).
16880This means that the value of the predefined variable @code{CONVFMT} can
16881affect how your program accesses elements of an array.  For example:
16882
16883@example
16884xyz = 12.153
16885data[xyz] = 1
16886CONVFMT = "%2.2f"
16887if (xyz in data)
16888    printf "%s is in data\n", xyz
16889else
16890    printf "%s is not in data\n", xyz
16891@end example
16892
16893@noindent
16894This prints @samp{12.15 is not in data}.  The first statement gives
16895@code{xyz} a numeric value.  Assigning to
16896@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"}
16897(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}).
16898Thus, the array element @code{data["12.153"]} is assigned the value one.
16899The program then changes
16900the value of @code{CONVFMT}.  The test @samp{(xyz in data)} generates a new
16901string value from @code{xyz}---this time @code{"12.15"}---because the value of
16902@code{CONVFMT} only allows two significant digits.  This test fails,
16903because @code{"12.15"} is different from @code{"12.153"}.
16904
16905@cindex converting @subentry integer array subscripts to strings
16906@cindex integer array indices
16907According to the rules for conversions
16908(@pxref{Conversion}), integer
16909values always convert to strings as integers, no matter what the
16910value of @code{CONVFMT} may happen to be.  So the usual case of
16911the following works:
16912
16913@example
16914for (i = 1; i <= maxsub; i++)
16915    @ii{do something with} array[i]
16916@end example
16917
16918The ``integer values always convert to strings as integers'' rule
16919has an additional consequence for array indexing.
16920Octal and hexadecimal constants
16921@ifnotdocbook
16922(@pxref{Nondecimal-numbers})
16923@end ifnotdocbook
16924@ifdocbook
16925(covered in @ref{Nondecimal-numbers})
16926@end ifdocbook
16927are converted internally into numbers, and their original form
16928is forgotten.  This means, for example, that @code{array[17]},
16929@code{array[021]}, and @code{array[0x11]} all refer to the same element!
16930
16931As with many things in @command{awk}, the majority of the time
16932things work as you would expect them to.  But it is useful to have a precise
16933knowledge of the actual rules, as they can sometimes have a subtle
16934effect on your programs.
16935
16936@node Uninitialized Subscripts
16937@section Using Uninitialized Variables as Subscripts
16938
16939@cindex variables @subentry uninitialized, as array subscripts
16940@cindex uninitialized variables, as array subscripts
16941@cindex subscripts in arrays @subentry uninitialized variables as
16942@cindex arrays @subentry subscripts, uninitialized variables as
16943Suppose it's necessary to write a program
16944to print the input data in reverse order.
16945A reasonable attempt to do so (with some test
16946data) might look like this:
16947
16948@example
16949$ @kbd{echo 'line 1}
16950> @kbd{line 2}
16951> @kbd{line 3' | awk '@{ l[lines] = $0; ++lines @}}
16952> @kbd{END @{}
16953>     @kbd{for (i = lines - 1; i >= 0; i--)}
16954>        @kbd{print l[i]}
16955> @kbd{@}'}
16956@print{} line 3
16957@print{} line 2
16958@end example
16959
16960Unfortunately, the very first line of input data did not appear in the
16961output!
16962
16963Upon first glance, we would think that this program should have worked.
16964The variable @code{lines}
16965is uninitialized, and uninitialized variables have the numeric value zero.
16966So, @command{awk} should have printed the value of @code{l[0]}.
16967
16968The issue here is that subscripts for @command{awk} arrays are @emph{always}
16969strings. Uninitialized variables, when used as strings, have the
16970value @code{""}, not zero.  Thus, @samp{line 1} ends up stored in
16971@code{l[""]}.
16972The following version of the program works correctly:
16973
16974@example
16975@{ l[lines++] = $0 @}
16976END @{
16977    for (i = lines - 1; i >= 0; i--)
16978       print l[i]
16979@}
16980@end example
16981
16982Here, the @samp{++} forces @code{lines} to be numeric, thus making
16983the ``old value'' numeric zero. This is then converted to @code{"0"}
16984as the array subscript.
16985
16986@cindex array subscripts @subentry null string as
16987@cindex null strings @subentry as array subscripts
16988@cindex dark corner @subentry array subscripts
16989@cindex lint checking @subentry array subscripts
16990Even though it is somewhat unusual, the null string
16991(@code{""}) is a valid array subscript.
16992@value{DARKCORNER}
16993@command{gawk} warns about the use of the null string as a subscript
16994if @option{--lint} is provided
16995on the command line (@pxref{Options}).
16996
16997@node Delete
16998@section The @code{delete} Statement
16999@cindex @code{delete} statement
17000@cindex deleting @subentry elements in arrays
17001@cindex arrays @subentry elements @subentry deleting
17002@cindex elements in arrays @subentry deleting
17003
17004To remove an individual element of an array, use the @code{delete}
17005statement:
17006
17007@example
17008delete @var{array}[@var{index-expression}]
17009@end example
17010
17011Once an array element has been deleted, any value the element once
17012had is no longer available. It is as if the element had never
17013been referred to or been given a value.
17014The following is an example of deleting elements in an array:
17015
17016@example
17017for (i in frequencies)
17018    delete frequencies[i]
17019@end example
17020
17021@noindent
17022This example removes all the elements from the array @code{frequencies}.
17023Once an element is deleted, a subsequent @code{for} statement to scan the array
17024does not report that element and using the @code{in} operator to check for
17025the presence of that element returns zero (i.e., false):
17026
17027@example
17028delete foo[4]
17029if (4 in foo)
17030    print "This will never be printed"
17031@end example
17032
17033@cindex null strings @subentry deleting array elements and
17034It is important to note that deleting an element is @emph{not} the
17035same as assigning it a null value (the empty string, @code{""}).
17036For example:
17037
17038@example
17039@group
17040foo[4] = ""
17041if (4 in foo)
17042  print "This is printed, even though foo[4] is empty"
17043@end group
17044@end example
17045
17046@cindex lint checking @subentry array subscripts
17047It is not an error to delete an element that does not exist.
17048However, if @option{--lint} is provided on the command line
17049(@pxref{Options}),
17050@command{gawk} issues a warning message when an element that
17051is not in the array is deleted.
17052
17053@cindex common extensions @subentry @code{delete} to delete entire arrays
17054@cindex extensions @subentry common @subentry @code{delete} to delete entire arrays
17055@cindex arrays @subentry deleting entire contents
17056@cindex deleting @subentry entire arrays
17057@cindex @code{delete} @var{array}
17058@cindex differences in @command{awk} and @command{gawk} @subentry array elements, deleting
17059All the elements of an array may be deleted with a single statement
17060by leaving off the subscript in the @code{delete} statement,
17061as follows:
17062
17063
17064@example
17065delete @var{array}
17066@end example
17067
17068Using this version of the @code{delete} statement is about three times
17069more efficient than the equivalent loop that deletes each element one
17070at a time.
17071
17072This form of the @code{delete} statement is also supported
17073by BWK @command{awk} and @command{mawk}, as well as
17074by a number of other implementations.
17075
17076@cindex Brian Kernighan's @command{awk}
17077@quotation NOTE
17078For many years, using @code{delete} without a subscript was a common
17079extension.  In September 2012, it was accepted for inclusion into the
17080POSIX standard.  See @uref{http://austingroupbugs.net/view.php?id=544,
17081the Austin Group website}.
17082@end quotation
17083
17084@cindex portability @subentry deleting array elements
17085@cindex Brennan, Michael
17086The following statement provides a portable but nonobvious way to clear
17087out an array:@footnote{Thanks to Michael Brennan for pointing this out.}
17088
17089@example
17090split("", array)
17091@end example
17092
17093@cindex @code{split()} function @subentry array elements, deleting
17094The @code{split()} function
17095(@pxref{String Functions})
17096clears out the target array first. This call asks it to split
17097apart the null string. Because there is no data to split out, the
17098function simply clears the array and then returns.
17099
17100@quotation CAUTION
17101Deleting all the elements from an array does not change its type; you cannot
17102clear an array and then use the array's name as a scalar
17103(i.e., a regular variable). For example, the following does not work:
17104
17105@example
17106a[1] = 3
17107delete a
17108a = 3
17109@end example
17110@end quotation
17111
17112@node Multidimensional
17113@section Multidimensional Arrays
17114
17115@menu
17116* Multiscanning::               Scanning multidimensional arrays.
17117@end menu
17118
17119@cindex subscripts in arrays @subentry multidimensional
17120@cindex arrays @subentry multidimensional
17121A @dfn{multidimensional array} is an array in which an element is identified
17122by a sequence of indices instead of a single index.  For example, a
17123two-dimensional array requires two indices.  The usual way (in many
17124languages, including @command{awk}) to refer to an element of a
17125two-dimensional array named @code{grid} is with
17126@code{grid[@var{x},@var{y}]}.
17127
17128@cindex @code{SUBSEP} variable @subentry multidimensional arrays and
17129Multidimensional arrays are supported in @command{awk} through
17130concatenation of indices into one string.
17131@command{awk} converts the indices into strings
17132(@pxref{Conversion}) and
17133concatenates them together, with a separator between them.  This creates
17134a single string that describes the values of the separate indices.  The
17135combined string is used as a single index into an ordinary,
17136one-dimensional array.  The separator used is the value of the built-in
17137variable @code{SUBSEP}.
17138
17139For example, suppose we evaluate the expression @samp{foo[5,12] = "value"}
17140when the value of @code{SUBSEP} is @code{"@@"}.  The numbers 5 and 12 are
17141converted to strings and
17142concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus,
17143the array element @code{foo["5@@12"]} is set to @code{"value"}.
17144
17145Once the element's value is stored, @command{awk} has no record of whether
17146it was stored with a single index or a sequence of indices.  The two
17147expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always
17148equivalent.
17149
17150The default value of @code{SUBSEP} is the string @code{"\034"},
17151which contains a nonprinting character that is unlikely to appear in an
17152@command{awk} program or in most input data.
17153The usefulness of choosing an unlikely character comes from the fact
17154that index values that contain a string matching @code{SUBSEP} can lead to
17155combined strings that are ambiguous.  Suppose that @code{SUBSEP} is
17156@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a",
17157"b@@c"]}} are indistinguishable because both are actually
17158stored as @samp{foo["a@@b@@c"]}.
17159
17160@cindex @code{in} operator @subentry index existence in multidimensional arrays
17161To test whether a particular index sequence exists in a
17162multidimensional array, use the same operator (@code{in}) that is
17163used for single-dimensional arrays.  Write the whole sequence of indices
17164in parentheses, separated by commas, as the left operand:
17165
17166@example
17167if ((@var{subscript1}, @var{subscript2}, @dots{}) in @var{array})
17168    @dots{}
17169@end example
17170
17171Here is an example that treats its input as a two-dimensional array of
17172fields; it rotates this array 90 degrees clockwise and prints the
17173result.  It assumes that all lines have the same number of
17174elements:
17175
17176@example
17177@{
17178     if (max_nf < NF)
17179          max_nf = NF
17180     max_nr = NR
17181     for (x = 1; x <= NF; x++)
17182          vector[x, NR] = $x
17183@}
17184
17185END @{
17186     for (x = 1; x <= max_nf; x++) @{
17187          for (y = max_nr; y >= 1; --y)
17188               printf("%s ", vector[x, y])
17189          printf("\n")
17190     @}
17191@}
17192@end example
17193
17194@noindent
17195When given the input:
17196
17197@example
17198@group
171991 2 3 4 5 6
172002 3 4 5 6 1
172013 4 5 6 1 2
172024 5 6 1 2 3
17203@end group
17204@end example
17205
17206@noindent
17207the program produces the following output:
17208
17209@example
17210@group
172114 3 2 1
172125 4 3 2
172136 5 4 3
172141 6 5 4
172152 1 6 5
172163 2 1 6
17217@end group
17218@end example
17219
17220@node Multiscanning
17221@subsection Scanning Multidimensional Arrays
17222
17223There is no special @code{for} statement for scanning a
17224``multidimensional'' array. There cannot be one, because, in truth,
17225@command{awk} does not have
17226multidimensional arrays or elements---there is only a
17227multidimensional @emph{way of accessing} an array.
17228
17229@cindex subscripts in arrays @subentry multidimensional @subentry scanning
17230@cindex arrays @subentry multidimensional @subentry scanning
17231@cindex scanning multidimensional arrays
17232However, if your program has an array that is always accessed as
17233multidimensional, you can get the effect of scanning it by combining
17234the scanning @code{for} statement
17235(@pxref{Scanning an Array}) with the
17236built-in @code{split()} function
17237(@pxref{String Functions}).
17238It works in the following manner:
17239
17240@example
17241for (combined in array) @{
17242    split(combined, separate, SUBSEP)
17243    @dots{}
17244@}
17245@end example
17246
17247@noindent
17248This sets the variable @code{combined} to
17249each concatenated combined index in the array, and splits it
17250into the individual indices by breaking it apart where the value of
17251@code{SUBSEP} appears.  The individual indices then become the elements of
17252the array @code{separate}.
17253
17254Thus, if a value is previously stored in @code{array[1, "foo"]}, then
17255an element with index @code{"1\034foo"} exists in @code{array}.  (Recall
17256that the default value of @code{SUBSEP} is the character with code 034.)
17257Sooner or later, the @code{for} statement finds that index and does an
17258iteration with the variable @code{combined} set to @code{"1\034foo"}.
17259Then the @code{split()} function is called as follows:
17260
17261@example
17262split("1\034foo", separate, "\034")
17263@end example
17264
17265@noindent
17266The result is to set @code{separate[1]} to @code{"1"} and
17267@code{separate[2]} to @code{"foo"}.  Presto! The original sequence of
17268separate indices is recovered.
17269
17270
17271@node Arrays of Arrays
17272@section Arrays of Arrays
17273@cindex arrays @subentry arrays of arrays
17274
17275@command{gawk} goes beyond standard @command{awk}'s multidimensional
17276array access and provides true arrays of
17277arrays. Elements of a subarray are referred to by their own indices
17278enclosed in square brackets, just like the elements of the main array.
17279For example, the following creates a two-element subarray at index @code{1}
17280of the main array @code{a}:
17281
17282@example
17283a[1][1] = 1
17284a[1][2] = 2
17285@end example
17286
17287This simulates a true two-dimensional array. Each subarray element can
17288contain another subarray as a value, which in turn can hold other arrays
17289as well. In this way, you can create arrays of three or more dimensions.
17290The indices can be any @command{awk} expressions, including scalars
17291separated by commas (i.e., a regular @command{awk} simulated
17292multidimensional subscript). So the following is valid in
17293@command{gawk}:
17294
17295@example
17296a[1][3][1, "name"] = "barney"
17297@end example
17298
17299Each subarray and the main array can be of different length. In fact, the
17300elements of an array or its subarray do not all have to have the same
17301type. This means that the main array and any of its subarrays can be
17302nonrectangular, or jagged in structure. You can assign a scalar value to
17303the index @code{4} of the main array @code{a}, even though @code{a[1]}
17304is itself an array and not a scalar:
17305
17306@example
17307a[4] = "An element in a jagged array"
17308@end example
17309
17310The terms @dfn{dimension}, @dfn{row}, and @dfn{column} are
17311meaningless when applied
17312to such an array, but we will use ``dimension'' henceforth to imply the
17313maximum number of indices needed to refer to an existing element. The
17314type of any element that has already been assigned cannot be changed
17315by assigning a value of a different type. You have to first delete the
17316current element, which effectively makes @command{gawk} forget about
17317the element at that index:
17318
17319@example
17320delete a[4]
17321a[4][5][6][7] = "An element in a four-dimensional array"
17322@end example
17323
17324@noindent
17325This removes the scalar value from index @code{4} and then inserts a
17326three-level nested subarray
17327containing a scalar. You can also
17328delete an entire subarray or subarray of subarrays:
17329
17330@example
17331delete a[4][5]
17332a[4][5] = "An element in subarray a[4]"
17333@end example
17334
17335But recall that you can not delete the main array @code{a} and then use it
17336as a scalar.
17337
17338The built-in functions that take array arguments can also be used
17339with subarrays. For example, the following code fragment uses @code{length()}
17340(@pxref{String Functions})
17341to determine the number of elements in the main array @code{a} and
17342its subarrays:
17343
17344@example
17345print length(a), length(a[1]), length(a[1][3])
17346@end example
17347
17348@noindent
17349This results in the following output for our main array @code{a}:
17350
17351@example
173522, 3, 1
17353@end example
17354
17355@noindent
17356The @samp{@var{subscript} in @var{array}} expression
17357(@pxref{Reference to Elements}) works similarly for both
17358regular @command{awk}-style
17359arrays and arrays of arrays. For example, the tests @samp{1 in a},
17360@samp{3 in a[1]}, and @samp{(1, "name") in a[1][3]} all evaluate to
17361one (true) for our array @code{a}.
17362
17363The @samp{for (item in array)} statement (@pxref{Scanning an Array})
17364can be nested to scan all the
17365elements of an array of arrays if it is rectangular in structure. In order
17366to print the contents (scalar values) of a two-dimensional array of arrays
17367(i.e., in which each first-level element is itself an
17368array, not necessarily of the same length),
17369you could use the following code:
17370
17371@example
17372for (i in array)
17373    for (j in array[i])
17374        print array[i][j]
17375@end example
17376
17377The @code{isarray()} function (@pxref{Type Functions})
17378lets you test if an array element is itself an array:
17379
17380@example
17381for (i in array) @{
17382    if (isarray(array[i])) @{
17383        for (j in array[i]) @{
17384            print array[i][j]
17385        @}
17386    @}
17387    else
17388        print array[i]
17389@}
17390@end example
17391
17392If the structure of a jagged array of arrays is known in advance,
17393you can often devise workarounds using control statements. For example,
17394the following code prints the elements of our main array @code{a}:
17395
17396@example
17397@group
17398for (i in a) @{
17399    for (j in a[i]) @{
17400        if (j == 3) @{
17401            for (k in a[i][j])
17402                print a[i][j][k]
17403@end group
17404@group
17405        @} else
17406            print a[i][j]
17407    @}
17408@}
17409@end group
17410@end example
17411
17412@noindent
17413@xref{Walking Arrays} for a user-defined function that ``walks'' an
17414arbitrarily dimensioned array of arrays.
17415
17416Recall that a reference to an uninitialized array element yields a value
17417of @code{""}, the null string. This has one important implication when you
17418intend to use a subarray as an argument to a function, as illustrated by
17419the following example:
17420
17421@example
17422$ @kbd{gawk 'BEGIN @{ split("a b c d", b[1]); print b[1][1] @}'}
17423@error{} gawk: cmd. line:1: fatal: split: second argument is not an array
17424@end example
17425
17426The way to work around this is to first force @code{b[1]} to be an array by
17427creating an arbitrary index:
17428
17429@example
17430$ @kbd{gawk 'BEGIN @{ b[1][1] = ""; split("a b c d", b[1]); print b[1][1] @}'}
17431@print{} a
17432@end example
17433
17434@node Arrays Summary
17435@section Summary
17436
17437@itemize @value{BULLET}
17438@item
17439Standard @command{awk} provides one-dimensional associative arrays
17440(arrays indexed by string values).  All arrays are associative; numeric
17441indices are converted automatically to strings.
17442
17443@item
17444Array elements are referenced as @code{@var{array}[@var{indx}]}.
17445Referencing an element creates it if it did not exist previously.
17446
17447@item
17448The proper way to see if an array has an element with a given index
17449is to use the @code{in} operator: @samp{@var{indx} in @var{array}}.
17450
17451@item
17452Use @samp{for (@var{indx} in @var{array}) @dots{}} to scan through all the
17453individual elements of an array. In the body of the loop, @var{indx} takes
17454on the value of each element's index in turn.
17455
17456@item
17457The order in which a @samp{for (@var{indx} in @var{array})} loop
17458traverses an array is undefined in POSIX @command{awk} and varies among
17459implementations.  @command{gawk} lets you control the order by assigning
17460special predefined values to @code{PROCINFO["sorted_in"]}.
17461
17462@item
17463Use @samp{delete @var{array}[@var{indx}]} to delete an individual element.
17464To delete all of the elements in an array,
17465use @samp{delete @var{array}}.
17466This latter feature has been a common extension for many
17467years and is now standard, but may not be supported by all commercial
17468versions of @command{awk}.
17469
17470@item
17471Standard @command{awk} simulates multidimensional arrays by separating
17472subscript values with commas.  The values are concatenated into a
17473single string, separated by the value of @code{SUBSEP}.  The fact
17474that such a subscript was created in this way is not retained; thus,
17475changing @code{SUBSEP} may have unexpected consequences.  You can use
17476@samp{(@var{sub1}, @var{sub2}, @dots{}) in @var{array}} to see if such
17477a multidimensional subscript exists in @var{array}.
17478
17479@item
17480@command{gawk} provides true arrays of arrays. You use a separate
17481set of square brackets for each dimension in such an array:
17482@code{data[row][col]}, for example. Array elements may thus be either
17483scalar values (number or string) or other arrays.
17484
17485@item
17486Use the @code{isarray()} built-in function to determine if an array
17487element is itself a subarray.
17488
17489@end itemize
17490
17491
17492@node Functions
17493@chapter Functions
17494
17495@cindex functions @subentry built-in
17496@cindex built-in functions
17497This @value{CHAPTER} describes @command{awk}'s built-in functions,
17498which fall into three categories: numeric, string, and I/O.
17499@command{gawk} provides additional groups of functions
17500to work with values that represent time, do
17501bit manipulation, sort arrays,
17502provide type information, and internationalize and localize programs.
17503
17504Besides the built-in functions, @command{awk} has provisions for
17505writing new functions that the rest of a program can use.
17506The second half of this @value{CHAPTER} describes these
17507@dfn{user-defined} functions.
17508Finally, we explore indirect function calls, a @command{gawk}-specific
17509extension that lets you determine at runtime what function is to
17510be called.
17511
17512@menu
17513* Built-in::                    Summarizes the built-in functions.
17514* User-defined::                Describes User-defined functions in detail.
17515* Indirect Calls::              Choosing the function to call at runtime.
17516* Functions Summary::           Summary of functions.
17517@end menu
17518
17519@node Built-in
17520@section Built-in Functions
17521
17522@dfn{Built-in} functions are always available for your @command{awk}
17523program to call.  This @value{SECTION} defines all the built-in functions
17524in @command{awk}; some of these are mentioned in other @value{SECTION}s
17525but are summarized here for your convenience.
17526
17527@menu
17528* Calling Built-in::            How to call built-in functions.
17529* Numeric Functions::           Functions that work with numbers, including
17530                                @code{int()}, @code{sin()} and @code{rand()}.
17531* String Functions::            Functions for string manipulation, such as
17532                                @code{split()}, @code{match()} and
17533                                @code{sprintf()}.
17534* I/O Functions::               Functions for files and shell commands.
17535* Time Functions::              Functions for dealing with timestamps.
17536* Bitwise Functions::           Functions for bitwise operations.
17537* Type Functions::              Functions for type information.
17538* I18N Functions::              Functions for string translation.
17539@end menu
17540
17541@node Calling Built-in
17542@subsection Calling Built-in Functions
17543
17544To call one of @command{awk}'s built-in functions, write the name of
17545the function followed
17546by arguments in parentheses.  For example, @samp{atan2(y + z, 1)}
17547is a call to the function @code{atan2()} and has two arguments.
17548
17549@cindex programming conventions @subentry functions @subentry calling
17550@cindex whitespace @subentry functions, calling
17551Whitespace is ignored between the built-in function name and the
17552opening parenthesis, but nonetheless it is good practice to avoid using whitespace
17553there.  User-defined functions do not permit whitespace in this way, and
17554it is easier to avoid mistakes by following a simple
17555convention that always works---no whitespace after a function name.
17556
17557@cindex troubleshooting @subentry @command{gawk} @subentry fatal errors, function arguments
17558@cindex @command{gawk} @subentry function arguments and
17559@cindex differences in @command{awk} and @command{gawk} @subentry function arguments
17560Each built-in function accepts a certain number of arguments.
17561In some cases, arguments can be omitted. The defaults for omitted
17562arguments vary from function to function and are described under the
17563individual functions.  In some @command{awk} implementations, extra
17564arguments given to built-in functions are ignored.  However, in @command{gawk},
17565it is a fatal error to give extra arguments to a built-in function.
17566
17567When a function is called, expressions that create the function's actual
17568parameters are evaluated completely before the call is performed.
17569For example, in the following code fragment:
17570
17571@example
17572i = 4
17573j = sqrt(i++)
17574@end example
17575
17576@cindex evaluation order @subentry functions
17577@cindex functions @subentry built-in @subentry evaluation order
17578@cindex built-in functions @subentry evaluation order
17579@noindent
17580the variable @code{i} is incremented to the value five before @code{sqrt()}
17581is called with a value of four for its actual parameter.
17582The order of evaluation of the expressions used for the function's
17583parameters is undefined.  Thus, avoid writing programs that
17584assume that parameters are evaluated from left to right or from
17585right to left.  For example:
17586
17587@example
17588i = 5
17589j = atan2(++i, i *= 2)
17590@end example
17591
17592If the order of evaluation is left to right, then @code{i} first becomes
17593six, and then 12, and @code{atan2()} is called with the two arguments six
17594and 12.  But if the order of evaluation is right to left, @code{i}
17595first becomes 10, then 11, and @code{atan2()} is called with the
17596two arguments 11 and 10.
17597
17598@node Numeric Functions
17599@subsection Numeric Functions
17600@cindex numeric @subentry functions
17601
17602The following list describes all of
17603the built-in functions that work with numbers.
17604Optional parameters are enclosed in square brackets@w{ ([ ]):}
17605
17606@c @asis for docbook
17607@table @asis
17608@item @code{atan2(@var{y}, @var{x})}
17609@cindexawkfunc{atan2}
17610@cindex arctangent
17611Return the arctangent of @code{@var{y} / @var{x}} in radians.
17612You can use @samp{pi = atan2(0, -1)} to retrieve the value of
17613@value{PI}.
17614
17615@item @code{cos(@var{x})}
17616@cindexawkfunc{cos}
17617@cindex cosine
17618Return the cosine of @var{x}, with @var{x} in radians.
17619
17620@item @code{exp(@var{x})}
17621@cindexawkfunc{exp}
17622@cindex exponent
17623Return the exponential of @var{x} (@code{e ^ @var{x}}) or report
17624an error if @var{x} is out of range.  The range of values @var{x} can have
17625depends on your machine's floating-point representation.
17626
17627@item @code{int(@var{x})}
17628@cindexawkfunc{int}
17629@cindex round to nearest integer
17630Return the nearest integer to @var{x}, located between @var{x} and zero and
17631truncated toward zero.
17632For example, @code{int(3)} is 3, @code{int(3.9)} is 3, @code{int(-3.9)}
17633is @minus{}3, and @code{int(-3)} is @minus{}3 as well.
17634
17635@ifset INTDIV
17636@item @code{intdiv0(@var{numerator}, @var{denominator}, @var{result})}
17637@cindexawkfunc{intdiv0}
17638@cindex intdiv0
17639Perform integer division, similar to the standard C @code{div()} function.
17640First, truncate @code{numerator} and @code{denominator}
17641towards zero, creating integer values.  Clear the @code{result}
17642array, and then set @code{result["quotient"]} to the result of
17643@samp{numerator / denominator}, truncated towards zero to an integer,
17644and set @code{result["remainder"]} to the result of @samp{numerator %
17645denominator}, truncated towards zero to an integer.
17646Attempting division by zero causes a fatal error.
17647The function returns zero upon success, and @minus{}1 upon error.
17648
17649This function is
17650primarily intended for use with arbitrary length integers; it avoids
17651creating MPFR arbitrary precision floating-point values (@pxref{Arbitrary
17652Precision Integers}).
17653
17654This function is a @code{gawk} extension.  It is not available in
17655compatibility mode (@pxref{Options}).
17656@end ifset
17657
17658@item @code{log(@var{x})}
17659@cindexawkfunc{log}
17660@cindex logarithm
17661Return the natural logarithm of @var{x}, if @var{x} is positive;
17662otherwise, return @code{NaN} (``not a number'') on IEEE 754 systems.
17663Additionally, @command{gawk} prints a warning message when @code{x}
17664is negative.
17665
17666@cindex Beebe, Nelson H.F.@:
17667@item @code{rand()}
17668@cindexawkfunc{rand}
17669@cindex random numbers @subentry @code{rand()}/@code{srand()} functions
17670Return a random number.  The values of @code{rand()} are
17671uniformly distributed between zero and one.
17672The value could be zero but is never one.@footnote{The C version of
17673@code{rand()} on many Unix systems is known to produce fairly poor
17674sequences of random numbers.  However, nothing requires that an
17675@command{awk} implementation use the C @code{rand()} to implement the
17676@command{awk} version of @code{rand()}.  In fact, for many years,
17677@command{gawk} used the BSD @code{random()} function, which is
17678considerably better than @code{rand()}, to produce random numbers.
17679From @value{PVERSION} 4.1.4, courtesy of Nelson H.F.@: Beebe, @command{gawk}
17680uses the Bayes-Durham shuffle buffer algorithm which considerably extends
17681the period of the random number generator, and eliminates short-range and
17682long-range correlations that might exist in the original generator.}
17683
17684Often random integers are needed instead.  Following is a user-defined function
17685that can be used to obtain a random nonnegative integer less than @var{n}:
17686
17687@example
17688function randint(n)
17689@{
17690    return int(n * rand())
17691@}
17692@end example
17693
17694@noindent
17695The multiplication produces a random number greater than or equal to
17696zero and less than @code{n}.  Using @code{int()}, this result is made into
17697an integer between zero and @code{n} @minus{} 1, inclusive.
17698
17699The following example uses a similar function to produce random integers
17700between one and @var{n}.  This program prints a new random number for
17701each input record:
17702
17703@example
17704# Function to roll a simulated die.
17705function roll(n) @{ return 1 + int(rand() * n) @}
17706
17707# Roll 3 six-sided dice and
17708# print total number of points.
17709@{
17710    printf("%d points\n", roll(6) + roll(6) + roll(6))
17711@}
17712@end example
17713
17714@cindex seeding random number generator
17715@cindex random numbers @subentry seed of
17716@quotation CAUTION
17717In most @command{awk} implementations, including @command{gawk},
17718@code{rand()} starts generating numbers from the same
17719starting number, or @dfn{seed}, each time you run @command{awk}.@footnote{@command{mawk}
17720uses a different seed each time.}  Thus,
17721a program generates the same results each time you run it.
17722The numbers are random within one @command{awk} run but predictable
17723from run to run.  This is convenient for debugging, but if you want
17724a program to do different things each time it is used, you must change
17725the seed to a value that is different in each run.  To do this,
17726use @code{srand()}.
17727@end quotation
17728
17729@item @code{sin(@var{x})}
17730@cindexawkfunc{sin}
17731@cindex sine
17732Return the sine of @var{x}, with @var{x} in radians.
17733
17734@item @code{sqrt(@var{x})}
17735@cindexawkfunc{sqrt}
17736@cindex square root
17737Return the positive square root of @var{x}.
17738@command{gawk} prints a warning message
17739if @var{x} is negative.  Thus, @code{sqrt(4)} is 2.
17740
17741@item @code{srand(}[@var{x}]@code{)}
17742@cindexawkfunc{srand}
17743Set the starting point, or seed,
17744for generating random numbers to the value @var{x}.
17745
17746Each seed value leads to a particular sequence of random
17747numbers.@footnote{Computer-generated random numbers really are not truly
17748random.  They are technically known as @dfn{pseudorandom}.  This means
17749that although the numbers in a sequence appear to be random, you can in
17750fact generate the same sequence of random numbers over and over again.}
17751Thus, if the seed is set to the same value a second time,
17752the same sequence of random numbers is produced again.
17753
17754@quotation CAUTION
17755Different @command{awk} implementations use different random-number
17756generators internally.  Don't expect the same @command{awk} program
17757to produce the same series of random numbers when executed by
17758different versions of @command{awk}.
17759@end quotation
17760
17761If the argument @var{x} is omitted, as in @samp{srand()}, then the current
17762date and time of day are used for a seed.  This is the way to get random
17763numbers that are truly unpredictable.
17764
17765The return value of @code{srand()} is the previous seed.  This makes it
17766easy to keep track of the seeds in case you need to consistently reproduce
17767sequences of random numbers.
17768
17769POSIX does not specify the initial seed; it differs among @command{awk}
17770implementations.
17771@end table
17772
17773@node String Functions
17774@subsection String-Manipulation Functions
17775@cindex string-manipulation functions
17776
17777The functions in this @value{SECTION} look at or change the text of one
17778or more strings.
17779
17780@command{gawk} understands locales (@pxref{Locales}) and does all
17781string processing in terms of @emph{characters}, not @emph{bytes}.
17782This distinction is particularly important to understand for locales
17783where one character may be represented by multiple bytes.  Thus, for
17784example, @code{length()} returns the number of characters in a string,
17785and not the number of bytes used to represent those characters. Similarly,
17786@code{index()} works with character indices, and not byte indices.
17787
17788@quotation CAUTION
17789A number of functions deal with indices into strings.  For these
17790functions, the first character of a string is at position (index) one.
17791This is different from C and the languages descended from it, where the
17792first character is at position zero.  You need to remember this when
17793doing index calculations, particularly if you are used to C.
17794@end quotation
17795
17796In the following list, optional parameters are enclosed in square brackets@w{ ([ ]).}
17797Several functions perform string substitution; the full discussion is
17798provided in the description of the @code{sub()} function, which comes
17799toward the end, because the list is presented alphabetically.
17800
17801Those functions that are specific to @command{gawk} are marked with a
17802pound sign (@samp{#}).  They are not available in compatibility mode
17803(@pxref{Options}):
17804
17805
17806@menu
17807* Gory Details::                More than you want to know about @samp{\} and
17808                                @samp{&} with @code{sub()}, @code{gsub()}, and
17809                                @code{gensub()}.
17810@end menu
17811
17812@c @asis for docbook
17813@table @asis
17814@item @code{asort(}@var{source} [@code{,} @var{dest} [@code{,} @var{how} ] ]@code{) #}
17815@itemx @code{asorti(}@var{source} [@code{,} @var{dest} [@code{,} @var{how} ] ]@code{) #}
17816@cindexgawkfunc{asorti}
17817@cindex sort array
17818@cindex arrays @subentry elements @subentry retrieving number of
17819@cindexgawkfunc{asort}
17820@cindex sort array indices
17821These two functions are similar in behavior, so they are described
17822together.
17823
17824@quotation NOTE
17825The following description ignores the third argument, @var{how}, as it
17826requires understanding features that we have not discussed yet.  Thus,
17827the discussion here is a deliberate simplification.  (We do provide all
17828the details later on; see @ref{Array Sorting Functions} for the full story.)
17829@end quotation
17830
17831Both functions return the number of elements in the array @var{source}.
17832For @command{asort()}, @command{gawk} sorts the values of @var{source}
17833and replaces the indices of the sorted values of @var{source} with
17834sequential integers starting with one.  If the optional array @var{dest}
17835is specified, then @var{source} is duplicated into @var{dest}.  @var{dest}
17836is then sorted, leaving the indices of @var{source} unchanged.
17837
17838@cindex @command{gawk} @subentry @code{IGNORECASE} variable in
17839When comparing strings, @code{IGNORECASE} affects the sorting
17840(@pxref{Array Sorting Functions}).  If the
17841@var{source} array contains subarrays as values (@pxref{Arrays of
17842Arrays}), they will come last, after all scalar values.
17843Subarrays are @emph{not} recursively sorted.
17844
17845For example, if the contents of @code{a} are as follows:
17846
17847@example
17848a["last"] = "de"
17849a["first"] = "sac"
17850a["middle"] = "cul"
17851@end example
17852
17853@noindent
17854A call to @code{asort()}:
17855
17856@example
17857asort(a)
17858@end example
17859
17860@noindent
17861results in the following contents of @code{a}:
17862
17863@example
17864@group
17865a[1] = "cul"
17866a[2] = "de"
17867a[3] = "sac"
17868@end group
17869@end example
17870
17871The @code{asorti()} function works similarly to @code{asort()}; however,
17872the @emph{indices} are sorted, instead of the values. Thus, in the
17873previous example, starting with the same initial set of indices and
17874values in @code{a}, calling @samp{asorti(a)} would yield:
17875
17876@example
17877a[1] = "first"
17878a[2] = "last"
17879a[3] = "middle"
17880@end example
17881
17882@quotation NOTE
17883You may not use either @code{SYMTAB} or @code{FUNCTAB} as the second
17884argument to these functions.  Attempting to do so produces a fatal error.
17885You may use them as the first argument, but only if providing a second
17886array to use for the actual sorting.
17887@end quotation
17888
17889You are allowed to use the same array for both the @var{source} and @var{dest}
17890arguments, but doing so only makes sense if you're also supplying the third argument.
17891
17892@item @code{gensub(@var{regexp}, @var{replacement}, @var{how}} [@code{, @var{target}}]@code{) #}
17893@cindexgawkfunc{gensub}
17894@cindex search and replace in strings
17895@cindex substitute in string
17896Search the target string @var{target} for matches of the regular
17897expression @var{regexp}.  If @var{how} is a string beginning with
17898@samp{g} or @samp{G} (short for ``global''), then replace all matches
17899of @var{regexp} with @var{replacement}.  Otherwise, treat @var{how}
17900as a number indicating which match of @var{regexp} to replace.  Treat
17901numeric values less than one as if they were one.  If no @var{target}
17902is supplied, use @code{$0}.  Return the modified string as the result
17903of the function. The original target string is @emph{not} changed.
17904
17905The returned value is @emph{always} a string, even if the original
17906@var{target} was a number or a regexp value.
17907
17908@code{gensub()} is a general substitution function.  Its purpose is
17909to provide more features than the standard @code{sub()} and @code{gsub()}
17910functions.
17911
17912@code{gensub()} provides an additional feature that is not available
17913in @code{sub()} or @code{gsub()}: the ability to specify components of a
17914regexp in the replacement text.  This is done by using parentheses in
17915the regexp to mark the components and then specifying @samp{\@var{N}}
17916in the replacement text, where @var{N} is a digit from 1 to 9.
17917For example:
17918
17919@example
17920$ @kbd{gawk '}
17921> @kbd{BEGIN @{}
17922>      @kbd{a = "abc def"}
17923>      @kbd{b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)}
17924>      @kbd{print b}
17925> @kbd{@}'}
17926@print{} def abc
17927@end example
17928
17929@noindent
17930As with @code{sub()}, you must type two backslashes in order
17931to get one into the string.
17932In the replacement text, the sequence @samp{\0} represents the entire
17933matched text, as does the character @samp{&}.
17934
17935The following example shows how you can use the third argument to control
17936which match of the regexp should be changed:
17937
17938@example
17939$ @kbd{echo a b c a b c |}
17940> @kbd{gawk '@{ print gensub(/a/, "AA", 2) @}'}
17941@print{} a b c AA b c
17942@end example
17943
17944In this case, @code{$0} is the default target string.
17945@code{gensub()} returns the new string as its result, which is
17946passed directly to @code{print} for printing.
17947
17948@c @cindex automatic warnings
17949@c @cindex warnings, automatic
17950If the @var{how} argument is a string that does not begin with @samp{g} or
17951@samp{G}, or if it is a number that is less than or equal to zero, only one
17952substitution is performed.  If @var{how} is zero, @command{gawk} issues
17953a warning message.
17954
17955If @var{regexp} does not match @var{target}, @code{gensub()}'s return value
17956is the original unchanged value of @var{target}.  Note that, as mentioned
17957above, the returned value is a string, even if @var{target} was not.
17958
17959@item @code{gsub(@var{regexp}, @var{replacement}} [@code{, @var{target}}]@code{)}
17960@cindexawkfunc{gsub}
17961Search @var{target} for
17962@emph{all} of the longest, leftmost, @emph{nonoverlapping} matching
17963substrings it can find and replace them with @var{replacement}.
17964The @samp{g} in @code{gsub()} stands for
17965``global,'' which means replace everywhere.  For example:
17966
17967@example
17968@{ gsub(/Britain/, "United Kingdom"); print @}
17969@end example
17970
17971@noindent
17972replaces all occurrences of the string @samp{Britain} with @samp{United
17973Kingdom} for all input records.
17974
17975The @code{gsub()} function returns the number of substitutions made.  If
17976the variable to search and alter (@var{target}) is
17977omitted, then the entire input record (@code{$0}) is used.
17978As in @code{sub()}, the characters @samp{&} and @samp{\} are special,
17979and the third argument must be assignable.
17980
17981@item @code{index(@var{in}, @var{find})}
17982@cindexawkfunc{index}
17983@cindex search for substring
17984@cindex find substring in string
17985Search the string @var{in} for the first occurrence of the string
17986@var{find}, and return the position in characters where that occurrence
17987begins in the string @var{in}.  Consider the following example:
17988
17989@example
17990$ @kbd{awk 'BEGIN @{ print index("peanut", "an") @}'}
17991@print{} 3
17992@end example
17993
17994@noindent
17995If @var{find} is not found, @code{index()} returns zero.
17996
17997@cindex dark corner @subentry regexp as second argument to @code{index()}
17998With BWK @command{awk} and @command{gawk},
17999it is a fatal error to use a regexp constant for @var{find}.
18000Other implementations allow it, simply treating the regexp
18001constant as an expression meaning @samp{$0 ~ /regexp/}. @value{DARKCORNER}
18002
18003@item @code{length(}[@var{string}]@code{)}
18004@cindexawkfunc{length}
18005@cindex string @subentry length
18006@cindex length of string
18007Return the number of characters in @var{string}.  If
18008@var{string} is a number, the length of the digit string representing
18009that number is returned.  For example, @code{length("abcde")} is five.  By
18010contrast, @code{length(15 * 35)} works out to three. In this example,
18011@iftex
18012@math{15 @cdot 35 = 525},
18013@end iftex
18014@ifnottex
18015@ifnotdocbook
1801615 * 35 = 525,
18017@end ifnotdocbook
18018@end ifnottex
18019@docbook
1802015 &sdot; 35 = 525,
18021@end docbook
18022and 525 is then converted to the string @code{"525"}, which has
18023three characters.
18024
18025@cindex length of input record
18026@cindex input record, length of
18027If no argument is supplied, @code{length()} returns the length of @code{$0}.
18028
18029@c @cindex historical features
18030@cindex portability @subentry @code{length()} function
18031@cindex POSIX @command{awk} @subentry functions and @subentry @code{length()}
18032@quotation NOTE
18033In older versions of @command{awk}, the @code{length()} function could
18034be called
18035without any parentheses.  Doing so is considered poor practice,
18036although the 2008 POSIX standard explicitly allows it, to
18037support historical practice.  For programs to be maximally portable,
18038always supply the parentheses.
18039@end quotation
18040
18041@cindex dark corner @subentry @code{length()} function
18042If @code{length()} is called with a variable that has not been used,
18043@command{gawk} forces the variable to be a scalar.  Other
18044implementations of @command{awk} leave the variable without a type.
18045@value{DARKCORNER}
18046Consider:
18047
18048@example
18049$ @kbd{gawk 'BEGIN @{ print length(x) ; x[1] = 1 @}'}
18050@print{} 0
18051@error{} gawk: fatal: attempt to use scalar `x' as array
18052
18053$ @kbd{nawk 'BEGIN @{ print length(x) ; x[1] = 1 @}'}
18054@print{} 0
18055@end example
18056
18057@noindent
18058If @option{--lint} has
18059been specified on the command line, @command{gawk} issues a
18060warning about this.
18061
18062@cindex common extensions @subentry @code{length()} applied to an array
18063@cindex extensions @subentry common @subentry @code{length()} applied to an array
18064@cindex differences in @command{awk} and @command{gawk} @subentry @code{length()} function
18065@cindex number of array elements
18066@cindex arrays @subentry number of elements
18067With @command{gawk} and several other @command{awk} implementations, when given an
18068array argument, the @code{length()} function returns the number of elements
18069in the array. @value{COMMONEXT}
18070This is less useful than it might seem at first, as the
18071array is not guaranteed to be indexed from one to the number of elements
18072in it.
18073If @option{--lint} is provided on the command line
18074(@pxref{Options}),
18075@command{gawk} warns that passing an array argument is not portable.
18076If @option{--posix} is supplied, using an array argument is a fatal error
18077(@pxref{Arrays}).
18078
18079@item @code{match(@var{string}, @var{regexp}} [@code{, @var{array}}]@code{)}
18080@cindexawkfunc{match}
18081@cindex string @subentry regular expression match of
18082@cindex match regexp in string
18083Search @var{string} for the
18084longest, leftmost substring matched by the regular expression
18085@var{regexp} and return the character position (index)
18086at which that substring begins (one, if it starts at the beginning of
18087@var{string}).  If no match is found, return zero.
18088
18089The @var{regexp} argument may be either a regexp constant
18090(@code{/}@dots{}@code{/}) or a string constant (@code{"}@dots{}@code{"}).
18091In the latter case, the string is treated as a regexp to be matched.
18092@xref{Computed Regexps} for a
18093discussion of the difference between the two forms, and the
18094implications for writing your program correctly.
18095
18096The order of the first two arguments is the opposite of most other string
18097functions that work with regular expressions, such as
18098@code{sub()} and @code{gsub()}.  It might help to remember that
18099for @code{match()}, the order is the same as for the @samp{~} operator:
18100@samp{@var{string} ~ @var{regexp}}.
18101
18102@cindex @code{RSTART} variable @subentry @code{match()} function and
18103@cindex @code{RLENGTH} variable @subentry @code{match()} function and
18104@cindex @code{match()} function @subentry @code{RSTART}/@code{RLENGTH} variables
18105@cindex @code{match()} function @subentry side effects
18106@cindex side effects @subentry @code{match()} function
18107The @code{match()} function sets the predefined variable @code{RSTART} to
18108the index.  It also sets the predefined variable @code{RLENGTH} to the
18109length in characters of the matched substring.  If no match is found,
18110@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1.
18111
18112For example:
18113
18114@example
18115@c file eg/misc/findpat.awk
18116@{
18117    if ($1 == "FIND")
18118        regex = $2
18119    else @{
18120        where = match($0, regex)
18121        if (where != 0)
18122            print "Match of", regex, "found at", where, "in", $0
18123       @}
18124@}
18125@c endfile
18126@end example
18127
18128@noindent
18129This program looks for lines that match the regular expression stored in
18130the variable @code{regex}.  This regular expression can be changed.  If the
18131first word on a line is @samp{FIND}, @code{regex} is changed to be the
18132second word on that line.  Therefore, if given:
18133
18134@example
18135@c file eg/misc/findpat.data
18136FIND ru+n
18137My program runs
18138but not very quickly
18139FIND Melvin
18140JF+KM
18141This line is property of Reality Engineering Co.
18142Melvin was here.
18143@c endfile
18144@end example
18145
18146@noindent
18147@command{awk} prints:
18148
18149@example
18150Match of ru+n found at 12 in My program runs
18151Match of Melvin found at 1 in Melvin was here.
18152@end example
18153
18154@cindex differences in @command{awk} and @command{gawk} @subentry @code{match()} function
18155If @var{array} is present, it is cleared, and then the zeroth element
18156of @var{array} is set to the entire portion of @var{string}
18157matched by @var{regexp}.  If @var{regexp} contains parentheses,
18158the integer-indexed elements of @var{array} are set to contain the
18159portion of @var{string} matching the corresponding parenthesized
18160subexpression.
18161For example:
18162
18163@example
18164$ @kbd{echo foooobazbarrrrr |}
18165> @kbd{gawk '@{ match($0, /(fo+).+(bar*)/, arr)}
18166>         @kbd{print arr[1], arr[2] @}'}
18167@print{} foooo barrrrr
18168@end example
18169
18170In addition,
18171multidimensional subscripts are available providing
18172the start index and length of each matched subexpression:
18173
18174@example
18175$ @kbd{echo foooobazbarrrrr |}
18176> @kbd{gawk '@{ match($0, /(fo+).+(bar*)/, arr)}
18177>           @kbd{print arr[1], arr[2]}
18178>           @kbd{print arr[1, "start"], arr[1, "length"]}
18179>           @kbd{print arr[2, "start"], arr[2, "length"]}
18180> @kbd{@}'}
18181@print{} foooo barrrrr
18182@print{} 1 5
18183@print{} 9 7
18184@end example
18185
18186There may not be subscripts for the start and index for every parenthesized
18187subexpression, because they may not all have matched text; thus, they
18188should be tested for with the @code{in} operator
18189(@pxref{Reference to Elements}).
18190
18191@cindex troubleshooting @subentry @code{match()} function
18192The @var{array} argument to @code{match()} is a
18193@command{gawk} extension.  In compatibility mode
18194(@pxref{Options}),
18195using a third argument is a fatal error.
18196
18197@item @code{patsplit(@var{string}, @var{array}} [@code{, @var{fieldpat}} [@code{, @var{seps}} ] ]@code{) #}
18198@cindexgawkfunc{patsplit}
18199@cindex split string into array
18200Divide
18201@var{string} into pieces (or ``fields'') defined by @var{fieldpat}
18202and store the pieces in @var{array} and the separator strings in the
18203@var{seps} array.  The first piece is stored in
18204@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so
18205forth.  The third argument, @var{fieldpat}, is
18206a regexp describing the fields in @var{string} (just as @code{FPAT} is
18207a regexp describing the fields in input records).
18208It may be either a regexp constant or a string.
18209If @var{fieldpat} is omitted, the value of @code{FPAT} is used.
18210@code{patsplit()} returns the number of elements created.
18211@code{@var{seps}[@var{i}]} is
18212the possibly null separator string
18213after @code{@var{array}[@var{i}]}.
18214The possibly null leading separator will be in @code{@var{seps}[0]}.
18215So a non-null @var{string} with @var{n} fields will have @var{n+1} separators.
18216A null @var{string} has no fields or separators.
18217
18218The @code{patsplit()} function splits strings into pieces in a
18219manner similar to the way input lines are split into fields using @code{FPAT}
18220(@pxref{Splitting By Content}).
18221
18222Before splitting the string, @code{patsplit()} deletes any previously existing
18223elements in the arrays @var{array} and @var{seps}.
18224
18225@item @code{split(@var{string}, @var{array}} [@code{, @var{fieldsep}} [@code{, @var{seps}} ] ]@code{)}
18226@cindexawkfunc{split}
18227Divide @var{string} into pieces separated by @var{fieldsep}
18228and store the pieces in @var{array} and the separator strings in the
18229@var{seps} array.  The first piece is stored in
18230@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so
18231forth.  The string value of the third argument, @var{fieldsep}, is
18232a regexp describing where to split @var{string} (much as @code{FS} can
18233be a regexp describing where to split input records).
18234If @var{fieldsep} is omitted, the value of @code{FS} is used.
18235@code{split()} returns the number of elements created.
18236@var{seps} is a @command{gawk} extension, with @code{@var{seps}[@var{i}]}
18237being the separator string
18238between @code{@var{array}[@var{i}]} and @code{@var{array}[@var{i}+1]}.
18239If @var{fieldsep} is a single
18240space, then any leading whitespace goes into @code{@var{seps}[0]} and
18241any trailing
18242whitespace goes into @code{@var{seps}[@var{n}]}, where @var{n} is the
18243return value of
18244@code{split()} (i.e., the number of elements in @var{array}).
18245
18246The @code{split()} function splits strings into pieces in the same way
18247that input lines are split into fields.  For example:
18248
18249@example
18250split("cul-de-sac", a, "-", seps)
18251@end example
18252
18253@noindent
18254@cindex strings @subentry splitting, example
18255splits the string @code{"cul-de-sac"} into three fields using @samp{-} as the
18256separator.  It sets the contents of the array @code{a} as follows:
18257
18258@example
18259a[1] = "cul"
18260a[2] = "de"
18261a[3] = "sac"
18262@end example
18263
18264and sets the contents of the array @code{seps} as follows:
18265
18266@example
18267seps[1] = "-"
18268seps[2] = "-"
18269@end example
18270
18271@noindent
18272The value returned by this call to @code{split()} is three.
18273
18274@cindex differences in @command{awk} and @command{gawk} @subentry @code{split()} function
18275As with input field-splitting, when the value of @var{fieldsep} is
18276@w{@code{" "}}, leading and trailing whitespace is ignored in values assigned to
18277the elements of
18278@var{array} but not in @var{seps}, and the elements
18279are separated by runs of whitespace.
18280Also, as with input field splitting, if @var{fieldsep} is the null string, each
18281individual character in the string is split into its own array element.
18282@value{COMMONEXT}
18283Additionally, if @var{fieldsep} is a single-character string, that string acts
18284as the separator, even if its value is a regular expression metacharacter.
18285
18286Note, however, that @code{RS} has no effect on the way @code{split()}
18287works. Even though @samp{RS = ""} causes the newline character to also be an input
18288field separator, this does not affect how @code{split()} splits strings.
18289
18290@cindex dark corner @subentry @code{split()} function
18291Modern implementations of @command{awk}, including @command{gawk}, allow
18292the third argument to be a regexp constant (@w{@code{/}@dots{}@code{/}})
18293as well as a string.  @value{DARKCORNER}
18294The POSIX standard allows this as well.
18295@xref{Computed Regexps} for a
18296discussion of the difference between using a string constant or a regexp constant,
18297and the implications for writing your program correctly.
18298
18299Before splitting the string, @code{split()} deletes any previously existing
18300elements in the arrays @var{array} and @var{seps}.
18301
18302If @var{string} is null, the array has no elements. (So this is a portable
18303way to delete an entire array with one statement.
18304@xref{Delete}.)
18305
18306If @var{string} does not match @var{fieldsep} at all (but is not null),
18307@var{array} has one element only. The value of that element is the original
18308@var{string}.
18309
18310@cindex POSIX mode
18311In POSIX mode (@pxref{Options}), the fourth argument is not allowed.
18312
18313@item @code{sprintf(@var{format}, @var{expression1}, @dots{})}
18314@cindexawkfunc{sprintf}
18315@cindex formatting @subentry strings
18316Return (without printing) the string that @code{printf} would
18317have printed out with the same arguments
18318(@pxref{Printf}).
18319For example:
18320
18321@example
18322pival = sprintf("pi = %.2f (approx.)", 22/7)
18323@end example
18324
18325@noindent
18326assigns the string @w{@samp{pi = 3.14 (approx.)}} to the variable @code{pival}.
18327
18328@cindexgawkfunc{strtonum}
18329@cindex converting @subentry string to numbers
18330@item @code{strtonum(@var{str}) #}
18331Examine @var{str} and return its numeric value.  If @var{str}
18332begins with a leading @samp{0}, @code{strtonum()} assumes that @var{str}
18333is an octal number.  If @var{str} begins with a leading @samp{0x} or
18334@samp{0X}, @code{strtonum()} assumes that @var{str} is a hexadecimal number.
18335For example:
18336
18337@example
18338$ @kbd{echo 0x11 |}
18339> @kbd{gawk '@{ printf "%d\n", strtonum($1) @}'}
18340@print{} 17
18341@end example
18342
18343Using the @code{strtonum()} function is @emph{not} the same as adding zero
18344to a string value; the automatic coercion of strings to numbers
18345works only for decimal data, not for octal or hexadecimal.@footnote{Unless
18346you use the @option{--non-decimal-data} option, which isn't recommended.
18347@xref{Nondecimal Data} for more information.}
18348
18349Note also that @code{strtonum()} uses the current locale's decimal point
18350for recognizing numbers (@pxref{Locales}).
18351
18352@item @code{sub(@var{regexp}, @var{replacement}} [@code{, @var{target}}]@code{)}
18353@cindexawkfunc{sub}
18354@cindex replace in string
18355Search @var{target}, which is treated as a string, for the
18356leftmost, longest substring matched by the regular expression @var{regexp}.
18357Modify the entire string
18358by replacing the matched text with @var{replacement}.
18359The modified string becomes the new value of @var{target}.
18360Return the number of substitutions made (zero or one).
18361
18362The @var{regexp} argument may be either a regexp constant
18363(@code{/}@dots{}@code{/}) or a string constant (@code{"}@dots{}@code{"}).
18364In the latter case, the string is treated as a regexp to be matched.
18365@xref{Computed Regexps} for a
18366discussion of the difference between the two forms, and the
18367implications for writing your program correctly.
18368
18369This function is peculiar because @var{target} is not simply
18370used to compute a value, and not just any expression will do---it
18371must be a variable, field, or array element so that @code{sub()} can
18372store a modified value there.  If this argument is omitted, then the
18373default is to use and alter @code{$0}.@footnote{Note that this means
18374that the record will first be regenerated using the value of @code{OFS} if
18375any fields have been changed, and that the fields will be updated
18376after the substitution, even if the operation is a ``no-op'' such
18377as @samp{sub(/^/, "")}.}
18378For example:
18379
18380@example
18381str = "water, water, everywhere"
18382sub(/at/, "ith", str)
18383@end example
18384
18385@noindent
18386sets @code{str} to @w{@samp{wither, water, everywhere}}, by replacing the
18387leftmost longest occurrence of @samp{at} with @samp{ith}.
18388
18389If the special character @samp{&} appears in @var{replacement}, it
18390stands for the precise substring that was matched by @var{regexp}.  (If
18391the regexp can match more than one string, then this precise substring
18392may vary.)  For example:
18393
18394@example
18395@{ sub(/candidate/, "& and his wife"); print @}
18396@end example
18397
18398@noindent
18399changes the first occurrence of @samp{candidate} to @samp{candidate
18400and his wife} on each input line.
18401Here is another example:
18402
18403@example
18404$ @kbd{awk 'BEGIN @{}
18405>         @kbd{str = "daabaaa"}
18406>         @kbd{sub(/a+/, "C&C", str)}
18407>         @kbd{print str}
18408> @kbd{@}'}
18409@print{} dCaaCbaaa
18410@end example
18411
18412@noindent
18413This shows how @samp{&} can represent a nonconstant string and also
18414illustrates the ``leftmost, longest'' rule in regexp matching
18415(@pxref{Leftmost Longest}).
18416
18417The effect of this special character (@samp{&}) can be turned off by putting a
18418backslash before it in the string.  As usual, to insert one backslash in
18419the string, you must write two backslashes.  Therefore, write @samp{\\&}
18420in a string constant to include a literal @samp{&} in the replacement.
18421For example, the following shows how to replace the first @samp{|} on each line with
18422an @samp{&}:
18423
18424@example
18425@{ sub(/\|/, "\\&"); print @}
18426@end example
18427
18428@cindex @code{sub()} function @subentry arguments of
18429@cindex @code{gsub()} function @subentry arguments of
18430@cindex side effects @subentry @code{sub()} function
18431@cindex side effects @subentry @code{gsub()} function
18432As mentioned, the third argument to @code{sub()} must
18433be a variable, field, or array element.
18434Some versions of @command{awk} allow the third argument to
18435be an expression that is not an lvalue.  In such a case, @code{sub()}
18436still searches for the pattern and returns zero or one, but the result of
18437the substitution (if any) is thrown away because there is no place
18438to put it.  Such versions of @command{awk} accept expressions
18439like the following:
18440
18441@example
18442sub(/USA/, "United States", "the USA and Canada")
18443@end example
18444
18445@noindent
18446@cindex troubleshooting @subentry @code{gsub()}/@code{sub()} functions
18447For historical compatibility, @command{gawk} accepts such erroneous code.
18448However, using any other nonchangeable
18449object as the third parameter causes a fatal error and your program
18450will not run.
18451
18452Finally, if the @var{regexp} is not a regexp constant, it is converted into a
18453string, and then the value of that string is treated as the regexp to match.
18454
18455@item @code{substr(@var{string}, @var{start}} [@code{, @var{length}} ]@code{)}
18456@cindexawkfunc{substr}
18457@cindex substring
18458Return a @var{length}-character-long substring of @var{string},
18459starting at character number @var{start}.  The first character of a
18460string is character number one.@footnote{This is different from
18461C and C++, in which the first character is number zero.}
18462For example, @code{substr("washington", 5, 3)} returns @code{"ing"}.
18463
18464If @var{length} is not present, @code{substr()} returns the whole suffix of
18465@var{string} that begins at character number @var{start}.  For example,
18466@code{substr("washington", 5)} returns @code{"ington"}.  The whole
18467suffix is also returned
18468if @var{length} is greater than the number of characters remaining
18469in the string, counting from character @var{start}.
18470
18471@cindex Brian Kernighan's @command{awk}
18472If @var{start} is less than one, @code{substr()} treats it as
18473if it was one. (POSIX doesn't specify what to do in this case:
18474BWK @command{awk} acts this way, and therefore @command{gawk}
18475does too.)
18476If @var{start} is greater than the number of characters
18477in the string, @code{substr()} returns the null string.
18478Similarly, if @var{length} is present but less than or equal to zero,
18479the null string is returned.
18480
18481@cindex troubleshooting @subentry @code{substr()} function
18482The string returned by @code{substr()} @emph{cannot} be
18483assigned.  Thus, it is a mistake to attempt to change a portion of
18484a string, as shown in the following example:
18485
18486@example
18487string = "abcdef"
18488# try to get "abCDEf", won't work
18489substr(string, 3, 3) = "CDE"
18490@end example
18491
18492@noindent
18493It is also a mistake to use @code{substr()} as the third argument
18494of @code{sub()} or @code{gsub()}:
18495
18496@example
18497gsub(/xyz/, "pdq", substr($0, 5, 20))  # WRONG
18498@end example
18499
18500@cindex portability @subentry @code{substr()} function
18501(Some commercial versions of @command{awk} treat
18502@code{substr()} as assignable, but doing so is not portable.)
18503
18504If you need to replace bits and pieces of a string, combine @code{substr()}
18505with string concatenation, in the following manner:
18506
18507@example
18508string = "abcdef"
18509@dots{}
18510string = substr(string, 1, 2) "CDE" substr(string, 6)
18511@end example
18512
18513@cindex case sensitivity @subentry converting case
18514@cindex strings @subentry converting letter case
18515@item @code{tolower(@var{string})}
18516@cindexawkfunc{tolower}
18517@cindex converting @subentry string to lower case
18518Return a copy of @var{string}, with each uppercase character
18519in the string replaced with its corresponding lowercase character.
18520Nonalphabetic characters are left unchanged.  For example,
18521@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}.
18522
18523@item @code{toupper(@var{string})}
18524@cindexawkfunc{toupper}
18525@cindex converting @subentry string to upper case
18526Return a copy of @var{string}, with each lowercase character
18527in the string replaced with its corresponding uppercase character.
18528Nonalphabetic characters are left unchanged.  For example,
18529@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}.
18530@end table
18531
18532At first glance, the @code{split()} and @code{patsplit()} functions appear to be
18533mirror images of each other. But there are differences:
18534
18535@itemize @bullet
18536@item @code{split()} treats its third argument like @code{FS}, with all the
18537special rules involved for @code{FS}.
18538
18539@item Matching of null strings differs. This is discussed in @ref{FS versus FPAT}.
18540@end itemize
18541
18542@sidebar Matching the Null String
18543@cindex matching @subentry null strings
18544@cindex null strings @subentry matching
18545@cindex @code{*} (asterisk) @subentry @code{*} operator @subentry null strings, matching
18546@cindex asterisk (@code{*}) @subentry @code{*} operator @subentry null strings, matching
18547
18548In @command{awk}, the @samp{*} operator can match the null string.
18549This is particularly important for the @code{sub()}, @code{gsub()},
18550and @code{gensub()} functions.  For example:
18551
18552@example
18553$ @kbd{echo abc | awk '@{ gsub(/m*/, "X"); print @}'}
18554@print{} XaXbXcX
18555@end example
18556
18557@noindent
18558Although this makes a certain amount of sense, it can be surprising.
18559@end sidebar
18560
18561
18562@node Gory Details
18563@subsubsection More about @samp{\} and @samp{&} with @code{sub()}, @code{gsub()}, and @code{gensub()}
18564
18565@cindex escape processing @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions
18566@cindex @code{sub()} function @subentry escape processing
18567@cindex @code{gsub()} function @subentry escape processing
18568@cindex @code{gensub()} function (@command{gawk}) @subentry escape processing
18569@cindex @code{\} (backslash) @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions and
18570@cindex backslash (@code{\}) @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions and
18571@cindex @code{&} (ampersand) @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions and
18572@cindex ampersand (@code{&}) @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions and
18573
18574@quotation CAUTION
18575This subsubsection has been reported to cause headaches.
18576You might want to skip it upon first reading.
18577@end quotation
18578
18579When using @code{sub()}, @code{gsub()}, or @code{gensub()}, and trying to get literal
18580backslashes and ampersands into the replacement text, you need to remember
18581that there are several levels of @dfn{escape processing} going on.
18582
18583First, there is the @dfn{lexical} level, which is when @command{awk} reads
18584your program
18585and builds an internal copy of it to execute.
18586Then there is the runtime level, which is when @command{awk} actually scans the
18587replacement string to determine what to generate.
18588
18589@cindex Brian Kernighan's @command{awk}
18590At both levels, @command{awk} looks for a defined set of characters that
18591can come after a backslash.  At the lexical level, it looks for the
18592escape sequences listed in @ref{Escape Sequences}.
18593Thus, for every @samp{\} that @command{awk} processes at the runtime
18594level, you must type two backslashes at the lexical level.
18595When a character that is not valid for an escape sequence follows the
18596@samp{\}, BWK @command{awk} and @command{gawk} both simply remove the initial
18597@samp{\} and put the next character into the string. Thus, for
18598example, @code{"a\qb"} is treated as @code{"aqb"}.
18599
18600At the runtime level, the various functions handle sequences of
18601@samp{\} and @samp{&} differently.  The situation is (sadly) somewhat complex.
18602Historically, the @code{sub()} and @code{gsub()} functions treated the
18603two-character sequence @samp{\&} specially; this sequence was replaced in
18604the generated text with a single @samp{&}.  Any other @samp{\} within
18605the @var{replacement} string that did not precede an @samp{&} was passed
18606through unchanged.  This is illustrated in @ref{table-sub-escapes}.
18607
18608@c Thank to Karl Berry for help with the TeX stuff.
18609@float Table,table-sub-escapes
18610@caption{Historical escape sequence processing for @code{sub()} and @code{gsub()}}
18611@tex
18612\vbox{\bigskip
18613% We need more characters for escape and tab ...
18614\catcode`_ = 0
18615\catcode`! = 4
18616% ... since this table has lots of &'s and \'s, so we unspecialize them.
18617\catcode`\& = \other \catcode`\\ = \other
18618_halign{_hfil#!_qquad_hfil#!_qquad#_hfil_cr
18619      You type!@code{sub()} sees!@code{sub()} generates_cr
18620_hrulefill!_hrulefill!_hrulefill_cr
18621     @code{\&}!       @code{&}!The matched text_cr
18622    @code{\\&}!      @code{\&}!A literal @samp{&}_cr
18623   @code{\\\&}!      @code{\&}!A literal @samp{&}_cr
18624  @code{\\\\&}!     @code{\\&}!A literal @samp{\&}_cr
18625 @code{\\\\\&}!     @code{\\&}!A literal @samp{\&}_cr
18626@code{\\\\\\&}!    @code{\\\&}!A literal @samp{\\&}_cr
18627    @code{\\q}!      @code{\q}!A literal @samp{\q}_cr
18628}
18629_bigskip}
18630@end tex
18631@ifdocbook
18632@multitable @columnfractions .20 .20 .60
18633@headitem You type @tab @code{sub()} sees @tab @code{sub()} generates
18634@item @code{\&}      @tab @code{&}    @tab The matched text
18635@item @code{\\&}     @tab @code{\&}   @tab A literal @samp{&}
18636@item @code{\\\&}    @tab @code{\&}   @tab A literal @samp{&}
18637@item @code{\\\\&}   @tab @code{\\&}  @tab A literal @samp{\&}
18638@item @code{\\\\\&}  @tab @code{\\&}  @tab A literal @samp{\&}
18639@item @code{\\\\\\&} @tab @code{\\\&} @tab A literal @samp{\\&}
18640@item @code{\\q}     @tab @code{\q}   @tab A literal @samp{\q}
18641@end multitable
18642@end ifdocbook
18643@ifnottex
18644@ifnotdocbook
18645@display
18646 You type         @code{sub()} sees          @code{sub()} generates
18647 --------         ----------          ---------------
18648     @code{\&}              @code{&}            The matched text
18649    @code{\\&}             @code{\&}            A literal @samp{&}
18650   @code{\\\&}             @code{\&}            A literal @samp{&}
18651  @code{\\\\&}            @code{\\&}            A literal @samp{\&}
18652 @code{\\\\\&}            @code{\\&}            A literal @samp{\&}
18653@code{\\\\\\&}           @code{\\\&}            A literal @samp{\\&}
18654    @code{\\q}             @code{\q}            A literal @samp{\q}
18655@end display
18656@end ifnotdocbook
18657@end ifnottex
18658@end float
18659
18660@noindent
18661This table shows the lexical-level processing, where
18662an odd number of backslashes becomes an even number at the runtime level,
18663as well as the runtime processing done by @code{sub()}.
18664(For the sake of simplicity, the rest of the following tables only show the
18665case of even numbers of backslashes entered at the lexical level.)
18666
18667The problem with the historical approach is that there is no way to get
18668a literal @samp{\} followed by the matched text.
18669
18670Several editions of the POSIX standard attempted to fix this problem
18671but weren't successful. The details are irrelevant at this point in time.
18672
18673At one point, the @command{gawk} maintainer submitted
18674proposed text for a revised standard that
18675reverts to rules that correspond more closely to the original existing
18676practice. The proposed rules have special cases that make it possible
18677to produce a @samp{\} preceding the matched text.
18678This is shown in
18679@ref{table-sub-proposed}.
18680
18681@float Table,table-sub-proposed
18682@caption{@command{gawk} rules for @code{sub()} and backslash}
18683@tex
18684\vbox{\bigskip
18685% We need more characters for escape and tab ...
18686\catcode`_ = 0
18687\catcode`! = 4
18688% ... since this table has lots of &'s and \'s, so we unspecialize them.
18689\catcode`\& = \other \catcode`\\ = \other
18690_halign{_hfil#!_qquad_hfil#!_qquad#_hfil_cr
18691    You type!@code{sub()} sees!@code{sub()} generates_cr
18692_hrulefill!_hrulefill!_hrulefill_cr
18693@code{\\\\\\&}!     @code{\\\&}!A literal @samp{\&}_cr
18694@code{\\\\&}!     @code{\\&}!A literal @samp{\}, followed by the matched text_cr
18695  @code{\\&}!      @code{\&}!A literal @samp{&}_cr
18696  @code{\\q}!      @code{\q}!A literal @samp{\q}_cr
18697 @code{\\\\}!      @code{\\}!@code{\\}_cr
18698}
18699_bigskip}
18700@end tex
18701@ifdocbook
18702@multitable @columnfractions .20 .20 .60
18703@headitem You type @tab @code{sub()} sees @tab @code{sub()} generates
18704@item @code{\\\\\\&} @tab @code{\\\&} @tab A literal @samp{\&}
18705@item @code{\\\\&}   @tab @code{\\&}  @tab A literal @samp{\}, followed by the matched text
18706@item @code{\\&}     @tab @code{\&}   @tab A literal @samp{&}
18707@item @code{\\q}     @tab @code{\q}   @tab A literal @samp{\q}
18708@item @code{\\\\}    @tab @code{\\}   @tab @code{\\}
18709@end multitable
18710@end ifdocbook
18711@ifnottex
18712@ifnotdocbook
18713@display
18714 You type         @code{sub()} sees         @code{sub()} generates
18715 --------         ----------         ---------------
18716@code{\\\\\\&}           @code{\\\&}            A literal @samp{\&}
18717  @code{\\\\&}            @code{\\&}            A literal @samp{\}, followed by the matched text
18718    @code{\\&}             @code{\&}            A literal @samp{&}
18719    @code{\\q}             @code{\q}            A literal @samp{\q}
18720   @code{\\\\}             @code{\\}            @code{\\}
18721@end display
18722@end ifnotdocbook
18723@end ifnottex
18724@end float
18725
18726In a nutshell, at the runtime level, there are now three special sequences
18727of characters (@samp{\\\&}, @samp{\\&}, and @samp{\&}) whereas historically
18728there was only one.  However, as in the historical case, any @samp{\} that
18729is not part of one of these three sequences is not special and appears
18730in the output literally.
18731
18732@command{gawk} 3.0 and 3.1 follow these rules for @code{sub()} and
18733@code{gsub()}.  The POSIX standard took much longer to be revised than
18734was expected.  In addition, the @command{gawk} maintainer's proposal was
18735lost during the standardization process.  The final rules are
18736somewhat simpler.  The results are similar except for one case.
18737
18738@cindex POSIX @command{awk} @subentry functions and @subentry @code{gsub()}/@code{sub()}
18739The POSIX rules state that @samp{\&} in the replacement string produces
18740a literal @samp{&}, @samp{\\} produces a literal @samp{\}, and @samp{\} followed
18741by anything else is not special; the @samp{\} is placed straight into the output.
18742These rules are presented in @ref{table-posix-sub}.
18743
18744@float Table,table-posix-sub
18745@caption{POSIX rules for @code{sub()} and @code{gsub()}}
18746@tex
18747\vbox{\bigskip
18748% We need more characters for escape and tab ...
18749\catcode`_ = 0
18750\catcode`! = 4
18751% ... since this table has lots of &'s and \'s, so we unspecialize them.
18752\catcode`\& = \other \catcode`\\ = \other
18753_halign{_hfil#!_qquad_hfil#!_qquad#_hfil_cr
18754    You type!@code{sub()} sees!@code{sub()} generates_cr
18755_hrulefill!_hrulefill!_hrulefill_cr
18756@code{\\\\\\&}!     @code{\\\&}!A literal @samp{\&}_cr
18757@code{\\\\&}!     @code{\\&}!A literal @samp{\}, followed by the matched text_cr
18758  @code{\\&}!      @code{\&}!A literal @samp{&}_cr
18759  @code{\\q}!      @code{\q}!A literal @samp{\q}_cr
18760 @code{\\\\}!      @code{\\}!@code{\}_cr
18761}
18762_bigskip}
18763@end tex
18764@ifdocbook
18765@multitable @columnfractions .20 .20 .60
18766@headitem You type @tab @code{sub()} sees @tab @code{sub()} generates
18767@item @code{\\\\\\&} @tab @code{\\\&} @tab A literal @samp{\&}
18768@item @code{\\\\&}   @tab @code{\\&}  @tab A literal @samp{\}, followed by the matched text
18769@item @code{\\&}     @tab @code{\&}   @tab A literal @samp{&}
18770@item @code{\\q}     @tab @code{\q}   @tab A literal @samp{\q}
18771@item @code{\\\\}    @tab @code{\\}   @tab @code{\}
18772@end multitable
18773@end ifdocbook
18774@ifnottex
18775@ifnotdocbook
18776@display
18777 You type         @code{sub()} sees         @code{sub()} generates
18778 --------         ----------         ---------------
18779@code{\\\\\\&}           @code{\\\&}            A literal @samp{\&}
18780  @code{\\\\&}            @code{\\&}            A literal @samp{\}, followed by the matched text
18781    @code{\\&}             @code{\&}            A literal @samp{&}
18782    @code{\\q}             @code{\q}            A literal @samp{\q}
18783   @code{\\\\}             @code{\\}            @code{\}
18784@end display
18785@end ifnotdocbook
18786@end ifnottex
18787@end float
18788
18789The only case where the difference is noticeable is the last one: @samp{\\\\}
18790is seen as @samp{\\} and produces @samp{\} instead of @samp{\\}.
18791
18792Starting with @value{PVERSION} 3.1.4, @command{gawk} followed the POSIX rules
18793when @option{--posix} was specified (@pxref{Options}). Otherwise,
18794it continued to follow the proposed rules, as
18795that had been its behavior for many years.
18796
18797When @value{PVERSION} 4.0.0 was released, the @command{gawk} maintainer
18798made the POSIX rules the default, breaking well over a decade's worth
18799of backward compatibility.@footnote{This was rather naive of him, despite
18800there being a note in this @value{SECTION} indicating that the next major version
18801would move to the POSIX rules.} Needless to say, this was a bad idea,
18802and as of @value{PVERSION} 4.0.1, @command{gawk} resumed its historical
18803behavior, and only follows the POSIX rules when @option{--posix} is given.
18804
18805The rules for @code{gensub()} are considerably simpler. At the runtime
18806level, whenever @command{gawk} sees a @samp{\}, if the following character
18807is a digit, then the text that matched the corresponding parenthesized
18808subexpression is placed in the generated output.  Otherwise,
18809no matter what character follows the @samp{\}, it
18810appears in the generated text and the @samp{\} does not,
18811as shown in @ref{table-gensub-escapes}.
18812
18813@float Table,table-gensub-escapes
18814@caption{Escape sequence processing for @code{gensub()}}
18815@tex
18816\vbox{\bigskip
18817% We need more characters for escape and tab ...
18818\catcode`_ = 0
18819\catcode`! = 4
18820% ... since this table has lots of &'s and \'s, so we unspecialize them.
18821\catcode`\& = \other \catcode`\\ = \other
18822_halign{_hfil#!_qquad_hfil#!_qquad#_hfil_cr
18823    You type!@code{gensub()} sees!@code{gensub()} generates_cr
18824_hrulefill!_hrulefill!_hrulefill_cr
18825      @code{&}!           @code{&}!The matched text_cr
18826    @code{\\&}!          @code{\&}!A literal @samp{&}_cr
18827   @code{\\\\}!          @code{\\}!A literal @samp{\}_cr
18828  @code{\\\\&}!         @code{\\&}!A literal @samp{\}, then the matched text_cr
18829@code{\\\\\\&}!        @code{\\\&}!A literal @samp{\&}_cr
18830    @code{\\q}!          @code{\q}!A literal @samp{q}_cr
18831}
18832_bigskip}
18833@end tex
18834@ifdocbook
18835@multitable @columnfractions .20 .20 .60
18836@headitem You type @tab @code{gensub()} sees @tab @code{gensub()} generates
18837@item @code{&}       @tab @code{&}    @tab The matched text
18838@item @code{\\&}     @tab @code{\&}   @tab A literal @samp{&}
18839@item @code{\\\\}    @tab @code{\\}   @tab A literal @samp{\}
18840@item @code{\\\\&}   @tab @code{\\&}  @tab A literal @samp{\}, then the matched text
18841@item @code{\\\\\\&} @tab @code{\\\&} @tab A literal @samp{\&}
18842@item @code{\\q}     @tab  @code{\q}  @tab A literal @samp{q}
18843@end multitable
18844@end ifdocbook
18845@ifnottex
18846@ifnotdocbook
18847@display
18848  You type          @code{gensub()} sees         @code{gensub()} generates
18849  --------          -------------         ------------------
18850      @code{&}                    @code{&}            The matched text
18851    @code{\\&}                   @code{\&}            A literal @samp{&}
18852   @code{\\\\}                   @code{\\}            A literal @samp{\}
18853  @code{\\\\&}                  @code{\\&}            A literal @samp{\}, then the matched text
18854@code{\\\\\\&}                 @code{\\\&}            A literal @samp{\&}
18855    @code{\\q}                   @code{\q}            A literal @samp{q}
18856@end display
18857@end ifnotdocbook
18858@end ifnottex
18859@end float
18860
18861Because of the complexity of the lexical- and runtime-level processing
18862and the special cases for @code{sub()} and @code{gsub()},
18863we recommend the use of @command{gawk} and @code{gensub()} when you have
18864to do substitutions.
18865
18866@node I/O Functions
18867@subsection Input/Output Functions
18868@cindex input/output @subentry functions
18869
18870The following functions relate to input/output (I/O).
18871Optional parameters are enclosed in square brackets ([ ]):
18872
18873@table @asis
18874@item @code{close(}@var{filename} [@code{,} @var{how}]@code{)}
18875@cindexawkfunc{close}
18876@cindex files @subentry closing
18877@cindex close file or coprocess
18878Close the file @var{filename} for input or output. Alternatively, the
18879argument may be a shell command that was used for creating a coprocess, or
18880for redirecting to or from a pipe; then the coprocess or pipe is closed.
18881@xref{Close Files And Pipes}
18882for more information.
18883
18884When closing a coprocess, it is occasionally useful to first close
18885one end of the two-way pipe and then to close the other.  This is done
18886by providing a second argument to @code{close()}.  This second argument
18887(@var{how})
18888should be one of the two string values @code{"to"} or @code{"from"},
18889indicating which end of the pipe to close.  Case in the string does
18890not matter.
18891@xref{Two-way I/O},
18892which discusses this feature in more detail and gives an example.
18893
18894Note that the second argument to @code{close()} is a @command{gawk}
18895extension; it is not available in compatibility mode (@pxref{Options}).
18896
18897@item @code{fflush(}[@var{filename}]@code{)}
18898@cindexawkfunc{fflush}
18899@cindex flush buffered output
18900Flush any buffered output associated with @var{filename}, which is either a
18901file opened for writing or a shell command for redirecting output to
18902a pipe or coprocess.
18903
18904@cindex buffers @subentry flushing
18905@cindex output @subentry buffering
18906Many utility programs @dfn{buffer} their output (i.e., they save information
18907to write to a disk file or the screen in memory until there is enough
18908for it to be worthwhile to send the data to the output device).
18909This is often more efficient than writing
18910every little bit of information as soon as it is ready.  However, sometimes
18911it is necessary to force a program to @dfn{flush} its buffers (i.e.,
18912write the information to its destination, even if a buffer is not full).
18913This is the purpose of the @code{fflush()} function---@command{gawk} also
18914buffers its output, and the @code{fflush()} function forces
18915@command{gawk} to flush its buffers.
18916
18917@cindex extensions @subentry common @subentry @code{fflush()} function
18918@cindex Brian Kernighan's @command{awk}
18919Brian Kernighan added @code{fflush()} to his @command{awk} in April
189201992.  For two decades, it was a common extension.  In December
189212012, it was accepted for inclusion into the POSIX standard.
18922See @uref{http://austingroupbugs.net/view.php?id=634, the Austin Group website}.
18923
18924POSIX standardizes @code{fflush()} as follows: if there
18925is no argument, or if the argument is the null string (@w{@code{""}}),
18926then @command{awk} flushes the buffers for @emph{all} open output files
18927and pipes.
18928
18929@quotation NOTE
18930Prior to @value{PVERSION} 4.0.2, @command{gawk}
18931would flush only the standard output if there was no argument,
18932and flush all output files and pipes if the argument was the null
18933string. This was changed in order to be compatible with BWK
18934@command{awk}, in the hope that standardizing this
18935feature in POSIX would then be easier (which indeed proved to be the case).
18936
18937With @command{gawk},
18938you can use @samp{fflush("/dev/stdout")} if you wish to flush
18939only the standard output.
18940@end quotation
18941
18942@c @cindex automatic warnings
18943@c @cindex warnings, automatic
18944@cindex troubleshooting @subentry @code{fflush()} function
18945@code{fflush()} returns zero if the buffer is successfully flushed;
18946otherwise, it returns a nonzero value. (@command{gawk} returns @minus{}1.)
18947In the case where all buffers are flushed, the return value is zero
18948only if all buffers were flushed successfully.  Otherwise, it is
18949@minus{}1, and @command{gawk} warns about the problem @var{filename}.
18950
18951@command{gawk} also issues a warning message if you attempt to flush
18952a file or pipe that was opened for reading (such as with @code{getline}),
18953or if @var{filename} is not an open file, pipe, or coprocess.
18954In such a case, @code{fflush()} returns @minus{}1, as well.
18955
18956@c end the table to let the sidebar take up the full width of the page.
18957@end table
18958
18959@sidebar Interactive Versus Noninteractive Buffering
18960@cindex buffering @subentry interactive vs.@: noninteractive
18961
18962As a side point, buffering issues can be even more confusing if
18963your program is @dfn{interactive} (i.e., communicating
18964with a user sitting at a keyboard).@footnote{A program is interactive
18965if the standard output is connected to a terminal device. On modern
18966systems, this means your keyboard and screen.}
18967
18968@c Thanks to Walter.Mecky@dresdnerbank.de for this example, and for
18969@c motivating me to write this section.
18970Interactive programs generally @dfn{line buffer} their output (i.e., they
18971write out every line).  Noninteractive programs wait until they have
18972a full buffer, which may be many lines of output.
18973Here is an example of the difference:
18974
18975@example
18976$ @kbd{awk '@{ print $1 + $2 @}'}
18977@kbd{1 1}
18978@print{} 2
18979@kbd{2 3}
18980@print{} 5
18981@kbd{Ctrl-d}
18982@end example
18983
18984@noindent
18985Each line of output is printed immediately. Compare that behavior
18986with this example:
18987
18988@example
18989$ @kbd{awk '@{ print $1 + $2 @}' | cat}
18990@kbd{1 1}
18991@kbd{2 3}
18992@kbd{Ctrl-d}
18993@print{} 2
18994@print{} 5
18995@end example
18996
18997@noindent
18998Here, no output is printed until after the @kbd{Ctrl-d} is typed, because
18999it is all buffered and sent down the pipe to @command{cat} in one shot.
19000@end sidebar
19001
19002@table @asis
19003@item @code{system(@var{command})}
19004@cindexawkfunc{system}
19005@cindex invoke shell command
19006@cindex interacting with other programs
19007Execute the operating system
19008command @var{command} and then return to the @command{awk} program.
19009Return @var{command}'s exit status (see further on).
19010
19011For example, if the following fragment of code is put in your @command{awk}
19012program:
19013
19014@example
19015END @{
19016     system("date | mail -s 'awk run done' root")
19017@}
19018@end example
19019
19020@noindent
19021the system administrator is sent mail when the @command{awk} program
19022finishes processing input and begins its end-of-input processing.
19023
19024Note that redirecting @code{print} or @code{printf} into a pipe is often
19025enough to accomplish your task.  If you need to run many commands, it
19026is more efficient to simply print them down a pipeline to the shell:
19027
19028@example
19029while (@var{more stuff to do})
19030    print @var{command} | "/bin/sh"
19031close("/bin/sh")
19032@end example
19033
19034@noindent
19035@cindex troubleshooting @subentry @code{system()} function
19036@cindex @option{--sandbox} option @subentry disabling @code{system()} function
19037However, if your @command{awk}
19038program is interactive, @code{system()} is useful for running large
19039self-contained programs, such as a shell or an editor.
19040Some operating systems cannot implement the @code{system()} function.
19041@code{system()} causes a fatal error if it is not supported.
19042
19043@quotation NOTE
19044When @option{--sandbox} is specified, the @code{system()} function is disabled
19045(@pxref{Options}).
19046@end quotation
19047
19048On POSIX systems, a command's exit status is a 16-bit number. The exit
19049value passed to the C @code{exit()} function is held in the high-order
19050eight bits. The low-order bits indicate if the process was killed by a
19051signal (bit 7) and if so, the guilty signal number (bits 0--6).
19052
19053Traditionally, @command{awk}'s @code{system()} function has simply
19054returned the exit status value divided by 256. In the normal case this
19055gives the exit status but in the case of death-by-signal it yields
19056a fractional floating-point value.@footnote{In private correspondence,
19057Dr.@: Kernighan has indicated to me that the way this was done
19058was probably a mistake.} POSIX states that @command{awk}'s
19059@code{system()} should return the full 16-bit value.
19060
19061@command{gawk} steers a middle ground.
19062The return values are summarized in @ref{table-system-return-values}.
19063
19064@float Table,table-system-return-values
19065@caption{Return values from @code{system()}}
19066@multitable @columnfractions .40 .60
19067@headitem Situation @tab Return value from @code{system()}
19068@item @option{--traditional} @tab C @code{system()}'s value divided by 256
19069@item @option{--posix} @tab C @code{system()}'s value
19070@item Normal exit of command @tab Command's exit status
19071@item Death by signal of command @tab 256 + number of murderous signal
19072@item Death by signal of command with core dump @tab 512 + number of murderous signal
19073@item Some kind of error @tab @minus{}1
19074@end multitable
19075@end float
19076@end table
19077
19078As of August, 2018, BWK @command{awk} now follows @command{gawk}'s behavior
19079for the return value of @code{system()}.
19080
19081@sidebar Controlling Output Buffering with @code{system()}
19082@cindex buffers @subentry flushing
19083@cindex buffering @subentry input/output
19084@cindex output @subentry buffering
19085
19086The @code{fflush()} function provides explicit control over output buffering for
19087individual files and pipes.  However, its use is not portable to many older
19088@command{awk} implementations.  An alternative method to flush output
19089buffers is to call @code{system()} with a null string as its argument:
19090
19091@example
19092system("")   # flush output
19093@end example
19094
19095@noindent
19096@command{gawk} treats this use of the @code{system()} function as a special
19097case and is smart enough not to run a shell (or other command
19098interpreter) with the empty command.  Therefore, with @command{gawk}, this
19099idiom is not only useful, it is also efficient.  Although this method should work
19100with other @command{awk} implementations, it does not necessarily avoid
19101starting an unnecessary shell.  (Other implementations may only
19102flush the buffer associated with the standard output and not necessarily
19103all buffered output.)
19104
19105If you think about what a programmer expects, it makes sense that
19106@code{system()} should flush any pending output.  The following program:
19107
19108@example
19109BEGIN @{
19110     print "first print"
19111     system("echo system echo")
19112     print "second print"
19113@}
19114@end example
19115
19116@noindent
19117must print:
19118
19119@example
19120first print
19121system echo
19122second print
19123@end example
19124
19125@noindent
19126and not:
19127
19128@example
19129system echo
19130first print
19131second print
19132@end example
19133
19134If @command{awk} did not flush its buffers before calling @code{system()},
19135you would see the latter (undesirable) output.
19136@end sidebar
19137
19138@node Time Functions
19139@subsection Time Functions
19140@cindex time functions
19141
19142@cindex timestamps
19143@cindex log files, timestamps in
19144@cindex files @subentry log, timestamps in
19145@cindex @command{gawk} @subentry timestamps
19146@cindex POSIX @command{awk} @subentry timestamps and
19147@command{awk} programs are commonly used to process log files
19148containing timestamp information, indicating when a
19149particular log record was written.  Many programs log their timestamps
19150in the form returned by the @code{time()} system call, which is the
19151number of seconds since a particular epoch.  On POSIX-compliant systems,
19152it is the number of seconds since
191531970-01-01 00:00:00 UTC, not counting leap
19154@ifclear FOR_PRINT
19155seconds.@footnote{@xref{Glossary}, especially the entries ``Epoch'' and ``UTC.''}
19156@end ifclear
19157@ifset FOR_PRINT
19158seconds.
19159@end ifset
19160All known POSIX-compliant systems support timestamps from 0 through
19161@iftex
19162@math{2^{31} - 1},
19163@end iftex
19164@ifinfo
191652^31 - 1,
19166@end ifinfo
19167@ifnottex
19168@ifnotinfo
191692@sup{31} @minus{} 1,
19170@end ifnotinfo
19171@end ifnottex
19172which is sufficient to represent times through
191732038-01-19 03:14:07 UTC.  Many systems support a wider range of timestamps,
19174including negative timestamps that represent times before the
19175epoch.
19176
19177@cindex @command{date} utility @subentry GNU
19178@cindex time @subentry retrieving
19179In order to make it easier to process such log files and to produce
19180useful reports, @command{gawk} provides the following functions for
19181working with timestamps.  They are @command{gawk} extensions; they are
19182not specified in the POSIX standard.@footnote{The GNU @command{date} utility can
19183also do many of the things described here.  Its use may be preferable
19184for simple time-related operations in shell scripts.}
19185However, recent versions
19186of @command{mawk} (@pxref{Other Versions}) also support these functions.
19187Optional parameters are enclosed in square brackets ([ ]):
19188
19189@c @asis for docbook
19190@table @asis
19191@item @code{mktime(@var{datespec}} [@code{, @var{utc-flag}} ]@code{)}
19192@cindexgawkfunc{mktime}
19193@cindex generate time values
19194Turn @var{datespec} into a timestamp in the same form
19195as is returned by @code{systime()}.  It is similar to the function of the
19196same name in ISO C.  The argument, @var{datespec}, is a string of the form
19197@w{@code{"@var{YYYY} @var{MM} @var{DD} @var{HH} @var{MM} @var{SS} [@var{DST}]"}}.
19198The string consists of six or seven numbers representing, respectively,
19199the full year including century, the month from 1 to 12, the day of the month
19200from 1 to 31, the hour of the day from 0 to 23, the minute from 0 to
1920159, the second from 0 to 60,@footnote{Occasionally there are
19202minutes in a year with a leap second, which is why the
19203seconds can go up to 60.}
19204and an optional daylight-savings flag.
19205
19206The values of these numbers need not be within the ranges specified;
19207for example, an hour of @minus{}1 means 1 hour before midnight.
19208The origin-zero Gregorian calendar is assumed, with year 0 preceding
19209year 1 and year @minus{}1 preceding year 0.
19210If @var{utc-flag} is present and is either nonzero or non-null, the time
19211is assumed to be in the UTC time zone; otherwise, the
19212time is assumed to be in the local time zone.
19213If the @var{DST} daylight-savings flag is positive, the time is assumed to be
19214daylight savings time; if zero, the time is assumed to be standard
19215time; and if negative (the default), @code{mktime()} attempts to determine
19216whether daylight savings time is in effect for the specified time.
19217
19218If @var{datespec} does not contain enough elements or if the resulting time
19219is out of range, @code{mktime()} returns @minus{}1.
19220
19221@cindex @command{gawk} @subentry @code{PROCINFO} array in
19222@cindex @code{PROCINFO} array
19223@item @code{strftime(}[@var{format} [@code{,} @var{timestamp} [@code{,} @var{utc-flag}] ] ]@code{)}
19224@cindexgawkfunc{strftime}
19225@cindex format time string
19226Format the time specified by @var{timestamp}
19227based on the contents of the @var{format} string and return the result.
19228It is similar to the function of the same name in ISO C.
19229If @var{utc-flag} is present and is either nonzero or non-null, the value
19230is formatted as UTC (Coordinated Universal Time, formerly GMT or Greenwich
19231Mean Time). Otherwise, the value is formatted for the local time zone.
19232The @var{timestamp} is in the same format as the value returned by the
19233@code{systime()} function.  If no @var{timestamp} argument is supplied,
19234@command{gawk} uses the current time of day as the timestamp.
19235Without a @var{format} argument, @code{strftime()} uses
19236the value of @code{PROCINFO["strftime"]} as the format string
19237(@pxref{Built-in Variables}).
19238The default string value is
19239@code{@w{"%a %b %e %H:%M:%S %Z %Y"}}.  This format string produces
19240output that is equivalent to that of the @command{date} utility.
19241You can assign a new value to @code{PROCINFO["strftime"]} to
19242change the default format; see the following list for the various format directives.
19243
19244@item @code{systime()}
19245@cindexgawkfunc{systime}
19246@cindex timestamps
19247@cindex current system time
19248Return the current time as the number of seconds since
19249the system epoch.  On POSIX systems, this is the number of seconds
19250since 1970-01-01 00:00:00 UTC, not counting leap seconds.
19251It may be a different number on other systems.
19252@end table
19253
19254The @code{systime()} function allows you to compare a timestamp from a
19255log file with the current time of day.  In particular, it is easy to
19256determine how long ago a particular record was logged.  It also allows
19257you to produce log records using the ``seconds since the epoch'' format.
19258
19259@cindex converting @subentry dates to timestamps
19260@cindex dates @subentry converting to timestamps
19261@cindex timestamps @subentry converting dates to
19262The @code{mktime()} function allows you to convert a textual representation
19263of a date and time into a timestamp.   This makes it easy to do before/after
19264comparisons of dates and times, particularly when dealing with date and
19265time data coming from an external source, such as a log file.
19266
19267The @code{strftime()} function allows you to easily turn a timestamp
19268into human-readable information.  It is similar in nature to the @code{sprintf()}
19269function
19270(@pxref{String Functions}),
19271in that it copies nonformat specification characters verbatim to the
19272returned string, while substituting date and time values for format
19273specifications in the @var{format} string.
19274
19275@cindex format specifiers @subentry @code{strftime()} function (@command{gawk})
19276@code{strftime()} is guaranteed by the 1999 ISO C
19277standard@footnote{Unfortunately,
19278not every system's @code{strftime()} necessarily
19279supports all of the conversions listed here.}
19280to support the following date format specifications:
19281
19282@table @code
19283@item %a
19284The locale's abbreviated weekday name.
19285
19286@item %A
19287The locale's full weekday name.
19288
19289@item %b
19290The locale's abbreviated month name.
19291
19292@item %B
19293The locale's full month name.
19294
19295@item %c
19296The locale's ``appropriate'' date and time representation.
19297(This is @samp{%A %B %d %T %Y} in the @code{"C"} locale.)
19298
19299@item %C
19300The century part of the current year.
19301This is the year divided by 100 and truncated to the next
19302lower integer.
19303
19304@item %d
19305The day of the month as a decimal number (01--31).
19306
19307@item %D
19308Equivalent to specifying @samp{%m/%d/%y}.
19309
19310@item %e
19311The day of the month, padded with a space if it is only one digit.
19312
19313@item %F
19314Equivalent to specifying @samp{%Y-%m-%d}.
19315This is the ISO 8601 date format.
19316
19317@item %g
19318The year modulo 100 of the ISO 8601 week number, as a decimal number (00--99).
19319For example, January 1, 2012, is in week 53 of 2011. Thus, the year
19320of its ISO 8601 week number is 2011, even though its year is 2012.
19321Similarly, December 31, 2012, is in week 1 of 2013. Thus, the year
19322of its ISO week number is 2013, even though its year is 2012.
19323
19324@item %G
19325The full year of the ISO week number, as a decimal number.
19326
19327@item %h
19328Equivalent to @samp{%b}.
19329
19330@item %H
19331The hour (24-hour clock) as a decimal number (00--23).
19332
19333@item %I
19334The hour (12-hour clock) as a decimal number (01--12).
19335
19336@item %j
19337The day of the year as a decimal number (001--366).
19338
19339@item %m
19340The month as a decimal number (01--12).
19341
19342@item %M
19343The minute as a decimal number (00--59).
19344
19345@item %n
19346A newline character (ASCII LF).
19347
19348@item %p
19349The locale's equivalent of the AM/PM designations associated
19350with a 12-hour clock.
19351
19352@item %r
19353The locale's 12-hour clock time.
19354(This is @samp{%I:%M:%S %p} in the @code{"C"} locale.)
19355
19356@item %R
19357Equivalent to specifying @samp{%H:%M}.
19358
19359@item %S
19360The second as a decimal number (00--60).
19361
19362@item %t
19363A TAB character.
19364
19365@item %T
19366Equivalent to specifying @samp{%H:%M:%S}.
19367
19368@item %u
19369The weekday as a decimal number (1--7).  Monday is day one.
19370
19371@item %U
19372The week number of the year (with the first Sunday as the first day of week one)
19373as a decimal number (00--53).
19374
19375@cindex ISO @subentry ISO 8601 date and time standard
19376@item %V
19377The week number of the year (with the first Monday as the first
19378day of week one) as a decimal number (01--53).
19379The method for determining the week number is as specified by ISO 8601.
19380(To wit: if the week containing January 1 has four or more days in the
19381new year, then it is week one; otherwise it is the last week
19382[52 or 53] of the previous year and the next week is week one.)
19383
19384@item %w
19385The weekday as a decimal number (0--6).  Sunday is day zero.
19386
19387@item %W
19388The week number of the year (with the first Monday as the first day of week one)
19389as a decimal number (00--53).
19390
19391@item %x
19392The locale's ``appropriate'' date representation.
19393(This is @samp{%A %B %d %Y} in the @code{"C"} locale.)
19394
19395@item %X
19396The locale's ``appropriate'' time representation.
19397(This is @samp{%T} in the @code{"C"} locale.)
19398
19399@item %y
19400The year modulo 100 as a decimal number (00--99).
19401
19402@item %Y
19403The full year as a decimal number (e.g., 2015).
19404
19405@c @cindex RFC 822
19406@c @cindex RFC 1036
19407@item %z
19408The time zone offset in a @samp{+@var{HHMM}} format (e.g., the format
19409necessary to produce RFC 822/RFC 1036 date headers).
19410
19411@item %Z
19412The time zone name or abbreviation; no characters if
19413no time zone is determinable.
19414
19415@item %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH
19416@itemx %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy
19417``Alternative representations'' for the specifications
19418that use only the second letter (@samp{%c}, @samp{%C},
19419and so on).@footnote{If you don't understand any of this, don't worry about
19420it; these facilities are meant to make it easier to ``internationalize''
19421programs.
19422Other internationalization features are described in
19423@ref{Internationalization}.}
19424(These facilitate compliance with the POSIX @command{date} utility.)
19425
19426@item %%
19427A literal @samp{%}.
19428@end table
19429
19430If a conversion specifier is not one of those just listed, the behavior is
19431undefined.@footnote{This is because ISO C leaves the
19432behavior of the C version of @code{strftime()} undefined and @command{gawk}
19433uses the system's version of @code{strftime()} if it's there.
19434Typically, the conversion specifier either does not appear in the
19435returned string or appears literally.}
19436
19437For systems that are not yet fully standards-compliant,
19438@command{gawk} supplies a copy of
19439@code{strftime()} from the GNU C Library.
19440It supports all of the just-listed format specifications.
19441If that version is
19442used to compile @command{gawk} (@pxref{Installation}),
19443then the following additional format specifications are available:
19444
19445@table @code
19446@item %k
19447The hour (24-hour clock) as a decimal number (0--23).
19448Single-digit numbers are padded with a space.
19449
19450@item %l
19451The hour (12-hour clock) as a decimal number (1--12).
19452Single-digit numbers are padded with a space.
19453
19454@ignore
19455@item %N
19456The ``Emperor/Era'' name.
19457Equivalent to @samp{%C}.
19458
19459@item %o
19460The ``Emperor/Era'' year.
19461Equivalent to @samp{%y}.
19462@end ignore
19463
19464@item %s
19465The time as a decimal timestamp in seconds since the epoch.
19466
19467@ignore
19468@item %v
19469The date in VMS format (e.g., @samp{20-JUN-1991}).
19470@end ignore
19471@end table
19472
19473Additionally, the alternative representations are recognized but their
19474normal representations are used.
19475
19476@cindex @code{date} utility @subentry POSIX
19477@cindex POSIX @command{awk} @subentry @code{date} utility and
19478The following example is an @command{awk} implementation of the POSIX
19479@command{date} utility.  Normally, the @command{date} utility prints the
19480current date and time of day in a well-known format.  However, if you
19481provide an argument to it that begins with a @samp{+}, @command{date}
19482copies nonformat specifier characters to the standard output and
19483interprets the current time according to the format specifiers in
19484the string.  For example:
19485
19486@example
19487$ @kbd{date '+Today is %A, %B %d, %Y.'}
19488@print{} Today is Monday, September 22, 2014.
19489@end example
19490
19491Here is the @command{gawk} version of the @command{date} utility.
19492It has a shell ``wrapper'' to handle the @option{-u} option,
19493which requires that @command{date} run as if the time zone
19494is set to UTC:
19495
19496@example
19497#! /bin/sh
19498#
19499# date --- approximate the POSIX 'date' command
19500
19501case $1 in
19502-u)  TZ=UTC0     # use UTC
19503     export TZ
19504     shift ;;
19505esac
19506
19507gawk 'BEGIN  @{
19508    format = PROCINFO["strftime"]
19509    exitval = 0
19510
19511    if (ARGC > 2)
19512        exitval = 1
19513    else if (ARGC == 2) @{
19514        format = ARGV[1]
19515        if (format ~ /^\+/)
19516            format = substr(format, 2)   # remove leading +
19517    @}
19518    print strftime(format)
19519    exit exitval
19520@}' "$@@"
19521@end example
19522
19523@node Bitwise Functions
19524@subsection Bit-Manipulation Functions
19525@cindex bit-manipulation functions
19526@cindex bitwise @subentry operations
19527@cindex AND bitwise operation
19528@cindex OR bitwise operation
19529@cindex XOR bitwise operation
19530@cindex operations, bitwise
19531@quotation
19532@i{I can explain it for you, but I can't understand it for you.}
19533@author Anonymous
19534@end quotation
19535
19536Many languages provide the ability to perform @dfn{bitwise} operations
19537on two integer numbers.  In other words, the operation is performed on
19538each successive pair of bits in the operands.
19539Three common operations are bitwise AND, OR, and XOR.
19540The operations are described in @ref{table-bitwise-ops}.
19541
19542@c 11/2014: Postprocessing turns the docbook informaltable
19543@c into a table. Hurray for scripting!
19544@float Table,table-bitwise-ops
19545@caption{Bitwise operations}
19546@ifnottex
19547@ifnotdocbook
19548@verbatim
19549                Bit operator
19550          |  AND  |   OR  |  XOR
19551          |---+---+---+---+---+---
19552Operands  | 0 | 1 | 0 | 1 | 0 | 1
19553----------+---+---+---+---+---+---
19554    0     | 0   0 | 0   1 | 0   1
19555    1     | 0   1 | 1   1 | 1   0
19556@end verbatim
19557@end ifnotdocbook
19558@end ifnottex
19559@tex
19560\centerline{
19561\vbox{\bigskip % space above the table (about 1 linespace)
19562% Because we have vertical rules, we can't let TeX insert interline space
19563% in its usual way.
19564\offinterlineskip
19565\halign{\strut\hfil#\quad\hfil  % operands
19566        &\vrule#&\quad#\quad    % rule, 0 (of and)
19567        &\vrule#&\quad#\quad    % rule, 1 (of and)
19568        &\vrule#                % rule between and and or
19569        &\quad#\quad            % 0 (of or)
19570        &\vrule#&\quad#\quad    % rule, 1 (of of)
19571        &\vrule#                % rule between or and xor
19572        &\quad#\quad            % 0 of xor
19573        &\vrule#&\quad#\quad    % rule, 1 of xor
19574        \cr
19575&\omit&\multispan{11}\hfil\bf Bit operator\hfil\cr
19576\noalign{\smallskip}
19577&     &\multispan3\hfil AND\hfil&&\multispan3\hfil  OR\hfil
19578                           &&\multispan3\hfil XOR\hfil\cr
19579\bf Operands&&0&&1&&0&&1&&0&&1\cr
19580\noalign{\hrule}
19581\omit&height 2pt&&\omit&&&&\omit&&&&\omit\cr
19582\noalign{\hrule height0pt}% without this the rule does not extend; why?
195830&&0&\omit&0&&0&\omit&1&&0&\omit&1\cr
195841&&0&\omit&1&&1&\omit&1&&1&\omit&0\cr
19585}}}
19586@end tex
19587
19588@docbook
19589<informaltable>
19590
19591<tgroup cols="7" colsep="1">
19592<colspec colname="c1"/>
19593<colspec colname="c2"/>
19594<colspec colname="c3"/>
19595<colspec colname="c4"/>
19596<colspec colname="c5"/>
19597<colspec colname="c6"/>
19598<colspec colname="c7"/>
19599<spanspec spanname="optitle" namest="c2" nameend="c7" align="center"/>
19600<spanspec spanname="andspan" namest="c2" nameend="c3" align="center"/>
19601<spanspec spanname="orspan" namest="c4" nameend="c5" align="center"/>
19602<spanspec spanname="xorspan" namest="c6" nameend="c7" align="center"/>
19603
19604<tbody>
19605<row>
19606<entry colsep="0"></entry>
19607<entry spanname="optitle"><emphasis role="bold">Bit operator</emphasis></entry>
19608</row>
19609
19610<row rowsep="1">
19611<entry rowsep="0"></entry>
19612<entry spanname="andspan">AND</entry>
19613<entry spanname="orspan">OR</entry>
19614<entry spanname="xorspan">XOR</entry>
19615</row>
19616
19617<row rowsep="1">
19618<entry ><emphasis role="bold">Operands</emphasis></entry>
19619<entry colsep="0">0</entry>
19620<entry colsep="1">1</entry>
19621<entry colsep="0">0</entry>
19622<entry colsep="1">1</entry>
19623<entry colsep="0">0</entry>
19624<entry colsep="1">1</entry>
19625</row>
19626
19627<row>
19628<entry align="center">0</entry>
19629<entry colsep="0">0</entry>
19630<entry>0</entry>
19631<entry colsep="0">0</entry>
19632<entry>1</entry>
19633<entry colsep="0">0</entry>
19634<entry>1</entry>
19635</row>
19636
19637<row>
19638<entry align="center">1</entry>
19639<entry colsep="0">0</entry>
19640<entry>1</entry>
19641<entry colsep="0">1</entry>
19642<entry>1</entry>
19643<entry colsep="0">1</entry>
19644<entry>0</entry>
19645</row>
19646
19647</tbody>
19648</tgroup>
19649</informaltable>
19650@end docbook
19651@end float
19652
19653@cindex bitwise @subentry complement
19654@cindex complement, bitwise
19655As you can see, the result of an AND operation is 1 only when @emph{both}
19656bits are 1.
19657The result of an OR operation is 1 if @emph{either} bit is 1.
19658The result of an XOR operation is 1 if either bit is 1,
19659but not both.
19660The next operation is the @dfn{complement}; the complement of 1 is 0 and
19661the complement of 0 is 1. Thus, this operation ``flips'' all the bits
19662of a given value.
19663
19664@cindex bitwise @subentry shift
19665@cindex left shift, bitwise
19666@cindex right shift, bitwise
19667@cindex shift, bitwise
19668Finally, two other common operations are to shift the bits left or right.
19669For example, if you have a bit string @samp{10111001} and you shift it
19670right by three bits, you end up with @samp{00010111}.@footnote{This example
19671shows that zeros come in on the left side. For @command{gawk}, this is
19672always true, but in some languages, it's possible to have the left side
19673fill with ones.}
19674If you start over again with @samp{10111001} and shift it left by three
19675bits, you end up with @samp{11001000}.  The following list describes
19676@command{gawk}'s built-in functions that implement the bitwise operations.
19677Optional parameters are enclosed in square brackets ([ ]):
19678
19679@cindex @command{gawk} @subentry bitwise operations in
19680@table @asis
19681@cindexgawkfunc{and}
19682@cindex bitwise @subentry AND
19683@item @code{and(}@var{v1}@code{,} @var{v2} [@code{,} @dots{}]@code{)}
19684Return the bitwise AND of the arguments. There must be at least two.
19685
19686@cindexgawkfunc{compl}
19687@cindex bitwise @subentry complement
19688@item @code{compl(@var{val})}
19689Return the bitwise complement of @var{val}.
19690
19691@cindexgawkfunc{lshift}
19692@item @code{lshift(@var{val}, @var{count})}
19693Return the value of @var{val}, shifted left by @var{count} bits.
19694
19695@cindexgawkfunc{or}
19696@cindex bitwise @subentry OR
19697@item @code{or(}@var{v1}@code{,} @var{v2} [@code{,} @dots{}]@code{)}
19698Return the bitwise OR of the arguments. There must be at least two.
19699
19700@cindexgawkfunc{rshift}
19701@item @code{rshift(@var{val}, @var{count})}
19702Return the value of @var{val}, shifted right by @var{count} bits.
19703
19704@cindexgawkfunc{xor}
19705@cindex bitwise @subentry XOR
19706@item @code{xor(}@var{v1}@code{,} @var{v2} [@code{,} @dots{}]@code{)}
19707Return the bitwise XOR of the arguments. There must be at least two.
19708@end table
19709
19710@quotation CAUTION
19711Beginning with @command{gawk} @value{PVERSION} 4.2, negative
19712operands are not allowed for any of these functions. A negative
19713operand produces a fatal error.  See the sidebar
19714``Beware The Smoke and Mirrors!'' for more information as to why.
19715@end quotation
19716
19717Here is a user-defined function (@pxref{User-defined})
19718that illustrates the use of these functions:
19719
19720@cindex @code{bits2str()} user-defined function
19721@cindex user-defined @subentry function @subentry @code{bits2str()}
19722@cindex @file{testbits.awk} program
19723@example
19724@group
19725@c file eg/lib/bits2str.awk
19726# bits2str --- turn an integer into readable ones and zeros
19727
19728function bits2str(bits,        data, mask)
19729@{
19730    if (bits == 0)
19731        return "0"
19732
19733    mask = 1
19734    for (; bits != 0; bits = rshift(bits, 1))
19735        data = (and(bits, mask) ? "1" : "0") data
19736
19737    while ((length(data) % 8) != 0)
19738        data = "0" data
19739
19740    return data
19741@}
19742@c endfile
19743@end group
19744
19745@c this is a hack to make testbits.awk self-contained
19746@ignore
19747@c file eg/prog/testbits.awk
19748# bits2str --- turn an integer into readable ones and zeros
19749
19750function bits2str(bits,        data, mask)
19751@{
19752    if (bits == 0)
19753        return "0"
19754
19755    mask = 1
19756    for (; bits != 0; bits = rshift(bits, 1))
19757        data = (and(bits, mask) ? "1" : "0") data
19758
19759    while ((length(data) % 8) != 0)
19760        data = "0" data
19761
19762    return data
19763@}
19764@c endfile
19765@end ignore
19766@c file eg/prog/testbits.awk
19767BEGIN @{
19768    printf "123 = %s\n", bits2str(123)
19769    printf "0123 = %s\n", bits2str(0123)
19770    printf "0x99 = %s\n", bits2str(0x99)
19771    comp = compl(0x99)
19772    printf "compl(0x99) = %#x = %s\n", comp, bits2str(comp)
19773    shift = lshift(0x99, 2)
19774    printf "lshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift)
19775    shift = rshift(0x99, 2)
19776    printf "rshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift)
19777@}
19778@c endfile
19779@end example
19780
19781@noindent
19782This program produces the following output when run:
19783
19784@example
19785$ @kbd{gawk -f testbits.awk}
19786@print{} 123 = 01111011
19787@print{} 0123 = 01010011
19788@print{} 0x99 = 10011001
19789@print{} compl(0x99) = 0x3fffffffffff66 =
19790@print{} 00111111111111111111111111111111111111111111111101100110
19791@print{} lshift(0x99, 2) = 0x264 = 0000001001100100
19792@print{} rshift(0x99, 2) = 0x26 = 00100110
19793@end example
19794
19795@cindex converting @subentry string to numbers
19796@cindex strings @subentry converting
19797@cindex numbers @subentry converting
19798@cindex converting @subentry numbers to strings
19799@cindex numbers @subentry as string of bits
19800The @code{bits2str()} function turns a binary number into a string.
19801Initializing @code{mask} to one creates
19802a binary value where the rightmost bit
19803is set to one.  Using this mask,
19804the function repeatedly checks the rightmost bit.
19805ANDing the mask with the value indicates whether the
19806rightmost bit is one or not. If so, a @code{"1"} is concatenated onto the front
19807of the string.
19808Otherwise, a @code{"0"} is added.
19809The value is then shifted right by one bit and the loop continues
19810until there are no more one bits.
19811
19812If the initial value is zero, it returns a simple @code{"0"}.
19813Otherwise, at the end, it pads the value with zeros to represent multiples
19814of 8-bit quantities. This is typical in modern computers.
19815
19816The main code in the @code{BEGIN} rule shows the difference between the
19817decimal and octal values for the same numbers
19818(@pxref{Nondecimal-numbers}),
19819and then demonstrates the
19820results of the @code{compl()}, @code{lshift()}, and @code{rshift()} functions.
19821
19822@sidebar Beware The Smoke and Mirrors!
19823
19824It other languages, bitwise operations are performed on integer values,
19825not floating-point values.  As a general statement, such operations work
19826best when performed on unsigned integers.
19827
19828@command{gawk} attempts to treat the arguments to the bitwise functions
19829as unsigned integers.  For this reason, negative arguments produce a
19830fatal error.
19831
19832In normal operation, for all of these functions, first the
19833double-precision floating-point value is converted to the widest C
19834unsigned integer type, then the bitwise operation is performed.  If the
19835result cannot be represented exactly as a C @code{double}, leading
19836nonzero bits are removed one by one until it can be represented exactly.
19837The result is then converted back into a C @code{double}.@footnote{If you don't
19838understand this paragraph, the upshot is that @command{gawk} can only
19839store a particular range of integer values; numbers outside that range
19840are reduced to fit within the range.}
19841
19842However, when using arbitrary precision arithmetic with the @option{-M}
19843option (@pxref{Arbitrary Precision Arithmetic}), the results may differ.
19844This is particularly noticeable with the @code{compl()} function:
19845
19846@example
19847$ @kbd{gawk 'BEGIN @{ print compl(42) @}'}
19848@print{} 9007199254740949
19849$ @kbd{gawk -M 'BEGIN @{ print compl(42) @}'}
19850@print{} -43
19851@end example
19852
19853What's going on becomes clear when printing the results
19854in hexadecimal:
19855
19856@example
19857$ @kbd{gawk 'BEGIN @{ printf "%#x\n", compl(42) @}'}
19858@print{} 0x1fffffffffffd5
19859$ @kbd{gawk -M 'BEGIN @{ printf "%#x\n", compl(42) @}'}
19860@print{} 0xffffffffffffffd5
19861@end example
19862
19863When using the @option{-M} option, under the hood, @command{gawk} uses
19864GNU MP arbitrary precision integers which have at least 64 bits of precision.
19865When not using @option{-M}, @command{gawk} stores integral values in
19866regular double-precision floating point, which only maintain 53 bits of
19867precision.  Furthermore, the GNU MP library treats (or at least seems to treat)
19868the leading bit as a sign bit; thus the result with @option{-M} in this case is
19869a negative number.
19870
19871In short, using @command{gawk} for any but the simplest kind of bitwise
19872operations is probably a bad idea; caveat emptor!
19873
19874@end sidebar
19875
19876@node Type Functions
19877@subsection Getting Type Information
19878
19879@command{gawk} provides two functions that let you distinguish
19880the type of a variable.
19881This is necessary for writing code
19882that traverses every element of an array of arrays
19883(@pxref{Arrays of Arrays}), and in other contexts.
19884
19885@table @code
19886@cindexgawkfunc{isarray}
19887@cindex scalar or array
19888@item isarray(@var{x})
19889Return a true value if @var{x} is an array. Otherwise, return false.
19890
19891@cindexgawkfunc{typeof}
19892@cindex variable type, @code{typeof()} function (@command{gawk})
19893@cindex type @subentry of variable, @code{typeof()} function (@command{gawk})
19894@item typeof(@var{x})
19895Return one of the following strings, depending upon the type of @var{x}:
19896
19897@c nested table
19898@table @code
19899@item "array"
19900@var{x} is an array.
19901
19902@item "regexp"
19903@var{x} is a strongly typed regexp (@pxref{Strong Regexp Constants}).
19904
19905@item "number"
19906@var{x} is a number.
19907
19908@item "string"
19909@var{x} is a string.
19910
19911@item "strnum"
19912@var{x} is a number that started life as user input, such as a field or
19913the result of calling @code{split()}. (I.e., @var{x} has the strnum
19914attribute; @pxref{Variable Typing}.)
19915
19916@item "unassigned"
19917@var{x} is a scalar variable that has not been assigned a value yet.
19918For example:
19919
19920@example
19921BEGIN @{
19922    # creates a[1] but it has no assigned value
19923    a[1]
19924    print typeof(a[1])  # unassigned
19925@}
19926@end example
19927
19928@item "untyped"
19929@var{x} has not yet been used yet at all; it can become a scalar or an
19930array.  The typing could even conceivably differ from run to run of
19931the same program! For example:
19932
19933@example
19934BEGIN @{
19935    print "initially, typeof(v) = ", typeof(v)
19936
19937    if ("FOO" in ENVIRON)
19938        make_scalar(v)
19939    else
19940        make_array(v)
19941
19942    print "typeof(v) =", typeof(v)
19943@}
19944
19945function make_scalar(p,    l) @{ l = p @}
19946
19947function make_array(p) @{ p[1] = 1 @}
19948@end example
19949
19950@end table
19951@end table
19952
19953@code{isarray()} is meant for use in two circumstances. The first is when
19954traversing a multidimensional array: you can test if an element is itself
19955an array or not.  The second is inside the body of a user-defined function
19956(not discussed yet; @pxref{User-defined}), to test if a parameter is an
19957array or not.
19958
19959@quotation NOTE
19960While you can use @code{isarray()} at the global level to test variables,
19961doing so makes no sense. Because @emph{you} are the one writing the
19962program, @emph{you} are supposed to know if your variables are arrays
19963or not.
19964@end quotation
19965
19966The @code{typeof()} function is general; it allows you to determine
19967if a variable or function parameter is a scalar (number, string,
19968or strongly typed regexp) or an array.
19969
19970Normally, passing a variable that has never been used to a built-in
19971function causes it to become a scalar variable (unassigned).
19972However, @code{isarray()} and @code{typeof()} are different; they do
19973not change their arguments from untyped to unassigned.
19974
19975@cindex dark corner @subentry array elements created by reference
19976By ``variable'' we mean one denoted by a simple identifier.  Array elements
19977that come into existence simply by referencing them
19978are different, they are automatically forced to be scalars. Consider:
19979
19980@example
19981$ @kbd{gawk 'BEGIN @{ print typeof(x) @}'}
19982@print{} untyped
19983$ @kbd{gawk 'BEGIN @{ print typeof(x["foo"]) @}'}
19984@print{} unassigned
19985@end example
19986
19987@noindent
19988@code{x["foo"]} comes into existence before it is passed to @code{typeof()};
19989@code{typeof()} cannot tell that it didn't exist prior to being called.
19990@value{DARKCORNER}
19991
19992@c FIXME: For 5.2, if this will change, update this bit of doc.
19993@c This may change in a future release, whereby @command{gawk}
19994@c would allow such an unassigned array element to be used for
19995@c a multidimensional array, and not remain a scalar forever
19996@c (or until deleted).
19997
19998@node I18N Functions
19999@subsection String-Translation Functions
20000@cindex @command{gawk} @subentry string-translation functions
20001@cindex functions @subentry string-translation
20002@cindex string-translation functions
20003@cindex internationalization
20004@cindex @command{awk} programs @subentry internationalizing
20005
20006@command{gawk} provides facilities for internationalizing @command{awk} programs.
20007These include the functions described in the following list.
20008The descriptions here are purposely brief.
20009@xref{Internationalization},
20010for the full story.
20011Optional parameters are enclosed in square brackets ([ ]):
20012
20013@table @asis
20014@cindexgawkfunc{bindtextdomain}
20015@cindex set directory of message catalogs
20016@item @code{bindtextdomain(@var{directory}} [@code{,} @var{domain}]@code{)}
20017Set the directory in which
20018@command{gawk} will look for message translation files, in case they
20019will not or cannot be placed in the ``standard'' locations
20020(e.g., during testing).
20021It returns the directory in which @var{domain} is ``bound.''
20022
20023The default @var{domain} is the value of @code{TEXTDOMAIN}.
20024If @var{directory} is the null string (@code{""}), then
20025@code{bindtextdomain()} returns the current binding for the
20026given @var{domain}.
20027
20028@cindexgawkfunc{dcgettext}
20029@cindex translate string
20030@item @code{dcgettext(@var{string}} [@code{,} @var{domain} [@code{,} @var{category}] ]@code{)}
20031Return the translation of @var{string} in
20032text domain @var{domain} for locale category @var{category}.
20033The default value for @var{domain} is the current value of @code{TEXTDOMAIN}.
20034The default value for @var{category} is @code{"LC_MESSAGES"}.
20035
20036@cindexgawkfunc{dcngettext}
20037@item @code{dcngettext(@var{string1}, @var{string2}, @var{number}} [@code{,} @var{domain} [@code{,} @var{category}] ]@code{)}
20038Return the plural form used for @var{number} of the
20039translation of @var{string1} and @var{string2} in text domain
20040@var{domain} for locale category @var{category}. @var{string1} is the
20041English singular variant of a message, and @var{string2} is the English plural
20042variant of the same message.
20043The default value for @var{domain} is the current value of @code{TEXTDOMAIN}.
20044The default value for @var{category} is @code{"LC_MESSAGES"}.
20045@end table
20046
20047@node User-defined
20048@section User-Defined Functions
20049
20050@cindex user-defined @subentry functions
20051@cindex functions @subentry user-defined
20052Complicated @command{awk} programs can often be simplified by defining
20053your own functions.  User-defined functions can be called just like
20054built-in ones (@pxref{Function Calls}), but it is up to you to define
20055them (i.e., to tell @command{awk} what they should do).
20056
20057@menu
20058* Definition Syntax::           How to write definitions and what they mean.
20059* Function Example::            An example function definition and what it
20060                                does.
20061* Function Calling::            Calling user-defined functions.
20062* Return Statement::            Specifying the value a function returns.
20063* Dynamic Typing::              How variable types can change at runtime.
20064@end menu
20065
20066@node Definition Syntax
20067@subsection Function Definition Syntax
20068
20069@quotation
20070@i{It's entirely fair to say that the awk syntax for local
20071variable definitions is appallingly awful.}
20072@author Brian Kernighan
20073@end quotation
20074
20075@cindex functions @subentry defining
20076Definitions of functions can appear anywhere between the rules of an
20077@command{awk} program.  Thus, the general form of an @command{awk} program is
20078extended to include sequences of rules @emph{and} user-defined function
20079definitions.
20080There is no need to put the definition of a function
20081before all uses of the function.  This is because @command{awk} reads the
20082entire program before starting to execute any of it.
20083
20084The definition of a function named @var{name} looks like this:
20085
20086@display
20087@group
20088@code{function} @var{name}@code{(}[@var{parameter-list}]@code{)}
20089@code{@{}
20090     @var{body-of-function}
20091@code{@}}
20092@end group
20093@end display
20094
20095@cindex names @subentry functions
20096@cindex functions @subentry names of
20097@cindex naming issues @subentry functions
20098@noindent
20099Here, @var{name} is the name of the function to define.  A valid function
20100name is like a valid variable name: a sequence of letters, digits, and
20101underscores that doesn't start with a digit.
20102Here too, only the 52 upper- and lowercase English letters may
20103be used in a function name.
20104Within a single @command{awk} program, any particular name can only be
20105used as a variable, array, or function.
20106
20107@var{parameter-list} is an optional list of the function's arguments and local
20108variable names, separated by commas.  When the function is called,
20109the argument names are used to hold the argument values given in
20110the call.
20111
20112A function cannot have two parameters with the same name, nor may it
20113have a parameter with the same name as the function itself.
20114
20115@quotation CAUTION
20116According to the POSIX standard, function parameters
20117cannot have the same name as one of the special predefined variables
20118(@pxref{Built-in Variables}), nor may a function parameter have the
20119same name as another function.
20120
20121@cindex dark corner @subentry parameter name restrictions
20122Not all versions of @command{awk} enforce
20123these restrictions.  @value{DARKCORNER}
20124@command{gawk} always enforces the first restriction.
20125With @option{--posix} (@pxref{Options}),
20126it also enforces the second restriction.
20127@end quotation
20128
20129Local variables act like the empty string if referenced where a string
20130value is required, and like zero if referenced where a numeric value
20131is required. This is the same as the behavior of regular variables that have never been
20132assigned a value.  (There is more to understand about local variables;
20133@pxref{Dynamic Typing}.)
20134
20135The @var{body-of-function} consists of @command{awk} statements.  It is the
20136most important part of the definition, because it says what the function
20137should actually @emph{do}.  The argument names exist to give the body a
20138way to talk about the arguments; local variables exist to give the body
20139places to keep temporary values.
20140
20141Argument names are not distinguished syntactically from local variable
20142names. Instead, the number of arguments supplied when the function is
20143called determines how many argument variables there are.  Thus, if three
20144argument values are given, the first three names in @var{parameter-list}
20145are arguments and the rest are local variables.
20146
20147It follows that if the number of arguments is not the same in all calls
20148to the function, some of the names in @var{parameter-list} may be
20149arguments on some occasions and local variables on others.  Another
20150way to think of this is that omitted arguments default to the
20151null string.
20152
20153@cindex programming conventions @subentry functions @subentry writing
20154Usually when you write a function, you know how many names you intend to
20155use for arguments and how many you intend to use as local variables.  It is
20156conventional to place some extra space between the arguments and
20157the local variables, in order to document how your function is supposed to be used.
20158
20159@cindex variables @subentry shadowing
20160@cindex shadowing of variable values
20161During execution of the function body, the arguments and local variable
20162values hide, or @dfn{shadow}, any variables of the same names used in the
20163rest of the program.  The shadowed variables are not accessible in the
20164function definition, because there is no way to name them while their
20165names have been taken away for the arguments and local variables.  All other variables
20166used in the @command{awk} program can be referenced or set normally in the
20167function's body.
20168
20169The arguments and local variables last only as long as the function body
20170is executing.  Once the body finishes, you can once again access the
20171variables that were shadowed while the function was running.
20172
20173@cindex recursive functions
20174@cindex functions @subentry recursive
20175The function body can contain expressions that call functions.  They
20176can even call this function, either directly or by way of another
20177function.  When this happens, we say the function is @dfn{recursive}.
20178The act of a function calling itself is called @dfn{recursion}.
20179
20180All the built-in functions return a value to their caller.
20181User-defined functions can do so also, using the @code{return} statement,
20182which is described in detail in @ref{Return Statement}.
20183Many of the subsequent examples in this @value{SECTION} use
20184the @code{return} statement.
20185
20186@cindex common extensions @subentry @code{func} keyword
20187@cindex extensions @subentry common @subentry @code{func} keyword
20188@c @cindex POSIX @command{awk}
20189@cindex @command{awk} @subentry language, POSIX version
20190@cindex POSIX @command{awk} @subentry @code{function} keyword in
20191In many @command{awk} implementations, including @command{gawk},
20192the keyword @code{function} may be
20193abbreviated @code{func}. @value{COMMONEXT}
20194However, POSIX only specifies the use of
20195the keyword @code{function}.  This actually has some practical implications.
20196If @command{gawk} is in POSIX-compatibility mode
20197(@pxref{Options}), then the following
20198statement does @emph{not} define a function:
20199
20200@example
20201func foo() @{ a = sqrt($1) ; print a @}
20202@end example
20203
20204@noindent
20205Instead, it defines a rule that, for each record, concatenates the value
20206of the variable @samp{func} with the return value of the function @samp{foo}.
20207If the resulting string is non-null, the action is executed.
20208This is probably not what is desired.  (@command{awk} accepts this input as
20209syntactically valid, because functions may be used before they are defined
20210in @command{awk} programs.@footnote{This program won't actually run,
20211because @code{foo()} is undefined.})
20212
20213@cindex portability @subentry functions, defining
20214To ensure that your @command{awk} programs are portable, always use the
20215keyword @code{function} when defining a function.
20216
20217@node Function Example
20218@subsection Function Definition Examples
20219@cindex function definition example
20220
20221Here is an example of a user-defined function, called @code{myprint()}, that
20222takes a number and prints it in a specific format:
20223
20224@example
20225function myprint(num)
20226@{
20227     printf "%6.3g\n", num
20228@}
20229@end example
20230
20231@noindent
20232To illustrate, here is an @command{awk} rule that uses our @code{myprint()}
20233function:
20234
20235@example
20236$3 > 0     @{ myprint($3) @}
20237@end example
20238
20239@noindent
20240This program prints, in our special format, all the third fields that
20241contain a positive number in our input.  Therefore, when given the following input:
20242
20243@example
20244 1.2   3.4    5.6   7.8
20245 9.10 11.12 -13.14 15.16
2024617.18 19.20  21.22 23.24
20247@end example
20248
20249@noindent
20250this program, using our function to format the results, prints:
20251
20252@example
20253   5.6
20254  21.2
20255@end example
20256
20257This function deletes all the elements in an array (recall that the
20258extra whitespace signifies the start of the local variable list):
20259
20260@example
20261@group
20262function delarray(a,    i)
20263@{
20264    for (i in a)
20265        delete a[i]
20266@}
20267@end group
20268@end example
20269
20270When working with arrays, it is often necessary to delete all the elements
20271in an array and start over with a new list of elements
20272(@pxref{Delete}).
20273Instead of having
20274to repeat this loop everywhere that you need to clear out
20275an array, your program can just call @code{delarray()}.
20276(This guarantees portability.  The use of @samp{delete @var{array}} to delete
20277the contents of an entire array is a relatively recent@footnote{Late in 2012.}
20278addition to the POSIX standard.)
20279
20280The following is an example of a recursive function.  It takes a string
20281as an input parameter and returns the string in reverse order.
20282Recursive functions must always have a test that stops the recursion.
20283In this case, the recursion terminates when the input string is
20284already empty:
20285
20286@c 8/2014: Thanks to Mike Brennan for the improved formulation
20287@cindex @code{rev()} user-defined function
20288@cindex user-defined @subentry function @subentry @code{rev()}
20289@example
20290function rev(str)
20291@{
20292    if (str == "")
20293        return ""
20294
20295    return (rev(substr(str, 2)) substr(str, 1, 1))
20296@}
20297@end example
20298
20299If this function is in a file named @file{rev.awk}, it can be tested
20300this way:
20301
20302@example
20303$ @kbd{echo "Don't Panic!" |}
20304> @kbd{gawk -e '@{ print rev($0) @}' -f rev.awk}
20305@print{} !cinaP t'noD
20306@end example
20307
20308The C @code{ctime()} function takes a timestamp and returns it as a string,
20309formatted in a well-known fashion.
20310The following example uses the built-in @code{strftime()} function
20311(@pxref{Time Functions})
20312to create an @command{awk} version of @code{ctime()}:
20313
20314@cindex @code{ctime()} user-defined function
20315@cindex user-defined @subentry function @subentry @code{ctime()}
20316@example
20317@c file eg/lib/ctime.awk
20318# ctime.awk
20319#
20320# awk version of C ctime(3) function
20321
20322function ctime(ts,    format)
20323@{
20324    format = "%a %b %e %H:%M:%S %Z %Y"
20325
20326    if (ts == 0)
20327        ts = systime()       # use current time as default
20328    return strftime(format, ts)
20329@}
20330@c endfile
20331@end example
20332
20333You might think that @code{ctime()} could use @code{PROCINFO["strftime"]}
20334for its format string. That would be a mistake, because @code{ctime()} is
20335supposed to return the time formatted in a standard fashion, and user-level
20336code could have changed @code{PROCINFO["strftime"]}.
20337
20338@node Function Calling
20339@subsection Calling User-Defined Functions
20340
20341@cindex functions @subentry user-defined @subentry calling
20342@dfn{Calling a function} means causing the function to run and do its job.
20343A function call is an expression and its value is the value returned by
20344the function.
20345
20346@menu
20347* Calling A Function::          Don't use spaces.
20348* Variable Scope::              Controlling variable scope.
20349* Pass By Value/Reference::     Passing parameters.
20350* Function Caveats::            Other points to know about functions.
20351@end menu
20352
20353@node Calling A Function
20354@subsubsection Writing a Function Call
20355
20356A function call consists of the function name followed by the arguments
20357in parentheses.  @command{awk} expressions are what you write in the
20358call for the arguments.  Each time the call is executed, these
20359expressions are evaluated, and the values become the actual arguments.  For
20360example, here is a call to @code{foo()} with three arguments (the first
20361being a string concatenation):
20362
20363@example
20364foo(x y, "lose", 4 * z)
20365@end example
20366
20367@quotation CAUTION
20368Whitespace characters (spaces and TABs) are not allowed
20369between the function name and the opening parenthesis of the argument list.
20370If you write whitespace by mistake, @command{awk} might think that you mean
20371to concatenate a variable with an expression in parentheses.  However, it
20372notices that you used a function name and not a variable name, and reports
20373an error.
20374@end quotation
20375
20376@node Variable Scope
20377@subsubsection Controlling Variable Scope
20378
20379@cindex local variables @subentry in a function
20380@cindex variables @subentry local to a function
20381Unlike in many languages,
20382there is no way to make a variable local to a @code{@{} @dots{} @code{@}} block in
20383@command{awk}, but you can make a variable local to a function. It is
20384good practice to do so whenever a variable is needed only in that
20385function.
20386
20387To make a variable local to a function, simply declare the variable as
20388an argument after the actual function arguments
20389(@pxref{Definition Syntax}).
20390Look at the following example, where variable
20391@code{i} is a global variable used by both functions @code{foo()} and
20392@code{bar()}:
20393
20394@example
20395function bar()
20396@{
20397    for (i = 0; i < 3; i++)
20398        print "bar's i=" i
20399@}
20400
20401function foo(j)
20402@{
20403    i = j + 1
20404    print "foo's i=" i
20405    bar()
20406    print "foo's i=" i
20407@}
20408
20409BEGIN @{
20410      i = 10
20411      print "top's i=" i
20412      foo(0)
20413      print "top's i=" i
20414@}
20415@end example
20416
20417Running this script produces the following, because the @code{i} in
20418functions @code{foo()} and @code{bar()} and at the top level refer to the same
20419variable instance:
20420
20421@example
20422top's i=10
20423foo's i=1
20424bar's i=0
20425bar's i=1
20426bar's i=2
20427foo's i=3
20428top's i=3
20429@end example
20430
20431If you want @code{i} to be local to both @code{foo()} and @code{bar()}, do as
20432follows (the extra space before @code{i} is a coding convention to
20433indicate that @code{i} is a local variable, not an argument):
20434
20435@example
20436function bar(    i)
20437@{
20438    for (i = 0; i < 3; i++)
20439        print "bar's i=" i
20440@}
20441
20442function foo(j,    i)
20443@{
20444    i = j + 1
20445    print "foo's i=" i
20446    bar()
20447    print "foo's i=" i
20448@}
20449
20450BEGIN @{
20451      i = 10
20452      print "top's i=" i
20453      foo(0)
20454      print "top's i=" i
20455@}
20456@end example
20457
20458Running the corrected script produces the following:
20459
20460@example
20461top's i=10
20462foo's i=1
20463bar's i=0
20464bar's i=1
20465bar's i=2
20466foo's i=1
20467top's i=10
20468@end example
20469
20470Besides scalar values (strings and numbers), you may also have
20471local arrays.  By using a parameter name as an array, @command{awk}
20472treats it as an array, and it is local to the function.
20473In addition, recursive calls create new arrays.
20474Consider this example:
20475
20476@example
20477@group
20478function some_func(p1,      a)
20479@{
20480    if (p1++ > 3)
20481        return
20482@end group
20483
20484    a[p1] = p1
20485
20486    some_func(p1)
20487
20488    printf("At level %d, index %d %s found in a\n",
20489         p1, (p1 - 1), (p1 - 1) in a ? "is" : "is not")
20490    printf("At level %d, index %d %s found in a\n",
20491         p1, p1, p1 in a ? "is" : "is not")
20492    print ""
20493@}
20494
20495BEGIN @{
20496    some_func(1)
20497@}
20498@end example
20499
20500When run, this program produces the following output:
20501
20502@example
20503At level 4, index 3 is not found in a
20504At level 4, index 4 is found in a
20505
20506At level 3, index 2 is not found in a
20507At level 3, index 3 is found in a
20508
20509At level 2, index 1 is not found in a
20510At level 2, index 2 is found in a
20511@end example
20512
20513@node Pass By Value/Reference
20514@subsubsection Passing Function Arguments by Value Or by Reference
20515
20516In @command{awk}, when you declare a function, there is no way to
20517declare explicitly whether the arguments are passed @dfn{by value} or
20518@dfn{by reference}.
20519
20520Instead, the passing convention is determined at runtime when
20521the function is called, according to the following rule:
20522if the argument is an array variable, then it is passed by reference.
20523Otherwise, the argument is passed by value.
20524
20525@cindex call by value
20526Passing an argument by value means that when a function is called, it
20527is given a @emph{copy} of the value of this argument.
20528The caller may use a variable as the expression for the argument, but
20529the called function does not know this---it only knows what value the
20530argument had.  For example, if you write the following code:
20531
20532@example
20533foo = "bar"
20534z = myfunc(foo)
20535@end example
20536
20537@noindent
20538then you should not think of the argument to @code{myfunc()} as being
20539``the variable @code{foo}.''  Instead, think of the argument as the
20540string value @code{"bar"}.
20541If the function @code{myfunc()} alters the values of its local variables,
20542this has no effect on any other variables.  Thus, if @code{myfunc()}
20543does this:
20544
20545@example
20546@group
20547function myfunc(str)
20548@{
20549   print str
20550   str = "zzz"
20551   print str
20552@}
20553@end group
20554@end example
20555
20556@noindent
20557to change its first argument variable @code{str}, it does @emph{not}
20558change the value of @code{foo} in the caller.  The role of @code{foo} in
20559calling @code{myfunc()} ended when its value (@code{"bar"}) was computed.
20560If @code{str} also exists outside of @code{myfunc()}, the function body
20561cannot alter this outer value, because it is shadowed during the
20562execution of @code{myfunc()} and cannot be seen or changed from there.
20563
20564@cindex call by reference
20565@cindex arrays @subentry as parameters to functions
20566@cindex functions @subentry arrays as parameters to
20567However, when arrays are the parameters to functions, they are @emph{not}
20568copied.  Instead, the array itself is made available for direct manipulation
20569by the function.  This is usually termed @dfn{call by reference}.
20570Changes made to an array parameter inside the body of a function @emph{are}
20571visible outside that function.
20572
20573@quotation NOTE
20574Changing an array parameter inside a function
20575can be very dangerous if you do not watch what you are doing.
20576For example:
20577
20578@example
20579function changeit(array, ind, nvalue)
20580@{
20581     array[ind] = nvalue
20582@}
20583
20584BEGIN @{
20585    a[1] = 1; a[2] = 2; a[3] = 3
20586    changeit(a, 2, "two")
20587    printf "a[1] = %s, a[2] = %s, a[3] = %s\n",
20588            a[1], a[2], a[3]
20589@}
20590@end example
20591
20592@noindent
20593prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because
20594@code{changeit()} stores @code{"two"} in the second element of @code{a}.
20595@end quotation
20596
20597@node Function Caveats
20598@subsubsection Other Points About Calling Functions
20599
20600@cindex undefined functions
20601@cindex functions @subentry undefined
20602Some @command{awk} implementations allow you to call a function that
20603has not been defined. They only report a problem at runtime, when the
20604program actually tries to call the function. For example:
20605
20606@example
20607BEGIN @{
20608    if (0)
20609        foo()
20610    else
20611        bar()
20612@}
20613function bar() @{ @dots{} @}
20614# note that `foo' is not defined
20615@end example
20616
20617@noindent
20618Because the @samp{if} statement will never be true, it is not really a
20619problem that @code{foo()} has not been defined.  Usually, though, it is a
20620problem if a program calls an undefined function.
20621
20622@cindex lint checking @subentry undefined functions
20623If @option{--lint} is specified
20624(@pxref{Options}),
20625@command{gawk} reports calls to undefined functions.
20626
20627@cindex portability @subentry @code{next} statement in user-defined functions
20628Some @command{awk} implementations generate a runtime
20629error if you use either the @code{next} statement
20630or the @code{nextfile} statement
20631(@pxref{Next Statement}, and
20632@ifdocbook
20633@ref{Nextfile Statement})
20634@end ifdocbook
20635@ifnotdocbook
20636@pxref{Nextfile Statement})
20637@end ifnotdocbook
20638inside a user-defined function.
20639@command{gawk} does not have this limitation.
20640
20641You can call a function and pass it more parameters than it was declared
20642with, like so:
20643
20644@example
20645function foo(p1, p2)
20646@{
20647    @dots{}
20648@}
20649
20650BEGIN @{
20651    foo(1, 2, 3, 4)
20652@}
20653@end example
20654
20655Doing so is bad practice, however.  The called function cannot do
20656anything with the additional values being passed to it, so @command{awk}
20657evaluates the expressions but then just throws them away.
20658
20659More importantly, such a call is confusing for whoever will next read your
20660program.@footnote{Said person might even be you, sometime in the future,
20661at which point you will wonder, ``what was I thinking?!?''} Function
20662parameters generally are input items that influence the computation
20663performed by the function.  Calling a function with more parameters than
20664it accepts gives the false impression that those values are important
20665to the function, when in fact they are not.
20666
20667Because this is such a bad practice, @command{gawk} @emph{unconditionally}
20668issues a warning whenever it executes such a function call.  (If you
20669don't like the warning, fix your code!  It's incorrect, after all.)
20670
20671@node Return Statement
20672@subsection The @code{return} Statement
20673@cindex @code{return} statement, user-defined functions
20674
20675As seen in several earlier examples,
20676the body of a user-defined function can contain a @code{return} statement.
20677This statement returns control to the calling part of the @command{awk} program.  It
20678can also be used to return a value for use in the rest of the @command{awk}
20679program.  It looks like this:
20680
20681@display
20682@code{return} [@var{expression}]
20683@end display
20684
20685The @var{expression} part is optional.
20686Due most likely to an oversight, POSIX does not define what the return
20687value is if you omit the @var{expression}.  Technically speaking, this
20688makes the returned value undefined, and therefore, unpredictable.
20689In practice, though, all versions of @command{awk} simply return the
20690null string, which acts like zero if used in a numeric context.
20691
20692A @code{return} statement without an @var{expression} is assumed at the end of
20693every function definition.  So, if control reaches the end of the function
20694body, then technically the function returns an unpredictable value.
20695In practice, it returns the empty string.  @command{awk}
20696does @emph{not} warn you if you use the return value of such a function.
20697
20698Sometimes, you want to write a function for what it does, not for
20699what it returns.  Such a function corresponds to a @code{void} function
20700in C, C++, or Java, or to a @code{procedure} in Ada.  Thus, it may be appropriate to not
20701return any value; simply bear in mind that you should not be using the
20702return value of such a function.
20703
20704The following is an example of a user-defined function that returns a value
20705for the largest number among the elements of an array:
20706
20707@example
20708function maxelt(vec,   i, ret)
20709@{
20710     for (i in vec) @{
20711          if (ret == "" || vec[i] > ret)
20712               ret = vec[i]
20713     @}
20714     return ret
20715@}
20716@end example
20717
20718@cindex programming conventions @subentry function parameters
20719@noindent
20720You call @code{maxelt()} with one argument, which is an array name.  The local
20721variables @code{i} and @code{ret} are not intended to be arguments;
20722there is nothing to stop you from passing more than one argument
20723to @code{maxelt()} but the results would be strange.  The extra space before
20724@code{i} in the function parameter list indicates that @code{i} and
20725@code{ret} are local variables.
20726You should follow this convention when defining functions.
20727
20728The following program uses the @code{maxelt()} function.  It loads an
20729array, calls @code{maxelt()}, and then reports the maximum number in that
20730array:
20731
20732@example
20733function maxelt(vec,   i, ret)
20734@{
20735     for (i in vec) @{
20736          if (ret == "" || vec[i] > ret)
20737               ret = vec[i]
20738     @}
20739     return ret
20740@}
20741
20742@group
20743# Load all fields of each record into nums.
20744@{
20745     for(i = 1; i <= NF; i++)
20746          nums[NR, i] = $i
20747@}
20748@end group
20749
20750END @{
20751     print maxelt(nums)
20752@}
20753@end example
20754
20755Given the following input:
20756
20757@example
20758 1 5 23 8 16
2075944 3 5 2 8 26
20760256 291 1396 2962 100
20761-6 467 998 1101
2076299385 11 0 225
20763@end example
20764
20765@noindent
20766the program reports (predictably) that 99,385 is the largest value
20767in the array.
20768
20769@node Dynamic Typing
20770@subsection Functions and Their Effects on Variable Typing
20771
20772@command{awk} is a very fluid language.
20773It is possible that @command{awk} can't tell if an identifier
20774represents a scalar variable or an array until runtime.
20775Here is an annotated sample program:
20776
20777@example
20778function foo(a)
20779@{
20780    a[1] = 1   # parameter is an array
20781@}
20782
20783BEGIN @{
20784    b = 1
20785    foo(b)  # invalid: fatal type mismatch
20786
20787    foo(x)  # x uninitialized, becomes an array dynamically
20788    x = 1   # now not allowed, runtime error
20789@}
20790@end example
20791
20792In this example, the first call to @code{foo()} generates
20793a fatal error, so @command{awk} will not report the second
20794error. If you comment out that call, though, then @command{awk}
20795does report the second error.
20796
20797Usually, such things aren't a big issue, but it's worth
20798being aware of them.
20799
20800@node Indirect Calls
20801@section Indirect Function Calls
20802
20803@cindex indirect function calls
20804@cindex function calls @subentry indirect
20805@cindex function pointers
20806@cindex pointers to functions
20807@cindex differences in @command{awk} and @command{gawk} @subentry indirect function calls
20808
20809This section describes an advanced, @command{gawk}-specific extension.
20810
20811Often, you may wish to defer the choice of function to call until runtime.
20812For example, you may have different kinds of records, each of which
20813should be processed differently.
20814
20815Normally, you would have to use a series of @code{if}-@code{else}
20816statements to decide which function to call.  By using @dfn{indirect}
20817function calls, you can specify the name of the function to call as a
20818string variable, and then call the function.  Let's look at an example.
20819
20820Suppose you have a file with your test scores for the classes you
20821are taking, and
20822you wish to get the sum and the average of
20823your test scores.
20824The first field is the class name. The following fields
20825are the functions to call to process the data, up to a ``marker''
20826field @samp{data:}.  Following the marker, to the end of the record,
20827are the various numeric test scores.
20828
20829Here is the initial file:
20830
20831@example
20832@c file eg/data/class_data1
20833Biology_101 sum average data: 87.0 92.4 78.5 94.9
20834Chemistry_305 sum average data: 75.2 98.3 94.7 88.2
20835English_401 sum average data: 100.0 95.6 87.1 93.4
20836@c endfile
20837@end example
20838
20839To process the data, you might write initially:
20840
20841@example
20842@{
20843    class = $1
20844    for (i = 2; $i != "data:"; i++) @{
20845        if ($i == "sum")
20846            sum()   # processes the whole record
20847        else if ($i == "average")
20848            average()
20849        @dots{}           # and so on
20850    @}
20851@}
20852@end example
20853
20854@noindent
20855This style of programming works, but can be awkward.  With @dfn{indirect}
20856function calls, you tell @command{gawk} to use the @emph{value} of a
20857variable as the @emph{name} of the function to call.
20858
20859@cindex @code{@@} (at-sign) @subentry @code{@@}-notation for indirect function calls
20860@cindex at-sign (@code{@@}) @subentry @code{@@}-notation for indirect function calls
20861@cindex indirect function calls @subentry @code{@@}-notation
20862@cindex function calls @subentry indirect @subentry @code{@@}-notation for
20863The syntax is similar to that of a regular function call: an identifier
20864immediately followed by an opening parenthesis, any arguments, and then
20865a closing parenthesis, with the addition of a leading @samp{@@}
20866character:
20867
20868@example
20869the_func = "sum"
20870result = @@the_func()   # calls the sum() function
20871@end example
20872
20873Here is a full program that processes the previously shown data,
20874using indirect function calls:
20875
20876@example
20877@c file eg/prog/indirectcall.awk
20878# indirectcall.awk --- Demonstrate indirect function calls
20879@c endfile
20880@ignore
20881@c file eg/prog/indirectcall.awk
20882#
20883# Arnold Robbins, arnold@@skeeve.com, Public Domain
20884# January 2009
20885@c endfile
20886@end ignore
20887
20888@c file eg/prog/indirectcall.awk
20889# average --- return the average of the values in fields $first - $last
20890
20891function average(first, last,   sum, i)
20892@{
20893    sum = 0;
20894    for (i = first; i <= last; i++)
20895        sum += $i
20896
20897    return sum / (last - first + 1)
20898@}
20899
20900# sum --- return the sum of the values in fields $first - $last
20901
20902function sum(first, last,   ret, i)
20903@{
20904    ret = 0;
20905    for (i = first; i <= last; i++)
20906        ret += $i
20907
20908    return ret
20909@}
20910@c endfile
20911@end example
20912
20913These two functions expect to work on fields; thus, the parameters
20914@code{first} and @code{last} indicate where in the fields to start and end.
20915Otherwise, they perform the expected computations and are not unusual:
20916
20917@example
20918@c file eg/prog/indirectcall.awk
20919# For each record, print the class name and the requested statistics
20920@{
20921    class_name = $1
20922    gsub(/_/, " ", class_name)  # Replace _ with spaces
20923
20924    # find start
20925    for (i = 1; i <= NF; i++) @{
20926        if ($i == "data:") @{
20927            start = i + 1
20928            break
20929        @}
20930    @}
20931
20932    printf("%s:\n", class_name)
20933    for (i = 2; $i != "data:"; i++) @{
20934        the_function = $i
20935        printf("\t%s: <%s>\n", $i, @@the_function(start, NF) "")
20936    @}
20937    print ""
20938@}
20939@c endfile
20940@end example
20941
20942This is the main processing for each record. It prints the class name (with
20943underscores replaced with spaces). It then finds the start of the actual data,
20944saving it in @code{start}.
20945The last part of the code loops through each function name (from @code{$2} up to
20946the marker, @samp{data:}), calling the function named by the field. The indirect
20947function call itself occurs as a parameter in the call to @code{printf}.
20948(The @code{printf} format string uses @samp{%s} as the format specifier so that we
20949can use functions that return strings, as well as numbers. Note that the result
20950from the indirect call is concatenated with the empty string, in order to force
20951it to be a string value.)
20952
20953Here is the result of running the program:
20954
20955@example
20956$ @kbd{gawk -f indirectcall.awk class_data1}
20957@print{} Biology 101:
20958@print{}     sum: <352.8>
20959@print{}     average: <88.2>
20960@print{}
20961@print{} Chemistry 305:
20962@print{}     sum: <356.4>
20963@print{}     average: <89.1>
20964@print{}
20965@print{} English 401:
20966@print{}     sum: <376.1>
20967@print{}     average: <94.025>
20968@end example
20969
20970The ability to use indirect function calls is more powerful than you may
20971think at first.  The C and C++ languages provide ``function pointers,'' which
20972are a mechanism for calling a function chosen at runtime.  One of the most
20973well-known uses of this ability is the C @code{qsort()} function, which sorts
20974an array using the famous ``quicksort'' algorithm
20975(see @uref{https://en.wikipedia.org/wiki/Quicksort, the Wikipedia article}
20976for more information).  To use this function, you supply a pointer to a comparison
20977function.  This mechanism allows you to sort arbitrary data in an arbitrary
20978fashion.
20979
20980We can do something similar using @command{gawk}, like this:
20981
20982@example
20983@c file eg/lib/quicksort.awk
20984# quicksort.awk --- Quicksort algorithm, with user-supplied
20985#                   comparison function
20986@c endfile
20987@ignore
20988@c file eg/lib/quicksort.awk
20989#
20990# Arnold Robbins, arnold@@skeeve.com, Public Domain
20991# January 2009
20992
20993@c endfile
20994@end ignore
20995@c file eg/lib/quicksort.awk
20996
20997# quicksort --- C.A.R. Hoare's quicksort algorithm. See Wikipedia
20998#               or almost any algorithms or computer science text.
20999@c endfile
21000@ignore
21001@c file eg/lib/quicksort.awk
21002#
21003# Adapted from K&R-II, page 110
21004@c endfile
21005@end ignore
21006@c file eg/lib/quicksort.awk
21007
21008function quicksort(data, left, right, less_than,    i, last)
21009@{
21010    if (left >= right)  # do nothing if array contains fewer
21011        return          # than two elements
21012
21013    quicksort_swap(data, left, int((left + right) / 2))
21014    last = left
21015    for (i = left + 1; i <= right; i++)
21016        if (@@less_than(data[i], data[left]))
21017            quicksort_swap(data, ++last, i)
21018    quicksort_swap(data, left, last)
21019    quicksort(data, left, last - 1, less_than)
21020    quicksort(data, last + 1, right, less_than)
21021@}
21022
21023# quicksort_swap --- helper function for quicksort, should really be inline
21024
21025function quicksort_swap(data, i, j,      temp)
21026@{
21027    temp = data[i]
21028    data[i] = data[j]
21029    data[j] = temp
21030@}
21031@c endfile
21032@end example
21033
21034The @code{quicksort()} function receives the @code{data} array, the starting and ending
21035indices to sort (@code{left} and @code{right}), and the name of a function that
21036performs a ``less than'' comparison.  It then implements the quicksort algorithm.
21037
21038To make use of the sorting function, we return to our previous example. The
21039first thing to do is write some comparison functions:
21040
21041@example
21042@c file eg/prog/indirectcall.awk
21043@group
21044# num_lt --- do a numeric less than comparison
21045
21046function num_lt(left, right)
21047@{
21048    return ((left + 0) < (right + 0))
21049@}
21050@end group
21051
21052# num_ge --- do a numeric greater than or equal to comparison
21053
21054function num_ge(left, right)
21055@{
21056    return ((left + 0) >= (right + 0))
21057@}
21058@c endfile
21059@end example
21060
21061The @code{num_ge()} function is needed to perform a descending sort; when used
21062to perform a ``less than'' test, it actually does the opposite (greater than
21063or equal to), which yields data sorted in descending order.
21064
21065Next comes a sorting function.  It is parameterized with the starting and
21066ending field numbers and the comparison function. It builds an array with
21067the data and calls @code{quicksort()} appropriately, and then formats the
21068results as a single string:
21069
21070@example
21071@c file eg/prog/indirectcall.awk
21072# do_sort --- sort the data according to `compare'
21073#             and return it as a string
21074
21075function do_sort(first, last, compare,      data, i, retval)
21076@{
21077    delete data
21078    for (i = 1; first <= last; first++) @{
21079        data[i] = $first
21080        i++
21081    @}
21082
21083    quicksort(data, 1, i-1, compare)
21084
21085    retval = data[1]
21086    for (i = 2; i in data; i++)
21087        retval = retval " " data[i]
21088
21089    return retval
21090@}
21091@c endfile
21092@end example
21093
21094Finally, the two sorting functions call @code{do_sort()}, passing in the
21095names of the two comparison functions:
21096
21097@example
21098@c file eg/prog/indirectcall.awk
21099@group
21100# sort --- sort the data in ascending order and return it as a string
21101
21102function sort(first, last)
21103@{
21104    return do_sort(first, last, "num_lt")
21105@}
21106@end group
21107
21108@group
21109# rsort --- sort the data in descending order and return it as a string
21110
21111function rsort(first, last)
21112@{
21113    return do_sort(first, last, "num_ge")
21114@}
21115@end group
21116@c endfile
21117@end example
21118
21119Here is an extended version of the @value{DF}:
21120
21121@example
21122@c file eg/data/class_data2
21123Biology_101 sum average sort rsort data: 87.0 92.4 78.5 94.9
21124Chemistry_305 sum average sort rsort data: 75.2 98.3 94.7 88.2
21125English_401 sum average sort rsort data: 100.0 95.6 87.1 93.4
21126@c endfile
21127@end example
21128
21129Finally, here are the results when the enhanced program is run:
21130
21131@example
21132$ @kbd{gawk -f quicksort.awk -f indirectcall.awk class_data2}
21133@print{} Biology 101:
21134@print{}     sum: <352.8>
21135@print{}     average: <88.2>
21136@print{}     sort: <78.5 87.0 92.4 94.9>
21137@print{}     rsort: <94.9 92.4 87.0 78.5>
21138@print{}
21139@print{} Chemistry 305:
21140@print{}     sum: <356.4>
21141@print{}     average: <89.1>
21142@print{}     sort: <75.2 88.2 94.7 98.3>
21143@print{}     rsort: <98.3 94.7 88.2 75.2>
21144@print{}
21145@print{} English 401:
21146@print{}     sum: <376.1>
21147@print{}     average: <94.025>
21148@print{}     sort: <87.1 93.4 95.6 100.0>
21149@print{}     rsort: <100.0 95.6 93.4 87.1>
21150@end example
21151
21152Another example where indirect functions calls are useful can be found in
21153processing arrays. This is described in @ref{Walking Arrays}.
21154
21155Remember that you must supply a leading @samp{@@} in front of an indirect function call.
21156
21157Starting with @value{PVERSION} 4.1.2 of @command{gawk}, indirect function
21158calls may also be used with built-in functions and with extension functions
21159(@pxref{Dynamic Extensions}). There are some limitations when calling
21160built-in functions indirectly, as follows.
21161
21162@itemize @value{BULLET}
21163@item
21164You cannot pass a regular expression constant to a built-in function
21165through an indirect function call.@footnote{This may change in a future
21166version; recheck the documentation that comes with your version of
21167@command{gawk} to see if it has.} This applies to the @code{sub()},
21168@code{gsub()}, @code{gensub()}, @code{match()}, @code{split()} and
21169@code{patsplit()} functions.
21170
21171@item
21172If calling @code{sub()} or @code{gsub()}, you may only pass two arguments,
21173since those functions are unusual in that they update their third argument.
21174This means that @code{$0} will be updated.
21175@end itemize
21176
21177@command{gawk} does its best to make indirect function calls efficient.
21178For example, in the following case:
21179
21180@example
21181for (i = 1; i <= n; i++)
21182    @@the_func()
21183@end example
21184
21185@noindent
21186@command{gawk} looks up the actual function to call only once.
21187
21188@node Functions Summary
21189@section Summary
21190
21191@itemize @value{BULLET}
21192@item
21193@command{awk} provides built-in functions and lets you define your own
21194functions.
21195
21196@item
21197POSIX @command{awk} provides three kinds of built-in functions: numeric,
21198string, and I/O.  @command{gawk} provides functions that sort arrays, work
21199with values representing time, do bit manipulation, determine variable
21200type (array versus scalar), and internationalize and localize programs.
21201@command{gawk} also provides several extensions to some of standard
21202functions, typically in the form of additional arguments.
21203
21204@item
21205Functions accept zero or more arguments and return a value.  The
21206expressions that provide the argument values are completely evaluated
21207before the function is called. Order of evaluation is not defined.
21208The return value can be ignored.
21209
21210@item
21211The handling of backslash in @code{sub()} and @code{gsub()} is not simple.
21212It is more straightforward in @command{gawk}'s @code{gensub()} function,
21213but that function still requires care in its use.
21214
21215@item
21216User-defined functions provide important capabilities but come with
21217some syntactic inelegancies. In a function call, there cannot be any
21218space between the function name and the opening left parenthesis of the
21219argument list.  Also, there is no provision for local variables, so the
21220convention is to add extra parameters, and to separate them visually
21221from the real parameters by extra whitespace.
21222
21223@item
21224User-defined functions may call other user-defined (and built-in)
21225functions and may call themselves recursively. Function parameters
21226``hide'' any global variables of the same names.
21227You cannot use the name of a reserved variable (such as @code{ARGC})
21228as the name of a parameter in user-defined functions.
21229
21230@item
21231Scalar values are passed to user-defined functions by value. Array
21232parameters are passed by reference; any changes made by the function to
21233array parameters are thus visible after the function has returned.
21234
21235@item
21236Use the @code{return} statement to return from a user-defined function.
21237An optional expression becomes the function's return value.  Only scalar
21238values may be returned by a function.
21239
21240@item
21241If a variable that has never been used is passed to a user-defined
21242function, how that function treats the variable can set its nature:
21243either scalar or array.
21244
21245@item
21246@command{gawk} provides indirect function calls using a special syntax.
21247By setting a variable to the name of a function, you can
21248determine at runtime what function will be called at that point in the
21249program. This is equivalent to function pointers in C and C++.
21250
21251@end itemize
21252
21253
21254@ifnotinfo
21255@part @value{PART2}Problem Solving with @command{awk}
21256@end ifnotinfo
21257
21258@ifdocbook
21259Part II shows how to use @command{awk} and @command{gawk} for problem solving.
21260There is lots of code here for you to read and learn from.
21261It contains the following chapters:
21262
21263@itemize @value{BULLET}
21264@item
21265@ref{Library Functions}
21266
21267@item
21268@ref{Sample Programs}
21269@end itemize
21270@end ifdocbook
21271
21272@node Library Functions
21273@chapter A Library of @command{awk} Functions
21274@cindex libraries of @command{awk} functions
21275@cindex functions @subentry library
21276@cindex functions @subentry user-defined @subentry library of
21277
21278@ref{User-defined} describes how to write
21279your own @command{awk} functions.  Writing functions is important, because
21280it allows you to encapsulate algorithms and program tasks in a single
21281place.  It simplifies programming, making program development more
21282manageable and making programs more readable.
21283
21284@cindex Kernighan, Brian @subentry quotes
21285@cindex Plauger, P.J.@:
21286In their seminal 1976 book, @cite{Software Tools},@footnote{Sadly, over 35
21287years later, many of the lessons taught by this book have yet to be
21288learned by a vast number of practicing programmers.} Brian Kernighan
21289and P.J.@: Plauger wrote:
21290
21291@quotation
21292Good Programming is not learned from generalities, but by seeing how
21293significant programs can be made clean, easy to read, easy to maintain and
21294modify, human-engineered, efficient and reliable, by the application of
21295common sense and good programming practices.  Careful study and imitation
21296of good programs leads to better writing.
21297@end quotation
21298
21299In fact, they felt this idea was so important that they placed this
21300statement on the cover of their book.  Because we believe strongly
21301that their statement is correct, this @value{CHAPTER} and @ref{Sample
21302Programs}, provide a good-sized body of code for you to read and, we hope,
21303to learn from.
21304
21305This @value{CHAPTER} presents a library of useful @command{awk} functions.
21306Many of the sample programs presented later in this @value{DOCUMENT}
21307use these functions.
21308The functions are presented here in a progression from simple to complex.
21309
21310@cindex Texinfo
21311@ref{Extract Program}
21312presents a program that you can use to extract the source code for
21313these example library functions and programs from the Texinfo source
21314for this @value{DOCUMENT}.
21315(This has already been done as part of the @command{gawk} distribution.)
21316
21317@ifclear FOR_PRINT
21318If you have written one or more useful, general-purpose @command{awk} functions
21319and would like to contribute them to the @command{awk} user community, see
21320@ref{How To Contribute}, for more information.
21321@end ifclear
21322
21323@cindex portability @subentry example programs
21324The programs in this @value{CHAPTER} and in
21325@ref{Sample Programs},
21326freely use @command{gawk}-specific features.
21327Rewriting these programs for different implementations of @command{awk}
21328is pretty straightforward:
21329
21330@itemize @value{BULLET}
21331@item
21332Diagnostic error messages are sent to @file{/dev/stderr}.
21333Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"} if your system
21334does not have a @file{/dev/stderr}, or if you cannot use @command{gawk}.
21335
21336@item
21337A number of programs use @code{nextfile}
21338(@pxref{Nextfile Statement})
21339to skip any remaining input in the input file.
21340
21341@item
21342@c 12/2000: Thanks to Nelson Beebe for pointing out the output issue.
21343@cindex case sensitivity @subentry example programs
21344@cindex @code{IGNORECASE} variable @subentry in example programs
21345Finally, some of the programs choose to ignore upper- and lowercase
21346distinctions in their input. They do so by assigning one to @code{IGNORECASE}.
21347You can achieve almost the same effect@footnote{The effects are
21348not identical.  Output of the transformed
21349record will be in all lowercase, while @code{IGNORECASE} preserves the original
21350contents of the input record.} by adding the following rule to the
21351beginning of the program:
21352
21353@example
21354# ignore case
21355@{ $0 = tolower($0) @}
21356@end example
21357
21358@noindent
21359Also, verify that all regexp and string constants used in
21360comparisons use only lowercase letters.
21361@end itemize
21362
21363@menu
21364* Library Names::               How to best name private global variables in
21365                                library functions.
21366* General Functions::           Functions that are of general use.
21367* Data File Management::        Functions for managing command-line data
21368                                files.
21369* Getopt Function::             A function for processing command-line
21370                                arguments.
21371* Passwd Functions::            Functions for getting user information.
21372* Group Functions::             Functions for getting group information.
21373* Walking Arrays::              A function to walk arrays of arrays.
21374* Library Functions Summary::   Summary of library functions.
21375* Library Exercises::           Exercises.
21376@end menu
21377
21378@node Library Names
21379@section Naming Library Function Global Variables
21380
21381@cindex names @subentry arrays/variables
21382@cindex names @subentry functions
21383@cindex naming issues
21384@cindex @command{awk} programs @subentry documenting
21385@cindex documentation @subentry of @command{awk} programs
21386Due to the way the @command{awk} language evolved, variables are either
21387@dfn{global} (usable by the entire program) or @dfn{local} (usable just by
21388a specific function).  There is no intermediate state analogous to
21389@code{static} variables in C.
21390
21391@cindex variables @subentry global @subentry for library functions
21392@cindex private variables
21393@cindex variables @subentry private
21394Library functions often need to have global variables that they can use to
21395preserve state information between calls to the function---for example,
21396@code{getopt()}'s variable @code{_opti}
21397(@pxref{Getopt Function}).
21398Such variables are called @dfn{private}, as the only functions that need to
21399use them are the ones in the library.
21400
21401When writing a library function, you should try to choose names for your
21402private variables that will not conflict with any variables used by
21403either another library function or a user's main program.  For example, a
21404name like @code{i} or @code{j} is not a good choice, because user programs
21405often use variable names like these for their own purposes.
21406
21407@cindex programming conventions @subentry private variable names
21408The example programs shown in this @value{CHAPTER} all start the names of their
21409private variables with an underscore (@samp{_}).  Users generally don't use
21410leading underscores in their variable names, so this convention immediately
21411decreases the chances that the variable names will be accidentally shared
21412with the user's program.
21413
21414@cindex @code{_} (underscore) @subentry in names of private variables
21415@cindex underscore (@code{_}) @subentry in names of private variables
21416In addition, several of the library functions use a prefix that helps
21417indicate what function or set of functions use the variables---for example,
21418@code{_pw_byname()} in the user database routines
21419(@pxref{Passwd Functions}).
21420This convention is recommended, as it even further decreases the
21421chance of inadvertent conflict among variable names.  Note that this
21422convention is used equally well for variable names and for private
21423function names.@footnote{Although all the library routines could have
21424been rewritten to use this convention, this was not done, in order to
21425show how our own @command{awk} programming style has evolved and to
21426provide some basis for this discussion.}
21427
21428As a final note on variable naming, if a function makes global variables
21429available for use by a main program, it is a good convention to start those
21430variables' names with a capital letter---for
21431example, @code{getopt()}'s @code{Opterr} and @code{Optind} variables
21432(@pxref{Getopt Function}).
21433The leading capital letter indicates that it is global, while the fact that
21434the variable name is not all capital letters indicates that the variable is
21435not one of @command{awk}'s predefined variables, such as @code{FS}.
21436
21437@cindex @option{--dump-variables} option @subentry using for library functions
21438It is also important that @emph{all} variables in library
21439functions that do not need to save state are, in fact, declared
21440local.@footnote{@command{gawk}'s @option{--dump-variables} command-line
21441option is useful for verifying this.} If this is not done, the variables
21442could accidentally be used in the user's program, leading to bugs that
21443are very difficult to track down:
21444
21445@example
21446function lib_func(x, y,    l1, l2)
21447@{
21448    @dots{}
21449    # some_var should be local but by oversight is not
21450    @var{use variable} some_var
21451    @dots{}
21452@}
21453@end example
21454
21455@cindex arrays @subentry associative @subentry library functions and
21456@cindex libraries of @command{awk} functions @subentry associative arrays and
21457@cindex functions @subentry library @subentry associative arrays and
21458@cindex Tcl
21459A different convention, common in the Tcl community, is to use a single
21460associative array to hold the values needed by the library function(s), or
21461``package.''  This significantly decreases the number of actual global names
21462in use.  For example, the functions described in
21463@ref{Passwd Functions}
21464might have used array elements @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}},
21465@code{@w{PW_data["count"]}}, and @code{@w{PW_data["awklib"]}}, instead of
21466@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}},
21467and @code{@w{_pw_count}}.
21468
21469The conventions presented in this @value{SECTION} are exactly
21470that: conventions. You are not required to write your programs this
21471way---we merely recommend that you do so.
21472
21473Beginning with @value{PVERSION} 5.0, @command{gawk} provides
21474a powerful mechanism for solving the problems described in this
21475section: @dfn{namespaces}.  Namespaces and their use are described
21476in detail in @ref{Namespaces}.
21477
21478@node General Functions
21479@section General Programming
21480
21481This @value{SECTION} presents a number of functions that are of general
21482programming use.
21483
21484@menu
21485* Strtonum Function::           A replacement for the built-in
21486                                @code{strtonum()} function.
21487* Assert Function::             A function for assertions in @command{awk}
21488                                programs.
21489* Round Function::              A function for rounding if @code{sprintf()}
21490                                does not do it correctly.
21491* Cliff Random Function::       The Cliff Random Number Generator.
21492* Ordinal Functions::           Functions for using characters as numbers and
21493                                vice versa.
21494* Join Function::               A function to join an array into a string.
21495* Getlocaltime Function::       A function to get formatted times.
21496* Readfile Function::           A function to read an entire file at once.
21497* Shell Quoting::               A function to quote strings for the shell.
21498* Isnumeric Function::          A function to test whether a value is numeric.
21499@end menu
21500
21501@node Strtonum Function
21502@subsection Converting Strings to Numbers
21503
21504The @code{strtonum()} function (@pxref{String Functions})
21505is a @command{gawk} extension.  The following function
21506provides an implementation for other versions of @command{awk}:
21507
21508@example
21509@c file eg/lib/strtonum.awk
21510# mystrtonum --- convert string to number
21511
21512@c endfile
21513@ignore
21514@c file eg/lib/strtonum.awk
21515#
21516# Arnold Robbins, arnold@@skeeve.com, Public Domain
21517# February, 2004
21518# Revised June, 2014
21519
21520@c endfile
21521@end ignore
21522@c file eg/lib/strtonum.awk
21523function mystrtonum(str,        ret, n, i, k, c)
21524@{
21525    if (str ~ /^0[0-7]*$/) @{
21526        # octal
21527        n = length(str)
21528        ret = 0
21529        for (i = 1; i <= n; i++) @{
21530            c = substr(str, i, 1)
21531            # index() returns 0 if c not in string,
21532            # includes c == "0"
21533            k = index("1234567", c)
21534
21535            ret = ret * 8 + k
21536        @}
21537    @} else if (str ~ /^0[xX][[:xdigit:]]+$/) @{
21538        # hexadecimal
21539        str = substr(str, 3)    # lop off leading 0x
21540        n = length(str)
21541        ret = 0
21542        for (i = 1; i <= n; i++) @{
21543            c = substr(str, i, 1)
21544            c = tolower(c)
21545            # index() returns 0 if c not in string,
21546            # includes c == "0"
21547            k = index("123456789abcdef", c)
21548
21549            ret = ret * 16 + k
21550        @}
21551    @} else if (str ~ \
21552  /^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|([.][0-9]+([Ee][-+]?[0-9]+)?))$/) @{
21553        # decimal number, possibly floating point
21554        ret = str + 0
21555    @} else
21556        ret = "NOT-A-NUMBER"
21557
21558    return ret
21559@}
21560
21561# BEGIN @{     # gawk test harness
21562#     a[1] = "25"
21563#     a[2] = ".31"
21564#     a[3] = "0123"
21565#     a[4] = "0xdeadBEEF"
21566#     a[5] = "123.45"
21567#     a[6] = "1.e3"
21568#     a[7] = "1.32"
21569#     a[8] = "1.32E2"
21570#
21571#     for (i = 1; i in a; i++)
21572#         print a[i], strtonum(a[i]), mystrtonum(a[i])
21573# @}
21574@c endfile
21575@end example
21576
21577The function first looks for C-style octal numbers (base 8).
21578If the input string matches a regular expression describing octal
21579numbers, then @code{mystrtonum()} loops through each character in the
21580string.  It sets @code{k} to the index in @code{"1234567"} of the current
21581octal digit.
21582The return value will either be the same number as the digit, or zero
21583if the character is not there, which will be true for a @samp{0}.
21584This is safe, because the regexp test in the @code{if} ensures that
21585only octal values are converted.
21586
21587Similar logic applies to the code that checks for and converts a
21588hexadecimal value, which starts with @samp{0x} or @samp{0X}.
21589The use of @code{tolower()} simplifies the computation for finding
21590the correct numeric value for each hexadecimal digit.
21591
21592Finally, if the string matches the (rather complicated) regexp for a
21593regular decimal integer or floating-point number, the computation
21594@samp{ret = str + 0} lets @command{awk} convert the value to a
21595number.
21596
21597A commented-out test program is included, so that the function can
21598be tested with @command{gawk} and the results compared to the built-in
21599@code{strtonum()} function.
21600
21601@node Assert Function
21602@subsection Assertions
21603
21604@cindex assertions
21605@cindex @code{assert()} function (C library)
21606@cindex C library functions @subentry @code{assert()}
21607@cindex libraries of @command{awk} functions @subentry assertions
21608@cindex functions @subentry library @subentry assertions
21609@cindex @command{awk} programs @subentry lengthy @subentry assertions
21610When writing large programs, it is often useful to know
21611that a condition or set of conditions is true.  Before proceeding with a
21612particular computation, you make a statement about what you believe to be
21613the case.  Such a statement is known as an
21614@dfn{assertion}.  The C language provides an @code{<assert.h>} header file
21615and corresponding @code{assert()} macro that a programmer can use to make
21616assertions.  If an assertion fails, the @code{assert()} macro arranges to
21617print a diagnostic message describing the condition that should have
21618been true but was not, and then it kills the program.  In C, using
21619@code{assert()} looks this:
21620
21621@example
21622@group
21623#include <assert.h>
21624
21625int myfunc(int a, double b)
21626@{
21627     assert(a <= 5 && b >= 17.1);
21628     @dots{}
21629@}
21630@end group
21631@end example
21632
21633If the assertion fails, the program prints a message similar to this:
21634
21635@example
21636prog.c:5: assertion failed: a <= 5 && b >= 17.1
21637@end example
21638
21639@cindex @code{assert()} user-defined function
21640@cindex user-defined @subentry function @subentry @code{assert()}
21641The C language makes it possible to turn the condition into a string for use
21642in printing the diagnostic message.  This is not possible in @command{awk}, so
21643this @code{assert()} function also requires a string version of the condition
21644that is being tested.
21645Following is the function:
21646
21647@example
21648@c file eg/lib/assert.awk
21649# assert --- assert that a condition is true. Otherwise, exit.
21650
21651@c endfile
21652@ignore
21653@c file eg/lib/assert.awk
21654#
21655# Arnold Robbins, arnold@@skeeve.com, Public Domain
21656# May, 1993
21657
21658@c endfile
21659@end ignore
21660@c file eg/lib/assert.awk
21661function assert(condition, string)
21662@{
21663    if (! condition) @{
21664        printf("%s:%d: assertion failed: %s\n",
21665            FILENAME, FNR, string) > "/dev/stderr"
21666        _assert_exit = 1
21667        exit 1
21668    @}
21669@}
21670
21671@group
21672END @{
21673    if (_assert_exit)
21674        exit 1
21675@}
21676@end group
21677@c endfile
21678@end example
21679
21680The @code{assert()} function tests the @code{condition} parameter. If it
21681is false, it prints a message to standard error, using the @code{string}
21682parameter to describe the failed condition.  It then sets the variable
21683@code{_assert_exit} to one and executes the @code{exit} statement.
21684The @code{exit} statement jumps to the @code{END} rule. If the @code{END}
21685rule finds @code{_assert_exit} to be true, it exits immediately.
21686
21687The purpose of the test in the @code{END} rule is to
21688keep any other @code{END} rules from running.  When an assertion fails, the
21689program should exit immediately.
21690If no assertions fail, then @code{_assert_exit} is still
21691false when the @code{END} rule is run normally, and the rest of the
21692program's @code{END} rules execute.
21693For all of this to work correctly, @file{assert.awk} must be the
21694first source file read by @command{awk}.
21695The function can be used in a program in the following way:
21696
21697@example
21698function myfunc(a, b)
21699@{
21700     assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1")
21701     @dots{}
21702@}
21703@end example
21704
21705@noindent
21706If the assertion fails, you see a message similar to the following:
21707
21708@example
21709mydata:1357: assertion failed: a <= 5 && b >= 17.1
21710@end example
21711
21712@cindex @code{END} pattern @subentry @code{assert()} user-defined function and
21713There is a small problem with this version of @code{assert()}.
21714An @code{END} rule is automatically added
21715to the program calling @code{assert()}.  Normally, if a program consists
21716of just a @code{BEGIN} rule, the input files and/or standard input are
21717not read. However, now that the program has an @code{END} rule, @command{awk}
21718attempts to read the input @value{DF}s or standard input
21719(@pxref{Using BEGIN/END}),
21720most likely causing the program to hang as it waits for input.
21721
21722@cindex @code{BEGIN} pattern @subentry @code{assert()} user-defined function and
21723There is a simple workaround to this:
21724make sure that such a @code{BEGIN} rule always ends
21725with an @code{exit} statement.
21726
21727@node Round Function
21728@subsection Rounding Numbers
21729
21730@cindex rounding numbers
21731@cindex numbers @subentry rounding
21732@cindex libraries of @command{awk} functions @subentry rounding numbers
21733@cindex functions @subentry library @subentry rounding numbers
21734@cindex @code{print} statement @subentry @code{sprintf()} function and
21735@cindex @code{printf} statement @subentry @code{sprintf()} function and
21736@cindex @code{sprintf()} function @subentry @code{print}/@code{printf} statements and
21737The way @code{printf} and @code{sprintf()}
21738(@pxref{Printf})
21739perform rounding often depends upon the system's C @code{sprintf()}
21740subroutine.  On many machines, @code{sprintf()} rounding is @dfn{unbiased},
21741which means it doesn't always round a trailing .5 up, contrary
21742to naive expectations.  In unbiased rounding, .5 rounds to even,
21743rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4.  This means
21744that if you are using a format that does rounding (e.g., @code{"%.0f"}),
21745you should check what your system does.  The following function does
21746traditional rounding; it might be useful if your @command{awk}'s @code{printf}
21747does unbiased rounding:
21748
21749@cindex @code{round()} user-defined function
21750@cindex user-defined @subentry function @subentry @code{round()}
21751@example
21752@c file eg/lib/round.awk
21753# round.awk --- do normal rounding
21754@c endfile
21755@ignore
21756@c file eg/lib/round.awk
21757#
21758# Arnold Robbins, arnold@@skeeve.com, Public Domain
21759# August, 1996
21760@c endfile
21761@end ignore
21762@c file eg/lib/round.awk
21763
21764function round(x,   ival, aval, fraction)
21765@{
21766   ival = int(x)    # integer part, int() truncates
21767
21768   # see if fractional part
21769   if (ival == x)   # no fraction
21770      return ival   # ensure no decimals
21771
21772   if (x < 0) @{
21773      aval = -x     # absolute value
21774      ival = int(aval)
21775      fraction = aval - ival
21776      if (fraction >= .5)
21777         return int(x) - 1   # -2.5 --> -3
21778      else
21779         return int(x)       # -2.3 --> -2
21780   @} else @{
21781      fraction = x - ival
21782      if (fraction >= .5)
21783         return ival + 1
21784      else
21785         return ival
21786   @}
21787@}
21788@c endfile
21789@c don't include test harness in the file that gets installed
21790@group
21791# test harness
21792# @{ print $0, round($0) @}
21793@end group
21794@end example
21795
21796@node Cliff Random Function
21797@subsection The Cliff Random Number Generator
21798@cindex random numbers @subentry Cliff
21799@cindex Cliff random numbers
21800@cindex numbers @subentry Cliff random
21801@cindex functions @subentry library @subentry Cliff random numbers
21802
21803The
21804@uref{http://mathworld.wolfram.com/CliffRandomNumberGenerator.html, Cliff random number generator}
21805is a very simple random number generator that ``passes the noise sphere test
21806for randomness by showing no structure.''
21807It is easily programmed, in less than 10 lines of @command{awk} code:
21808
21809@cindex @code{cliff_rand()} user-defined function
21810@cindex user-defined @subentry function @subentry @code{cliff_rand()}
21811@example
21812@c file eg/lib/cliff_rand.awk
21813# cliff_rand.awk --- generate Cliff random numbers
21814@c endfile
21815@ignore
21816@c file eg/lib/cliff_rand.awk
21817#
21818# Arnold Robbins, arnold@@skeeve.com, Public Domain
21819# December 2000
21820@c endfile
21821@end ignore
21822@c file eg/lib/cliff_rand.awk
21823
21824BEGIN @{ _cliff_seed = 0.1 @}
21825
21826function cliff_rand()
21827@{
21828    _cliff_seed = (100 * log(_cliff_seed)) % 1
21829    if (_cliff_seed < 0)
21830        _cliff_seed = - _cliff_seed
21831    return _cliff_seed
21832@}
21833@c endfile
21834@end example
21835
21836This algorithm requires an initial ``seed'' of 0.1.  Each new value
21837uses the current seed as input for the calculation.
21838If the built-in @code{rand()} function
21839(@pxref{Numeric Functions})
21840isn't random enough, you might try using this function instead.
21841
21842@node Ordinal Functions
21843@subsection Translating Between Characters and Numbers
21844
21845@cindex libraries of @command{awk} functions @subentry character values as numbers
21846@cindex functions @subentry library @subentry character values as numbers
21847@cindex characters @subentry values of as numbers
21848@cindex numbers @subentry as values of characters
21849One commercial implementation of @command{awk} supplies a built-in function,
21850@code{ord()}, which takes a character and returns the numeric value for that
21851character in the machine's character set.  If the string passed to
21852@code{ord()} has more than one character, only the first one is used.
21853
21854The inverse of this function is @code{chr()} (from the function of the same
21855name in Pascal), which takes a number and returns the corresponding character.
21856Both functions are written very nicely in @command{awk}; there is no real
21857reason to build them into the @command{awk} interpreter:
21858
21859@cindex @code{ord()} user-defined function
21860@cindex user-defined @subentry function @subentry @code{ord()}
21861@cindex @code{chr()} user-defined function
21862@cindex user-defined @subentry function @subentry @code{chr()}
21863@cindex @code{_ord_init()} user-defined function
21864@cindex user-defined @subentry function @subentry @code{_ord_init()}
21865@example
21866@c file eg/lib/ord.awk
21867# ord.awk --- do ord and chr
21868
21869# Global identifiers:
21870#    _ord_:        numerical values indexed by characters
21871#    _ord_init:    function to initialize _ord_
21872@c endfile
21873@ignore
21874@c file eg/lib/ord.awk
21875#
21876# Arnold Robbins, arnold@@skeeve.com, Public Domain
21877# 16 January, 1992
21878# 20 July, 1992, revised
21879@c endfile
21880@end ignore
21881@c file eg/lib/ord.awk
21882
21883BEGIN    @{ _ord_init() @}
21884
21885function _ord_init(    low, high, i, t)
21886@{
21887    low = sprintf("%c", 7) # BEL is ascii 7
21888    if (low == "\a") @{    # regular ascii
21889        low = 0
21890        high = 127
21891    @} else if (sprintf("%c", 128 + 7) == "\a") @{
21892        # ascii, mark parity
21893        low = 128
21894        high = 255
21895    @} else @{        # ebcdic(!)
21896        low = 0
21897        high = 255
21898    @}
21899
21900    for (i = low; i <= high; i++) @{
21901        t = sprintf("%c", i)
21902        _ord_[t] = i
21903    @}
21904@}
21905@c endfile
21906@end example
21907
21908@cindex character sets (machine character encodings)
21909@cindex ASCII
21910@cindex EBCDIC
21911@cindex Unicode
21912@cindex mark parity
21913Some explanation of the numbers used by @code{_ord_init()} is worthwhile.
21914The most prominent character set in use today is ASCII.@footnote{This
21915is changing; many systems use Unicode, a very large character set
21916that includes ASCII as a subset.  On systems with full Unicode support,
21917a character can occupy up to 32 bits, making simple tests such as
21918used here prohibitively expensive.}
21919Although an
219208-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
21921defines characters that use the values from 0 to 127.@footnote{ASCII
21922has been extended in many countries to use the values from 128 to 255
21923for country-specific characters.  If your  system uses these extensions,
21924you can simplify @code{_ord_init()} to loop from 0 to 255.}
21925In the now distant past,
21926at least one minicomputer manufacturer
21927@c Pr1me, blech
21928used ASCII, but with mark parity, meaning that the leftmost bit in the byte
21929is always 1.  This means that on those systems, characters
21930have numeric values from 128 to 255.
21931Finally, large mainframe systems use the EBCDIC character set, which
21932uses all 256 values.
21933There are other character sets in use on some older systems, but
21934they are not really worth worrying about:
21935
21936@example
21937@c file eg/lib/ord.awk
21938function ord(str,    c)
21939@{
21940    # only first character is of interest
21941    c = substr(str, 1, 1)
21942    return _ord_[c]
21943@}
21944
21945function chr(c)
21946@{
21947    # force c to be numeric by adding 0
21948    return sprintf("%c", c + 0)
21949@}
21950@c endfile
21951
21952#### test code ####
21953# BEGIN @{
21954#    for (;;) @{
21955#        printf("enter a character: ")
21956#        if (getline var <= 0)
21957#            break
21958#        printf("ord(%s) = %d\n", var, ord(var))
21959#    @}
21960# @}
21961@c endfile
21962@end example
21963
21964An obvious improvement to these functions is to move the code for the
21965@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule.  It was
21966written this way initially for ease of development.
21967There is a ``test program'' in a @code{BEGIN} rule, to test the
21968function.  It is commented out for production use.
21969
21970@node Join Function
21971@subsection Merging an Array into a String
21972
21973@cindex libraries of @command{awk} functions @subentry merging arrays into strings
21974@cindex functions @subentry library @subentry merging arrays into strings
21975@cindex strings @subentry merging arrays into
21976@cindex arrays @subentry merging into strings
21977When doing string processing, it is often useful to be able to join
21978all the strings in an array into one long string.  The following function,
21979@code{join()}, accomplishes this task.  It is used later in several of
21980the application programs
21981(@pxref{Sample Programs}).
21982
21983Good function design is important; this function needs to be general, but it
21984should also have a reasonable default behavior.  It is called with an array
21985as well as the beginning and ending indices of the elements in the array to be
21986merged.  This assumes that the array indices are numeric---a reasonable
21987assumption, as the array was likely created with @code{split()}
21988(@pxref{String Functions}):
21989
21990@cindex @code{join()} user-defined function
21991@cindex user-defined @subentry function @subentry @code{join()}
21992@example
21993@c file eg/lib/join.awk
21994# join.awk --- join an array into a string
21995@c endfile
21996@ignore
21997@c file eg/lib/join.awk
21998#
21999# Arnold Robbins, arnold@@skeeve.com, Public Domain
22000# May 1993
22001@c endfile
22002@end ignore
22003@c file eg/lib/join.awk
22004
22005function join(array, start, end, sep,    result, i)
22006@{
22007    if (sep == "")
22008       sep = " "
22009    else if (sep == SUBSEP) # magic value
22010       sep = ""
22011    result = array[start]
22012    for (i = start + 1; i <= end; i++)
22013        result = result sep array[i]
22014    return result
22015@}
22016@c endfile
22017@end example
22018
22019An optional additional argument is the separator to use when joining the
22020strings back together.  If the caller supplies a nonempty value,
22021@code{join()} uses it; if it is not supplied, it has a null
22022value.  In this case, @code{join()} uses a single space as a default
22023separator for the strings.  If the value is equal to @code{SUBSEP},
22024then @code{join()} joins the strings with no separator between them.
22025@code{SUBSEP} serves as a ``magic'' value to indicate that there should
22026be no separation between the component strings.@footnote{It would
22027be nice if @command{awk} had an assignment operator for concatenation.
22028The lack of an explicit operator for concatenation makes string operations
22029more difficult than they really need to be.}
22030
22031@node Getlocaltime Function
22032@subsection Managing the Time of Day
22033
22034@cindex libraries of @command{awk} functions @subentry managing @subentry time
22035@cindex functions @subentry library @subentry managing time
22036@cindex timestamps @subentry formatted
22037@cindex time @subentry managing
22038The @code{systime()} and @code{strftime()} functions described in
22039@ref{Time Functions}
22040provide the minimum functionality necessary for dealing with the time of day
22041in human-readable form.  Although @code{strftime()} is extensive, the control
22042formats are not necessarily easy to remember or intuitively obvious when
22043reading a program.
22044
22045The following function, @code{getlocaltime()}, populates a user-supplied array
22046with preformatted time information.  It returns a string with the current
22047time formatted in the same way as the @command{date} utility:
22048
22049@cindex @code{getlocaltime()} user-defined function
22050@cindex user-defined @subentry function @subentry @code{getlocaltime()}
22051@example
22052@c file eg/lib/gettime.awk
22053# getlocaltime.awk --- get the time of day in a usable format
22054@c endfile
22055@ignore
22056@c file eg/lib/gettime.awk
22057#
22058# Arnold Robbins, arnold@@skeeve.com, Public Domain, May 1993
22059#
22060@c endfile
22061@end ignore
22062@c file eg/lib/gettime.awk
22063
22064# Returns a string in the format of output of date(1)
22065# Populates the array argument time with individual values:
22066#    time["second"]       -- seconds (0 - 59)
22067#    time["minute"]       -- minutes (0 - 59)
22068#    time["hour"]         -- hours (0 - 23)
22069#    time["althour"]      -- hours (0 - 12)
22070#    time["monthday"]     -- day of month (1 - 31)
22071#    time["month"]        -- month of year (1 - 12)
22072#    time["monthname"]    -- name of the month
22073#    time["shortmonth"]   -- short name of the month
22074#    time["year"]         -- year modulo 100 (0 - 99)
22075#    time["fullyear"]     -- full year
22076#    time["weekday"]      -- day of week (Sunday = 0)
22077#    time["altweekday"]   -- day of week (Monday = 0)
22078#    time["dayname"]      -- name of weekday
22079#    time["shortdayname"] -- short name of weekday
22080#    time["yearday"]      -- day of year (0 - 365)
22081#    time["timezone"]     -- abbreviation of timezone name
22082#    time["ampm"]         -- AM or PM designation
22083#    time["weeknum"]      -- week number, Sunday first day
22084#    time["altweeknum"]   -- week number, Monday first day
22085
22086function getlocaltime(time,    ret, now, i)
22087@{
22088    # get time once, avoids unnecessary system calls
22089    now = systime()
22090
22091    # return date(1)-style output
22092    ret = strftime("%a %b %e %H:%M:%S %Z %Y", now)
22093
22094    # clear out target array
22095    delete time
22096
22097    # fill in values, force numeric values to be
22098    # numeric by adding 0
22099    time["second"]       = strftime("%S", now) + 0
22100    time["minute"]       = strftime("%M", now) + 0
22101    time["hour"]         = strftime("%H", now) + 0
22102    time["althour"]      = strftime("%I", now) + 0
22103    time["monthday"]     = strftime("%d", now) + 0
22104    time["month"]        = strftime("%m", now) + 0
22105    time["monthname"]    = strftime("%B", now)
22106    time["shortmonth"]   = strftime("%b", now)
22107    time["year"]         = strftime("%y", now) + 0
22108    time["fullyear"]     = strftime("%Y", now) + 0
22109    time["weekday"]      = strftime("%w", now) + 0
22110    time["altweekday"]   = strftime("%u", now) + 0
22111    time["dayname"]      = strftime("%A", now)
22112    time["shortdayname"] = strftime("%a", now)
22113    time["yearday"]      = strftime("%j", now) + 0
22114    time["timezone"]     = strftime("%Z", now)
22115    time["ampm"]         = strftime("%p", now)
22116    time["weeknum"]      = strftime("%U", now) + 0
22117    time["altweeknum"]   = strftime("%W", now) + 0
22118
22119    return ret
22120@}
22121@c endfile
22122@end example
22123
22124The string indices are easier to use and read than the various formats
22125required by @code{strftime()}.  The @code{alarm} program presented in
22126@ref{Alarm Program}
22127uses this function.
22128A more general design for the @code{getlocaltime()} function would have
22129allowed the user to supply an optional timestamp value to use instead
22130of the current time.
22131
22132@node Readfile Function
22133@subsection Reading a Whole File at Once
22134
22135Often, it is convenient to have the entire contents of a file available
22136in memory as a single string. A straightforward but naive way to
22137do that might be as follows:
22138
22139@example
22140function readfile1(file,    tmp, contents)
22141@{
22142    if ((getline tmp < file) < 0)
22143        return
22144
22145    contents = tmp RT
22146    while ((getline tmp < file) > 0)
22147        contents = contents tmp RT
22148
22149    close(file)
22150    return contents
22151@}
22152@end example
22153
22154This function reads from @code{file} one record at a time, building
22155up the full contents of the file in the local variable @code{contents}.
22156It works, but is not necessarily efficient.
22157
22158The following function, based on a suggestion by Denis Shirokov,
22159reads the entire contents of the named file in one shot:
22160
22161@cindex @code{readfile()} user-defined function
22162@cindex user-defined @subentry function @subentry @code{readfile()}
22163@example
22164@c file eg/lib/readfile.awk
22165# readfile.awk --- read an entire file at once
22166@c endfile
22167@ignore
22168@c file eg/lib/readfile.awk
22169#
22170# Original idea by Denis Shirokov, cosmogen@@gmail.com, April 2013
22171#
22172@c endfile
22173@end ignore
22174@c file eg/lib/readfile.awk
22175
22176function readfile(file,     tmp, save_rs)
22177@{
22178    save_rs = RS
22179    RS = "^$"
22180    getline tmp < file
22181    close(file)
22182    RS = save_rs
22183
22184    return tmp
22185@}
22186@c endfile
22187@end example
22188
22189It works by setting @code{RS} to @samp{^$}, a regular expression that
22190will never match if the file has contents.  @command{gawk} reads data from
22191the file into @code{tmp}, attempting to match @code{RS}.  The match fails
22192after each read, but fails quickly, such that @command{gawk} fills
22193@code{tmp} with the entire contents of the file.
22194(@xref{Records} for information on @code{RT} and @code{RS}.)
22195
22196In the case that @code{file} is empty, the return value is the null
22197string.  Thus, calling code may use something like:
22198
22199@example
22200contents = readfile("/some/path")
22201if (length(contents) == 0)
22202    # file was empty @dots{}
22203@end example
22204
22205This tests the result to see if it is empty or not. An equivalent
22206test would be @samp{@w{contents == ""}}.
22207
22208@xref{Extension Sample Readfile} for an extension function that
22209also reads an entire file into memory.
22210
22211@node Shell Quoting
22212@subsection Quoting Strings to Pass to the Shell
22213
22214@c included by permission
22215@ignore
22216Date: Sun, 27 Jul 2014 17:16:16 -0700
22217Message-ID: <CAKuGj+iCF_obaCLDUX60aSAgbfocFVtguG39GyeoNxTFby5sqQ@mail.gmail.com>
22218Subject: Useful awk function
22219From: Mike Brennan <mike@madronabluff.com>
22220To: Arnold Robbins <arnold@skeeve.com>
22221@end ignore
22222
22223Michael Brennan offers the following programming pattern,
22224which he uses frequently:
22225
22226@example
22227#! /bin/sh
22228
22229awkp='
22230   @dots{}
22231   '
22232
22233@var{input_program} | awk "$awkp" | /bin/sh
22234@end example
22235
22236For example, a program of his named @command{flac-edit} has this form:
22237
22238@example
22239$ @kbd{flac-edit -song="Whoope! That's Great" file.flac}
22240@end example
22241
22242It generates the following output, which is to be piped to
22243the shell (@file{/bin/sh}):
22244
22245@example
22246chmod +w file.flac
22247metaflac --remove-tag=TITLE file.flac
22248LANG=en_US.88591 metaflac --set-tag=TITLE='Whoope! That'"'"'s Great' file.flac
22249chmod -w file.flac
22250@end example
22251
22252Note the need for shell quoting.  The function @code{shell_quote()}
22253does it.  @code{SINGLE} is the one-character string @code{"'"} and
22254@code{QSINGLE} is the three-character string @code{"\"'\""}:
22255
22256@example
22257@c file eg/lib/shellquote.awk
22258# shell_quote --- quote an argument for passing to the shell
22259@c endfile
22260@ignore
22261@c file eg/lib/shellquote.awk
22262#
22263# Michael Brennan
22264# brennan@@madronabluff.com
22265# September 2014
22266@c endfile
22267@end ignore
22268@c file eg/lib/shellquote.awk
22269
22270function shell_quote(s,             # parameter
22271    SINGLE, QSINGLE, i, X, n, ret)  # locals
22272@{
22273    if (s == "")
22274        return "\"\""
22275
22276    SINGLE = "\x27"  # single quote
22277    QSINGLE = "\"\x27\""
22278    n = split(s, X, SINGLE)
22279
22280    ret = SINGLE X[1] SINGLE
22281    for (i = 2; i <= n; i++)
22282        ret = ret QSINGLE SINGLE X[i] SINGLE
22283
22284    return ret
22285@}
22286@c endfile
22287@end example
22288
22289@node Isnumeric Function
22290@subsection Checking Whether A Value Is Numeric
22291
22292A frequent programming question is how to ascertain whether a value is numeric.
22293This can be solved by using this example function @code{isnumeric()}, which
22294employs the trick of converting a string value to user input by using the
22295@code{split()} function:
22296
22297@cindex @code{isnumeric()} user-defined function
22298@cindex user-defined @subentry function @subentry @code{isnumeric()}
22299@example
22300@c file eg/lib/isnumeric.awk
22301# isnumeric --- check whether a value is numeric
22302
22303function isnumeric(x,  f)
22304@{
22305    switch (typeof(x)) @{
22306    case "strnum":
22307    case "number":
22308        return 1
22309    case "string":
22310        return (split(x, f, " ") == 1) && (typeof(f[1]) == "strnum")
22311    default:
22312        return 0
22313    @}
22314@}
22315@c endfile
22316@end example
22317
22318Please note that leading or trailing white space is disregarded in deciding
22319whether a value is numeric or not, so if it matters to you, you may want
22320to add an additional check for that.
22321
22322Traditionally, it has been recommended to check for numeric values using the
22323test @samp{x+0 == x}. This function is superior in two ways: it will not
22324report that unassigned variables contain numeric values; and it recognizes
22325string values with numeric contents where @code{CONVFMT} does not yield
22326the original string.
22327On the other hand, it uses the @code{typeof()} function
22328(@pxref{Type Functions}), which is specific to @command{gawk}.
22329
22330@node Data File Management
22331@section @value{DDF} Management
22332
22333@cindex files @subentry managing
22334@cindex libraries of @command{awk} functions @subentry managing @subentry data files
22335@cindex functions @subentry library @subentry managing data files
22336This @value{SECTION} presents functions that are useful for managing
22337command-line @value{DF}s.
22338
22339@menu
22340* Filetrans Function::          A function for handling data file transitions.
22341* Rewind Function::             A function for rereading the current file.
22342* File Checking::               Checking that data files are readable.
22343* Empty Files::                 Checking for zero-length files.
22344* Ignoring Assigns::            Treating assignments as file names.
22345@end menu
22346
22347@node Filetrans Function
22348@subsection Noting @value{DDF} Boundaries
22349
22350@cindex files @subentry managing @subentry data file boundaries
22351@cindex files @subentry initialization and cleanup
22352The @code{BEGIN} and @code{END} rules are each executed exactly once, at
22353the beginning and end of your @command{awk} program, respectively
22354(@pxref{BEGIN/END}).
22355We (the @command{gawk} authors) once had a user who mistakenly thought that the
22356@code{BEGIN} rules were executed at the beginning of each @value{DF} and the
22357@code{END} rules were executed at the end of each @value{DF}.
22358
22359When informed
22360that this was not the case, the user requested that we add new special
22361patterns to @command{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that
22362would have the desired behavior.  He even supplied us the code to do so.
22363
22364Adding these special patterns to @command{gawk} wasn't necessary;
22365the job can be done cleanly in @command{awk} itself, as illustrated
22366by the following library program.
22367It arranges to call two user-supplied functions, @code{beginfile()} and
22368@code{endfile()}, at the beginning and end of each @value{DF}.
22369Besides solving the problem in only nine(!) lines of code, it does so
22370@emph{portably}; this works with any implementation of @command{awk}:
22371
22372@example
22373# transfile.awk
22374#
22375# Give the user a hook for filename transitions
22376#
22377# The user must supply functions beginfile() and endfile()
22378# that each take the name of the file being started or
22379# finished, respectively.
22380@c #
22381@c # Arnold Robbins, arnold@@skeeve.com, Public Domain
22382@c # January 1992
22383
22384FILENAME != _oldfilename @{
22385    if (_oldfilename != "")
22386        endfile(_oldfilename)
22387    _oldfilename = FILENAME
22388    beginfile(FILENAME)
22389@}
22390
22391END @{ endfile(FILENAME) @}
22392@end example
22393
22394This file must be loaded before the user's ``main'' program, so that the
22395rule it supplies is executed first.
22396
22397This rule relies on @command{awk}'s @code{FILENAME} variable, which
22398automatically changes for each new @value{DF}.  The current @value{FN} is
22399saved in a private variable, @code{_oldfilename}.  If @code{FILENAME} does
22400not equal @code{_oldfilename}, then a new @value{DF} is being processed and
22401it is necessary to call @code{endfile()} for the old file.  Because
22402@code{endfile()} should only be called if a file has been processed, the
22403program first checks to make sure that @code{_oldfilename} is not the null
22404string.  The program then assigns the current @value{FN} to
22405@code{_oldfilename} and calls @code{beginfile()} for the file.
22406Because, like all @command{awk} variables, @code{_oldfilename} is
22407initialized to the null string, this rule executes correctly even for the
22408first @value{DF}.
22409
22410The program also supplies an @code{END} rule to do the final processing for
22411the last file.  Because this @code{END} rule comes before any @code{END} rules
22412supplied in the ``main'' program, @code{endfile()} is called first.  Once
22413again, the value of multiple @code{BEGIN} and @code{END} rules should be clear.
22414
22415@cindex @code{beginfile()} user-defined function
22416@cindex user-defined @subentry function @subentry @code{beginfile()}
22417@cindex @code{endfile()} user-defined function
22418@cindex user-defined @subentry function @subentry @code{endfile()}
22419If the same @value{DF} occurs twice in a row on the command line, then
22420@code{endfile()} and @code{beginfile()} are not executed at the end of the
22421first pass and at the beginning of the second pass.
22422The following version solves the problem:
22423
22424@example
22425@c file eg/lib/ftrans.awk
22426# ftrans.awk --- handle datafile transitions
22427#
22428# user supplies beginfile() and endfile() functions
22429@c endfile
22430@ignore
22431@c file eg/lib/ftrans.awk
22432#
22433# Arnold Robbins, arnold@@skeeve.com, Public Domain
22434# November 1992
22435@c endfile
22436@end ignore
22437@c file eg/lib/ftrans.awk
22438
22439FNR == 1 @{
22440    if (_filename_ != "")
22441        endfile(_filename_)
22442    _filename_ = FILENAME
22443    beginfile(FILENAME)
22444@}
22445
22446END @{ endfile(_filename_) @}
22447@c endfile
22448@end example
22449
22450@ref{Wc Program}
22451shows how this library function can be used and
22452how it simplifies writing the main program.
22453
22454@sidebar So Why Does @command{gawk} Have @code{BEGINFILE} and @code{ENDFILE}?
22455
22456You are probably wondering, if @code{beginfile()} and @code{endfile()}
22457functions can do the job, why does @command{gawk} have
22458@code{BEGINFILE} and @code{ENDFILE} patterns?
22459
22460Good question.  Normally, if @command{awk} cannot open a file, this
22461causes an immediate fatal error.  In this case, there is no way for a
22462user-defined function to deal with the problem, as the mechanism for
22463calling it relies on the file being open and at the first record.  Thus,
22464the main reason for @code{BEGINFILE} is to give you a ``hook'' to catch
22465files that cannot be processed.  @code{ENDFILE} exists for symmetry,
22466and because it provides an easy way to do per-file cleanup processing.
22467For more information, refer to @ref{BEGINFILE/ENDFILE}.
22468@end sidebar
22469
22470@node Rewind Function
22471@subsection Rereading the Current File
22472
22473@cindex files @subentry reading
22474Another request for a new built-in function was for a
22475function that would make it possible to reread the current file.
22476The requesting user didn't want to have to use @code{getline}
22477(@pxref{Getline})
22478inside a loop.
22479
22480However, as long as you are not in the @code{END} rule, it is
22481quite easy to arrange to immediately close the current input file
22482and then start over with it from the top.
22483For lack of a better name, we'll call the function @code{rewind()}:
22484
22485@cindex @code{rewind()} user-defined function
22486@cindex user-defined @subentry function @subentry @code{rewind()}
22487@example
22488@c file eg/lib/rewind.awk
22489# rewind.awk --- rewind the current file and start over
22490@c endfile
22491@ignore
22492@c file eg/lib/rewind.awk
22493#
22494# Arnold Robbins, arnold@@skeeve.com, Public Domain
22495# September 2000
22496@c endfile
22497@end ignore
22498@c file eg/lib/rewind.awk
22499
22500function rewind(    i)
22501@{
22502    # shift remaining arguments up
22503    for (i = ARGC; i > ARGIND; i--)
22504        ARGV[i] = ARGV[i-1]
22505
22506    # make sure gawk knows to keep going
22507    ARGC++
22508
22509    # make current file next to get done
22510    ARGV[ARGIND+1] = FILENAME
22511
22512    # do it
22513    nextfile
22514@}
22515@c endfile
22516@end example
22517
22518The @code{rewind()} function relies on the @code{ARGIND} variable
22519(@pxref{Auto-set}), which is specific to @command{gawk}.  It also
22520relies on the @code{nextfile} keyword (@pxref{Nextfile Statement}).
22521Because of this, you should not call it from an @code{ENDFILE} rule.
22522(This isn't necessary anyway, because @command{gawk} goes to the next
22523file as soon as an @code{ENDFILE} rule finishes!)
22524
22525You need to be careful calling @code{rewind()}.  You can end up
22526causing infinite recursion if you don't pay attention. Here is an
22527example use:
22528
22529@example
22530$ @kbd{cat data}
22531@print{} a
22532@print{} b
22533@print{} c
22534@print{} d
22535@print{} e
22536
22537$ cat @kbd{test.awk}
22538@print{} FNR == 3 && ! rewound @{
22539@print{}    rewound = 1
22540@print{}    rewind()
22541@print{} @}
22542@print{}
22543@print{} @{ print FILENAME, FNR, $0 @}
22544
22545$ @kbd{gawk -f rewind.awk -f test.awk data }
22546@print{} data 1 a
22547@print{} data 2 b
22548@print{} data 1 a
22549@print{} data 2 b
22550@print{} data 3 c
22551@group
22552@print{} data 4 d
22553@print{} data 5 e
22554@end group
22555@end example
22556
22557@node File Checking
22558@subsection Checking for Readable @value{DDF}s
22559
22560@cindex troubleshooting @subentry readable data files
22561@cindex readable data files, checking
22562@cindex files @subentry skipping
22563Normally, if you give @command{awk} a @value{DF} that isn't readable,
22564it stops with a fatal error.  There are times when you might want to
22565just ignore such files and keep going.@footnote{The @code{BEGINFILE}
22566special pattern (@pxref{BEGINFILE/ENDFILE}) provides an alternative
22567mechanism for dealing with files that can't be opened.  However, the
22568code here provides a portable solution.} You can do this by prepending
22569the following program to your @command{awk} program:
22570
22571@cindex @file{readable.awk} program
22572@example
22573@c file eg/lib/readable.awk
22574# readable.awk --- library file to skip over unreadable files
22575@c endfile
22576@ignore
22577@c file eg/lib/readable.awk
22578#
22579# Arnold Robbins, arnold@@skeeve.com, Public Domain
22580# October 2000
22581# December 2010
22582@c endfile
22583@end ignore
22584@c file eg/lib/readable.awk
22585
22586BEGIN @{
22587    for (i = 1; i < ARGC; i++) @{
22588        if (ARGV[i] ~ /^[a-zA-Z_][a-zA-Z0-9_]*=.*/ \
22589            || ARGV[i] == "-" || ARGV[i] == "/dev/stdin")
22590            continue    # assignment or standard input
22591        else if ((getline junk < ARGV[i]) < 0) # unreadable
22592            delete ARGV[i]
22593        else
22594            close(ARGV[i])
22595    @}
22596@}
22597@c endfile
22598@end example
22599
22600@cindex troubleshooting @subentry @code{getline} command
22601This works, because the @code{getline} won't be fatal.
22602Removing the element from @code{ARGV} with @code{delete}
22603skips the file (because it's no longer in the list).
22604See also @ref{ARGC and ARGV}.
22605
22606Because @command{awk} variable names only allow the English letters,
22607the regular expression check purposely does not use character classes
22608such as @samp{[:alpha:]} and @samp{[:alnum:]}
22609(@pxref{Bracket Expressions}).
22610
22611@node Empty Files
22612@subsection Checking for Zero-Length Files
22613
22614All known @command{awk} implementations silently skip over zero-length files.
22615This is a by-product of @command{awk}'s implicit
22616read-a-record-and-match-against-the-rules loop: when @command{awk}
22617tries to read a record from an empty file, it immediately receives an
22618end-of-file indication, closes the file, and proceeds on to the next
22619command-line @value{DF}, @emph{without} executing any user-level
22620@command{awk} program code.
22621
22622Using @command{gawk}'s @code{ARGIND} variable
22623(@pxref{Built-in Variables}), it is possible to detect when an empty
22624@value{DF} has been skipped.  Similar to the library file presented
22625in @ref{Filetrans Function}, the following library file calls a function named
22626@code{zerofile()} that the user must provide.  The arguments passed are
22627the @value{FN} and the position in @code{ARGV} where it was found:
22628
22629@cindex @file{zerofile.awk} program
22630@example
22631@c file eg/lib/zerofile.awk
22632# zerofile.awk --- library file to process empty input files
22633@c endfile
22634@ignore
22635@c file eg/lib/zerofile.awk
22636#
22637# Arnold Robbins, arnold@@skeeve.com, Public Domain
22638# June 2003
22639@c endfile
22640@end ignore
22641@c file eg/lib/zerofile.awk
22642
22643BEGIN @{ Argind = 0 @}
22644
22645ARGIND > Argind + 1 @{
22646    for (Argind++; Argind < ARGIND; Argind++)
22647        zerofile(ARGV[Argind], Argind)
22648@}
22649
22650ARGIND != Argind @{ Argind = ARGIND @}
22651
22652END @{
22653    if (ARGIND > Argind)
22654        for (Argind++; Argind <= ARGIND; Argind++)
22655            zerofile(ARGV[Argind], Argind)
22656@}
22657@c endfile
22658@end example
22659
22660The user-level variable @code{Argind} allows the @command{awk} program
22661to track its progress through @code{ARGV}.  Whenever the program detects
22662that @code{ARGIND} is greater than @samp{Argind + 1}, it means that one or
22663more empty files were skipped.  The action then calls @code{zerofile()} for
22664each such file, incrementing @code{Argind} along the way.
22665
22666The @samp{Argind != ARGIND} rule simply keeps @code{Argind} up to date
22667in the normal case.
22668
22669Finally, the @code{END} rule catches the case of any empty files at
22670the end of the command-line arguments.  Note that the test in the
22671condition of the @code{for} loop uses the @samp{<=} operator,
22672not @samp{<}.
22673
22674@node Ignoring Assigns
22675@subsection Treating Assignments as @value{FFN}s
22676
22677@cindex assignments as file names
22678@cindex file names @subentry assignments as
22679Occasionally, you might not want @command{awk} to process command-line
22680variable assignments
22681(@pxref{Assignment Options}).
22682In particular, if you have a @value{FN} that contains an @samp{=} character,
22683@command{awk} treats the @value{FN} as an assignment and does not process it.
22684
22685Some users have suggested an additional command-line option for @command{gawk}
22686to disable command-line assignments.  However, some simple programming with
22687a library file does the trick:
22688
22689@cindex @file{noassign.awk} program
22690@example
22691@c file eg/lib/noassign.awk
22692# noassign.awk --- library file to avoid the need for a
22693# special option that disables command-line assignments
22694@c endfile
22695@ignore
22696@c file eg/lib/noassign.awk
22697#
22698# Arnold Robbins, arnold@@skeeve.com, Public Domain
22699# October 1999
22700@c endfile
22701@end ignore
22702@c file eg/lib/noassign.awk
22703
22704function disable_assigns(argc, argv,    i)
22705@{
22706    for (i = 1; i < argc; i++)
22707        if (argv[i] ~ /^[a-zA-Z_][a-zA-Z0-9_]*=.*/)
22708            argv[i] = ("./" argv[i])
22709@}
22710
22711BEGIN @{
22712    if (No_command_assign)
22713        disable_assigns(ARGC, ARGV)
22714@}
22715@c endfile
22716@end example
22717
22718You then run your program this way:
22719
22720@example
22721awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk *
22722@end example
22723
22724The function works by looping through the arguments.
22725It prepends @samp{./} to
22726any argument that matches the form
22727of a variable assignment, turning that argument into a @value{FN}.
22728
22729The use of @code{No_command_assign} allows you to disable command-line
22730assignments at invocation time, by giving the variable a true value.
22731When not set, it is initially zero (i.e., false), so the command-line arguments
22732are left alone.
22733
22734@node Getopt Function
22735@section Processing Command-Line Options
22736
22737@cindex libraries of @command{awk} functions @subentry command-line options
22738@cindex functions @subentry library @subentry command-line options
22739@cindex command line @subentry options @subentry processing
22740@cindex options @subentry command-line @subentry processing
22741@cindex functions @subentry library @subentry C library
22742@cindex arguments @subentry processing
22743Most utilities on POSIX-compatible systems take options on
22744the command line that can be used to change the way a program behaves.
22745@command{awk} is an example of such a program
22746(@pxref{Options}).
22747Often, options take @dfn{arguments} (i.e., data that the program needs to
22748correctly obey the command-line option).  For example, @command{awk}'s
22749@option{-F} option requires a string to use as the field separator.
22750The first occurrence on the command line of either @option{--} or a
22751string that does not begin with @samp{-} ends the options.
22752
22753@cindex @code{getopt()} function (C library)
22754@cindex C library functions @subentry @code{getopt()}
22755Modern Unix systems provide a C function named @code{getopt()} for processing
22756command-line arguments.  The programmer provides a string describing the
22757one-letter options. If an option requires an argument, it is followed in the
22758string with a colon.  @code{getopt()} is also passed the
22759count and values of the command-line arguments and is called in a loop.
22760@code{getopt()} processes the command-line arguments for option letters.
22761Each time around the loop, it returns a single character representing the
22762next option letter that it finds, or @samp{?} if it finds an invalid option.
22763When it returns @minus{}1, there are no options left on the command line.
22764
22765When using @code{getopt()}, options that do not take arguments can be
22766grouped together.  Furthermore, options that take arguments require that the
22767argument be present.  The argument can immediately follow the option letter,
22768or it can be a separate command-line argument.
22769
22770Given a hypothetical program that takes
22771three command-line options, @option{-a}, @option{-b}, and @option{-c}, where
22772@option{-b} requires an argument, all of the following are valid ways of
22773invoking the program:
22774
22775@example
22776prog -a -b foo -c data1 data2 data3
22777prog -ac -bfoo -- data1 data2 data3
22778prog -acbfoo data1 data2 data3
22779@end example
22780
22781Notice that when the argument is grouped with its option, the rest of
22782the argument is considered to be the option's argument.
22783In this example, @option{-acbfoo} indicates that all of the
22784@option{-a}, @option{-b}, and @option{-c} options were supplied,
22785and that @samp{foo} is the argument to the @option{-b} option.
22786
22787@code{getopt()} provides four external variables that the programmer can use:
22788
22789@table @code
22790@item optind
22791The index in the argument value array (@code{argv}) where the first
22792nonoption command-line argument can be found.
22793
22794@item optarg
22795The string value of the argument to an option.
22796
22797@item opterr
22798Usually @code{getopt()} prints an error message when it finds an invalid
22799option.  Setting @code{opterr} to zero disables this feature.  (An
22800application might want to print its own error message.)
22801
22802@item optopt
22803The letter representing the command-line option.
22804@end table
22805
22806The following C fragment shows how @code{getopt()} might process command-line
22807arguments for @command{awk}:
22808
22809@example
22810int
22811main(int argc, char *argv[])
22812@{
22813    @dots{}
22814    /* print our own message */
22815    opterr = 0;
22816    while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{
22817        switch (c) @{
22818        case 'f':    /* file */
22819            @dots{}
22820            break;
22821        case 'F':    /* field separator */
22822            @dots{}
22823            break;
22824        case 'v':    /* variable assignment */
22825            @dots{}
22826            break;
22827        case 'W':    /* extension */
22828            @dots{}
22829            break;
22830        case '?':
22831        default:
22832            usage();
22833            break;
22834        @}
22835    @}
22836    @dots{}
22837@}
22838@end example
22839
22840The GNU project's version of the original Unix utilities popularized
22841the use of long command line options.  For example, @option{--help}
22842in addition to @option{-h}. Arguments to long options are either provided
22843as separate command line arguments (@samp{--source '@var{program-text}'})
22844or separated from the option with an @samp{=} sign
22845(@samp{--source='@var{program-text}'}).
22846
22847As a side point, @command{gawk} actually uses the GNU @code{getopt_long()}
22848function to process both normal and GNU-style long options
22849(@pxref{Options}).
22850
22851The abstraction provided by @code{getopt()} is very useful and is quite
22852handy in @command{awk} programs as well.  Following is an @command{awk}
22853version of @code{getopt()} that accepts both short and long options.
22854
22855This function highlights one of the
22856greatest weaknesses in @command{awk}, which is that it is very poor at
22857manipulating single characters.  The function needs repeated calls to
22858@code{substr()} in order to access individual characters
22859(@pxref{String Functions}).@footnote{This
22860function was written before @command{gawk} acquired the ability to
22861split strings into single characters using @code{""} as the separator.
22862We have left it alone, as using @code{substr()} is more portable.}
22863
22864The discussion that follows walks through the code a bit at a time:
22865
22866@cindex @code{getopt()} user-defined function
22867@cindex user-defined @subentry function @subentry @code{getopt()}
22868@example
22869@c file eg/lib/getopt.awk
22870# getopt.awk --- Do C library getopt(3) function in awk
22871#                Also supports long options.
22872@c endfile
22873@ignore
22874@c file eg/lib/getopt.awk
22875#
22876# Arnold Robbins, arnold@@skeeve.com, Public Domain
22877#
22878# Initial version: March, 1991
22879# Revised: May, 1993
22880# Long options added by Greg Minshall, January 2020
22881@c endfile
22882@end ignore
22883@c file eg/lib/getopt.awk
22884
22885# External variables:
22886#    Optind -- index in ARGV of first nonoption argument
22887#    Optarg -- string value of argument to current option
22888#    Opterr -- if nonzero, print our own diagnostic
22889#    Optopt -- current option letter
22890
22891# Returns:
22892#    -1     at end of options
22893#    "?"    for unrecognized option
22894#    <s>    a string representing the current option
22895
22896# Private Data:
22897#    _opti  -- index in multiflag option, e.g., -abc
22898@c endfile
22899@end example
22900
22901The function starts out with comments presenting
22902a list of the global variables it uses,
22903what the return values are, what they mean, and any global variables that
22904are ``private'' to this library function.  Such documentation is essential
22905for any program, and particularly for library functions.
22906
22907The @code{getopt()} function first checks that it was indeed called with
22908a string of options (the @code{options} parameter).  If both
22909@code{options} and @code{longoptions} have a zero length,
22910@code{getopt()} immediately returns @minus{}1:
22911
22912@cindex @code{getopt()} user-defined function
22913@cindex user-defined @subentry function @subentry @code{getopt()}
22914@example
22915@c file eg/lib/getopt.awk
22916function getopt(argc, argv, options, longopts,    thisopt, i, j)
22917@{
22918    if (length(options) == 0 && length(longopts) == 0)
22919        return -1                # no options given
22920
22921@group
22922    if (argv[Optind] == "--") @{  # all done
22923        Optind++
22924        _opti = 0
22925        return -1
22926@end group
22927    @} else if (argv[Optind] !~ /^-[^:[:space:]]/) @{
22928        _opti = 0
22929        return -1
22930    @}
22931@c endfile
22932@end example
22933
22934The next thing to check for is the end of the options.  A @option{--}
22935ends the command-line options, as does any command-line argument that
22936does not begin with a @samp{-} (unless it is an argument to a preceding
22937option).  @code{Optind} steps through
22938the array of command-line arguments; it retains its value across calls
22939to @code{getopt()}, because it is a global variable.
22940
22941The regular expression @code{@w{/^-[^:[:space:]/}}
22942checks for a @samp{-} followed by anything
22943that is not whitespace and not a colon.
22944If the current command-line argument does not match this pattern,
22945it is not an option, and it ends option processing.
22946Now, we
22947check to see if we are processing a short (single letter) option, or a
22948long option (indicated by two dashes, e.g., @samp{--filename}).  If it
22949is a short option, we continue on:
22950
22951@example
22952@c file eg/lib/getopt.awk
22953    if (argv[Optind] !~ /^--/) @{        # if this is a short option
22954        if (_opti == 0)
22955            _opti = 2
22956        thisopt = substr(argv[Optind], _opti, 1)
22957        Optopt = thisopt
22958        i = index(options, thisopt)
22959        if (i == 0) @{
22960            if (Opterr)
22961                printf("%c -- invalid option\n", thisopt) > "/dev/stderr"
22962            if (_opti >= length(argv[Optind])) @{
22963                Optind++
22964                _opti = 0
22965            @} else
22966                _opti++
22967            return "?"
22968        @}
22969@c endfile
22970@end example
22971
22972The @code{_opti} variable tracks the position in the current command-line
22973argument (@code{argv[Optind]}).  If multiple options are
22974grouped together with one @samp{-} (e.g., @option{-abx}), it is necessary
22975to return them to the user one at a time.
22976
22977If @code{_opti} is equal to zero, it is set to two, which is the index in
22978the string of the next character to look at (we skip the @samp{-}, which
22979is at position one).  The variable @code{thisopt} holds the character,
22980obtained with @code{substr()}.  It is saved in @code{Optopt} for the main
22981program to use.
22982
22983If @code{thisopt} is not in the @code{options} string, then it is an
22984invalid option.  If @code{Opterr} is nonzero, @code{getopt()} prints an error
22985message on the standard error that is similar to the message from the C
22986version of @code{getopt()}.
22987
22988Because the option is invalid, it is necessary to skip it and move on to the
22989next option character.  If @code{_opti} is greater than or equal to the
22990length of the current command-line argument, it is necessary to move on
22991to the next argument, so @code{Optind} is incremented and @code{_opti} is reset
22992to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely
22993incremented.
22994
22995In any case, because the option is invalid, @code{getopt()} returns @code{"?"}.
22996The main program can examine @code{Optopt} if it needs to know what the
22997invalid option letter actually is. Continuing on:
22998
22999@example
23000@c file eg/lib/getopt.awk
23001        if (substr(options, i + 1, 1) == ":") @{
23002            # get option argument
23003            if (length(substr(argv[Optind], _opti + 1)) > 0)
23004                Optarg = substr(argv[Optind], _opti + 1)
23005            else
23006                Optarg = argv[++Optind]
23007            _opti = 0
23008        @} else
23009            Optarg = ""
23010@c endfile
23011@end example
23012
23013If the option requires an argument, the option letter is followed by a colon
23014in the @code{options} string.  If there are remaining characters in the
23015current command-line argument (@code{argv[Optind]}), then the rest of that
23016string is assigned to @code{Optarg}.  Otherwise, the next command-line
23017argument is used (@samp{-xFOO} versus @samp{@w{-x FOO}}). In either case,
23018@code{_opti} is reset to zero, because there are no more characters left to
23019examine in the current command-line argument. Continuing:
23020
23021@example
23022@c file eg/lib/getopt.awk
23023        if (_opti == 0 || _opti >= length(argv[Optind])) @{
23024            Optind++
23025            _opti = 0
23026        @} else
23027            _opti++
23028        return thisopt
23029@c endfile
23030@end example
23031
23032Finally, for a short option, if @code{_opti} is either zero or greater
23033than the length of the current command-line argument, it means this
23034element in @code{argv} is through being processed, so @code{Optind} is
23035incremented to point to the next element in @code{argv}.  If neither
23036condition is true, then only @code{_opti} is incremented, so that the
23037next option letter can be processed on the next call to @code{getopt()}.
23038
23039On the other hand, if the earlier test found that this was a long
23040option, we take a different branch:
23041
23042@example
23043@c file eg/lib/getopt.awk
23044    @} else @{
23045        j = index(argv[Optind], "=")
23046        if (j > 0)
23047            thisopt = substr(argv[Optind], 3, j - 3)
23048        else
23049            thisopt = substr(argv[Optind], 3)
23050        Optopt = thisopt
23051@c endfile
23052@end example
23053
23054First, we search this option for a possible embedded equal sign, as the
23055specification of long options allows an argument to an option
23056@samp{--someopt} to be specified as @samp{--someopt=answer} as well as
23057@samp{@w{--someopt answer}}.
23058
23059@example
23060@c file eg/lib/getopt.awk
23061        i = match(longopts, "(^|,)" thisopt "($|[,:])")
23062        if (i == 0) @{
23063            if (Opterr)
23064                 printf("%s -- invalid option\n", thisopt) > "/dev/stderr"
23065            Optind++
23066            return "?"
23067        @}
23068@c endfile
23069@end example
23070
23071Next, we try to find the current option in @code{longopts}.  The regular
23072expression given to @code{match()}, @code{@w{"(^|,)" thisopt "($|[,:])"}},
23073matches this option at the beginning of @code{longopts}, or at the
23074beginning of a subsequent long option (the previous long option would
23075have been terminated by a comma), and, in any case, either at the end of
23076the @code{longopts} string (@samp{$}), or followed by a comma
23077(separating this option from a subsequent option) or a colon (indicating
23078this long option takes an argument (@samp{@w{[,:]}}).
23079
23080Using this regular expression, we check to see if the current option
23081might possibly be in @code{longopts} (if @code{longopts} is not
23082specified, this test will also fail).  In case of an error, we possibly
23083print an error message and then return @code{"?"}. Continuing on:
23084
23085@example
23086@c file eg/lib/getopt.awk
23087        if (substr(longopts, i+1+length(thisopt), 1) == ":") @{
23088            if (j > 0)
23089                Optarg = substr(argv[Optind], j + 1)
23090            else
23091                Optarg = argv[++Optind]
23092        @} else
23093            Optarg = ""
23094@c endfile
23095@end example
23096
23097We now check to see if this option takes an argument and, if so, we set
23098@code{Optarg} to the value of that argument (either a value after an
23099equal sign specified on the command line, immediately adjoining the long
23100option string, or as the next argument on the command line).
23101
23102@example
23103@c file eg/lib/getopt.awk
23104        Optind++
23105        return thisopt
23106    @}
23107@}
23108@c endfile
23109@end example
23110
23111We increase @code{Optind} (which we already increased once if a required
23112argument was separated from its option by an equal sign), and return the
23113long option (minus its leading dashes).
23114
23115The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one.
23116@code{Opterr} is set to one, because the default behavior is for @code{getopt()}
23117to print a diagnostic message upon seeing an invalid option.  @code{Optind}
23118is set to one, because there's no reason to look at the program name, which is
23119in @code{ARGV[0]}:
23120
23121@example
23122@c file eg/lib/getopt.awk
23123BEGIN @{
23124    Opterr = 1    # default is to diagnose
23125    Optind = 1    # skip ARGV[0]
23126
23127    # test program
23128    if (_getopt_test) @{
23129        _myshortopts = "ab:cd"
23130        _mylongopts = "longa,longb:,otherc,otherd"
23131
23132        while ((_go_c = getopt(ARGC, ARGV, _myshortopts, _mylongopts)) != -1)
23133            printf("c = <%s>, Optarg = <%s>\n", _go_c, Optarg)
23134        printf("non-option arguments:\n")
23135        for (; Optind < ARGC; Optind++)
23136            printf("\tARGV[%d] = <%s>\n", Optind, ARGV[Optind])
23137    @}
23138@}
23139@c endfile
23140@end example
23141
23142The rest of the @code{BEGIN} rule is a simple test program.  Here are the
23143results of some sample runs of the test program:
23144
23145@example
23146$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x}
23147@print{} c = <a>, Optarg = <>
23148@print{} c = <c>, Optarg = <>
23149@print{} c = <b>, Optarg = <ARG>
23150@print{} non-option arguments:
23151@print{}         ARGV[3] = <bax>
23152@print{}         ARGV[4] = <-x>
23153
23154$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc}
23155@print{} c = <a>, Optarg = <>
23156@error{} x -- invalid option
23157@print{} c = <?>, Optarg = <>
23158@print{} non-option arguments:
23159@print{}         ARGV[4] = <xyz>
23160@print{}         ARGV[5] = <abc>
23161
23162$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a \}
23163> @kbd{--longa -b xx --longb=foo=bar --otherd --otherc arg1 arg2}
23164@print{} c = <a>, Optarg = <>
23165@print{} c = <longa>, Optarg = <>
23166@print{} c = <b>, Optarg = <xx>
23167@print{} c = <longb>, Optarg = <foo=bar>
23168@print{} c = <otherd>, Optarg = <>
23169@print{} c = <otherc>, Optarg = <>
23170@print{} non-option arguments:
23171@print{}        ARGV[8] = <arg1>
23172@print{}        ARGV[9] = <arg2>
23173@end example
23174
23175In all the runs, the first @option{--} terminates the arguments to
23176@command{awk}, so that it does not try to interpret the @option{-a},
23177etc., as its own options.
23178
23179@quotation NOTE
23180After @code{getopt()} is through,
23181user-level code must clear out all the elements of @code{ARGV} from 1
23182to @code{Optind}, so that @command{awk} does not try to process the
23183command-line options as @value{FN}s.
23184@end quotation
23185
23186Using @samp{#!} with the @option{-E} option may help avoid
23187conflicts between your program's options and @command{gawk}'s options,
23188as @option{-E} causes @command{gawk} to abandon processing of
23189further options
23190(@pxref{Executable Scripts} and
23191@ifnotdocbook
23192@pxref{Options}).
23193@end ifnotdocbook
23194@ifdocbook
23195@ref{Options}).
23196@end ifdocbook
23197
23198Several of the sample programs presented in
23199@ref{Sample Programs},
23200use @code{getopt()} to process their arguments.
23201
23202@node Passwd Functions
23203@section Reading the User Database
23204
23205@cindex libraries of @command{awk} functions @subentry user database, reading
23206@cindex functions @subentry library @subentry user database, reading
23207@cindex user database, reading
23208@cindex database @subentry users, reading
23209@cindex @code{PROCINFO} array
23210The @code{PROCINFO} array
23211(@pxref{Built-in Variables})
23212provides access to the current user's real and effective user and group ID
23213numbers, and, if available, the user's supplementary group set.
23214However, because these are numbers, they do not provide very useful
23215information to the average user.  There needs to be some way to find the
23216user information associated with the user and group ID numbers.  This
23217@value{SECTION} presents a suite of functions for retrieving information from the
23218user database.  @xref{Group Functions}
23219for a similar suite that retrieves information from the group database.
23220
23221@cindex @code{getpwent()} function (C library)
23222@cindex C library functions @subentry @code{getpwent()}
23223@cindex @code{getpwent()} user-defined function
23224@cindex user-defined @subentry function @subentry @code{getpwent()}
23225@cindex users, information about @subentry retrieving
23226@cindex login information
23227@cindex account information
23228@cindex password file
23229@cindex files @subentry password
23230The POSIX standard does not define the file where user information is
23231kept.  Instead, it provides the @code{<pwd.h>} header file
23232and several C language subroutines for obtaining user information.
23233The primary function is @code{getpwent()}, for ``get password entry.''
23234The ``password'' comes from the original user database file,
23235@file{/etc/passwd}, which stores user information along with the
23236encrypted passwords (hence the name).
23237
23238@cindex @command{pwcat} program
23239Although an @command{awk} program could simply read @file{/etc/passwd}
23240directly, this file may not contain complete information about the
23241system's set of users.@footnote{It is often the case that password
23242information is stored in a network database.} To be sure you are able to
23243produce a readable and complete version of the user database, it is necessary
23244to write a small C program that calls @code{getpwent()}.  @code{getpwent()}
23245is defined as returning a pointer to a @code{struct passwd}.  Each time it
23246is called, it returns the next entry in the database.  When there are
23247no more entries, it returns @code{NULL}, the null pointer.  When this
23248happens, the C program should call @code{endpwent()} to close the database.
23249Following is @command{pwcat}, a C program that ``cats'' the password database:
23250
23251@example
23252@c file eg/lib/pwcat.c
23253/*
23254 * pwcat.c
23255 *
23256 * Generate a printable version of the password database.
23257 */
23258@c endfile
23259@ignore
23260@c file eg/lib/pwcat.c
23261/*
23262 * Arnold Robbins, arnold@@skeeve.com, May 1993
23263 * Public Domain
23264 * December 2010, move to ANSI C definition for main().
23265 */
23266
23267#if HAVE_CONFIG_H
23268#include <config.h>
23269#endif
23270
23271@c endfile
23272@end ignore
23273@c file eg/lib/pwcat.c
23274#include <stdio.h>
23275#include <pwd.h>
23276
23277@c endfile
23278@ignore
23279@c file eg/lib/pwcat.c
23280#if defined (STDC_HEADERS)
23281#include <stdlib.h>
23282#endif
23283
23284@c endfile
23285@end ignore
23286@c file eg/lib/pwcat.c
23287int
23288main(int argc, char **argv)
23289@{
23290    struct passwd *p;
23291
23292    while ((p = getpwent()) != NULL)
23293@c endfile
23294@ignore
23295@c file eg/lib/pwcat.c
23296#ifdef HAVE_STRUCT_PASSWD_PW_PASSWD
23297@c endfile
23298@end ignore
23299@c file eg/lib/pwcat.c
23300        printf("%s:%s:%ld:%ld:%s:%s:%s\n",
23301            p->pw_name, p->pw_passwd, (long) p->pw_uid,
23302            (long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
23303@c endfile
23304@ignore
23305@c file eg/lib/pwcat.c
23306#else
23307        printf("%s:*:%ld:%ld:%s:%s\n",
23308            p->pw_name, (long) p->pw_uid,
23309            (long) p->pw_gid, p->pw_dir, p->pw_shell);
23310#endif
23311@c endfile
23312@end ignore
23313@c file eg/lib/pwcat.c
23314
23315    endpwent();
23316    return 0;
23317@}
23318@c endfile
23319@end example
23320
23321If you don't understand C, don't worry about it.
23322The output from @command{pwcat} is the user database, in the traditional
23323@file{/etc/passwd} format of colon-separated fields.  The fields are:
23324
23325@table @asis
23326@item Login name
23327The user's login name.
23328
23329@item Encrypted password
23330The user's encrypted password.  This may not be available on some systems.
23331
23332@item User-ID
23333The user's numeric user ID number.
23334(On some systems, it's a C @code{long}, and not an @code{int}.  Thus,
23335we cast it to @code{long} for all cases.)
23336
23337@item Group-ID
23338The user's numeric group ID number.
23339(Similar comments about @code{long} versus @code{int} apply here.)
23340
23341@item Full name
23342The user's full name, and perhaps other information associated with the
23343user.
23344
23345@item Home directory
23346The user's login (or ``home'') directory (familiar to shell programmers as
23347@code{$HOME}).
23348
23349@item Login shell
23350The program that is run when the user logs in.  This is usually a
23351shell, such as Bash.
23352@end table
23353
23354A few lines representative of @command{pwcat}'s output are as follows:
23355
23356@cindex Jacobs, Andrew
23357@cindex Robbins @subentry Arnold
23358@cindex Robbins @subentry Miriam
23359@example
23360$ @kbd{pwcat}
23361@print{} root:x:0:1:Operator:/:/bin/sh
23362@print{} nobody:*:65534:65534::/:
23363@print{} daemon:*:1:1::/:
23364@print{} sys:*:2:2::/:/bin/csh
23365@print{} bin:*:3:3::/bin:
23366@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
23367@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
23368@print{} andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh
23369@dots{}
23370@end example
23371
23372With that introduction, following is a group of functions for getting user
23373information.  There are several functions here, corresponding to the C
23374functions of the same names:
23375
23376@cindex @code{_pw_init()} user-defined function
23377@cindex user-defined @subentry function @subentry @code{_pw_init()}
23378@example
23379@c file eg/lib/passwdawk.in
23380# passwd.awk --- access password file information
23381@c endfile
23382@ignore
23383@c file eg/lib/passwdawk.in
23384#
23385# Arnold Robbins, arnold@@skeeve.com, Public Domain
23386# May 1993
23387# Revised October 2000
23388# Revised December 2010
23389@c endfile
23390@end ignore
23391@c file eg/lib/passwdawk.in
23392
23393BEGIN @{
23394    # tailor this to suit your system
23395    _pw_awklib = "/usr/local/libexec/awk/"
23396@}
23397
23398function _pw_init(    oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat)
23399@{
23400    if (_pw_inited)
23401        return
23402
23403    oldfs = FS
23404    oldrs = RS
23405    olddol0 = $0
23406    using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
23407    using_fpat = (PROCINFO["FS"] == "FPAT")
23408    FS = ":"
23409    RS = "\n"
23410
23411    pwcat = _pw_awklib "pwcat"
23412    while ((pwcat | getline) > 0) @{
23413        _pw_byname[$1] = $0
23414        _pw_byuid[$3] = $0
23415        _pw_bycount[++_pw_total] = $0
23416    @}
23417    close(pwcat)
23418    _pw_count = 0
23419    _pw_inited = 1
23420    FS = oldfs
23421    if (using_fw)
23422        FIELDWIDTHS = FIELDWIDTHS
23423    else if (using_fpat)
23424        FPAT = FPAT
23425    RS = oldrs
23426    $0 = olddol0
23427@}
23428@c endfile
23429@end example
23430
23431@cindex @code{BEGIN} pattern @subentry @code{pwcat} program
23432The @code{BEGIN} rule sets a private variable to the directory where
23433@command{pwcat} is stored.  Because it is used to help out an @command{awk} library
23434routine, we have chosen to put it in @file{/usr/local/libexec/awk};
23435however, you might want it to be in a different directory on your system.
23436
23437The function @code{_pw_init()} fills three copies of the user information
23438into three associative arrays.  The arrays are indexed by username
23439(@code{_pw_byname}), by user ID number (@code{_pw_byuid}), and by order of
23440occurrence (@code{_pw_bycount}).
23441The variable @code{_pw_inited} is used for efficiency, as @code{_pw_init()}
23442needs to be called only once.
23443
23444@cindex @code{PROCINFO} array @subentry testing the field splitting
23445@cindex @code{getline} command @subentry @code{_pw_init()} function
23446Because this function uses @code{getline} to read information from
23447@command{pwcat}, it first saves the values of @code{FS}, @code{RS}, and @code{$0}.
23448It notes in the variable @code{using_fw} whether field splitting
23449with @code{FIELDWIDTHS} is in effect or not.
23450Doing so is necessary, as these functions could be called
23451from anywhere within a user's program, and the user may have his
23452or her own way of splitting records and fields.
23453This makes it possible to restore the correct
23454field-splitting mechanism later.  The test can only be true for
23455@command{gawk}.  It is false if using @code{FS} or @code{FPAT},
23456or on some other @command{awk} implementation.
23457
23458The code that checks for using @code{FPAT}, using @code{using_fpat}
23459and @code{PROCINFO["FS"]}, is similar.
23460
23461The main part of the function uses a loop to read database lines, split
23462the lines into fields, and then store the lines into each array as necessary.
23463When the loop is done, @code{@w{_pw_init()}} cleans up by closing the pipeline,
23464setting @code{@w{_pw_inited}} to one, and restoring @code{FS}
23465(and @code{FIELDWIDTHS} or @code{FPAT}
23466if necessary), @code{RS}, and @code{$0}.
23467The use of @code{@w{_pw_count}} is explained shortly.
23468
23469@cindex @code{getpwnam()} function (C library)
23470@cindex C library functions @subentry @code{getpwnam()}
23471The @code{getpwnam()} function takes a username as a string argument. If that
23472user is in the database, it returns the appropriate line. Otherwise, it
23473relies on the array reference to a nonexistent
23474element to create the element with the null string as its value:
23475
23476@cindex @code{getpwnam()} user-defined function
23477@cindex user-defined @subentry function @subentry @code{getpwnam()}
23478@example
23479@group
23480@c file eg/lib/passwdawk.in
23481function getpwnam(name)
23482@{
23483    _pw_init()
23484    return _pw_byname[name]
23485@}
23486@c endfile
23487@end group
23488@end example
23489
23490@cindex @code{getpwuid()} function (C library)
23491@cindex C library functions @subentry @code{getpwuid()}
23492Similarly, the @code{getpwuid()} function takes a user ID number
23493argument. If that user number is in the database, it returns the
23494appropriate line. Otherwise, it returns the null string:
23495
23496@cindex @code{getpwuid()} user-defined function
23497@cindex user-defined @subentry function @subentry @code{getpwuid()}
23498@example
23499@c file eg/lib/passwdawk.in
23500function getpwuid(uid)
23501@{
23502    _pw_init()
23503    return _pw_byuid[uid]
23504@}
23505@c endfile
23506@end example
23507
23508@cindex @code{getpwent()} function (C library)
23509@cindex C library functions @subentry @code{getpwent()}
23510The @code{getpwent()} function simply steps through the database, one entry at
23511a time.  It uses @code{_pw_count} to track its current position in the
23512@code{_pw_bycount} array:
23513
23514@cindex @code{getpwent()} user-defined function
23515@cindex user-defined @subentry function @subentry @code{getpwent()}
23516@example
23517@c file eg/lib/passwdawk.in
23518function getpwent()
23519@{
23520    _pw_init()
23521    if (_pw_count < _pw_total)
23522        return _pw_bycount[++_pw_count]
23523    return ""
23524@}
23525@c endfile
23526@end example
23527
23528@cindex @code{endpwent()} function (C library)
23529@cindex C library functions @subentry @code{endpwent()}
23530The @code{@w{endpwent()}} function resets @code{@w{_pw_count}} to zero, so that
23531subsequent calls to @code{getpwent()} start over again:
23532
23533@cindex @code{endpwent()} user-defined function
23534@cindex user-defined @subentry function @subentry @code{endpwent()}
23535@example
23536@c file eg/lib/passwdawk.in
23537function endpwent()
23538@{
23539    _pw_count = 0
23540@}
23541@c endfile
23542@end example
23543
23544A conscious design decision in this suite is that each subroutine calls
23545@code{@w{_pw_init()}} to initialize the database arrays.
23546The overhead of running
23547a separate process to generate the user database, and the I/O to scan it,
23548are only incurred if the user's main program actually calls one of these
23549functions.  If this library file is loaded along with a user's program, but
23550none of the routines are ever called, then there is no extra runtime overhead.
23551(The alternative is move the body of @code{@w{_pw_init()}} into a
23552@code{BEGIN} rule, which always runs @command{pwcat}.  This simplifies the
23553code but runs an extra process that may never be needed.)
23554
23555In turn, calling @code{_pw_init()} is not too expensive, because the
23556@code{_pw_inited} variable keeps the program from reading the data more than
23557once.  If you are worried about squeezing every last cycle out of your
23558@command{awk} program, the check of @code{_pw_inited} could be moved out of
23559@code{_pw_init()} and duplicated in all the other functions.  In practice,
23560this is not necessary, as most @command{awk} programs are I/O-bound,
23561and such a change would clutter up the code.
23562
23563The @command{id} program in @ref{Id Program}
23564uses these functions.
23565
23566@node Group Functions
23567@section Reading the Group Database
23568
23569@cindex libraries of @command{awk} functions @subentry group database, reading
23570@cindex functions @subentry library @subentry group database, reading
23571@cindex group database, reading
23572@cindex database @subentry group, reading
23573@cindex @code{PROCINFO} array @subentry group membership and
23574@cindex @code{getgrent()} function (C library)
23575@cindex C library functions @subentry @code{getgrent()}
23576@cindex @code{getgrent()} user-defined function
23577@cindex user-defined @subentry function @subentry @code{getgrent()}
23578@cindex groups, information about
23579@cindex account information
23580@cindex group file
23581@cindex files @subentry group
23582Much of the discussion presented in
23583@ref{Passwd Functions}
23584applies to the group database as well.  Although there has traditionally
23585been a well-known file (@file{/etc/group}) in a well-known format, the POSIX
23586standard only provides a set of C library routines
23587(@code{<grp.h>} and @code{getgrent()})
23588for accessing the information.
23589Even though this file may exist, it may not have
23590complete information.  Therefore, as with the user database, it is necessary
23591to have a small C program that generates the group database as its output.
23592@command{grcat}, a C program that ``cats'' the group database,
23593is as follows:
23594
23595@cindex @command{grcat} program
23596@example
23597@c file eg/lib/grcat.c
23598/*
23599 * grcat.c
23600 *
23601 * Generate a printable version of the group database.
23602 */
23603@c endfile
23604@ignore
23605@c file eg/lib/grcat.c
23606/*
23607 * Arnold Robbins, arnold@@skeeve.com, May 1993
23608 * Public Domain
23609 * December 2010, move to ANSI C definition for main().
23610 */
23611
23612#if HAVE_CONFIG_H
23613#include <config.h>
23614#endif
23615
23616#if defined (STDC_HEADERS)
23617#include <stdlib.h>
23618#endif
23619
23620#ifndef HAVE_GETGRENT
23621int main() { return 0; }
23622#else
23623@c endfile
23624@end ignore
23625@c file eg/lib/grcat.c
23626#include <stdio.h>
23627#include <grp.h>
23628
23629int
23630main(int argc, char **argv)
23631@{
23632    struct group *g;
23633    int i;
23634
23635    while ((g = getgrent()) != NULL) @{
23636@c endfile
23637@ignore
23638@c file eg/lib/grcat.c
23639#ifdef HAVE_STRUCT_GROUP_GR_PASSWD
23640@c endfile
23641@end ignore
23642@c file eg/lib/grcat.c
23643        printf("%s:%s:%ld:", g->gr_name, g->gr_passwd,
23644                                     (long) g->gr_gid);
23645@c endfile
23646@ignore
23647@c file eg/lib/grcat.c
23648#else
23649        printf("%s:*:%ld:", g->gr_name, (long) g->gr_gid);
23650#endif
23651@c endfile
23652@end ignore
23653@c file eg/lib/grcat.c
23654        for (i = 0; g->gr_mem[i] != NULL; i++) @{
23655            printf("%s", g->gr_mem[i]);
23656@group
23657            if (g->gr_mem[i+1] != NULL)
23658                putchar(',');
23659        @}
23660@end group
23661        putchar('\n');
23662    @}
23663    endgrent();
23664    return 0;
23665@}
23666@c endfile
23667@ignore
23668@c file eg/lib/grcat.c
23669#endif /* HAVE_GETGRENT */
23670@c endfile
23671@end ignore
23672@end example
23673
23674Each line in the group database represents one group.  The fields are
23675separated with colons and represent the following information:
23676
23677@table @asis
23678@item Group Name
23679The group's name.
23680
23681@item Group Password
23682The group's encrypted password. In practice, this field is never used;
23683it is usually empty or set to @samp{*}.
23684
23685@item Group ID Number
23686The group's numeric group ID number;
23687the association of name to number must be unique within the file.
23688(On some systems it's a C @code{long}, and not an @code{int}.  Thus,
23689we cast it to @code{long} for all cases.)
23690
23691@item Group Member List
23692A comma-separated list of usernames.  These users are members of the group.
23693Modern Unix systems allow users to be members of several groups
23694simultaneously.  If your system does, then there are elements
23695@code{"group1"} through @code{"group@var{N}"} in @code{PROCINFO}
23696for those group ID numbers.
23697(Note that @code{PROCINFO} is a @command{gawk} extension;
23698@pxref{Built-in Variables}.)
23699@end table
23700
23701Here is what running @command{grcat} might produce:
23702
23703@example
23704$ @kbd{grcat}
23705@print{} wheel:*:0:arnold
23706@print{} nogroup:*:65534:
23707@print{} daemon:*:1:
23708@print{} kmem:*:2:
23709@print{} staff:*:10:arnold,miriam,andy
23710@print{} other:*:20:
23711@dots{}
23712@end example
23713
23714Here are the functions for obtaining information from the group database.
23715There are several, modeled after the C library functions of the same names:
23716
23717@cindex @code{getline} command @subentry @code{_gr_init()} user-defined function
23718@cindex @code{_gr_init()} user-defined function
23719@cindex user-defined @subentry function @subentry @code{_gr_init()}
23720@example
23721@c file eg/lib/groupawk.in
23722# group.awk --- functions for dealing with the group file
23723@c endfile
23724@ignore
23725@c file eg/lib/groupawk.in
23726#
23727# Arnold Robbins, arnold@@skeeve.com, Public Domain
23728# May 1993
23729# Revised October 2000
23730# Revised December 2010
23731@c endfile
23732@end ignore
23733@c line break on _gr_init for smallbook
23734@c file eg/lib/groupawk.in
23735
23736BEGIN @{
23737    # Change to suit your system
23738    _gr_awklib = "/usr/local/libexec/awk/"
23739@}
23740
23741function _gr_init(    oldfs, oldrs, olddol0, grcat,
23742                             using_fw, using_fpat, n, a, i)
23743@{
23744    if (_gr_inited)
23745        return
23746
23747    oldfs = FS
23748    oldrs = RS
23749    olddol0 = $0
23750    using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
23751    using_fpat = (PROCINFO["FS"] == "FPAT")
23752    FS = ":"
23753    RS = "\n"
23754
23755    grcat = _gr_awklib "grcat"
23756    while ((grcat | getline) > 0) @{
23757        if ($1 in _gr_byname)
23758            _gr_byname[$1] = _gr_byname[$1] "," $4
23759        else
23760            _gr_byname[$1] = $0
23761        if ($3 in _gr_bygid)
23762            _gr_bygid[$3] = _gr_bygid[$3] "," $4
23763        else
23764            _gr_bygid[$3] = $0
23765
23766        n = split($4, a, "[ \t]*,[ \t]*")
23767        for (i = 1; i <= n; i++)
23768            if (a[i] in _gr_groupsbyuser)
23769                _gr_groupsbyuser[a[i]] = _gr_groupsbyuser[a[i]] " " $1
23770            else
23771                _gr_groupsbyuser[a[i]] = $1
23772
23773        _gr_bycount[++_gr_count] = $0
23774    @}
23775    close(grcat)
23776    _gr_count = 0
23777    _gr_inited++
23778    FS = oldfs
23779    if (using_fw)
23780        FIELDWIDTHS = FIELDWIDTHS
23781    else if (using_fpat)
23782        FPAT = FPAT
23783    RS = oldrs
23784    $0 = olddol0
23785@}
23786@c endfile
23787@end example
23788
23789The @code{BEGIN} rule sets a private variable to the directory where
23790@command{grcat} is stored.  Because it is used to help out an @command{awk} library
23791routine, we have chosen to put it in @file{/usr/local/libexec/awk}.  You might
23792want it to be in a different directory on your system.
23793
23794These routines follow the same general outline as the user database routines
23795(@pxref{Passwd Functions}).
23796The @code{@w{_gr_inited}} variable is used to
23797ensure that the database is scanned no more than once.
23798The @code{@w{_gr_init()}} function first saves @code{FS},
23799@code{RS}, and
23800@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for
23801scanning the group information.
23802It also takes care to note whether @code{FIELDWIDTHS} or @code{FPAT}
23803is being used, and to restore the appropriate field-splitting mechanism.
23804
23805The group information is stored in several associative arrays.
23806The arrays are indexed by group name (@code{@w{_gr_byname}}), by group ID number
23807(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}).
23808There is an additional array indexed by username (@code{@w{_gr_groupsbyuser}}),
23809which is a space-separated list of groups to which each user belongs.
23810
23811Unlike in the user database, it is possible to have multiple records in the
23812database for the same group.  This is common when a group has a large number
23813of members.  A pair of such entries might look like the following:
23814
23815@example
23816tvpeople:*:101:johnny,jay,arsenio
23817tvpeople:*:101:david,conan,tom,joan
23818@end example
23819
23820For this reason, @code{_gr_init()} looks to see if a group name or
23821group ID number is already seen.  If so, the usernames are
23822simply concatenated onto the previous list of users.@footnote{There is a
23823subtle problem with the code just presented.  Suppose that
23824the first time there were no names. This code adds the names with
23825a leading comma. It also doesn't check that there is a @code{$4}.}
23826
23827Finally, @code{_gr_init()} closes the pipeline to @command{grcat}, restores
23828@code{FS} (and @code{FIELDWIDTHS} or @code{FPAT}, if necessary), @code{RS}, and @code{$0},
23829initializes @code{_gr_count} to zero
23830(it is used later), and makes @code{_gr_inited} nonzero.
23831
23832@cindex @code{getgrnam()} function (C library)
23833@cindex C library functions @subentry @code{getgrnam()}
23834The @code{getgrnam()} function takes a group name as its argument, and if that
23835group exists, it is returned.
23836Otherwise, it
23837relies on the array reference to a nonexistent
23838element to create the element with the null string as its value:
23839
23840@cindex @code{getgrnam()} user-defined function
23841@cindex user-defined @subentry function @subentry @code{getgrnam()}
23842@example
23843@c file eg/lib/groupawk.in
23844function getgrnam(group)
23845@{
23846    _gr_init()
23847    return _gr_byname[group]
23848@}
23849@c endfile
23850@end example
23851
23852@cindex @code{getgrgid()} function (C library)
23853@cindex C library functions @subentry @code{getgrgid()}
23854The @code{getgrgid()} function is similar; it takes a numeric group ID and
23855looks up the information associated with that group ID:
23856
23857@cindex @code{getgrgid()} user-defined function
23858@cindex user-defined @subentry function @subentry @code{getgrgid()}
23859@example
23860@c file eg/lib/groupawk.in
23861function getgrgid(gid)
23862@{
23863    _gr_init()
23864    return _gr_bygid[gid]
23865@}
23866@c endfile
23867@end example
23868
23869@cindex @code{getgruser()} function (C library)
23870@cindex C library functions @subentry @code{getgruser()}
23871The @code{getgruser()} function does not have a C counterpart. It takes a
23872username and returns the list of groups that have the user as a member:
23873
23874@cindex @code{getgruser()} user-defined function
23875@cindex user-defined @subentry function @subentry @code{getgruser()}
23876@example
23877@c file eg/lib/groupawk.in
23878function getgruser(user)
23879@{
23880    _gr_init()
23881    return _gr_groupsbyuser[user]
23882@}
23883@c endfile
23884@end example
23885
23886@cindex @code{getgrent()} function (C library)
23887@cindex C library functions @subentry @code{getgrent()}
23888The @code{getgrent()} function steps through the database one entry at a time.
23889It uses @code{_gr_count} to track its position in the list:
23890
23891@cindex @code{getgrent()} user-defined function
23892@cindex user-defined @subentry function @subentry @code{getgrent()}
23893@example
23894@c file eg/lib/groupawk.in
23895function getgrent()
23896@{
23897    _gr_init()
23898    if (++_gr_count in _gr_bycount)
23899        return _gr_bycount[_gr_count]
23900@group
23901    return ""
23902@}
23903@end group
23904@c endfile
23905@end example
23906
23907@cindex @code{endgrent()} function (C library)
23908@cindex C library functions @subentry @code{endgrent()}
23909The @code{endgrent()} function resets @code{_gr_count} to zero so that @code{getgrent()} can
23910start over again:
23911
23912@cindex @code{endgrent()} user-defined function
23913@cindex user-defined @subentry function @subentry @code{endgrent()}
23914@example
23915@c file eg/lib/groupawk.in
23916function endgrent()
23917@{
23918    _gr_count = 0
23919@}
23920@c endfile
23921@end example
23922
23923As with the user database routines, each function calls @code{_gr_init()} to
23924initialize the arrays.  Doing so only incurs the extra overhead of running
23925@command{grcat} if these functions are used (as opposed to moving the body of
23926@code{_gr_init()} into a @code{BEGIN} rule).
23927
23928Most of the work is in scanning the database and building the various
23929associative arrays.  The functions that the user calls are themselves very
23930simple, relying on @command{awk}'s associative arrays to do work.
23931
23932The @command{id} program in @ref{Id Program}
23933uses these functions.
23934
23935@node Walking Arrays
23936@section Traversing Arrays of Arrays
23937
23938@ref{Arrays of Arrays} described how @command{gawk}
23939provides arrays of arrays.  In particular, any element of
23940an array may be either a scalar or another array. The
23941@code{isarray()} function (@pxref{Type Functions})
23942lets you distinguish an array
23943from a scalar.
23944The following function, @code{walk_array()}, recursively traverses
23945an array, printing the element indices and values.
23946You call it with the array and a string representing the name
23947of the array:
23948
23949@cindex @code{walk_array()} user-defined function
23950@cindex user-defined @subentry function @subentry @code{walk_array()}
23951@example
23952@c file eg/lib/walkarray.awk
23953function walk_array(arr, name,      i)
23954@{
23955    for (i in arr) @{
23956        if (isarray(arr[i]))
23957            walk_array(arr[i], (name "[" i "]"))
23958        else
23959            printf("%s[%s] = %s\n", name, i, arr[i])
23960    @}
23961@}
23962@c endfile
23963@end example
23964
23965@noindent
23966It works by looping over each element of the array. If any given
23967element is itself an array, the function calls itself recursively,
23968passing the subarray and a new string representing the current index.
23969Otherwise, the function simply prints the element's name, index, and value.
23970Here is a main program to demonstrate:
23971
23972@example
23973BEGIN @{
23974    a[1] = 1
23975    a[2][1] = 21
23976    a[2][2] = 22
23977    a[3] = 3
23978    a[4][1][1] = 411
23979    a[4][2] = 42
23980
23981    walk_array(a, "a")
23982@}
23983@end example
23984
23985When run, the program produces the following output:
23986
23987@example
23988$ @kbd{gawk -f walk_array.awk}
23989@print{} a[1] = 1
23990@print{} a[2][1] = 21
23991@print{} a[2][2] = 22
23992@print{} a[3] = 3
23993@print{} a[4][1][1] = 411
23994@print{} a[4][2] = 42
23995@end example
23996
23997The function just presented simply prints the
23998name and value of each scalar array element. However, it is easy to
23999generalize it, by passing in the name of a function to call
24000when walking an array. The modified function looks like this:
24001
24002@example
24003@c file eg/lib/processarray.awk
24004function process_array(arr, name, process, do_arrays,   i, new_name)
24005@{
24006    for (i in arr) @{
24007        new_name = (name "[" i "]")
24008        if (isarray(arr[i])) @{
24009            if (do_arrays)
24010                @@process(new_name, arr[i])
24011            process_array(arr[i], new_name, process, do_arrays)
24012        @} else
24013            @@process(new_name, arr[i])
24014    @}
24015@}
24016@c endfile
24017@end example
24018
24019The arguments are as follows:
24020
24021@table @code
24022@item arr
24023The array.
24024
24025@item name
24026The name of the array (a string).
24027
24028@item process
24029The name of the function to call.
24030
24031@item do_arrays
24032If this is true, the function can handle elements that are subarrays.
24033@end table
24034
24035If subarrays are to be processed, that is done before walking them further.
24036
24037When run with the following scaffolding, the function produces the same
24038results as does the earlier version of @code{walk_array()}:
24039
24040@example
24041BEGIN @{
24042    a[1] = 1
24043    a[2][1] = 21
24044    a[2][2] = 22
24045    a[3] = 3
24046    a[4][1][1] = 411
24047    a[4][2] = 42
24048
24049    process_array(a, "a", "do_print", 0)
24050@}
24051
24052function do_print(name, element)
24053@{
24054    printf "%s = %s\n", name, element
24055@}
24056@end example
24057
24058@node Library Functions Summary
24059@section Summary
24060
24061@itemize @value{BULLET}
24062@item
24063Reading programs is an excellent way to learn Good Programming.
24064The functions and programs provided in this @value{CHAPTER} and the next
24065are intended to serve that purpose.
24066
24067@item
24068When writing general-purpose library functions, put some thought into how
24069to name any global variables so that they won't conflict with variables
24070from a user's program.
24071
24072@item
24073The functions presented here fit into the following categories:
24074
24075@c nested list
24076@table @asis
24077@item General problems
24078Number-to-string conversion, testing assertions, rounding, random number
24079generation, converting characters to numbers, joining strings, getting
24080easily usable time-of-day information, and reading a whole file in
24081one shot
24082
24083@item Managing @value{DF}s
24084Noting @value{DF} boundaries, rereading the current file, checking for
24085readable files, checking for zero-length files, and treating assignments
24086as @value{FN}s
24087
24088@item Processing command-line options
24089An @command{awk} version of the standard C @code{getopt()} function
24090
24091@item Reading the user and group databases
24092Two sets of routines that parallel the C library versions
24093
24094@item Traversing arrays of arrays
24095Two functions that traverse an array of arrays to any depth
24096@end table
24097@c end nested list
24098
24099@end itemize
24100
24101@c EXCLUDE START
24102@node Library Exercises
24103@section Exercises
24104
24105@enumerate
24106@item
24107In @ref{Empty Files}, we presented the @file{zerofile.awk} program,
24108which made use of @command{gawk}'s @code{ARGIND} variable.  Can this
24109problem be solved without relying on @code{ARGIND}?  If so, how?
24110
24111@ignore
24112# zerofile2.awk --- same thing, portably
24113
24114BEGIN @{
24115    ARGIND = Argind = 0
24116    for (i = 1; i < ARGC; i++)
24117        Fnames[ARGV[i]]++
24118
24119@}
24120FNR == 1 @{
24121    while (ARGV[ARGIND] != FILENAME)
24122        ARGIND++
24123    Seen[FILENAME]++
24124    if (Seen[FILENAME] == Fnames[FILENAME])
24125        do
24126            ARGIND++
24127        while (ARGV[ARGIND] != FILENAME)
24128@}
24129ARGIND > Argind + 1 @{
24130    for (Argind++; Argind < ARGIND; Argind++)
24131        zerofile(ARGV[Argind], Argind)
24132@}
24133ARGIND != Argind @{
24134    Argind = ARGIND
24135@}
24136END @{
24137    if (ARGIND < ARGC - 1)
24138        ARGIND = ARGC - 1
24139    if (ARGIND > Argind)
24140        for (Argind++; Argind <= ARGIND; Argind++)
24141            zerofile(ARGV[Argind], Argind)
24142@}
24143@end ignore
24144
24145@item
24146As a related challenge, revise that code to handle the case where
24147an intervening value in @code{ARGV} is a variable assignment.
24148
24149@ignore
24150@c June 13 2015: Antonio points out that this is answered in the text. Ooops.
24151@item
24152@ref{Walking Arrays} presented a function that walked a multidimensional
24153array to print it out.  However, walking an array and processing
24154each element is a general-purpose operation.  Generalize the
24155@code{walk_array()} function by adding an additional parameter named
24156@code{process}.
24157
24158Then, inside the loop, instead of printing the array element's index and
24159value, use the indirect function call syntax (@pxref{Indirect Calls})
24160on @code{process}, passing it the index and the value.
24161
24162When calling @code{walk_array()}, you would pass the name of a
24163user-defined function that expects to receive an index and a value,
24164and then processes the element.
24165
24166Test your new version by printing the array; you should end up with
24167output identical to that of the original version.
24168@end ignore
24169
24170@end enumerate
24171@c EXCLUDE END
24172
24173
24174@node Sample Programs
24175@chapter Practical @command{awk} Programs
24176@cindex @command{awk} programs @subentry examples of
24177
24178@c FULLXREF ON
24179@ref{Library Functions},
24180presents the idea that reading programs in a language contributes to
24181learning that language.  This @value{CHAPTER} continues that theme,
24182presenting a potpourri of @command{awk} programs for your reading
24183enjoyment.
24184@c FULLXREF OFF
24185@ifnotinfo
24186There are three @value{SECTION}s.
24187The first describes how to run the programs presented
24188in this @value{CHAPTER}.
24189
24190The second presents @command{awk}
24191versions of several common POSIX utilities.
24192These are programs that you are hopefully already familiar with,
24193and therefore whose problems are understood.
24194By reimplementing these programs in @command{awk},
24195you can focus on the @command{awk}-related aspects of solving
24196the programming problems.
24197
24198The third is a grab bag of interesting programs.
24199These solve a number of different data-manipulation and management
24200problems.  Many of the programs are short, which emphasizes @command{awk}'s
24201ability to do a lot in just a few lines of code.
24202@end ifnotinfo
24203
24204Many of these programs use library functions presented in
24205@ref{Library Functions}.
24206
24207@menu
24208* Running Examples::            How to run these examples.
24209* Clones::                      Clones of common utilities.
24210* Miscellaneous Programs::      Some interesting @command{awk} programs.
24211* Programs Summary::            Summary of programs.
24212* Programs Exercises::          Exercises.
24213@end menu
24214
24215@node Running Examples
24216@section Running the Example Programs
24217
24218To run a given program, you would typically do something like this:
24219
24220@example
24221awk -f @var{program} -- @var{options} @var{files}
24222@end example
24223
24224@noindent
24225Here, @var{program} is the name of the @command{awk} program (such as
24226@file{cut.awk}), @var{options} are any command-line options for the
24227program that start with a @samp{-}, and @var{files} are the actual @value{DF}s.
24228
24229If your system supports the @samp{#!} executable interpreter mechanism
24230(@pxref{Executable Scripts}),
24231you can instead run your program directly:
24232
24233@example
24234cut.awk -c1-8 myfiles > results
24235@end example
24236
24237If your @command{awk} is not @command{gawk}, you may instead need to use this:
24238
24239@example
24240cut.awk -- -c1-8 myfiles > results
24241@end example
24242
24243@node Clones
24244@section Reinventing Wheels for Fun and Profit
24245@cindex POSIX @subentry programs, implementing in @command{awk}
24246
24247This @value{SECTION} presents a number of POSIX utilities implemented in
24248@command{awk}.  Reinventing these programs in @command{awk} is often enjoyable,
24249because the algorithms can be very clearly expressed, and the code is usually
24250very concise and simple.  This is true because @command{awk} does so much for you.
24251
24252It should be noted that these programs are not necessarily intended to
24253replace the installed versions on your system.
24254Nor may all of these programs be fully compliant with the most recent
24255POSIX standard.  This is not a problem; their
24256purpose is to illustrate @command{awk} language programming for ``real-world''
24257tasks.
24258
24259The programs are presented in alphabetical order.
24260
24261@menu
24262* Cut Program::                 The @command{cut} utility.
24263* Egrep Program::               The @command{egrep} utility.
24264* Id Program::                  The @command{id} utility.
24265* Split Program::               The @command{split} utility.
24266* Tee Program::                 The @command{tee} utility.
24267* Uniq Program::                The @command{uniq} utility.
24268* Wc Program::                  The @command{wc} utility.
24269@end menu
24270
24271@node Cut Program
24272@subsection Cutting Out Fields and Columns
24273
24274@cindex @command{cut} utility
24275@cindex @command{cut} utility
24276@cindex fields @subentry cutting
24277@cindex columns @subentry cutting
24278The @command{cut} utility selects, or ``cuts,'' characters or fields
24279from its standard input and sends them to its standard output.
24280Fields are separated by TABs by default,
24281but you may supply a command-line option to change the field
24282@dfn{delimiter} (i.e., the field-separator character). @command{cut}'s
24283definition of fields is less general than @command{awk}'s.
24284
24285A common use of @command{cut} might be to pull out just the login names of
24286logged-on users from the output of @command{who}.  For example, the following
24287pipeline generates a sorted, unique list of the logged-on users:
24288
24289@example
24290who | cut -c1-8 | sort | uniq
24291@end example
24292
24293The options for @command{cut} are:
24294
24295@table @code
24296@item -c @var{list}
24297Use @var{list} as the list of characters to cut out.  Items within the list
24298may be separated by commas, and ranges of characters can be separated with
24299dashes.  The list @samp{1-8,15,22-35} specifies characters 1 through
243008, 15, and 22 through 35.
24301
24302@item -d @var{delim}
24303Use @var{delim} as the field-separator character instead of the TAB
24304character.
24305
24306@item -f @var{list}
24307Use @var{list} as the list of fields to cut out.
24308
24309@item -s
24310Suppress printing of lines that do not contain the field delimiter.
24311@end table
24312
24313The @command{awk} implementation of @command{cut} uses the @code{getopt()} library
24314function (@pxref{Getopt Function})
24315and the @code{join()} library function
24316(@pxref{Join Function}).
24317
24318The current POSIX version of @command{cut} has options to cut fields based on
24319both bytes and characters. This version does not attempt to implement those options,
24320as @command{awk} works exclusively in terms of characters.
24321
24322The program begins with a comment describing the options, the library
24323functions needed, and a @code{usage()} function that prints out a usage
24324message and exits.  @code{usage()} is called if invalid arguments are
24325supplied:
24326
24327@cindex @file{cut.awk} program
24328@example
24329@c file eg/prog/cut.awk
24330# cut.awk --- implement cut in awk
24331@c endfile
24332@ignore
24333@c file eg/prog/cut.awk
24334#
24335# Arnold Robbins, arnold@@skeeve.com, Public Domain
24336# May 1993
24337@c endfile
24338@end ignore
24339@c file eg/prog/cut.awk
24340
24341# Options:
24342#    -c list     Cut characters
24343#    -f list     Cut fields
24344#    -d c        Field delimiter character
24345#
24346#    -s          Suppress lines without the delimiter
24347#
24348# Requires getopt() and join() library functions
24349
24350@group
24351function usage()
24352@{
24353    print("usage: cut [-f list] [-d c] [-s] [files...]") > "/dev/stderr"
24354    print("       cut [-c list] [files...]") > "/dev/stderr"
24355    exit 1
24356@}
24357@end group
24358@c endfile
24359@end example
24360
24361@cindex @code{BEGIN} pattern @subentry running @command{awk} programs and
24362@cindex @code{FS} variable @subentry running @command{awk} programs and
24363Next comes a @code{BEGIN} rule that parses the command-line options.
24364It sets @code{FS} to a single TAB character, because that is @command{cut}'s
24365default field separator. The rule then sets the output field separator to be the
24366same as the input field separator.  A loop using @code{getopt()} steps
24367through the command-line options.  Exactly one of the variables
24368@code{by_fields} or @code{by_chars} is set to true, to indicate that
24369processing should be done by fields or by characters, respectively.
24370When cutting by characters, the output field separator is set to the null
24371string:
24372
24373@example
24374@c file eg/prog/cut.awk
24375BEGIN @{
24376    FS = "\t"    # default
24377    OFS = FS
24378    while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{
24379        if (c == "f") @{
24380            by_fields = 1
24381            fieldlist = Optarg
24382        @} else if (c == "c") @{
24383            by_chars = 1
24384            fieldlist = Optarg
24385            OFS = ""
24386        @} else if (c == "d") @{
24387            if (length(Optarg) > 1) @{
24388                printf("cut: using first character of %s" \
24389                       " for delimiter\n", Optarg) > "/dev/stderr"
24390                Optarg = substr(Optarg, 1, 1)
24391            @}
24392            fs = FS = Optarg
24393            OFS = FS
24394            if (FS == " ")    # defeat awk semantics
24395                FS = "[ ]"
24396        @} else if (c == "s")
24397            suppress = 1
24398        else
24399            usage()
24400    @}
24401
24402    # Clear out options
24403    for (i = 1; i < Optind; i++)
24404        ARGV[i] = ""
24405@c endfile
24406@end example
24407
24408@cindex field separator @subentry spaces as
24409The code must take
24410special care when the field delimiter is a space.  Using
24411a single space (@code{@w{" "}}) for the value of @code{FS} is
24412incorrect---@command{awk} would separate fields with runs of spaces,
24413TABs, and/or newlines, and we want them to be separated with individual
24414spaces.
24415To this end, we save the original space character in the variable
24416@code{fs} for later use; after setting @code{FS} to @code{@w{"[ ]"}} we can't
24417use it directly to see if the field delimiter character is in the string.
24418
24419Also remember that after @code{getopt()} is through
24420(as described in @ref{Getopt Function}),
24421we have to
24422clear out all the elements of @code{ARGV} from 1 to @code{Optind},
24423so that @command{awk} does not try to process the command-line options
24424as @value{FN}s.
24425
24426After dealing with the command-line options, the program verifies that the
24427options make sense.  Only one or the other of @option{-c} and @option{-f}
24428should be used, and both require a field list.  Then the program calls
24429either @code{set_fieldlist()} or @code{set_charlist()} to pull apart the
24430list of fields or characters:
24431
24432@example
24433@c file eg/prog/cut.awk
24434    if (by_fields && by_chars)
24435        usage()
24436
24437    if (by_fields == 0 && by_chars == 0)
24438        by_fields = 1    # default
24439
24440@group
24441    if (fieldlist == "") @{
24442        print "cut: needs list for -c or -f" > "/dev/stderr"
24443        exit 1
24444    @}
24445@end group
24446
24447    if (by_fields)
24448        set_fieldlist()
24449    else
24450        set_charlist()
24451@}
24452@c endfile
24453@end example
24454
24455@code{set_fieldlist()} splits the field list apart at the commas
24456into an array.  Then, for each element of the array, it looks to
24457see if the element is actually a range, and if so, splits it apart.
24458The function checks the range
24459to make sure that the first number is smaller than the second.
24460Each number in the list is added to the @code{flist} array, which
24461simply lists the fields that will be printed.  Normal field splitting
24462is used.  The program lets @command{awk} handle the job of doing the
24463field splitting:
24464
24465@example
24466@c file eg/prog/cut.awk
24467function set_fieldlist(        n, m, i, j, k, f, g)
24468@{
24469    n = split(fieldlist, f, ",")
24470    j = 1    # index in flist
24471    for (i = 1; i <= n; i++) @{
24472        if (index(f[i], "-") != 0) @{ # a range
24473            m = split(f[i], g, "-")
24474@group
24475            if (m != 2 || g[1] >= g[2]) @{
24476                printf("cut: bad field list: %s\n",
24477                                  f[i]) > "/dev/stderr"
24478                exit 1
24479            @}
24480@end group
24481            for (k = g[1]; k <= g[2]; k++)
24482                flist[j++] = k
24483        @} else
24484            flist[j++] = f[i]
24485    @}
24486    nfields = j - 1
24487@}
24488@c endfile
24489@end example
24490
24491The @code{set_charlist()} function is more complicated than
24492@code{set_fieldlist()}.
24493The idea here is to use @command{gawk}'s @code{FIELDWIDTHS} variable
24494(@pxref{Constant Size}),
24495which describes constant-width input.  When using a character list, that is
24496exactly what we have.
24497
24498Setting up @code{FIELDWIDTHS} is more complicated than simply listing the
24499fields that need to be printed.  We have to keep track of the fields to
24500print and also the intervening characters that have to be skipped.
24501For example, suppose you wanted characters 1 through 8, 15, and
2450222 through 35.  You would use @samp{-c 1-8,15,22-35}.  The necessary value
24503for @code{FIELDWIDTHS} is @code{@w{"8 6 1 6 14"}}.  This yields five
24504fields, and the fields to print
24505are @code{$1}, @code{$3}, and @code{$5}.
24506The intermediate fields are @dfn{filler},
24507which is stuff in between the desired data.
24508@code{flist} lists the fields to print, and @code{t} tracks the
24509complete field list, including filler fields:
24510
24511@example
24512@c file eg/prog/cut.awk
24513function set_charlist(    field, i, j, f, g, n, m, t,
24514                          filler, last, len)
24515@{
24516    field = 1   # count total fields
24517    n = split(fieldlist, f, ",")
24518    j = 1       # index in flist
24519    for (i = 1; i <= n; i++) @{
24520        if (index(f[i], "-") != 0) @{ # range
24521            m = split(f[i], g, "-")
24522            if (m != 2 || g[1] >= g[2]) @{
24523                printf("cut: bad character list: %s\n",
24524                               f[i]) > "/dev/stderr"
24525                exit 1
24526            @}
24527            len = g[2] - g[1] + 1
24528            if (g[1] > 1)  # compute length of filler
24529                filler = g[1] - last - 1
24530            else
24531                filler = 0
24532@group
24533            if (filler)
24534                t[field++] = filler
24535@end group
24536            t[field++] = len  # length of field
24537            last = g[2]
24538            flist[j++] = field - 1
24539        @} else @{
24540            if (f[i] > 1)
24541                filler = f[i] - last - 1
24542            else
24543                filler = 0
24544            if (filler)
24545                t[field++] = filler
24546            t[field++] = 1
24547            last = f[i]
24548            flist[j++] = field - 1
24549        @}
24550    @}
24551    FIELDWIDTHS = join(t, 1, field - 1)
24552    nfields = j - 1
24553@}
24554@c endfile
24555@end example
24556
24557Next is the rule that processes the data.  If the @option{-s} option
24558is given, then @code{suppress} is true.  The first @code{if} statement
24559makes sure that the input record does have the field separator.  If
24560@command{cut} is processing fields, @code{suppress} is true, and the field
24561separator character is not in the record, then the record is skipped.
24562
24563If the record is valid, then @command{gawk} has split the data
24564into fields, either using the character in @code{FS} or using fixed-length
24565fields and @code{FIELDWIDTHS}.  The loop goes through the list of fields
24566that should be printed.  The corresponding field is printed if it contains data.
24567If the next field also has data, then the separator character is
24568written out between the fields:
24569
24570@example
24571@c file eg/prog/cut.awk
24572@{
24573    if (by_fields && suppress && index($0, fs) == 0)
24574        next
24575
24576    for (i = 1; i <= nfields; i++) @{
24577        if ($flist[i] != "") @{
24578            printf "%s", $flist[i]
24579            if (i < nfields && $flist[i+1] != "")
24580                printf "%s", OFS
24581        @}
24582    @}
24583    print ""
24584@}
24585@c endfile
24586@end example
24587
24588This version of @command{cut} relies on @command{gawk}'s @code{FIELDWIDTHS}
24589variable to do the character-based cutting.  It is possible in
24590other @command{awk} implementations to use @code{substr()}
24591(@pxref{String Functions}), but
24592it is also extremely painful.
24593The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem
24594of picking the input line apart by characters.
24595
24596
24597@node Egrep Program
24598@subsection Searching for Regular Expressions in Files
24599
24600@cindex regular expressions @subentry searching for
24601@cindex searching @subentry files for regular expressions
24602@cindex files @subentry searching for regular expressions
24603@cindex @command{egrep} utility
24604The @command{grep} family of programs searches files for patterns.
24605These programs have an unusual history.
24606Initially there was @command{grep} (Global Regular Expression Print),
24607which used what are now called Basic Regular Expressions (BREs).
24608Later there was @command{egrep} (Extended @command{grep}) which used
24609what are now called Extended Regular Expressions (EREs). (These are almost
24610identical to those available in @command{awk}; @pxref{Regexp}).
24611There was also @command{fgrep} (Fast @command{grep}), which searched
24612for matches of one more fixed strings.
24613
24614POSIX chose to combine these three programs into one, simply named
24615@command{grep}.  On a POSIX system, @command{grep}'s default behavior
24616is to search using BREs. You use @command{-E} to specify the use
24617of EREs, and @option{-F} to specify searching for fixed strings.
24618
24619In practice, systems continue to come with separate @command{egrep}
24620and @command{fgrep} utilities, for backwards compatibility. This
24621@value{SECTION} provides an @command{awk} implementation of @command{egrep},
24622which supports all of the POSIX-mandated options.
24623You invoke it as follows:
24624
24625@display
24626@command{egrep} [@var{options}] @code{'@var{pattern}'} @var{files} @dots{}
24627@end display
24628
24629The @var{pattern} is a regular expression.  In typical usage, the regular
24630expression is quoted to prevent the shell from expanding any of the
24631special characters as @value{FN} wildcards.  Normally, @command{egrep}
24632prints the lines that matched.  If multiple @value{FN}s are provided on
24633the command line, each output line is preceded by the name of the file
24634and a colon.
24635
24636The options to @command{egrep} are as follows:
24637
24638@table @code
24639@item -c
24640Print a count of the lines that matched the pattern, instead of the
24641lines themselves.
24642
24643@item -e @var{pattern}
24644Use @var{pattern} as the regexp to match.  The purpose of the @option{-e}
24645option is to allow patterns that start with a @samp{-}.
24646
24647@item -i
24648Ignore case distinctions in both the pattern and the input data.
24649
24650@item -l
24651Only print (list) the names of the files that matched, not the lines that matched.
24652
24653@item -q
24654Be quiet.  No output is produced and the exit value indicates whether
24655the pattern was matched.
24656
24657@item -s
24658Be silent. Do not print error messages for files that could
24659not be opened.
24660
24661@item -v
24662Invert the sense of the test. @command{egrep} prints the lines that do
24663@emph{not} match the pattern and exits successfully if the pattern is not
24664matched.
24665
24666@item -x
24667Match the entire input line in order to consider the match as having
24668succeeded.
24669@end table
24670
24671This version uses the @code{getopt()} library function
24672(@pxref{Getopt Function}) and @command{gawk}'s
24673@code{BEGINFILE} and @code{ENDFILE} special patterns
24674(@pxref{BEGINFILE/ENDFILE}).
24675
24676The program begins with descriptive comments and then a @code{BEGIN} rule
24677that processes the command-line arguments with @code{getopt()}.  The @option{-i}
24678(ignore case) option is particularly easy with @command{gawk}; we just use the
24679@code{IGNORECASE} predefined variable
24680(@pxref{Built-in Variables}):
24681
24682@cindex @file{egrep.awk} program
24683@example
24684@c file eg/prog/egrep.awk
24685# egrep.awk --- simulate egrep in awk
24686#
24687@c endfile
24688@ignore
24689@c file eg/prog/egrep.awk
24690# Arnold Robbins, arnold@@skeeve.com, Public Domain
24691# May 1993
24692# Revised September 2020
24693
24694@c endfile
24695@end ignore
24696@c file eg/prog/egrep.awk
24697# Options:
24698#    -c    count of lines
24699#    -e    argument is pattern
24700#    -i    ignore case
24701#    -l    print filenames only
24702#    -n    add line number to output
24703#    -q    quiet - use exit value
24704#    -s    silent - don't print errors
24705#    -v    invert test, success if no match
24706#    -x    the entire line must match
24707#
24708# Requires getopt library function
24709# Uses IGNORECASE, BEGINFILE and ENDFILE
24710# Invoke using gawk -f egrep.awk -- options ...
24711
24712BEGIN @{
24713    while ((c = getopt(ARGC, ARGV, "ce:ilnqsvx")) != -1) @{
24714        if (c == "c")
24715            count_only++
24716        else if (c == "e")
24717            pattern = Optarg
24718        else if (c == "i")
24719            IGNORECASE = 1
24720        else if (c == "l")
24721            filenames_only++
24722        else if (c == "n")
24723            line_numbers++
24724        else if (c == "q")
24725            no_print++
24726        else if (c == "s")
24727            no_errors++
24728        else if (c == "v")
24729            invert++
24730        else if (c == "x")
24731            full_line++
24732        else
24733            usage()
24734    @}
24735@c endfile
24736@end example
24737
24738@noindent
24739Note the comment about invocation: Because several of the options overlap
24740with @command{gawk}'s, a @option{--} is needed to tell @command{gawk}
24741to stop looking for options.
24742
24743Next comes the code that handles the @command{egrep}-specific behavior.
24744@command{egrep} uses the first nonoption on the command line
24745if no pattern is supplied with @option{-e}.
24746If the pattern is empty, that means no pattern was supplied, so it's
24747necessary to print an error message and exit.
24748The @command{awk} command-line arguments up to @code{ARGV[Optind]}
24749are cleared, so that @command{awk} won't try to process them as files.  If no
24750files are specified, the standard input is used, and if multiple files are
24751specified, we make sure to note this so that the @value{FN}s can precede the
24752matched lines in the output:
24753
24754@example
24755@c file eg/prog/egrep.awk
24756    if (pattern == "")
24757        pattern = ARGV[Optind++]
24758
24759    if (pattern == "")
24760      usage()
24761
24762    for (i = 1; i < Optind; i++)
24763        ARGV[i] = ""
24764
24765    if (Optind >= ARGC) @{
24766        ARGV[1] = "-"
24767        ARGC = 2
24768    @} else if (ARGC - Optind > 1)
24769        do_filenames++
24770@}
24771@c endfile
24772@end example
24773
24774The @code{BEGINFILE} rule executes
24775when each new file is processed.  In this case, it is fairly simple; it
24776initializes a variable @code{fcount} to zero. @code{fcount} tracks
24777how many lines in the current file matched the pattern.
24778
24779Here also is where we implement the @option{-s} option. We check
24780if @code{ERRNO} has been set, and if @option{-s} was supplied.
24781In that case, it's necessary to move on to the next file. Otherwise
24782@command{gawk} would exit with an error:
24783
24784@example
24785@c file eg/prog/egrep.awk
24786BEGINFILE @{
24787    fcount = 0
24788    if (ERRNO && no_errors)
24789        nextfile
24790@}
24791@c endfile
24792@end example
24793
24794The @code{ENDFILE} rule executes after each file has been processed.
24795It affects the output only when the user wants a count of the number of lines that
24796matched.  @code{no_print} is true only if the exit status is desired.
24797@code{count_only} is true if line counts are desired.  @command{egrep}
24798therefore only prints line counts if printing and counting are enabled.
24799The output format must be adjusted depending upon the number of files to
24800process.  Finally, @code{fcount} is added to @code{total}, so that we
24801know the total number of lines that matched the pattern:
24802
24803@example
24804@c file eg/prog/egrep.awk
24805ENDFILE @{
24806    if (! no_print && count_only) @{
24807        if (do_filenames)
24808            print file ":" fcount
24809        else
24810            print fcount
24811    @}
24812
24813@group
24814    total += fcount
24815@}
24816@end group
24817@c endfile
24818@end example
24819
24820The following rule does most of the work of matching lines. The variable
24821@code{matches} is true (non-zero) if the line matched the pattern.
24822If the user specified that the entire line must match (with @option{-x}),
24823the code checks this condition by looking at the values of
24824@code{RSTART} and @code{RLENGTH}.  If those indicate that the match
24825is not over the full line, @code{matches} is set to zero (false).
24826
24827If the user
24828wants lines that did not match, we invert the sense of @code{matches}
24829using the @samp{!} operator. We then increment @code{fcount} with the value of
24830@code{matches}, which is either one or zero, depending upon a
24831successful or unsuccessful match.  If the line does not match, the
24832@code{next} statement just moves on to the next input line.
24833
24834We make a number of additional tests, but only if we
24835are not counting lines.  First, if the user only wants the exit status
24836(@code{no_print} is true), then it is enough to know that @emph{one}
24837line in this file matched, and we can skip on to the next file with
24838@code{nextfile}.  Similarly, if we are only printing @value{FN}s, we can
24839print the @value{FN}, and then skip to the next file with @code{nextfile}.
24840Finally, each line is printed, with a leading @value{FN},
24841optional colon and line number, and the final colon
24842if necessary:
24843
24844@cindex @code{!} (exclamation point) @subentry @code{!} operator
24845@cindex exclamation point (@code{!}) @subentry @code{!} operator
24846@example
24847@c file eg/prog/egrep.awk
24848@{
24849    matches = match($0, pattern)
24850    if (matches && full_line && (RSTART != 1 || RLENGTH != length()))
24851         matches = 0
24852
24853    if (invert)
24854        matches = ! matches
24855
24856    fcount += matches    # 1 or 0
24857
24858    if (! matches)
24859        next
24860
24861    if (! count_only) @{
24862        if (no_print)
24863            nextfile
24864
24865        if (filenames_only) @{
24866            print FILENAME
24867            nextfile
24868        @}
24869
24870        if (do_filenames)
24871            if (line_numbers)
24872               print FILENAME ":" FNR ":" $0
24873            else
24874               print FILENAME ":" $0
24875        else
24876            print
24877    @}
24878@}
24879@c endfile
24880@end example
24881
24882The @code{END} rule takes care of producing the correct exit status. If
24883there are no matches, the exit status is one; otherwise, it is zero:
24884
24885@example
24886@c file eg/prog/egrep.awk
24887END @{
24888    exit (total == 0)
24889@}
24890@c endfile
24891@end example
24892
24893The @code{usage()} function prints a usage message in case of invalid options,
24894and then exits:
24895
24896@example
24897@c file eg/prog/egrep.awk
24898function usage()
24899@{
24900    print("Usage:\tegrep [-cilnqsvx] [-e pat] [files ...]") > "/dev/stderr"
24901    print("\tegrep [-cilnqsvx] pat [files ...]") > "/dev/stderr"
24902    exit 1
24903@}
24904@c endfile
24905@end example
24906
24907@node Id Program
24908@subsection Printing Out User Information
24909
24910@cindex printing @subentry user information
24911@cindex users, information about @subentry printing
24912@cindex @command{id} utility
24913The @command{id} utility lists a user's real and effective user ID numbers,
24914real and effective group ID numbers, and the user's group set, if any.
24915@command{id} only prints the effective user ID and group ID if they are
24916different from the real ones.  If possible, @command{id} also supplies the
24917corresponding user and group names.  The output might look like this:
24918
24919@example
24920$ @kbd{id}
24921@print{} uid=1000(arnold) gid=1000(arnold) groups=1000(arnold),4(adm),7(lp),27(sudo)
24922@end example
24923
24924@cindex @code{PROCINFO} array @subentry user and group ID numbers and
24925This information is part of what is provided by @command{gawk}'s
24926@code{PROCINFO} array (@pxref{Built-in Variables}).
24927However, the @command{id} utility provides a more palatable output than just
24928individual numbers.
24929
24930The POSIX version of @command{id} takes several options that give you
24931control over the output's format, such as printing only real ids, or printing
24932only numbers or only names.  Additionally, you can print the information
24933for a specific user, instead of that of the current user.
24934
24935Here is a version of POSIX @command{id} written in @command{awk}.
24936It uses the @code{getopt()} library function
24937(@pxref{Getopt Function}),
24938the user database library functions
24939(@pxref{Passwd Functions}),
24940and the group database library functions
24941(@pxref{Group Functions})
24942from @ref{Library Functions}.
24943
24944The program is moderately straightforward.  All the work is done in the
24945@code{BEGIN} rule.
24946It starts with explanatory comments, a list of options,
24947and then a @code{usage()} function:
24948
24949@cindex @file{id.awk} program
24950@example
24951@c file eg/prog/id.awk
24952# id.awk --- implement id in awk
24953#
24954# Requires user and group library functions and getopt
24955@c endfile
24956@ignore
24957@c file eg/prog/id.awk
24958#
24959# Arnold Robbins, arnold@@skeeve.com, Public Domain
24960# May 1993
24961# Revised February 1996
24962# Revised May 2014
24963# Revised September 2014
24964# Revised September 2020
24965
24966@c endfile
24967@end ignore
24968@c file eg/prog/id.awk
24969# output is:
24970# uid=12(foo) euid=34(bar) gid=3(baz) \
24971#             egid=5(blat) groups=9(nine),2(two),1(one)
24972
24973# Options:
24974#   -G Output all group ids as space separated numbers (ruid, euid, groups)
24975#   -g Output only the euid as a number
24976#   -n Output name instead of the numeric value (with -g/-G/-u)
24977#   -r Output ruid/rguid instead of effective id
24978#   -u Output only effective user id, as a number
24979
24980@group
24981function usage()
24982@{
24983    printf("Usage:\n" \
24984           "\tid [user]\n" \
24985           "\tid -G [-n] [user]\n" \
24986           "\tid -g [-nr] [user]\n" \
24987           "\tid -u [-nr] [user]\n") > "/dev/stderr"
24988
24989    exit 1
24990@}
24991@end group
24992@c endfile
24993@end example
24994
24995The first step is to parse the options using @code{getopt()},
24996and to set various flag variables according to the options given:
24997
24998@example
24999@c file eg/prog/id.awk
25000BEGIN @{
25001    # parse args
25002    while ((c = getopt(ARGC, ARGV, "Ggnru")) != -1) @{
25003        if (c == "G")
25004            groupset_only++
25005        else if (c == "g")
25006            egid_only++
25007        else if (c == "n")
25008            names_not_groups++
25009        else if (c == "r")
25010            real_ids_only++
25011        else if (c == "u")
25012            euid_only++
25013        else
25014            usage()
25015    @}
25016@c endfile
25017@end example
25018
25019The next step is to check that no conflicting options were
25020provided. @option{-G} and @option{-r} are mutually exclusive.
25021It is also not allowed to provide more than one user name
25022on the command line:
25023
25024@example
25025@c file eg/prog/id.awk
25026    if (groupset_only && real_ids_only)
25027        usage()
25028    else if (ARGC - Optind > 1)
25029        usage()
25030@c endfile
25031@end example
25032
25033The user and group ID numbers are obtained from
25034@code{PROCINFO} for the current user, or from the
25035user and password databases for a user supplied on
25036the command line. In the latter case, @code{real_ids_only}
25037is set, since it's not possible to print information about
25038the effective user and group IDs:
25039
25040@example
25041@c file eg/prog/id.awk
25042    if (ARGC - Optind == 0) @{
25043        # gather info for current user
25044        uid = PROCINFO["uid"]
25045        euid = PROCINFO["euid"]
25046        gid = PROCINFO["gid"]
25047        egid = PROCINFO["egid"]
25048        for (i = 1; ("group" i) in PROCINFO; i++)
25049            groupset[i] = PROCINFO["group" i]
25050    @} else @{
25051        fill_info_for_user(ARGV[ARGC-1])
25052        real_ids_only++
25053    @}
25054@c endfile
25055@end example
25056
25057The test in the @code{for} loop is worth noting.
25058Any supplementary groups in the @code{PROCINFO} array have the
25059indices @code{"group1"} through @code{"group@var{N}"} for some
25060@var{N} (i.e., the total number of supplementary groups).
25061However, we don't know in advance how many of these groups
25062there are.
25063
25064This loop works by starting at one, concatenating the value with
25065@code{"group"}, and then using @code{in} to see if that value is
25066in the array (@pxref{Reference to Elements}).  Eventually, @code{i} increments past
25067the last group in the array and the loop exits.
25068
25069The loop is also correct if there are @emph{no} supplementary
25070groups; then the condition is false the first time it's
25071tested, and the loop body never executes.
25072
25073
25074Now, based on the options, we decide what information to print.
25075For @option{-G} (print just the group set), we then select
25076whether to print names or numbers. In either case, when done
25077we exit:
25078
25079@example
25080@c file eg/prog/id.awk
25081    if (groupset_only) @{
25082        if (names_not_groups) @{
25083            for (i = 1; i in groupset; i++) @{
25084                entry = getgrgid(groupset[i])
25085                name = get_first_field(entry)
25086                printf("%s", name)
25087                if ((i + 1) in groupset)
25088                    printf(" ")
25089            @}
25090        @} else @{
25091            for (i = 1; i in groupset; i++) @{
25092                printf("%u", groupset[i])
25093                if ((i + 1) in groupset)
25094                    printf(" ")
25095            @}
25096        @}
25097
25098        print ""    # final newline
25099        exit 0
25100    @}
25101@c endfile
25102@end example
25103
25104Otherwise, for @option{-g} (effective group ID only), we
25105check if @option{-r} was also provided, in which case we
25106use the real group ID. Then based on @option{-n}, we decide
25107whether to print names or numbers. Here too, when done,
25108we exit:
25109
25110@example
25111@c file eg/prog/id.awk
25112    else if (egid_only) @{
25113        id = real_ids_only ? gid : egid
25114        if (names_not_groups) @{
25115            entry = getgrgid(id)
25116            name = get_first_field(entry)
25117            printf("%s\n", name)
25118        @} else @{
25119            printf("%u\n", id)
25120        @}
25121
25122        exit 0
25123    @}
25124@c endfile
25125@end example
25126
25127The @code{get_first_field()} function extracts the group name from
25128the group database entry for the given group ID.
25129
25130Similar processing logic applies to @option{-u} (effective user ID only),
25131combined with @option{-r} and @option{-n}:
25132
25133@example
25134@c file eg/prog/id.awk
25135    else if (euid_only) @{
25136        id = real_ids_only ? uid : euid
25137        if (names_not_groups) @{
25138            entry = getpwuid(id)
25139            name = get_first_field(entry)
25140            printf("%s\n", name)
25141        @} else @{
25142            printf("%u\n", id)
25143        @}
25144
25145        exit 0
25146    @}
25147@c endfile
25148@end example
25149
25150At this point, we haven't exited yet, so we print
25151the regular, default output, based either on the current
25152user's information, or that of the user whose name was
25153provided on the command line. We start with the real user ID:
25154
25155@example
25156@c file eg/prog/id.awk
25157    printf("uid=%d", uid)
25158    pw = getpwuid(uid)
25159    print_first_field(pw)
25160@c endfile
25161@end example
25162
25163The @code{print_first_field()} function prints the user's
25164login name from the password file entry, surrounded by
25165parentheses. It is shown soon.
25166Printing the effective user ID is next:
25167
25168@example
25169@c file eg/prog/id.awk
25170    if (euid != uid && ! real_ids_only) @{
25171        printf(" euid=%d", euid)
25172        pw = getpwuid(euid)
25173        print_first_field(pw)
25174    @}
25175@c endfile
25176@end example
25177
25178Similar logic applies to the real and effective group IDs:
25179
25180@example
25181@c file eg/prog/id.awk
25182    printf(" gid=%d", gid)
25183    pw = getgrgid(gid)
25184    print_first_field(pw)
25185
25186    if (egid != gid && ! real_ids_only) @{
25187        printf(" egid=%d", egid)
25188        pw = getgrgid(egid)
25189        print_first_field(pw)
25190    @}
25191@c endfile
25192@end example
25193
25194Finally, we print the group set and the terminating newline:
25195
25196@example
25197@c file eg/prog/id.awk
25198    for (i = 1; i in groupset; i++) @{
25199        if (i == 1)
25200            printf(" groups=")
25201        group = groupset[i]
25202        printf("%d", group)
25203        pw = getgrgid(group)
25204        print_first_field(pw)
25205        if ((i + 1) in groupset)
25206            printf(",")
25207    @}
25208
25209    print ""
25210@}
25211@c endfile
25212@end example
25213
25214The @code{get_first_field()} function extracts the first field
25215from a password or group file entry for use as a user or group
25216name. Fields are separated by @samp{:} characters:
25217
25218@example
25219@c file eg/prog/id.awk
25220function get_first_field(str,  a)
25221@{
25222    if (str != "") @{
25223        split(str, a, ":")
25224        return a[1]
25225    @}
25226@}
25227@c endfile
25228@end example
25229
25230This function is then used by @code{print_first_field()} to
25231output the given name surrounded by parentheses:
25232
25233@example
25234@c file eg/prog/id.awk
25235function print_first_field(str)
25236@{
25237    first = get_first_field(str)
25238    printf("(%s)", first)
25239@}
25240@c endfile
25241@end example
25242
25243These two functions simply isolate out some code that is used repeatedly,
25244making the whole program shorter and cleaner. In particular, moving the
25245check for the empty string into @code{get_first_field()} saves several
25246lines of code.
25247
25248Finally, @code{fill_info_for_user()} fetches user, group, and group
25249set information for the user named on the command.  The code is fairly
25250straightforward, merely requiring that we exit if the given user doesn't
25251exist:
25252
25253@example
25254@c file eg/prog/id.awk
25255function fill_info_for_user(user,
25256                            pwent, fields, groupnames, grent, groups, i)
25257@{
25258    pwent = getpwnam(user)
25259    if (pwent == "") @{
25260        printf("id: '%s': no such user\n", user) > "/dev/stderr"
25261        exit 1
25262    @}
25263
25264    split(pwent, fields, ":")
25265    uid = fields[3] + 0
25266    gid = fields[4] + 0
25267@c endfile
25268@end example
25269
25270Getting the group set is a little awkward. The library routine
25271@code{getgruser()} returns a list of group @emph{names}. These
25272have to be gone through and turned back into group numbers,
25273so that the rest of the code will work as expected:
25274
25275@example
25276@ignore
25277@c file eg/prog/id.awk
25278
25279@c endfile
25280@end ignore
25281@c file eg/prog/id.awk
25282    groupnames = getgruser(user)
25283    split(groupnames, groups, " ")
25284    for (i = 1; i in groups; i++) @{
25285        grent = getgrnam(groups[i])
25286        split(grent, fields, ":")
25287        groupset[i] = fields[3] + 0
25288    @}
25289@}
25290@c endfile
25291@end example
25292
25293@node Split Program
25294@subsection Splitting a Large File into Pieces
25295
25296@cindex files @subentry splitting
25297@cindex @code{split} utility
25298The @command{split} utility splits large text files into smaller pieces.
25299The usage follows the POSIX standard for @command{split} and is as follows:
25300
25301@display
25302@command{split} [@option{-l} @var{count}] [@option{-a} @var{suffix-len}] [@var{file} [@var{outname}]]
25303@command{split} @option{-b} @var{N}[@code{k}|@code{m}]] [@option{-a} @var{suffix-len}] [@var{file} [@var{outname}]]
25304@end display
25305
25306By default, the output files are named @file{xaa}, @file{xab}, and so
25307on. Each file has 1,000 lines in it, with the likely exception of the
25308last file.
25309
25310The @command{split} program has evolved over time, and the current POSIX
25311version is more complicated than the original Unix version.  The options
25312and what they do are as follows:
25313
25314@table @asis
25315@item @option{-a} @var{suffix-len}
25316Use @var{suffix-len} characters for the suffix. For example, if @var{suffix-len}
25317is four, the output files would range from @file{xaaaa} to @file{xzzzz}.
25318
25319@item @option{-b} @var{N}[@code{k}|@code{m}]]
25320Instead of each file containing a specified number of lines, each file
25321should have (at most) @var{N} bytes.  Supplying a trailing @samp{k}
25322multiplies @var{N} by 1,024, yielding kilobytes.  Supplying a trailing
25323@samp{m} multiplies @var{N} by 1,048,576 (@math{1,024 @value{TIMES} 1,024})
25324yielding megabytes.  (This option is mutually exclusive with @option{-l}).
25325
25326@item @option{-l} @var{count}
25327Each file should have at most @var{count} lines, instead of the default
253281,000.  (This option is mutually exclusive with @option{-b}).
25329@end table
25330
25331If supplied, @var{file} is the input file to read. Otherwise standard
25332input is processed.  If supplied, @var{outname} is the leading prefix
25333to use for @value{FN}s, instead of @samp{x}.
25334
25335In order to use the @option{-b} option, @command{gawk} should be invoked
25336with its @option{-b} option (@pxref{Options}), or with the environment
25337variable @env{LC_ALL} set to @samp{C}, so that each input byte is treated
25338as a separate character.@footnote{Using @option{-b} twice requires
25339separating @command{gawk}'s options from those of the program.  For example:
25340@samp{gawk -f getopt.awk -f split.awk -b -- -b 42m large-file.txt split-}.}
25341
25342Here is an implementation of @command{split} in @command{awk}. It uses the
25343@code{getopt()} function presented in @ref{Getopt Function}.
25344
25345The program begins with a standard descriptive comment and then
25346a @code{usage()} function describing the options. The variable
25347@code{common} keeps the function's lines short so that they
25348look nice on the page:
25349
25350@cindex @file{split.awk} program
25351@example
25352@c file eg/prog/split.awk
25353# split.awk --- do split in awk
25354#
25355# Requires getopt() library function.
25356@c endfile
25357@ignore
25358@c file eg/prog/split.awk
25359#
25360# Arnold Robbins, arnold@@skeeve.com, Public Domain
25361# May 1993
25362# Revised slightly, May 2014
25363# Rewritten September 2020
25364
25365@c endfile
25366@end ignore
25367@c file eg/prog/split.awk
25368
25369function usage(     common)
25370@{
25371    common = "[-a suffix-len] [file [outname]]"
25372    printf("usage: split [-l count]  %s\n", common) > "/dev/stderr"
25373    printf("       split [-b N[k|m]] %s\n", common) > "/dev/stderr"
25374    exit 1
25375@}
25376@c endfile
25377@end example
25378
25379Next, in a @code{BEGIN} rule we set the default values and parse the arguments.
25380After that we initialize the data structures used to cycle the suffix
25381from @samp{aa@dots{}} to @samp{zz@dots{}}. Finally we set the name of
25382the first output file:
25383
25384@example
25385@c file eg/prog/split.awk
25386BEGIN @{
25387    # Set defaults:
25388    Suffix_length = 2
25389    Line_count = 1000
25390    Byte_count = 0
25391    Outfile = "x"
25392
25393    parse_arguments()
25394
25395    init_suffix_data()
25396
25397    Output = (Outfile compute_suffix())
25398@}
25399@c endfile
25400@end example
25401
25402Parsing the arguments is straightforward.  The program follows our
25403convention (@pxref{Library Names}) of having important global variables
25404start with an uppercase letter:
25405
25406@example
25407@c file eg/prog/split.awk
25408function parse_arguments(   i, c, l, modifier)
25409@{
25410    while ((c = getopt(ARGC, ARGV, "a:b:l:")) != -1) @{
25411        if (c == "a")
25412            Suffix_length = Optarg + 0
25413        else if (c == "b") @{
25414            Byte_count = Optarg + 0
25415            Line_count = 0
25416
25417            l = length(Optarg)
25418            modifier = substr(Optarg, l, 1)
25419            if (modifier == "k")
25420                Byte_count *= 1024
25421            else if (modifier == "m")
25422                Byte_count *= 1024 * 1024
25423        @} else if (c == "l") @{
25424            Line_count = Optarg + 0
25425            Byte_count = 0
25426        @} else
25427            usage()
25428    @}
25429
25430    # Clear out options
25431    for (i = 1; i < Optind; i++)
25432        ARGV[i] = ""
25433
25434    # Check for filename
25435    if (ARGV[Optind]) @{
25436        Optind++
25437
25438        # Check for different prefix
25439        if (ARGV[Optind]) @{
25440            Outfile = ARGV[Optind]
25441            ARGV[Optind] = ""
25442
25443            if (++Optind < ARGC)
25444                usage()
25445        @}
25446    @}
25447@}
25448@c endfile
25449@end example
25450
25451Managing the @value{FN} suffix is interesting.
25452Given a suffix of length three, say, the values go from
25453@samp{aaa}, @samp{aab}, @samp{aac} and so on, all the way to
25454@samp{zzx}, @samp{zzy}, and finally @samp{zzz}.
25455There are two important aspects to this:
25456
25457@itemize @bullet
25458@item
25459We have to be
25460able to easily generate these suffixes, and in particular
25461easily handle ``rolling over''; for example, going from
25462@samp{abz} to @samp{aca}.
25463
25464@item
25465We have to tell when we've finished with the last file,
25466so that if we still have more input data we can print an
25467error message and exit. The trick is to handle this @emph{after}
25468using the last suffix, and not when the final suffix is created.
25469@end itemize
25470
25471The computation is handled by @code{compute_suffix()}.
25472This function is called every time a new file is opened.
25473
25474The flow here is messy, because we want to generate @samp{zzzz} (say),
25475and use it, and only produce an error after all the @value{FN}
25476suffixes have been used up. The logical steps are as follows:
25477
25478@enumerate 1
25479@item
25480Generate the suffix, saving the value in @code{result} to return.
25481To do this, the supplementary array @code{Suffix_ind} contains one
25482element for each letter in the suffix.  Each element ranges from 1 to
2548326, acting as the index into a string containing all the lowercase
25484letters of the English alphabet.
25485It is initialized by @code{init_suffix_data()}.
25486@code{result} is built up one letter at a time, using each @code{substr()}.
25487
25488@item
25489Prepare the data structures for the next time @code{compute_suffix()}
25490is called. To do this, we loop over @code{Suffix_ind}, @emph{backwards}.
25491If the current element is less than 26, it's incremented and the loop
25492breaks (@samp{abq} goes to @samp{abr}). Otherwise, the element is
25493reset to one and we move down the list (@samp{abz} to @samp{aca}).
25494Thus, the @code{Suffix_ind} array is always ``one step ahead'' of the actual
25495@value{FN} suffix to be returned.
25496
25497@item
25498Check if we've gone past the limit of possible @value{FN}s.
25499If @code{Reached_last} is true, print a message and exit. Otherwise,
25500check if @code{Suffix_ind} describes a suffix where all the letters are
25501@samp{z}. If that's the case we're about to return the final suffix. If
25502so, we set @code{Reached_last} to true so that the @emph{next} call to
25503@code{compute_suffix()} will cause a failure.
25504@end enumerate
25505
25506Physically, the steps in the function occur in the order 3, 1, 2:
25507
25508@example
25509@c file eg/prog/split.awk
25510function compute_suffix(    i, result, letters)
25511@{
25512    # Logical step 3
25513    if (Reached_last) @{
25514        printf("split: too many files!\n") > "/dev/stderr"
25515        exit 1
25516    @} else if (on_last_file())
25517        Reached_last = 1    # fail when wrapping after 'zzz'
25518
25519    # Logical step 1
25520    result = ""
25521    letters = "abcdefghijklmnopqrstuvwxyz"
25522    for (i = 1; i <= Suffix_length; i++)
25523        result = result substr(letters, Suffix_ind[i], 1)
25524
25525    # Logical step 2
25526    for (i = Suffix_length; i >= 1; i--) @{
25527        if (++Suffix_ind[i] > 26) @{
25528            Suffix_ind[i] = 1
25529        @} else
25530            break
25531    @}
25532
25533    return result
25534@}
25535@c endfile
25536@end example
25537
25538The @code{Suffix_ind} array and @code{Reached_last} are initialized
25539by @code{init_suffix_data()}:
25540
25541@example
25542@c file eg/prog/split.awk
25543function init_suffix_data(  i)
25544@{
25545    for (i = 1; i <= Suffix_length; i++)
25546        Suffix_ind[i] = 1
25547
25548    Reached_last = 0
25549@}
25550@c endfile
25551@end example
25552
25553The function @code{on_last_file()} returns true if @code{Suffix_ind} describes
25554a suffix where all the letters are @samp{z} by checking that all the elements
25555in the array are equal to 26:
25556
25557@example
25558@c file eg/prog/split.awk
25559function on_last_file(  i, on_last)
25560@{
25561    on_last = 1
25562    for (i = 1; i <= Suffix_length; i++) @{
25563        on_last = on_last && (Suffix_ind[i] == 26)
25564    @}
25565
25566    return on_last
25567@}
25568@c endfile
25569@end example
25570
25571The actual work of splitting the input file is done by the next two rules.
25572Since splitting by line count and splitting by byte count are mutually
25573exclusive, we simply use two separate rules, one for when @code{Line_count}
25574is greater than zero, and another for when @code{Byte_count} is greater than zero.
25575
25576The variable @code{tcount} counts how many lines have been processed so far.
25577When it exceeds @code{Line_count}, it's time to close the previous file and
25578switch to a new one:
25579
25580@example
25581@c file eg/prog/split.awk
25582Line_count > 0 @{
25583    if (++tcount > Line_count) @{
25584        close(Output)
25585        Output = (Outfile compute_suffix())
25586        tcount = 1
25587    @}
25588    print > Output
25589@}
25590@c endfile
25591@end example
25592
25593The rule for handling bytes is more complicated.  Since lines most likely
25594vary in length, the @code{Byte_count} boundary may be hit in the middle of
25595an input record.  In that case, @command{split} has to write enough of the
25596first bytes of the input record to finish up @code{Byte_count} bytes, close
25597the file, open a new file, and write the rest of the record to the new file.
25598The logic here does all that:
25599
25600@example
25601@c file eg/prog/split.awk
25602Byte_count > 0 @{
25603    # `+ 1' is for the final newline
25604    if (tcount + length($0) + 1 > Byte_count) @{ # would overflow
25605        # compute leading bytes
25606        leading_bytes = Byte_count - tcount
25607
25608        # write leading bytes
25609        printf("%s", substr($0, 1, leading_bytes)) > Output
25610
25611        # close old file, open new file
25612        close(Output)
25613        Output = (Outfile compute_suffix())
25614
25615        # set up first bytes for new file
25616        $0 = substr($0, leading_bytes + 1)  # trailing bytes
25617        tcount = 0
25618    @}
25619
25620    # write full record or trailing bytes
25621    tcount += length($0) + 1
25622    print > Output
25623@}
25624@c endfile
25625@end example
25626
25627Finally, the @code{END} rule cleans up by closing the last output file:
25628
25629@example
25630@c file eg/prog/split.awk
25631END @{
25632    close(Output)
25633@}
25634@c endfile
25635@end example
25636
25637@node Tee Program
25638@subsection Duplicating Output into Multiple Files
25639
25640@cindex files @subentry multiple, duplicating output into
25641@cindex output @subentry duplicating into files
25642@cindex @code{tee} utility
25643The @code{tee} program is known as a ``pipe fitting.''  @code{tee} copies
25644its standard input to its standard output and also duplicates it to the
25645files named on the command line.  Its usage is as follows:
25646
25647@display
25648@command{tee} [@option{-a}] @var{file} @dots{}
25649@end display
25650
25651The @option{-a} option tells @code{tee} to append to the named files, instead of
25652truncating them and starting over.
25653
25654The @code{BEGIN} rule first makes a copy of all the command-line arguments
25655into an array named @code{copy}.
25656@code{ARGV[0]} is not needed, so it is not copied.
25657@code{tee} cannot use @code{ARGV} directly, because @command{awk} attempts to
25658process each @value{FN} in @code{ARGV} as input data.
25659
25660@cindex flag variables
25661If the first argument is @option{-a}, then the flag variable
25662@code{append} is set to true, and both @code{ARGV[1]} and
25663@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no
25664@value{FN}s were supplied and @code{tee} prints a usage message and exits.
25665Finally, @command{awk} is forced to read the standard input by setting
25666@code{ARGV[1]} to @code{"-"} and @code{ARGC} to two:
25667
25668@cindex @file{tee.awk} program
25669@example
25670@c file eg/prog/tee.awk
25671# tee.awk --- tee in awk
25672#
25673# Copy standard input to all named output files.
25674# Append content if -a option is supplied.
25675#
25676@c endfile
25677@ignore
25678@c file eg/prog/tee.awk
25679# Arnold Robbins, arnold@@skeeve.com, Public Domain
25680# May 1993
25681# Revised December 1995
25682
25683@c endfile
25684@end ignore
25685@c file eg/prog/tee.awk
25686BEGIN @{
25687    for (i = 1; i < ARGC; i++)
25688        copy[i] = ARGV[i]
25689
25690    if (ARGV[1] == "-a") @{
25691        append = 1
25692        delete ARGV[1]
25693        delete copy[1]
25694        ARGC--
25695    @}
25696    if (ARGC < 2) @{
25697        print "usage: tee [-a] file ..." > "/dev/stderr"
25698        exit 1
25699    @}
25700    ARGV[1] = "-"
25701    ARGC = 2
25702@}
25703@c endfile
25704@end example
25705
25706The following single rule does all the work.  Because there is no pattern, it is
25707executed for each line of input.  The body of the rule simply prints the
25708line into each file on the command line, and then to the standard output:
25709
25710@example
25711@c file eg/prog/tee.awk
25712@{
25713    # moving the if outside the loop makes it run faster
25714    if (append)
25715        for (i in copy)
25716            print >> copy[i]
25717    else
25718        for (i in copy)
25719            print > copy[i]
25720    print
25721@}
25722@c endfile
25723@end example
25724
25725@noindent
25726It is also possible to write the loop this way:
25727
25728@example
25729@group
25730for (i in copy)
25731    if (append)
25732        print >> copy[i]
25733@end group
25734@group
25735    else
25736        print > copy[i]
25737@end group
25738@end example
25739
25740@noindent
25741This is more concise, but it is also less efficient.  The @samp{if} is
25742tested for each record and for each output file.  By duplicating the loop
25743body, the @samp{if} is only tested once for each input record.  If there are
25744@var{N} input records and @var{M} output files, the first method only
25745executes @var{N} @samp{if} statements, while the second executes
25746@var{N}@code{*}@var{M} @samp{if} statements.
25747
25748Finally, the @code{END} rule cleans up by closing all the output files:
25749
25750@example
25751@c file eg/prog/tee.awk
25752END @{
25753    for (i in copy)
25754        close(copy[i])
25755@}
25756@c endfile
25757@end example
25758
25759@node Uniq Program
25760@subsection Printing Nonduplicated Lines of Text
25761
25762@cindex printing @subentry unduplicated lines of text
25763@cindex text, printing @subentry unduplicated lines of
25764@cindex @command{uniq} utility
25765The @command{uniq} utility reads sorted lines of data on its standard
25766input, and by default removes duplicate lines.  In other words, it only
25767prints unique lines---hence the name.  @command{uniq} has a number of
25768options. The usage is as follows:
25769
25770@display
25771@command{uniq} [@option{-udc} [@code{-f @var{n}}] [@code{-s @var{n}}]] [@var{inputfile} [@var{outputfile}]]
25772@end display
25773
25774The options for @command{uniq} are:
25775
25776@table @code
25777@item -d
25778Print only repeated (duplicated) lines.
25779
25780@item -u
25781Print only nonrepeated (unique) lines.
25782
25783@item -c
25784Count lines. This option overrides @option{-d} and @option{-u}.  Both repeated
25785and nonrepeated lines are counted.
25786
25787@item -f @var{n}
25788Skip @var{n} fields before comparing lines.  The definition of fields
25789is similar to @command{awk}'s default: nonwhitespace characters separated
25790by runs of spaces and/or TABs.
25791
25792@item -s @var{n}
25793Skip @var{n} characters before comparing lines.  Any fields specified with
25794@option{-f} are skipped first.
25795
25796@item @var{inputfile}
25797Data is read from the input file named on the command line, instead of from
25798the standard input.
25799
25800@item @var{outputfile}
25801The generated output is sent to the named output file, instead of to the
25802standard output.
25803@end table
25804
25805Normally @command{uniq} behaves as if both the @option{-d} and
25806@option{-u} options are provided.
25807
25808@command{uniq} uses the
25809@code{getopt()} library function
25810(@pxref{Getopt Function})
25811and the @code{join()} library function
25812(@pxref{Join Function}).
25813
25814The program begins with a @code{usage()} function and then a brief outline of
25815the options and their meanings in comments:
25816
25817@cindex @file{uniq.awk} program
25818@example
25819@c file eg/prog/uniq.awk
25820@group
25821# uniq.awk --- do uniq in awk
25822#
25823# Requires getopt() and join() library functions
25824@end group
25825@c endfile
25826@ignore
25827@c file eg/prog/uniq.awk
25828#
25829# Arnold Robbins, arnold@@skeeve.com, Public Domain
25830# May 1993
25831# Updated August 2020 to current POSIX
25832@c endfile
25833@end ignore
25834@c file eg/prog/uniq.awk
25835
25836function usage()
25837@{
25838    print("Usage: uniq [-udc [-f fields] [-s chars]] " \
25839          "[ in [ out ]]") > "/dev/stderr"
25840    exit 1
25841@}
25842
25843# -c    count lines. overrides -d and -u
25844# -d    only repeated lines
25845# -u    only nonrepeated lines
25846# -f n  skip n fields
25847# -s n  skip n characters, skip fields first
25848@c endfile
25849@end example
25850
25851The POSIX standard for @command{uniq} allows options to start with
25852@samp{+} as well as with @samp{-}.  An initial @code{BEGIN} rule
25853traverses the arguments changing any leading @samp{+} to @samp{-}
25854so that the @code{getopt()} function can parse the options:
25855
25856@example
25857@c file eg/prog/uniq.awk
25858# As of 2020, '+' can be used as the option character in addition to '-'
25859# Previously allowed use of -N to skip fields and +N to skip
25860# characters is no longer allowed, and not supported by this version.
25861
25862BEGIN @{
25863    # Convert + to - so getopt can handle things
25864    for (i = 1; i < ARGC; i++) @{
25865        first = substr(ARGV[i], 1, 1)
25866        if (ARGV[i] == "--" || (first != "-" && first != "+"))
25867            break
25868        else if (first == "+")
25869            # Replace "+" with "-"
25870            ARGV[i] = "-" substr(ARGV[i], 2)
25871    @}
25872@}
25873@c endfile
25874@end example
25875
25876The next @code{BEGIN} rule deals with the command-line arguments and options.
25877If no options are supplied, then the default is taken, to print both
25878repeated and nonrepeated lines.  The output file, if provided, is assigned
25879to @code{outputfile}.  Early on, @code{outputfile} is initialized to the
25880standard output, @file{/dev/stdout}:
25881
25882@example
25883@c file eg/prog/uniq.awk
25884BEGIN @{
25885    count = 1
25886    outputfile = "/dev/stdout"
25887    opts = "udcf:s:"
25888    while ((c = getopt(ARGC, ARGV, opts)) != -1) @{
25889        if (c == "u")
25890            non_repeated_only++
25891        else if (c == "d")
25892            repeated_only++
25893        else if (c == "c")
25894            do_count++
25895        else if (c == "f")
25896            fcount = Optarg + 0
25897        else if (c == "s")
25898            charcount = Optarg + 0
25899        else
25900            usage()
25901    @}
25902
25903    for (i = 1; i < Optind; i++)
25904        ARGV[i] = ""
25905
25906    if (repeated_only == 0 && non_repeated_only == 0)
25907        repeated_only = non_repeated_only = 1
25908
25909    if (ARGC - Optind == 2) @{
25910        outputfile = ARGV[ARGC - 1]
25911        ARGV[ARGC - 1] = ""
25912    @}
25913@}
25914@c endfile
25915@end example
25916
25917The following function, @code{are_equal()}, compares the current line,
25918@code{$0}, to the previous line, @code{last}.  It handles skipping fields
25919and characters.  If no field count and no character count are specified,
25920@code{are_equal()} returns one or zero depending upon the result of a
25921simple string comparison of @code{last} and @code{$0}.
25922
25923Otherwise, things get more complicated.  If fields have to be skipped,
25924each line is broken into an array using @code{split()} (@pxref{String
25925Functions}); the desired fields are then joined back into a line
25926using @code{join()}.  The joined lines are stored in @code{clast} and
25927@code{cline}.  If no fields are skipped, @code{clast} and @code{cline}
25928are set to @code{last} and @code{$0}, respectively.  Finally, if
25929characters are skipped, @code{substr()} is used to strip off the leading
25930@code{charcount} characters in @code{clast} and @code{cline}.  The two
25931strings are then compared and @code{are_equal()} returns the result:
25932
25933@example
25934@c file eg/prog/uniq.awk
25935@group
25936function are_equal(    n, m, clast, cline, alast, aline)
25937@{
25938    if (fcount == 0 && charcount == 0)
25939        return (last == $0)
25940@end group
25941
25942    if (fcount > 0) @{
25943        n = split(last, alast)
25944        m = split($0, aline)
25945        clast = join(alast, fcount+1, n)
25946        cline = join(aline, fcount+1, m)
25947    @} else @{
25948        clast = last
25949        cline = $0
25950    @}
25951    if (charcount) @{
25952        clast = substr(clast, charcount + 1)
25953        cline = substr(cline, charcount + 1)
25954    @}
25955@group
25956
25957    return (clast == cline)
25958@}
25959@end group
25960@c endfile
25961@end example
25962
25963The following two rules are the body of the program.  The first one is
25964executed only for the very first line of data.  It sets @code{last} equal to
25965@code{$0}, so that subsequent lines of text have something to be compared to.
25966
25967The second rule does the work. The variable @code{equal} is one or zero,
25968depending upon the results of @code{are_equal()}'s comparison. If @command{uniq}
25969is counting repeated lines, and the lines are equal, then it increments the @code{count} variable.
25970Otherwise, it prints the line and resets @code{count},
25971because the two lines are not equal.
25972
25973If @command{uniq} is not counting, and if the lines are equal, @code{count} is incremented.
25974Nothing is printed, as the point is to remove duplicates.
25975Otherwise, if @command{uniq} is counting repeated lines and more than
25976one line is seen, or if @command{uniq} is counting nonrepeated lines
25977and only one line is seen, then the line is printed, and @code{count}
25978is reset.
25979
25980Finally, similar logic is used in the @code{END} rule to print the final
25981line of input data:
25982
25983@example
25984@c file eg/prog/uniq.awk
25985NR == 1 @{
25986    last = $0
25987    next
25988@}
25989
25990@{
25991    equal = are_equal()
25992
25993    if (do_count) @{    # overrides -d and -u
25994        if (equal)
25995            count++
25996        else @{
25997            printf("%4d %s\n", count, last) > outputfile
25998            last = $0
25999            count = 1    # reset
26000        @}
26001        next
26002    @}
26003
26004    if (equal)
26005        count++
26006    else @{
26007        if ((repeated_only && count > 1) ||
26008            (non_repeated_only && count == 1))
26009                print last > outputfile
26010        last = $0
26011        count = 1
26012    @}
26013@}
26014
26015END @{
26016    if (do_count)
26017        printf("%4d %s\n", count, last) > outputfile
26018@group
26019    else if ((repeated_only && count > 1) ||
26020            (non_repeated_only && count == 1))
26021        print last > outputfile
26022    close(outputfile)
26023@}
26024@end group
26025@c endfile
26026@end example
26027
26028As a side note, this program does not follow our recommended convention of naming
26029global variables with a leading capital letter.  Doing that would
26030make the program a little easier to follow.
26031
26032@ifset FOR_PRINT
26033@cindex Kernighan, Brian @subentry quotes
26034The logic for choosing which lines to print represents a @dfn{state
26035machine}, which is ``a device which can be in one of a set number
26036of stable conditions depending on its previous condition and on the
26037present values of its inputs.''@footnote{This definition is from
26038@uref{https://www.lexico.com/en/definition/state_machine}.} Brian
26039Kernighan suggests that ``an alternative approach to state machines is
26040to just read the input into an array, then use indexing.  It's almost
26041always easier code, and for most inputs where you would use this, just
26042as fast.''  Consider how to rewrite the logic to follow this suggestion.
26043@end ifset
26044
26045
26046@node Wc Program
26047@subsection Counting Things
26048
26049@cindex counting words, lines, characters, and bytes
26050@cindex input files @subentry counting elements in
26051@cindex words @subentry counting
26052@cindex characters @subentry counting
26053@cindex lines @subentry counting
26054@cindex bytes @subentry counting
26055@cindex @command{wc} utility
26056The @command{wc} (word count) utility counts lines, words, characters
26057and bytes in one or more input files.
26058
26059@menu
26060* Bytes vs. Characters::        Modern character sets.
26061* Using extensions::            A brief intro to extensions.
26062* @command{wc} program::        Code for @file{wc.awk}.
26063@end menu
26064
26065@node Bytes vs. Characters
26066@subsubsection Modern Character Sets
26067
26068In the early days of computing, single bytes were used for storing
26069characters.  The most common character sets were ASCII and EBCDIC,
26070which each provided all the English upper- and lowercase letters, the 10
26071Hindu-Arabic numerals from 0 through 9, and a number of other standard
26072punctuation and control characters.
26073
26074Today, the most popular character set in use is Unicode (of which ASCII
26075is a pure subset). Unicode provides tens of thousands of unique characters
26076(called @dfn{code points}) to cover most existing human languages (living
26077and dead) and a number of  nonhuman ones as well (such as Klingon and
26078J.R.R.@: Tolkien's elvish languages).
26079
26080To save space in files, Unicode code points are @dfn{encoded}, where each
26081character takes from one to four bytes in the file.  UTF-8 is possibly
26082the most popular of such @dfn{multibyte encodings}.
26083
26084The POSIX standard requires that @command{awk} function in terms
26085of characters, not bytes.  Thus in @command{gawk}, @code{length()},
26086@code{substr()}, @code{split()}, @code{match()} and the other string
26087functions (@pxref{String Functions}) all work in terms of characters in
26088the local character set, and not in terms of bytes. (Not all @command{awk}
26089implementations do so, though).
26090
26091There is no standard, built-in way to distinguish characters from bytes
26092in an @command{awk} program.  For an @command{awk} implementation of
26093@command{wc}, which needs to make such a distinction, we will have to
26094use an external extension.
26095
26096@node Using extensions
26097@subsubsection A Brief Introduction To Extensions
26098
26099Loadable extensions are presented in full detail in @ref{Dynamic Extensions}.
26100They provide a way to add functions to @command{gawk} which can call
26101out to other facilities written in C or C++.
26102
26103For the purposes of
26104@file{wc.awk}, it's enough to know that the extension is loaded
26105with the @code{@@load} directive, and the additional function we
26106will use is called @code{mbs_length()}.  This function returns the
26107number of bytes in a string, not the number of characters.
26108
26109The @code{"mbs"} extension comes from the @code{gawkextlib}
26110project. @xref{gawkextlib} for more information.
26111
26112@node @command{wc} program
26113@subsubsection Code for @file{wc.awk}
26114
26115The usage for @command{wc} is as follows:
26116
26117@display
26118@command{wc} [@option{-lwcm}] [@var{files} @dots{}]
26119@end display
26120
26121If no files are specified on the command line, @command{wc} reads its standard
26122input. If there are multiple files, it also prints total counts for all
26123the files.  The options and their meanings are as follows:
26124
26125@table @code
26126@item -c
26127Count only bytes.
26128Once upon a time, the @samp{c} in this option stood for ``characters.''
26129But, as explained earlier, bytes and character are no longer synonymous
26130with each other.
26131
26132@item -l
26133Count only lines.
26134
26135@item -m
26136Count only characters.
26137
26138@item -w
26139Count only words.
26140A ``word'' is a contiguous sequence of nonwhitespace characters, separated
26141by spaces and/or TABs.  Luckily, this is the normal way @command{awk} separates
26142fields in its input data.
26143@end table
26144
26145Implementing @command{wc} in @command{awk} is particularly elegant,
26146because @command{awk} does a lot of the work for us; it splits lines into
26147words (i.e., fields) and counts them, it counts lines (i.e., records),
26148and it can easily tell us how long a line is in characters.
26149
26150This program uses the @code{getopt()} library function
26151(@pxref{Getopt Function})
26152and the file-transition functions
26153(@pxref{Filetrans Function}).
26154
26155This version has one notable difference from older versions of
26156@command{wc}: it always prints the counts in the order lines, words,
26157characters and bytes.  Older versions note the order of the @option{-l},
26158@option{-w}, and @option{-c} options on the command line, and print the
26159counts in that order.  POSIX does not mandate this behavior, though.
26160
26161The @code{BEGIN} rule does the argument processing.  The variable
26162@code{print_total} is true if more than one file is named on the
26163command line:
26164
26165@cindex @file{wc.awk} program
26166@example
26167@c file eg/prog/wc.awk
26168# wc.awk --- count lines, words, characters, bytes
26169@c endfile
26170@ignore
26171@c file eg/prog/wc.awk
26172#
26173# Arnold Robbins, arnold@@skeeve.com, Public Domain
26174# May 1993
26175# Revised September 2020
26176@c endfile
26177@end ignore
26178@c file eg/prog/wc.awk
26179
26180# Options:
26181#    -l    only count lines
26182#    -w    only count words
26183#    -c    only count bytes
26184#    -m    only count characters
26185#
26186# Default is to count lines, words, bytes
26187#
26188# Requires getopt() and file transition library functions
26189# Requires mbs extension from gawkextlib
26190
26191@@load "mbs"
26192
26193BEGIN @{
26194    # let getopt() print a message about
26195    # invalid options. we ignore them
26196    while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) @{
26197        if (c == "l")
26198            do_lines = 1
26199        else if (c == "w")
26200            do_words = 1
26201        else if (c == "c")
26202            do_bytes = 1
26203        else if (c == "m")
26204            do_chars = 1
26205    @}
26206    for (i = 1; i < Optind; i++)
26207        ARGV[i] = ""
26208
26209    # if no options, do lines, words, bytes
26210    if (! do_lines && ! do_words && ! do_chars && ! do_bytes)
26211        do_lines = do_words = do_bytes = 1
26212
26213    print_total = (ARGC - i > 1)
26214@}
26215@c endfile
26216@end example
26217
26218The @code{beginfile()} function is simple; it just resets the counts of lines,
26219words, characters and bytes to zero, and saves the current @value{FN} in
26220@code{fname}:
26221
26222@example
26223@c file eg/prog/wc.awk
26224function beginfile(file)
26225@{
26226    lines = words = chars = bytes = 0
26227    fname = FILENAME
26228@}
26229@c endfile
26230@end example
26231
26232The @code{endfile()} function adds the current file's numbers to the
26233running totals of lines, words, and characters.  It then prints out those
26234numbers for the file that was just read. It relies on @code{beginfile()}
26235to reset the numbers for the following @value{DF}:
26236
26237@example
26238@c file eg/prog/wc.awk
26239function endfile(file)
26240@{
26241    tlines += lines
26242    twords += words
26243    tchars += chars
26244    tbytes += bytes
26245    if (do_lines)
26246        printf "\t%d", lines
26247@group
26248    if (do_words)
26249        printf "\t%d", words
26250@end group
26251    if (do_chars)
26252        printf "\t%d", chars
26253    if (do_bytes)
26254        printf "\t%d", bytes
26255    printf "\t%s\n", fname
26256@}
26257@c endfile
26258@end example
26259
26260There is one rule that is executed for each line. It adds the length of
26261the record, plus one, to @code{chars}.  Adding one plus the record length
26262is needed because the newline character separating records (the value
26263of @code{RS}) is not part of the record itself, and thus not included
26264in its length.  Similarly, it adds the length of the record in bytes,
26265plus one, to @code{bytes}.  Next, @code{lines} is incremented for each
26266line read, and @code{words} is incremented by the value of @code{NF},
26267which is the number of ``words'' on this line:
26268
26269@example
26270@c file eg/prog/wc.awk
26271# do per line
26272@{
26273    chars += length($0) + 1    # get newline
26274    bytes += mbs_length($0) + 1
26275    lines++
26276    words += NF
26277@}
26278@c endfile
26279@end example
26280
26281Finally, the @code{END} rule simply prints the totals for all the files:
26282
26283@example
26284@c file eg/prog/wc.awk
26285END @{
26286    if (print_total) @{
26287        if (do_lines)
26288            printf "\t%d", tlines
26289        if (do_words)
26290            printf "\t%d", twords
26291        if (do_chars)
26292            printf "\t%d", tchars
26293        if (do_bytes)
26294            printf "\t%d", tbytes
26295        print "\ttotal"
26296    @}
26297@}
26298@c endfile
26299@end example
26300
26301@node Miscellaneous Programs
26302@section A Grab Bag of @command{awk} Programs
26303
26304This @value{SECTION} is a large ``grab bag'' of miscellaneous programs.
26305We hope you find them both interesting and enjoyable.
26306
26307@menu
26308* Dupword Program::             Finding duplicated words in a document.
26309* Alarm Program::               An alarm clock.
26310* Translate Program::           A program similar to the @command{tr} utility.
26311* Labels Program::              Printing mailing labels.
26312* Word Sorting::                A program to produce a word usage count.
26313* History Sorting::             Eliminating duplicate entries from a history
26314                                file.
26315* Extract Program::             Pulling out programs from Texinfo source
26316                                files.
26317* Simple Sed::                  A Simple Stream Editor.
26318* Igawk Program::               A wrapper for @command{awk} that includes
26319                                files.
26320* Anagram Program::             Finding anagrams from a dictionary.
26321* Signature Program::           People do amazing things with too much time on
26322                                their hands.
26323@end menu
26324
26325@node Dupword Program
26326@subsection Finding Duplicated Words in a Document
26327
26328@cindex words @subentry duplicate, searching for
26329@cindex searching @subentry for words
26330@cindex documents, searching
26331A common error when writing large amounts of prose is to accidentally
26332duplicate words.  Typically you will see this in text as something like ``the
26333the program does the following@dots{}''  When the text is online, often
26334the duplicated words occur at the end of one line and the
26335@iftex
26336the
26337@end iftex
26338beginning of
26339another, making them very difficult to spot.
26340@c as here!
26341
26342This program, @file{dupword.awk}, scans through a file one line at a time
26343and looks for adjacent occurrences of the same word.  It also saves the last
26344word on a line (in the variable @code{prev}) for comparison with the first
26345word on the next line.
26346
26347@cindex Texinfo
26348The first two statements make sure that the line is all lowercase,
26349so that, for example, ``The'' and ``the'' compare equal to each other.
26350The next statement replaces nonalphanumeric and nonwhitespace characters
26351with spaces, so that punctuation does not affect the comparison either.
26352The characters are replaced with spaces so that formatting controls
26353don't create nonsense words (e.g., the Texinfo @samp{@@code@{NF@}}
26354becomes @samp{codeNF} if punctuation is simply deleted).  The record is
26355then resplit into fields, yielding just the actual words on the line,
26356and ensuring that there are no empty fields.
26357
26358If there are no fields left after removing all the punctuation, the
26359current record is skipped.  Otherwise, the program loops through each
26360word, comparing it to the previous one:
26361
26362@cindex @file{dupword.awk} program
26363@example
26364@c file eg/prog/dupword.awk
26365# dupword.awk --- find duplicate words in text
26366@c endfile
26367@ignore
26368@c file eg/prog/dupword.awk
26369#
26370# Arnold Robbins, arnold@@skeeve.com, Public Domain
26371# December 1991
26372# Revised October 2000
26373
26374@c endfile
26375@end ignore
26376@c file eg/prog/dupword.awk
26377@{
26378    $0 = tolower($0)
26379    gsub(/[^[:alnum:][:blank:]]/, " ");
26380    $0 = $0         # re-split
26381    if (NF == 0)
26382        next
26383    if ($1 == prev)
26384        printf("%s:%d: duplicate %s\n",
26385            FILENAME, FNR, $1)
26386    for (i = 2; i <= NF; i++)
26387        if ($i == $(i-1))
26388            printf("%s:%d: duplicate %s\n",
26389                FILENAME, FNR, $i)
26390    prev = $NF
26391@}
26392@c endfile
26393@end example
26394
26395@node Alarm Program
26396@subsection An Alarm Clock Program
26397@cindex insomnia, cure for
26398@cindex Robbins @subentry Arnold
26399@quotation
26400@i{Nothing cures insomnia like a ringing alarm clock.}
26401@author Arnold Robbins
26402@end quotation
26403@cindex Quanstrom, Erik
26404@ignore
26405Date: Sat, 15 Feb 2014 16:47:09 -0500
26406Subject: Re: 9atom install question
26407Message-ID: <l2jcvx6j6mey60xnrkb0hhob.1392500829294@email.android.com>
26408From: Erik Quanstrom <quanstro@quanstro.net>
26409To: Aharon Robbins <arnold@skeeve.com>
26410
26411yes.
26412
26413- erik
26414
26415Aharon Robbins <arnold@skeeve.com> wrote:
26416
26417>> sleep is for web developers.
26418>
26419>Can I quote you, in the gawk manual?
26420>
26421>Thanks,
26422>
26423>Arnold
26424@end ignore
26425@quotation
26426@i{Sleep is for web developers.}
26427@author Erik Quanstrom
26428@end quotation
26429
26430@cindex time @subentry alarm clock example program
26431@cindex alarm clock example program
26432The following program is a simple ``alarm clock'' program.
26433You give it a time of day and an optional message.  At the specified time,
26434it prints the message on the standard output. In addition, you can give it
26435the number of times to repeat the message as well as a delay between
26436repetitions.
26437
26438This program uses the @code{getlocaltime()} function from
26439@ref{Getlocaltime Function}.
26440
26441@cindex ASCII
26442All the work is done in the @code{BEGIN} rule.  The first part is argument
26443checking and setting of defaults: the delay, the count, and the message to
26444print.  If the user supplied a message without the ASCII BEL
26445character (known as the ``alert'' character, @code{"\a"}), then it is added to
26446the message.  (On many systems, printing the ASCII BEL generates an
26447audible alert. Thus, when the alarm goes off, the system calls attention
26448to itself in case the user is not looking at the computer.)
26449Just for a change, this program uses a @code{switch} statement
26450(@pxref{Switch Statement}), but the processing could be done with a series of
26451@code{if}-@code{else} statements instead.
26452Here is the program:
26453
26454@cindex @file{alarm.awk} program
26455@example
26456@c file eg/prog/alarm.awk
26457# alarm.awk --- set an alarm
26458#
26459# Requires getlocaltime() library function
26460@c endfile
26461@ignore
26462@c file eg/prog/alarm.awk
26463#
26464# Arnold Robbins, arnold@@skeeve.com, Public Domain
26465# May 1993
26466# Revised December 2010
26467
26468@c endfile
26469@end ignore
26470@c file eg/prog/alarm.awk
26471# usage: alarm time [ "message" [ count [ delay ] ] ]
26472
26473BEGIN @{
26474    # Initial argument sanity checking
26475    usage1 = "usage: alarm time ['message' [count [delay]]]"
26476    usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
26477
26478    if (ARGC < 2) @{
26479        print usage1 > "/dev/stderr"
26480        print usage2 > "/dev/stderr"
26481        exit 1
26482    @}
26483    switch (ARGC) @{
26484    case 5:
26485        delay = ARGV[4] + 0
26486        # fall through
26487    case 4:
26488        count = ARGV[3] + 0
26489        # fall through
26490    case 3:
26491        message = ARGV[2]
26492        break
26493    default:
26494        if (ARGV[1] !~ /[[:digit:]]?[[:digit:]]:[[:digit:]]@{2@}/) @{
26495            print usage1 > "/dev/stderr"
26496            print usage2 > "/dev/stderr"
26497            exit 1
26498        @}
26499        break
26500    @}
26501
26502    # set defaults for once we reach the desired time
26503    if (delay == 0)
26504        delay = 180    # 3 minutes
26505@group
26506    if (count == 0)
26507        count = 5
26508@end group
26509    if (message == "")
26510        message = sprintf("\aIt is now %s!\a", ARGV[1])
26511    else if (index(message, "\a") == 0)
26512        message = "\a" message "\a"
26513@c endfile
26514@end example
26515
26516The next @value{SECTION} of code turns the alarm time into hours and minutes,
26517converts it (if necessary) to a 24-hour clock, and then turns that
26518time into a count of the seconds since midnight.  Next it turns the current
26519time into a count of seconds since midnight.  The difference between the two
26520is how long to wait before setting off the alarm:
26521
26522@example
26523@c file eg/prog/alarm.awk
26524    # split up alarm time
26525    split(ARGV[1], atime, ":")
26526    hour = atime[1] + 0    # force numeric
26527    minute = atime[2] + 0  # force numeric
26528
26529    # get current broken down time
26530    getlocaltime(now)
26531
26532    # if time given is 12-hour hours and it's after that
26533    # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
26534    # then add 12 to real hour
26535    if (hour < 12 && now["hour"] > hour)
26536        hour += 12
26537
26538    # set target time in seconds since midnight
26539    target = (hour * 60 * 60) + (minute * 60)
26540
26541    # get current time in seconds since midnight
26542    current = (now["hour"] * 60 * 60) + \
26543               (now["minute"] * 60) + now["second"]
26544
26545    # how long to sleep for
26546    naptime = target - current
26547    if (naptime <= 0) @{
26548        print "alarm: time is in the past!" > "/dev/stderr"
26549        exit 1
26550    @}
26551@c endfile
26552@end example
26553
26554@cindex @command{sleep} utility
26555Finally, the program uses the @code{system()} function
26556(@pxref{I/O Functions})
26557to call the @command{sleep} utility.  The @command{sleep} utility simply pauses
26558for the given number of seconds.  If the exit status is not zero,
26559the program assumes that @command{sleep} was interrupted and exits. If
26560@command{sleep} exited with an OK status (zero), then the program prints the
26561message in a loop, again using @command{sleep} to delay for however many
26562seconds are necessary:
26563
26564@example
26565@c file eg/prog/alarm.awk
26566    # zzzzzz..... go away if interrupted
26567    if (system(sprintf("sleep %d", naptime)) != 0)
26568        exit 1
26569
26570    # time to notify!
26571    command = sprintf("sleep %d", delay)
26572    for (i = 1; i <= count; i++) @{
26573        print message
26574        # if sleep command interrupted, go away
26575        if (system(command) != 0)
26576            break
26577    @}
26578
26579    exit 0
26580@}
26581@c endfile
26582@end example
26583
26584@node Translate Program
26585@subsection Transliterating Characters
26586
26587@cindex characters @subentry transliterating
26588@cindex @command{tr} utility
26589The system @command{tr} utility transliterates characters.  For example, it is
26590often used to map uppercase letters into lowercase for further processing:
26591
26592@example
26593@var{generate data} | tr 'A-Z' 'a-z' | @var{process data} @dots{}
26594@end example
26595
26596@command{tr} requires two lists of characters.@footnote{On some older
26597systems, including Solaris, the system version of @command{tr} may require
26598that the lists be written as range expressions enclosed in square brackets
26599(@samp{[a-z]}) and quoted, to prevent the shell from attempting a
26600@value{FN} expansion.  This is not a feature.}  When processing the input, the
26601first character in the first list is replaced with the first character
26602in the second list, the second character in the first list is replaced
26603with the second character in the second list, and so on.  If there are
26604more characters in the ``from'' list than in the ``to'' list, the last
26605character of the ``to'' list is used for the remaining characters in the
26606``from'' list.
26607
26608Once upon a time,
26609@c early or mid-1989!
26610a user proposed adding a transliteration function
26611to @command{gawk}.
26612@c Wishing to avoid gratuitous new features,
26613@c at least theoretically
26614The following program was written to
26615prove that character transliteration could be done with a user-level
26616function.  This program is not as complete as the system @command{tr} utility,
26617but it does most of the job.
26618
26619The @command{translate} program was written long before @command{gawk}
26620acquired the ability to split each character in a string into separate
26621array elements.  Thus, it makes repeated use of the @code{substr()},
26622@code{index()}, and @code{gsub()} built-in functions (@pxref{String
26623Functions}).  There are two functions.  The first, @code{stranslate()},
26624takes three arguments:
26625
26626@table @code
26627@item from
26628A list of characters from which to translate
26629
26630@item to
26631A list of characters to which to translate
26632
26633@item target
26634The string on which to do the translation
26635@end table
26636
26637Associative arrays make the translation part fairly easy. @code{t_ar} holds
26638the ``to'' characters, indexed by the ``from'' characters.  Then a simple
26639loop goes through @code{from}, one character at a time.  For each character
26640in @code{from}, if the character appears in @code{target},
26641it is replaced with the corresponding @code{to} character.
26642
26643The @code{translate()} function calls @code{stranslate()}, using @code{$0}
26644as the target.  The main program sets two global variables, @code{FROM} and
26645@code{TO}, from the command line, and then changes @code{ARGV} so that
26646@command{awk} reads from the standard input.
26647
26648Finally, the processing rule simply calls @code{translate()} for each record:
26649
26650@cindex @file{translate.awk} program
26651@example
26652@c file eg/prog/translate.awk
26653# translate.awk --- do tr-like stuff
26654@c endfile
26655@ignore
26656@c file eg/prog/translate.awk
26657#
26658# Arnold Robbins, arnold@@skeeve.com, Public Domain
26659# August 1989
26660# February 2009 - bug fix
26661
26662@c endfile
26663@end ignore
26664@c file eg/prog/translate.awk
26665# Bugs: does not handle things like tr A-Z a-z; it has
26666# to be spelled out. However, if `to' is shorter than `from',
26667# the last character in `to' is used for the rest of `from'.
26668
26669function stranslate(from, to, target,     lf, lt, ltarget, t_ar, i, c,
26670                                                               result)
26671@{
26672    lf = length(from)
26673    lt = length(to)
26674    ltarget = length(target)
26675    for (i = 1; i <= lt; i++)
26676        t_ar[substr(from, i, 1)] = substr(to, i, 1)
26677    if (lt < lf)
26678        for (; i <= lf; i++)
26679            t_ar[substr(from, i, 1)] = substr(to, lt, 1)
26680    for (i = 1; i <= ltarget; i++) @{
26681        c = substr(target, i, 1)
26682        if (c in t_ar)
26683            c = t_ar[c]
26684        result = result c
26685    @}
26686    return result
26687@}
26688
26689function translate(from, to)
26690@{
26691    return $0 = stranslate(from, to, $0)
26692@}
26693
26694# main program
26695BEGIN @{
26696@group
26697    if (ARGC < 3) @{
26698        print "usage: translate from to" > "/dev/stderr"
26699        exit
26700    @}
26701@end group
26702    FROM = ARGV[1]
26703    TO = ARGV[2]
26704    ARGC = 2
26705    ARGV[1] = "-"
26706@}
26707
26708@{
26709    translate(FROM, TO)
26710    print
26711@}
26712@c endfile
26713@end example
26714
26715It is possible to do character transliteration in a user-level
26716function, but it is not necessarily efficient, and we (the @command{gawk}
26717developers) started to consider adding a built-in function.  However,
26718shortly after writing this program, we learned that Brian Kernighan
26719had added the @code{toupper()} and @code{tolower()} functions to his
26720@command{awk} (@pxref{String Functions}).  These functions handle the
26721vast majority of the cases where character transliteration is necessary,
26722and so we chose to simply add those functions to @command{gawk} as well
26723and then leave well enough alone.
26724
26725An obvious improvement to this program would be to set up the
26726@code{t_ar} array only once, in a @code{BEGIN} rule. However, this
26727assumes that the ``from'' and ``to'' lists
26728will never change throughout the lifetime of the program.
26729
26730Another obvious improvement is to enable the use of ranges,
26731such as @samp{a-z}, as allowed by the @command{tr} utility.
26732Look at the code for @file{cut.awk} (@pxref{Cut Program})
26733for inspiration.
26734
26735
26736@node Labels Program
26737@subsection Printing Mailing Labels
26738
26739@cindex printing @subentry mailing labels
26740@cindex mailing labels, printing
26741Here is a ``real-world''@footnote{``Real world'' is defined as
26742``a program actually used to get something done.''}
26743program.  This
26744script reads lists of names and
26745addresses and generates mailing labels.  Each page of labels has 20 labels
26746on it, two across and 10 down.  The addresses are guaranteed to be no more
26747than five lines of data.  Each address is separated from the next by a blank
26748line.
26749
26750The basic idea is to read 20 labels' worth of data.  Each line of each label
26751is stored in the @code{line} array.  The single rule takes care of filling
26752the @code{line} array and printing the page when 20 labels have been read.
26753
26754The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that
26755@command{awk} splits records at blank lines
26756(@pxref{Records}).
26757It sets @code{MAXLINES} to 100, because 100 is the maximum number
26758of lines on the page
26759@iftex
26760(@math{20 @cdot 5 = 100}).
26761@end iftex
26762@ifnottex
26763@ifnotdocbook
26764(20 * 5 = 100).
26765@end ifnotdocbook
26766@end ifnottex
26767@docbook
26768(20 &sdot; 5 = 100).
26769@end docbook
26770
26771Most of the work is done in the @code{printpage()} function.
26772The label lines are stored sequentially in the @code{line} array.  But they
26773have to print horizontally: @code{line[1]} next to @code{line[6]},
26774@code{line[2]} next to @code{line[7]}, and so on.  Two loops
26775accomplish this.  The outer loop, controlled by @code{i}, steps through
26776every 10 lines of data; this is each row of labels.  The inner loop,
26777controlled by @code{j}, goes through the lines within the row.
26778As @code{j} goes from 0 to 4, @samp{i+j} is the @code{j}th line in
26779the row, and @samp{i+j+5} is the entry next to it.  The output ends up
26780looking something like this:
26781
26782@example
26783line 1          line 6
26784line 2          line 7
26785line 3          line 8
26786line 4          line 9
26787line 5          line 10
26788@dots{}
26789@end example
26790
26791@noindent
26792The @code{printf} format string @samp{%-41s} left-aligns
26793the data and prints it within a fixed-width field.
26794
26795As a final note, an extra blank line is printed at lines 21 and 61, to keep
26796the output lined up on the labels.  This is dependent on the particular
26797brand of labels in use when the program was written.  You will also note
26798that there are two blank lines at the top and two blank lines at the bottom.
26799
26800The @code{END} rule arranges to flush the final page of labels; there may
26801not have been an even multiple of 20 labels in the data:
26802
26803@cindex @file{labels.awk} program
26804@example
26805@c file eg/prog/labels.awk
26806# labels.awk --- print mailing labels
26807@c endfile
26808@ignore
26809@c file eg/prog/labels.awk
26810#
26811# Arnold Robbins, arnold@@skeeve.com, Public Domain
26812# June 1992
26813# December 2010, minor edits
26814@c endfile
26815@end ignore
26816@c file eg/prog/labels.awk
26817
26818# Each label is 5 lines of data that may have blank lines.
26819# The label sheets have 2 blank lines at the top and 2 at
26820# the bottom.
26821
26822BEGIN    @{ RS = "" ; MAXLINES = 100 @}
26823
26824function printpage(    i, j)
26825@{
26826    if (Nlines <= 0)
26827        return
26828
26829    printf "\n\n"        # header
26830
26831    for (i = 1; i <= Nlines; i += 10) @{
26832        if (i == 21 || i == 61)
26833            print ""
26834        for (j = 0; j < 5; j++) @{
26835            if (i + j > MAXLINES)
26836                break
26837            printf "   %-41s %s\n", line[i+j], line[i+j+5]
26838        @}
26839        print ""
26840    @}
26841
26842    printf "\n\n"        # footer
26843
26844    delete line
26845@}
26846
26847# main rule
26848@{
26849    if (Count >= 20) @{
26850        printpage()
26851        Count = 0
26852        Nlines = 0
26853    @}
26854    n = split($0, a, "\n")
26855    for (i = 1; i <= n; i++)
26856        line[++Nlines] = a[i]
26857    for (; i <= 5; i++)
26858        line[++Nlines] = ""
26859    Count++
26860@}
26861
26862END @{
26863    printpage()
26864@}
26865@c endfile
26866@end example
26867
26868@node Word Sorting
26869@subsection Generating Word-Usage Counts
26870
26871@cindex words @subentry usage counts, generating
26872
26873When working with large amounts of text, it can be interesting to know
26874how often different words appear.  For example, an author may overuse
26875certain words, in which case he or she might wish to find synonyms to substitute
26876for words that appear too often. This @value{SUBSECTION} develops a
26877program for counting words and presenting the frequency information
26878in a useful format.
26879
26880At first glance, a program like this would seem to do the job:
26881
26882@example
26883# wordfreq-first-try.awk --- print list of word frequencies
26884
26885@{
26886    for (i = 1; i <= NF; i++)
26887        freq[$i]++
26888@}
26889
26890@group
26891END @{
26892    for (word in freq)
26893        printf "%s\t%d\n", word, freq[word]
26894@}
26895@end group
26896@end example
26897
26898The program relies on @command{awk}'s default field-splitting
26899mechanism to break each line up into ``words'' and uses an
26900associative array named @code{freq}, indexed by each word, to count
26901the number of times the word occurs. In the @code{END} rule,
26902it prints the counts.
26903
26904This program has several problems that prevent it from being
26905useful on real text files:
26906
26907@itemize @value{BULLET}
26908@item
26909The @command{awk} language considers upper- and lowercase characters to be
26910distinct.  Therefore, ``bartender'' and ``Bartender'' are not treated
26911as the same word.  This is undesirable, because words are capitalized
26912if they begin sentences in normal text, and a frequency analyzer should
26913not be sensitive to capitalization.
26914
26915@item
26916Words are detected using the @command{awk} convention that fields are
26917separated just by whitespace.  Other characters in the input (except
26918newlines) don't have any special meaning to @command{awk}.  This means that
26919punctuation characters count as part of words.
26920
26921@item
26922The output does not come out in any useful order.  You're more likely to be
26923interested in which words occur most frequently or in having an alphabetized
26924table of how frequently each word occurs.
26925@end itemize
26926
26927@cindex @command{sort} utility
26928The first problem can be solved by using @code{tolower()} to remove case
26929distinctions.  The second problem can be solved by using @code{gsub()}
26930to remove punctuation characters.  Finally, we solve the third problem
26931by using the system @command{sort} utility to process the output of the
26932@command{awk} script.  Here is the new version of the program:
26933
26934@cindex @file{wordfreq.awk} program
26935@example
26936@c file eg/prog/wordfreq.awk
26937# wordfreq.awk --- print list of word frequencies
26938
26939@{
26940    $0 = tolower($0)    # remove case distinctions
26941    # remove punctuation
26942    gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
26943    for (i = 1; i <= NF; i++)
26944        freq[$i]++
26945@}
26946
26947@c endfile
26948END @{
26949    for (word in freq)
26950        printf "%s\t%d\n", word, freq[word]
26951@}
26952@end example
26953
26954The regexp @code{/[^[:alnum:]_[:blank:]]/} might have been written
26955@code{/[[:punct:]]/}, but then underscores would also be removed,
26956and we want to keep them.
26957
26958Assuming we have saved this program in a file named @file{wordfreq.awk},
26959and that the data is in @file{file1}, the following pipeline:
26960
26961@example
26962awk -f wordfreq.awk file1 | sort -k 2nr
26963@end example
26964
26965@noindent
26966produces a table of the words appearing in @file{file1} in order of
26967decreasing frequency.
26968
26969The @command{awk} program suitably massages the
26970data and produces a word frequency table, which is not ordered.
26971The @command{awk} script's output is then sorted by the @command{sort}
26972utility and printed on the screen.
26973
26974The options given to @command{sort}
26975specify a sort that uses the second field of each input line (skipping
26976one field), that the sort keys should be treated as numeric quantities
26977(otherwise @samp{15} would come before @samp{5}), and that the sorting
26978should be done in descending (reverse) order.
26979
26980The @command{sort} could even be done from within the program, by changing
26981the @code{END} action to:
26982
26983@example
26984@c file eg/prog/wordfreq.awk
26985END @{
26986    sort = "sort -k 2nr"
26987    for (word in freq)
26988        printf "%s\t%d\n", word, freq[word] | sort
26989    close(sort)
26990@}
26991@c endfile
26992@end example
26993
26994This way of sorting must be used on systems that do not
26995have true pipes at the command-line (or batch-file) level.
26996See the general operating system documentation for more information on how
26997to use the @command{sort} program.
26998
26999@node History Sorting
27000@subsection Removing Duplicates from Unsorted Text
27001
27002@cindex lines @subentry duplicate, removing
27003The @command{uniq} program
27004(@pxref{Uniq Program})
27005removes duplicate lines from @emph{sorted} data.
27006
27007Suppose, however, you need to remove duplicate lines from a @value{DF} but
27008that you want to preserve the order the lines are in.  A good example of
27009this might be a shell history file.  The history file keeps a copy of all
27010the commands you have entered, and it is not unusual to repeat a command
27011several times in a row.  Occasionally you might want to compact the history
27012by removing duplicate entries.  Yet it is desirable to maintain the order
27013of the original commands.
27014
27015This simple program does the job.  It uses two arrays.  The @code{data}
27016array is indexed by the text of each line.
27017For each line, @code{data[$0]} is incremented.
27018If a particular line has not
27019been seen before, then @code{data[$0]} is zero.
27020In this case, the text of the line is stored in @code{lines[count]}.
27021Each element of @code{lines} is a unique command, and the indices of
27022@code{lines} indicate the order in which those lines are encountered.
27023The @code{END} rule simply prints out the lines, in order:
27024
27025@cindex Rakitzis, Byron
27026@cindex @file{histsort.awk} program
27027@example
27028@c file eg/prog/histsort.awk
27029# histsort.awk --- compact a shell history file
27030# Thanks to Byron Rakitzis for the general idea
27031@c endfile
27032@ignore
27033@c file eg/prog/histsort.awk
27034#
27035# Arnold Robbins, arnold@@skeeve.com, Public Domain
27036# May 1993
27037@c endfile
27038@end ignore
27039@c file eg/prog/histsort.awk
27040
27041@group
27042@{
27043    if (data[$0]++ == 0)
27044        lines[++count] = $0
27045@}
27046@end group
27047
27048@group
27049END @{
27050    for (i = 1; i <= count; i++)
27051        print lines[i]
27052@}
27053@end group
27054@c endfile
27055@end example
27056
27057This program also provides a foundation for generating other useful
27058information.  For example, using the following @code{print} statement in the
27059@code{END} rule indicates how often a particular command is used:
27060
27061@example
27062print data[lines[i]], lines[i]
27063@end example
27064
27065@noindent
27066This works because @code{data[$0]} is incremented each time a line is
27067seen.
27068
27069@c rick@openfortress.nl, Tue, 24 Dec 2019 13:43:06 +0100
27070Rick van Rein offers the following one-liner to do the same job of
27071removing duplicates from unsorted text:
27072
27073@example
27074awk '@{ if (! seen[$0]++) print @}'
27075@end example
27076
27077This can be simplified even further, at the risk of becoming
27078almost too obscure:
27079
27080@example
27081awk '! seen[$0]++'
27082@end example
27083
27084@noindent
27085This version uses the expression as a pattern, relying on
27086@command{awk}'s default action of printing the line when
27087the pattern is true.
27088
27089@node Extract Program
27090@subsection Extracting Programs from Texinfo Source Files
27091
27092@cindex Texinfo @subentry extracting programs from source files
27093@cindex files @subentry Texinfo, extracting programs from
27094@ifnotinfo
27095Both this chapter and the previous chapter
27096(@ref{Library Functions})
27097present a large number of @command{awk} programs.
27098@end ifnotinfo
27099@ifinfo
27100The nodes
27101@ref{Library Functions},
27102and @ref{Sample Programs},
27103are the top level nodes for a large number of @command{awk} programs.
27104@end ifinfo
27105If you want to experiment with these programs, it is tedious to type
27106them in by hand.  Here we present a program that can extract parts of a
27107Texinfo input file into separate files.
27108
27109@cindex Texinfo
27110This @value{DOCUMENT} is written in @uref{https://www.gnu.org/software/texinfo/, Texinfo},
27111the GNU Project's document formatting language.
27112A single Texinfo source file can be used to produce both
27113printed documentation, with @TeX{}, and online documentation.
27114@ifnotinfo
27115(Texinfo is fully documented in the book
27116@cite{Texinfo---The GNU Documentation Format},
27117available from the Free Software Foundation,
27118and also available @uref{https://www.gnu.org/software/texinfo/manual/texinfo/, online}.)
27119@end ifnotinfo
27120@ifinfo
27121(The Texinfo language is described fully, starting with
27122@inforef{Top, , Texinfo, texinfo,Texinfo---The GNU Documentation Format}.)
27123@end ifinfo
27124
27125For our purposes, it is enough to know three things about Texinfo input
27126files:
27127
27128@itemize @value{BULLET}
27129@item
27130The ``at'' symbol (@samp{@@}) is special in Texinfo, much as
27131the backslash (@samp{\}) is in C
27132or @command{awk}.  Literal @samp{@@} symbols are represented in Texinfo source
27133files as @samp{@@@@}.
27134
27135@item
27136Comments start with either @samp{@@c} or @samp{@@comment}.
27137The file-extraction program works by using special comments that start
27138at the beginning of a line.
27139
27140@item
27141Lines containing @samp{@@group} and @samp{@@end group} commands bracket
27142example text that should not be split across a page boundary.
27143(Unfortunately, @TeX{} isn't always smart enough to do things exactly right,
27144so we have to give it some help.)
27145@end itemize
27146
27147The following program, @file{extract.awk}, reads through a Texinfo source
27148file and does two things, based on the special comments.
27149Upon seeing @samp{@w{@@c system @dots{}}},
27150it runs a command, by extracting the command text from the
27151control line and passing it on to the @code{system()} function
27152(@pxref{I/O Functions}).
27153Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to
27154the file @var{filename}, until @samp{@@c endfile} is encountered.
27155The rules in @file{extract.awk} match either @samp{@@c} or
27156@samp{@@comment} by letting the @samp{omment} part be optional.
27157Lines containing @samp{@@group} and @samp{@@end group} are simply removed.
27158@file{extract.awk} uses the @code{join()} library function
27159(@pxref{Join Function}).
27160
27161The example programs in the online Texinfo source for @cite{@value{TITLE}}
27162(@file{gawktexi.in}) have all been bracketed inside @samp{file} and
27163@samp{endfile} lines.  The @command{gawk} distribution uses a copy of
27164@file{extract.awk} to extract the sample programs and install many
27165of them in a standard directory where @command{gawk} can find them.
27166The Texinfo file looks something like this:
27167
27168@example
27169@dots{}
27170This program has a @@code@{BEGIN@} rule
27171that prints a nice message:
27172
27173@@example
27174@@c file examples/messages.awk
27175BEGIN @@@{ print "Don't panic!" @@@}
27176@@c endfile
27177@@end example
27178
27179It also prints some final advice:
27180
27181@@example
27182@@c file examples/messages.awk
27183END @@@{ print "Always avoid bored archaeologists!" @@@}
27184@@c endfile
27185@@end example
27186@dots{}
27187@end example
27188
27189@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that
27190mixed upper- and lowercase letters in the directives won't matter.
27191
27192The first rule handles calling @code{system()}, checking that a command is
27193given (@code{NF} is at least three) and also checking that the command
27194exits with a zero exit status, signifying OK:
27195
27196@cindex @file{extract.awk} program
27197@example
27198@c file eg/prog/extract.awk
27199# extract.awk --- extract files and run programs from Texinfo files
27200@c endfile
27201@ignore
27202@c file eg/prog/extract.awk
27203#
27204# Arnold Robbins, arnold@@skeeve.com, Public Domain
27205# May 1993
27206# Revised September 2000
27207@c endfile
27208@end ignore
27209@c file eg/prog/extract.awk
27210
27211BEGIN    @{ IGNORECASE = 1 @}
27212
27213/^@@c(omment)?[ \t]+system/ @{
27214    if (NF < 3) @{
27215        e = ("extract: " FILENAME ":" FNR)
27216        e = (e  ": badly formed `system' line")
27217        print e > "/dev/stderr"
27218        next
27219    @}
27220    $1 = ""
27221    $2 = ""
27222    stat = system($0)
27223    if (stat != 0) @{
27224        e = ("extract: " FILENAME ":" FNR)
27225        e = (e ": warning: system returned " stat)
27226        print e > "/dev/stderr"
27227    @}
27228@}
27229@c endfile
27230@end example
27231
27232@noindent
27233The variable @code{e} is used so that the rule
27234fits nicely on the @value{PAGE}.
27235
27236The second rule handles moving data into files.  It verifies that a
27237@value{FN} is given in the directive.  If the file named is not the
27238current file, then the current file is closed.  Keeping the current file
27239open until a new file is encountered allows the use of the @samp{>}
27240redirection for printing the contents, keeping open-file management
27241simple.
27242
27243The @code{for} loop does the work.  It reads lines using @code{getline}
27244(@pxref{Getline}).
27245For an unexpected end-of-file, it calls the @code{@w{unexpected_eof()}}
27246function.  If the line is an ``endfile'' line, then it breaks out of
27247the loop.
27248If the line is an @samp{@@group} or @samp{@@end group} line, then it
27249ignores it and goes on to the next line.
27250Similarly, comments within examples are also ignored.
27251
27252Most of the work is in the following few lines.  If the line has no @samp{@@}
27253symbols, the program can print it directly.
27254Otherwise, each leading @samp{@@} must be stripped off.
27255To remove the @samp{@@} symbols, the line is split into separate elements of
27256the array @code{a}, using the @code{split()} function
27257(@pxref{String Functions}).
27258The @samp{@@} symbol is used as the separator character.
27259Each element of @code{a} that is empty indicates two successive @samp{@@}
27260symbols in the original line.  For each two empty elements (@samp{@@@@} in
27261the original file), we have to add a single @samp{@@} symbol back in.
27262
27263When the processing of the array is finished, @code{join()} is called with the
27264value of @code{SUBSEP} (@pxref{Multidimensional}),
27265to rejoin the pieces back into a single
27266line.  That line is then printed to the output file:
27267
27268@example
27269@c file eg/prog/extract.awk
27270/^@@c(omment)?[ \t]+file/ @{
27271    if (NF != 3) @{
27272        e = ("extract: " FILENAME ":" FNR ": badly formed `file' line")
27273        print e > "/dev/stderr"
27274        next
27275    @}
27276    if ($3 != curfile) @{
27277        if (curfile != "")
27278            filelist[curfile] = 1   # save to close later
27279        curfile = $3
27280    @}
27281
27282    for (;;) @{
27283        if ((getline line) <= 0)
27284            unexpected_eof()
27285        if (line ~ /^@@c(omment)?[ \t]+endfile/)
27286            break
27287        else if (line ~ /^@@(end[ \t]+)?group/)
27288            continue
27289        else if (line ~ /^@@c(omment+)?[ \t]+/)
27290            continue
27291        if (index(line, "@@") == 0) @{
27292            print line > curfile
27293            continue
27294        @}
27295        n = split(line, a, "@@")
27296        # if a[1] == "", means leading @@,
27297        # don't add one back in.
27298        for (i = 2; i <= n; i++) @{
27299            if (a[i] == "") @{ # was an @@@@
27300                a[i] = "@@"
27301                if (a[i+1] == "")
27302                    i++
27303            @}
27304        @}
27305@group
27306        print join(a, 1, n, SUBSEP) > curfile
27307    @}
27308@}
27309@end group
27310@c endfile
27311@end example
27312
27313An important thing to note is the use of the @samp{>} redirection.
27314Output done with @samp{>} only opens the file once; it stays open and
27315subsequent output is appended to the file
27316(@pxref{Redirection}).
27317This makes it easy to mix program text and explanatory prose for the same
27318sample source file (as has been done here!) without any hassle.  The file is
27319only closed when a new @value{DF} name is encountered or at the end of the
27320input file.
27321
27322When a new @value{FN} is encountered, instead of closing the file,
27323the program saves the name of the current file in @code{filelist}.
27324This makes it possible to interleave the code for more than one file in
27325the Texinfo input file.  (Previous versions of this program @emph{did}
27326close the file. But because of the @samp{>} redirection, a file whose
27327parts were not all one after the other ended up getting clobbered.)
27328An @code{END} rule then closes all the open files when processing
27329is finished:
27330
27331@example
27332@c file eg/prog/extract.awk
27333@group
27334END @{
27335    close(curfile)          # close the last one
27336    for (f in filelist)     # close all the rest
27337        close(f)
27338@}
27339@end group
27340@c endfile
27341@end example
27342
27343Finally, the function @code{@w{unexpected_eof()}} prints an appropriate
27344error message and then exits:
27345
27346@example
27347@c file eg/prog/extract.awk
27348@group
27349function unexpected_eof()
27350@{
27351    printf("extract: %s:%d: unexpected EOF or error\n",
27352                     FILENAME, FNR) > "/dev/stderr"
27353    exit 1
27354@}
27355@end group
27356@c endfile
27357@end example
27358
27359@node Simple Sed
27360@subsection A Simple Stream Editor
27361
27362@cindex @command{sed} utility
27363@cindex stream editors
27364The @command{sed} utility is a @dfn{stream editor}, a program that reads a
27365stream of data, makes changes to it, and passes it on.
27366It is often used to make global changes to a large file or to a stream
27367of data generated by a pipeline of commands.
27368Although @command{sed} is a complicated program in its own right, its most common
27369use is to perform global substitutions in the middle of a pipeline:
27370
27371@example
27372@var{command1} < orig.data | sed 's/old/new/g' | @var{command2} > result
27373@end example
27374
27375Here, @samp{s/old/new/g} tells @command{sed} to look for the regexp
27376@samp{old} on each input line and globally replace it with the text
27377@samp{new} (i.e., all the occurrences on a line).  This is similar to
27378@command{awk}'s @code{gsub()} function
27379(@pxref{String Functions}).
27380
27381The following program, @file{awksed.awk}, accepts at least two command-line
27382arguments: the pattern to look for and the text to replace it with. Any
27383additional arguments are treated as @value{DF} names to process. If none
27384are provided, the standard input is used:
27385
27386@cindex Brennan, Michael
27387@cindex @command{awksed.awk} program
27388@c @cindex simple stream editor
27389@c @cindex stream editor, simple
27390@example
27391@c file eg/prog/awksed.awk
27392# awksed.awk --- do s/foo/bar/g using just print
27393#    Thanks to Michael Brennan for the idea
27394@c endfile
27395@ignore
27396@c file eg/prog/awksed.awk
27397#
27398# Arnold Robbins, arnold@@skeeve.com, Public Domain
27399# August 1995
27400@c endfile
27401@end ignore
27402@c file eg/prog/awksed.awk
27403
27404function usage()
27405@{
27406    print "usage: awksed pat repl [files...]" > "/dev/stderr"
27407    exit 1
27408@}
27409
27410@group
27411BEGIN @{
27412    # validate arguments
27413    if (ARGC < 3)
27414        usage()
27415@end group
27416
27417    RS = ARGV[1]
27418    ORS = ARGV[2]
27419
27420    # don't use arguments as files
27421    ARGV[1] = ARGV[2] = ""
27422@}
27423
27424@group
27425# look ma, no hands!
27426@{
27427    if (RT == "")
27428        printf "%s", $0
27429    else
27430        print
27431@}
27432@end group
27433@c endfile
27434@end example
27435
27436The program relies on @command{gawk}'s ability to have @code{RS} be a regexp,
27437as well as on the setting of @code{RT} to the actual text that terminates the
27438record (@pxref{Records}).
27439
27440The idea is to have @code{RS} be the pattern to look for. @command{gawk}
27441automatically sets @code{$0} to the text between matches of the pattern.
27442This is text that we want to keep, unmodified.  Then, by setting @code{ORS}
27443to the replacement text, a simple @code{print} statement outputs the
27444text we want to keep, followed by the replacement text.
27445
27446There is one wrinkle to this scheme, which is what to do if the last record
27447doesn't end with text that matches @code{RS}.  Using a @code{print}
27448statement unconditionally prints the replacement text, which is not correct.
27449However, if the file did not end in text that matches @code{RS}, @code{RT}
27450is set to the null string.  In this case, we can print @code{$0} using
27451@code{printf}
27452(@pxref{Printf}).
27453
27454The @code{BEGIN} rule handles the setup, checking for the right number
27455of arguments and calling @code{usage()} if there is a problem. Then it sets
27456@code{RS} and @code{ORS} from the command-line arguments and sets
27457@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they are
27458not treated as @value{FN}s
27459(@pxref{ARGC and ARGV}).
27460
27461The @code{usage()} function prints an error message and exits.
27462Finally, the single rule handles the printing scheme outlined earlier,
27463using @code{print} or @code{printf} as appropriate, depending upon the
27464value of @code{RT}.
27465
27466@node Igawk Program
27467@subsection An Easy Way to Use Library Functions
27468
27469@cindex libraries of @command{awk} functions @subentry example program for using
27470@cindex functions @subentry library @subentry example program for using
27471In @ref{Include Files}, we saw how @command{gawk} provides a built-in
27472file-inclusion capability.  However, this is a @command{gawk} extension.
27473This @value{SECTION} provides the motivation for making file inclusion
27474available for standard @command{awk}, and shows how to do it using a
27475combination of shell and @command{awk} programming.
27476
27477Using library functions in @command{awk} can be very beneficial. It
27478encourages code reuse and the writing of general functions. Programs are
27479smaller and therefore clearer.
27480However, using library functions is only easy when writing @command{awk}
27481programs; it is painful when running them, requiring multiple @option{-f}
27482options.  If @command{gawk} is unavailable, then so too is the @env{AWKPATH}
27483environment variable and the ability to put @command{awk} functions into a
27484library directory (@pxref{Options}).
27485It would be nice to be able to write programs in the following manner:
27486
27487@example
27488# library functions
27489@@include getopt.awk
27490@@include join.awk
27491@dots{}
27492
27493# main program
27494BEGIN @{
27495    while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
27496        @dots{}
27497    @dots{}
27498@}
27499@end example
27500
27501The following program, @file{igawk.sh}, provides this service.
27502It simulates @command{gawk}'s searching of the @env{AWKPATH} variable
27503and also allows @dfn{nested} includes (i.e., a file that is included
27504with @code{@@include} can contain further @code{@@include} statements).
27505@command{igawk} makes an effort to only include files once, so that nested
27506includes don't accidentally include a library function twice.
27507
27508@command{igawk} should behave just like @command{gawk} externally.  This
27509means it should accept all of @command{gawk}'s command-line arguments,
27510including the ability to have multiple source files specified via
27511@option{-f} and the ability to mix command-line and library source files.
27512
27513The program is written using the POSIX Shell (@command{sh}) command
27514language.@footnote{Fully explaining the @command{sh} language is beyond
27515the scope of this book. We provide some minimal explanations, but see
27516a good shell programming book if you wish to understand things in more
27517depth.} It works as follows:
27518
27519@enumerate
27520@item
27521Loop through the arguments, saving anything that doesn't represent
27522@command{awk} source code for later, when the expanded program is run.
27523
27524@item
27525For any arguments that do represent @command{awk} text, put the arguments into
27526a shell variable that will be expanded.  There are two cases:
27527
27528@enumerate a
27529@item
27530Literal text, provided with @option{-e} or @option{--source}.  This
27531text is just appended directly.
27532
27533@item
27534Source @value{FN}s, provided with @option{-f}.  We use a neat trick and
27535append @samp{@@include @var{filename}} to the shell variable's contents.
27536Because the file-inclusion program works the way @command{gawk} does, this
27537gets the text of the file included in the program at the correct point.
27538@end enumerate
27539
27540@item
27541Run an @command{awk} program (naturally) over the shell variable's contents to expand
27542@code{@@include} statements.  The expanded program is placed in a second
27543shell variable.
27544
27545@item
27546Run the expanded program with @command{gawk} and any other original command-line
27547arguments that the user supplied (such as the @value{DF} names).
27548@end enumerate
27549
27550This program uses shell variables extensively: for storing command-line arguments and
27551the text of the @command{awk} program that will expand the user's program, for the
27552user's original program, and for the expanded program.  Doing so removes some
27553potential problems that might arise were we to use temporary files instead,
27554at the cost of making the script somewhat more complicated.
27555
27556The initial part of the program turns on shell tracing if the first
27557argument is @samp{debug}.
27558
27559The next part loops through all the command-line arguments.
27560There are several cases of interest:
27561
27562@c @asis for docbook
27563@table @asis
27564@item @option{--}
27565This ends the arguments to @command{igawk}.  Anything else should be passed on
27566to the user's @command{awk} program without being evaluated.
27567
27568@item @option{-W}
27569This indicates that the next option is specific to @command{gawk}.  To make
27570argument processing easier, the @option{-W} is appended to the front of the
27571remaining arguments and the loop continues.  (This is an @command{sh}
27572programming trick.  Don't worry about it if you are not familiar with
27573@command{sh}.)
27574
27575@item @option{-v}, @option{-F}
27576These are saved and passed on to @command{gawk}.
27577
27578@item @option{-f}, @option{--file}, @option{--file=}, @option{-Wfile=}
27579The @value{FN} is appended to the shell variable @code{program} with an
27580@code{@@include} statement.
27581The @command{expr} utility is used to remove the leading option part of the
27582argument (e.g., @samp{--file=}).
27583(Typical @command{sh} usage would be to use the @command{echo} and @command{sed}
27584utilities to do this work.  Unfortunately, some versions of @command{echo} evaluate
27585escape sequences in their arguments, possibly mangling the program text.
27586Using @command{expr} avoids this problem.)
27587
27588@item @option{--source}, @option{--source=}, @option{-Wsource=}
27589The source text is appended to @code{program}.
27590
27591@item @option{--version}, @option{-Wversion}
27592@command{igawk} prints its version number, runs @samp{gawk --version}
27593to get the @command{gawk} version information, and then exits.
27594@end table
27595
27596If none of the @option{-f}, @option{--file}, @option{-Wfile}, @option{--source},
27597or @option{-Wsource} arguments are supplied, then the first nonoption argument
27598should be the @command{awk} program.  If there are no command-line
27599arguments left, @command{igawk} prints an error message and exits.
27600Otherwise, the first argument is appended to @code{program}.
27601In any case, after the arguments have been processed,
27602the shell variable
27603@code{program} contains the complete text of the original @command{awk}
27604program.
27605
27606The program is as follows:
27607
27608@cindex @code{igawk.sh} program
27609@example
27610@c file eg/prog/igawk.sh
27611#! /bin/sh
27612# igawk --- like gawk but do @@include processing
27613@c endfile
27614@ignore
27615@c file eg/prog/igawk.sh
27616#
27617# Arnold Robbins, arnold@@skeeve.com, Public Domain
27618# July 1993
27619# December 2010, minor edits
27620@c endfile
27621@end ignore
27622@c file eg/prog/igawk.sh
27623
27624if [ "$1" = debug ]
27625then
27626    set -x
27627    shift
27628fi
27629
27630# A literal newline, so that program text is formatted correctly
27631n='
27632'
27633
27634# Initialize variables to empty
27635program=
27636opts=
27637
27638while [ $# -ne 0 ] # loop over arguments
27639do
27640    case $1 in
27641    --)     shift
27642            break ;;
27643
27644    -W)     shift
27645            # The $@{x?'message here'@} construct prints a
27646            # diagnostic if $x is the null string
27647            set -- -W"$@{@@?'missing operand'@}"
27648            continue ;;
27649
27650    -[vF])  opts="$opts $1 '$@{2?'missing operand'@}'"
27651            shift ;;
27652
27653    -[vF]*) opts="$opts '$1'" ;;
27654
27655    -f)     program="$program$n@@include $@{2?'missing operand'@}"
27656            shift ;;
27657
27658    -f*)    f=$(expr "$1" : '-f\(.*\)')
27659            program="$program$n@@include $f" ;;
27660
27661    -[W-]file=*)
27662            f=$(expr "$1" : '-.file=\(.*\)')
27663            program="$program$n@@include $f" ;;
27664
27665    -[W-]file)
27666            program="$program$n@@include $@{2?'missing operand'@}"
27667            shift ;;
27668
27669    -[W-]source=*)
27670            t=$(expr "$1" : '-.source=\(.*\)')
27671            program="$program$n$t" ;;
27672
27673    -[W-]source)
27674            program="$program$n$@{2?'missing operand'@}"
27675            shift ;;
27676
27677    -[W-]version)
27678            echo igawk: version 3.0 1>&2
27679            gawk --version
27680            exit 0 ;;
27681
27682    -[W-]*) opts="$opts '$1'" ;;
27683
27684    *)      break ;;
27685    esac
27686    shift
27687done
27688
27689if [ -z "$program" ]
27690then
27691     program=$@{1?'missing program'@}
27692     shift
27693fi
27694
27695# At this point, `program' has the program.
27696@c endfile
27697@end example
27698
27699The @command{awk} program to process @code{@@include} directives
27700is stored in the shell variable @code{expand_prog}.  Doing this keeps
27701the shell script readable.  The @command{awk} program
27702reads through the user's program, one line at a time, using @code{getline}
27703(@pxref{Getline}).  The input
27704@value{FN}s and @code{@@include} statements are managed using a stack.
27705As each @code{@@include} is encountered, the current @value{FN} is
27706``pushed'' onto the stack and the file named in the @code{@@include}
27707directive becomes the current @value{FN}.  As each file is finished,
27708the stack is ``popped,'' and the previous input file becomes the current
27709input file again.  The process is started by making the original file
27710the first one on the stack.
27711
27712The @code{pathto()} function does the work of finding the full path to
27713a file.  It simulates @command{gawk}'s behavior when searching the
27714@env{AWKPATH} environment variable
27715(@pxref{AWKPATH Variable}).
27716If a @value{FN} has a @samp{/} in it, no path search is done.
27717Similarly, if the @value{FN} is @code{"-"}, then that string is
27718used as-is.  Otherwise,
27719the @value{FN} is concatenated with the name of each directory in
27720the path, and an attempt is made to open the generated @value{FN}.
27721The only way to test if a file can be read in @command{awk} is to go
27722ahead and try to read it with @code{getline}; this is what @code{pathto()}
27723does.@footnote{On some very old versions of @command{awk}, the test
27724@samp{getline junk < t} can loop forever if the file exists but is empty.}
27725If the file can be read, it is closed and the @value{FN}
27726is returned:
27727
27728@ignore
27729An alternative way to test for the file's existence would be to call
27730@samp{system("test -r " t)}, which uses the @command{test} utility to
27731see if the file exists and is readable.  The disadvantage to this method
27732is that it requires creating an extra process and can thus be slightly
27733slower.
27734@end ignore
27735
27736@example
27737@c file eg/prog/igawk.sh
27738expand_prog='
27739
27740function pathto(file,    i, t, junk)
27741@{
27742    if (index(file, "/") != 0)
27743        return file
27744
27745    if (file == "-")
27746        return file
27747
27748    for (i = 1; i <= ndirs; i++) @{
27749        t = (pathlist[i] "/" file)
27750@group
27751        if ((getline junk < t) > 0) @{
27752            # found it
27753            close(t)
27754            return t
27755        @}
27756@end group
27757    @}
27758    return ""
27759@}
27760@c endfile
27761@end example
27762
27763The main program is contained inside one @code{BEGIN} rule.  The first thing it
27764does is set up the @code{pathlist} array that @code{pathto()} uses.  After
27765splitting the path on @samp{:}, null elements are replaced with @code{"."},
27766which represents the current directory:
27767
27768@example
27769@c file eg/prog/igawk.sh
27770BEGIN @{
27771    path = ENVIRON["AWKPATH"]
27772    ndirs = split(path, pathlist, ":")
27773    for (i = 1; i <= ndirs; i++) @{
27774        if (pathlist[i] == "")
27775            pathlist[i] = "."
27776    @}
27777@c endfile
27778@end example
27779
27780The stack is initialized with @code{ARGV[1]}, which will be @code{"/dev/stdin"}.
27781The main loop comes next.  Input lines are read in succession. Lines that
27782do not start with @code{@@include} are printed verbatim.
27783If the line does start with @code{@@include}, the @value{FN} is in @code{$2}.
27784@code{pathto()} is called to generate the full path.  If it cannot, then the program
27785prints an error message and continues.
27786
27787The next thing to check is if the file is included already.  The
27788@code{processed} array is indexed by the full @value{FN} of each included
27789file and it tracks this information for us.  If the file is
27790seen again, a warning message is printed. Otherwise, the new @value{FN} is
27791pushed onto the stack and processing continues.
27792
27793Finally, when @code{getline} encounters the end of the input file, the file
27794is closed and the stack is popped.  When @code{stackptr} is less than zero,
27795the program is done:
27796
27797@example
27798@c file eg/prog/igawk.sh
27799    stackptr = 0
27800    input[stackptr] = ARGV[1] # ARGV[1] is first file
27801
27802    for (; stackptr >= 0; stackptr--) @{
27803        while ((getline < input[stackptr]) > 0) @{
27804            if (tolower($1) != "@@include") @{
27805                print
27806                continue
27807            @}
27808            fpath = pathto($2)
27809            if (fpath == "") @{
27810                printf("igawk: %s:%d: cannot find %s\n",
27811                    input[stackptr], FNR, $2) > "/dev/stderr"
27812                continue
27813            @}
27814            if (! (fpath in processed)) @{
27815                processed[fpath] = input[stackptr]
27816                input[++stackptr] = fpath  # push onto stack
27817            @} else
27818                print $2, "included in", input[stackptr],
27819                    "already included in",
27820                    processed[fpath] > "/dev/stderr"
27821        @}
27822        close(input[stackptr])
27823    @}
27824@}'  # close quote ends `expand_prog' variable
27825
27826processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF
27827$program
27828EOF
27829)
27830@c endfile
27831@end example
27832
27833The shell construct @samp{@var{command} << @var{marker}} is called
27834a @dfn{here document}.  Everything in the shell script up to the
27835@var{marker} is fed to @var{command} as input.  The shell processes
27836the contents of the here document for variable and command substitution
27837(and possibly other things as well, depending upon the shell).
27838
27839The shell construct @samp{$(@dots{})} is called @dfn{command substitution}.
27840The output of the command inside the parentheses is substituted
27841into the command line.
27842Because the result is used in a variable assignment,
27843it is saved as a single string, even if the results contain whitespace.
27844
27845The expanded program is saved in the variable @code{processed_program}.
27846It's done in these steps:
27847
27848@enumerate
27849@item
27850Run @command{gawk} with the @code{@@include}-processing program (the
27851value of the @code{expand_prog} shell variable) reading standard input.
27852
27853@item
27854Standard input is the contents of the user's program,
27855from the shell variable @code{program}.
27856Feed its contents to @command{gawk} via a here document.
27857
27858@item
27859Save the results of this processing in the shell variable
27860@code{processed_program} by using command substitution.
27861@end enumerate
27862
27863The last step is to call @command{gawk} with the expanded program,
27864along with the original
27865options and command-line arguments that the user supplied:
27866
27867@example
27868@c file eg/prog/igawk.sh
27869eval gawk $opts -- '"$processed_program"' '"$@@"'
27870@c endfile
27871@end example
27872
27873The @command{eval} command is a shell construct that reruns the shell's parsing
27874process.  This keeps things properly quoted.
27875
27876This version of @command{igawk} represents the fifth version of this program.
27877There are four key simplifications that make the program work better:
27878
27879@itemize @value{BULLET}
27880@item
27881Using @code{@@include} even for the files named with @option{-f} makes building
27882the initial collected @command{awk} program much simpler; all the
27883@code{@@include} processing can be done once.
27884
27885@item
27886Not trying to save the line read with @code{getline}
27887in the @code{pathto()} function when testing for the
27888file's accessibility for use with the main program simplifies things
27889considerably.
27890
27891@item
27892Using a @code{getline} loop in the @code{BEGIN} rule does it all in one
27893place.  It is not necessary to call out to a separate loop for processing
27894nested @code{@@include} statements.
27895
27896@item
27897Instead of saving the expanded program in a temporary file, putting it in a shell variable
27898avoids some potential security problems.
27899This has the disadvantage that the script relies upon more features
27900of the @command{sh} language, making it harder to follow for those who
27901aren't familiar with @command{sh}.
27902@end itemize
27903
27904Also, this program illustrates that it is often worthwhile to combine
27905@command{sh} and @command{awk} programming together.  You can usually
27906accomplish quite a lot, without having to resort to low-level programming
27907in C or C++, and it is frequently easier to do certain kinds of string
27908and argument manipulation using the shell than it is in @command{awk}.
27909
27910Finally, @command{igawk} shows that it is not always necessary to add new
27911features to a program; they can often be layered on top.@footnote{@command{gawk}
27912does @code{@@include} processing itself in order to support the use
27913of @command{awk} programs as Web CGI scripts.}
27914
27915
27916@node Anagram Program
27917@subsection Finding Anagrams from a Dictionary
27918
27919@cindex anagrams, finding
27920An interesting programming challenge is to
27921search for @dfn{anagrams} in a
27922word list (such as
27923@file{/usr/share/dict/words} on many GNU/Linux systems).
27924One word is an anagram of another if both words contain
27925the same letters
27926(e.g., ``babbling'' and ``blabbing'').
27927
27928Column 2, Problem C, of Jon Bentley's @cite{Programming Pearls}, Second
27929Edition, presents an elegant algorithm.  The idea is to give words that
27930are anagrams a common signature, sort all the words together by their
27931signatures, and then print them.  Dr.@: Bentley observes that taking the
27932letters in each word and sorting them produces those common signatures.
27933
27934The following program uses arrays of arrays to bring together
27935words with the same signature and array sorting to print the words
27936in sorted order:
27937
27938@cindex @file{anagram.awk} program
27939@example
27940@c file eg/prog/anagram.awk
27941# anagram.awk --- An implementation of the anagram-finding algorithm
27942#                 from Jon Bentley's "Programming Pearls," 2nd edition.
27943#                 Addison Wesley, 2000, ISBN 0-201-65788-0.
27944#                 Column 2, Problem C, section 2.8, pp 18-20.
27945@c endfile
27946@ignore
27947@c file eg/prog/anagram.awk
27948#
27949# This program requires gawk 4.0 or newer.
27950# Required gawk-specific features:
27951#   - True multidimensional arrays
27952#   - split() with "" as separator splits out individual characters
27953#   - asort() and asorti() functions
27954#
27955# See https://savannah.gnu.org/projects/gawk.
27956#
27957# Arnold Robbins
27958# arnold@@skeeve.com
27959# Public Domain
27960# January, 2011
27961@c endfile
27962@end ignore
27963@c file eg/prog/anagram.awk
27964
27965/'s$/   @{ next @}        # Skip possessives
27966@c endfile
27967@end example
27968
27969The program starts with a header, and then a rule to skip
27970possessives in the dictionary file. The next rule builds
27971up the data structure. The first dimension of the array
27972is indexed by the signature; the second dimension is the word
27973itself:
27974
27975@example
27976@c file eg/prog/anagram.awk
27977@{
27978    key = word2key($1)  # Build signature
27979    data[key][$1] = $1  # Store word with signature
27980@}
27981@c endfile
27982@end example
27983
27984The @code{word2key()} function creates the signature.
27985It splits the word apart into individual letters,
27986sorts the letters, and then joins them back together:
27987
27988@example
27989@c file eg/prog/anagram.awk
27990# word2key --- split word apart into letters, sort, and join back together
27991
27992function word2key(word,     a, i, n, result)
27993@{
27994    n = split(word, a, "")
27995    asort(a)
27996
27997    for (i = 1; i <= n; i++)
27998        result = result a[i]
27999
28000    return result
28001@}
28002@c endfile
28003@end example
28004
28005Finally, the @code{END} rule traverses the array
28006and prints out the anagram lists.  It sends the output
28007to the system @command{sort} command because otherwise
28008the anagrams would appear in arbitrary order:
28009
28010@example
28011@c file eg/prog/anagram.awk
28012END @{
28013    sort = "sort"
28014    for (key in data) @{
28015        # Sort words with same key
28016        nwords = asorti(data[key], words)
28017        if (nwords == 1)
28018            continue
28019
28020        # And print. Minor glitch: trailing space at end of each line
28021        for (j = 1; j <= nwords; j++)
28022            printf("%s ", words[j]) | sort
28023        print "" | sort
28024    @}
28025    close(sort)
28026@}
28027@c endfile
28028@end example
28029
28030Here is some partial output when the program is run:
28031
28032@example
28033$ @kbd{gawk -f anagram.awk /usr/share/dict/words | grep '^b'}
28034@dots{}
28035babbled blabbed
28036babbler blabber brabble
28037babblers blabbers brabbles
28038babbling blabbing
28039babbly blabby
28040babel bable
28041babels beslab
28042babery yabber
28043@dots{}
28044@end example
28045
28046
28047@node Signature Program
28048@subsection And Now for Something Completely Different
28049
28050@cindex signature program
28051@cindex Brini, Davide
28052The following program was written by Davide Brini
28053@c (@email{dave_br@@gmx.com})
28054and is published on @uref{http://backreference.org/2011/02/03/obfuscated-awk/,
28055his website}.
28056It serves as his signature in the Usenet group @code{comp.lang.awk}.
28057He supplies the following copyright terms:
28058
28059@quotation
28060Copyright @copyright{} 2008 Davide Brini
28061
28062Copying and distribution of the code published in this page, with or without
28063modification, are permitted in any medium without royalty provided the copyright
28064notice and this notice are preserved.
28065@end quotation
28066
28067Here is the program:
28068
28069@example
28070@group
28071awk 'BEGIN@{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
28072printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
28073X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
28074O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O@}'
28075@end group
28076@end example
28077
28078@cindex Johansen, Chris
28079We leave it to you to determine what the program does.  (If you are
28080truly desperate to understand it, see Chris Johansen's explanation,
28081which is embedded in the Texinfo source file for this @value{DOCUMENT}.)
28082
28083@ignore
28084To: "Arnold Robbins" <arnold@skeeve.com>
28085Date: Sat, 20 Aug 2011 13:50:46 -0400
28086Subject: The GNU Awk User's Guide, Section 13.3.11
28087From: "Chris Johansen" <johansen@main.nc.us>
28088Message-ID: <op.v0iw6wlv7finx3@asusodin.thrudvang.lan>
28089
28090Arnold, you don't know me, but we have a tenuous connection.  My wife is
28091Barbara A. Field, FAIA, GIT '65 (B. Arch.).
28092
28093I have had a couple of paper copies of "Effective Awk Programming" for
28094years, and now I'm going through a Kindle version of "The GNU Awk User's
28095Guide" again.  When I got to section 13.3.11, I reformatted and lightly
28096commented Davide Brin's signature script to understand its workings.
28097
28098It occurs to me that this might have pedagogical value as an example
28099(although imperfect) of the value of whitespace and comments, and a
28100starting point for that discussion.  It certainly helped _me_ understand
28101what's going on.  You are welcome to it, as-is or modified (subject to
28102Davide's constraints, of course, which I think I have met).
28103
28104If I were to include it in a future edition, I would put it at some
28105distance from section 13.3.11, say, as a note or an appendix, so as not to
28106be a "spoiler" to the puzzle.
28107
28108Best regards,
28109--
28110Chris Johansen {johansen at main dot nc dot us}
28111  . . . collapsing the probability wave function, sending ripples of
28112certainty through the space-time continuum.
28113
28114
28115#! /usr/bin/gawk -f
28116
28117# From "13.3.11 And Now For Something Completely Different"
28118#   https://www.gnu.org/software/gawk/manual/html_node/Signature-Program.html#Signature-Program
28119
28120# Copyright @copyright{} 2008 Davide Brini
28121
28122# Copying and distribution of the code published in this page, with
28123# or without modification, are permitted in any medium without
28124# royalty provided the copyright notice and this notice are preserved.
28125
28126BEGIN {
28127  O = "~" ~ "~";    #  1
28128  o = "==" == "=="; #  1
28129  o += +o;          #  2
28130  x = O "" O;       # 11
28131
28132
28133  while ( X++ <= x + o + o ) c = c "%c";
28134
28135  # O is  1
28136  # o is  2
28137  # x is 11
28138  # X is 17
28139  # c is "%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c"
28140
28141  printf c,
28142    ( x - O )*( x - O),                  # 100 d
28143    x*( x - o ) - o,                     #  97 a
28144    x*( x - O ) + x - O - o,             # 118 v
28145    +x*( x - O ) - x + o,                # 101 e
28146    X*( o*o + O ) + x - O,               #  95 _
28147    X*( X - x ) - o*o,                   #  98 b
28148    ( x + X )*o*o + o,                   # 114 r
28149    x*( X - x ) - O - O,                 #  64 @
28150    x - O + ( O + o + X + x )*( o + O ), # 103 g
28151    X*X - X*( x - O ) - x + O,           # 109 m
28152    O + X*( o*( o + O ) + O ),           # 120 x
28153    +x + O + X*o,                        #  46 .
28154    x*( x - o),                          #  99 c
28155    ( o + X + x )*o*o - ( x - O - O ),   # 111 0
28156    O + ( X - x )*( X + O ),             # 109 m
28157    x - O                                #  10 \n
28158}
28159@end ignore
28160
28161@node Programs Summary
28162@section Summary
28163
28164@itemize @value{BULLET}
28165@item
28166The programs provided in this @value{CHAPTER}
28167continue on the theme that reading programs is an excellent way to learn
28168Good Programming.
28169
28170@item
28171Using @samp{#!} to make @command{awk} programs directly runnable makes
28172them easier to use.  Otherwise, invoke the program using @samp{awk
28173-f @dots{}}.
28174
28175@item
28176Reimplementing standard POSIX programs in @command{awk} is a pleasant
28177exercise; @command{awk}'s expressive power lets you write such programs
28178in relatively few lines of code, yet they are functionally complete
28179and usable.
28180
28181@item
28182One of standard @command{awk}'s weaknesses is working with individual
28183characters.  The ability to use @code{split()} with the empty string as
28184the separator can considerably simplify such tasks.
28185
28186@item
28187The examples here demonstrate the usefulness of the library
28188functions from @ref{Library Functions}
28189for a number of real (if small) programs.
28190
28191@item
28192Besides reinventing POSIX wheels, other programs solved a selection of
28193interesting problems, such as finding duplicate words in text, printing
28194mailing labels, and finding anagrams.
28195
28196@end itemize
28197
28198@c EXCLUDE START
28199@node Programs Exercises
28200@section Exercises
28201
28202@enumerate
28203@item
28204Rewrite @file{cut.awk} (@pxref{Cut Program})
28205using @code{split()} with @code{""} as the separator.
28206
28207@item
28208In @ref{Egrep Program}, we mentioned that @samp{egrep -i} could be
28209simulated in versions of @command{awk} without @code{IGNORECASE} by
28210using @code{tolower()} on the line and the pattern. In a footnote there,
28211we also mentioned that this solution has a bug: the translated line is
28212output, and not the original one.  Fix this problem.
28213@c Exercise: Fix this, w/array and new line as key to original line
28214
28215@item
28216The POSIX version of @command{id} takes options that control which
28217information is printed.  Modify the @command{awk} version
28218(@pxref{Id Program}) to accept the same arguments and perform in the
28219same way.
28220
28221@item
28222The @file{split.awk} program (@pxref{Split Program}) assumes
28223that letters are contiguous in the character set,
28224which isn't true for EBCDIC systems.
28225Fix this problem.
28226(Hint: Consider a different way to work through the alphabet,
28227without relying on @code{ord()} and @code{chr()}.)
28228
28229@item
28230@cindex Kernighan, Brian @subentry quotes
28231In @file{uniq.awk} (@pxref{Uniq Program}, the
28232logic for choosing which lines to print represents a @dfn{state
28233machine}, which is ``a device which can be in one of a set number of stable
28234conditions depending on its previous condition and on the present values
28235of its inputs.''@footnote{This definition is from
28236@uref{https://www.lexico.com/en/definition/state_machine}.}
28237Brian Kernighan suggests that
28238``an alternative approach to state machines is to just read
28239the input into an array, then use indexing.  It's almost always
28240easier code, and for most inputs where you would use this, just
28241as fast.''  Rewrite the logic to follow this
28242suggestion.
28243
28244
28245@item
28246Why can't the @file{wc.awk} program (@pxref{Wc Program}) just
28247use the value of @code{FNR} in @code{endfile()}?
28248Hint: Examine the code in @ref{Filetrans Function}.
28249
28250@ignore
28251@command{wc} can't just use the value of @code{FNR} in
28252@code{endfile()}. If you examine the code in @ref{Filetrans Function},
28253you will see that @code{FNR} has already been reset by the time
28254@code{endfile()} is called.
28255@end ignore
28256
28257@item
28258Manipulation of individual characters in the @command{translate} program
28259(@pxref{Translate Program}) is painful using standard @command{awk}
28260functions.  Given that @command{gawk} can split strings into individual
28261characters using @code{""} as the separator, how might you use this
28262feature to simplify the program?
28263
28264@item
28265The @file{extract.awk} program (@pxref{Extract Program}) was written
28266before @command{gawk} had the @code{gensub()} function.  Use it
28267to simplify the code.
28268
28269@item
28270Compare the performance of the @file{awksed.awk} program
28271(@pxref{Simple Sed}) with the more straightforward:
28272
28273@example
28274BEGIN @{
28275    pat = ARGV[1]
28276    repl = ARGV[2]
28277    ARGV[1] = ARGV[2] = ""
28278@}
28279
28280@{ gsub(pat, repl); print @}
28281@end example
28282
28283@item
28284What are the advantages and disadvantages of @file{awksed.awk} versus
28285the real @command{sed} utility?
28286
28287@ignore
28288  Advantage: egrep regexps
28289             speed (?)
28290  Disadvantage: no & in replacement text
28291
28292Others?
28293@end ignore
28294
28295@item
28296In @ref{Igawk Program}, we mentioned that not trying to save the line
28297read with @code{getline} in the @code{pathto()} function when testing
28298for the file's accessibility for use with the main program simplifies
28299things considerably.  What problem does this engender though?
28300@c answer, reading from "-" or /dev/stdin
28301
28302@cindex search paths
28303@cindex search paths @subentry for source files
28304@cindex source files, search path for
28305@cindex files @subentry source, search path for
28306@cindex directories @subentry searching @subentry for source files
28307@item
28308As an additional example of the idea that it is not always necessary to
28309add new features to a program, consider the idea of having two files in
28310a directory in the search path:
28311
28312@table @file
28313@item default.awk
28314This file contains a set of default library functions, such
28315as @code{getopt()} and @code{assert()}.
28316
28317@item site.awk
28318This file contains library functions that are specific to a site or
28319installation; i.e., locally developed functions.
28320Having a separate file allows @file{default.awk} to change with
28321new @command{gawk} releases, without requiring the system administrator to
28322update it each time by adding the local functions.
28323@end table
28324
28325One user
28326@c Karl Berry, karl@ileaf.com, 10/95
28327suggested that @command{gawk} be modified to automatically read these files
28328upon startup.  Instead, it would be very simple to modify @command{igawk}
28329to do this. Since @command{igawk} can process nested @code{@@include}
28330directives, @file{default.awk} could simply contain @code{@@include}
28331statements for the desired library functions.
28332Make this change.
28333
28334@item
28335Modify @file{anagram.awk} (@pxref{Anagram Program}), to avoid
28336the use of the external @command{sort} utility.
28337
28338@end enumerate
28339@c EXCLUDE END
28340
28341@ifnotinfo
28342@part @value{PART3}Moving Beyond Standard @command{awk} with @command{gawk}
28343@end ifnotinfo
28344
28345@ifdocbook
28346Part III focuses on features specific to @command{gawk}.
28347It contains the following chapters:
28348
28349@itemize @value{BULLET}
28350@item
28351@ref{Namespaces}
28352
28353@item
28354@ref{Advanced Features}
28355
28356@item
28357@ref{Internationalization}
28358
28359@item
28360@ref{Debugger}
28361
28362@item
28363@ref{Arbitrary Precision Arithmetic}
28364
28365@item
28366@ref{Dynamic Extensions}
28367@end itemize
28368@end ifdocbook
28369
28370@node Advanced Features
28371@chapter Advanced Features of @command{gawk}
28372@cindex @command{gawk} @subentry features @subentry advanced
28373@cindex advanced features @subentry @command{gawk}
28374@ignore
28375Contributed by: Peter Langston <pud!psl@bellcore.bellcore.com>
28376
28377    Found in Steve English's "signature" line:
28378
28379"Write documentation as if whoever reads it is a violent psychopath
28380who knows where you live."
28381@end ignore
28382@cindex Langston, Peter
28383@cindex English, Steve
28384@quotation
28385@i{Write documentation as if whoever reads it is
28386a violent psychopath who knows where you live.}
28387@author Steve English, as quoted by Peter Langston
28388@end quotation
28389
28390This @value{CHAPTER} discusses advanced features in @command{gawk}.
28391It's a bit of a ``grab bag'' of items that are otherwise unrelated
28392to each other.
28393First, we look at a command-line option that allows @command{gawk} to recognize
28394nondecimal numbers in input data, not just in @command{awk}
28395programs.
28396Then, @command{gawk}'s special features for sorting arrays are presented.
28397Next, two-way I/O, discussed briefly in earlier parts of this
28398@value{DOCUMENT}, is described in full detail, along with the basics
28399of TCP/IP networking.  Finally, we see how @command{gawk}
28400can @dfn{profile} an @command{awk} program, making it possible to tune
28401it for performance.
28402
28403@c FULLXREF ON
28404Additional advanced features are discussed in separate @value{CHAPTER}s of their
28405own:
28406
28407@itemize @value{BULLET}
28408@item
28409@ref{Internationalization}, discusses how to internationalize
28410your @command{awk} programs, so that they can speak multiple
28411national languages.
28412
28413@item
28414@ref{Debugger}, describes @command{gawk}'s built-in command-line
28415debugger for debugging @command{awk} programs.
28416
28417@item
28418@ref{Arbitrary Precision Arithmetic}, describes how you can use
28419@command{gawk} to perform arbitrary-precision arithmetic.
28420
28421@item
28422@ref{Dynamic Extensions},
28423discusses the ability to dynamically add new built-in functions to
28424@command{gawk}.
28425@end itemize
28426@c FULLXREF OFF
28427
28428@menu
28429* Nondecimal Data::             Allowing nondecimal input data.
28430* Array Sorting::               Facilities for controlling array traversal and
28431                                sorting arrays.
28432* Two-way I/O::                 Two-way communications with another process.
28433* TCP/IP Networking::           Using @command{gawk} for network programming.
28434* Profiling::                   Profiling your @command{awk} programs.
28435* Extension Philosophy::        What should be built-in and what should not.
28436* Advanced Features Summary::   Summary of advanced features.
28437@end menu
28438
28439@node Nondecimal Data
28440@section Allowing Nondecimal Input Data
28441@cindex @option{--non-decimal-data} option
28442@cindex advanced features @subentry nondecimal input data
28443@cindex input @subentry data, nondecimal
28444@cindex constants @subentry nondecimal
28445
28446If you run @command{gawk} with the @option{--non-decimal-data} option,
28447you can have nondecimal values in your input data:
28448
28449@example
28450$ @kbd{echo 0123 123 0x123 |}
28451> @kbd{gawk --non-decimal-data '@{ printf "%d, %d, %d\n", $1, $2, $3 @}'}
28452@print{} 83, 123, 291
28453@end example
28454
28455For this feature to work, write your program so that
28456@command{gawk} treats your data as numeric:
28457
28458@example
28459$ @kbd{echo 0123 123 0x123 | gawk '@{ print $1, $2, $3 @}'}
28460@print{} 0123 123 0x123
28461@end example
28462
28463@noindent
28464The @code{print} statement treats its expressions as strings.
28465Although the fields can act as numbers when necessary,
28466they are still strings, so @code{print} does not try to treat them
28467numerically.  You need to add zero to a field to force it to
28468be treated as a number.  For example:
28469
28470@example
28471$ @kbd{echo 0123 123 0x123 | gawk --non-decimal-data '}
28472> @kbd{@{ print $1, $2, $3}
28473>   @kbd{print $1 + 0, $2 + 0, $3 + 0 @}'}
28474@print{} 0123 123 0x123
28475@print{} 83 123 291
28476@end example
28477
28478Because it is common to have decimal data with leading zeros, and because
28479using this facility could lead to surprising results, the default is to leave it
28480disabled.  If you want it, you must explicitly request it.
28481
28482@cindex programming conventions @subentry @option{--non-decimal-data} option
28483@cindex @option{--non-decimal-data} option @subentry @code{strtonum()} function and
28484@cindex @code{strtonum()} function (@command{gawk}) @subentry @option{--non-decimal-data} option and
28485@quotation CAUTION
28486@emph{Use of this option is not recommended.}
28487It can break old programs very badly.
28488Instead, use the @code{strtonum()} function to convert your data
28489(@pxref{String Functions}).
28490This makes your programs easier to write and easier to read, and
28491leads to less surprising results.
28492
28493This option may disappear in a future version of @command{gawk}.
28494@end quotation
28495
28496@node Array Sorting
28497@section Controlling Array Traversal and Array Sorting
28498
28499@command{gawk} lets you control the order in which a
28500@samp{for (@var{indx} in @var{array})}
28501loop traverses an array.
28502
28503In addition, two built-in functions, @code{asort()} and @code{asorti()},
28504let you sort arrays based on the array values and indices, respectively.
28505These two functions also provide control over the sorting criteria used
28506to order the elements during sorting.
28507
28508@menu
28509* Controlling Array Traversal:: How to use PROCINFO["sorted_in"].
28510* Array Sorting Functions::     How to use @code{asort()} and @code{asorti()}.
28511@end menu
28512
28513@node Controlling Array Traversal
28514@subsection Controlling Array Traversal
28515
28516By default, the order in which a @samp{for (@var{indx} in @var{array})} loop
28517scans an array is not defined; it is generally based upon
28518the internal implementation of arrays inside @command{awk}.
28519
28520Often, though, it is desirable to be able to loop over the elements
28521in a particular order that you, the programmer, choose.  @command{gawk}
28522lets you do this.
28523
28524@ref{Controlling Scanning} describes how you can assign special,
28525predefined values to @code{PROCINFO["sorted_in"]} in order to
28526control the order in which @command{gawk} traverses an array
28527during a @code{for} loop.
28528
28529In addition, the value of @code{PROCINFO["sorted_in"]} can be a
28530function name.@footnote{This is why the predefined sorting orders
28531start with an @samp{@@} character, which cannot be part of an identifier.}
28532This lets you traverse an array based on any custom criterion.
28533The array elements are ordered according to the return value of this
28534function.  The comparison function should be defined with at least
28535four arguments:
28536
28537@example
28538function comp_func(i1, v1, i2, v2)
28539@{
28540    @var{compare elements 1 and 2 in some fashion}
28541    @var{return < 0; 0; or > 0}
28542@}
28543@end example
28544
28545Here, @code{i1} and @code{i2} are the indices, and @code{v1} and @code{v2}
28546are the corresponding values of the two elements being compared.
28547Either @code{v1} or @code{v2}, or both, can be arrays if the array being
28548traversed contains subarrays as values.
28549(@xref{Arrays of Arrays} for more information about subarrays.)
28550The three possible return values are interpreted as follows:
28551
28552@table @code
28553@item comp_func(i1, v1, i2, v2) < 0
28554Index @code{i1} comes before index @code{i2} during loop traversal.
28555
28556@item comp_func(i1, v1, i2, v2) == 0
28557Indices @code{i1} and @code{i2}
28558come together, but the relative order with respect to each other is undefined.
28559
28560@item comp_func(i1, v1, i2, v2) > 0
28561Index @code{i1} comes after index @code{i2} during loop traversal.
28562@end table
28563
28564Our first comparison function can be used to scan an array in
28565numerical order of the indices:
28566
28567@example
28568@group
28569function cmp_num_idx(i1, v1, i2, v2)
28570@{
28571     # numerical index comparison, ascending order
28572     return (i1 - i2)
28573@}
28574@end group
28575@end example
28576
28577Our second function traverses an array based on the string order of
28578the element values rather than by indices:
28579
28580@example
28581function cmp_str_val(i1, v1, i2, v2)
28582@{
28583    # string value comparison, ascending order
28584    v1 = v1 ""
28585    v2 = v2 ""
28586    if (v1 < v2)
28587        return -1
28588    return (v1 != v2)
28589@}
28590@end example
28591
28592The third
28593comparison function makes all numbers, and numeric strings without
28594any leading or trailing spaces, come out first during loop traversal:
28595
28596@example
28597function cmp_num_str_val(i1, v1, i2, v2,   n1, n2)
28598@{
28599     # numbers before string value comparison, ascending order
28600     n1 = v1 + 0
28601     n2 = v2 + 0
28602     if (n1 == v1)
28603         return (n2 == v2) ? (n1 - n2) : -1
28604     else if (n2 == v2)
28605         return 1
28606     return (v1 < v2) ? -1 : (v1 != v2)
28607@}
28608@end example
28609
28610Here is a main program to demonstrate how @command{gawk}
28611behaves using each of the previous functions:
28612
28613@example
28614BEGIN @{
28615    data["one"] = 10
28616    data["two"] = 20
28617    data[10] = "one"
28618    data[100] = 100
28619    data[20] = "two"
28620
28621    f[1] = "cmp_num_idx"
28622    f[2] = "cmp_str_val"
28623    f[3] = "cmp_num_str_val"
28624    for (i = 1; i <= 3; i++) @{
28625        printf("Sort function: %s\n", f[i])
28626        PROCINFO["sorted_in"] = f[i]
28627        for (j in data)
28628            printf("\tdata[%s] = %s\n", j, data[j])
28629        print ""
28630    @}
28631@}
28632@end example
28633
28634Here are the results when the program is run:
28635
28636@example
28637$ @kbd{gawk -f compdemo.awk}
28638@print{} Sort function: cmp_num_idx      @ii{Sort by numeric index}
28639@print{}     data[two] = 20
28640@print{}     data[one] = 10              @ii{Both strings are numerically zero}
28641@print{}     data[10] = one
28642@print{}     data[20] = two
28643@print{}     data[100] = 100
28644@print{}
28645@print{} Sort function: cmp_str_val      @ii{Sort by element values as strings}
28646@print{}     data[one] = 10
28647@print{}     data[100] = 100             @ii{String 100 is less than string 20}
28648@print{}     data[two] = 20
28649@print{}     data[10] = one
28650@print{}     data[20] = two
28651@print{}
28652@print{} Sort function: cmp_num_str_val  @ii{Sort all numeric values before all strings}
28653@print{}     data[one] = 10
28654@print{}     data[two] = 20
28655@print{}     data[100] = 100
28656@print{}     data[10] = one
28657@print{}     data[20] = two
28658@end example
28659
28660Consider sorting the entries of a GNU/Linux system password file
28661according to login name.  The following program sorts records
28662by a specific field position and can be used for this purpose:
28663
28664@example
28665# passwd-sort.awk --- simple program to sort by field position
28666# field position is specified by the global variable POS
28667
28668function cmp_field(i1, v1, i2, v2)
28669@{
28670    # comparison by value, as string, and ascending order
28671    return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS])
28672@}
28673
28674@{
28675    for (i = 1; i <= NF; i++)
28676        a[NR][i] = $i
28677@}
28678
28679@group
28680END @{
28681    PROCINFO["sorted_in"] = "cmp_field"
28682@end group
28683    if (POS < 1 || POS > NF)
28684        POS = 1
28685
28686    for (i in a) @{
28687        for (j = 1; j <= NF; j++)
28688            printf("%s%c", a[i][j], j < NF ? ":" : "")
28689        print ""
28690    @}
28691@}
28692@end example
28693
28694The first field in each entry of the password file is the user's login name,
28695and the fields are separated by colons.
28696Each record defines a subarray,
28697with each field as an element in the subarray.
28698Running the program produces the
28699following output:
28700
28701@example
28702$ @kbd{gawk -v POS=1 -F: -f sort.awk /etc/passwd}
28703@print{} adm:x:3:4:adm:/var/adm:/sbin/nologin
28704@print{} apache:x:48:48:Apache:/var/www:/sbin/nologin
28705@print{} avahi:x:70:70:Avahi daemon:/:/sbin/nologin
28706@dots{}
28707@end example
28708
28709The comparison should normally always return the same value when given a
28710specific pair of array elements as its arguments.  If inconsistent
28711results are returned, then the order is undefined.  This behavior can be
28712exploited to introduce random order into otherwise seemingly
28713ordered data:
28714
28715@example
28716function cmp_randomize(i1, v1, i2, v2)
28717@{
28718    # random order (caution: this may never terminate!)
28719    return (2 - 4 * rand())
28720@}
28721@end example
28722
28723As already mentioned, the order of the indices is arbitrary if two
28724elements compare equal.  This is usually not a problem, but letting
28725the tied elements come out in arbitrary order can be an issue, especially
28726when comparing item values.  The partial ordering of the equal elements
28727may change the next time the array is traversed, if other elements are added to or
28728removed from the array.  One way to resolve ties when comparing elements
28729with otherwise equal values is to include the indices in the comparison
28730rules.  Note that doing this may make the loop traversal less efficient,
28731so consider it only if necessary.  The following comparison functions
28732force a deterministic order, and are based on the fact that the
28733(string) indices of two elements are never equal:
28734
28735@example
28736function cmp_numeric(i1, v1, i2, v2)
28737@{
28738    # numerical value (and index) comparison, descending order
28739    return (v1 != v2) ? (v2 - v1) : (i2 - i1)
28740@}
28741
28742@group
28743function cmp_string(i1, v1, i2, v2)
28744@{
28745    # string value (and index) comparison, descending order
28746    v1 = v1 i1
28747    v2 = v2 i2
28748    return (v1 > v2) ? -1 : (v1 != v2)
28749@}
28750@end group
28751@end example
28752
28753@c Avoid using the term ``stable'' when describing the unpredictable behavior
28754@c if two items compare equal.  Usually, the goal of a "stable algorithm"
28755@c is to maintain the original order of the items, which is a meaningless
28756@c concept for a list constructed from a hash.
28757
28758A custom comparison function can often simplify ordered loop
28759traversal, and the sky is really the limit when it comes to
28760designing such a function.
28761
28762When string comparisons are made during a sort, either for element
28763values where one or both aren't numbers, or for element indices
28764handled as strings, the value of @code{IGNORECASE}
28765(@pxref{Built-in Variables}) controls whether
28766the comparisons treat corresponding upper- and lowercase letters as
28767equivalent or distinct.
28768
28769Another point to keep in mind is that in the case of subarrays,
28770the element values can themselves be arrays; a production comparison
28771function should use the @code{isarray()} function
28772(@pxref{Type Functions})
28773to check for this, and choose a defined sorting order for subarrays.
28774
28775@cindex POSIX mode
28776All sorting based on @code{PROCINFO["sorted_in"]}
28777is disabled in POSIX mode,
28778because the @code{PROCINFO} array is not special in that case.
28779
28780As a side note, sorting the array indices before traversing
28781the array has been reported to add a 15% to 20% overhead to the
28782execution time of @command{awk} programs. For this reason,
28783sorted array traversal is not the default.
28784
28785@c The @command{gawk}
28786@c maintainers believe that only the people who wish to use a
28787@c feature should have to pay for it.
28788
28789@node Array Sorting Functions
28790@subsection Sorting Array Values and Indices with @command{gawk}
28791
28792@cindex arrays @subentry sorting @subentry @code{asort()} function (@command{gawk})
28793@cindex arrays @subentry sorting @subentry @code{asorti()} function (@command{gawk})
28794@cindexgawkfunc{asort}
28795@cindex @code{asort()} function (@command{gawk}) @subentry arrays, sorting
28796@cindex @code{asort()} function (@command{gawk}) @subentry side effects
28797@cindexgawkfunc{asorti}
28798@cindex @code{asorti()} function (@command{gawk}) @subentry arrays, sorting
28799@cindex @code{asorti()} function (@command{gawk}) @subentry side effects
28800@cindex sort function, arrays, sorting
28801In most @command{awk} implementations, sorting an array requires writing
28802a @code{sort()} function.  This can be educational for exploring
28803different sorting algorithms, but usually that's not the point of the program.
28804@command{gawk} provides the built-in @code{asort()} and @code{asorti()}
28805functions (@pxref{String Functions}) for sorting arrays.  For example:
28806
28807@example
28808@var{populate the array} data
28809n = asort(data)
28810for (i = 1; i <= n; i++)
28811    @var{do something with} data[i]
28812@end example
28813
28814After the call to @code{asort()}, the array @code{data} is indexed from 1
28815to some number @var{n}, the total number of elements in @code{data}.
28816(This count is @code{asort()}'s return value.)
28817@code{data[1]} @value{LEQ} @code{data[2]} @value{LEQ} @code{data[3]}, and so on.
28818The default comparison is based on the type of the elements
28819(@pxref{Typing and Comparison}).
28820All numeric values come before all string values,
28821which in turn come before all subarrays.
28822
28823@cindex side effects @subentry @code{asort()} function
28824@cindex side effects @subentry @code{asorti()} function
28825An important side effect of calling @code{asort()} is that
28826@emph{the array's original indices are irrevocably lost}.
28827As this isn't always desirable, @code{asort()} accepts a
28828second argument:
28829
28830@example
28831@var{populate the array} source
28832n = asort(source, dest)
28833for (i = 1; i <= n; i++)
28834    @var{do something with} dest[i]
28835@end example
28836
28837In this case, @command{gawk} copies the @code{source} array into the
28838@code{dest} array and then sorts @code{dest}, destroying its indices.
28839However, the @code{source} array is not affected.
28840
28841Often, what's needed is to sort on the values of the @emph{indices}
28842instead of the values of the elements.  To do that, use the
28843@code{asorti()} function.  The interface and behavior are identical to
28844that of @code{asort()}, except that the index values are used for sorting
28845and become the values of the result array:
28846
28847@example
28848@{ source[$0] = some_func($0) @}
28849
28850END @{
28851    n = asorti(source, dest)
28852    for (i = 1; i <= n; i++) @{
28853        @ii{Work with sorted indices directly:}
28854        @var{do something with} dest[i]
28855        @dots{}
28856        @ii{Access original array via sorted indices:}
28857        @var{do something with} source[dest[i]]
28858    @}
28859@}
28860@end example
28861
28862So far, so good. Now it starts to get interesting.  Both @code{asort()}
28863and @code{asorti()} accept a third string argument to control comparison
28864of array elements.  When we introduced @code{asort()} and @code{asorti()}
28865in @ref{String Functions}, we ignored this third argument; however,
28866now is the time to describe how this argument affects these two functions.
28867
28868Basically, the third argument specifies how the array is to be sorted.
28869There are two possibilities.  As with @code{PROCINFO["sorted_in"]},
28870this argument may be one of the predefined names that @command{gawk}
28871provides (@pxref{Controlling Scanning}), or it may be the name of a
28872user-defined function (@pxref{Controlling Array Traversal}).
28873
28874In the latter case, @emph{the function can compare elements in any way
28875it chooses}, taking into account just the indices, just the values,
28876or both.  This is extremely powerful.
28877
28878Once the array is sorted, @code{asort()} takes the @emph{values} in
28879their final order and uses them to fill in the result array, whereas
28880@code{asorti()} takes the @emph{indices} in their final order and uses
28881them to fill in the result array.
28882
28883@cindex reference counting, sorting arrays
28884@quotation NOTE
28885Copying array indices and elements isn't expensive in terms of memory.
28886Internally, @command{gawk} maintains @dfn{reference counts} to data.
28887For example, when @code{asort()} copies the first array to the second one,
28888there is only one copy of the original array elements' data, even though
28889both arrays use the values.
28890@end quotation
28891
28892You may use the same array for both the first and second arguments to
28893@code{asort()} and @code{asorti()}.  Doing so only makes sense if you
28894are also supplying the third argument, since @command{awk} doesn't
28895provide a way to pass that third argument without also passing the first
28896and second ones.
28897
28898@c Document It And Call It A Feature. Sigh.
28899@cindex @command{gawk} @subentry @code{IGNORECASE} variable in
28900@cindex arrays @subentry sorting @subentry @code{IGNORECASE} variable and
28901@cindex @code{IGNORECASE} variable @subentry array sorting functions and
28902Because @code{IGNORECASE} affects string comparisons, the value
28903of @code{IGNORECASE} also affects sorting for both @code{asort()} and @code{asorti()}.
28904Note also that the locale's sorting order does @emph{not}
28905come into play; comparisons are based on character values only.@footnote{This
28906is true because locale-based comparison occurs only when in
28907POSIX-compatibility mode, and because @code{asort()} and @code{asorti()} are
28908@command{gawk} extensions, they are not available in that case.}
28909
28910The following example demonstrates the use of a comparison function with
28911@code{asort()}.  The comparison function, @code{case_fold_compare()}, maps
28912both values to lowercase in order to compare them ignoring case.
28913
28914@example
28915@group
28916# case_fold_compare --- compare as strings, ignoring case
28917
28918function case_fold_compare(i1, v1, i2, v2,    l, r)
28919@{
28920    l = tolower(v1)
28921@end group
28922    r = tolower(v2)
28923
28924    if (l < r)
28925        return -1
28926    else if (l == r)
28927        return 0
28928    else
28929        return 1
28930@}
28931@end example
28932
28933And here is the test program for it:
28934
28935@example
28936# Test program
28937
28938BEGIN @{
28939    Letters = "abcdefghijklmnopqrstuvwxyz" \
28940              "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
28941    split(Letters, data, "")
28942
28943    asort(data, result, "case_fold_compare")
28944
28945    j = length(result)
28946    for (i = 1; i <= j; i++) @{
28947        printf("%s", result[i])
28948        if (i % (j/2) == 0)
28949            printf("\n")
28950        else
28951            printf(" ")
28952    @}
28953@}
28954@end example
28955
28956When run, we get the following:
28957
28958@example
28959$ @kbd{gawk -f case_fold_compare.awk}
28960@print{} A a B b c C D d e E F f g G H h i I J j k K l L M m
28961@print{} n N O o p P Q q r R S s t T u U V v w W X x y Y z Z
28962@end example
28963
28964@node Two-way I/O
28965@section Two-Way Communications with Another Process
28966
28967@c 8/2014. Neither Mike nor BWK saw this as relevant. Commenting it out.
28968@ignore
28969@cindex Brennan, Michael
28970@cindex programmers, attractiveness of
28971@smallexample
28972@c Path: cssun.mathcs.emory.edu!gatech!newsxfer3.itd.umich.edu!news-peer.sprintlink.net!news-sea-19.sprintlink.net!news-in-west.sprintlink.net!news.sprintlink.net!Sprint!204.94.52.5!news.whidbey.com!brennan
28973From: brennan@@whidbey.com (Mike Brennan)
28974Newsgroups: comp.lang.awk
28975Subject: Re: Learn the SECRET to Attract Women Easily
28976Date: 4 Aug 1997 17:34:46 GMT
28977@c Organization: WhidbeyNet
28978@c Lines: 12
28979Message-ID: <5s53rm$eca@@news.whidbey.com>
28980@c References: <5s20dn$2e1@chronicle.concentric.net>
28981@c Reply-To: brennan@whidbey.com
28982@c NNTP-Posting-Host: asn202.whidbey.com
28983@c X-Newsreader: slrn (0.9.4.1 UNIX)
28984@c Xref: cssun.mathcs.emory.edu comp.lang.awk:5403
28985
28986On 3 Aug 1997 13:17:43 GMT, Want More Dates???
28987<tracy78@@kilgrona.com> wrote:
28988>Learn the SECRET to Attract Women Easily
28989>
28990>The SCENT(tm)  Pheromone Sex Attractant For Men to Attract Women
28991
28992The scent of awk programmers is a lot more attractive to women than
28993the scent of perl programmers.
28994--
28995Mike Brennan
28996@c brennan@@whidbey.com
28997@end smallexample
28998@end ignore
28999
29000@cindex advanced features @subentry processes, communicating with
29001@cindex processes, two-way communications with
29002It is often useful to be able to
29003send data to a separate program for
29004processing and then read the result.  This can always be
29005done with temporary files:
29006
29007@example
29008# Write the data for processing
29009tempfile = ("mydata." PROCINFO["pid"])
29010while (@var{not done with data})
29011    print @var{data} | ("subprogram > " tempfile)
29012close("subprogram > " tempfile)
29013
29014# Read the results, remove tempfile when done
29015while ((getline newdata < tempfile) > 0)
29016    @var{process} newdata @var{appropriately}
29017close(tempfile)
29018system("rm " tempfile)
29019@end example
29020
29021@noindent
29022This works, but not elegantly.  Among other things, it requires that
29023the program be run in a directory that cannot be shared among users;
29024for example, @file{/tmp} will not do, as another user might happen
29025to be using a temporary file with the same name.@footnote{Michael
29026Brennan suggests the use of @command{rand()} to generate unique
29027@value{FN}s. This is a valid point; nevertheless, temporary files
29028remain more difficult to use than two-way pipes.} @c 8/2014
29029
29030@cindex coprocesses
29031@cindex input/output @subentry two-way
29032@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O)
29033@cindex vertical bar (@code{|}) @subentry @code{|&} operator (I/O)
29034@cindex @command{csh} utility @subentry @code{|&} operator, comparison with
29035However, with @command{gawk}, it is possible to
29036open a @emph{two-way} pipe to another process.  The second process is
29037termed a @dfn{coprocess}, as it runs in parallel with @command{gawk}.
29038The two-way connection is created using the @samp{|&} operator
29039(borrowed from the Korn shell, @command{ksh}):@footnote{This is very
29040different from the same operator in the C shell and in Bash.}
29041
29042@example
29043do @{
29044    print @var{data} |& "subprogram"
29045    "subprogram" |& getline results
29046@} while (@var{data left to process})
29047close("subprogram")
29048@end example
29049
29050The first time an I/O operation is executed using the @samp{|&}
29051operator, @command{gawk} creates a two-way pipeline to a child process
29052that runs the other program.  Output created with @code{print}
29053or @code{printf} is written to the program's standard input, and
29054output from the program's standard output can be read by the @command{gawk}
29055program using @code{getline}.
29056As is the case with processes started by @samp{|}, the subprogram
29057can be any program, or pipeline of programs, that can be started by
29058the shell.
29059
29060There are some cautionary items to be aware of:
29061
29062@itemize @value{BULLET}
29063@item
29064As the code inside @command{gawk} currently stands, the coprocess's
29065standard error goes to the same place that the parent @command{gawk}'s
29066standard error goes. It is not possible to read the child's
29067standard error separately.
29068
29069@cindex deadlocks
29070@cindex buffering @subentry input/output
29071@cindex @code{getline} command @subentry deadlock and
29072@item
29073I/O buffering may be a problem.  @command{gawk} automatically
29074flushes all output down the pipe to the coprocess.
29075However, if the coprocess does not flush its output,
29076@command{gawk} may hang when doing a @code{getline} in order to read
29077the coprocess's results.  This could lead to a situation
29078known as @dfn{deadlock}, where each process is waiting for the
29079other one to do something.
29080@end itemize
29081
29082@cindex @code{close()} function @subentry two-way pipes and
29083It is possible to close just one end of the two-way pipe to
29084a coprocess, by supplying a second argument to the @code{close()}
29085function of either @code{"to"} or @code{"from"}
29086(@pxref{Close Files And Pipes}).
29087These strings tell @command{gawk} to close the end of the pipe
29088that sends data to the coprocess or the end that reads from it,
29089respectively.
29090
29091@cindex @command{sort} utility @subentry coprocesses and
29092This is particularly necessary in order to use
29093the system @command{sort} utility as part of a coprocess;
29094@command{sort} must read @emph{all} of its input
29095data before it can produce any output.
29096The @command{sort} program does not receive an end-of-file indication
29097until @command{gawk} closes the write end of the pipe.
29098
29099When you have finished writing data to the @command{sort}
29100utility, you can close the @code{"to"} end of the pipe, and
29101then start reading sorted data via @code{getline}.
29102For example:
29103
29104@example
29105BEGIN @{
29106    command = "LC_ALL=C sort"
29107    n = split("abcdefghijklmnopqrstuvwxyz", a, "")
29108
29109    for (i = n; i > 0; i--)
29110        print a[i] |& command
29111    close(command, "to")
29112
29113    while ((command |& getline line) > 0)
29114        print "got", line
29115    close(command)
29116@}
29117@end example
29118
29119This program writes the letters of the alphabet in reverse order, one
29120per line, down the two-way pipe to @command{sort}.  It then closes the
29121write end of the pipe, so that @command{sort} receives an end-of-file
29122indication.  This causes @command{sort} to sort the data and write the
29123sorted data back to the @command{gawk} program.  Once all of the data
29124has been read, @command{gawk} terminates the coprocess and exits.
29125
29126@cindex ASCII
29127As a side note, the assignment @samp{LC_ALL=C} in the @command{sort}
29128command ensures traditional Unix (ASCII) sorting from @command{sort}.
29129This is not strictly necessary here, but it's good to know how to do this.
29130
29131Be careful when closing the @code{"from"} end of a two-way pipe; in this
29132case @command{gawk} waits for the child process to exit, which may cause
29133your program to hang.  (Thus, this particular feature is of much less
29134use in practice than being able to close the @code{"to"} end.)
29135
29136@quotation CAUTION
29137Normally,
29138it is a fatal error to write to the @code{"to"} end of a two-way
29139pipe which has been closed, and it is also a fatal error to read
29140from the @code{"from"} end of a two-way pipe that has been closed.
29141
29142You may set @code{PROCINFO["@var{command}", "NONFATAL"]} to
29143make such operations become nonfatal. If you do so, you then need
29144to check @code{ERRNO} after each @code{print}, @code{printf},
29145or @code{getline}.
29146@xref{Nonfatal}, for more information.
29147@end quotation
29148
29149@cindex @command{gawk} @subentry @code{PROCINFO} array in
29150@cindex @code{PROCINFO} array @subentry communications via ptys and
29151You may also use pseudo-ttys (ptys) for
29152two-way communication instead of pipes, if your system supports them.
29153This is done on a per-command basis, by setting a special element
29154in the @code{PROCINFO} array
29155(@pxref{Auto-set}),
29156like so:
29157
29158@example
29159command = "sort -nr"           # command, save in convenience variable
29160PROCINFO[command, "pty"] = 1   # update PROCINFO
29161print @dots{} |& command           # start two-way pipe
29162@dots{}
29163@end example
29164
29165@noindent
29166If your system does not have ptys, or if all the system's ptys are in use,
29167@command{gawk} automatically falls back to using regular pipes.
29168
29169Using ptys usually avoids the buffer deadlock issues described earlier,
29170at some loss in performance. This is because the tty driver buffers
29171and sends data line-by-line.  On systems with the @command{stdbuf}
29172(part of the @uref{https://www.gnu.org/software/coreutils/coreutils.html,
29173GNU Coreutils package}), you can use that program instead of ptys.
29174
29175Note also that ptys are not fully transparent. Certain binary control
29176codes, such @kbd{Ctrl-d} for end-of-file, are interpreted by the tty
29177driver and not passed through.
29178
29179@quotation CAUTION
29180Finally, coprocesses open up the possibility of @dfn{deadlock} between
29181@command{gawk} and the program running in the coprocess. This can occur
29182if you send ``too much'' data to the coprocess before reading any back;
29183each process is blocked writing data with no one available to read what
29184they've already written.  There is no workaround for deadlock; careful
29185programming and knowledge of the behavior of the coprocess are required.
29186@end quotation
29187
29188@c From email send January 4, 2018.
29189The following example, due to Andrew Schorr, demonstrates how
29190using ptys can help deal with buffering deadlocks.
29191
29192Suppose @command{gawk} were unable to add numbers.
29193You could use a coprocess to do it. Here's an exceedingly
29194simple program written for that purpose:
29195
29196@example
29197$ @kbd{cat add.c}
29198#include <stdio.h>
29199
29200int
29201main(void)
29202@{
29203    int x, y;
29204    while (scanf("%d %d", & x, & y) == 2)
29205        printf("%d\n", x + y);
29206    return 0;
29207@}
29208$ @kbd{cc -O add.c -o add}      @ii{Compile the program}
29209@end example
29210
29211You could then write an exceedingly simple @command{gawk} program
29212to add numbers by passing them to the coprocess:
29213
29214@example
29215$ @kbd{echo 1 2 |}
29216> @kbd{gawk -v cmd=./add '@{ print |& cmd; cmd |& getline x; print x @}'}
29217@end example
29218
29219And it would deadlock, because @file{add.c} fails to call
29220@samp{setlinebuf(stdout)}. The @command{add} program freezes.
29221
29222Now try instead:
29223
29224@example
29225$ @kbd{echo 1 2 |}
29226> @kbd{gawk -v cmd=add 'BEGIN @{ PROCINFO[cmd, "pty"] = 1 @}}
29227> @kbd{                 @{ print |& cmd; cmd |& getline x; print x @}'}
29228@print{} 3
29229@end example
29230
29231By using a pty, @command{gawk} fools the standard I/O library into
29232thinking it has an interactive session, so it defaults to line buffering.
29233And now, magically, it works!
29234
29235@node TCP/IP Networking
29236@section Using @command{gawk} for Network Programming
29237@cindex advanced features @subentry network programming
29238@cindex networks @subentry programming
29239@cindex TCP/IP
29240@cindex @code{/inet/@dots{}} special files (@command{gawk})
29241@cindex files @subentry @code{/inet/@dots{}} (@command{gawk})
29242@cindex @code{/inet4/@dots{}} special files (@command{gawk})
29243@cindex files @subentry @code{/inet4/@dots{}} (@command{gawk})
29244@cindex @code{/inet6/@dots{}} special files (@command{gawk})
29245@cindex files @subentry @code{/inet6/@dots{}} (@command{gawk})
29246@cindex @code{EMRED}
29247@ifnotdocbook
29248@quotation
29249@code{EMRED}:@*
29250@ @ @ @ @i{A host is a host from coast to coast,@*
29251@ @ @ @ and nobody talks to a host that's close,@*
29252@ @ @ @ unless the host that isn't close@*
29253@ @ @ @ is busy, hung, or dead.}
29254@author Mike O'Brien (aka Mr.@: Protocol)
29255@end quotation
29256@end ifnotdocbook
29257
29258@docbook
29259<blockquote>
29260<attribution>Mike O'Brien (aka Mr.&nbsp;Protocol)</attribution>
29261<literallayout class="normal"><literal>EMRED</literal>:
29262&nbsp;&nbsp;&nbsp;&nbsp;<emphasis>A host is a host from coast to coast,</emphasis>
29263&nbsp;&nbsp;&nbsp;&nbsp;<emphasis>and no-one can talk to host that's close,</emphasis>
29264&nbsp;&nbsp;&nbsp;&nbsp;<emphasis>unless the host that isn't close</emphasis>
29265&nbsp;&nbsp;&nbsp;&nbsp;<emphasis>is busy, hung, or dead.</emphasis></literallayout>
29266</blockquote>
29267@end docbook
29268
29269In addition to being able to open a two-way pipeline to a coprocess
29270on the same system
29271(@pxref{Two-way I/O}),
29272it is possible to make a two-way connection to
29273another process on another system across an IP network connection.
29274
29275You can think of this as just a @emph{very long} two-way pipeline to
29276a coprocess.
29277The way @command{gawk} decides that you want to use TCP/IP networking is
29278by recognizing special @value{FN}s that begin with one of @samp{/inet/},
29279@samp{/inet4/}, or @samp{/inet6/}.
29280
29281The full syntax of the special @value{FN} is
29282@file{/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}.
29283The components are:
29284
29285@table @var
29286@item net-type
29287Specifies the kind of Internet connection to make.
29288Use @samp{/inet4/} to force IPv4, and
29289@samp{/inet6/} to force IPv6.
29290Plain @samp{/inet/} (which used to be the only option) uses
29291the system default, most likely IPv4.
29292
29293@item protocol
29294The protocol to use over IP.  This must be either @samp{tcp}, or
29295@samp{udp}, for a TCP or UDP IP connection,
29296respectively.  TCP should be used for most applications.
29297
29298@item local-port
29299@cindex @code{getaddrinfo()} function (C library)
29300@cindex C library functions @subentry @code{getaddrinfo()}
29301The local TCP or UDP port number to use.  Use a port number of @samp{0}
29302when you want the system to pick a port. This is what you should do
29303when writing a TCP or UDP client.
29304You may also use a well-known service name, such as @samp{smtp}
29305or @samp{http}, in which case @command{gawk} attempts to determine
29306the predefined port number using the C @code{getaddrinfo()} function.
29307
29308@item remote-host
29309The IP address or fully qualified domain name of the Internet
29310host to which you want to connect.
29311
29312@item remote-port
29313The TCP or UDP port number to use on the given @var{remote-host}.
29314Again, use @samp{0} if you don't care, or else a well-known
29315service name.
29316@end table
29317
29318@cindex @command{gawk} @subentry @code{ERRNO} variable in
29319@cindex @code{ERRNO} variable
29320@quotation NOTE
29321Failure in opening a two-way socket will result in a nonfatal error
29322being returned to the calling code. The value of @code{ERRNO} indicates
29323the error (@pxref{Auto-set}).
29324@end quotation
29325
29326Consider the following very simple example:
29327
29328@example
29329BEGIN @{
29330    Service = "/inet/tcp/0/localhost/daytime"
29331    Service |& getline
29332    print $0
29333    close(Service)
29334@}
29335@end example
29336
29337This program reads the current date and time from the local system's
29338TCP @code{daytime} server.
29339It then prints the results and closes the connection.
29340
29341Because this topic is extensive, the use of @command{gawk} for
29342TCP/IP programming is documented separately.
29343@ifinfo
29344See
29345@inforef{Top, , General Introduction, gawkinet, @value{GAWKINETTITLE}},
29346@end ifinfo
29347@ifnotinfo
29348See
29349@uref{https://www.gnu.org/software/gawk/manual/gawkinet/,
29350@cite{@value{GAWKINETTITLE}}},
29351which comes as part of the @command{gawk} distribution,
29352@end ifnotinfo
29353for a much more complete introduction and discussion, as well as
29354extensive examples.
29355
29356@quotation NOTE
29357@command{gawk} can only open direct sockets. There is currently
29358no way to access services available over Secure Socket Layer
29359(SSL); this includes any web service whose URL starts with @samp{https://}.
29360@end quotation
29361
29362
29363@node Profiling
29364@section Profiling Your @command{awk} Programs
29365@cindex @command{awk} programs @subentry profiling
29366@cindex profiling @command{awk} programs
29367@cindex @code{awkprof.out} file
29368@cindex files @subentry @code{awkprof.out}
29369
29370You may produce execution traces of your @command{awk} programs.
29371This is done by passing the option @option{--profile} to @command{gawk}.
29372When @command{gawk} has finished running, it creates a profile of your program in a file
29373named @file{awkprof.out}. Because it is profiling, it also executes up to 45% slower than
29374@command{gawk} normally does.
29375
29376@cindex @option{--profile} option
29377As shown in the following example,
29378the @option{--profile} option can be used to change the name of the file
29379where @command{gawk} will write the profile:
29380
29381@example
29382gawk --profile=myprog.prof -f myprog.awk data1 data2
29383@end example
29384
29385@noindent
29386In the preceding example, @command{gawk} places the profile in
29387@file{myprog.prof} instead of in @file{awkprof.out}.
29388
29389Here is a sample session showing a simple @command{awk} program,
29390its input data, and the results from running @command{gawk} with the
29391@option{--profile} option.  First, the @command{awk} program:
29392
29393@example
29394BEGIN @{ print "First BEGIN rule" @}
29395
29396END @{ print "First END rule" @}
29397
29398/foo/ @{
29399    print "matched /foo/, gosh"
29400    for (i = 1; i <= 3; i++)
29401        sing()
29402@}
29403
29404@{
29405    if (/foo/)
29406        print "if is true"
29407    else
29408        print "else is true"
29409@}
29410
29411BEGIN @{ print "Second BEGIN rule" @}
29412
29413END @{ print "Second END rule" @}
29414
29415function sing(    dummy)
29416@{
29417    print "I gotta be me!"
29418@}
29419@end example
29420
29421Following is the input data:
29422
29423@example
29424foo
29425bar
29426baz
29427foo
29428junk
29429@end example
29430
29431Here is the @file{awkprof.out} that results from running the
29432@command{gawk} profiler on this program and data (this example also
29433illustrates that @command{awk} programmers sometimes get up very early
29434in the morning to work):
29435
29436@cindex @code{BEGIN} pattern @subentry profiling and
29437@cindex @code{END} pattern @subentry profiling and
29438@example
29439    # gawk profile, created Mon Sep 29 05:16:21 2014
29440
29441    # BEGIN rule(s)
29442
29443    BEGIN @{
29444 1          print "First BEGIN rule"
29445    @}
29446
29447    BEGIN @{
29448 1          print "Second BEGIN rule"
29449    @}
29450
29451    # Rule(s)
29452
29453 5  /foo/ @{ # 2
29454 2          print "matched /foo/, gosh"
29455 6          for (i = 1; i <= 3; i++) @{
29456 6                  sing()
29457            @}
29458    @}
29459
29460 5  @{
29461 5          if (/foo/) @{ # 2
29462 2                  print "if is true"
29463 3          @} else @{
29464 3                  print "else is true"
29465            @}
29466    @}
29467
29468    # END rule(s)
29469
29470    END @{
29471 1          print "First END rule"
29472    @}
29473
29474    END @{
29475 1          print "Second END rule"
29476    @}
29477
29478
29479    # Functions, listed alphabetically
29480
29481 6  function sing(dummy)
29482    @{
29483 6          print "I gotta be me!"
29484    @}
29485@end example
29486
29487This example illustrates many of the basic features of profiling output.
29488They are as follows:
29489
29490@itemize @value{BULLET}
29491@item
29492The program is printed in the order @code{BEGIN} rules,
29493@code{BEGINFILE} rules,
29494pattern--action rules,
29495@code{ENDFILE} rules, @code{END} rules, and functions, listed
29496alphabetically.
29497Multiple @code{BEGIN} and @code{END} rules retain their
29498separate identities, as do
29499multiple @code{BEGINFILE} and @code{ENDFILE} rules.
29500
29501@cindex patterns @subentry counts, in a profile
29502@item
29503Pattern--action rules have two counts.
29504The first count, to the left of the rule, shows how many times
29505the rule's pattern was @emph{tested}.
29506The second count, to the right of the rule's opening left brace
29507in a comment,
29508shows how many times the rule's action was @emph{executed}.
29509The difference between the two indicates how many times the rule's
29510pattern evaluated to false.
29511
29512@item
29513Similarly,
29514the count for an @code{if}-@code{else} statement shows how many times
29515the condition was tested.
29516To the right of the opening left brace for the @code{if}'s body
29517is a count showing how many times the condition was true.
29518The count for the @code{else}
29519indicates how many times the test failed.
29520
29521@cindex loops @subentry count for header, in a profile
29522@item
29523The count for a loop header (such as @code{for}
29524or @code{while}) shows how many times the loop test was executed.
29525(Because of this, you can't just look at the count on the first
29526statement in a rule to determine how many times the rule was executed.
29527If the first statement is a loop, the count is misleading.)
29528
29529@cindex functions @subentry user-defined @subentry counts, in a profile
29530@cindex user-defined @subentry functions @subentry counts, in a profile
29531@item
29532For user-defined functions, the count next to the @code{function}
29533keyword indicates how many times the function was called.
29534The counts next to the statements in the body show how many times
29535those statements were executed.
29536
29537@cindex @code{@{@}} (braces)
29538@cindex braces (@code{@{@}})
29539@item
29540The layout uses ``K&R'' style with TABs.
29541Braces are used everywhere, even when
29542the body of an @code{if}, @code{else}, or loop is only a single statement.
29543
29544@cindex @code{()} (parentheses) @subentry in a profile
29545@cindex parentheses @code{()} @subentry in a profile
29546@item
29547Parentheses are used only where needed, as indicated by the structure
29548of the program and the precedence rules.
29549For example, @samp{(3 + 5) * 4} means add three and five, then multiply
29550the total by four.  However, @samp{3 + 5 * 4} has no parentheses, and
29551means @samp{3 + (5 * 4)}.
29552However, explicit parentheses in the source program are retained.
29553
29554@ignore
29555@item
29556All string concatenations are parenthesized too.
29557(This could be made a bit smarter.)
29558@end ignore
29559
29560@item
29561Parentheses are used around the arguments to @code{print}
29562and @code{printf} only when
29563the @code{print} or @code{printf} statement is followed by a redirection.
29564Similarly, if
29565the target of a redirection isn't a scalar, it gets parenthesized.
29566
29567@item
29568@command{gawk} supplies leading comments in
29569front of the @code{BEGIN} and @code{END} rules,
29570the @code{BEGINFILE} and @code{ENDFILE} rules,
29571the pattern--action rules, and the functions.
29572
29573@item
29574Functions are listed alphabetically. All functions in the @code{awk}
29575namespace are listed first, in alphabetical order.  Then come the
29576functions in namespaces.  The namespaces are listed in alphabetical order,
29577and the functions within each namespace are listed alphabetically.
29578
29579@end itemize
29580
29581The profiled version of your program may not look exactly like what you
29582typed when you wrote it.  This is because @command{gawk} creates the
29583profiled version by ``pretty-printing'' its internal representation of
29584the program.  The advantage to this is that @command{gawk} can produce
29585a standard representation.
29586Also, things such as:
29587
29588@example
29589/foo/
29590@end example
29591
29592@noindent
29593come out as:
29594
29595@example
29596/foo/   @{
29597    print
29598@}
29599@end example
29600
29601@noindent
29602which is correct, but possibly unexpected.
29603(If a program uses both @samp{print $0} and plain
29604@samp{print}, that distinction is retained.)
29605
29606@cindex profiling @command{awk} programs @subentry dynamically
29607@cindex @command{gawk} @subentry dynamic profiling
29608@cindex @command{gawk} @subentry profiling programs
29609@cindex dynamic profiling
29610Besides creating profiles when a program has completed,
29611@command{gawk} can produce a profile while it is running.
29612This is useful if your @command{awk} program goes into an
29613infinite loop and you want to see what has been executed.
29614To use this feature, run @command{gawk} with the @option{--profile}
29615option in the background:
29616
29617@example
29618$ @kbd{gawk --profile -f myprog &}
29619[1] 13992
29620@end example
29621
29622@cindex @command{kill} command, dynamic profiling
29623@cindex @code{USR1} signal, for dynamic profiling
29624@cindex @code{SIGUSR1} signal, for dynamic profiling
29625@cindex signals @subentry @code{USR1}/@code{SIGUSR1}, for profiling
29626@noindent
29627The shell prints a job number and process ID number; in this case, 13992.
29628Use the @command{kill} command to send the @code{USR1} signal
29629to @command{gawk}:
29630
29631@example
29632$ @kbd{kill -USR1 13992}
29633@end example
29634
29635@noindent
29636As usual, the profiled version of the program is written to
29637@file{awkprof.out}, or to a different file if one was specified with
29638the @option{--profile} option.
29639
29640Along with the regular profile, as shown earlier, the profile file
29641includes a trace of any active functions:
29642
29643@example
29644# Function Call Stack:
29645
29646#   3. baz
29647#   2. bar
29648#   1. foo
29649# -- main --
29650@end example
29651
29652You may send @command{gawk} the @code{USR1} signal as many times as you like.
29653Each time, the profile and function call trace are appended to the output
29654profile file.
29655
29656@cindex @code{HUP} signal, for dynamic profiling
29657@cindex @code{SIGHUP} signal, for dynamic profiling
29658@cindex signals @subentry @code{HUP}/@code{SIGHUP}, for profiling
29659If you use the @code{HUP} signal instead of the @code{USR1} signal,
29660@command{gawk} produces the profile and the function call trace and then exits.
29661
29662@cindex @code{INT} signal (MS-Windows)
29663@cindex @code{SIGINT} signal (MS-Windows)
29664@cindex signals @subentry @code{INT}/@code{SIGINT} (MS-Windows)
29665@cindex @code{QUIT} signal (MS-Windows)
29666@cindex @code{SIGQUIT} signal (MS-Windows)
29667@cindex signals @subentry @code{QUIT}/@code{SIGQUIT} (MS-Windows)
29668When @command{gawk} runs on MS-Windows systems, it uses the
29669@code{INT} and @code{QUIT} signals for producing the profile, and in
29670the case of the @code{INT} signal, @command{gawk} exits.  This is
29671because these systems don't support the @command{kill} command, so the
29672only signals you can deliver to a program are those generated by the
29673keyboard.  The @code{INT} signal is generated by the
29674@kbd{Ctrl-c} or @kbd{Ctrl-BREAK} key, while the
29675@code{QUIT} signal is generated by the @kbd{Ctrl-\} key.
29676
29677@cindex pretty printing
29678Finally, @command{gawk} also accepts another option, @option{--pretty-print}.
29679When called this way, @command{gawk} ``pretty-prints'' the program into
29680@file{awkprof.out}, without any execution counts.
29681
29682@quotation NOTE
29683Once upon a time, the @option{--pretty-print} option would also run
29684your program.  This is no longer the case.
29685@end quotation
29686
29687@cindex profiling, pretty printing, difference with
29688@cindex pretty printing @subentry profiling, difference with
29689There is a significant difference between the output created when
29690profiling, and that created when pretty-printing.  Pretty-printed output
29691preserves the original comments that were in the program, although their
29692placement may not correspond exactly to their original locations in the
29693source code. However, no comments should be lost.
29694Also, @command{gawk} does the best it can to preserve
29695the distinction between comments at the end of a statement and comments
29696on lines by themselves. This isn't always perfect, though.
29697
29698However, as a deliberate design decision, profiling output @emph{omits}
29699the original program's comments. This allows you to focus on the
29700execution count data and helps you avoid the temptation to use the
29701profiler for pretty-printing.
29702
29703Additionally, pretty-printed output does not have the leading indentation
29704that the profiling output does. This makes it easy to pretty-print your
29705code once development is completed, and then use the result as the final
29706version of your program.
29707
29708Because the internal representation of your program is formatted to
29709recreate an @command{awk} program, profiling and pretty-printing
29710automatically disable @command{gawk}'s default optimizations.
29711
29712Profiling and pretty-printing also preserve the original format of numeric
29713constants; if you used an octal or hexadecimal value in your source
29714code, it will appear that way in the output.
29715
29716@node Extension Philosophy
29717@section Builtin Features versus Extensions
29718
29719As this and subsequent @value{CHAPTER}s show, @command{gawk} has a
29720large number of extensions over standard @command{awk} built-in to
29721the program.  These have developed over time.  More recently, the
29722focus has moved to using the extension mechanism (@pxref{Dynamic Extensions})
29723for adding features.  This @value{SECTION} discusses the ``guiding philosophy''
29724behind what should be added to the interpreter as a built-in
29725feature versus what should be done in extensions.
29726
29727There are several goals:
29728
29729@enumerate 1
29730@item
29731Keep the language @command{awk}; it should not become unrecognizable, even
29732if programs in it will only run on @command{gawk}.
29733
29734@item
29735Keep the core from getting any larger unless absolutely necessary.
29736
29737@item
29738Add new functionality either in @command{awk} scripts (@option{-f},
29739@code{@@include}) or in loadable extensions written in C or C++
29740(@option{-l}, @code{@@load}).
29741
29742@item
29743Extend the core interpreter only if some feature is:
29744
29745@c sublist
29746@enumerate A
29747@item
29748Truly desirable.
29749@item
29750Cannot be done via library files or loadable extensions.
29751@item
29752Can be implemented without too much pain in the core.
29753@end enumerate
29754@end enumerate
29755Combining modules with @command{awk} files is a powerful technique.
29756Some of the sample extensions demonstrate this.
29757
29758Loading extensions and library files should not be done automatically,
29759because then there's overhead that most users don't want or need.
29760
29761@node Advanced Features Summary
29762@section Summary
29763
29764@itemize @value{BULLET}
29765@item
29766The @option{--non-decimal-data} option causes @command{gawk} to treat
29767octal- and hexadecimal-looking input data as octal and hexadecimal.
29768This option should be used with caution or not at all; use of @code{strtonum()}
29769is preferable.
29770Note that this option may disappear in a future version of @command{gawk}.
29771
29772@item
29773You can take over complete control of sorting in @samp{for (@var{indx} in @var{array})}
29774array traversal by setting @code{PROCINFO["sorted_in"]} to the name of a user-defined
29775function that does the comparison of array elements based on index and value.
29776
29777@item
29778Similarly, you can supply the name of a user-defined comparison function as the
29779third argument to either @code{asort()} or @command{asorti()} to control how
29780those functions sort arrays. Or you may provide one of the predefined control
29781strings that work for @code{PROCINFO["sorted_in"]}.
29782
29783@item
29784You can use the @samp{|&} operator to create a two-way pipe to a coprocess.
29785You read from the coprocess with @code{getline} and write to it with @code{print}
29786or @code{printf}. Use @code{close()} to close off the coprocess completely, or
29787optionally, close off one side of the two-way communications.
29788
29789@item
29790By using special @value{FN}s with the @samp{|&} operator, you can open a
29791TCP/IP (or UDP/IP) connection to remote hosts on the Internet. @command{gawk}
29792supports both IPv4 and IPv6.
29793
29794@item
29795You can generate statement count profiles of your program. This can help you
29796determine which parts of your program may be taking the most time and let
29797you tune them more easily.  Sending the @code{USR1} signal while profiling causes
29798@command{gawk} to dump the profile and keep going, including a function call stack.
29799
29800@item
29801You can also just ``pretty-print'' the program.
29802
29803@item
29804New features should be developed using the extension mechanism if possible;
29805they should be added to the core interpreter only as a last resort.
29806@end itemize
29807
29808
29809@node Internationalization
29810@chapter Internationalization with @command{gawk}
29811
29812@cindex Robbins @subentry Malka
29813@cindex Moon, Sailor
29814@cindex Sailor Moon @seeentry{Moon, Sailor}
29815@quotation
29816@i{Moon@dots{} Gorgeous@dots{} MEDITATION!}
29817@author Pretty Guardian Sailor Moon Eternal, The Movie
29818@end quotation
29819
29820@quotation
29821@i{It probably sounded better in Japanese.}
29822@author Malka Robbins
29823@end quotation
29824
29825Once upon a time, computer makers
29826wrote software that worked only in English.
29827Eventually, hardware and software vendors noticed that if their
29828systems worked in the native languages of non-English-speaking
29829countries, they were able to sell more systems.
29830As a result, internationalization and localization
29831of programs and software systems became a common practice.
29832
29833@cindex internationalization @subentry localization
29834@cindex @command{gawk} @subentry internationalization @seeentry{internationalization}
29835@cindex internationalization @subentry localization @subentry @command{gawk} and
29836For many years, the ability to provide internationalization
29837was largely restricted to programs written in C and C++.
29838This @value{CHAPTER} describes the underlying library @command{gawk}
29839uses for internationalization, as well as how
29840@command{gawk} makes internationalization
29841features available at the @command{awk} program level.
29842Having internationalization available at the @command{awk} level
29843gives software developers additional flexibility---they are no
29844longer forced to write in C or C++ when internationalization is
29845a requirement.
29846
29847@menu
29848* I18N and L10N::               Internationalization and Localization.
29849* Explaining gettext::          How GNU @command{gettext} works.
29850* Programmer i18n::             Features for the programmer.
29851* Translator i18n::             Features for the translator.
29852* I18N Example::                A simple i18n example.
29853* Gawk I18N::                   @command{gawk} is also internationalized.
29854* I18N Summary::                Summary of I18N stuff.
29855@end menu
29856
29857@node I18N and L10N
29858@section Internationalization and Localization
29859
29860@cindex internationalization
29861@cindex localization @seeentry{internationalization, localization}
29862@cindex internationalization @subentry localization
29863@dfn{Internationalization} means writing (or modifying) a program once,
29864in such a way that it can use multiple languages without requiring
29865further source code changes.
29866@dfn{Localization} means providing the data necessary for an
29867internationalized program to work in a particular language.
29868Most typically, these terms refer to features such as the language
29869used for printing error messages, the language used to read
29870responses, and information related to how numerical and
29871monetary values are printed and read.
29872
29873@node Explaining gettext
29874@section GNU @command{gettext}
29875
29876@cindex internationalizing a program
29877@cindex @command{gettext} library
29878@command{gawk} uses GNU @command{gettext} to provide its internationalization
29879features.
29880The facilities in GNU @command{gettext} focus on messages: strings printed
29881by a program, either directly or via formatting with @code{printf} or
29882@code{sprintf()}.@footnote{For some operating systems, the @command{gawk}
29883port doesn't support GNU @command{gettext}.
29884Therefore, these features are not available
29885if you are using one of those operating systems. Sorry.}
29886
29887@cindex portability @subentry @command{gettext} library and
29888When using GNU @command{gettext}, each application has its own
29889@dfn{text domain}.  This is a unique name, such as @samp{kpilot} or @samp{gawk},
29890that identifies the application.
29891A complete application may have multiple components---programs written
29892in C or C++, as well as scripts written in @command{sh} or @command{awk}.
29893All of the components use the same text domain.
29894
29895To make the discussion concrete, assume we're writing an application
29896named @command{guide}.  Internationalization consists of the
29897following steps, in this order:
29898
29899@enumerate
29900@item
29901The programmer reviews the source for all of @command{guide}'s components
29902and marks each string that is a candidate for translation.
29903For example, @code{"`-F': option required"} is a good candidate for translation.
29904A table with strings of option names is not (e.g., @command{gawk}'s
29905@option{--profile} option should remain the same, no matter what the local
29906language).
29907
29908@cindex @code{textdomain()} function (C library)
29909@cindex C library functions @subentry @code{textdomain()}
29910@item
29911The programmer indicates the application's text domain
29912(@command{"guide"}) to the @command{gettext} library,
29913by calling the @code{textdomain()} function.
29914
29915@cindex @code{.pot} files
29916@cindex files @subentry @code{.pot}
29917@cindex portable object @subentry template files
29918@cindex files @subentry portable object @subentry template file (@file{.pot})
29919@item
29920Messages from the application are extracted from the source code and
29921collected into a portable object template file (@file{guide.pot}),
29922which lists the strings and their translations.
29923The translations are initially empty.
29924The original (usually English) messages serve as the key for
29925lookup of the translations.
29926
29927@cindex @code{.po} files
29928@cindex files @subentry @code{.po}
29929@cindex portable object @subentry files
29930@cindex files @subentry portable object
29931@item
29932For each language with a translator, @file{guide.pot}
29933is copied to a portable object file (@code{.po})
29934and translations are created and shipped with the application.
29935For example, there might be a @file{fr.po} for a French translation.
29936
29937@cindex @code{.gmo} files
29938@cindex files @subentry @code{.gmo}
29939@cindex message object files
29940@cindex files @subentry message object
29941@item
29942Each language's @file{.po} file is converted into a binary
29943message object (@file{.gmo}) file.
29944A message object file contains the original messages and their
29945translations in a binary format that allows fast lookup of translations
29946at runtime.
29947
29948@item
29949When @command{guide} is built and installed, the binary translation files
29950are installed in a standard place.
29951
29952@cindex @code{bindtextdomain()} function (C library)
29953@cindex C library functions @subentry @code{bindtextdomain()}
29954@item
29955For testing and development, it is possible to tell @command{gettext}
29956to use @file{.gmo} files in a different directory than the standard
29957one by using the @code{bindtextdomain()} function.
29958
29959@cindex @code{.gmo} files @subentry specifying directory of
29960@cindex files @subentry @code{.gmo} @subentry specifying directory of
29961@cindex message object files @subentry specifying directory of
29962@cindex files @subentry message object @subentry specifying directory of
29963@item
29964At runtime, @command{guide} looks up each string via a call
29965to @code{gettext()}.  The returned string is the translated string
29966if available, or the original string if not.
29967
29968@item
29969If necessary, it is possible to access messages from a different
29970text domain than the one belonging to the application, without
29971having to switch the application's default text domain back
29972and forth.
29973@end enumerate
29974
29975@cindex @code{gettext()} function (C library)
29976@cindex C library functions @subentry @code{gettext()}
29977In C (or C++), the string marking and dynamic translation lookup
29978are accomplished by wrapping each string in a call to @code{gettext()}:
29979
29980@example
29981printf("%s", gettext("Don't Panic!\n"));
29982@end example
29983
29984The tools that extract messages from source code pull out all
29985strings enclosed in calls to @code{gettext()}.
29986
29987@cindex @code{_} (underscore) @subentry C macro
29988@cindex underscore (@code{_}) @subentry C macro
29989The GNU @command{gettext} developers, recognizing that typing
29990@samp{gettext(@dots{})} over and over again is both painful and ugly to look
29991at, use the macro @samp{_} (an underscore) to make things easier:
29992
29993@example
29994/* In the standard header file: */
29995#define _(str) gettext(str)
29996
29997/* In the program text: */
29998printf("%s", _("Don't Panic!\n"));
29999@end example
30000
30001@cindex internationalization @subentry localization @subentry locale categories
30002@cindex @command{gettext} library @subentry locale categories
30003@cindex locale categories
30004@noindent
30005This reduces the typing overhead to just three extra characters per string
30006and is considerably easier to read as well.
30007
30008There are locale @dfn{categories}
30009for different types of locale-related information.
30010The defined locale categories that @command{gettext} knows about are:
30011
30012@table @code
30013@cindex @code{LC_MESSAGES} locale category
30014@item LC_MESSAGES
30015Text messages.  This is the default category for @command{gettext}
30016operations, but it is possible to supply a different one explicitly,
30017if necessary.  (It is almost never necessary to supply a different category.)
30018
30019@cindex sorting characters in different languages
30020@cindex @code{LC_COLLATE} locale category
30021@item LC_COLLATE
30022Text-collation information (i.e., how different characters
30023and/or groups of characters sort in a given language).
30024
30025@cindex @code{LC_CTYPE} locale category
30026@item LC_CTYPE
30027Character-type information (alphabetic, digit, upper- or lowercase, and
30028so on) as well as character encoding.
30029@ignore
30030In June 2001 Bruno Haible wrote:
30031- Description of LC_CTYPE: It determines both
30032  1. character encoding,
30033  2. character type information.
30034  (For example, in both KOI8-R and ISO-8859-5 the character type information
30035  is the same - cyrillic letters could as 'alpha' - but the encoding is
30036  different.)
30037@end ignore
30038This information is accessed via the
30039POSIX character classes in regular expressions,
30040such as @code{/[[:alnum:]]/}
30041(@pxref{Bracket Expressions}).
30042
30043@cindex monetary information, localization
30044@cindex currency symbols, localization
30045@cindex internationalization @subentry localization @subentry monetary information
30046@cindex internationalization @subentry localization @subentry currency symbols
30047@cindex @code{LC_MONETARY} locale category
30048@item LC_MONETARY
30049Monetary information, such as the currency symbol, and whether the
30050symbol goes before or after a number.
30051
30052@cindex @code{LC_NUMERIC} locale category
30053@item LC_NUMERIC
30054Numeric information, such as which characters to use for the decimal
30055point and the thousands separator.@footnote{Americans
30056use a comma every three decimal places and a period for the decimal
30057point, while many Europeans do exactly the opposite:
300581,234.56 versus 1.234,56.}
30059
30060@cindex time @subentry localization and
30061@cindex dates @subentry information related to, localization
30062@cindex @code{LC_TIME} locale category
30063@item LC_TIME
30064Time- and date-related information, such as 12- or 24-hour clock, month printed
30065before or after the day in a date, local month abbreviations, and so on.
30066
30067@cindex @code{LC_ALL} locale category
30068@item LC_ALL
30069All of the above.  (Not too useful in the context of @command{gettext}.)
30070@end table
30071
30072@quotation NOTE
30073@cindex @env{LANGUAGE} environment variable
30074@cindex environment variables @subentry @env{LANGUAGE}
30075As described in @ref{Locales}, environment variables with the same
30076name as the locale categories (@env{LC_CTYPE}, @env{LC_ALL}, etc.)
30077influence @command{gawk}'s behavior (and that of other utilities).
30078
30079Normally, these variables also affect how the @code{gettext} library
30080finds translations.  However, the @env{LANGUAGE} environment variable
30081overrides the @env{LC_@var{xxx}} variables. Many GNU/Linux systems
30082may define this variable without your knowledge, causing @command{gawk}
30083to not find the correct translations.  If this happens to you,
30084look to see if @env{LANGUAGE} is defined, and if so, use the shell's
30085@command{unset} command to remove it.
30086@end quotation
30087
30088@cindex @env{GAWK_LOCALE_DIR} environment variable
30089@cindex environment variables @subentry @env{GAWK_LOCALE_DIR}
30090For testing translations of @command{gawk} itself, you can set
30091the @env{GAWK_LOCALE_DIR} environment variable. See the documentation
30092for the C @code{bindtextdomain()} function and also see
30093@ref{Other Environment Variables}.
30094
30095@node Programmer i18n
30096@section Internationalizing @command{awk} Programs
30097@cindex @command{awk} programs @subentry internationalizing
30098
30099@command{gawk} provides the following variables for
30100internationalization:
30101
30102@table @code
30103@cindex @code{TEXTDOMAIN} variable
30104@item TEXTDOMAIN
30105This variable indicates the application's text domain.
30106For compatibility with GNU @command{gettext}, the default
30107value is @code{"messages"}.
30108
30109@cindex internationalization @subentry localization @subentry marked strings
30110@cindex strings @subentry for localization
30111@item _"your message here"
30112String constants marked with a leading underscore
30113are candidates for translation at runtime.
30114String constants without a leading underscore are not translated.
30115@end table
30116
30117@command{gawk} provides the following functions for
30118internationalization:
30119
30120@table @code
30121@cindexgawkfunc{dcgettext}
30122@item @code{dcgettext(@var{string}} [@code{,} @var{domain} [@code{,} @var{category}]]@code{)}
30123Return the translation of @var{string} in
30124text domain @var{domain} for locale category @var{category}.
30125The default value for @var{domain} is the current value of @code{TEXTDOMAIN}.
30126The default value for @var{category} is @code{"LC_MESSAGES"}.
30127
30128If you supply a value for @var{category}, it must be a string equal to
30129one of the known locale categories described in
30130@ifnotinfo
30131the previous @value{SECTION}.
30132@end ifnotinfo
30133@ifinfo
30134@ref{Explaining gettext}.
30135@end ifinfo
30136You must also supply a text domain.  Use @code{TEXTDOMAIN} if
30137you want to use the current domain.
30138
30139@quotation CAUTION
30140The order of arguments to the @command{awk} version
30141of the @code{dcgettext()} function is purposely different from the order for
30142the C version.  The @command{awk} version's order was
30143chosen to be simple and to allow for reasonable @command{awk}-style
30144default arguments.
30145@end quotation
30146
30147@cindexgawkfunc{dcngettext}
30148@item @code{dcngettext(@var{string1}, @var{string2}, @var{number}} [@code{,} @var{domain} [@code{,} @var{category}]]@code{)}
30149Return the plural form used for @var{number} of the
30150translation of @var{string1} and @var{string2} in text domain
30151@var{domain} for locale category @var{category}. @var{string1} is the
30152English singular variant of a message, and @var{string2} is the English plural
30153variant of the same message.
30154The default value for @var{domain} is the current value of @code{TEXTDOMAIN}.
30155The default value for @var{category} is @code{"LC_MESSAGES"}.
30156
30157The same remarks about argument order as for the @code{dcgettext()} function apply.
30158
30159@cindex @code{.gmo} files @subentry specifying directory of
30160@cindex files @subentry @code{.gmo} @subentry specifying directory of
30161@cindex message object files @subentry specifying directory of
30162@cindex files @subentry message object @subentry specifying directory of
30163@cindexgawkfunc{bindtextdomain}
30164@item @code{bindtextdomain(@var{directory}} [@code{,} @var{domain} ]@code{)}
30165Change the directory in which
30166@command{gettext} looks for @file{.gmo} files, in case they
30167will not or cannot be placed in the standard locations
30168(e.g., during testing).
30169Return the directory in which @var{domain} is ``bound.''
30170
30171The default @var{domain} is the value of @code{TEXTDOMAIN}.
30172If @var{directory} is the null string (@code{""}), then
30173@code{bindtextdomain()} returns the current binding for the
30174given @var{domain}.
30175@end table
30176
30177To use these facilities in your @command{awk} program, follow these steps:
30178
30179@enumerate
30180@cindex @code{BEGIN} pattern @subentry @code{TEXTDOMAIN} variable and
30181@cindex @code{TEXTDOMAIN} variable @subentry @code{BEGIN} pattern and
30182@item
30183Set the variable @code{TEXTDOMAIN} to the text domain of
30184your program.  This is best done in a @code{BEGIN} rule
30185(@pxref{BEGIN/END}),
30186or it can also be done via the @option{-v} command-line
30187option (@pxref{Options}):
30188
30189@example
30190BEGIN @{
30191    TEXTDOMAIN = "guide"
30192    @dots{}
30193@}
30194@end example
30195
30196@cindex @code{_} (underscore) @subentry translatable strings
30197@cindex underscore (@code{_}) @subentry translatable strings
30198@item
30199Mark all translatable strings with a leading underscore (@samp{_})
30200character.  It @emph{must} be adjacent to the opening
30201quote of the string.  For example:
30202
30203@example
30204print _"hello, world"
30205x = _"you goofed"
30206printf(_"Number of users is %d\n", nusers)
30207@end example
30208
30209@item
30210If you are creating strings dynamically, you can
30211still translate them, using the @code{dcgettext()}
30212built-in function:@footnote{Thanks to Bruno Haible for this
30213example.}
30214
30215@example
30216if (groggy)
30217    message = dcgettext("%d customers disturbing me\n", "adminprog")
30218else
30219    message = dcgettext("enjoying %d customers\n", "adminprog")
30220printf(message, ncustomers)
30221@end example
30222
30223Here, the call to @code{dcgettext()} supplies a different
30224text domain (@code{"adminprog"}) in which to find the
30225message, but it uses the default @code{"LC_MESSAGES"} category.
30226
30227The previous example only works if @code{ncustomers} is greater than one.
30228This example would be better done with @code{dcngettext()}:
30229
30230@example
30231if (groggy)
30232    message = dcngettext("%d customer disturbing me\n",
30233                         "%d customers disturbing me\n",
30234                         ncustomers, "adminprog")
30235else
30236    message = dcngettext("enjoying %d customer\n",
30237                         "enjoying %d customers\n",
30238                         ncustomers, "adminprog")
30239printf(message, ncustomers)
30240@end example
30241
30242
30243@cindex @code{LC_MESSAGES} locale category @subentry @code{bindtextdomain()} function (@command{gawk})
30244@item
30245During development, you might want to put the @file{.gmo}
30246file in a private directory for testing.  This is done
30247with the @code{bindtextdomain()} built-in function:
30248
30249@example
30250BEGIN @{
30251   TEXTDOMAIN = "guide"   # our text domain
30252   if (Testing) @{
30253       # where to find our files
30254       bindtextdomain("testdir")
30255       # joe is in charge of adminprog
30256       bindtextdomain("../joe/testdir", "adminprog")
30257   @}
30258   @dots{}
30259@}
30260@end example
30261
30262@end enumerate
30263
30264@xref{I18N Example}
30265for an example program showing the steps to create
30266and use translations from @command{awk}.
30267
30268@node Translator i18n
30269@section Translating @command{awk} Programs
30270
30271@cindex @code{.po} files
30272@cindex files @subentry @code{.po}
30273@cindex portable object @subentry files
30274@cindex files @subentry portable object
30275Once a program's translatable strings have been marked, they must
30276be extracted to create the initial @file{.pot} file.
30277As part of translation, it is often helpful to rearrange the order
30278in which arguments to @code{printf} are output.
30279
30280@command{gawk}'s @option{--gen-pot} command-line option extracts
30281the messages and is discussed next.
30282After that, @code{printf}'s ability to
30283rearrange the order for @code{printf} arguments at runtime
30284is covered.
30285
30286@menu
30287* String Extraction::           Extracting marked strings.
30288* Printf Ordering::             Rearranging @code{printf} arguments.
30289* I18N Portability::            @command{awk}-level portability issues.
30290@end menu
30291
30292@node String Extraction
30293@subsection Extracting Marked Strings
30294@cindex strings @subentry extracting
30295@cindex @option{--gen-pot} option
30296@cindex command line @subentry options @subentry string extraction
30297@cindex string @subentry extraction (internationalization)
30298@cindex marked string extraction (internationalization)
30299@cindex extraction, of marked strings (internationalization)
30300
30301@cindex @option{--gen-pot} option
30302Once your @command{awk} program is working, and all the strings have
30303been marked and you've set (and perhaps bound) the text domain,
30304it is time to produce translations.
30305First, use the @option{--gen-pot} command-line option to create
30306the initial @file{.pot} file:
30307
30308@example
30309gawk --gen-pot -f guide.awk > guide.pot
30310@end example
30311
30312@cindex @command{xgettext} utility
30313When run with @option{--gen-pot}, @command{gawk} does not execute your
30314program.  Instead, it parses it as usual and prints all marked strings
30315to standard output in the format of a GNU @command{gettext} Portable Object
30316file.  Also included in the output are any constant strings that
30317appear as the first argument to @code{dcgettext()} or as the first and
30318second argument to @code{dcngettext()}.@footnote{The
30319@command{xgettext} utility that comes with GNU
30320@command{gettext} can handle @file{.awk} files.}
30321You should distribute the generated @file{.pot} file with
30322your @command{awk} program; translators will eventually use it
30323to provide you translations that you can also then distribute.
30324@xref{I18N Example}
30325for the full list of steps to go through to create and test
30326translations for @command{guide}.
30327
30328@node Printf Ordering
30329@subsection Rearranging @code{printf} Arguments
30330
30331@cindex @code{printf} statement @subentry positional specifiers
30332@cindex positional specifiers, @code{printf} statement
30333Format strings for @code{printf} and @code{sprintf()}
30334(@pxref{Printf})
30335present a special problem for translation.
30336Consider the following:@footnote{This example is borrowed
30337from the GNU @command{gettext} manual.}
30338
30339@example
30340printf(_"String `%s' has %d characters\n",
30341          string, length(string)))
30342@end example
30343
30344A possible German translation for this might be:
30345
30346@example
30347"%d Zeichen lang ist die Zeichenkette `%s'\n"
30348@end example
30349
30350The problem should be obvious: the order of the format
30351specifications is different from the original!
30352Even though @code{gettext()} can return the translated string
30353at runtime,
30354it cannot change the argument order in the call to @code{printf}.
30355
30356To solve this problem, @code{printf} format specifiers may have
30357an additional optional element, which we call a @dfn{positional specifier}.
30358For example:
30359
30360@example
30361"%2$d Zeichen lang ist die Zeichenkette `%1$s'\n"
30362@end example
30363
30364Here, the positional specifier consists of an integer count, which indicates which
30365argument to use, and a @samp{$}. Counts are one-based, and the
30366format string itself is @emph{not} included.  Thus, in the following
30367example, @samp{string} is the first argument and @samp{length(string)} is the second:
30368
30369@example
30370$ @kbd{gawk 'BEGIN @{}
30371>     @kbd{string = "Don\47t Panic"}
30372>     @kbd{printf "%2$d characters live in \"%1$s\"\n",}
30373>                         @kbd{string, length(string)}
30374> @kbd{@}'}
30375@print{} 11 characters live in "Don't Panic"
30376@end example
30377
30378If present, positional specifiers come first in the format specification,
30379before the flags, the field width, and/or the precision.
30380
30381Positional specifiers can be used with the dynamic field width and
30382precision capability:
30383
30384@example
30385$ @kbd{gawk 'BEGIN @{}
30386>    @kbd{printf("%*.*s\n", 10, 20, "hello")}
30387>    @kbd{printf("%3$*2$.*1$s\n", 20, 10, "hello")}
30388> @kbd{@}'}
30389@print{}      hello
30390@print{}      hello
30391@end example
30392
30393@quotation NOTE
30394When using @samp{*} with a positional specifier, the @samp{*}
30395comes first, then the integer position, and then the @samp{$}.
30396This is somewhat counterintuitive.
30397@end quotation
30398
30399@cindex @code{printf} statement @subentry positional specifiers @subentry mixing with regular formats
30400@cindex positional specifiers, @code{printf} statement @subentry mixing with regular formats
30401@cindex format specifiers @subentry mixing regular with positional specifiers
30402@command{gawk} does not allow you to mix regular format specifiers
30403and those with positional specifiers in the same string:
30404
30405@example
30406@group
30407$ @kbd{gawk 'BEGIN @{ printf "%d %3$s\n", 1, 2, "hi" @}'}
30408@error{} gawk: cmd. line:1: fatal: must use `count$' on all formats or none
30409@end group
30410@end example
30411
30412@quotation NOTE
30413There are some pathological cases that @command{gawk} may fail to
30414diagnose.  In such cases, the output may not be what you expect.
30415It's still a bad idea to try mixing them, even if @command{gawk}
30416doesn't detect it.
30417@end quotation
30418
30419Although positional specifiers can be used directly in @command{awk} programs,
30420their primary purpose is to help in producing correct translations of
30421format strings into languages different from the one in which the program
30422is first written.
30423
30424@node I18N Portability
30425@subsection @command{awk} Portability Issues
30426
30427@cindex portability @subentry internationalization and
30428@cindex internationalization @subentry localization @subentry portability and
30429@command{gawk}'s internationalization features were purposely chosen to
30430have as little impact as possible on the portability of @command{awk}
30431programs that use them to other versions of @command{awk}.
30432Consider this program:
30433
30434@example
30435BEGIN @{
30436    TEXTDOMAIN = "guide"
30437    if (Test_Guide)   # set with -v
30438        bindtextdomain("/test/guide/messages")
30439    print _"don't panic!"
30440@}
30441@end example
30442
30443@noindent
30444As written, it won't work on other versions of @command{awk}.
30445However, it is actually almost portable, requiring very little
30446change:
30447
30448@itemize @value{BULLET}
30449@cindex @code{TEXTDOMAIN} variable @subentry portability and
30450@item
30451Assignments to @code{TEXTDOMAIN} won't have any effect,
30452because @code{TEXTDOMAIN} is not special in other @command{awk} implementations.
30453
30454@item
30455Non-GNU versions of @command{awk} treat marked strings
30456as the concatenation of a variable named @code{_} with the string
30457following it.@footnote{This is good fodder for an ``Obfuscated
30458@command{awk}'' contest.} Typically, the variable @code{_} has
30459the null string (@code{""}) as its value, leaving the original string constant as
30460the result.
30461
30462@item
30463By defining ``dummy'' functions to replace @code{dcgettext()}, @code{dcngettext()},
30464and @code{bindtextdomain()}, the @command{awk} program can be made to run, but
30465all the messages are output in the original language.
30466For example:
30467
30468@cindex @code{bindtextdomain()} function (@command{gawk}) @subentry portability and
30469@cindex @code{dcgettext()} function (@command{gawk}) @subentry portability and
30470@cindex @code{dcngettext()} function (@command{gawk}) @subentry portability and
30471@example
30472@c file eg/lib/libintl.awk
30473function bindtextdomain(dir, domain)
30474@{
30475    return dir
30476@}
30477
30478function dcgettext(string, domain, category)
30479@{
30480    return string
30481@}
30482
30483function dcngettext(string1, string2, number, domain, category)
30484@{
30485    return (number == 1 ? string1 : string2)
30486@}
30487@c endfile
30488@end example
30489
30490@item
30491The use of positional specifications in @code{printf} or
30492@code{sprintf()} is @emph{not} portable.
30493To support @code{gettext()} at the C level, many systems' C versions of
30494@code{sprintf()} do support positional specifiers.  But it works only if
30495enough arguments are supplied in the function call.  Many versions of
30496@command{awk} pass @code{printf} formats and arguments unchanged to the
30497underlying C library version of @code{sprintf()}, but only one format and
30498argument at a time.  What happens if a positional specification is
30499used is anybody's guess.
30500However, because the positional specifications are primarily for use in
30501@emph{translated} format strings, and because non-GNU @command{awk}s never
30502retrieve the translated string, this should not be a problem in practice.
30503@end itemize
30504
30505@node I18N Example
30506@section A Simple Internationalization Example
30507
30508Now let's look at a step-by-step example of how to internationalize and
30509localize a simple @command{awk} program, using @file{guide.awk} as our
30510original source:
30511
30512@example
30513@c file eg/prog/guide.awk
30514BEGIN @{
30515    TEXTDOMAIN = "guide"
30516    bindtextdomain(".")  # for testing
30517    print _"Don't Panic"
30518    print _"The Answer Is", 42
30519    print "Pardon me, Zaphod who?"
30520@}
30521@c endfile
30522@end example
30523
30524@noindent
30525Run @samp{gawk --gen-pot} to create the @file{.pot} file:
30526
30527@example
30528$ @kbd{gawk --gen-pot -f guide.awk > guide.pot}
30529@end example
30530
30531@noindent
30532This produces:
30533
30534@example
30535@c file eg/data/guide.po
30536#: guide.awk:4
30537msgid "Don't Panic"
30538msgstr ""
30539
30540#: guide.awk:5
30541msgid "The Answer Is"
30542msgstr ""
30543
30544@c endfile
30545@end example
30546
30547This original portable object template file is saved and reused for each language
30548into which the application is translated.  The @code{msgid}
30549is the original string and the @code{msgstr} is the translation.
30550
30551@quotation NOTE
30552Strings not marked with a leading underscore do not
30553appear in the @file{guide.pot} file.
30554@end quotation
30555
30556Next, the messages must be translated.
30557Here is a translation to a hypothetical dialect of English,
30558called ``Mellow'':@footnote{Perhaps it would be better if it were
30559called ``Hippy.'' Ah, well.}
30560
30561@example
30562@group
30563$ @kbd{cp guide.pot guide-mellow.po}
30564@var{Add translations to} guide-mellow.po @dots{}
30565@end group
30566@end example
30567
30568@noindent
30569Following are the translations:
30570
30571@example
30572@c file eg/data/guide-mellow.po
30573#: guide.awk:4
30574msgid "Don't Panic"
30575msgstr "Hey man, relax!"
30576
30577#: guide.awk:5
30578msgid "The Answer Is"
30579msgstr "Like, the scoop is"
30580
30581@c endfile
30582@end example
30583
30584@cindex GNU/Linux
30585@quotation NOTE
30586The following instructions apply to GNU/Linux with the GNU C Library. Be
30587aware that the actual steps may change over time, that the following
30588description may not be accurate for all GNU/Linux distributions, and
30589that things may work entirely differently on other operating systems.
30590@end quotation
30591
30592The next step is to make the directory to hold the binary message object
30593file and then to create the @file{guide.mo} file.
30594The directory has the form @file{@var{locale}/LC_MESSAGES}, where
30595@var{locale} is a locale name known to the C @command{gettext} routines.
30596
30597@cindex @env{LANGUAGE} environment variable
30598@cindex environment variables @subentry @env{LANGUAGE}
30599@cindex @env{LC_ALL} environment variable
30600@cindex environment variables @subentry @env{LC_ALL}
30601@cindex @env{LANG} environment variable
30602@cindex environment variables @subentry @env{LANG}
30603@cindex @env{LC_MESSAGES} environment variable
30604@cindex environment variables @subentry @env{LC_MESSAGES}
30605How do we know which locale to use?  It turns out that there are
30606four different environment variables used by the C @command{gettext} routines.
30607In order, they are @env{$LANGUAGE}, @env{$LC_ALL}, @env{$LANG}, and
30608@env{$LC_MESSAGES}.@footnote{Well, sort of. It seems that if @env{$LC_ALL}
30609is set to @samp{C}, then no translations are done. Go figure.}
30610Thus, we check the value of @env{$LANGUAGE}:
30611
30612@example
30613$ @kbd{echo $LANGUAGE}
30614@print{} en_US.UTF-8
30615@end example
30616
30617@noindent
30618We next make the directories:
30619
30620@example
30621$ @kbd{mkdir en_US.UTF-8 en_US.UTF-8/LC_MESSAGES}
30622@end example
30623
30624@cindex @code{.po} files @subentry converting to @code{.mo}
30625@cindex files @subentry @code{.po} @subentry converting to @code{.mo}
30626@cindex @code{.mo} files, converting from @code{.po}
30627@cindex files @subentry @code{.mo}, converting from @code{.po}
30628@cindex portable object @subentry files @subentry converting to message object files
30629@cindex files @subentry portable object @subentry converting to message object files
30630@cindex message object files @subentry converting from portable object files
30631@cindex files @subentry message object @subentry converting from portable object files
30632@cindex @command{msgfmt} utility
30633The @command{msgfmt} utility converts the human-readable
30634@file{.po} file into a machine-readable @file{.mo} file.
30635By default, @command{msgfmt} creates a file named @file{messages}.
30636This file must be renamed and placed in the proper directory (using
30637the @option{-o} option) so that @command{gawk} can find it:
30638
30639@example
30640$ @kbd{msgfmt guide-mellow.po -o en_US.UTF-8/LC_MESSAGES/guide.mo}
30641@end example
30642
30643Finally, we run the program to test it:
30644
30645@example
30646$ @kbd{gawk -f guide.awk}
30647@print{} Hey man, relax!
30648@print{} Like, the scoop is 42
30649@print{} Pardon me, Zaphod who?
30650@end example
30651
30652If the three replacement functions for @code{dcgettext()}, @code{dcngettext()},
30653and @code{bindtextdomain()}
30654(@pxref{I18N Portability})
30655are in a file named @file{libintl.awk},
30656then we can run @file{guide.awk} unchanged as follows:
30657
30658@example
30659$ @kbd{gawk --posix -f guide.awk -f libintl.awk}
30660@print{} Don't Panic
30661@print{} The Answer Is 42
30662@print{} Pardon me, Zaphod who?
30663@end example
30664
30665@node Gawk I18N
30666@section @command{gawk} Can Speak Your Language
30667
30668@command{gawk} itself has been internationalized
30669using the GNU @command{gettext} package.
30670(GNU @command{gettext} is described in
30671complete detail in
30672@ifinfo
30673@inforef{Top, , GNU @command{gettext} utilities, gettext, GNU @command{gettext} utilities}.)
30674@end ifinfo
30675@ifnotinfo
30676@uref{https://www.gnu.org/software/gettext/manual/,
30677@cite{GNU @command{gettext} utilities}}.)
30678@end ifnotinfo
30679As of this writing, the latest version of GNU @command{gettext} is
30680@uref{ftp://ftp.gnu.org/gnu/gettext/gettext-0.19.8.1.tar.gz,
30681@value{PVERSION} 0.19.8.1}.
30682
30683If a translation of @command{gawk}'s messages exists,
30684then @command{gawk} produces usage messages, warnings,
30685and fatal errors in the local language.
30686
30687@node I18N Summary
30688@section Summary
30689
30690@itemize @value{BULLET}
30691@item
30692Internationalization means writing a program such that it can use multiple
30693languages without requiring source code changes.  Localization means
30694providing the data necessary for an internationalized program to work
30695in a particular language.
30696
30697@item
30698@command{gawk} uses GNU @command{gettext} to let you internationalize
30699and localize @command{awk} programs.  A program's text domain identifies
30700the program for grouping all messages and other data together.
30701
30702@item
30703You mark a program's strings for translation by preceding them with
30704an underscore. Once that is done, the strings are extracted into a
30705@file{.pot} file.  This file is copied for each language into a @file{.po}
30706file, and the @file{.po} files are compiled into @file{.gmo} files for
30707use at runtime.
30708
30709@item
30710You can use positional specifications with @code{sprintf()} and
30711@code{printf} to rearrange the placement of argument values in formatted
30712strings and output. This is useful for the translation of format
30713control strings.
30714
30715@item
30716The internationalization features have been designed so that they
30717can be easily worked around in a standard @command{awk}.
30718
30719@item
30720@command{gawk} itself has been internationalized and ships with
30721a number of translations for its messages.
30722
30723@end itemize
30724
30725
30726@node Debugger
30727@chapter Debugging @command{awk} Programs
30728@cindex debugging @subentry @command{awk} programs
30729
30730@c The original text for this chapter was contributed by Efraim Yawitz.
30731
30732It would be nice if computer programs worked perfectly the first time they
30733were run, but in real life, this rarely happens for programs of
30734any complexity.  Thus, most programming languages have facilities available
30735for ``debugging'' programs, and @command{awk} is no exception.
30736
30737The @command{gawk} debugger is purposely modeled after
30738@uref{https://www.gnu.org/software/gdb/, the GNU Debugger (GDB)}
30739command-line debugger.  If you are familiar with GDB, learning
30740how to use @command{gawk} for debugging your programs is easy.
30741
30742@menu
30743* Debugging::                   Introduction to @command{gawk} debugger.
30744* Sample Debugging Session::    Sample debugging session.
30745* List of Debugger Commands::   Main debugger commands.
30746* Readline Support::            Readline support.
30747* Limitations::                 Limitations and future plans.
30748* Debugging Summary::           Debugging summary.
30749@end menu
30750
30751@node Debugging
30752@section Introduction to the @command{gawk} Debugger
30753
30754This @value{SECTION} introduces debugging in general and begins
30755the discussion of debugging in @command{gawk}.
30756
30757@menu
30758* Debugging Concepts::          Debugging in General.
30759* Debugging Terms::             Additional Debugging Concepts.
30760* Awk Debugging::               Awk Debugging.
30761@end menu
30762
30763@node Debugging Concepts
30764@subsection Debugging in General
30765
30766(If you have used debuggers in other languages, you may want to skip
30767ahead to @ref{Awk Debugging}.)
30768
30769Of course, a debugging program cannot remove bugs for you, because it has
30770no way of knowing what you or your users consider a ``bug'' versus a
30771``feature.''  (Sometimes, we humans have a hard time with this ourselves.)
30772In that case, what can you expect from such a tool?  The answer to that
30773depends on the language being debugged, but in general, you can expect at
30774least the following:
30775
30776@cindex debugger @subentry capabilities
30777@itemize @value{BULLET}
30778@item
30779The ability to watch a program execute its instructions one by one,
30780giving you, the programmer, the opportunity to think about what is happening
30781on a time scale of seconds, minutes, or hours, rather than the nanosecond
30782time scale at which the code usually runs.
30783
30784@item
30785The opportunity to not only passively observe the operation of your
30786program, but to control it and try different paths of execution, without
30787having to change your source files.
30788
30789@item
30790The chance to see the values of data in the program at any point in
30791execution, and also to change that data on the fly, to see how that
30792affects what happens afterward.  (This often includes the ability
30793to look at internal data structures besides the variables you actually
30794defined in your code.)
30795
30796@item
30797The ability to obtain additional information about your program's state
30798or even its internal structure.
30799@end itemize
30800
30801All of these tools provide a great amount of help in using your own
30802skills and understanding of the goals of your program to find where it
30803is going wrong (or, for that matter, to better comprehend a perfectly
30804functional program that you or someone else wrote).
30805
30806@node Debugging Terms
30807@subsection Debugging Concepts
30808
30809@cindex debugger @subentry concepts
30810Before diving in to the details, we need to introduce several
30811important concepts that apply to just about all debuggers.
30812The following list defines terms used throughout the rest of
30813this @value{CHAPTER}:
30814
30815@table @dfn
30816@cindex call stack @subentry explanation of
30817@cindex stack frame (debugger)
30818@item Stack frame
30819Programs generally call functions during the course of their execution.
30820One function can call another, or a function can call itself (recursion).
30821You can view the chain of called functions (main program calls A, which
30822calls B, which calls C), as a stack of executing functions: the currently
30823running function is the topmost one on the stack, and when it finishes
30824(returns), the next one down then becomes the active function.
30825Such a stack is termed a @dfn{call stack}.
30826
30827For each function on the call stack, the system maintains a data area
30828that contains the function's parameters, local variables, and return value,
30829as well as any other ``bookkeeping'' information needed to manage the
30830call stack.  This data area is termed a @dfn{stack frame}.
30831
30832@command{gawk} also follows this model, and gives you
30833access to the call stack and to each stack frame. You can see the
30834call stack, as well as from where each function on the stack was
30835invoked. Commands that print the call stack print information about
30836each stack frame (as detailed later on).
30837
30838@item Breakpoint
30839@cindex breakpoint
30840During debugging, you often wish to let the program run until it
30841reaches a certain point, and then continue execution from there one
30842statement (or instruction) at a time.  The way to do this is to set
30843a @dfn{breakpoint} within the program.  A breakpoint is where the
30844execution of the program should break off (stop), so that you can
30845take over control of the program's execution.  You can add and remove
30846as many breakpoints as you like.
30847
30848@item Watchpoint
30849@cindex watchpoint (debugger)
30850A watchpoint is similar to a breakpoint.  The difference is that
30851breakpoints are oriented around the code: stop when a certain point in the
30852code is reached.  A watchpoint, however, specifies that program execution
30853should stop when a @emph{data value} is changed.  This is useful, as
30854sometimes it happens that a variable receives an erroneous value, and it's
30855hard to track down where this happens just by looking at the code.
30856By using a watchpoint, you can stop whenever a variable is assigned to,
30857and usually find the errant code quite quickly.
30858@end table
30859
30860@node Awk Debugging
30861@subsection @command{awk} Debugging
30862
30863Debugging an @command{awk} program has some specific aspects that are
30864not shared with programs written in other languages.
30865
30866First of all, the fact that @command{awk} programs usually take input
30867line by line from a file or files and operate on those lines using specific
30868rules makes it especially useful to organize viewing the execution of
30869the program in terms of these rules.  As we will see, each @command{awk}
30870rule is treated almost like a function call, with its own specific block
30871of instructions.
30872
30873In addition, because @command{awk} is by design a very concise language,
30874it is easy to lose sight of everything that is going on ``inside''
30875each line of @command{awk} code.  The debugger provides the opportunity
30876to look at the individual primitive instructions carried out
30877by the higher-level @command{awk} commands.@footnote{The ``primitive
30878instructions'' are defined by @command{gawk} itself; the debugger
30879does not work at the level of machine instructions.}
30880
30881@node Sample Debugging Session
30882@section Sample @command{gawk} Debugging Session
30883@cindex sample debugging session
30884@cindex example debugging session
30885@cindex debugging @subentry example session
30886
30887In order to illustrate the use of @command{gawk} as a debugger, let's look at a sample
30888debugging session.  We will use the @command{awk} implementation of the
30889POSIX @command{uniq} command presented earlier (@pxref{Uniq Program})
30890as our example.
30891
30892@menu
30893* Debugger Invocation::         How to Start the Debugger.
30894* Finding The Bug::             Finding the Bug.
30895@end menu
30896
30897@node Debugger Invocation
30898@subsection How to Start the Debugger
30899@cindex starting the debugger
30900@cindex debugger @subentry how to start
30901
30902Starting the debugger is almost exactly like running @command{gawk} normally,
30903except you have to pass an additional option, @option{--debug}, or the
30904corresponding short option, @option{-D}.  The file(s) containing the
30905program and any supporting code are given on the command line as arguments
30906to one or more @option{-f} options. (@command{gawk} is not designed
30907to debug command-line programs, only programs contained in files.)
30908In our case, we invoke the debugger like this:
30909
30910@example
30911$ @kbd{gawk -D -f getopt.awk -f join.awk -f uniq.awk -1 inputfile}
30912@end example
30913
30914@noindent
30915where both @file{getopt.awk} and @file{uniq.awk} are in @env{$AWKPATH}.
30916(Experienced users of GDB or similar debuggers should note that
30917this syntax is slightly different from what you are used to.
30918With the @command{gawk} debugger, you give the arguments for running the program
30919in the command line to the debugger rather than as part of the @code{run}
30920command at the debugger prompt.)
30921The @option{-1} is an option to @file{uniq.awk}.
30922
30923@cindex debugger @subentry prompt
30924Instead of immediately running the program on @file{inputfile}, as
30925@command{gawk} would ordinarily do, the debugger merely loads all
30926the program source files, compiles them internally, and then gives
30927us a prompt:
30928
30929@example
30930gawk>
30931@end example
30932
30933@noindent
30934from which we can issue commands to the debugger.  At this point, no
30935code has been executed.
30936
30937@node Finding The Bug
30938@subsection Finding the Bug
30939
30940Let's say that we are having a problem using (a faulty version of)
30941@file{uniq.awk} in ``field-skipping'' mode, and it doesn't seem to be
30942catching lines which should be identical when skipping the first field,
30943such as:
30944
30945@example
30946awk is a wonderful program!
30947gawk is a wonderful program!
30948@end example
30949
30950This could happen if we were thinking (C-like) of the fields in a record
30951as being numbered in a zero-based fashion, so instead of the lines:
30952
30953@example
30954clast = join(alast, fcount+1, n)
30955cline = join(aline, fcount+1, m)
30956@end example
30957
30958@noindent
30959we wrote:
30960
30961@example
30962clast = join(alast, fcount, n)
30963cline = join(aline, fcount, m)
30964@end example
30965
30966The first thing we usually want to do when trying to investigate a
30967problem like this is to put a breakpoint in the program so that we can
30968watch it at work and catch what it is doing wrong.  A reasonable spot for
30969a breakpoint in @file{uniq.awk} is at the beginning of the function
30970@code{are_equal()}, which compares the current line with the previous one. To set
30971the breakpoint, use the @code{b} (breakpoint) command:
30972
30973@cindex debugger @subentry setting a breakpoint
30974@cindex debugger @subentry commands @subentry @code{breakpoint}
30975@cindex debugger @subentry commands @subentry @code{break}
30976@cindex debugger @subentry commands @subentry @code{b} (@code{break})
30977@example
30978gawk> @kbd{b are_equal}
30979@print{} Breakpoint 1 set at file `awklib/eg/prog/uniq.awk', line 63
30980@end example
30981
30982The debugger tells us the file and line number where the breakpoint is.
30983Now type @samp{r} or @samp{run} and the program runs until it hits
30984the breakpoint for the first time:
30985
30986@cindex debugger @subentry running the program
30987@cindex debugger @subentry commands @subentry @code{run}
30988@example
30989gawk> @kbd{r}
30990@print{} Starting program:
30991@print{} Stopping in Rule ...
30992@print{} Breakpoint 1, are_equal(n, m, clast, cline, alast, aline)
30993         at `awklib/eg/prog/uniq.awk':63
30994@print{} 63          if (fcount == 0 && charcount == 0)
30995gawk>
30996@end example
30997
30998Now we can look at what's going on inside our program.  First of all,
30999let's see how we got to where we are.  At the prompt, we type @samp{bt}
31000(short for ``backtrace''), and the debugger responds with a
31001listing of the current stack frames:
31002
31003@cindex debugger @subentry stack frames, showing
31004@cindex debugger @subentry commands @subentry @code{bt} (@code{backtrace})
31005@cindex debugger @subentry commands @subentry @code{backtrace}
31006@example
31007gawk> @kbd{bt}
31008@print{} #0  are_equal(n, m, clast, cline, alast, aline)
31009         at `awklib/eg/prog/uniq.awk':68
31010@print{} #1  in main() at `awklib/eg/prog/uniq.awk':88
31011@end example
31012
31013This tells us that @code{are_equal()} was called by the main program at
31014line 88 of @file{uniq.awk}.  (This is not a big surprise, because this
31015is the only call to @code{are_equal()} in the program, but in more complex
31016programs, knowing who called a function and with what parameters can be
31017the key to finding the source of the problem.)
31018
31019Now that we're in @code{are_equal()}, we can start looking at the values
31020of some variables.  Let's say we type @samp{p n}
31021(@code{p} is short for ``print'').  We would expect to see the value of
31022@code{n}, a parameter to @code{are_equal()}.  Actually, the debugger
31023gives us:
31024
31025@cindex debugger @subentry commands @subentry @code{print}
31026@cindex debugger @subentry commands @subentry @code{p} (@code{print})
31027@example
31028gawk> @kbd{p n}
31029@print{} n = untyped variable
31030@end example
31031
31032@noindent
31033In this case, @code{n} is an uninitialized local variable, because the
31034function was called without arguments (@pxref{Function Calls}).
31035
31036A more useful variable to display might be the current record:
31037
31038@example
31039gawk> @kbd{p $0}
31040@print{} $0 = "gawk is a wonderful program!"
31041@end example
31042
31043@noindent
31044This might be a bit puzzling at first, as this is the second line of
31045our test input.  Let's look at @code{NR}:
31046
31047@example
31048gawk> @kbd{p NR}
31049@print{} NR = 2
31050@end example
31051
31052@noindent
31053So we can see that @code{are_equal()} was only called for the second record
31054of the file.  Of course, this is because our program contains a rule for
31055@samp{NR == 1}:
31056
31057@example
31058NR == 1 @{
31059    last = $0
31060    next
31061@}
31062@end example
31063
31064OK, let's just check that that rule worked correctly:
31065
31066@example
31067gawk> @kbd{p last}
31068@print{} last = "awk is a wonderful program!"
31069@end example
31070
31071Everything we have done so far has verified that the program has worked as
31072planned, up to and including the call to @code{are_equal()}, so the problem must
31073be inside this function.  To investigate further, we must begin
31074``stepping through'' the lines of @code{are_equal()}.  We start by typing
31075@samp{n} (for ``next''):
31076
31077@cindex debugger @subentry commands @subentry @code{n} (@code{next})
31078@cindex debugger @subentry commands @subentry @code{next}
31079@example
31080@group
31081gawk> @kbd{n}
31082@print{} 66          if (fcount > 0) @{
31083@end group
31084@end example
31085
31086This tells us that @command{gawk} is now ready to execute line 66, which
31087decides whether to give the lines the special ``field-skipping'' treatment
31088indicated by the @option{-1} command-line option.  (Notice that we skipped
31089from where we were before, at line 63, to here, because the condition
31090in line 63, @samp{if (fcount == 0 && charcount == 0)}, was false.)
31091
31092Continuing to step, we now get to the splitting of the current and
31093last records:
31094
31095@example
31096gawk> @kbd{n}
31097@print{} 67              n = split(last, alast)
31098gawk> @kbd{n}
31099@print{} 68              m = split($0, aline)
31100@end example
31101
31102At this point, we should be curious to see what our records were split
31103into, so we try to look:
31104
31105@example
31106gawk> @kbd{p n m alast aline}
31107@print{} n = 5
31108@print{} m = untyped variable
31109@print{} alast = array, 5 elements
31110@print{} aline = untyped variable
31111@end example
31112
31113@noindent
31114(The @code{p} command can take more than one argument, similar to
31115@command{awk}'s @code{print} statement.)
31116
31117This is kind of disappointing, though.  All we found out is that there
31118are five elements in @code{alast}; @code{m} and @code{aline} don't have
31119values because we are at line 68 but haven't executed it yet.
31120This information is useful enough (we now know that
31121none of the words were accidentally left out), but what if we want to see
31122inside the array?
31123
31124@cindex debugger @subentry printing single array elements
31125The first choice would be to use subscripts:
31126
31127@example
31128gawk> @kbd{p alast[0]}
31129@print{} "0" not in array `alast'
31130@end example
31131
31132@noindent
31133Oops!
31134
31135@example
31136gawk> @kbd{p alast[1]}
31137@print{} alast["1"] = "awk"
31138@end example
31139
31140This would be kind of slow for a 100-member array, though, so
31141@command{gawk} provides a shortcut (reminiscent of another language
31142not to be mentioned):
31143
31144@cindex debugger @subentry printing all array elements
31145@example
31146gawk> @kbd{p @@alast}
31147@print{} alast["1"] = "awk"
31148@print{} alast["2"] = "is"
31149@print{} alast["3"] = "a"
31150@print{} alast["4"] = "wonderful"
31151@print{} alast["5"] = "program!"
31152@end example
31153
31154It looks like we got this far OK.  Let's take another step
31155or two:
31156
31157@example
31158gawk> @kbd{n}
31159@print{} 69              clast = join(alast, fcount, n)
31160gawk> @kbd{n}
31161@print{} 70              cline = join(aline, fcount, m)
31162@end example
31163
31164Well, here we are at our error (sorry to spoil the suspense).  What we
31165had in mind was to join the fields starting from the second one to make
31166the virtual record to compare, and if the first field were numbered zero,
31167this would work.  Let's look at what we've got:
31168
31169@example
31170gawk> @kbd{p cline clast}
31171@print{} cline = "gawk is a wonderful program!"
31172@print{} clast = "awk is a wonderful program!"
31173@end example
31174
31175Hey, those look pretty familiar!  They're just our original, unaltered
31176input records.  A little thinking (the human brain is still the best
31177debugging tool), and we realize that we were off by one!
31178
31179We get out of the debugger:
31180
31181@example
31182gawk> @kbd{q}
31183@print{} The program is running. Exit anyway (y/n)? @kbd{y}
31184@end example
31185
31186@noindent
31187Then we get into an editor:
31188
31189@example
31190clast = join(alast, fcount+1, n)
31191cline = join(aline, fcount+1, m)
31192@end example
31193
31194@noindent
31195and problem solved!
31196
31197@node List of Debugger Commands
31198@section Main Debugger Commands
31199
31200The @command{gawk} debugger command set can be divided into the
31201following categories:
31202
31203@itemize @value{BULLET}
31204
31205@item
31206Breakpoint control
31207
31208@item
31209Execution control
31210
31211@item
31212Viewing and changing data
31213
31214@item
31215Working with the stack
31216
31217@item
31218Getting information
31219
31220@item
31221Miscellaneous
31222@end itemize
31223
31224@cindex debugger @subentry repeating commands
31225Each of these are discussed in the following subsections.
31226In the following descriptions, commands that may be abbreviated
31227show the abbreviation on a second description line.
31228A debugger command name may also be truncated if that partial
31229name is unambiguous. The debugger has the built-in capability to
31230automatically repeat the previous command just by hitting @kbd{Enter}.
31231This works for the commands @code{list}, @code{next}, @code{nexti},
31232@code{step}, @code{stepi}, and @code{continue} executed without any
31233argument.
31234
31235@menu
31236* Breakpoint Control::          Control of Breakpoints.
31237* Debugger Execution Control::  Control of Execution.
31238* Viewing And Changing Data::   Viewing and Changing Data.
31239* Execution Stack::             Dealing with the Stack.
31240* Debugger Info::               Obtaining Information about the Program and
31241                                the Debugger State.
31242* Miscellaneous Debugger Commands:: Miscellaneous Commands.
31243@end menu
31244
31245@node Breakpoint Control
31246@subsection Control of Breakpoints
31247
31248As we saw earlier, the first thing you probably want to do in a debugging
31249session is to get your breakpoints set up, because your program
31250will otherwise just run as if it was not under the debugger.  The commands for
31251controlling breakpoints are:
31252
31253@table @asis
31254@cindex debugger @subentry commands @subentry @code{b} (@code{break})
31255@cindex debugger @subentry commands @subentry @code{break}
31256@cindex @code{break} debugger command
31257@cindex @code{b} debugger command (alias for @code{break})
31258@cindex set breakpoint
31259@cindex breakpoint @subentry setting
31260@item @code{break} [[@var{filename}@code{:}]@var{n} | @var{function}] [@code{"@var{expression}"}]
31261@itemx @code{b} [[@var{filename}@code{:}]@var{n} | @var{function}] [@code{"@var{expression}"}]
31262Without any argument, set a breakpoint at the next instruction
31263to be executed in the selected stack frame.
31264Arguments can be one of the following:
31265
31266@c @asis for docbook
31267@c nested table
31268@table @asis
31269@item @var{n}
31270Set a breakpoint at line number @var{n} in the current source file.
31271
31272@item @var{filename}@code{:}@var{n}
31273Set a breakpoint at line number @var{n} in source file @var{filename}.
31274
31275@item @var{function}
31276Set a breakpoint at entry to (the first instruction of)
31277function @var{function}.
31278@end table
31279
31280Each breakpoint is assigned a number that can be used to delete it from
31281the breakpoint list using the @code{delete} command.
31282
31283With a breakpoint, you may also supply a condition.  This is an
31284@command{awk} expression (enclosed in double quotes) that the debugger
31285evaluates whenever the breakpoint is reached. If the condition is true,
31286then the debugger stops execution and prompts for a command. Otherwise,
31287it continues executing the program.
31288
31289@cindex debugger @subentry commands @subentry @code{clear}
31290@cindex @code{clear} debugger command
31291@cindex delete breakpoint @subentry at location
31292@cindex breakpoint @subentry at location, how to delete
31293@item @code{clear} [[@var{filename}@code{:}]@var{n} | @var{function}]
31294Without any argument, delete any breakpoint at the next instruction
31295to be executed in the selected stack frame. If the program stops at
31296a breakpoint, this deletes that breakpoint so that the program
31297does not stop at that location again.  Arguments can be one of the following:
31298
31299@c nested table
31300@table @asis
31301@item @var{n}
31302Delete breakpoint(s) set at line number @var{n} in the current source file.
31303
31304@item @var{filename}@code{:}@var{n}
31305Delete breakpoint(s) set at line number @var{n} in source file @var{filename}.
31306
31307@item @var{function}
31308Delete breakpoint(s) set at entry to function @var{function}.
31309@end table
31310
31311@cindex debugger @subentry commands @subentry @code{condition}
31312@cindex @code{condition} debugger command
31313@cindex breakpoint @subentry condition
31314@item @code{condition} @var{n} @code{"@var{expression}"}
31315Add a condition to existing breakpoint or watchpoint @var{n}. The
31316condition is an @command{awk} expression @emph{enclosed in double quotes}
31317that the debugger evaluates
31318whenever the breakpoint or watchpoint is reached. If the condition is true, then
31319the debugger stops execution and prompts for a command. Otherwise,
31320the debugger continues executing the program. If the condition expression is
31321not specified, any existing condition is removed (i.e., the breakpoint or
31322watchpoint is made unconditional).
31323
31324@cindex debugger @subentry commands @subentry @code{d} (@code{delete})
31325@cindex debugger @subentry commands @subentry @code{delete}
31326@cindex @code{delete} debugger command
31327@cindex @code{d} debugger command (alias for @code{delete})
31328@cindex delete breakpoint @subentry by number
31329@cindex breakpoint @subentry delete by number
31330@item @code{delete} [@var{n1 n2} @dots{}] [@var{n}--@var{m}]
31331@itemx @code{d} [@var{n1 n2} @dots{}] [@var{n}--@var{m}]
31332Delete specified breakpoints or a range of breakpoints. Delete
31333all defined breakpoints if no argument is supplied.
31334
31335@cindex debugger @subentry commands @subentry @code{disable}
31336@cindex @code{disable} debugger command
31337@cindex disable breakpoint
31338@cindex breakpoint @subentry how to disable or enable
31339@item @code{disable} [@var{n1 n2} @dots{} | @var{n}--@var{m}]
31340Disable specified breakpoints or a range of breakpoints. Without
31341any argument, disable all breakpoints.
31342
31343@cindex debugger @subentry commands @subentry @code{e} (@code{enable})
31344@cindex debugger @subentry commands @subentry @code{enable}
31345@cindex @code{enable} debugger command
31346@cindex @code{e} debugger command (alias for @code{enable})
31347@cindex enable breakpoint
31348@item @code{enable} [@code{del} | @code{once}] [@var{n1 n2} @dots{}] [@var{n}--@var{m}]
31349@itemx @code{e} [@code{del} | @code{once}] [@var{n1 n2} @dots{}] [@var{n}--@var{m}]
31350Enable specified breakpoints or a range of breakpoints. Without
31351any argument, enable all breakpoints.
31352Optionally, you can specify how to enable the breakpoints:
31353
31354@c nested table
31355@table @code
31356@item del
31357Enable the breakpoints temporarily, then delete each one when
31358the program stops at it.
31359
31360@item once
31361Enable the breakpoints temporarily, then disable each one when
31362the program stops at it.
31363@end table
31364
31365@cindex debugger @subentry commands @subentry @code{ignore}
31366@cindex @code{ignore} debugger command
31367@cindex ignore breakpoint
31368@item @code{ignore} @var{n} @var{count}
31369Ignore breakpoint number @var{n} the next @var{count} times it is
31370hit.
31371
31372@cindex debugger @subentry commands @subentry @code{t} (@code{tbreak})
31373@cindex debugger @subentry commands @subentry @code{tbreak}
31374@cindex @code{tbreak} debugger command
31375@cindex @code{t} debugger command (alias for @code{tbreak})
31376@cindex temporary breakpoint
31377@item @code{tbreak} [[@var{filename}@code{:}]@var{n} | @var{function}]
31378@itemx @code{t} [[@var{filename}@code{:}]@var{n} | @var{function}]
31379Set a temporary breakpoint (enabled for only one stop).
31380The arguments are the same as for @code{break}.
31381@end table
31382
31383@node Debugger Execution Control
31384@subsection Control of Execution
31385
31386Now that your breakpoints are ready, you can start running the program
31387and observing its behavior.  There are more commands for controlling
31388execution of the program than we saw in our earlier example:
31389
31390@table @asis
31391@cindex debugger @subentry commands @subentry @code{commands}
31392@cindex @code{commands} debugger command
31393@cindex debugger @subentry commands @subentry @code{silent}
31394@cindex @code{silent} debugger command
31395@cindex debugger @subentry commands @subentry @code{end}
31396@cindex @code{end} debugger command
31397@cindex breakpoint @subentry commands to execute at
31398@cindex commands to execute at breakpoint
31399@item @code{commands} [@var{n}]
31400@itemx @code{silent}
31401@itemx @dots{}
31402@itemx @code{end}
31403Set a list of commands to be executed upon stopping at
31404a breakpoint or watchpoint. @var{n} is the breakpoint or watchpoint number.
31405Without a number, the last one set is used. The actual commands follow,
31406starting on the next line, and terminated by the @code{end} command.
31407If the command @code{silent} is in the list, the usual messages about
31408stopping at a breakpoint and the source line are not printed. Any command
31409in the list that resumes execution (e.g., @code{continue}) terminates the list
31410(an implicit @code{end}), and subsequent commands are ignored.
31411For example:
31412
31413@example
31414gawk> @kbd{commands}
31415> @kbd{silent}
31416> @kbd{printf "A silent breakpoint; i = %d\n", i}
31417> @kbd{info locals}
31418> @kbd{set i = 10}
31419> @kbd{continue}
31420> @kbd{end}
31421gawk>
31422@end example
31423
31424@cindex debugger @subentry commands @subentry @code{c} (@code{continue})
31425@cindex debugger @subentry commands @subentry @code{continue}
31426@cindex continue program, in debugger
31427@cindex @code{continue} debugger command
31428@item @code{continue} [@var{count}]
31429@itemx @code{c} [@var{count}]
31430Resume program execution. If continued from a breakpoint and @var{count} is
31431specified, ignore the breakpoint at that location the next @var{count} times
31432before stopping.
31433
31434@cindex debugger @subentry commands @subentry @code{finish}
31435@cindex @code{finish} debugger command
31436@item @code{finish}
31437Execute until the selected stack frame returns.
31438Print the returned value.
31439
31440@cindex debugger @subentry commands @subentry @code{n} (@code{next})
31441@cindex debugger @subentry commands @subentry @code{next}
31442@cindex @code{next} debugger command
31443@cindex @code{n} debugger command (alias for @code{next})
31444@cindex single-step execution, in the debugger
31445@item @code{next} [@var{count}]
31446@itemx @code{n} [@var{count}]
31447Continue execution to the next source line, stepping over function calls.
31448The argument @var{count} controls how many times to repeat the action, as
31449in @code{step}.
31450
31451@cindex debugger @subentry commands @subentry @code{ni} (@code{nexti})
31452@cindex debugger @subentry commands @subentry @code{nexti}
31453@cindex @code{nexti} debugger command
31454@cindex @code{ni} debugger command (alias for @code{nexti})
31455@item @code{nexti} [@var{count}]
31456@itemx @code{ni} [@var{count}]
31457Execute one (or @var{count}) instruction(s), stepping over function calls.
31458
31459@cindex debugger @subentry commands @subentry @code{return}
31460@cindex @code{return} debugger command
31461@item @code{return} [@var{value}]
31462Cancel execution of a function call. If @var{value} (either a string or a
31463number) is specified, it is used as the function's return value. If used in a
31464frame other than the innermost one (the currently executing function; i.e.,
31465frame number 0), discard all inner frames in addition to the selected one,
31466and the caller of that frame becomes the innermost frame.
31467
31468@cindex debugger @subentry commands @subentry @code{r} (@code{run})
31469@cindex debugger @subentry commands @subentry @code{run}
31470@cindex @code{run} debugger command
31471@cindex @code{r} debugger command (alias for @code{run})
31472@item @code{run}
31473@itemx @code{r}
31474Start/restart execution of the program. When restarting, the debugger
31475retains the current breakpoints, watchpoints, command history,
31476automatic display variables, and debugger options.
31477
31478@cindex debugger @subentry commands @subentry @code{s} (@code{step})
31479@cindex debugger @subentry commands @subentry @code{step}
31480@cindex @code{step} debugger command
31481@cindex @code{s} debugger command (alias for @code{step})
31482@item @code{step} [@var{count}]
31483@itemx @code{s} [@var{count}]
31484Continue execution until control reaches a different source line in the
31485current stack frame, stepping inside any function called within
31486the line.  If the argument @var{count} is supplied, steps that many times before
31487stopping, unless it encounters a breakpoint or watchpoint.
31488
31489@cindex debugger @subentry commands @subentry @code{si} (@code{stepi})
31490@cindex debugger @subentry commands @subentry @code{stepi}
31491@cindex @code{stepi} debugger command
31492@cindex @code{si} debugger command (alias for @code{stepi})
31493@item @code{stepi} [@var{count}]
31494@itemx @code{si} [@var{count}]
31495Execute one (or @var{count}) instruction(s), stepping inside function calls.
31496(For illustration of what is meant by an ``instruction'' in @command{gawk},
31497see the output shown under @code{dump} in @ref{Miscellaneous Debugger Commands}.)
31498
31499@cindex debugger @subentry commands @subentry @code{u} (@code{until})
31500@cindex debugger @subentry commands @subentry @code{until}
31501@cindex @code{until} debugger command
31502@cindex @code{u} debugger command (alias for @code{until})
31503@item @code{until} [[@var{filename}@code{:}]@var{n} | @var{function}]
31504@itemx @code{u} [[@var{filename}@code{:}]@var{n} | @var{function}]
31505Without any argument, continue execution until a line past the current
31506line in the current stack frame is reached. With an argument,
31507continue execution until the specified location is reached, or the current
31508stack frame returns.
31509@end table
31510
31511@node Viewing And Changing Data
31512@subsection Viewing and Changing Data
31513
31514The commands for viewing and changing variables inside of @command{gawk} are:
31515
31516@table @asis
31517@cindex debugger @subentry commands @subentry @code{display}
31518@cindex @code{display} debugger command
31519@item @code{display} [@var{var} | @code{$}@var{n}]
31520Add variable @var{var} (or field @code{$@var{n}}) to the display list.
31521The value of the variable or field is displayed each time the program stops.
31522Each variable added to the list is identified by a unique number:
31523
31524@example
31525gawk> @kbd{display x}
31526@print{} 10: x = 1
31527@end example
31528
31529@noindent
31530This displays the assigned item number, the variable name, and its current value.
31531If the display variable refers to a function parameter, it is silently
31532deleted from the list as soon as the execution reaches a context where
31533no such variable of the given name exists.
31534Without argument, @code{display} displays the current values of
31535items on the list.
31536
31537@cindex debugger @subentry commands @subentry @code{eval}
31538@cindex @code{eval} debugger command
31539@cindex evaluate expressions, in debugger
31540@item @code{eval "@var{awk statements}"}
31541Evaluate @var{awk statements} in the context of the running program.
31542You can do anything that an @command{awk} program would do: assign
31543values to variables, call functions, and so on.
31544
31545@quotation NOTE
31546You cannot use @code{eval} to execute a statement containing
31547any of the following:
31548@code{exit},
31549@code{getline},
31550@code{next},
31551@code{nextfile},
31552or
31553@code{return}.
31554@end quotation
31555
31556@item @code{eval} @var{param}, @dots{}
31557@itemx @var{awk statements}
31558@itemx @code{end}
31559This form of @code{eval} is similar, but it allows you to define
31560``local variables'' that exist in the context of the
31561@var{awk statements}, instead of using variables or function
31562parameters defined by the program.
31563
31564@cindex debugger @subentry commands @subentry @code{p} (@code{print})
31565@cindex debugger @subentry commands @subentry @code{print}
31566@cindex @code{print} debugger command
31567@cindex @code{p} debugger command (alias for @code{print})
31568@cindex print variables, in debugger
31569@item @code{print} @var{var1}[@code{,} @var{var2} @dots{}]
31570@itemx @code{p} @var{var1}[@code{,} @var{var2} @dots{}]
31571Print the value of a @command{gawk} variable or field.
31572Fields must be referenced by constants:
31573
31574@example
31575gawk> @kbd{print $3}
31576@end example
31577
31578@noindent
31579This prints the third field in the input record (if the specified field does not
31580exist, it prints @samp{Null field}). A variable can be an array element, with
31581the subscripts being constant string values. To print the contents of an array,
31582prefix the name of the array with the @samp{@@} symbol:
31583
31584@example
31585gawk> @kbd{print @@a}
31586@end example
31587
31588@noindent
31589This prints the indices and the corresponding values for all elements in
31590the array @code{a}.
31591
31592@cindex debugger @subentry commands @subentry @code{printf}
31593@cindex @code{printf} debugger command
31594@item @code{printf} @var{format} [@code{,} @var{arg} @dots{}]
31595Print formatted text. The @var{format} may include escape sequences,
31596such as @samp{\n}
31597(@pxref{Escape Sequences}).
31598No newline is printed unless one is specified.
31599
31600@cindex debugger @subentry commands @subentry @code{set}
31601@cindex @code{set} debugger command
31602@cindex assign values to variables, in debugger
31603@item @code{set} @var{var}@code{=}@var{value}
31604Assign a constant (number or string) value to an @command{awk} variable
31605or field.
31606String values must be enclosed between double quotes (@code{"}@dots{}@code{"}).
31607
31608You can also set special @command{awk} variables, such as @code{FS},
31609@code{NF}, @code{NR}, and so on.
31610
31611@cindex debugger @subentry commands @subentry @code{w} (@code{watch})
31612@cindex debugger @subentry commands @subentry @code{watch}
31613@cindex @code{watch} debugger command
31614@cindex @code{w} debugger command (alias for @code{watch})
31615@cindex set watchpoint
31616@item @code{watch} @var{var} | @code{$}@var{n} [@code{"@var{expression}"}]
31617@itemx @code{w} @var{var} | @code{$}@var{n} [@code{"@var{expression}"}]
31618Add variable @var{var} (or field @code{$@var{n}}) to the watch list.
31619The debugger then stops whenever
31620the value of the variable or field changes. Each watched item is assigned a
31621number that can be used to delete it from the watch list using the
31622@code{unwatch} command.
31623
31624With a watchpoint, you may also supply a condition.  This is an
31625@command{awk} expression (enclosed in double quotes) that the debugger
31626evaluates whenever the watchpoint is reached. If the condition is true,
31627then the debugger stops execution and prompts for a command. Otherwise,
31628@command{gawk} continues executing the program.
31629
31630@cindex debugger @subentry commands @subentry @code{undisplay}
31631@cindex @code{undisplay} debugger command
31632@cindex stop automatic display, in debugger
31633@item @code{undisplay} [@var{n}]
31634Remove item number @var{n} (or all items, if no argument) from the
31635automatic display list.
31636
31637@cindex debugger @subentry commands @subentry @code{unwatch}
31638@cindex @code{unwatch} debugger command
31639@cindex delete watchpoint
31640@item @code{unwatch} [@var{n}]
31641Remove item number @var{n} (or all items, if no argument) from the
31642watch list.
31643
31644@end table
31645
31646@node Execution Stack
31647@subsection Working with the Stack
31648
31649Whenever you run a program that contains any function calls,
31650@command{gawk} maintains a stack of all of the function calls leading up
31651to where the program is right now.  You can see how you got to where you are,
31652and also move around in the stack to see what the state of things was in the
31653functions that called the one you are in.  The commands for doing this are:
31654
31655@table @asis
31656@cindex debugger @subentry commands @subentry @code{bt} (@code{backtrace})
31657@cindex debugger @subentry commands @subentry @code{backtrace}
31658@cindex debugger @subentry commands @subentry @code{where} (@code{backtrace})
31659@cindex @code{backtrace} debugger command
31660@cindex @code{bt} debugger command (alias for @code{backtrace})
31661@cindex @code{where} debugger command (alias for @code{backtrace})
31662@cindex call stack @subentry display in debugger
31663@cindex traceback, display in debugger
31664@item @code{backtrace} [@var{count}]
31665@itemx @code{bt} [@var{count}]
31666@itemx @code{where} [@var{count}]
31667Print a backtrace of all function calls (stack frames), or innermost @var{count}
31668frames if @var{count} > 0. Print the outermost @var{count} frames if
31669@var{count} < 0.  The backtrace displays the name and arguments to each
31670function, the source @value{FN}, and the line number.
31671The alias @code{where} for @code{backtrace} is provided for longtime
31672GDB users who may be used to that command.
31673
31674@cindex debugger @subentry commands @subentry @code{down}
31675@cindex @code{down} debugger command
31676@item @code{down} [@var{count}]
31677Move @var{count} (default 1) frames down the stack toward the innermost frame.
31678Then select and print the frame.
31679
31680@cindex debugger @subentry commands @subentry @code{f} (@code{frame})
31681@cindex debugger @subentry commands @subentry @code{frame}
31682@cindex @code{frame} debugger command
31683@cindex @code{f} debugger command (alias for @code{frame})
31684@item @code{frame} [@var{n}]
31685@itemx @code{f} [@var{n}]
31686Select and print stack frame @var{n}.  Frame 0 is the currently executing,
31687or @dfn{innermost}, frame (function call); frame 1 is the frame that
31688called the innermost one. The highest-numbered frame is the one for the
31689main program.  The printed information consists of the frame number,
31690function and argument names, source file, and the source line.
31691
31692@cindex debugger @subentry commands @subentry @code{up}
31693@cindex @code{up} debugger command
31694@item @code{up} [@var{count}]
31695Move @var{count} (default 1) frames up the stack toward the outermost frame.
31696Then select and print the frame.
31697@end table
31698
31699@node Debugger Info
31700@subsection Obtaining Information About the Program and the Debugger State
31701
31702Besides looking at the values of variables, there is often a need to get
31703other sorts of information about the state of your program and of the
31704debugging environment itself.  The @command{gawk} debugger has one command that
31705provides this information, appropriately called @code{info}.  @code{info}
31706is used with one of a number of arguments that tell it exactly what
31707you want to know:
31708
31709@table @asis
31710@cindex debugger @subentry commands @subentry @code{i} (@code{info})
31711@cindex debugger @subentry commands @subentry @code{info}
31712@cindex @code{info} debugger command
31713@cindex @code{i} debugger command (alias for @code{info})
31714@item @code{info} @var{what}
31715@itemx @code{i} @var{what}
31716The value for @var{what} should be one of the following:
31717
31718@c nested table
31719@table @code
31720@item args
31721@cindex show in debugger @subentry function arguments
31722@cindex function arguments, show in debugger
31723List arguments of the selected frame.
31724
31725@item break
31726@cindex show in debugger @subentry breakpoints
31727@cindex breakpoint @subentry show all in debugger
31728List all currently set breakpoints.
31729
31730@item display
31731@cindex automatic displays, in debugger
31732List all items in the automatic display list.
31733
31734@item frame
31735@cindex describe call stack frame, in debugger
31736Give a description of the selected stack frame.
31737
31738@item functions
31739@cindex list function definitions, in debugger
31740@cindex function definitions, list in debugger
31741List all function definitions including source @value{FN}s and
31742line numbers.
31743
31744@item locals
31745@cindex show in debugger @subentry local variables
31746@cindex local variables @subentry show in debugger
31747List local variables of the selected frame.
31748
31749@item source
31750@cindex show in debugger @subentry name of current source file
31751@cindex current source file, show in debugger
31752@cindex source file, show in debugger
31753Print the name of the current source file. Each time the program stops, the
31754current source file is the file containing the current instruction.
31755When the debugger first starts, the current source file is the first file
31756included via the @option{-f} option. The
31757@samp{list @var{filename}:@var{lineno}} command can
31758be used at any time to change the current source.
31759
31760@item sources
31761@cindex show in debugger @subentry all source files
31762@cindex all source files, show in debugger
31763List all program sources.
31764
31765@item variables
31766@cindex list all global variables, in debugger
31767@cindex global variables, show in debugger
31768List all global variables.
31769
31770@item watch
31771@cindex show in debugger @subentry watchpoints
31772@cindex watchpoints, show in debugger
31773List all items in the watch list.
31774@end table
31775@end table
31776
31777Additional commands give you control over the debugger, the ability to
31778save the debugger's state, and the ability to run debugger commands
31779from a file.  The commands are:
31780
31781@table @asis
31782@cindex debugger @subentry commands @subentry @code{o} (@code{option})
31783@cindex debugger @subentry commands @subentry @code{option}
31784@cindex @code{option} debugger command
31785@cindex @code{o} debugger command (alias for @code{option})
31786@cindex display debugger options
31787@cindex debugger @subentry options
31788@item @code{option} [@var{name}[@code{=}@var{value}]]
31789@itemx @code{o} [@var{name}[@code{=}@var{value}]]
31790Without an argument, display the available debugger options
31791and their current values. @samp{option @var{name}} shows the current
31792value of the named option. @samp{option @var{name}=@var{value}} assigns
31793a new value to the named option.
31794The available options are:
31795
31796@c nested table
31797@c asis for docbook
31798@table @asis
31799@item @code{history_size}
31800@cindex debugger @subentry history size
31801Set the maximum number of lines to keep in the history file
31802@file{./.gawk_history}.  The default is 100.
31803
31804@item @code{listsize}
31805@cindex debugger @subentry default list amount
31806Specify the number of lines that @code{list} prints. The default is 15.
31807
31808@item @code{outfile}
31809@cindex redirect @command{gawk} output, in debugger
31810Send @command{gawk} output to a file; debugger output still goes
31811to standard output. An empty string (@code{""}) resets output to
31812standard output.
31813
31814@item @code{prompt}
31815@cindex debugger @subentry prompt
31816Change the debugger prompt. The default is @samp{@w{gawk> }}.
31817
31818@item @code{save_history} [@code{on} | @code{off}]
31819@cindex debugger @subentry history file
31820Save command history to file @file{./.gawk_history}.
31821The default is @code{on}.
31822
31823@item @code{save_options} [@code{on} | @code{off}]
31824@cindex save debugger options
31825Save current options to file @file{./.gawkrc} upon exit.
31826The default is @code{on}.
31827Options are read back into the next session upon startup.
31828
31829@item @code{trace} [@code{on} | @code{off}]
31830@cindex instruction tracing, in debugger
31831@cindex debugger @subentry instruction tracing
31832Turn instruction tracing on or off. The default is @code{off}.
31833@end table
31834
31835@cindex debugger @subentry save commands to a file
31836@item @code{save} @var{filename}
31837Save the commands from the current session to the given @value{FN},
31838so that they can be replayed using the @command{source} command.
31839
31840@item @code{source} @var{filename}
31841@cindex debugger @subentry read commands from a file
31842Run command(s) from a file; an error in any command does not
31843terminate execution of subsequent commands. Comments (lines starting
31844with @samp{#}) are allowed in a command file.
31845Empty lines are ignored; they do @emph{not}
31846repeat the last command.
31847You can't restart the program by having more than one @code{run}
31848command in the file. Also, the list of commands may include additional
31849@code{source} commands; however, the @command{gawk} debugger will not source the
31850same file more than once in order to avoid infinite recursion.
31851
31852In addition to, or instead of, the @code{source} command, you can use
31853the @option{-D @var{file}} or @option{--debug=@var{file}} command-line
31854options to execute commands from a file non-interactively
31855(@pxref{Options}).
31856@end table
31857
31858@node Miscellaneous Debugger Commands
31859@subsection Miscellaneous Commands
31860
31861There are a few more commands that do not fit into the
31862previous categories, as follows:
31863
31864@table @asis
31865@cindex debugger @subentry commands @subentry @code{dump}
31866@cindex @code{dump} debugger command
31867@item @code{dump} [@var{filename}]
31868Dump byte code of the program to standard output or to the file
31869named in @var{filename}.  This prints a representation of the internal
31870instructions that @command{gawk} executes to implement the @command{awk}
31871commands in a program.  This can be very enlightening, as the following
31872partial dump of Davide Brini's obfuscated code
31873(@pxref{Signature Program}) demonstrates:
31874
31875@smallexample
31876@group
31877gawk> @kbd{dump}
31878@print{}        # BEGIN
31879@print{}
31880@print{} [  1:0xfcd340] Op_rule           : [in_rule = BEGIN] [source_file = brini.awk]
31881@end group
31882@print{} [  1:0xfcc240] Op_push_i         : "~" [MALLOC|STRING|STRCUR]
31883@print{} [  1:0xfcc2a0] Op_push_i         : "~" [MALLOC|STRING|STRCUR]
31884@print{} [  1:0xfcc280] Op_match          :
31885@print{} [  1:0xfcc1e0] Op_store_var      : O
31886@print{} [  1:0xfcc2e0] Op_push_i         : "==" [MALLOC|STRING|STRCUR]
31887@print{} [  1:0xfcc340] Op_push_i         : "==" [MALLOC|STRING|STRCUR]
31888@print{} [  1:0xfcc320] Op_equal          :
31889@print{} [  1:0xfcc200] Op_store_var      : o
31890@print{} [  1:0xfcc380] Op_push           : o
31891@print{} [  1:0xfcc360] Op_plus_i         : 0 [MALLOC|NUMCUR|NUMBER]
31892@print{} [  1:0xfcc220] Op_push_lhs       : o [do_reference = true]
31893@print{} [  1:0xfcc300] Op_assign_plus    :
31894@print{} [   :0xfcc2c0] Op_pop            :
31895@print{} [  1:0xfcc400] Op_push           : O
31896@print{} [  1:0xfcc420] Op_push_i         : "" [MALLOC|STRING|STRCUR]
31897@print{} [   :0xfcc4a0] Op_no_op          :
31898@print{} [  1:0xfcc480] Op_push           : O
31899@print{} [   :0xfcc4c0] Op_concat         : [expr_count = 3] [concat_flag = 0]
31900@print{} [  1:0xfcc3c0] Op_store_var      : x
31901@print{} [  1:0xfcc440] Op_push_lhs       : X [do_reference = true]
31902@print{} [  1:0xfcc3a0] Op_postincrement  :
31903@print{} [  1:0xfcc4e0] Op_push           : x
31904@print{} [  1:0xfcc540] Op_push           : o
31905@print{} [  1:0xfcc500] Op_plus           :
31906@print{} [  1:0xfcc580] Op_push           : o
31907@print{} [  1:0xfcc560] Op_plus           :
31908@print{} [  1:0xfcc460] Op_leq            :
31909@print{} [   :0xfcc5c0] Op_jmp_false      : [target_jmp = 0xfcc5e0]
31910@print{} [  1:0xfcc600] Op_push_i         : "%c" [MALLOC|STRING|STRCUR]
31911@print{} [   :0xfcc660] Op_no_op          :
31912@print{} [  1:0xfcc520] Op_assign_concat  : c
31913@print{} [   :0xfcc620] Op_jmp            : [target_jmp = 0xfcc440]
31914@dots{}
31915@print{} [     2:0xfcc5a0] Op_K_printf         : [expr_count = 17] [redir_type = ""]
31916@print{} [      :0xfcc140] Op_no_op            :
31917@print{} [      :0xfcc1c0] Op_atexit           :
31918@print{} [      :0xfcc640] Op_stop             :
31919@print{} [      :0xfcc180] Op_no_op            :
31920@print{} [      :0xfcd150] Op_after_beginfile  :
31921@group
31922@print{} [      :0xfcc160] Op_no_op            :
31923@print{} [      :0xfcc1a0] Op_after_endfile    :
31924gawk>
31925@end group
31926@end smallexample
31927
31928@cindex @code{exit} debugger command
31929@cindex exit the debugger
31930@item @code{exit}
31931Exit the debugger.
31932See the entry for @samp{quit}, later in this list.
31933
31934@cindex debugger @subentry commands @subentry @code{h} (@code{help})
31935@cindex debugger @subentry commands @subentry @code{help}
31936@cindex @code{help} debugger command
31937@cindex @code{h} debugger command (alias for @code{help})
31938@item @code{help}
31939@itemx @code{h}
31940Print a list of all of the @command{gawk} debugger commands with a short
31941summary of their usage.  @samp{help @var{command}} prints the information
31942about the command @var{command}.
31943
31944@cindex debugger @subentry commands @subentry @code{l} (@code{list})
31945@cindex debugger @subentry commands @subentry @code{list}
31946@cindex @code{list} debugger command
31947@cindex @code{l} debugger command (alias for @code{list})
31948@item @code{list} [@code{-} | @code{+} | @var{n} | @var{filename}@code{:}@var{n} | @var{n}--@var{m} | @var{function}]
31949@itemx @code{l} [@code{-} | @code{+} | @var{n} | @var{filename}@code{:}@var{n} | @var{n}--@var{m} | @var{function}]
31950Print the specified lines (default 15) from the current source file
31951or the file named @var{filename}. The possible arguments to @code{list}
31952are as follows:
31953
31954@c nested table
31955@table @asis
31956@item @code{-} (Minus)
31957Print lines before the lines last printed.
31958
31959@item @code{+}
31960Print lines after the lines last printed.
31961@code{list} without any argument does the same thing.
31962
31963@item @var{n}
31964Print lines centered around line number @var{n}.
31965
31966@item  @var{n}--@var{m}
31967Print lines from @var{n} to @var{m}.
31968
31969@item @var{filename}@code{:}@var{n}
31970Print lines centered around line number @var{n} in
31971source file @var{filename}. This command may change the current source file.
31972
31973@item @var{function}
31974Print lines centered around the beginning of the
31975function @var{function}. This command may change the current source file.
31976@end table
31977
31978@cindex debugger @subentry commands @subentry @code{q} (@code{quit})
31979@cindex debugger @subentry commands @subentry @code{quit}
31980@cindex @code{quit} debugger command
31981@cindex @code{q} debugger command (alias for @code{quit})
31982@cindex exit the debugger
31983@item @code{quit}
31984@itemx @code{q}
31985Exit the debugger.  Debugging is great fun, but sometimes we all have
31986to tend to other obligations in life, and sometimes we find the bug
31987and are free to go on to the next one!  As we saw earlier, if you are
31988running a program, the debugger warns you when you type
31989@samp{q} or @samp{quit}, to make sure you really want to quit.
31990
31991@cindex debugger @subentry commands @subentry @code{trace}
31992@cindex @code{trace} debugger command
31993@item @code{trace} [@code{on} | @code{off}]
31994Turn on or off continuous printing of the instructions that are about to
31995be executed, along with the @command{awk} lines they
31996implement.  The default is @code{off}.
31997
31998It is to be hoped that most of the ``opcodes'' in these instructions are
31999fairly self-explanatory, and using @code{stepi} and @code{nexti} while
32000@code{trace} is on will make them into familiar friends.
32001
32002@end table
32003
32004@node Readline Support
32005@section Readline Support
32006@cindex command completion, in debugger
32007@cindex debugger @subentry command completion
32008@cindex history expansion, in debugger
32009@cindex debugger @subentry history expansion
32010
32011If @command{gawk} is compiled with
32012@uref{http://cnswww.cns.cwru.edu/php/chet/readline/readline.html,
32013the GNU Readline library}, you can take advantage of that library's
32014command completion and history expansion features. The following types
32015of completion are available:
32016
32017@table @asis
32018@item Command completion
32019Command names.
32020
32021@item Source @value{FN} completion
32022Source @value{FN}s. Relevant commands are
32023@code{break},
32024@code{clear},
32025@code{list},
32026@code{tbreak},
32027and
32028@code{until}.
32029
32030@item Argument completion
32031Non-numeric arguments to a command.
32032Relevant commands are @code{enable} and @code{info}.
32033
32034@item Variable name completion
32035Global variable names, and function arguments in the current context
32036if the program is running. Relevant commands are
32037@code{display},
32038@code{print},
32039@code{set},
32040and
32041@code{watch}.
32042
32043@end table
32044
32045@node Limitations
32046@section Limitations
32047
32048@cindex debugger @subentry limitations
32049We hope you find the @command{gawk} debugger useful and enjoyable to work with,
32050but as with any program, especially in its early releases, it still has
32051some limitations.  A few that it's worth being aware of are:
32052
32053@itemize @value{BULLET}
32054@item
32055At this point, the debugger does not give a detailed explanation of
32056what you did wrong when you type in something it doesn't like. Rather, it just
32057responds @samp{syntax error}.  When you do figure out what your mistake was,
32058though, you'll feel like a real guru.
32059
32060@item
32061@c NOTE: no comma after the ref{} on purpose, due to following
32062@c parenthetical remark.
32063If you perused the dump of opcodes in @ref{Miscellaneous Debugger Commands}
32064(or if you are already familiar with @command{gawk} internals),
32065you will realize that much of the internal manipulation of data
32066in @command{gawk}, as in many interpreters, is done on a stack.
32067@code{Op_push}, @code{Op_pop}, and the like are the ``bread and butter'' of
32068most @command{gawk} code.
32069
32070Unfortunately, as of now, the @command{gawk}
32071debugger does not allow you to examine the stack's contents.
32072That is, the intermediate results of expression evaluation are on the
32073stack, but cannot be printed.  Rather, only variables that are defined
32074in the program can be printed.  Of course, a workaround for
32075this is to use more explicit variables at the debugging stage and then
32076change back to obscure, perhaps more optimal code later.
32077
32078@item
32079There is no way to look ``inside'' the process of compiling
32080regular expressions to see if you got it right.  As an @command{awk}
32081programmer, you are expected to know the meaning of
32082@code{/[^[:alnum:][:blank:]]/}.
32083
32084@item
32085The @command{gawk} debugger is designed to be used by running a program (with all its
32086parameters) on the command line, as described in @ref{Debugger Invocation}.
32087There is no way (as of now) to attach or ``break into'' a running program.
32088This seems reasonable for a language that is used mainly for quickly
32089executing, short programs.
32090
32091@item
32092The @command{gawk} debugger only accepts source code supplied with the @option{-f} option.
32093If you have a shell script that provides an @command{awk} program as a command
32094line parameter, and you need to use the debugger, you can write the script
32095to a temporary file, and use that as the program, with the @option{-f} option. This
32096might look like this:
32097
32098@example
32099cat << \EOF > /tmp/script.$$
32100@dots{}                                  @ii{Your program here}
32101EOF
32102gawk -D -f /tmp/script.$$
32103rm /tmp/script.$$
32104@end example
32105@end itemize
32106
32107@ignore
32108@c 11/2016: This no longer applies after all the type cleanup work that's been done.
32109One other point is worth discussing.  Conventional debuggers run in a
32110separate process (and thus address space) from the programs that they
32111debug (the @dfn{debuggee}, if you will).
32112
32113The @command{gawk} debugger is different; it is an integrated part
32114of @command{gawk} itself.  This makes it possible, in rare cases,
32115for @command{gawk} to become an excellent demonstrator of Heisenberg
32116Uncertainty physics, where the mere act of observing something can change
32117it. Consider the following:@footnote{Thanks to Hermann Peifer for
32118this example.}
32119
32120@example
32121$ @kbd{cat test.awk}
32122@print{} @{ print typeof($1), typeof($2) @}
32123$ @kbd{cat test.data}
32124@print{} abc 123
32125$ @kbd{gawk -f test.awk test.data}
32126@print{} strnum strnum
32127@end example
32128
32129This is all as expected: field data has the STRNUM attribute
32130(@pxref{Variable Typing}).  Now watch what happens when we run
32131this program under the debugger:
32132
32133@example
32134$ @kbd{gawk -D -f test.awk test.data}
32135gawk> @kbd{w $1}                        @ii{Set watchpoint on} $1
32136@print{} Watchpoint 1: $1
32137gawk> @kbd{w $2}                        @ii{Set watchpoint on} $2
32138@print{} Watchpoint 2: $2
32139gawk> @kbd{r}                           @ii{Start the program}
32140@print{} Starting program:
32141@print{} Stopping in Rule ...
32142@print{} Watchpoint 1: $1               @ii{Watchpoint fires}
32143@print{}   Old value: ""
32144@print{}   New value: "abc"
32145@print{} main() at `test.awk':1
32146@print{} 1       @{ print typeof($1), typeof($2) @}
32147gawk> @kbd{n}                           @ii{Keep going @dots{}}
32148@print{} Watchpoint 2: $2               @ii{Watchpoint fires}
32149@print{}   Old value: ""
32150@print{}   New value: "123"
32151@print{} main() at `test.awk':1
32152@print{} 1       @{ print typeof($1), typeof($2) @}
32153gawk> @kbd{n}                           @ii{Get result from} typeof()
32154@print{} strnum number                  @ii{Result for} $2 @ii{isn't right}
32155@print{} Program exited normally with exit value: 0
32156gawk> @kbd{quit}
32157@end example
32158
32159In this case, the act of comparing the new value of @code{$2}
32160with the old one caused @command{gawk} to evaluate it and determine that it
32161is indeed a number, and this is reflected in the result of
32162@code{typeof()}.
32163
32164Cases like this where the debugger is not transparent to the program's
32165execution should be rare. If you encounter one, please report it
32166(@pxref{Bugs}).
32167@end ignore
32168
32169@ignore
32170Look forward to a future release when these and other missing features may
32171be added, and of course feel free to try to add them yourself!
32172@end ignore
32173
32174@node Debugging Summary
32175@section Summary
32176
32177@itemize @value{BULLET}
32178@item
32179Programs rarely work correctly the first time.  Finding bugs
32180is called debugging, and a program that helps you find bugs is a
32181debugger.  @command{gawk} has a built-in debugger that works very
32182similarly to the GNU Debugger, GDB.
32183
32184@item
32185Debuggers let you step through your program one statement at a time,
32186examine and change variable and array values, and do a number of other
32187things that let you understand what your program is actually doing (as
32188opposed to what it is supposed to do).
32189
32190@item
32191Like most debuggers, the @command{gawk} debugger works in terms of stack
32192frames, and lets you set both breakpoints (stop at a point in the code)
32193and watchpoints (stop when a data value changes).
32194
32195@item
32196The debugger command set is fairly complete, providing control over
32197breakpoints, execution, viewing and changing data, working with the stack,
32198getting information, and other tasks.
32199
32200@item
32201If the GNU Readline library is available when @command{gawk} is
32202compiled, it is used by the debugger to provide command-line history
32203and editing.
32204
32205@item
32206Usually, the debugger does not not affect the
32207program being debugged, but occasionally it can.
32208
32209@end itemize
32210
32211@hyphenation{name-space name-spaces Name-space Name-spaces}
32212@node Namespaces
32213@chapter Namespaces in @command{gawk}
32214
32215This @value{CHAPTER} describes a feature that is specific to @command{gawk}.
32216
32217@quotation CAUTION
32218This feature described in this chapter is new.  It is entirely
32219possible, and even likely, that there are dark corners (if not bugs)
32220still lurking within the implementation. If you find any such,
32221please report them (@xref{Bugs}).
32222@end quotation
32223
32224@menu
32225* Global Namespace::            The global namespace in standard
32226                                @command{awk}.
32227* Qualified Names::             How to qualify names with a namespace.
32228* Default Namespace::           The default namespace.
32229* Changing The Namespace::      How to change the namespace.
32230* Naming Rules::                Namespace and Component Naming Rules.
32231* Internal Name Management::    How names are stored internally.
32232* Namespace Example::           An example of code using a namespace.
32233* Namespace And Features::      Namespaces and other @command{gawk} features.
32234* Namespace Summary::           Summarizing namespaces.
32235@end menu
32236
32237@node Global Namespace
32238@section Standard @command{awk}'s Single Namespace
32239
32240@cindex namespace @subentry definition of
32241@cindex namespace @subentry standard @command{awk}, global
32242In standard @command{awk}, there is a single, global, @dfn{namespace}.
32243This means that @emph{all} function names and global variable names must
32244be unique. For example, two different @command{awk} source files cannot
32245both define a function named @code{min()}, or define the same identifier,
32246used as a scalar in one and as an array in the other.
32247
32248This situation is okay when programs are small, say a few hundred
32249lines, or even a few thousand, but it prevents the development of
32250reusable libraries of @command{awk} functions, and can inadvertently
32251cause independently-developed library files to accidentally step on each
32252other's ``private'' global variables
32253(@pxref{Library Names}).
32254
32255@cindex package, definition of
32256@cindex module, definition of
32257Most other programming languages solve this issue by providing some kind
32258of namespace control: a way to say ``this function is in namespace @var{xxx},
32259and that function is in namespace @var{yyy}.''  (Of course, there is then
32260still a single namespace for the namespaces, but the hope is that there
32261are much fewer namespaces in use by any given program, and thus much
32262less chance for collisions.)  These facilities are sometimes referred
32263to as @dfn{packages} or @dfn{modules}.
32264
32265Starting with @value{PVERSION} 5.0, @command{gawk} provides a
32266simple mechanism to put functions and global variables into separate namespaces.
32267
32268@node Qualified Names
32269@section Qualified Names
32270
32271@cindex qualified name @subentry definition of
32272@cindex namespaces @subentry qualified names
32273@cindex @code{:} (colon) @subentry @code{::} namespace separator
32274@cindex colon (@code{:}) @subentry @code{::} namespace separator
32275@cindex component name
32276A @dfn{qualified name} is an identifier that includes a namespace name,
32277the namespace separator @code{::}, and a @dfn{component} name.  For example, one
32278might have a function named @code{posix::getpid()}.  Here, the namespace
32279is @code{posix} and the function name within the namespace (the component)
32280is @code{getpid()}.  The namespace and component names are separated by
32281a double-colon.  Only one such separator is allowed in a qualified name.
32282
32283@quotation NOTE
32284Unlike C++, the @code{::} is @emph{not} an operator.  No spaces are
32285allowed between the namespace name, the @code{::}, and the component name.
32286@end quotation
32287
32288@cindex qualified name @subentry use of
32289You must use qualified names from one namespace to access variables
32290and functions in another.  This is especially important when using
32291variable names to index the special @code{SYMTAB} array (@pxref{Auto-set}),
32292and when making indirect function calls (@pxref{Indirect Calls}).
32293
32294@node Default Namespace
32295@section The Default Namespace
32296
32297@cindex namespace @subentry default
32298@cindex namespace @subentry @code{awk}
32299@cindex @code{awk} @subentry namespace
32300The default namespace, not surprisingly, is @code{awk}.
32301All of the predefined @command{awk} and @command{gawk} variables
32302are in this namespace, and thus have qualified names like
32303@code{awk::ARGC}, @code{awk::NF}, and so on.
32304
32305@cindex uppercase names, namespace for
32306Furthermore, even when you have changed the namespace for your
32307current source file (@pxref{Changing The Namespace}), @command{gawk}
32308forces unqualified identifiers whose names are all uppercase letters
32309to be in the @code{awk} namespace.  This makes it possible for you to easily
32310reference @command{gawk}'s global variables from different namespaces.
32311It also keeps your code looking natural.
32312
32313@node Changing The Namespace
32314@section Changing The Namespace
32315
32316@cindex namespaces @subentry changing
32317@cindex @code{@@} (at-sign) @subentry @code{@@namespace} directive
32318@cindex at-sign (@code{@@}) @subentry @code{@@namespace} directive
32319@cindex @code{@@namespace} directive @sortas{namespace directive}
32320In order to set the current namespace, use an @code{@@namespace} directive
32321at the top level of your program:
32322
32323@example
32324@@namespace "passwd"
32325
32326BEGIN @{ @dots{} @}
32327@dots{}
32328@end example
32329
32330After this directive, all simple non-completely-uppercase identifiers are
32331placed into the @code{passwd} namespace.
32332
32333You can change the namespace multiple times within a single
32334source file, although this is likely to become confusing if you
32335do it too much.
32336
32337@quotation NOTE
32338Association of unqualified identifiers to a namespace is handled while
32339@command{gawk} parses your program, @emph{before} it starts to run.  There is
32340no concept of a ``current'' namespace once your program starts executing.
32341Be sure you understand this.
32342@end quotation
32343
32344@cindex namespace @subentry implicit
32345@cindex implicit namespace
32346Each source file for @option{-i} and @option{-f} starts out with
32347an implicit @samp{@@namespace "awk"}.  Similarly, each chunk of
32348command-line code supplied with @option{-e} has such an implicit
32349initial statement (@pxref{Options}).
32350
32351@cindex current namespace, pushing and popping
32352@cindex namespace @subentry pushing and popping
32353Files included with @code{@@include} (@pxref{Include Files}) ``push''
32354and ``pop'' the current namespace. That is, each @code{@@include} saves
32355the current namespace and starts over with an implicit @samp{@@namespace
32356"awk"} which remains in effect until an explicit @code{@@namespace}
32357directive is seen.  When @command{gawk} finishes processing the included
32358file, the saved namespace is restored and processing continues where it
32359left off in the original file.
32360
32361@cindex @code{@@} (at-sign) @subentry @code{@@namespace} directive @subentry @code{BEGIN}, @code{BEGINFILE}, @code{END}, @code{ENDFILE} and
32362@cindex at-sign (@code{@@}) @subentry @code{@@namespace} directive @subentry @code{BEGIN}, @code{BEGINFILE}, @code{END}, @code{ENDFILE} and
32363@cindex @code{BEGIN} pattern @subentry @code{@@namespace} directive and
32364@cindex @code{BEGINFILE} pattern @subentry @code{@@namespace} directive and
32365@cindex @code{END} pattern @subentry @code{@@namespace} directive and
32366@cindex @code{ENDFILE} pattern @subentry @code{@@namespace} directive and
32367@cindex @code{@@namespace} directive @sortas{namespace directive}
32368The use of @code{@@namespace} has no influence upon the order of execution
32369of @code{BEGIN}, @code{BEGINFILE}, @code{END}, and @code{ENDFILE} rules.
32370
32371@node Naming Rules
32372@section Namespace and Component Naming Rules
32373
32374@cindex naming rules, namespace and component names
32375@cindex namespaces @subentry naming rules
32376@c not "component names" to merge with other index entry
32377@cindex component name @subentry naming rules
32378A number of rules apply to the namespace and component names, as follows.
32379
32380@itemize @bullet
32381@item
32382It is a syntax error to use qualified names for function parameter names.
32383
32384@item
32385It is a syntax error to use any standard @command{awk} reserved word (such
32386as @code{if} or @code{for}), or the name of any standard built-in function
32387(such as @code{sin()} or @code{gsub()}) as either part of a qualified name.
32388Thus, the following produces a syntax error:
32389
32390@example
32391@@namespace "example"
32392
32393function gsub(str, pat, result) @{ @dots{} @}
32394@end example
32395
32396@item
32397Outside the @code{awk} namespace, the names of the additional @command{gawk}
32398built-in functions (such as @code{gensub()} or @code{strftime()}) @emph{may}
32399be used as component names.  The same set of names may be used as namespace
32400names, although this has the potential to be confusing.
32401
32402@item
32403The additional @command{gawk} built-in functions may still be called
32404from outside the @code{awk} namespace by qualifying them. For example,
32405@code{awk::systime()}.  Here is a somewhat silly example demonstrating
32406this rule and the previous one:
32407
32408@example
32409BEGIN @{
32410    print "in awk namespace, systime() =", systime()
32411@}
32412
32413@@namespace "testing"
32414
32415function systime()
32416@{
32417    print "in testing namespace, systime() =", awk::systime()
32418@}
32419
32420BEGIN @{
32421    systime()
32422@}
32423@end example
32424
32425@noindent
32426
32427When run, it produces output like this:
32428
32429@example
32430$ @kbd{gawk -f systime.awk}
32431@print{} in awk namespace, systime() = 1500488503
32432@print{} in testing namespace, systime() = 1500488503
32433@end example
32434
32435@item
32436@command{gawk} pre-defined variable names may be used:
32437@code{NF::NR} is valid, if possibly not all that useful.
32438@end itemize
32439
32440@node Internal Name Management
32441@section Internal Name Management
32442
32443@cindex name management
32444@cindex @code{awk} @subentry namespace @subentry identifier name storage
32445@cindex @code{awk} @subentry namespace @subentry use for indirect function calls
32446For backwards compatibility, all identifiers in the @code{awk} namespace
32447are stored internally as unadorned identifiers (that is, without a
32448leading @samp{awk::}).  This is mainly relevant
32449when using such identifiers as indices for @code{SYMTAB}, @code{FUNCTAB},
32450and @code{PROCINFO["identifiers"]} (@pxref{Auto-set}), and for use in
32451indirect function calls (@pxref{Indirect Calls}).
32452
32453In program code, to refer to variables and functions in the @code{awk}
32454namespace from another namespace, you must still use the @samp{awk::}
32455prefix. For example:
32456
32457@example
32458@@namespace "awk"          @ii{This is the default namespace}
32459
32460BEGIN @{
32461    Title = "My Report"   @ii{Qualified name is} awk::Title
32462@}
32463
32464@@namespace "report"       @ii{Now in} report @ii{namespace}
32465
32466function compute()        @ii{This is really} report::compute()
32467@{
32468    print awk::Title      @ii{But would be} SYMTAB["Title"]
32469    @dots{}
32470@}
32471@end example
32472
32473@node Namespace Example
32474@section Namespace Example
32475
32476@cindex namespace @subentry example code
32477The following example is a revised version of the suite of routines
32478developed in @ref{Passwd Functions}. See there for an explanation
32479of how the code works.
32480
32481The formulation here, due mainly to Andrew Schorr, is rather elegant.
32482All of the implementation functions and variables are in the
32483@code{passwd} namespace, whereas the main interface functions are
32484defined in the @code{awk} namespace.
32485
32486@example
32487@c file eg/lib/ns_passwd.awk
32488# ns_passwd.awk --- access password file information
32489@c endfile
32490@ignore
32491@c file eg/lib/ns_passwd.awk
32492#
32493# Arnold Robbins, arnold@@skeeve.com, Public Domain
32494# May 1993
32495# Revised October 2000
32496# Revised December 2010
32497#
32498# Reworked for namespaces June 2017, with help from
32499# Andrew J.@: Schorr, aschorr@@telemetry-investments.com
32500@c endfile
32501@end ignore
32502@c file eg/lib/ns_passwd.awk
32503
32504@@namespace "passwd"
32505
32506BEGIN @{
32507    # tailor this to suit your system
32508    Awklib = "/usr/local/libexec/awk/"
32509@}
32510
32511function Init(    oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat)
32512@{
32513    if (Inited)
32514        return
32515
32516    oldfs = FS
32517    oldrs = RS
32518    olddol0 = $0
32519    using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
32520    using_fpat = (PROCINFO["FS"] == "FPAT")
32521    FS = ":"
32522    RS = "\n"
32523
32524    pwcat = Awklib "pwcat"
32525    while ((pwcat | getline) > 0) @{
32526        Byname[$1] = $0
32527        Byuid[$3] = $0
32528        Bycount[++Total] = $0
32529    @}
32530    close(pwcat)
32531    Count = 0
32532    Inited = 1
32533    FS = oldfs
32534    if (using_fw)
32535        FIELDWIDTHS = FIELDWIDTHS
32536    else if (using_fpat)
32537        FPAT = FPAT
32538    RS = oldrs
32539    $0 = olddol0
32540@}
32541
32542function awk::getpwnam(name)
32543@{
32544    Init()
32545    return Byname[name]
32546@}
32547
32548function awk::getpwuid(uid)
32549@{
32550    Init()
32551    return Byuid[uid]
32552@}
32553
32554function awk::getpwent()
32555@{
32556    Init()
32557    if (Count < Total)
32558        return Bycount[++Count]
32559    return ""
32560@}
32561
32562function awk::endpwent()
32563@{
32564    Count = 0
32565@}
32566@c endfile
32567@end example
32568
32569As you can see, this version also follows the convention mentioned in
32570@ref{Library Names}, whereby global variable and function names
32571start with a capital letter.
32572
32573Here is a simple test program. Since it's in a separate file, unadorned
32574identifiers are sought for in the @code{awk} namespace:
32575
32576@example
32577BEGIN @{
32578    while ((p = getpwent()) != "")
32579        print p
32580@}
32581@end example
32582
32583@noindent
32584
32585Here's what happens when it's run:
32586
32587@example
32588$ @kbd{gawk -f ns_passwd.awk -f testpasswd.awk}
32589@print{} root:x:0:0:root:/root:/bin/bash
32590@print{} daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
32591@print{} bin:x:2:2:bin:/bin:/usr/sbin/nologin
32592@print{} sys:x:3:3:sys:/dev:/usr/sbin/nologin
32593@dots{}
32594@end example
32595
32596@node Namespace And Features
32597@section Namespaces and Other @command{gawk} Features
32598
32599This @value{SECTION} looks briefly at how the namespace facility interacts
32600with other important @command{gawk} features.
32601
32602@cindex namespaces @subentry interaction with @subentry profiler
32603@cindex namespaces @subentry interaction with @subentry pretty printer
32604@cindex profiler, interaction with namespaces
32605@cindex pretty printer, interaction with namespaces
32606The profiler and pretty-printer (@pxref{Profiling}) have been enhanced
32607to understand namespaces and the namespace naming rules presented in
32608@ref{Naming Rules}.  In particular, the output groups functions in the same
32609namespace together, and has @code{@@namespace} directives in front
32610of rules as necessary. This allows component names to be
32611simple identifiers, instead of using qualified identifiers everywhere.
32612
32613@cindex namespaces @subentry interaction with @subentry debugger
32614@cindex debugger @subentry interaction with namespaces
32615Interaction with the debugger (@pxref{Debugging}) has not had to change
32616(at least as of this writing).  Some of the internal byte codes changed
32617in order to accommodate namespaces, and the debugger's @code{dump} command
32618was adjusted to match.
32619
32620@cindex namespaces @subentry interaction with @subentry extension API
32621@cindex extension API @subentry interaction with namespaces
32622The extension API (@pxref{Dynamic Extensions}) has always allowed for
32623placing functions into a different namespace, although this was not
32624previously implemented.  However, the symbol lookup and symbol update
32625routines did not have provision for including a namespace. That has now
32626been corrected (@pxref{Symbol table by name}).
32627@xref{Extension Sample Inplace}, for a nice example of an extension that
32628leverages a namespace shared by cooperating @command{awk} and C code.
32629
32630@node Namespace Summary
32631@section Summary
32632
32633@itemize @value{BULLET}
32634@item
32635Standard @command{awk} provides a single namespace for all global
32636identifiers (scalars, arrays, and functions).  This is limiting when
32637one wants to develop libraries of reusable functions or function suites.
32638
32639@item
32640@command{gawk} provides multiple namespaces by using qualified names:
32641names consisting of a namespace name, a double colon, @code{::}, and a
32642component name.  Namespace names might still possibly conflict, but this
32643is true of any language providing namespaces, modules, or packages.
32644
32645@item
32646The default namespace is @command{awk}. The rules for namespace and
32647component names are provided in @ref{Naming Rules}. The rules are
32648designed in such a way as to make namespace-aware code continue to
32649look and work naturally while still providing the necessary power and
32650flexibility.
32651
32652@item
32653Other parts of @command{gawk} have been extended as necessary to integrate
32654namespaces smoothly with their operation.  This applies most notably to
32655the profiler / pretty-printer (@pxref{Profiling}) and to the extension
32656facility (@pxref{Dynamic Extensions}).
32657
32658@cindex namespaces @subentry backwards compatibility
32659@item
32660Overall, the namespace facility was designed and implemented such that
32661backwards compatibility is paramount. Programs that don't use namespaces
32662should see absolutely no difference in behavior when run by a namespace-capable
32663version of @command{gawk}.
32664@end itemize
32665
32666@node Arbitrary Precision Arithmetic
32667@chapter Arithmetic and Arbitrary-Precision Arithmetic with @command{gawk}
32668@cindex arbitrary precision
32669@cindex multiple precision
32670@cindex infinite precision
32671@cindex floating-point @subentry numbers @subentry arbitrary-precision
32672
32673This @value{CHAPTER} introduces some basic concepts relating to
32674how computers do arithmetic and defines some important terms.
32675It then proceeds to describe floating-point arithmetic,
32676which is what @command{awk} uses for all its computations, including a
32677discussion of arbitrary-precision floating-point arithmetic, which is
32678a feature available only in @command{gawk}. It continues on to present
32679arbitrary-precision integers, and concludes with a description of some
32680points where @command{gawk} and the POSIX standard are not quite in
32681agreement.
32682
32683@quotation NOTE
32684Most users of @command{gawk} can safely skip this chapter.
32685But if you want to do scientific calculations with @command{gawk},
32686this is the place to be.
32687@end quotation
32688
32689@menu
32690* Computer Arithmetic::           A quick intro to computer math.
32691* Math Definitions::              Defining terms used.
32692* MPFR features::                 The MPFR features in @command{gawk}.
32693* FP Math Caution::               Things to know.
32694* Arbitrary Precision Integers::  Arbitrary Precision Integer Arithmetic with
32695                                  @command{gawk}.
32696* Checking for MPFR::             How to check if MPFR is available.
32697* POSIX Floating Point Problems:: Standards Versus Existing Practice.
32698* Floating point summary::        Summary of floating point discussion.
32699@end menu
32700
32701@node Computer Arithmetic
32702@section A General Description of Computer Arithmetic
32703
32704Until now, we have worked with data as either numbers or
32705strings.  Ultimately, however, computers represent everything in terms
32706of @dfn{binary digits}, or @dfn{bits}.  A decimal digit can take on any
32707of 10 values: zero through nine.  A binary digit can take on any of two
32708values, zero or one.  Using binary, computers (and computer software)
32709can represent and manipulate numerical and character data.  In general,
32710the more bits you can use to represent a particular thing, the greater
32711the range of possible values it can take on.
32712
32713Modern computers support at least two, and often more, ways to do
32714arithmetic.  Each kind of arithmetic uses a different representation
32715(organization of the bits) for the numbers.  The kinds of arithmetic
32716that interest us are:
32717
32718@table @asis
32719@item Decimal arithmetic
32720This is the kind of arithmetic you learned in elementary school, using
32721paper and pencil (and/or a calculator). In theory, numbers can have an
32722arbitrary number of digits on either side (or both sides) of the decimal
32723point, and the results of a computation are always exact.
32724
32725Some modern systems can do decimal arithmetic in hardware, but usually you
32726need a special software library to provide access to these instructions.
32727There are also libraries that do decimal arithmetic entirely in software.
32728
32729Despite the fact that some users expect @command{gawk} to be performing
32730decimal arithmetic,@footnote{We don't know why they expect this, but
32731they do.} it does not do so.
32732
32733@item Integer arithmetic
32734In school, integer values were referred to as ``whole'' numbers---that
32735is, numbers without any fractional part, such as 1, 42, or @minus{}17.
32736The advantage to integer numbers is that they represent values exactly.
32737The disadvantage is that their range is limited.
32738
32739@cindex unsigned integers
32740@cindex integers @subentry unsigned
32741In computers, integer values come in two flavors: @dfn{signed} and
32742@dfn{unsigned}.  Signed values may be negative or positive, whereas
32743unsigned values are always greater than or equal
32744to zero.
32745
32746In computer systems, integer arithmetic is exact, but the possible
32747range of values is limited.  Integer arithmetic is generally faster than
32748floating-point arithmetic.
32749
32750@cindex floating-point @subentry numbers
32751@item Floating-point arithmetic
32752Floating-point numbers represent what were called in school ``real''
32753numbers (i.e., those that have a fractional part, such as 3.1415927).
32754The advantage to floating-point numbers is that they can represent a
32755much larger range of values than can integers.  The disadvantage is that
32756there are numbers that they cannot represent exactly.
32757
32758Modern systems support floating-point arithmetic in hardware, with a
32759limited range of values.  There are software libraries that allow
32760the use of arbitrary-precision floating-point calculations.
32761
32762@cindex floating-point @subentry numbers @subentry single-precision
32763@cindex floating-point @subentry numbers @subentry double-precision
32764@cindex floating-point @subentry numbers @subentry arbitrary-precision
32765@cindex single-precision
32766@cindex double-precision
32767@cindex arbitrary precision
32768POSIX @command{awk} uses @dfn{double-precision} floating-point numbers, which
32769can hold more digits than @dfn{single-precision} floating-point numbers.
32770@command{gawk} has facilities for performing arbitrary-precision
32771floating-point arithmetic, which we describe in more detail shortly.
32772@end table
32773
32774Computers work with integer and floating-point values of different
32775ranges. Integer values are usually either 32 or 64 bits in size.
32776Single-precision floating-point values occupy 32 bits, whereas double-precision
32777floating-point values occupy 64 bits.
32778(Quadruple-precision floating point values also exist. They occupy 128 bits,
32779but such numbers are not available in @command{awk}.)
32780Floating-point values are always
32781signed. The possible ranges of values are shown in @ref{table-numeric-ranges}
32782and @ref{table-floating-point-ranges}.
32783
32784@float Table,table-numeric-ranges
32785@caption{Value ranges for integer representations}
32786@multitable @columnfractions .34 .33 .33
32787@headitem Representation @tab Minimum value @tab Maximum value
32788@item 32-bit signed integer @tab @minus{}2,147,483,648 @tab 2,147,483,647
32789@item 32-bit unsigned integer @tab 0 @tab 4,294,967,295
32790@item 64-bit signed integer @tab @minus{}9,223,372,036,854,775,808 @tab 9,223,372,036,854,775,807
32791@item 64-bit unsigned integer @tab 0 @tab 18,446,744,073,709,551,615
32792@end multitable
32793@end float
32794
32795@float Table,table-floating-point-ranges
32796@caption{Approximate value ranges for floating-point number representations}
32797@multitable @columnfractions .38 .22 .22 .23
32798@iftex
32799@headitem Representation @tab @w{Minimum positive} @w{nonzero value} @tab Minimum @w{finite value} @tab Maximum @w{finite value}
32800@end iftex
32801@ifnottex
32802@headitem Representation @tab Minimum positive nonzero value @tab Minimum finite value @tab Maximum finite value
32803@end ifnottex
32804@iftex
32805@item @w{Single-precision floating-point} @tab @math{1.175494 @cdot 10^{-38}} @tab @math{-3.402823 @cdot 10^{38}} @tab @math{3.402823 @cdot 10^{38}}
32806@item @w{Double-precision floating-point} @tab @math{2.225074 @cdot 10^{-308}} @tab @math{-1.797693 @cdot 10^{308}} @tab @math{1.797693 @cdot 10^{308}}
32807@item @w{Quadruple-precision floating-point} @tab @math{3.362103 @cdot 10^{-4932}} @tab @math{-1.189731 @cdot 10^{4932}} @tab @math{1.189731 @cdot 10^{4932}}
32808@end iftex
32809@ifinfo
32810@item Single-precision floating-point @tab 1.175494e-38 @tab -3.402823e+38 @tab 3.402823e+38
32811@item Double-precision floating-point @tab 2.225074e-308 @tab -1.797693e+308 @tab 1.797693e+308
32812@item Quadruple-precision floating-point @tab 3.362103e-4932 @tab -1.189731e+4932 @tab 1.189731e+4932
32813@end ifinfo
32814@ifnottex
32815@ifnotinfo
32816@item Single-precision floating-point @tab 1.175494*10@sup{-38} @tab -3.402823*10@sup{38} @tab 3.402823*10@sup{38}
32817@item Double-precision floating-point @tab 2.225074*10@sup{-308} @tab -1.797693*10@sup{308} @tab 1.797693*10@sup{308}
32818@item Quadruple-precision floating-point @tab 3.362103*10@sup{-4932} @tab -1.189731*10@sup{4932} @tab 1.189731*10@sup{4932}
32819@end ifnotinfo
32820@end ifnottex
32821@end multitable
32822@end float
32823
32824@node Math Definitions
32825@section Other Stuff to Know
32826
32827The rest of this @value{CHAPTER} uses a number of terms. Here are some
32828informal definitions that should help you work your way through the material
32829here:
32830
32831@table @dfn
32832@item Accuracy
32833A floating-point calculation's accuracy is how close it comes
32834to the real (paper and pencil) value.
32835
32836@item Error
32837The difference between what the result of a computation ``should be''
32838and what it actually is.  It is best to minimize error as much
32839as possible.
32840
32841@item Exponent
32842The order of magnitude of a value;
32843some number of bits in a floating-point value store the exponent.
32844
32845@item Inf
32846A special value representing infinity. Operations involving another
32847number and infinity produce infinity.
32848
32849@item NaN
32850``Not a number.''@footnote{Thanks to Michael Brennan for this description,
32851which we have paraphrased, and for the examples.} A special value that
32852results from attempting a calculation that has no answer as a real number.
32853In such a case, programs can either receive a floating-point exception,
32854or get @code{NaN} back as the result. The IEEE 754 standard recommends
32855that systems return @code{NaN}. Some examples:
32856
32857@table @code
32858@item sqrt(-1)
32859This makes sense in the range of complex numbers, but not in the
32860range of real numbers, so the result is @code{NaN}.
32861
32862@item log(-8)
32863@minus{}8 is out of the domain of @code{log()}, so the result is @code{NaN}.
32864@end table
32865
32866@item Normalized
32867How the significand (see later in this list) is usually stored. The
32868value is adjusted so that the first bit is one, and then that leading
32869one is assumed instead of physically stored.  This provides one
32870extra bit of precision.
32871
32872@item Precision
32873The number of bits used to represent a floating-point number.
32874The more bits, the more digits you can represent.
32875Binary and decimal precisions are related approximately, according to the
32876formula:
32877
32878@display
32879@iftex
32880@math{prec = 3.322 @cdot dps}
32881@end iftex
32882@ifnottex
32883@ifnotdocbook
32884@var{prec} = 3.322 * @var{dps}
32885@end ifnotdocbook
32886@end ifnottex
32887@docbook
32888<emphasis>prec</emphasis> = 3.322 &sdot; <emphasis>dps</emphasis>
32889@end docbook
32890@end display
32891
32892@noindent
32893Here, @emph{prec} denotes the binary precision
32894(measured in bits) and @emph{dps} (short for decimal places)
32895is the decimal digits.
32896
32897@item Rounding mode
32898How numbers are rounded up or down when necessary.
32899More details are provided later.
32900
32901@item Significand
32902A floating-point value consists of the significand multiplied by 10
32903to the power of the exponent. For example, in @code{1.2345e67},
32904the significand is @code{1.2345}.
32905
32906@item Stability
32907From @uref{https://en.wikipedia.org/wiki/Numerical_stability,
32908the Wikipedia article on numerical stability}:
32909``Calculations that can be proven not to magnify approximation errors
32910are called @dfn{numerically stable}.''
32911@end table
32912
32913See @uref{https://en.wikipedia.org/wiki/Accuracy_and_precision,
32914the Wikipedia article on accuracy and precision} for more information
32915on some of those terms.
32916
32917On modern systems, floating-point hardware uses the representation and
32918operations defined by the IEEE 754 standard.
32919Three of the standard IEEE 754 types are 32-bit single precision,
3292064-bit double precision, and 128-bit quadruple precision.
32921The standard also specifies extended precision formats
32922to allow greater precisions and larger exponent ranges.
32923(@command{awk} uses only the 64-bit double-precision format.)
32924
32925@ref{table-ieee-formats} lists the precision and exponent
32926field values for the basic IEEE 754 binary formats.
32927
32928@float Table,table-ieee-formats
32929@caption{Basic IEEE format values}
32930@multitable @columnfractions .20 .20 .20 .20 .20
32931@headitem Name @tab Total bits @tab Precision @tab Minimum exponent @tab Maximum exponent
32932@item Single @tab 32 @tab 24 @tab @minus{}126 @tab +127
32933@item Double @tab 64 @tab 53 @tab @minus{}1022 @tab +1023
32934@item Quadruple @tab 128 @tab 113 @tab @minus{}16382 @tab +16383
32935@end multitable
32936@end float
32937
32938@quotation NOTE
32939The precision numbers include the implied leading one that gives them
32940one extra bit of significand.
32941@end quotation
32942
32943@node MPFR features
32944@section Arbitrary-Precision Arithmetic Features in @command{gawk}
32945
32946By default, @command{gawk} uses the double-precision floating-point values
32947supplied by the hardware of the system it runs on.  However, if it was
32948compiled to do so, and the @option{-M} command-line option is supplied,
32949@command{gawk} uses the @uref{http://www.mpfr.org,
32950GNU MPFR} and @uref{https://gmplib.org, GNU MP} (GMP) libraries for
32951arbitrary-precision arithmetic on numbers.  You can see if MPFR support
32952is available like so:
32953
32954@example
32955$ @kbd{gawk --version}
32956@print{} GNU Awk 4.1.2, API: 1.1 (GNU MPFR 3.1.0-p3, GNU MP 5.0.2)
32957@print{} Copyright (C) 1989, 1991-2015 Free Software Foundation.
32958@dots{}
32959@end example
32960
32961@noindent
32962(You may see different version numbers than what's shown here. That's OK;
32963what's important is to see that GNU MPFR and GNU MP are listed in
32964the output.)
32965
32966Additionally, there are a few elements available in the @code{PROCINFO}
32967array to provide information about the MPFR and GMP libraries
32968(@pxref{Auto-set}).
32969
32970The MPFR library provides precise control over precisions and rounding
32971modes, and gives correctly rounded, reproducible, platform-independent
32972results.  With the @option{-M} command-line option,
32973all floating-point arithmetic operators and numeric functions
32974can yield results to any desired precision level supported by MPFR.
32975
32976Two predefined variables, @code{PREC} and @code{ROUNDMODE},
32977provide control over the working precision and the rounding mode.
32978The precision and the rounding mode are set globally for every operation
32979to follow.
32980@xref{Setting precision} and @ref{Setting the rounding mode}
32981for more information.
32982
32983@node FP Math Caution
32984@section Floating-Point Arithmetic: Caveat Emptor!
32985
32986@quotation
32987@i{Math class is tough!}
32988@author Teen Talk Barbie, July 1992
32989@end quotation
32990
32991This @value{SECTION} provides a high-level overview of the issues
32992involved when doing lots of floating-point arithmetic.@footnote{There
32993is a very nice @uref{http://www.validlab.com/goldberg/paper.pdf,
32994paper on floating-point arithmetic} by David Goldberg, ``What Every
32995Computer Scientist Should Know About Floating-Point Arithmetic,''
32996@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03): 5-48.  This is
32997worth reading if you are interested in the details, but it does require
32998a background in computer science.}
32999The discussion applies to both hardware and arbitrary-precision
33000floating-point arithmetic.
33001
33002@quotation CAUTION
33003The material here is purposely general. If you need to do serious
33004computer arithmetic, you should do some research first, and not
33005rely just on what we tell you.
33006@end quotation
33007
33008@menu
33009* Inexactness of computations:: Floating point math is not exact.
33010* Getting Accuracy::            Getting more accuracy takes some work.
33011* Try To Round::                Add digits and round.
33012* Setting precision::           How to set the precision.
33013* Setting the rounding mode::   How to set the rounding mode.
33014@end menu
33015
33016@node Inexactness of computations
33017@subsection Floating-Point Arithmetic Is Not Exact
33018
33019Binary floating-point representations and arithmetic are inexact.
33020Simple values like 0.1 cannot be precisely represented using
33021binary floating-point numbers, and the limited precision of
33022floating-point numbers means that slight changes in
33023the order of operations or the precision of intermediate storage
33024can change the result. To make matters worse, with arbitrary-precision
33025floating-point arithmetic, you can set the precision before starting a
33026computation, but then you cannot be sure of the number of significant
33027decimal places in the final result.
33028
33029@menu
33030* Inexact representation::      Numbers are not exactly represented.
33031* Comparing FP Values::         How to compare floating point values.
33032* Errors accumulate::           Errors get bigger as they go.
33033@end menu
33034
33035@node Inexact representation
33036@subsubsection Many Numbers Cannot Be Represented Exactly
33037
33038So, before you start to write any code, you should think
33039about what you really want and what's really happening. Consider the
33040two numbers in the following example:
33041
33042@example
33043x = 0.875             # 1/2 + 1/4 + 1/8
33044y = 0.425
33045@end example
33046
33047Unlike the number in @code{y}, the number stored in @code{x}
33048is exactly representable
33049in binary because it can be written as a finite sum of one or
33050more fractions whose denominators are all powers of two.
33051When @command{gawk} reads a floating-point number from
33052program source, it automatically rounds that number to whatever
33053precision your machine supports. If you try to print the numeric
33054content of a variable using an output format string of @code{"%.17g"},
33055it may not produce the same number as you assigned to it:
33056
33057@example
33058$ @kbd{gawk 'BEGIN @{ x = 0.875; y = 0.425}
33059> @kbd{              printf("%0.17g, %0.17g\n", x, y) @}'}
33060@print{} 0.875, 0.42499999999999999
33061@end example
33062
33063Often the error is so small you do not even notice it, and if you do,
33064you can always specify how much precision you would like in your output.
33065Usually this is a format string like @code{"%.15g"}, which, when
33066used in the previous example, produces an output identical to the input.
33067
33068@node Comparing FP Values
33069@subsubsection Be Careful Comparing Values
33070
33071Because the underlying representation can be a little bit off from the exact value,
33072comparing floating-point values to see if they are exactly equal is generally a bad idea.
33073Here is an example where it does not work like you would expect:
33074
33075@example
33076$ @kbd{gawk 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'}
33077@print{} 0
33078@end example
33079
33080The general wisdom when comparing floating-point values is to see if
33081they are within some small range of each other (called a @dfn{delta},
33082or @dfn{tolerance}).
33083You have to decide how small a delta is important to you. Code to do
33084this looks something like the following:
33085
33086@example
33087@group
33088delta = 0.00001                 # for example
33089difference = abs(a - b)         # subtract the two values
33090if (difference < delta)
33091    # all ok
33092else
33093    # not ok
33094@end group
33095@end example
33096
33097@noindent
33098(We assume that you have a simple absolute value function named
33099@code{abs()} defined elsewhere in your program.)  If you write a
33100function to compare values with a delta, you should be sure
33101to use @samp{difference < abs(delta)} in case someone passes
33102in a negative delta value.
33103
33104@node Errors accumulate
33105@subsubsection Errors Accumulate
33106
33107The loss of accuracy during a single computation with floating-point
33108numbers usually isn't enough to worry about. However, if you compute a
33109value that is the result of a sequence of floating-point operations,
33110the error can accumulate and greatly affect the computation itself.
33111Here is an attempt to compute the value of @value{PI} using one of its
33112many series representations:
33113
33114@example
33115BEGIN @{
33116    x = 1.0 / sqrt(3.0)
33117    n = 6
33118    for (i = 1; i < 30; i++) @{
33119        n = n * 2.0
33120        x = (sqrt(x * x + 1) - 1) / x
33121        printf("%.15f\n", n * x)
33122    @}
33123@}
33124@end example
33125
33126When run, the early errors propagate through later computations,
33127causing the loop to terminate prematurely after attempting to divide by zero:
33128
33129@example
33130$ @kbd{gawk -f pi.awk}
33131@print{} 3.215390309173475
33132@print{} 3.159659942097510
33133@print{} 3.146086215131467
33134@print{} 3.142714599645573
33135@dots{}
33136@print{} 3.224515243534819
33137@print{} 2.791117213058638
33138@print{} 0.000000000000000
33139@error{} gawk: pi.awk:6: fatal: division by zero attempted
33140@end example
33141
33142Here is an additional example where the inaccuracies in internal representations
33143yield an unexpected result:
33144
33145@example
33146$ @kbd{gawk 'BEGIN @{}
33147>   @kbd{for (d = 1.1; d <= 1.5; d += 0.1)    # loop five times (?)}
33148>       @kbd{i++}
33149>   @kbd{print i}
33150> @kbd{@}'}
33151@print{} 4
33152@end example
33153
33154@node Getting Accuracy
33155@subsection Getting the Accuracy You Need
33156
33157Can arbitrary-precision arithmetic give exact results? There are
33158no easy answers. The standard rules of algebra often do not apply
33159when using floating-point arithmetic.
33160Among other things, the distributive and associative laws
33161do not hold completely, and order of operation may be important
33162for your computation. Rounding error, cumulative precision loss,
33163and underflow are often troublesome.
33164
33165When @command{gawk} tests the expressions @samp{0.1 + 12.2} and
33166@samp{12.3} for equality using the machine double-precision arithmetic,
33167it decides that they are not equal!  (@xref{Comparing FP Values}.)
33168You can get the result you want by increasing the precision; 56 bits in
33169this case does the job:
33170
33171@example
33172$ @kbd{gawk -M -v PREC=56 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'}
33173@print{} 1
33174@end example
33175
33176If adding more bits is good, perhaps adding even more bits of
33177precision is better?
33178Here is what happens if we use an even larger value of @code{PREC}:
33179
33180@example
33181$ @kbd{gawk -M -v PREC=201 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'}
33182@print{} 0
33183@end example
33184
33185This is not a bug in @command{gawk} or in the MPFR library.
33186It is easy to forget that the finite number of bits used to store the value
33187is often just an approximation after proper rounding.
33188The test for equality succeeds if and only if @emph{all} bits in the two operands
33189are exactly the same. Because this is not necessarily true after floating-point
33190computations with a particular precision and effective rounding mode,
33191a straight test for equality may not work. Instead, compare the
33192two numbers to see if they are within the desirable delta of each other.
33193
33194In applications where 15 or fewer decimal places suffice,
33195hardware double-precision arithmetic can be adequate, and is usually much faster.
33196But you need to keep in mind that every floating-point operation
33197can suffer a new rounding error with catastrophic consequences, as illustrated
33198by our earlier attempt to compute the value of @value{PI}.
33199Extra precision can greatly enhance the stability and the accuracy
33200of your computation in such cases.
33201
33202Additionally, you should understand that
33203repeated addition is not necessarily equivalent to multiplication
33204in floating-point arithmetic. In the example in
33205@ref{Errors accumulate}:
33206
33207@example
33208$ @kbd{gawk 'BEGIN @{}
33209>   @kbd{for (d = 1.1; d <= 1.5; d += 0.1)    # loop five times (?)}
33210>       @kbd{i++}
33211>   @kbd{print i}
33212> @kbd{@}'}
33213@print{} 4
33214@end example
33215
33216@noindent
33217you may or may not succeed in getting the correct result by choosing
33218an arbitrarily large value for @code{PREC}. Reformulation of
33219the problem at hand is often the correct approach in such situations.
33220
33221@node Try To Round
33222@subsection Try a Few Extra Bits of Precision and Rounding
33223
33224Instead of arbitrary-precision floating-point arithmetic,
33225often all you need is an adjustment of your logic
33226or a different order for the operations in your calculation.
33227The stability and the accuracy of the computation of @value{PI}
33228in the earlier example can be enhanced by using the following
33229simple algebraic transformation:
33230
33231@example
33232(sqrt(x * x + 1) - 1) / x @equiv{} x / (sqrt(x * x + 1) + 1)
33233@end example
33234
33235@noindent
33236After making this change, the program converges to
33237@value{PI} in under 30 iterations:
33238
33239@example
33240$ @kbd{gawk -f pi2.awk}
33241@print{} 3.215390309173473
33242@print{} 3.159659942097501
33243@print{} 3.146086215131436
33244@print{} 3.142714599645370
33245@print{} 3.141873049979825
33246@dots{}
33247@print{} 3.141592653589797
33248@print{} 3.141592653589797
33249@end example
33250
33251@node Setting precision
33252@subsection Setting the Precision
33253
33254@command{gawk} uses a global working precision; it does not keep track of
33255the precision or accuracy of individual numbers. Performing an arithmetic
33256operation or calling a built-in function rounds the result to the current
33257working precision. The default working precision is 53 bits, which you can
33258modify using the predefined variable @code{PREC}. You can also set the
33259value to one of the predefined case-insensitive strings
33260shown in @ref{table-predefined-precision-strings},
33261to emulate an IEEE 754 binary format.
33262
33263@float Table,table-predefined-precision-strings
33264@caption{Predefined precision strings for @code{PREC}}
33265@multitable {@code{"double"}} {12345678901234567890123456789012345}
33266@headitem @code{PREC} @tab IEEE 754 binary format
33267@item @code{"half"} @tab 16-bit half-precision
33268@item @code{"single"} @tab Basic 32-bit single precision
33269@item @code{"double"} @tab Basic 64-bit double precision
33270@item @code{"quad"} @tab Basic 128-bit quadruple precision
33271@item @code{"oct"} @tab 256-bit octuple precision
33272@end multitable
33273@end float
33274
33275The following example illustrates the effects of changing precision
33276on arithmetic operations:
33277
33278@example
33279$ @kbd{gawk -M -v PREC=100 'BEGIN @{ x = 1.0e-400; print x + 0}
33280>   @kbd{PREC = "double"; print x + 0 @}'}
33281@print{} 1e-400
33282@print{} 0
33283@end example
33284
33285@quotation CAUTION
33286Be wary of floating-point constants! When reading a floating-point
33287constant from program source code, @command{gawk} uses the default
33288precision (that of a C @code{double}), unless overridden by an assignment
33289to the special variable @code{PREC} on the command line, to store it
33290internally as an MPFR number.  Changing the precision using @code{PREC}
33291in the program text does @emph{not} change the precision of a constant.
33292
33293If you need to represent a floating-point constant at a higher precision
33294than the default and cannot use a command-line assignment to @code{PREC},
33295you should either specify the constant as a string, or as a rational
33296number, whenever possible. The following example illustrates the
33297differences among various ways to print a floating-point constant:
33298
33299@example
33300$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 0.1) @}'}
33301@print{} 0.1000000000000000055511151
33302$ @kbd{gawk -M -v PREC=113 'BEGIN @{ printf("%0.25f\n", 0.1) @}'}
33303@print{} 0.1000000000000000000000000
33304$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", "0.1") @}'}
33305@print{} 0.1000000000000000000000000
33306$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 1/10) @}'}
33307@print{} 0.1000000000000000000000000
33308@end example
33309@end quotation
33310
33311@node Setting the rounding mode
33312@subsection Setting the Rounding Mode
33313
33314@cindex @code{ROUNDMODE} variable
33315The @code{ROUNDMODE} variable provides
33316program-level control over the rounding mode.
33317The correspondence between @code{ROUNDMODE} and the IEEE
33318rounding modes is shown in @ref{table-gawk-rounding-modes}.
33319
33320@float Table,table-gawk-rounding-modes
33321@caption{@command{gawk} rounding modes}
33322@multitable @columnfractions .45 .30 .25
33323@headitem Rounding mode @tab IEEE name @tab @code{ROUNDMODE}
33324@item Round to nearest, ties to even @tab @code{roundTiesToEven} @tab @code{"N"} or @code{"n"}
33325@item Round toward positive infinity @tab @code{roundTowardPositive} @tab @code{"U"} or @code{"u"}
33326@item Round toward negative infinity @tab @code{roundTowardNegative} @tab @code{"D"} or @code{"d"}
33327@item Round toward zero @tab @code{roundTowardZero} @tab @code{"Z"} or @code{"z"}
33328@item Round away from zero @tab @tab @code{"A"} or @code{"a"}
33329@end multitable
33330@end float
33331
33332@code{ROUNDMODE} has the default value @code{"N"}, which
33333selects the IEEE 754 rounding mode @code{roundTiesToEven}.
33334In @ref{table-gawk-rounding-modes}, the value @code{"A"} selects
33335rounding away from zero. This is only available if your version of the
33336MPFR library supports it; otherwise, setting @code{ROUNDMODE} to @code{"A"}
33337has no effect.
33338
33339The default mode @code{roundTiesToEven} is the most preferred,
33340but the least intuitive. This method does the obvious thing for most values,
33341by rounding them up or down to the nearest digit.
33342For example, rounding 1.132 to two digits yields 1.13,
33343and rounding 1.157 yields 1.16.
33344
33345However, when it comes to rounding a value that is exactly halfway between,
33346things do not work the way you probably learned in school.
33347In this case, the number is rounded to the nearest even digit.
33348So rounding 0.125 to two digits rounds down to 0.12,
33349but rounding 0.6875 to three digits rounds up to 0.688.
33350You probably have already encountered this rounding mode when
33351using @code{printf} to format floating-point numbers.
33352For example:
33353
33354@example
33355BEGIN @{
33356    x = -4.5
33357    for (i = 1; i < 10; i++) @{
33358        x += 1.0
33359        printf("%4.1f => %2.0f\n", x, x)
33360    @}
33361@}
33362@end example
33363
33364@noindent
33365produces the following output when run on the author's system:@footnote{It
33366is possible for the output to be completely different if the
33367C library in your system does not use the IEEE 754 even-rounding
33368rule to round halfway cases for @code{printf}.}
33369
33370@example
33371-3.5 => -4
33372-2.5 => -2
33373-1.5 => -2
33374-0.5 => 0
33375 0.5 => 0
33376 1.5 => 2
33377 2.5 => 2
33378 3.5 => 4
33379 4.5 => 4
33380@end example
33381
33382The theory behind @code{roundTiesToEven} is that it more or less evenly
33383distributes upward and downward rounds of exact halves, which might
33384cause any accumulating round-off error to cancel itself out. This is the
33385default rounding mode for IEEE 754 computing functions and operators.
33386
33387@c January 2018. Thanks to nethox@gmail.com for the example.
33388@sidebar Rounding Modes and Conversion
33389It's important to understand that, along with @code{CONVFMT} and
33390@code{OFMT}, the rounding mode affects how numbers are converted to strings.
33391For example, consider the following program:
33392
33393@example
33394BEGIN @{
33395    pi = 3.1416
33396    OFMT = "%.f"        # Print value as integer
33397    print pi            # ROUNDMODE = "N" by default.
33398    ROUNDMODE = "U"     # Now change ROUNDMODE
33399    print pi
33400@}
33401@end example
33402
33403@noindent
33404Running this program produces this output:
33405
33406@example
33407$ @kbd{gawk -M -f roundmode.awk}
33408@print{} 3
33409@print{} 4
33410@end example
33411@end sidebar
33412
33413The other rounding modes are rarely used.  Rounding toward positive infinity
33414(@code{roundTowardPositive}) and toward negative infinity
33415(@code{roundTowardNegative}) are often used to implement interval
33416arithmetic, where you adjust the rounding mode to calculate upper and
33417lower bounds for the range of output. The @code{roundTowardZero} mode can
33418be used for converting floating-point numbers to integers.  When rounding
33419away from zero, the nearest number with magnitude greater than or equal to
33420the value is selected.
33421
33422Some numerical analysts will tell you that your choice of rounding
33423style has tremendous impact on the final outcome, and advise you to
33424wait until final output for any rounding. Instead, you can often avoid
33425round-off error problems by setting the precision initially to some
33426value sufficiently larger than the final desired precision, so that
33427the accumulation of round-off error does not influence the outcome.
33428If you suspect that results from your computation are sensitive to
33429accumulation of round-off error, look for a significant difference in
33430output when you change the rounding mode to be sure.
33431
33432@node Arbitrary Precision Integers
33433@section Arbitrary-Precision Integer Arithmetic with @command{gawk}
33434@cindex integers @subentry arbitrary precision
33435@cindex arbitrary precision @subentry integers
33436
33437When given the @option{-M} option,
33438@command{gawk} performs all integer arithmetic using GMP arbitrary-precision
33439integers.  Any number that looks like an integer in a source
33440or @value{DF} is stored as an arbitrary-precision integer.  The size
33441of the integer is limited only by the available memory.  For example,
33442the following computes
33443@iftex
33444@math{5^{4^{3^{2}}}},
33445@end iftex
33446@ifinfo
334475^4^3^2,
33448@end ifinfo
33449@ifnottex
33450@ifnotinfo
334515@sup{4@sup{3@sup{2}}},
33452@end ifnotinfo
33453@end ifnottex
33454the result of which is beyond the
33455limits of ordinary hardware double-precision floating-point values:
33456
33457@example
33458$ @kbd{gawk -M 'BEGIN @{}
33459>   @kbd{x = 5^4^3^2}
33460>   @kbd{print "number of digits =", length(x)}
33461>   @kbd{print substr(x, 1, 20), "...", substr(x, length(x) - 19, 20)}
33462> @kbd{@}'}
33463@print{} number of digits = 183231
33464@print{} 62060698786608744707 ... 92256259918212890625
33465@end example
33466
33467If instead you were to compute the same value using arbitrary-precision
33468floating-point values, the precision needed for correct output (using
33469the formula
33470@iftex
33471@math{prec = 3.322 @cdot dps})
33472would be @math{3.322 @cdot 183231},
33473@end iftex
33474@ifnottex
33475@ifnotdocbook
33476@samp{prec = 3.322 * dps})
33477would be 3.322 x 183231,
33478@end ifnotdocbook
33479@end ifnottex
33480@docbook
33481<emphasis>prec</emphasis> = 3.322 &sdot; <emphasis>dps</emphasis>)
33482would be
33483<emphasis>prec</emphasis> = 3.322 &sdot; 183231,
33484@end docbook
33485or 608693.
33486
33487The result from an arithmetic operation with an integer and a floating-point value
33488is a floating-point value with a precision equal to the working precision.
33489The following program calculates the eighth term in
33490Sylvester's sequence@footnote{Weisstein, Eric W.
33491@cite{Sylvester's Sequence}. From MathWorld---A Wolfram Web Resource
33492@w{(@url{http://mathworld.wolfram.com/SylvestersSequence.html}).}}
33493using a recurrence:
33494
33495@example
33496$ @kbd{gawk -M 'BEGIN @{}
33497>   @kbd{s = 2.0}
33498>   @kbd{for (i = 1; i <= 7; i++)}
33499>       @kbd{s = s * (s - 1) + 1}
33500>   @kbd{print s}
33501> @kbd{@}'}
33502@print{} 113423713055421845118910464
33503@end example
33504
33505The output differs from the actual number, 113,423,713,055,421,844,361,000,443,
33506because the default precision of 53 bits is not enough to represent the
33507floating-point results exactly. You can either increase the precision
33508(100 bits is enough in this case), or replace the floating-point constant
33509@samp{2.0} with an integer, to perform all computations using integer
33510arithmetic to get the correct output.
33511
33512Sometimes @command{gawk} must implicitly convert an arbitrary-precision
33513integer into an arbitrary-precision floating-point value.  This is
33514primarily because the MPFR library does not always provide the relevant
33515interface to process arbitrary-precision integers or mixed-mode numbers
33516as needed by an operation or function.  In such a case, the precision is
33517set to the minimum value necessary for exact conversion, and the working
33518precision is not used for this purpose.  If this is not what you need or
33519want, you can employ a subterfuge and convert the integer to floating
33520point first, like this:
33521
33522@example
33523gawk -M 'BEGIN @{ n = 13; print (n + 0.0) % 2.0 @}'
33524@end example
33525
33526You can avoid this issue altogether by specifying the number as a floating-point value
33527to begin with:
33528
33529@example
33530gawk -M 'BEGIN @{ n = 13.0; print n % 2.0 @}'
33531@end example
33532
33533Note that for this particular example, it is likely best
33534to just use the following:
33535
33536@example
33537gawk -M 'BEGIN @{ n = 13; print n % 2 @}'
33538@end example
33539
33540When dividing two arbitrary precision integers with either
33541@samp{/} or @samp{%}, the result is typically an arbitrary
33542precision floating point value (unless the denominator evenly
33543divides into the numerator).
33544@ifset INTDIV
33545In order to do integer division
33546or remainder with arbitrary precision integers, use the built-in
33547@code{intdiv0()} function (@pxref{Numeric Functions}).
33548
33549You can simulate the @code{intdiv0()} function in standard @command{awk}
33550using this user-defined function:
33551
33552@example
33553@c file eg/lib/intdiv0.awk
33554# intdiv0 --- do integer division
33555
33556@c endfile
33557@ignore
33558@c file eg/lib/intdiv0.awk
33559#
33560# Arnold Robbins, arnold@@skeeve.com, Public Domain
33561# July, 2014
33562#
33563# Name changed from div() to intdiv()
33564# April, 2015
33565#
33566# Changed to intdiv0()
33567# April, 2016
33568
33569@c endfile
33570
33571@end ignore
33572@c file eg/lib/intdiv0.awk
33573function intdiv0(numerator, denominator, result)
33574@{
33575    split("", result)
33576
33577    numerator = int(numerator)
33578    denominator = int(denominator)
33579    result["quotient"] = int(numerator / denominator)
33580    result["remainder"] = int(numerator % denominator)
33581
33582    return 0.0
33583@}
33584@c endfile
33585@end example
33586
33587The following example program, contributed by Katie Wasserman,
33588uses @code{intdiv0()} to
33589compute the digits of @value{PI} to as many places as you
33590choose to set:
33591
33592@example
33593@c file eg/prog/pi.awk
33594@group
33595# pi.awk --- compute the digits of pi
33596@c endfile
33597@c endfile
33598@ignore
33599@c file eg/prog/pi.awk
33600#
33601# Katie Wasserman, katie@@wass.net
33602# August 2014
33603@c endfile
33604@end ignore
33605@c file eg/prog/pi.awk
33606
33607BEGIN @{
33608    digits = 100000
33609    two = 2 * 10 ^ digits
33610@end group
33611    pi = two
33612    for (m = digits * 4; m > 0; --m) @{
33613        d = m * 2 + 1
33614        x = pi * m
33615        intdiv0(x, d, result)
33616        pi = result["quotient"]
33617        pi = pi + two
33618    @}
33619    print pi
33620@}
33621@c endfile
33622@end example
33623
33624@ignore
33625Date: Wed, 20 Aug 2014 10:19:11 -0400
33626To: arnold@skeeve.com
33627From: Katherine Wasserman <katie@wass.net>
33628Subject: Re: computation of digits of pi?
33629
33630Arnold,
33631
33632>The program that you sent to compute the digits of pi using div(). Is
33633>that some standard algorithm that every math student knows? If so,
33634>what's it called?
33635
33636It's not that well known but it's not that obscure either
33637
33638It's Euler's modification to Newton's method for calculating pi.
33639
33640Take a look at lines (23) - (25)  here: http://mathworld.wolfram.com/PiFormulas.htm
33641
33642The algorithm I wrote simply expands the multiply by 2 and works from the innermost expression outwards.  I used this to program HP calculators because it's quite easy to modify for tiny memory devices with smallish word sizes.
33643
33644http://www.hpmuseum.org/cgi-sys/cgiwrap/hpmuseum/articles.cgi?read=899
33645
33646-Katie
33647@end ignore
33648
33649When asked about the algorithm used, Katie replied:
33650
33651@quotation
33652It's not that well known but it's not that obscure either.
33653It's Euler's modification to Newton's method for calculating pi.
33654Take a look at lines (23) - (25) here: @uref{http://mathworld.wolfram.com/PiFormulas.html}.
33655
33656The algorithm I wrote simply expands the multiply by 2 and works from
33657the innermost expression outwards.  I used this to program HP calculators
33658because it's quite easy to modify for tiny memory devices with smallish
33659word sizes. See
33660@uref{http://www.hpmuseum.org/cgi-sys/cgiwrap/hpmuseum/articles.cgi?read=899}.
33661@end quotation
33662@end ifset
33663
33664@node Checking for MPFR
33665@section How To Check If MPFR Is Available
33666
33667@cindex checking for MPFR
33668@cindex MPFR, checking for
33669Occasionally, you might like to be able to check if @command{gawk}
33670was invoked with the @option{-M} option, enabling arbitrary-precision
33671arithmetic.  You can do so with the following function, contributed
33672by Andrew Schorr:
33673
33674@example
33675@c file eg/lib/have_mpfr.awk
33676# adequate_math_precision --- return true if we have enough bits
33677@c endfile
33678@ignore
33679@c file eg/lib/have_mpfr.awk
33680#
33681# Andrew Schorr, aschorr@@telemetry-investments.com, Public Domain
33682# May 2017
33683@c endfile
33684@end ignore
33685@c file eg/lib/have_mpfr.awk
33686
33687function adequate_math_precision(n)
33688@{
33689    return (1 != (1+(1/(2^(n-1)))))
33690@}
33691@c endfile
33692@end example
33693
33694Here is code that invokes the function in order to check
33695if arbitrary-precision arithmetic is available:
33696
33697@example
33698BEGIN @{
33699    # How many bits of mantissa precision are required
33700    # for this program to function properly?
33701    fpbits = 123
33702
33703    # We hope that we were invoked with MPFR enabled. If so, the
33704    # following statement should configure calculations to our desired
33705    # precision.
33706    PREC = fpbits
33707
33708    if (! adequate_math_precision(fpbits)) @{
33709        print("Error: insufficient computation precision available.\n" \
33710              "Try again with the -M argument?") > "/dev/stderr"
33711        # Note: you may need to set a flag here to bail out of END rules
33712        exit 1
33713    @}
33714@}
33715@end example
33716
33717Please be aware that @code{exit} will jump to the @code{END} rules, if present (@pxref{Exit Statement}).
33718
33719@node POSIX Floating Point Problems
33720@section Standards Versus Existing Practice
33721
33722Historically, @command{awk} has converted any nonnumeric-looking string
33723to the numeric value zero, when required.  Furthermore, the original
33724definition of the language and the original POSIX standards specified that
33725@command{awk} only understands decimal numbers (base 10), and not octal
33726(base 8) or hexadecimal numbers (base 16).
33727
33728Changes in the language of the
337292001 and 2004 POSIX standards can be interpreted to imply that @command{awk}
33730should support additional features.  These features are:
33731
33732@itemize @value{BULLET}
33733@item
33734Interpretation of floating-point data values specified in hexadecimal
33735notation (e.g., @code{0xDEADBEEF}). (Note: data values, @emph{not}
33736source code constants.)
33737
33738@item
33739Support for the special IEEE 754 floating-point values ``not a number''
33740(NaN), positive infinity (``inf''), and negative infinity (``@minus{}inf'').
33741In particular, the format for these values is as specified by the ISO 1999
33742C standard, which ignores case and can allow implementation-dependent additional
33743characters after the @samp{nan} and allow either @samp{inf} or @samp{infinity}.
33744@end itemize
33745
33746The first problem is that both of these are clear changes to historical
33747practice:
33748
33749@itemize @value{BULLET}
33750@item
33751The @command{gawk} maintainer feels that supporting hexadecimal
33752floating-point values, in particular, is ugly, and was never intended by the
33753original designers to be part of the language.
33754
33755@item
33756Allowing completely alphabetic strings to have valid numeric
33757values is also a very severe departure from historical practice.
33758@end itemize
33759
33760The second problem is that the @command{gawk} maintainer feels that this
33761interpretation of the standard, which required a certain amount of
33762``language lawyering'' to arrive at in the first place, was not even
33763intended by the standard developers.  In other words, ``We see how you
33764got where you are, but we don't think that that's where you want to be.''
33765
33766Recognizing these issues, but attempting to provide compatibility
33767with the earlier versions of the standard,
33768the 2008 POSIX standard added explicit wording to allow, but not require,
33769that @command{awk} support hexadecimal floating-point values and
33770special values for ``not a number'' and infinity.
33771
33772Although the @command{gawk} maintainer continues to feel that
33773providing those features is inadvisable,
33774nevertheless, on systems that support IEEE floating point, it seems
33775reasonable to provide @emph{some} way to support NaN and infinity values.
33776The solution implemented in @command{gawk} is as follows:
33777
33778@itemize @value{BULLET}
33779@item
33780With the @option{--posix} command-line option, @command{gawk} becomes
33781``hands off.'' String values are passed directly to the system library's
33782@code{strtod()} function, and if it successfully returns a numeric value,
33783that is what's used.@footnote{You asked for it, you got it.}
33784By definition, the results are not portable across
33785different systems.  They are also a little surprising:
33786
33787@example
33788$ @kbd{echo nanny | gawk --posix '@{ print $1 + 0 @}'}
33789@print{} nan
33790$ @kbd{echo 0xDeadBeef | gawk --posix '@{ print $1 + 0 @}'}
33791@print{} 3735928559
33792@end example
33793
33794@item
33795Without @option{--posix}, @command{gawk} interprets the four string values
33796@samp{+inf},
33797@samp{-inf},
33798@samp{+nan},
33799and
33800@samp{-nan}
33801specially, producing the corresponding special numeric values.
33802The leading sign acts a signal to @command{gawk} (and the user)
33803that the value is really numeric.  Hexadecimal floating point is
33804not supported (unless you also use @option{--non-decimal-data},
33805which is @emph{not} recommended). For example:
33806
33807@example
33808$ @kbd{echo nanny | gawk '@{ print $1 + 0 @}'}
33809@print{} 0
33810$ @kbd{echo +nan | gawk '@{ print $1 + 0 @}'}
33811@print{} +nan
33812$ @kbd{echo 0xDeadBeef | gawk '@{ print $1 + 0 @}'}
33813@print{} 0
33814@end example
33815
33816@command{gawk} ignores case in the four special values.
33817Thus, @samp{+nan} and @samp{+NaN} are the same.
33818@end itemize
33819
33820@cindex POSIX mode
33821Besides handling input, @command{gawk} also needs to print ``correct'' values on
33822output when a value is either NaN or infinity. Starting with @value{PVERSION}
338234.2.2, for such values @command{gawk} prints one of the four strings
33824just described: @samp{+inf}, @samp{-inf}, @samp{+nan}, or @samp{-nan}.
33825Similarly, in POSIX mode, @command{gawk} prints the result of
33826the system's C @code{printf()} function using the @code{%g} format string
33827for the value, whatever that may be.
33828
33829@node Floating point summary
33830@section Summary
33831
33832@itemize @value{BULLET}
33833@item
33834Most computer arithmetic is done using either integers or floating-point
33835values.  Standard @command{awk} uses double-precision
33836floating-point values.
33837
33838@item
33839In the early 1990s Barbie mistakenly said, ``Math class is tough!''
33840Although math isn't tough, floating-point arithmetic isn't the same
33841as pencil-and-paper math, and care must be taken:
33842
33843@c nested list
33844@itemize @value{MINUS}
33845@item
33846Not all numbers can be represented exactly.
33847
33848@item
33849Comparing values should use a delta, instead of being done directly
33850with @samp{==} and @samp{!=}.
33851
33852@item
33853Errors accumulate.
33854
33855@item
33856Operations are not always truly associative or distributive.
33857@end itemize
33858
33859@item
33860Increasing the accuracy can help, but it is not a panacea.
33861
33862@item
33863Often, increasing the accuracy and then rounding to the desired
33864number of digits produces reasonable results.
33865
33866@item
33867Use @option{-M} (or @option{--bignum}) to enable MPFR
33868arithmetic. Use @code{PREC} to set the precision in bits, and
33869@code{ROUNDMODE} to set the IEEE 754 rounding mode.
33870
33871@item
33872With @option{-M}, @command{gawk} performs
33873arbitrary-precision integer arithmetic using the GMP library.
33874This is faster and more space-efficient than using MPFR for
33875the same calculations.
33876
33877@item
33878There are several areas with respect to floating-point
33879numbers where @command{gawk} disagrees with the POSIX standard.
33880It pays to be aware of them.
33881
33882@item
33883Overall, there is no need to be unduly suspicious about the results from
33884floating-point arithmetic. The lesson to remember is that floating-point
33885arithmetic is always more complex than arithmetic using pencil and
33886paper. In order to take advantage of the power of floating-point arithmetic,
33887you need to know its limitations and work within them. For most casual
33888use of floating-point arithmetic, you will often get the expected result
33889if you simply round the display of your final results to the correct number
33890of significant decimal digits.
33891
33892@item
33893As general advice, avoid presenting numerical data in a manner that
33894implies better precision than is actually the case.
33895
33896@end itemize
33897
33898@node Dynamic Extensions
33899@chapter Writing Extensions for @command{gawk}
33900@cindex dynamically loaded extensions
33901
33902It is possible to add new functions written in C or C++ to @command{gawk} using
33903dynamically loaded libraries. This facility is available on systems
33904that support the C @code{dlopen()} and @code{dlsym()}
33905functions.  This @value{CHAPTER} describes how to create extensions
33906using code written in C or C++.
33907
33908If you don't know anything about C programming, you can safely skip this
33909@value{CHAPTER}, although you may wish to review the documentation on the
33910extensions that come with @command{gawk} (@pxref{Extension Samples}),
33911and the information on the @code{gawkextlib} project (@pxref{gawkextlib}).
33912The sample extensions are automatically built and installed when
33913@command{gawk} is.
33914
33915@quotation NOTE
33916When @option{--sandbox} is specified, extensions are disabled
33917(@pxref{Options}).
33918@end quotation
33919
33920@menu
33921* Extension Intro::             What is an extension.
33922* Plugin License::              A note about licensing.
33923* Extension Mechanism Outline:: An outline of how it works.
33924* Extension API Description::   A full description of the API.
33925* Finding Extensions::          How @command{gawk} finds compiled extensions.
33926* Extension Example::           Example C code for an extension.
33927* Extension Samples::           The sample extensions that ship with
33928                                @command{gawk}.
33929* gawkextlib::                  The @code{gawkextlib} project.
33930* Extension summary::           Extension summary.
33931* Extension Exercises::         Exercises.
33932@end menu
33933
33934@node Extension Intro
33935@section Introduction
33936
33937@cindex plug-in
33938An @dfn{extension} (sometimes called a @dfn{plug-in}) is a piece of
33939external compiled code that @command{gawk} can load at runtime to
33940provide additional functionality, over and above the built-in capabilities
33941described in the rest of this @value{DOCUMENT}.
33942
33943Extensions are useful because they allow you (of course) to extend
33944@command{gawk}'s functionality. For example, they can provide access to
33945system calls (such as @code{chdir()} to change directory) and to other
33946C library routines that could be of use.  As with most software,
33947``the sky is the limit''; if you can imagine something that you might
33948want to do and can write in C or C++, you can write an extension to do it!
33949
33950Extensions are written in C or C++, using the @dfn{application programming
33951interface} (API) defined for this purpose by the @command{gawk}
33952developers.  The rest of this @value{CHAPTER} explains
33953the facilities that the API provides and how to use
33954them, and presents a small example extension.  In addition, it documents
33955the sample extensions included in the @command{gawk} distribution
33956and describes the @code{gawkextlib} project.
33957@ifclear FOR_PRINT
33958@xref{Extension Design}, for a discussion of the extension mechanism
33959goals and design.
33960@end ifclear
33961@ifset FOR_PRINT
33962See @uref{https://www.gnu.org/software/gawk/manual/html_node/Extension-Design.html}
33963for a discussion of the extension mechanism
33964goals and design.
33965@end ifset
33966
33967@node Plugin License
33968@section Extension Licensing
33969
33970Every dynamic extension must be distributed under a license that is
33971compatible with the GNU GPL (@pxref{Copying}).
33972
33973In order for the extension to tell @command{gawk} that it is
33974properly licensed, the extension must define the global symbol
33975@code{plugin_is_GPL_compatible}.  If this symbol does not exist,
33976@command{gawk} emits a fatal error and exits when it tries to load
33977your extension.
33978
33979The declared type of the symbol should be @code{int}.  It does not need
33980to be in any allocated section, though.  The code merely asserts that
33981the symbol exists in the global scope.  Something like this is enough:
33982
33983@example
33984int plugin_is_GPL_compatible;
33985@end example
33986
33987@node Extension Mechanism Outline
33988@section How It Works at a High Level
33989
33990Communication between
33991@command{gawk} and an extension is two-way.  First, when an extension
33992is loaded, @command{gawk} passes it a pointer to a @code{struct} whose fields are
33993function pointers.
33994@ifnotdocbook
33995This is shown in @ref{figure-load-extension}.
33996@end ifnotdocbook
33997@ifdocbook
33998This is shown in @inlineraw{docbook, <xref linkend="figure-load-extension"/>}.
33999@end ifdocbook
34000
34001@ifnotdocbook
34002@float Figure,figure-load-extension
34003@caption{Loading the extension}
34004@center @image{api-figure1, , , Loading the extension}
34005@end float
34006@end ifnotdocbook
34007
34008@docbook
34009<figure id="figure-load-extension" float="0">
34010<title>Loading the extension</title>
34011<mediaobject>
34012<imageobject role="web"><imagedata fileref="api-figure1.png" format="PNG"/></imageobject>
34013</mediaobject>
34014</figure>
34015@end docbook
34016
34017The extension can call functions inside @command{gawk} through these
34018function pointers, at runtime, without needing (link-time) access
34019to @command{gawk}'s symbols.  One of these function pointers is to a
34020function for ``registering'' new functions.
34021@ifnotdocbook
34022This is shown in @ref{figure-register-new-function}.
34023@end ifnotdocbook
34024@ifdocbook
34025This is shown in @inlineraw{docbook, <xref linkend="figure-register-new-function"/>}.
34026@end ifdocbook
34027
34028@ifnotdocbook
34029@float Figure,figure-register-new-function
34030@caption{Registering a new function}
34031@center @image{api-figure2, , , Registering a new Function}
34032@end float
34033@end ifnotdocbook
34034
34035@docbook
34036<figure id="figure-register-new-function" float="0">
34037<title>Registering a new function</title>
34038<mediaobject>
34039<imageobject role="web"><imagedata fileref="api-figure2.png" format="PNG"/></imageobject>
34040</mediaobject>
34041</figure>
34042@end docbook
34043
34044In the other direction, the extension registers its new functions
34045with @command{gawk} by passing function pointers to the functions that
34046provide the new feature (@code{do_chdir()}, for example).  @command{gawk}
34047associates the function pointer with a name and can then call it, using a
34048defined calling convention.
34049@ifnotdocbook
34050This is shown in @ref{figure-call-new-function}.
34051@end ifnotdocbook
34052@ifdocbook
34053This is shown in @inlineraw{docbook, <xref linkend="figure-call-new-function"/>}.
34054@end ifdocbook
34055
34056@ifnotdocbook
34057@float Figure,figure-call-new-function
34058@caption{Calling the new function}
34059@center @image{api-figure3, , , Calling the new function}
34060@end float
34061@end ifnotdocbook
34062
34063@docbook
34064<figure id="figure-call-new-function" float="0">
34065<title>Calling the new function</title>
34066<mediaobject>
34067<imageobject role="web"><imagedata fileref="api-figure3.png" format="PNG"/></imageobject>
34068</mediaobject>
34069</figure>
34070@end docbook
34071
34072The @code{do_@var{xxx}()} function, in turn, then uses the function
34073pointers in the API @code{struct} to do its work, such as updating
34074variables or arrays, printing messages, setting @code{ERRNO}, and so on.
34075
34076Convenience macros make calling through the function pointers look
34077like regular function calls so that extension code is quite readable
34078and understandable.
34079
34080Although all of this sounds somewhat complicated, the result is that
34081extension code is quite straightforward to write and to read. You can
34082see this in the sample extension @file{filefuncs.c} (@pxref{Extension
34083Example}) and also in the @file{testext.c} code for testing the APIs.
34084
34085Some other bits and pieces:
34086
34087@itemize @value{BULLET}
34088@item
34089The API provides access to @command{gawk}'s @code{do_@var{xxx}} values,
34090reflecting command-line options, like @code{do_lint}, @code{do_profiling},
34091and so on (@pxref{Extension API Variables}).
34092These are informational: an extension cannot affect their values
34093inside @command{gawk}.  In addition, attempting to assign to them
34094produces a compile-time error.
34095
34096@item
34097The API also provides major and minor version numbers, so that an
34098extension can check if the @command{gawk} it is loaded with supports the
34099facilities it was compiled with.  (Version mismatches ``shouldn't''
34100happen, but we all know how @emph{that} goes.)
34101@xref{Extension Versioning} for details.
34102@end itemize
34103
34104@node Extension API Description
34105@section API Description
34106@cindex extension API
34107
34108C or C++ code for an extension must include the header file
34109@file{gawkapi.h}, which declares the functions and defines the data
34110types used to communicate with @command{gawk}.
34111This (rather large) @value{SECTION} describes the API in detail.
34112
34113@menu
34114* Extension API Functions Introduction:: Introduction to the API functions.
34115* General Data Types::                   The data types.
34116* Memory Allocation Functions::          Functions for allocating memory.
34117* Constructor Functions::                Functions for creating values.
34118* API Ownership of MPFR and GMP Values:: Managing MPFR and GMP Values.
34119* Registration Functions::               Functions to register things with
34120                                         @command{gawk}.
34121* Printing Messages::                    Functions for printing messages.
34122* Updating @code{ERRNO}::                Functions for updating @code{ERRNO}.
34123* Requesting Values::                    How to get a value.
34124* Accessing Parameters::                 Functions for accessing parameters.
34125* Symbol Table Access::                  Functions for accessing global
34126                                         variables.
34127* Array Manipulation::                   Functions for working with arrays.
34128* Redirection API::                      How to access and manipulate
34129                                         redirections.
34130* Extension API Variables::              Variables provided by the API.
34131* Extension API Boilerplate::            Boilerplate code for using the API.
34132* Changes from API V1::                  Changes from V1 of the API.
34133@end menu
34134
34135@node Extension API Functions Introduction
34136@subsection Introduction
34137
34138Access to facilities within @command{gawk} is achieved
34139by calling through function pointers passed into your extension.
34140
34141API function pointers are provided for the following kinds of operations:
34142
34143@itemize @value{BULLET}
34144@item
34145Allocating, reallocating, and releasing memory.
34146
34147@item
34148Registration functions. You may register:
34149
34150@c nested list
34151@itemize @value{MINUS}
34152@item
34153Extension functions
34154@item
34155Exit callbacks
34156@item
34157A version string
34158@item
34159Input parsers
34160@item
34161Output wrappers
34162@item
34163Two-way processors
34164@end itemize
34165
34166All of these are discussed in detail later in this @value{CHAPTER}.
34167
34168@item
34169Printing fatal, warning, and ``lint'' warning messages.
34170
34171@item
34172Updating @code{ERRNO}, or unsetting it.
34173
34174@item
34175Accessing parameters, including converting an undefined parameter into
34176an array.
34177
34178@item
34179Symbol table access: retrieving a global variable, creating one,
34180or changing one.
34181
34182@item
34183Creating and releasing cached values; this provides an
34184efficient way to use values for multiple variables and
34185can be a big performance win.
34186
34187@item
34188Manipulating arrays:
34189
34190@itemize @value{MINUS}
34191@item
34192Retrieving, adding, deleting, and modifying elements
34193
34194@item
34195Getting the count of elements in an array
34196
34197@item
34198Creating a new array
34199
34200@item
34201Clearing an array
34202
34203@item
34204Flattening an array for easy C-style looping over all its indices and elements
34205@end itemize
34206
34207@item
34208Accessing and manipulating redirections.
34209
34210@end itemize
34211
34212Some points about using the API:
34213
34214@itemize @value{BULLET}
34215@item
34216The following types, macros, and/or functions are referenced
34217in @file{gawkapi.h}.  For correct use, you must therefore include the
34218corresponding standard header file @emph{before} including @file{gawkapi.h}.
34219The list of macros and related header files is shown in @ref{table-api-std-headers}.
34220
34221@float Table,table-api-std-headers
34222@caption{Standard header files needed by API}
34223@multitable {@code{memset()}, @code{memcpy()}} {@code{<sys/types.h>}}
34224@headitem C entity @tab Header file
34225@item @code{EOF} @tab @code{<stdio.h>}
34226@item Values for @code{errno} @tab @code{<errno.h>}
34227@item @code{FILE} @tab @code{<stdio.h>}
34228@item @code{NULL} @tab @code{<stddef.h>}
34229@item @code{memcpy()} @tab @code{<string.h>}
34230@item @code{memset()} @tab @code{<string.h>}
34231@item @code{size_t} @tab @code{<sys/types.h>}
34232@item @code{struct stat} @tab @code{<sys/stat.h>}
34233@end multitable
34234@end float
34235
34236Due to portability concerns, especially to systems that are not
34237fully standards-compliant, it is your responsibility
34238to include the correct files in the correct way. This requirement
34239is necessary in order to keep @file{gawkapi.h} clean, instead of becoming
34240a portability hodge-podge as can be seen in some parts of
34241the @command{gawk} source code.
34242
34243@item
34244If your extension uses MPFR facilities, and you wish to receive such
34245values from @command{gawk} and/or pass such values to it, you must include the
34246@code{<mpfr.h>} header before including @code{<gawkapi.h>}.
34247
34248@item
34249The @file{gawkapi.h} file may be included more than once without ill effect.
34250Doing so, however, is poor coding practice.
34251
34252@item
34253Although the API only uses ISO C 90 features, there is an exception; the
34254``constructor'' functions use the @code{inline} keyword. If your compiler
34255does not support this keyword, you should either place
34256@samp{-Dinline=''} on your command line or use the GNU Autotools and include a
34257@file{config.h} file in your extensions.
34258
34259@item
34260All pointers filled in by @command{gawk} point to memory
34261managed by @command{gawk} and should be treated by the extension as
34262read-only.
34263
34264Memory for @emph{all} strings passed into @command{gawk}
34265from the extension @emph{must} come from calling one of
34266@code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()},
34267and is managed by @command{gawk} from then on.
34268
34269Memory for MPFR/GMP values that come from @command{gawk}
34270should also be treated as read-only.  However, unlike strings,
34271memory for MPFR/GMP values allocated by an extension and passed
34272into @command{gawk} is @emph{copied} by @command{gawk}; the extension
34273should then free the values itself to avoid memory leaks. This is
34274discussed further in @strong{API Ownership of MPFR and GMP Values}.
34275
34276@item
34277The API defines several simple @code{struct}s that map values as seen
34278from @command{awk}.  A value can be a @code{double}, a string, or an
34279array (as in multidimensional arrays, or when creating a new array).
34280
34281String values maintain both pointer and length, because embedded @sc{nul}
34282characters are allowed.
34283
34284@quotation NOTE
34285By intent, @command{gawk} maintains strings using the current multibyte
34286encoding (as defined by @env{LC_@var{xxx}} environment variables)
34287and not using wide characters.  This matches how @command{gawk} stores
34288strings internally and also how characters are likely to be input into
34289and output from files.
34290@end quotation
34291
34292@quotation NOTE
34293String values passed to an extension by @command{gawk} are always
34294@sc{nul}-terminated.  Thus it is safe to pass such string values to
34295standard library and system routines. However, because @command{gawk}
34296allows embedded @sc{nul} characters in string data, before using the data
34297as a regular C string, you should check that the length for that string
34298passed to the extension matches the return value of @code{strlen()}
34299for it.
34300@end quotation
34301
34302@item
34303When retrieving a value (such as a parameter or that of a global variable
34304or array element), the extension requests a specific type (number, string,
34305scalar, value cookie, array, or ``undefined'').  When the request is
34306``undefined,'' the returned value will have the real underlying type.
34307
34308However, if the request and actual type don't match, the access function
34309returns ``false'' and fills in the type of the actual value that is there,
34310so that the extension can, e.g., print an error message
34311(such as ``scalar passed where array expected'').
34312
34313@c This is documented in the header file and needs some expanding upon.
34314@c The table there should be presented here
34315@end itemize
34316
34317You may call the API functions by using the function pointers
34318directly, but the interface is not so pretty. To make extension code look
34319more like regular code, the @file{gawkapi.h} header file defines several
34320macros that you should use in your code.  This @value{SECTION} presents
34321the macros as if they were functions.
34322
34323@node General Data Types
34324@subsection General-Purpose Data Types
34325
34326@cindex Robbins @subentry Arnold
34327@cindex Ramey, Chet
34328@quotation
34329@i{I have a true love/hate relationship with unions.}
34330@author Arnold Robbins
34331@end quotation
34332
34333@quotation
34334@i{That's the thing about unions: the compiler will arrange things so they
34335can accommodate both love and hate.}
34336@author Chet Ramey
34337@end quotation
34338
34339The extension API defines a number of simple types and structures for
34340general-purpose use. Additional, more specialized, data structures are
34341introduced in subsequent @value{SECTION}s, together with the functions
34342that use them.
34343
34344The general-purpose types and structures are as follows:
34345
34346@table @code
34347@item typedef void *awk_ext_id_t;
34348A value of this type is received from @command{gawk} when an extension is loaded.
34349That value must then be passed back to @command{gawk} as the first parameter of
34350each API function.
34351
34352@item #define awk_const @dots{}
34353This macro expands to @samp{const} when compiling an extension,
34354and to nothing when compiling @command{gawk} itself.  This makes
34355certain fields in the API data structures unwritable from extension code,
34356while allowing @command{gawk} to use them as it needs to.
34357
34358@item typedef enum awk_bool @{
34359@itemx @ @ @ @ awk_false = 0,
34360@itemx @ @ @ @ awk_true
34361@itemx @} awk_bool_t;
34362A simple Boolean type.
34363
34364@item typedef struct awk_string @{
34365@itemx @ @ @ @ char *str;@ @ @ @ @ @ /* data */
34366@itemx @ @ @ @ size_t len;@ @ @ @ @ /* length thereof, in chars */
34367@itemx @} awk_string_t;
34368This represents a mutable string. @command{gawk}
34369owns the memory pointed to if it supplied
34370the value. Otherwise, it takes ownership of the memory pointed to.
34371@emph{Such memory must come from calling one of the
34372@code{gawk_malloc()}, @code{gawk_calloc()}, or
34373@code{gawk_realloc()} functions!}
34374
34375As mentioned earlier, strings are maintained using the current
34376multibyte encoding.
34377
34378@item typedef enum @{
34379@itemx @ @ @ @ AWK_UNDEFINED,
34380@itemx @ @ @ @ AWK_NUMBER,
34381@itemx @ @ @ @ AWK_STRING,
34382@itemx @ @ @ @ AWK_REGEX,
34383@itemx @ @ @ @ AWK_STRNUM,
34384@itemx @ @ @ @ AWK_ARRAY,
34385@itemx @ @ @ @ AWK_SCALAR,@ @ @ @ @ @ @ @ @ /* opaque access to a variable */
34386@itemx @ @ @ @ AWK_VALUE_COOKIE@ @ @ @ /* for updating a previously created value */
34387@itemx @} awk_valtype_t;
34388This @code{enum} indicates the type of a value.
34389It is used in the following @code{struct}.
34390
34391@item typedef struct awk_value @{
34392@itemx @ @ @ @ awk_valtype_t   val_type;
34393@itemx @ @ @ @ union @{
34394@itemx @ @ @ @ @ @ @ @ awk_string_t@ @ @ @ @ @ @ s;
34395@itemx @ @ @ @ @ @ @ @ awknum_t@ @ @ @ @ @ @ @ @ @ @ n;
34396@itemx @ @ @ @ @ @ @ @ awk_array_t@ @ @ @ @ @ @ @ a;
34397@itemx @ @ @ @ @ @ @ @ awk_scalar_t@ @ @ @ @ @ @ scl;
34398@itemx @ @ @ @ @ @ @ @ awk_value_cookie_t@ vc;
34399@itemx @ @ @ @ @} u;
34400@itemx @} awk_value_t;
34401An ``@command{awk} value.''
34402The @code{val_type} member indicates what kind of value the
34403@code{union} holds, and each member is of the appropriate type.
34404
34405@item #define str_value@ @ @ @ @ @ u.s
34406@itemx #define strnum_value@ @ @ str_value
34407@itemx #define regex_value@ @ @ @ str_value
34408@itemx #define num_value@ @ @ @ @ @ u.n.d
34409@itemx #define num_type@ @ @ @ @ @ @ u.n.type
34410@itemx #define num_ptr@ @ @ @ @ @ @ @ u.n.ptr
34411@itemx #define array_cookie@ @ @ u.a
34412@itemx #define scalar_cookie@ @ u.scl
34413@itemx #define value_cookie@ @ @ u.vc
34414Using these macros makes accessing the fields of the @code{awk_value_t} more
34415readable.
34416
34417@item enum AWK_NUMBER_TYPE @{
34418@itemx @ @ @ @ AWK_NUMBER_TYPE_DOUBLE,
34419@itemx @ @ @ @ AWK_NUMBER_TYPE_MPFR,
34420@itemx @ @ @ @ AWK_NUMBER_TYPE_MPZ
34421@itemx @};
34422This @code{enum} is used in the following structure for defining the
34423type of numeric value that is being worked with.  It is declared at the
34424top level of the file so that it works correctly for C++ as well as for C.
34425
34426@item typedef struct awk_number @{
34427@itemx @ @ @ @ double d;
34428@itemx @ @ @ @ enum AWK_NUMBER_TYPE type;
34429@itemx @ @ @ @ void *ptr;
34430@itemx @} awk_number_t;
34431This represents a numeric value.  Internally, @command{gawk} stores
34432every number as either a C @code{double}, a GMP integer, or an MPFR
34433arbitrary-precision floating-point value.  In order to allow extensions
34434to also support GMP and MPFR values, numeric values are passed in this
34435structure.
34436
34437The double-precision @code{d} element is always populated
34438in data received from @command{gawk}. In addition, by examining the
34439@code{type} member, an extension can determine if the @code{ptr}
34440member is either a GMP integer (type @code{mpz_ptr}), or an MPFR
34441floating-point value (type @code{mpfr_ptr_t}), and cast it appropriately.
34442
34443@quotation CAUTION
34444Any MPFR or MPZ values that you create and pass to @command{gawk}
34445to save are @emph{copied}. This means you are responsible to release
34446the storage once you're done with it. See the sample @code{intdiv}
34447extension for some example code.
34448@end quotation
34449
34450@item typedef void *awk_scalar_t;
34451Scalars can be represented as an opaque type. These values are obtained
34452from @command{gawk} and then passed back into it. This is discussed
34453in a general fashion in the text following this list, and in more detail in
34454@ref{Symbol table by cookie}.
34455
34456@item typedef void *awk_value_cookie_t;
34457A ``value cookie'' is an opaque type representing a cached value.
34458This is also discussed in a general fashion in the text following this list,
34459and in more detail in @ref{Cached values}.
34460
34461@end table
34462
34463Scalar values in @command{awk} are numbers, strings, strnums, or typed regexps. The
34464@code{awk_value_t} struct represents values.  The @code{val_type} member
34465indicates what is in the @code{union}.
34466
34467Representing numbers is easy---the API uses a C @code{double}.  Strings
34468require more work. Because @command{gawk} allows embedded @sc{nul} bytes
34469in string values, a string must be represented as a pair containing a
34470data pointer and length. This is the @code{awk_string_t} type.
34471
34472A strnum (numeric string) value is represented as a string and consists
34473of user input data that appears to be numeric.
34474When an extension creates a strnum value, the result is a string flagged
34475as user input. Subsequent parsing by @command{gawk} then determines whether it
34476looks like a number and should be treated as a strnum, or as a regular string.
34477
34478This is useful in cases where an extension function would like to do something
34479comparable to the @code{split()} function which sets the strnum attribute
34480on the array elements it creates.  For example, an extension that implements
34481CSV splitting would want to use this feature. This is also useful for a
34482function that retrieves a data item from a database. The PostgreSQL
34483@code{PQgetvalue()} function, for example, returns a string that may be numeric
34484or textual depending on the contents.
34485
34486Typed regexp values (@pxref{Strong Regexp Constants}) are not of
34487much use to extension functions.  Extension functions can tell that
34488they've received them, and create them for scalar values. Otherwise,
34489they can examine the text of the regexp through @code{regex_value.str}
34490and @code{regex_value.len}.
34491
34492Identifiers (i.e., the names of global variables) can be associated
34493with either scalar values or with arrays.  In addition, @command{gawk}
34494provides true arrays of arrays, where any given array element can
34495itself be an array.  Discussion of arrays is delayed until
34496@ref{Array Manipulation}.
34497
34498The various macros listed earlier make it easier to use the elements
34499of the @code{union} as if they were fields in a @code{struct}; this
34500is a common coding practice in C.  Such code is easier to write and to
34501read, but it remains @emph{your} responsibility to make sure that
34502the @code{val_type} member correctly reflects the type of the value in
34503the @code{awk_value_t} struct.
34504
34505Conceptually, the first three members of the @code{union} (number, string,
34506and array) are all that is needed for working with @command{awk} values.
34507However, because the API provides routines for accessing and changing
34508the value of a global scalar variable only by using the variable's name,
34509there is a performance penalty: @command{gawk} must find the variable
34510each time it is accessed and changed.  This turns out to be a real issue,
34511not just a theoretical one.
34512
34513Thus, if you know that your extension will spend considerable time
34514reading and/or changing the value of one or more scalar variables, you
34515can obtain a @dfn{scalar cookie}@footnote{See
34516@uref{http://catb.org/jargon/html/C/cookie.html, the ``cookie'' entry in the Jargon file} for a
34517definition of @dfn{cookie}, and @uref{http://catb.org/jargon/html/M/magic-cookie.html,
34518the ``magic cookie'' entry in the Jargon file} for a nice example.
34519@ifclear FOR_PRINT
34520See also the entry for ``Cookie'' in the @ref{Glossary}.
34521@end ifclear
34522}
34523object for that variable, and then use
34524the cookie for getting the variable's value or for changing the variable's
34525value.
34526The @code{awk_scalar_t} type holds a scalar cookie, and the
34527@code{scalar_cookie} macro provides access to the value of that type
34528in the @code{awk_value_t} struct.
34529Given a scalar cookie, @command{gawk} can directly retrieve or
34530modify the value, as required, without having to find it first.
34531
34532The @code{awk_value_cookie_t} type and @code{value_cookie} macro are similar.
34533If you know that you wish to
34534use the same numeric or string @emph{value} for one or more variables,
34535you can create the value once, retaining a @dfn{value cookie} for it,
34536and then pass in that value cookie whenever you wish to set the value of a
34537variable.  This saves storage space within the running @command{gawk}
34538process and reduces the time needed to create the value.
34539
34540@node Memory Allocation Functions
34541@subsection Memory Allocation Functions and Convenience Macros
34542@cindex allocating memory for extensions
34543@cindex extensions @subentry loadable @subentry allocating memory
34544@cindex memory, allocating for extensions
34545
34546The API provides a number of @dfn{memory allocation} functions for
34547allocating memory that can be passed to @command{gawk}, as well as a number of
34548convenience macros.
34549This @value{SUBSECTION} presents them all as function prototypes, in
34550the way that extension code would use them:
34551
34552@table @code
34553@item void *gawk_malloc(size_t size);
34554Call the correct version of @code{malloc()} to allocate storage that may
34555be passed to @command{gawk}.
34556
34557@item void *gawk_calloc(size_t nmemb, size_t size);
34558Call the correct version of @code{calloc()} to allocate storage that may
34559be passed to @command{gawk}.
34560
34561@item void *gawk_realloc(void *ptr, size_t size);
34562Call the correct version of @code{realloc()} to allocate storage that may
34563be passed to @command{gawk}.
34564
34565@item void gawk_free(void *ptr);
34566Call the correct version of @code{free()} to release storage that was
34567allocated with @code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()}.
34568@end table
34569
34570The API has to provide these functions because it is possible
34571for an extension to be compiled and linked against a different
34572version of the C library than was used for the @command{gawk}
34573executable.@footnote{This is more common on MS-Windows systems, but it
34574can happen on Unix-like systems as well.} If @command{gawk} were
34575to use its version of @code{free()} when the memory came from an
34576unrelated version of @code{malloc()}, unexpected behavior would
34577likely result.
34578
34579Three convenience macros may be used for allocating storage
34580from @code{gawk_malloc()}, @code{gawk_calloc}, and
34581@code{gawk_realloc()}. If the allocation fails, they cause @command{gawk}
34582to exit with a fatal error message.  They should be used as if they were
34583procedure calls that do not return a value:
34584
34585@table @code
34586@item #define emalloc(pointer, type, size, message) @dots{}
34587The arguments to this macro are as follows:
34588
34589@c nested table
34590@table @code
34591@item pointer
34592The pointer variable to point at the allocated storage.
34593
34594@item type
34595The type of the pointer variable.  This is used to create a cast for
34596the call to @code{gawk_malloc()}.
34597
34598@item size
34599The total number of bytes to be allocated.
34600
34601@item message
34602A message to be prefixed to the fatal error message. Typically this is the name
34603of the function using the macro.
34604@end table
34605
34606@noindent
34607For example, you might allocate a string value like so:
34608
34609@example
34610@group
34611awk_value_t result;
34612char *message;
34613const char greet[] = "Don't Panic!";
34614
34615emalloc(message, char *, sizeof(greet), "myfunc");
34616strcpy(message, greet);
34617make_malloced_string(message, strlen(message), & result);
34618@end group
34619@end example
34620
34621@sp 2
34622@item #define ezalloc(pointer, type, size, message) @dots{}
34623This is like @code{emalloc()}, but it calls @code{gawk_calloc()}
34624instead of @code{gawk_malloc()}.
34625The arguments are the same as for the @code{emalloc()} macro, but this
34626macro guarantees that the memory returned is initialized to zero.
34627
34628@item #define erealloc(pointer, type, size, message) @dots{}
34629This is like @code{emalloc()}, but it calls @code{gawk_realloc()}
34630instead of @code{gawk_malloc()}.
34631The arguments are the same as for the @code{emalloc()} macro.
34632@end table
34633
34634Two additional functions allocate MPFR and GMP objects for use
34635by extension functions that need to create and then return such
34636values.
34637
34638@quotation NOTE
34639These functions are obsolete. Extension functions that need local MPFR
34640and GMP values should simply allocate them on the stack and clear them,
34641as any other code would.
34642@end quotation
34643
34644@noindent
34645The functions are:
34646
34647@table @code
34648@item void *get_mpfr_ptr();
34649Allocate and initialize an MPFR object and return a pointer to it.
34650If the allocation fails, @command{gawk} exits with a fatal
34651``out of memory'' error.  If @command{gawk} was compiled without
34652MPFR support, calling this function causes a fatal error.
34653
34654@item void *get_mpz_ptr();
34655Allocate and initialize a GMP object and return a pointer to it.
34656If the allocation fails, @command{gawk} exits with a fatal
34657``out of memory'' error.  If @command{gawk} was compiled without
34658MPFR support, calling this function causes a fatal error.
34659@end table
34660
34661Both of these functions return @samp{void *}, since the @file{gawkapi.h}
34662header file should not have dependency upon @code{<mpfr.h>} (and @code{<gmp.h>},
34663which is included from @code{<mpfr.h>}).  The actual return values are of
34664types @code{mpfr_ptr} and @code{mpz_ptr} respectively, and you should cast
34665the return values appropriately before assigning the results to variables
34666of the correct types.
34667
34668The memory allocated by these functions should be freed with
34669@code{gawk_free()}.
34670
34671@node Constructor Functions
34672@subsection Constructor Functions
34673
34674The API provides a number of @dfn{constructor} functions for creating
34675string and numeric values, as well as a number of convenience macros.
34676This @value{SUBSECTION} presents them all as function prototypes, in
34677the way that extension code would use them:
34678
34679@table @code
34680@item static inline awk_value_t *
34681@itemx make_const_string(const char *string, size_t length, awk_value_t *result);
34682This function creates a string value in the @code{awk_value_t} variable
34683pointed to by @code{result}. It expects @code{string} to be a C string constant
34684(or other string data), and automatically creates a @emph{copy} of the data
34685for storage in @code{result}. It returns @code{result}.
34686
34687@item static inline awk_value_t *
34688@itemx make_malloced_string(const char *string, size_t length, awk_value_t *result);
34689This function creates a string value in the @code{awk_value_t} variable
34690pointed to by @code{result}. It expects @code{string} to be a @samp{char *}
34691value pointing to data previously obtained from @code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()}. The idea here
34692is that the data is passed directly to @command{gawk}, which assumes
34693responsibility for it. It returns @code{result}.
34694
34695@item static inline awk_value_t *
34696@itemx make_null_string(awk_value_t *result);
34697This specialized function creates a null string (the ``undefined'' value)
34698in the @code{awk_value_t} variable pointed to by @code{result}.
34699It returns @code{result}.
34700
34701@item static inline awk_value_t *
34702@itemx make_number(double num, awk_value_t *result);
34703This function simply creates a numeric value in the @code{awk_value_t} variable
34704pointed to by @code{result}.
34705
34706@item static inline awk_value_t *
34707@itemx make_number_mpz(void *mpz, awk_value_t *result);
34708This function creates a GMP number value in @code{result}.
34709The @code{mpz} must be from a call to @code{get_mpz_ptr()}
34710(and thus be of real underlying type @code{mpz_ptr}).
34711
34712@item static inline awk_value_t *
34713@itemx make_number_mpfr(void *mpfr, awk_value_t *result);
34714This function creates an MPFR number value in @code{result}.
34715The @code{mpfr} must be from a call to @code{get_mpfr_ptr()}.
34716
34717@item static inline awk_value_t *
34718@itemx make_const_user_input(const char *string, size_t length, awk_value_t *result);
34719This function is identical to @code{make_const_string()}, but the string is
34720flagged as user input that should be treated as a strnum value if the contents
34721of the string are numeric.
34722
34723@item static inline awk_value_t *
34724@itemx make_malloced_user_input(const char *string, size_t length, awk_value_t *result);
34725This function is identical to @code{make_malloced_string()}, but the string is
34726flagged as user input that should be treated as a strnum value if the contents
34727of the string are numeric.
34728
34729@item static inline awk_value_t *
34730@itemx make_const_regex(const char *string, size_t length, awk_value_t *result);
34731This function creates a strongly typed regexp value by allocating a copy of the string.
34732@code{string} is the regular expression of length @code{len}.
34733
34734@item static inline awk_value_t *
34735@itemx make_malloced_regex(const char *string, size_t length, awk_value_t *result);
34736This function creates a strongly typed regexp value.  @code{string} is
34737the regular expression of length @code{len}.  It expects @code{string}
34738to be a @samp{char *} value pointing to data previously obtained from
34739@code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()}.
34740
34741@end table
34742
34743@node API Ownership of MPFR and GMP Values
34744@subsection Managing MPFR and GMP Values
34745@cindex MPFR values, API ownership of
34746@cindex GMP values, API ownership of
34747@cindex API, ownership of MPFR and GMP values
34748
34749MPFR and GMP values are different from string values, where you can
34750``take ownership'' of the value simply by assigning pointers. For example:
34751
34752@example
34753char *p = gawk_malloc(42);      p @ii{``owns'' the memory}
34754char *q = p;
34755p = NULL;                       @ii{now} q @ii{``owns'' it}
34756@end example
34757
34758MPFR and GMP objects are indeed allocated on the stack or dynamically,
34759but the MPFR and GMP libraries treat these objects as values, the same way that
34760you would pass an @code{int} or a @code{double} by value.  There is no
34761way to ``transfer ownership'' of MPFR and GMP objects.  Thus, code in
34762an extension should look like this:
34763
34764@example
34765mpz_t part1, part2, answer;             @ii{declare local values}
34766
34767mpz_set_si(part1, 21);                  @ii{do some computations}
34768mpz_set_si(part2, 21);
34769mpz_add(answer, part1, part2);
34770@dots{}
34771/* assume that result is a parameter of type (awk_value_t *). */
34772make_number_mpz(answer, & result);      @ii{set it with final GMP value}
34773
34774mpz_clear(part1);                       @ii{release intermediate values}
34775mpz_clear(part2);
34776mpz_clear(answer);
34777
34778return result;
34779@end example
34780
34781@node Registration Functions
34782@subsection Registration Functions
34783@cindex register loadable extension
34784@cindex extensions @subentry loadable @subentry registration
34785
34786This @value{SECTION} describes the API functions for
34787registering parts of your extension with @command{gawk}.
34788
34789@menu
34790* Extension Functions::         Registering extension functions.
34791* Exit Callback Functions::     Registering an exit callback.
34792* Extension Version String::    Registering a version string.
34793* Input Parsers::               Registering an input parser.
34794* Output Wrappers::             Registering an output wrapper.
34795* Two-way processors::          Registering a two-way processor.
34796@end menu
34797
34798@node Extension Functions
34799@subsubsection Registering An Extension Function
34800
34801Extension functions are described by the following record:
34802
34803@example
34804@group
34805typedef struct awk_ext_func @{
34806@ @ @ @ const char *name;
34807@ @ @ @ awk_value_t *(*const function)(int num_actual_args,
34808@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result,
34809@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ struct awk_ext_func *finfo);
34810@ @ @ @ const size_t max_expected_args;
34811@ @ @ @ const size_t min_required_args;
34812@ @ @ @ awk_bool_t suppress_lint;
34813@ @ @ @ void *data;        /* opaque pointer to any extra state */
34814@} awk_ext_func_t;
34815@end group
34816@end example
34817
34818The fields are:
34819
34820@table @code
34821@item const char *name;
34822The name of the new function.
34823@command{awk}-level code calls the function by this name.
34824This is a regular C string.
34825
34826Function names must obey the rules for @command{awk}
34827identifiers. That is, they must begin with either an English letter
34828or an underscore, which may be followed by any number of
34829letters, digits, and underscores.
34830Letter case in function names is significant.
34831
34832@item awk_value_t *(*const function)(int num_actual_args,
34833@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result,
34834@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ struct awk_ext_func *finfo);
34835This is a pointer to the C function that provides the extension's
34836functionality.
34837The function must fill in @code{*result} with either a number,
34838a string, or a regexp.
34839@command{gawk} takes ownership of any string memory.
34840As mentioned earlier, string memory @emph{must} come from one of
34841@code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()}.
34842
34843The @code{num_actual_args} argument tells the C function how many
34844actual parameters were passed from the calling @command{awk} code.
34845
34846The @code{finfo} parameter is a pointer to the @code{awk_ext_func_t} for
34847this function. The called function may access data within it as desired, or not.
34848
34849The function must return the value of @code{result}.
34850This is for the convenience of the calling code inside @command{gawk}.
34851
34852@item const size_t max_expected_args;
34853This is the maximum number of arguments the function expects to receive.
34854If called with more arguments than this, and if lint checking has
34855been enabled, then @command{gawk} prints a warning message.  For more
34856information, see the entry for @code{suppress_lint}, later in this list.
34857
34858@item const size_t min_required_args;
34859This is the minimum number of arguments the function expects to receive.
34860If called with fewer arguments, @command{gawk} prints a fatal error
34861message and exits.
34862
34863@item awk_bool_t suppress_lint;
34864This flag tells @command{gawk} not to print a lint message if lint
34865checking has been enabled and if more arguments were supplied in the call
34866than expected.  An extension function can tell if @command{gawk} already
34867printed at least one such message by checking if @samp{num_actual_args >
34868finfo->max_expected_args}.  If so, and the function does not want more
34869lint messages to be printed, it should set @code{finfo->suppress_lint}
34870to @code{awk_true}.
34871
34872@item void *data;
34873This is an opaque pointer to any data that an extension function may
34874wish to have available when called.  Passing the @code{awk_ext_func_t}
34875structure to the extension function, and having this pointer available
34876in it enable writing a single C or C++ function that implements multiple
34877@command{awk}-level extension functions.
34878@end table
34879
34880Once you have a record representing your extension function, you register
34881it with @command{gawk} using this API function:
34882
34883@table @code
34884@item awk_bool_t add_ext_func(const char *name_space, awk_ext_func_t *func);
34885This function returns true upon success, false otherwise.
34886The @code{name_space} parameter is the namespace in which to place
34887the function (@pxref{Namespaces}).
34888Use an empty string (@code{""}) or @code{"awk"} to place
34889the function in the default @code{awk} namespace.
34890The @code{func} pointer is the address of a
34891@code{struct} representing your function, as just described.
34892
34893@command{gawk} does not modify what @code{func} points to, but the
34894extension function itself receives this pointer and can modify what it
34895points to, thus it is purposely not declared to be @code{const}.
34896@end table
34897
34898The combination of @code{min_required_args}, @code{max_expected_args},
34899and @code{suppress_lint} may be confusing. Here is how you should
34900set things up.
34901
34902@table @asis
34903@item Any number of arguments is valid
34904Set @code{min_required_args} and @code{max_expected_args} to zero and
34905set @code{suppress_lint} to @code{awk_true}.
34906
34907@item A minimum number of arguments is required, no limit on maximum number of arguments
34908Set @code{min_required_args} to the minimum required. Set
34909@code{max_expected_args} to zero and
34910set @code{suppress_lint} to @code{awk_true}.
34911
34912@item A minimum number of arguments is required, a maximum number is expected
34913Set @code{min_required_args} to the minimum required. Set
34914@code{max_expected_args} to the maximum expected.
34915Set @code{suppress_lint} to @code{awk_false}.
34916
34917@item A minimum number of arguments is required, and no more than a maximum is allowed
34918Set @code{min_required_args} to the minimum required. Set
34919@code{max_expected_args} to the maximum expected.
34920Set @code{suppress_lint} to @code{awk_false}.
34921In your extension function, check that @code{num_actual_args} does not
34922exceed @code{f->max_expected_args}. If it does, issue a fatal error message.
34923@end table
34924
34925@node Exit Callback Functions
34926@subsubsection Registering An Exit Callback Function
34927
34928An @dfn{exit callback} function is a function that
34929@command{gawk} calls before it exits.
34930Such functions are useful if you have general ``cleanup'' tasks
34931that should be performed in your extension (such as closing database
34932connections or other resource deallocations).
34933You can register such
34934a function with @command{gawk} using the following function:
34935
34936@table @code
34937@item void awk_atexit(void (*funcp)(void *data, int exit_status),
34938@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ void *arg0);
34939The parameters are:
34940
34941@c nested table
34942@table @code
34943@item funcp
34944A pointer to the function to be called before @command{gawk} exits. The @code{data}
34945parameter will be the original value of @code{arg0}.
34946The @code{exit_status} parameter is the exit status value that
34947@command{gawk} intends to pass to the @code{exit()} system call.
34948
34949@item arg0
34950A pointer to private data that @command{gawk} saves in order to pass to
34951the function pointed to by @code{funcp}.
34952@end table
34953@end table
34954
34955Exit callback functions are called in last-in, first-out (LIFO)
34956order---that is, in the reverse order in which they are registered with
34957@command{gawk}.
34958
34959@node Extension Version String
34960@subsubsection Registering An Extension Version String
34961
34962You can register a version string that indicates the name and
34963version of your extension with @command{gawk}, as follows:
34964
34965@table @code
34966@item void register_ext_version(const char *version);
34967Register the string pointed to by @code{version} with @command{gawk}.
34968Note that @command{gawk} does @emph{not} copy the @code{version} string, so
34969it should not be changed.
34970@end table
34971
34972@command{gawk} prints all registered extension version strings when it
34973is invoked with the @option{--version} option.
34974
34975@node Input Parsers
34976@subsubsection Customized Input Parsers
34977@cindex customized input parser
34978
34979By default, @command{gawk} reads text files as its input. It uses the value
34980of @code{RS} to find the end of the record, and then uses @code{FS}
34981(or @code{FIELDWIDTHS} or @code{FPAT}) to split it into fields (@pxref{Reading Files}).
34982Additionally, it sets the value of @code{RT} (@pxref{Built-in Variables}).
34983
34984If you want, you can provide your own custom input parser.  An input
34985parser's job is to return a record to the @command{gawk} record-processing
34986code, along with indicators for the value and length of the data to be
34987used for @code{RT}, if any.
34988
34989To provide an input parser, you must first provide two functions
34990(where @var{XXX} is a prefix name for your extension):
34991
34992@table @code
34993@item awk_bool_t @var{XXX}_can_take_file(const awk_input_buf_t *iobuf);
34994This function examines the information available in @code{iobuf}
34995(which we discuss shortly).  Based on the information there, it
34996decides if the input parser should be used for this file.
34997If so, it should return true. Otherwise, it should return false.
34998It should not change any state (variable values, etc.) within @command{gawk}.
34999
35000@item awk_bool_t @var{XXX}_take_control_of(awk_input_buf_t *iobuf);
35001When @command{gawk} decides to hand control of the file over to the
35002input parser, it calls this function.  This function in turn must fill
35003in certain fields in the @code{awk_input_buf_t} structure and ensure
35004that certain conditions are true.  It should then return true. If an
35005error of some kind occurs, it should not fill in any fields and should
35006return false; then @command{gawk} will not use the input parser.
35007The details are presented shortly.
35008@end table
35009
35010Your extension should package these functions inside an
35011@code{awk_input_parser_t}, which looks like this:
35012
35013@example
35014@group
35015typedef struct awk_input_parser @{
35016    const char *name;   /* name of parser */
35017    awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
35018    awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
35019    awk_const struct awk_input_parser *awk_const next;   /* for gawk */
35020@} awk_input_parser_t;
35021@end group
35022@end example
35023
35024The fields are:
35025
35026@table @code
35027@item const char *name;
35028The name of the input parser. This is a regular C string.
35029
35030@item awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
35031A pointer to your @code{@var{XXX}_can_take_file()} function.
35032
35033@item awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
35034A pointer to your @code{@var{XXX}_take_control_of()} function.
35035
35036@item awk_const struct input_parser *awk_const next;
35037This is for use by @command{gawk};
35038therefore it is marked @code{awk_const} so that the extension cannot
35039modify it.
35040@end table
35041
35042The steps are as follows:
35043
35044@enumerate
35045@item
35046Create a @code{static awk_input_parser_t} variable and initialize it
35047appropriately.
35048
35049@item
35050When your extension is loaded, register your input parser with
35051@command{gawk} using the @code{register_input_parser()} API function
35052(described next).
35053@end enumerate
35054
35055An @code{awk_input_buf_t} looks like this:
35056
35057@example
35058typedef struct awk_input @{
35059    const char *name;       /* filename */
35060    int fd;                 /* file descriptor */
35061#define INVALID_HANDLE (-1)
35062    void *opaque;           /* private data for input parsers */
35063    int (*get_record)(char **out, struct awk_input *iobuf,
35064                      int *errcode, char **rt_start, size_t *rt_len,
35065                      const awk_fieldwidth_info_t **field_width);
35066    ssize_t (*read_func)();
35067    void (*close_func)(struct awk_input *iobuf);
35068    struct stat sbuf;       /* stat buf */
35069@} awk_input_buf_t;
35070@end example
35071
35072The fields can be divided into two categories: those for use (initially,
35073at least) by @code{@var{XXX}_can_take_file()}, and those for use by
35074@code{@var{XXX}_take_control_of()}.  The first group of fields and their uses
35075are as follows:
35076
35077@table @code
35078@item const char *name;
35079The name of the file.
35080
35081@item int fd;
35082A file descriptor for the file.  If @command{gawk} was able to
35083open the file, then @code{fd} will @emph{not} be equal to
35084@code{INVALID_HANDLE}. Otherwise, it will.
35085
35086@item struct stat sbuf;
35087If the file descriptor is valid, then @command{gawk} will have filled
35088in this structure via a call to the @code{fstat()} system call.
35089@end table
35090
35091The @code{@var{XXX}_can_take_file()} function should examine these
35092fields and decide if the input parser should be used for the file.
35093The decision can be made based upon @command{gawk} state (the value
35094of a variable defined previously by the extension and set by
35095@command{awk} code), the name of the
35096file, whether or not the file descriptor is valid, the information
35097in the @code{struct stat}, or any combination of these factors.
35098
35099Once @code{@var{XXX}_can_take_file()} has returned true, and
35100@command{gawk} has decided to use your input parser, it calls
35101@code{@var{XXX}_take_control_of()}.  That function then fills
35102either the @code{get_record} field or the @code{read_func} field in
35103the @code{awk_input_buf_t}.  It must also ensure that @code{fd} is @emph{not}
35104set to @code{INVALID_HANDLE}.  The following list describes the fields that
35105may be filled by @code{@var{XXX}_take_control_of()}:
35106
35107@table @code
35108@item void *opaque;
35109This is used to hold any state information needed by the input parser
35110for this file.  It is ``opaque'' to @command{gawk}.  The input parser
35111is not required to use this pointer.
35112
35113@item int@ (*get_record)(char@ **out,
35114@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ struct@ awk_input *iobuf,
35115@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ int *errcode,
35116@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ char **rt_start,
35117@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ size_t *rt_len,
35118@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_fieldwidth_info_t **field_width);
35119This function pointer should point to a function that creates the input
35120records.  Said function is the core of the input parser.  Its behavior
35121is described in the text following this list.
35122
35123@item ssize_t (*read_func)();
35124This function pointer should point to a function that has the
35125same behavior as the standard POSIX @code{read()} system call.
35126It is an alternative to the @code{get_record} pointer.  Its behavior
35127is also described in the text following this list.
35128
35129@item void (*close_func)(struct awk_input *iobuf);
35130This function pointer should point to a function that does
35131the ``teardown.'' It should release any resources allocated by
35132@code{@var{XXX}_take_control_of()}.  It may also close the file. If it
35133does so, it should set the @code{fd} field to @code{INVALID_HANDLE}.
35134
35135If @code{fd} is still not @code{INVALID_HANDLE} after the call to this
35136function, @command{gawk} calls the regular @code{close()} system call.
35137
35138Having a ``teardown'' function is optional. If your input parser does
35139not need it, do not set this field.  Then, @command{gawk} calls the
35140regular @code{close()} system call on the file descriptor, so it should
35141be valid.
35142@end table
35143
35144The @code{@var{XXX}_get_record()} function does the work of creating
35145input records.  The parameters are as follows:
35146
35147@table @code
35148@item char **out
35149This is a pointer to a @code{char *} variable that is set to point
35150to the record.  @command{gawk} makes its own copy of the data, so
35151the extension must manage this storage.
35152
35153@item struct awk_input *iobuf
35154This is the @code{awk_input_buf_t} for the file.  The fields should be
35155used for reading data (@code{fd}) and for managing private state
35156(@code{opaque}), if any.
35157
35158@item int *errcode
35159If an error occurs, @code{*errcode} should be set to an appropriate
35160code from @code{<errno.h>}.
35161
35162@item char **rt_start
35163@itemx size_t *rt_len
35164If the concept of a ``record terminator'' makes sense, then
35165@code{*rt_start} should be set to point to the data to be used for
35166@code{RT}, and @code{*rt_len} should be set to the length of the
35167data. Otherwise, @code{*rt_len} should be set to zero.
35168@command{gawk} makes its own copy of this data, so the
35169extension must manage this storage.
35170
35171@item const awk_fieldwidth_info_t **field_width
35172If @code{field_width} is not @code{NULL}, then @code{*field_width} will be initialized
35173to @code{NULL}, and the function may set it to point to a structure
35174supplying field width information to override the default
35175field parsing mechanism. Note that this structure will not
35176be copied by @command{gawk}; it must persist at least until the next call
35177to @code{get_record} or @code{close_func}. Note also that @code{field_width} is
35178@code{NULL} when @code{getline} is assigning the results to a variable, thus
35179field parsing is not needed. If the parser does set @code{*field_width},
35180then @command{gawk} uses this layout to parse the input record,
35181and the @code{PROCINFO["FS"]} value will be @code{"API"} while this record
35182is active in @code{$0}.
35183The @code{awk_fieldwidth_info_t} data structure
35184is described below.
35185@end table
35186
35187The return value is the length of the buffer pointed to by
35188@code{*out}, or @code{EOF} if end-of-file was reached or an
35189error occurred.
35190
35191It is guaranteed that @code{errcode} is a valid pointer, so there is no
35192need to test for a @code{NULL} value.  @command{gawk} sets @code{*errcode}
35193to zero, so there is no need to set it unless an error occurs.
35194
35195If an error does occur, the function should return @code{EOF} and set
35196@code{*errcode} to a value greater than zero.  In that case, if @code{*errcode}
35197does not equal zero, @command{gawk} automatically updates
35198the @code{ERRNO} variable based on the value of @code{*errcode}.
35199(In general, setting @samp{*errcode = errno} should do the right thing.)
35200
35201As an alternative to supplying a function that returns an input record,
35202you may instead supply a function that simply reads bytes, and let
35203@command{gawk} parse the data into records.  If you do so, the data
35204should be returned in the multibyte encoding of the current locale.
35205Such a function should follow the same behavior as the @code{read()}
35206system call, and you fill in the @code{read_func} pointer with its
35207address in the @code{awk_input_buf_t} structure.
35208
35209By default, @command{gawk} sets the @code{read_func} pointer to
35210point to the @code{read()} system call. So your extension need not
35211set this field explicitly.
35212
35213@quotation NOTE
35214You must choose one method or the other: either a function that
35215returns a record, or one that returns raw data.  In particular,
35216if you supply a function to get a record, @command{gawk} will
35217call it, and will never call the raw read function.
35218@end quotation
35219
35220@command{gawk} ships with a sample extension that reads directories,
35221returning records for each entry in a directory (@pxref{Extension
35222Sample Readdir}).  You may wish to use that code as a guide for writing
35223your own input parser.
35224
35225When writing an input parser, you should think about (and document)
35226how it is expected to interact with @command{awk} code.  You may want
35227it to always be called, and to take effect as appropriate (as the
35228@code{readdir} extension does).  Or you may want it to take effect
35229based upon the value of an @command{awk} variable, as the XML extension
35230from the @code{gawkextlib} project does (@pxref{gawkextlib}).
35231In the latter case, code in a @code{BEGINFILE} rule
35232can look at @code{FILENAME} and @code{ERRNO} to decide whether or
35233not to activate an input parser (@pxref{BEGINFILE/ENDFILE}).
35234
35235You register your input parser with the following function:
35236
35237@table @code
35238@item void register_input_parser(awk_input_parser_t *input_parser);
35239Register the input parser pointed to by @code{input_parser} with
35240@command{gawk}.
35241@end table
35242
35243If you would like to override the default field parsing mechanism for a given
35244record, then you must populate an @code{awk_fieldwidth_info_t} structure,
35245which looks like this:
35246
35247@example
35248typedef struct @{
35249        awk_bool_t     use_chars; /* false ==> use bytes */
35250        size_t         nf;        /* number of fields in record (NF) */
35251        struct awk_field_info @{
35252                size_t skip;      /* amount to skip before field starts */
35253                size_t len;       /* length of field */
35254        @} fields[1];              /* actual dimension should be nf */
35255@} awk_fieldwidth_info_t;
35256@end example
35257
35258The fields are:
35259
35260@table @code
35261@item awk_bool_t use_chars;
35262Set this to @code{awk_true} if the field lengths are specified in terms
35263of potentially multi-byte characters, and set it to @code{awk_false} if
35264the lengths are in terms of bytes.
35265Performance will be better if the values are supplied in
35266terms of bytes.
35267
35268@item size_t nf;
35269Set this to the number of fields in the input record, i.e. @code{NF}.
35270
35271@item struct awk_field_info fields[nf];
35272This is a variable-length array whose actual dimension should be @code{nf}.
35273For each field, the @code{skip} element should be set to the number
35274of characters or bytes, as controlled by the @code{use_chars} flag,
35275to skip before the start of this field. The @code{len} element provides
35276the length of the field. The values in @code{fields[0]} provide the information
35277for @code{$1}, and so on through the @code{fields[nf-1]} element containing the information for @code{$NF}.
35278@end table
35279
35280A convenience macro @code{awk_fieldwidth_info_size(numfields)} is provided to
35281calculate the appropriate size of a variable-length
35282@code{awk_fieldwidth_info_t} structure containing @code{numfields} fields. This can
35283be used as an argument to @code{malloc()} or in a union to allocate space
35284statically. Please refer to the @code{readdir_test} sample extension for an
35285example.
35286
35287@node Output Wrappers
35288@subsubsection Customized Output Wrappers
35289@cindex customized output wrapper
35290
35291@cindex output wrapper
35292An @dfn{output wrapper} is the mirror image of an input parser.
35293It allows an extension to take over the output to a file opened
35294with the @samp{>} or @samp{>>} I/O redirection operators (@pxref{Redirection}).
35295
35296The output wrapper is very similar to the input parser structure:
35297
35298@example
35299typedef struct awk_output_wrapper @{
35300    const char *name;   /* name of the wrapper */
35301    awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf);
35302    awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf);
35303    awk_const struct awk_output_wrapper *awk_const next;  /* for gawk */
35304@} awk_output_wrapper_t;
35305@end example
35306
35307The members are as follows:
35308
35309@table @code
35310@item const char *name;
35311This is the name of the output wrapper.
35312
35313@item awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf);
35314This points to a function that examines the information in
35315the @code{awk_output_buf_t} structure pointed to by @code{outbuf}.
35316It should return true if the output wrapper wants to take over the
35317file, and false otherwise.  It should not change any state (variable
35318values, etc.) within @command{gawk}.
35319
35320@item awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf);
35321The function pointed to by this field is called when @command{gawk}
35322decides to let the output wrapper take control of the file. It should
35323fill in appropriate members of the @code{awk_output_buf_t} structure,
35324as described next, and return true if successful, false otherwise.
35325
35326@item awk_const struct output_wrapper *awk_const next;
35327This is for use by @command{gawk};
35328therefore it is marked @code{awk_const} so that the extension cannot
35329modify it.
35330@end table
35331
35332The @code{awk_output_buf_t} structure looks like this:
35333
35334@example
35335typedef struct awk_output_buf @{
35336    const char *name;   /* name of output file */
35337    const char *mode;   /* mode argument to fopen */
35338    FILE *fp;           /* stdio file pointer */
35339    awk_bool_t redirected;  /* true if a wrapper is active */
35340    void *opaque;       /* for use by output wrapper */
35341    size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count,
35342                FILE *fp, void *opaque);
35343    int (*gawk_fflush)(FILE *fp, void *opaque);
35344    int (*gawk_ferror)(FILE *fp, void *opaque);
35345    int (*gawk_fclose)(FILE *fp, void *opaque);
35346@} awk_output_buf_t;
35347@end example
35348
35349Here too, your extension will define @code{@var{XXX}_can_take_file()}
35350and @code{@var{XXX}_take_control_of()} functions that examine and update
35351data members in the @code{awk_output_buf_t}.
35352The data members are as follows:
35353
35354@table @code
35355@item const char *name;
35356The name of the output file.
35357
35358@item const char *mode;
35359The mode string (as would be used in the second argument to @code{fopen()})
35360with which the file was opened.
35361
35362@item FILE *fp;
35363The @code{FILE} pointer from @code{<stdio.h>}. @command{gawk} opens the file
35364before attempting to find an output wrapper.
35365
35366@item awk_bool_t redirected;
35367This field must be set to true by the @code{@var{XXX}_take_control_of()} function.
35368
35369@item void *opaque;
35370This pointer is opaque to @command{gawk}. The extension should use it to store
35371a pointer to any private data associated with the file.
35372
35373@item size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count,
35374@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ FILE *fp, void *opaque);
35375@itemx int (*gawk_fflush)(FILE *fp, void *opaque);
35376@itemx int (*gawk_ferror)(FILE *fp, void *opaque);
35377@itemx int (*gawk_fclose)(FILE *fp, void *opaque);
35378These pointers should be set to point to functions that perform
35379the equivalent function as the @code{<stdio.h>} functions do, if appropriate.
35380@command{gawk} uses these function pointers for all output.
35381@command{gawk} initializes the pointers to point to internal ``pass-through''
35382functions that just call the regular @code{<stdio.h>} functions, so an
35383extension only needs to redefine those functions that are appropriate for
35384what it does.
35385@end table
35386
35387The @code{@var{XXX}_can_take_file()} function should make a decision based
35388upon the @code{name} and @code{mode} fields, and any additional state
35389(such as @command{awk} variable values) that is appropriate.
35390
35391When @command{gawk} calls @code{@var{XXX}_take_control_of()}, that function should fill
35392in the other fields as appropriate, except for @code{fp}, which it should just
35393use normally.
35394
35395You register your output wrapper with the following function:
35396
35397@table @code
35398@item void register_output_wrapper(awk_output_wrapper_t *output_wrapper);
35399Register the output wrapper pointed to by @code{output_wrapper} with
35400@command{gawk}.
35401@end table
35402
35403@node Two-way processors
35404@subsubsection Customized Two-way Processors
35405@cindex customized two-way processor
35406
35407A @dfn{two-way processor} combines an input parser and an output wrapper for
35408two-way I/O with the @samp{|&} operator (@pxref{Redirection}).  It makes identical
35409use of the @code{awk_input_parser_t} and @code{awk_output_buf_t} structures
35410as described earlier.
35411
35412A two-way processor is represented by the following structure:
35413
35414@example
35415typedef struct awk_two_way_processor @{
35416    const char *name;   /* name of the two-way processor */
35417    awk_bool_t (*can_take_two_way)(const char *name);
35418    awk_bool_t (*take_control_of)(const char *name,
35419                                  awk_input_buf_t *inbuf,
35420                                  awk_output_buf_t *outbuf);
35421    awk_const struct awk_two_way_processor *awk_const next;  /* for gawk */
35422@} awk_two_way_processor_t;
35423@end example
35424
35425The fields are as follows:
35426
35427@table @code
35428@item const char *name;
35429The name of the two-way processor.
35430
35431@item awk_bool_t (*can_take_two_way)(const char *name);
35432The function pointed to by this field should return true if it wants to take over two-way I/O for this @value{FN}.
35433It should not change any state (variable
35434values, etc.) within @command{gawk}.
35435
35436@item awk_bool_t (*take_control_of)(const char *name,
35437@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_input_buf_t *inbuf,
35438@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_output_buf_t *outbuf);
35439The function pointed to by this field should fill in the @code{awk_input_buf_t} and
35440@code{awk_output_buf_t} structures pointed to by @code{inbuf} and
35441@code{outbuf}, respectively.  These structures were described earlier.
35442
35443@item awk_const struct two_way_processor *awk_const next;
35444This is for use by @command{gawk};
35445therefore it is marked @code{awk_const} so that the extension cannot
35446modify it.
35447@end table
35448
35449As with the input parser and output processor, you provide
35450``yes I can take this'' and ``take over for this'' functions,
35451@code{@var{XXX}_can_take_two_way()} and @code{@var{XXX}_take_control_of()}.
35452
35453You register your two-way processor with the following function:
35454
35455@table @code
35456@item void register_two_way_processor(awk_two_way_processor_t *two_way_processor);
35457Register the two-way processor pointed to by @code{two_way_processor} with
35458@command{gawk}.
35459@end table
35460
35461@node Printing Messages
35462@subsection Printing Messages
35463@cindex printing @subentry messages from extensions
35464@cindex messages from extensions
35465
35466You can print different kinds of warning messages from your
35467extension, as described here.  Note that for these functions,
35468you must pass in the extension ID received from @command{gawk}
35469when the extension was loaded:@footnote{Because the API uses only ISO C 90
35470features, it cannot make use of the ISO C 99 variadic macro feature to hide
35471that parameter. More's the pity.}
35472
35473@table @code
35474@item void fatal(awk_ext_id_t id, const char *format, ...);
35475Print a message and then cause @command{gawk} to exit immediately.
35476
35477@item void nonfatal(awk_ext_id_t id, const char *format, ...);
35478Print a nonfatal error message.
35479
35480@item void warning(awk_ext_id_t id, const char *format, ...);
35481Print a warning message.
35482
35483@item void lintwarn(awk_ext_id_t id, const char *format, ...);
35484Print a ``lint warning.''  Normally this is the same as printing a
35485warning message, but if @command{gawk} was invoked with @samp{--lint=fatal},
35486then lint warnings become fatal error messages.
35487@end table
35488
35489All of these functions are otherwise like the C @code{printf()}
35490family of functions, where the @code{format} parameter is a string
35491with literal characters and formatting codes intermixed.
35492
35493@node Updating @code{ERRNO}
35494@subsection Updating @code{ERRNO}
35495
35496The following functions allow you to update the @code{ERRNO}
35497variable:
35498
35499@table @code
35500@item void update_ERRNO_int(int errno_val);
35501Set @code{ERRNO} to the string equivalent of the error code
35502in @code{errno_val}. The value should be one of the defined
35503error codes in @code{<errno.h>}, and @command{gawk} turns it
35504into a (possibly translated) string using the C @code{strerror()} function.
35505
35506@item void update_ERRNO_string(const char *string);
35507Set @code{ERRNO} directly to the string value of @code{ERRNO}.
35508@command{gawk} makes a copy of the value of @code{string}.
35509
35510@item void unset_ERRNO(void);
35511Unset @code{ERRNO}.
35512@end table
35513
35514@node Requesting Values
35515@subsection Requesting Values
35516
35517All of the functions that return values from @command{gawk}
35518work in the same way. You pass in an @code{awk_valtype_t} value
35519to indicate what kind of value you expect.  If the actual value
35520matches what you requested, the function returns true and fills
35521in the @code{awk_value_t} result.
35522Otherwise, the function returns false, and the @code{val_type}
35523member indicates the type of the actual value.  You may then
35524print an error message or reissue the request for the actual
35525value type, as appropriate.  This behavior is summarized in
35526@ref{table-value-types-returned}.
35527
35528@float Table,table-value-types-returned
35529@caption{API value types returned}
35530@docbook
35531<informaltable>
35532<tgroup cols="8">
35533  <colspec colname="c1"/>
35534  <colspec colname="c2"/>
35535  <colspec colname="c3"/>
35536  <colspec colname="c4"/>
35537  <colspec colname="c5"/>
35538  <colspec colname="c6"/>
35539  <colspec colname="c7"/>
35540  <colspec colname="c8"/>
35541  <spanspec spanname="hspan" namest="c3" nameend="c8" align="center"/>
35542  <thead>
35543    <row><entry></entry><entry spanname="hspan"><para>Type of Actual Value</para></entry></row>
35544    <row>
35545      <entry></entry>
35546      <entry></entry>
35547      <entry><para>String</para></entry>
35548      <entry><para>Strnum</para></entry>
35549      <entry><para>Number</para></entry>
35550      <entry><para>Regex</para></entry>
35551      <entry><para>Array</para></entry>
35552      <entry><para>Undefined</para></entry>
35553    </row>
35554  </thead>
35555  <tbody>
35556    <row>
35557      <entry></entry>
35558      <entry><para><emphasis role="bold">String</emphasis></para></entry>
35559      <entry><para>String</para></entry>
35560      <entry><para>String</para></entry>
35561      <entry><para>String</para></entry>
35562      <entry><para>String</para></entry>
35563      <entry><para>false</para></entry>
35564      <entry><para>false</para></entry>
35565    </row>
35566    <row>
35567      <entry></entry>
35568      <entry><para><emphasis role="bold">Strnum</emphasis></para></entry>
35569      <entry><para>false</para></entry>
35570      <entry><para>Strnum</para></entry>
35571      <entry><para>Strnum</para></entry>
35572      <entry><para>false</para></entry>
35573      <entry><para>false</para></entry>
35574      <entry><para>false</para></entry>
35575    </row>
35576    <row>
35577      <entry></entry>
35578      <entry><para><emphasis role="bold">Number</emphasis></para></entry>
35579      <entry><para>Number</para></entry>
35580      <entry><para>Number</para></entry>
35581      <entry><para>Number</para></entry>
35582      <entry><para>false</para></entry>
35583      <entry><para>false</para></entry>
35584      <entry><para>false</para></entry>
35585    </row>
35586    <row>
35587      <entry><para><emphasis role="bold">Type</emphasis></para></entry>
35588      <entry><para><emphasis role="bold">Regex</emphasis></para></entry>
35589      <entry><para>false</para></entry>
35590      <entry><para>false</para></entry>
35591      <entry><para>Regex</para></entry>
35592      <entry><para>false</para></entry>
35593      <entry><para>false</para></entry>
35594      <entry><para>false</para></entry>
35595    </row>
35596    <row>
35597      <entry><para><emphasis role="bold">Requested</emphasis></para></entry>
35598      <entry><para><emphasis role="bold">Array</emphasis></para></entry>
35599      <entry><para>false</para></entry>
35600      <entry><para>false</para></entry>
35601      <entry><para>false</para></entry>
35602      <entry><para>false</para></entry>
35603      <entry><para>Array</para></entry>
35604      <entry><para>false</para></entry>
35605    </row>
35606    <row>
35607      <entry></entry>
35608      <entry><para><emphasis role="bold">Scalar</emphasis></para></entry>
35609      <entry><para>Scalar</para></entry>
35610      <entry><para>Scalar</para></entry>
35611      <entry><para>Scalar</para></entry>
35612      <entry><para>Scalar</para></entry>
35613      <entry><para>false</para></entry>
35614      <entry><para>false</para></entry>
35615    </row>
35616    <row>
35617      <entry></entry>
35618      <entry><para><emphasis role="bold">Undefined</emphasis></para></entry>
35619      <entry><para>String</para></entry>
35620      <entry><para>Strnum</para></entry>
35621      <entry><para>Number</para></entry>
35622      <entry><para>Regex</para></entry>
35623      <entry><para>Array</para></entry>
35624      <entry><para>Undefined</para></entry>
35625    </row>
35626    <row>
35627      <entry></entry>
35628      <entry><para><emphasis role="bold">Value cookie</emphasis></para></entry>
35629      <entry><para>false</para></entry>
35630      <entry><para>false</para></entry>
35631      <entry><para>false</para></entry>
35632      <entry><para>false</para></entry>
35633      <entry><para>false</para></entry>
35634      <entry><para>false</para></entry>
35635    </row>
35636  </tbody>
35637</tgroup>
35638</informaltable>
35639@end docbook
35640
35641@ifnotplaintext
35642@ifnotdocbook
35643@multitable @columnfractions .50 .50
35644@headitem @tab Type of Actual Value
35645@end multitable
35646@c 10/2014: Thanks to Karl Berry for this bit to reduce the space:
35647@tex
35648\vglue-1.1\baselineskip
35649@end tex
35650@c @multitable @columnfractions .166 .166 .198 .15 .15 .166
35651@multitable {Requested} {Undefined} {Number} {Number} {Scalar} {Regex} {Array} {Undefined}
35652@headitem @tab @tab String @tab Strnum @tab Number @tab Regex @tab Array @tab Undefined
35653@item @tab @b{String} @tab String @tab String @tab String @tab String @tab false @tab false
35654@item @tab @b{Strnum} @tab false @tab Strnum @tab Strnum @tab false @tab false @tab false
35655@item @tab @b{Number} @tab Number @tab Number @tab Number @tab false @tab false @tab false
35656@item @b{Type} @tab @b{Regex} @tab false @tab false @tab false @tab Regex @tab false @tab false
35657@item @b{Requested} @tab @b{Array} @tab false @tab false @tab false @tab false @tab Array @tab false
35658@item @tab @b{Scalar} @tab Scalar @tab Scalar @tab Scalar @tab Scalar @tab false @tab false
35659@item @tab @b{Undefined} @tab String @tab Strnum @tab Number @tab Regex @tab Array @tab Undefined
35660@item @tab @b{Value cookie} @tab false @tab false @tab false @tab false @tab false @tab false
35661@end multitable
35662@end ifnotdocbook
35663@end ifnotplaintext
35664@ifplaintext
35665@verbatim
35666                        +-------------------------------------------------------+
35667                        |                   Type of Actual Value:               |
35668                        +--------+--------+--------+--------+-------+-----------+
35669                        | String | Strnum | Number | Regex  | Array | Undefined |
35670+-----------+-----------+--------+--------+--------+--------+-------+-----------+
35671|           | String    | String | String | String | String | false | false     |
35672|           +-----------+--------+--------+--------+--------+-------+-----------+
35673|           | Strnum    | false  | Strnum | Strnum | false  | false | false     |
35674|           +-----------+--------+--------+--------+--------+-------+-----------+
35675|           | Number    | Number | Number | Number | false  | false | false     |
35676|           +-----------+--------+--------+--------+--------+-------+-----------+
35677|           | Regex     | false  | false  | false  | Regex  | false | false     |
35678|   Type    +-----------+--------+--------+--------+--------+-------+-----------+
35679| Requested | Array     | false  | false  | false  | false  | Array | false     |
35680|           +-----------+--------+--------+--------+--------+-------+-----------+
35681|           | Scalar    | Scalar | Scalar | Scalar | Scalar | false | false     |
35682|           +-----------+--------+--------+--------+--------+-------+-----------+
35683|           | Undefined | String | Strnum | Number | Regex  | Array | Undefined |
35684|           +-----------+--------+--------+--------+--------+-------+-----------+
35685|           | Value     | false  | false  | false  | false  | false | false     |
35686|           | Cookie    |        |        |        |        |       |           |
35687+-----------+-----------+--------+--------+--------+--------+-------+-----------+
35688@end verbatim
35689@end ifplaintext
35690@end float
35691
35692@node Accessing Parameters
35693@subsection Accessing and Updating Parameters
35694
35695Two functions give you access to the arguments (parameters)
35696passed to your extension function. They are:
35697
35698@table @code
35699@item awk_bool_t get_argument(size_t count,
35700@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted,
35701@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result);
35702Fill in the @code{awk_value_t} structure pointed to by @code{result}
35703with the @code{count}th argument.  Return true if the actual
35704type matches @code{wanted}, and false otherwise.  In the latter
35705case, @code{result@w{->}val_type} indicates the actual type
35706(@pxref{table-value-types-returned}).  Counts are zero-based---the first
35707argument is numbered zero, the second one, and so on. @code{wanted}
35708indicates the type of value expected.
35709
35710@item awk_bool_t set_argument(size_t count, awk_array_t array);
35711Convert a parameter that was undefined into an array; this provides
35712call by reference for arrays.  Return false if @code{count} is too big,
35713or if the argument's type is not undefined.  @xref{Array Manipulation}
35714for more information on creating arrays.
35715@end table
35716
35717@node Symbol Table Access
35718@subsection Symbol Table Access
35719@cindex accessing global variables from extensions
35720
35721Two sets of routines provide access to global variables, and one set
35722allows you to create and release cached values.
35723
35724@menu
35725* Symbol table by name::        Accessing variables by name.
35726* Symbol table by cookie::      Accessing variables by ``cookie''.
35727* Cached values::               Creating and using cached values.
35728@end menu
35729
35730@node Symbol table by name
35731@subsubsection Variable Access and Update by Name
35732
35733The following routines provide the ability to access and update
35734global @command{awk}-level variables by name.  In compiler terminology,
35735identifiers of different kinds are termed @dfn{symbols}, thus the ``sym''
35736in the routines' names.  The data structure that stores information
35737about symbols is termed a @dfn{symbol table}.
35738The functions are as follows:
35739
35740@table @code
35741@item awk_bool_t sym_lookup(const char *name,
35742@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted,
35743@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result);
35744Fill in the @code{awk_value_t} structure pointed to by @code{result}
35745with the value of the variable named by the string @code{name}, which is
35746a regular C string.  @code{wanted} indicates the type of value expected.
35747Return true if the actual type matches @code{wanted}, and false otherwise.
35748In the latter case, @code{result->val_type} indicates the actual type
35749(@pxref{table-value-types-returned}).
35750
35751@item awk_bool_t sym_lookup_ns(const char *name,
35752@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const char *name_space,
35753@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted,
35754@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result);
35755This is like @code{sym_lookup()}, but the @code{name_space} parameter allows you
35756to specify which namespace @code{name} is part of.  @code{name_space} cannot be
35757@code{NULL}. If it is @code{""} or @code{"awk"}, then @code{name} is searched
35758for in the default @code{awk} namespace.
35759
35760Note that @code{namespace} is a C++ keyword. For interoperability with C++,
35761you should avoid using that identifier in C code.
35762
35763@item awk_bool_t sym_update(const char *name, awk_value_t *value);
35764Update the variable named by the string @code{name}, which is a regular
35765C string.  The variable is added to @command{gawk}'s symbol table
35766if it is not there.  Return true if everything worked, and false otherwise.
35767
35768Changing types (scalar to array or vice versa) of an existing variable
35769is @emph{not} allowed, nor may this routine be used to update an array.
35770This routine cannot be used to update any of the predefined
35771variables (such as @code{ARGC} or @code{NF}).
35772
35773@item awk_bool_t sym_update_ns(const char *name_space, const char *name, awk_value_t *value);
35774This is like @code{sym_update()}, but the @code{name_space} parameter allows you
35775to specify which namespace @code{name} is part of.  @code{name_space} cannot be
35776@code{NULL}. If it is @code{""} or @code{"awk"}, then @code{name} is searched
35777for in the default @code{awk} namespace.
35778@end table
35779
35780An extension can look up the value of @command{gawk}'s special variables.
35781However, with the exception of the @code{PROCINFO} array, an extension
35782cannot change any of those variables.
35783
35784When searching for or updating variables outside the @code{awk} namespace
35785(@pxref{Namespaces}), function and variable names must be simple
35786identifiers.@footnote{Allowing both namespace plus identifier and
35787@code{foo::bar} would have been too confusing to document, and to code
35788and test.} In addition, namespace names and variable and function names
35789must follow the rules given in @ref{Naming Rules}.
35790
35791@node Symbol table by cookie
35792@subsubsection Variable Access and Update by Cookie
35793
35794A @dfn{scalar cookie} is an opaque handle that provides access
35795to a global variable or array. It is an optimization that
35796avoids looking up variables in @command{gawk}'s symbol table every time
35797access is needed. This was discussed earlier, in @ref{General Data Types}.
35798
35799@need 1500
35800The following functions let you work with scalar cookies:
35801
35802@table @code
35803@item awk_bool_t sym_lookup_scalar(awk_scalar_t cookie,
35804@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted,
35805@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result);
35806Retrieve the current value of a scalar cookie.
35807Once you have obtained a scalar cookie using @code{sym_lookup()}, you can
35808use this function to get its value more efficiently.
35809Return false if the value cannot be retrieved.
35810
35811@item awk_bool_t sym_update_scalar(awk_scalar_t cookie, awk_value_t *value);
35812Update the value associated with a scalar cookie.  Return false if
35813the new value is not of type @code{AWK_STRING}, @code{AWK_STRNUM}, @code{AWK_REGEX}, or @code{AWK_NUMBER}.
35814Here too, the predefined variables may not be updated.
35815@end table
35816
35817It is not obvious at first glance how to work with scalar cookies or
35818what their @i{raison d'@^etre} really is.  In theory, the @code{sym_lookup()}
35819and @code{sym_update()} routines are all you really need to work with
35820variables.  For example, you might have code that looks up the value of
35821a variable, evaluates a condition, and then possibly changes the value
35822of the variable based on the result of that evaluation, like so:
35823
35824@example
35825/*  do_magic --- do something really great */
35826
35827static awk_value_t *
35828do_magic(int nargs, awk_value_t *result)
35829@{
35830    awk_value_t value;
35831
35832    if (   sym_lookup("MAGIC_VAR", AWK_NUMBER, & value)
35833        && some_condition(value.num_value)) @{
35834            value.num_value += 42;
35835            sym_update("MAGIC_VAR", & value);
35836    @}
35837
35838    return make_number(0.0, result);
35839@}
35840@end example
35841
35842@noindent
35843This code looks (and is) simple and straightforward. So what's the problem?
35844
35845Well, consider what happens if @command{awk}-level code associated
35846with your extension calls the @code{magic()} function (implemented in
35847C by @code{do_magic()}), once per record, while processing hundreds
35848of thousands or millions of records.  The @code{MAGIC_VAR} variable is
35849looked up in the symbol table once or twice per function call!
35850
35851The symbol table lookup is really pure overhead; it is considerably
35852more efficient to get a cookie that represents the variable, and use
35853that to get the variable's value and update it as needed.@footnote{The
35854difference is measurable and quite real. Trust us.}
35855
35856Thus, the way to use cookies is as follows.  First, install
35857your extension's variable in @command{gawk}'s symbol table using
35858@code{sym_update()}, as usual. Then get a scalar cookie for the variable
35859using @code{sym_lookup()}:
35860
35861@example
35862@group
35863static awk_scalar_t magic_var_cookie;    /* cookie for MAGIC_VAR */
35864
35865static void
35866my_extension_init()
35867@{
35868    awk_value_t value;
35869@end group
35870
35871    /* install initial value */
35872    sym_update("MAGIC_VAR", make_number(42.0, & value));
35873
35874    /* get the cookie */
35875    sym_lookup("MAGIC_VAR", AWK_SCALAR, & value);
35876
35877    /* save the cookie */
35878    magic_var_cookie = value.scalar_cookie;
35879    @dots{}
35880@}
35881@end example
35882
35883Next, use the routines in this @value{SECTION} for retrieving and updating
35884the value through the cookie.  Thus, @code{do_magic()} now becomes
35885something like this:
35886
35887@example
35888/*  do_magic --- do something really great */
35889
35890static awk_value_t *
35891do_magic(int nargs, awk_value_t *result)
35892@{
35893    awk_value_t value;
35894
35895    if (   sym_lookup_scalar(magic_var_cookie, AWK_NUMBER, & value)
35896        && some_condition(value.num_value)) @{
35897            value.num_value += 42;
35898            sym_update_scalar(magic_var_cookie, & value);
35899    @}
35900    @dots{}
35901
35902    return make_number(0.0, result);
35903@}
35904@end example
35905
35906@quotation NOTE
35907The previous code omitted error checking for
35908presentation purposes.  Your extension code should be more robust
35909and carefully check the return values from the API functions.
35910@end quotation
35911
35912@node Cached values
35913@subsubsection Creating and Using Cached Values
35914
35915The routines in this @value{SECTION} allow you to create and release
35916cached values.  Like scalar cookies, in theory, cached values
35917are not necessary. You can create numbers and strings using
35918the functions in @ref{Constructor Functions}. You can then
35919assign those values to variables using @code{sym_update()}
35920or @code{sym_update_scalar()}, as you like.
35921
35922However, you can understand the point of cached values if you remember that
35923@emph{every} string value's storage @emph{must} come from @code{gawk_malloc()},
35924@code{gawk_calloc()}, or @code{gawk_realloc()}.
35925If you have 20 variables, all of which have the same string value, you
35926must create 20 identical copies of the string.@footnote{Numeric values
35927are clearly less problematic, requiring only a C @code{double} to store.
35928But of course, GMP and MPFR values @emph{do} take up more memory.}
35929
35930It is clearly more efficient, if possible, to create a value once, and
35931then tell @command{gawk} to reuse the value for multiple variables. That
35932is what the routines in this @value{SECTION} let you do.  The functions are as follows:
35933
35934@table @code
35935@item awk_bool_t create_value(awk_value_t *value, awk_value_cookie_t *result);
35936Create a cached string or numeric value from @code{value} for
35937efficient later assignment.  Only values of type @code{AWK_NUMBER}, @code{AWK_REGEX}, @code{AWK_STRNUM},
35938and @code{AWK_STRING} are allowed.  Any other type is rejected.
35939@code{AWK_UNDEFINED} could be allowed, but doing so would result in
35940inferior performance.
35941
35942@item awk_bool_t release_value(awk_value_cookie_t vc);
35943Release the memory associated with a value cookie obtained
35944from @code{create_value()}.
35945@end table
35946
35947You use value cookies in a fashion similar to the way you use scalar cookies.
35948In the extension initialization routine, you create the value cookie:
35949
35950@example
35951static awk_value_cookie_t answer_cookie;  /* static value cookie */
35952
35953static void
35954my_extension_init()
35955@{
35956    awk_value_t value;
35957    char *long_string;
35958    size_t long_string_len;
35959
35960    /* code from earlier */
35961    @dots{}
35962    /* @dots{} fill in long_string and long_string_len @dots{} */
35963    make_malloced_string(long_string, long_string_len, & value);
35964    create_value(& value, & answer_cookie);    /* create cookie */
35965    @dots{}
35966@}
35967@end example
35968
35969Once the value is created, you can use it as the value of any number
35970of variables:
35971
35972@example
35973static awk_value_t *
35974do_magic(int nargs, awk_value_t *result)
35975@{
35976    awk_value_t new_value;
35977
35978    @dots{}    /* as earlier */
35979
35980    value.val_type = AWK_VALUE_COOKIE;
35981    value.value_cookie = answer_cookie;
35982    sym_update("VAR1", & value);
35983    sym_update("VAR2", & value);
35984    @dots{}
35985    sym_update("VAR100", & value);
35986    @dots{}
35987@}
35988@end example
35989
35990@noindent
35991Using value cookies in this way saves considerable storage, as all of
35992@code{VAR1} through @code{VAR100} share the same value.
35993
35994You might be wondering, ``Is this sharing problematic?
35995What happens if @command{awk} code assigns a new value to @code{VAR1};
35996are all the others changed too?''
35997
35998That's a great question. The answer is that no, it's not a problem.
35999Internally, @command{gawk} uses @dfn{reference-counted strings}. This means
36000that many variables can share the same string value, and @command{gawk}
36001keeps track of the usage.  When a variable's value changes, @command{gawk}
36002simply decrements the reference count on the old value and updates
36003the variable to use the new value.
36004
36005Finally, as part of your cleanup action (@pxref{Exit Callback Functions})
36006you should release any cached values that you created, using
36007@code{release_value()}.
36008
36009@node Array Manipulation
36010@subsection Array Manipulation
36011@cindex array manipulation in extensions
36012@cindex extensions @subentry loadable @subentry array manipulation in
36013
36014The primary data structure@footnote{OK, the only data structure.} in @command{awk}
36015is the associative array (@pxref{Arrays}).
36016Extensions need to be able to manipulate @command{awk} arrays.
36017The API provides a number of data structures for working with arrays,
36018functions for working with individual elements, and functions for
36019working with arrays as a whole. This includes the ability to
36020``flatten'' an array so that it is easy for C code to traverse
36021every element in an array.  The array data structures integrate
36022nicely with the data structures for values to make it easy to
36023both work with and create true arrays of arrays (@pxref{General Data Types}).
36024
36025@menu
36026* Array Data Types::            Data types for working with arrays.
36027* Array Functions::             Functions for working with arrays.
36028* Flattening Arrays::           How to flatten arrays.
36029* Creating Arrays::             How to create and populate arrays.
36030@end menu
36031
36032@node Array Data Types
36033@subsubsection Array Data Types
36034
36035The data types associated with arrays are as follows:
36036
36037@table @code
36038@item typedef void *awk_array_t;
36039If you request the value of an array variable, you get back an
36040@code{awk_array_t} value. This value is opaque@footnote{It is also
36041a ``cookie,'' but the @command{gawk} developers did not wish to overuse this
36042term.} to the extension; it uniquely identifies the array but can
36043only be used by passing it into API functions or receiving it from API
36044functions. This is very similar to way @samp{FILE *} values are used
36045with the @code{<stdio.h>} library routines.
36046
36047@item typedef struct awk_element @{
36048@itemx @ @ @ @ /* convenience linked list pointer, not used by gawk */
36049@itemx @ @ @ @ struct awk_element *next;
36050@itemx @ @ @ @ enum @{
36051@itemx @ @ @ @ @ @ @ @ AWK_ELEMENT_DEFAULT = 0,@ @ /* set by gawk */
36052@itemx @ @ @ @ @ @ @ @ AWK_ELEMENT_DELETE = 1@ @ @ @ /* set by extension */
36053@itemx @ @ @ @ @} flags;
36054@itemx @ @ @ @ awk_value_t    index;
36055@itemx @ @ @ @ awk_value_t    value;
36056@itemx @} awk_element_t;
36057The @code{awk_element_t} is a ``flattened''
36058array element. @command{awk} produces an array of these
36059inside the @code{awk_flat_array_t} (see the next item).
36060Individual elements may be marked for deletion. New elements must be added
36061individually, one at a time, using the separate API for that purpose.
36062The fields are as follows:
36063
36064@c nested table
36065@table @code
36066@item struct awk_element *next;
36067This pointer is for the convenience of extension writers.  It allows
36068an extension to create a linked list of new elements that can then be
36069added to an array in a loop that traverses the list.
36070
36071@item enum @{ @dots{} @} flags;
36072A set of flag values that convey information between the extension
36073and @command{gawk}.  Currently there is only one: @code{AWK_ELEMENT_DELETE}.
36074Setting it causes @command{gawk} to delete the
36075element from the original array upon release of the flattened array.
36076
36077@item index
36078@itemx value
36079The index and value of the element, respectively.
36080@emph{All} memory pointed to by @code{index} and @code{value} belongs to @command{gawk}.
36081@end table
36082
36083@item typedef struct awk_flat_array @{
36084@itemx @ @ @ @ awk_const void *awk_const opaque1;@ @ @ @ /* for use by gawk */
36085@itemx @ @ @ @ awk_const void *awk_const opaque2;@ @ @ @ /* for use by gawk */
36086@itemx @ @ @ @ awk_const size_t count;@ @ @ @ @ /* how many elements */
36087@itemx @ @ @ @ awk_element_t elements[1];@ @ /* will be extended */
36088@itemx @} awk_flat_array_t;
36089This is a flattened array. When an extension gets one of these
36090from @command{gawk}, the @code{elements} array is of actual
36091size @code{count}.
36092The @code{opaque1} and @code{opaque2} pointers are for use by @command{gawk};
36093therefore they are marked @code{awk_const} so that the extension cannot
36094modify them.
36095@end table
36096
36097@node Array Functions
36098@subsubsection Array Functions
36099
36100The following functions relate to individual array elements:
36101
36102@table @code
36103@item awk_bool_t get_element_count(awk_array_t a_cookie, size_t *count);
36104For the array represented by @code{a_cookie}, place in @code{*count}
36105the number of elements it contains. A subarray counts as a single element.
36106Return false if there is an error.
36107
36108@item awk_bool_t get_array_element(awk_array_t a_cookie,
36109@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_value_t *const index,
36110@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted,
36111@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result);
36112For the array represented by @code{a_cookie}, return in @code{*result}
36113the value of the element whose index is @code{index}.
36114@code{wanted} specifies the type of value you wish to retrieve.
36115Return false if @code{wanted} does not match the actual type or if
36116@code{index} is not in the array (@pxref{table-value-types-returned}).
36117
36118The value for @code{index} can be numeric, in which case @command{gawk}
36119converts it to a string. Using nonintegral values is possible, but
36120requires that you understand how such values are converted to strings
36121(@pxref{Conversion}); thus, using integral values is safest.
36122
36123As with @emph{all} strings passed into @command{gawk} from an extension,
36124the string value of @code{index} must come from @code{gawk_malloc()},
36125@code{gawk_calloc()}, or @code{gawk_realloc()}, and
36126@command{gawk} releases the storage.
36127
36128@item awk_bool_t set_array_element(awk_array_t a_cookie,
36129@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const@ awk_value_t *const index,
36130@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const@ awk_value_t *const value);
36131In the array represented by @code{a_cookie}, create or modify
36132the element whose index is given by @code{index}.
36133The @code{ARGV} and @code{ENVIRON} arrays may not be changed,
36134although the @code{PROCINFO} array can be.
36135
36136@item awk_bool_t set_array_element_by_elem(awk_array_t a_cookie,
36137@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_element_t element);
36138Like @code{set_array_element()}, but take the @code{index} and @code{value}
36139from @code{element}. This is a convenience macro.
36140
36141@item awk_bool_t del_array_element(awk_array_t a_cookie,
36142@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_value_t* const index);
36143Remove the element with the given index from the array
36144represented by @code{a_cookie}.
36145Return true if the element was removed, or false if the element did
36146not exist in the array.
36147@end table
36148
36149The following functions relate to arrays as a whole:
36150
36151@table @code
36152@item awk_array_t create_array(void);
36153Create a new array to which elements may be added.
36154@xref{Creating Arrays} for a discussion of how to
36155create a new array and add elements to it.
36156
36157@item awk_bool_t clear_array(awk_array_t a_cookie);
36158Clear the array represented by @code{a_cookie}.
36159Return false if there was some kind of problem, true otherwise.
36160The array remains an array, but after calling this function, it
36161has no elements. This is equivalent to using the @code{delete}
36162statement (@pxref{Delete}).
36163
36164@item awk_bool_t flatten_array_typed(awk_array_t a_cookie,
36165@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_flat_array_t **data,
36166@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t index_type,
36167@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t value_type);
36168For the array represented by @code{a_cookie}, create an @code{awk_flat_array_t}
36169structure and fill it in with indices and values of the requested types.
36170Set the pointer whose address is passed as @code{data}
36171to point to this structure.
36172Return true upon success, or false otherwise.
36173@ifset FOR_PRINT
36174See the next @value{SECTION}
36175@end ifset
36176@ifclear FOR_PRINT
36177@xref{Flattening Arrays},
36178@end ifclear
36179for a discussion of how to
36180flatten an array and work with it.
36181
36182@item awk_bool_t flatten_array(awk_array_t a_cookie, awk_flat_array_t **data);
36183For the array represented by @code{a_cookie}, create an @code{awk_flat_array_t}
36184structure and fill it in with @code{AWK_STRING} indices and
36185@code{AWK_UNDEFINED} values.
36186This is superseded by @code{flatten_array_typed()}.
36187It is provided as a macro, and remains for convenience and for source code
36188compatibility with the previous version of the API.
36189
36190@item awk_bool_t release_flattened_array(awk_array_t a_cookie,
36191@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_flat_array_t *data);
36192When done with a flattened array, release the storage using this function.
36193You must pass in both the original array cookie and the address of
36194the created @code{awk_flat_array_t} structure.
36195The function returns true upon success, false otherwise.
36196@end table
36197
36198@node Flattening Arrays
36199@subsubsection Working With All The Elements of an Array
36200
36201To @dfn{flatten} an array is to create a structure that
36202represents the full array in a fashion that makes it easy
36203for C code to traverse the entire array.  Some of the code
36204in @file{extension/testext.c} does this, and also serves
36205as a nice example showing how to use the APIs.
36206
36207We walk through that part of the code one step at a time.
36208First, the @command{gawk} script that drives the test extension:
36209
36210@example
36211@@load "testext"
36212BEGIN @{
36213    n = split("blacky rusty sophie raincloud lucky", pets)
36214    printf("pets has %d elements\n", length(pets))
36215    ret = dump_array_and_delete("pets", "3")
36216    printf("dump_array_and_delete(pets) returned %d\n", ret)
36217    if ("3" in pets)
36218        printf("dump_array_and_delete() did NOT remove index \"3\"!\n")
36219    else
36220        printf("dump_array_and_delete() did remove index \"3\"!\n")
36221    print ""
36222@}
36223@end example
36224
36225@noindent
36226This code creates an array with @code{split()} (@pxref{String Functions})
36227and then calls @code{dump_array_and_delete()}. That function looks up
36228the array whose name is passed as the first argument, and
36229deletes the element at the index passed in the second argument.
36230The @command{awk} code then prints the return value and checks if the element
36231was indeed deleted.  Here is the C code that implements
36232@code{dump_array_and_delete()}. It has been edited slightly for
36233presentation.
36234
36235The first part declares variables, sets up the default
36236return value in @code{result}, and checks that the function
36237was called with the correct number of arguments:
36238
36239@example
36240static awk_value_t *
36241dump_array_and_delete(int nargs, awk_value_t *result)
36242@{
36243    awk_value_t value, value2, value3;
36244    awk_flat_array_t *flat_array;
36245    size_t count;
36246    char *name;
36247    int i;
36248
36249    assert(result != NULL);
36250    make_number(0.0, result);
36251
36252    if (nargs != 2) @{
36253        printf("dump_array_and_delete: nargs not right "
36254               "(%d should be 2)\n", nargs);
36255        goto out;
36256    @}
36257@end example
36258
36259The function then proceeds in steps, as follows. First, retrieve
36260the name of the array, passed as the first argument, followed by
36261the array itself. If either operation fails, print an
36262error message and return:
36263
36264@example
36265    /* get argument named array as flat array and print it */
36266    if (get_argument(0, AWK_STRING, & value)) @{
36267        name = value.str_value.str;
36268        if (sym_lookup(name, AWK_ARRAY, & value2))
36269            printf("dump_array_and_delete: sym_lookup of %s passed\n",
36270                   name);
36271        else @{
36272            printf("dump_array_and_delete: sym_lookup of %s failed\n",
36273                   name);
36274            goto out;
36275        @}
36276    @} else @{
36277        printf("dump_array_and_delete: get_argument(0) failed\n");
36278        goto out;
36279    @}
36280@end example
36281
36282For testing purposes and to make sure that the C code sees
36283the same number of elements as the @command{awk} code,
36284the second step is to get the count of elements in the array
36285and print it:
36286
36287@example
36288    if (! get_element_count(value2.array_cookie, & count)) @{
36289        printf("dump_array_and_delete: get_element_count failed\n");
36290        goto out;
36291    @}
36292
36293    printf("dump_array_and_delete: incoming size is %lu\n",
36294           (unsigned long) count);
36295@end example
36296
36297The third step is to actually flatten the array, and then
36298to double-check that the count in the @code{awk_flat_array_t}
36299is the same as the count just retrieved:
36300
36301@example
36302    if (! flatten_array_typed(value2.array_cookie, & flat_array,
36303                              AWK_STRING, AWK_UNDEFINED)) @{
36304        printf("dump_array_and_delete: could not flatten array\n");
36305        goto out;
36306    @}
36307
36308    if (flat_array->count != count) @{
36309        printf("dump_array_and_delete: flat_array->count (%lu)"
36310               " != count (%lu)\n",
36311                (unsigned long) flat_array->count,
36312                (unsigned long) count);
36313        goto out;
36314    @}
36315@end example
36316
36317The fourth step is to retrieve the index of the element
36318to be deleted, which was passed as the second argument.
36319Remember that argument counts passed to @code{get_argument()}
36320are zero-based, and thus the second argument is numbered one:
36321
36322@example
36323    if (! get_argument(1, AWK_STRING, & value3)) @{
36324        printf("dump_array_and_delete: get_argument(1) failed\n");
36325        goto out;
36326    @}
36327@end example
36328
36329The fifth step is where the ``real work'' is done. The function
36330loops over every element in the array, printing the index and
36331element values. In addition, upon finding the element with the
36332index that is supposed to be deleted, the function sets the
36333@code{AWK_ELEMENT_DELETE} bit in the @code{flags} field
36334of the element.  When the array is released, @command{gawk}
36335traverses the flattened array, and deletes any elements that
36336have this flag bit set:
36337
36338@example
36339    for (i = 0; i < flat_array->count; i++) @{
36340        printf("\t%s[\"%.*s\"] = %s\n",
36341            name,
36342            (int) flat_array->elements[i].index.str_value.len,
36343            flat_array->elements[i].index.str_value.str,
36344            valrep2str(& flat_array->elements[i].value));
36345
36346        if (strcmp(value3.str_value.str,
36347                   flat_array->elements[i].index.str_value.str) == 0) @{
36348            flat_array->elements[i].flags |= AWK_ELEMENT_DELETE;
36349            printf("dump_array_and_delete: marking element \"%s\" "
36350                   "for deletion\n",
36351                flat_array->elements[i].index.str_value.str);
36352        @}
36353    @}
36354@end example
36355
36356The sixth step is to release the flattened array. This tells
36357@command{gawk} that the extension is no longer using the array,
36358and that it should delete any elements marked for deletion.
36359@command{gawk} also frees any storage that was allocated,
36360so you should not use the pointer (@code{flat_array} in this
36361code) once you have called @code{release_flattened_array()}:
36362
36363@example
36364    if (! release_flattened_array(value2.array_cookie, flat_array)) @{
36365        printf("dump_array_and_delete: could not release flattened array\n");
36366        goto out;
36367    @}
36368@end example
36369
36370Finally, because everything was successful, the function sets the
36371return value to success, and returns:
36372
36373@example
36374@group
36375    make_number(1.0, result);
36376out:
36377    return result;
36378@}
36379@end group
36380@end example
36381
36382Here is the output from running this part of the test:
36383
36384@example
36385pets has 5 elements
36386dump_array_and_delete: sym_lookup of pets passed
36387dump_array_and_delete: incoming size is 5
36388        pets["1"] = "blacky"
36389        pets["2"] = "rusty"
36390        pets["3"] = "sophie"
36391dump_array_and_delete: marking element "3" for deletion
36392        pets["4"] = "raincloud"
36393        pets["5"] = "lucky"
36394dump_array_and_delete(pets) returned 1
36395dump_array_and_delete() did remove index "3"!
36396@end example
36397
36398@node Creating Arrays
36399@subsubsection How To Create and Populate Arrays
36400
36401Besides working with arrays created by @command{awk} code, you can
36402create arrays and populate them as you see fit, and then @command{awk}
36403code can access them and manipulate them.
36404
36405There are two important points about creating arrays from extension code:
36406
36407@itemize @value{BULLET}
36408@item
36409You must install a new array into @command{gawk}'s symbol
36410table immediately upon creating it.  Once you have done so,
36411you can then populate the array.
36412
36413@ignore
36414Strictly speaking, this is required only
36415for arrays that will have subarrays as elements; however it is
36416a good idea to always do this.  This restriction may be relaxed
36417in a subsequent revision of the API.
36418@end ignore
36419
36420Similarly, if installing a new array as a subarray of an existing array,
36421you must add the new array to its parent before adding any elements to it.
36422
36423Thus, the correct way to build an array is to work ``top down.''  Create
36424the array, and immediately install it in @command{gawk}'s symbol table
36425using @code{sym_update()}, or install it as an element in a previously
36426existing array using @code{set_array_element()}.  We show example code shortly.
36427
36428@item
36429Due to @command{gawk} internals, after using @code{sym_update()} to install an array
36430into @command{gawk}, you have to retrieve the array cookie from the value
36431passed in to @command{sym_update()} before doing anything else with it, like so:
36432
36433@example
36434awk_value_t val;
36435awk_array_t new_array;
36436
36437new_array = create_array();
36438val.val_type = AWK_ARRAY;
36439val.array_cookie = new_array;
36440
36441/* install array in the symbol table */
36442sym_update("array", & val);
36443
36444new_array = val.array_cookie;    /* YOU MUST DO THIS */
36445@end example
36446
36447If installing an array as a subarray, you must also retrieve the value
36448of the array cookie after the call to @code{set_element()}.
36449@end itemize
36450
36451The following C code is a simple test extension to create an array
36452with two regular elements and with a subarray. The leading @code{#include}
36453directives and boilerplate variable declarations
36454(@pxref{Extension API Boilerplate})
36455are omitted for brevity.
36456The first step is to create a new array and then install it
36457in the symbol table:
36458
36459@example
36460@ignore
36461#ifdef HAVE_CONFIG_H
36462#include <config.h>
36463#endif
36464
36465#include <stdio.h>
36466#include <assert.h>
36467#include <errno.h>
36468#include <stdlib.h>
36469#include <string.h>
36470#include <unistd.h>
36471
36472#include <sys/types.h>
36473#include <sys/stat.h>
36474
36475#include "gawkapi.h"
36476
36477static const gawk_api_t *api;   /* for convenience macros to work */
36478static awk_ext_id_t ext_id;
36479static const char *ext_version = "testarray extension: version 1.0";
36480
36481int plugin_is_GPL_compatible;
36482
36483@end ignore
36484/* create_new_array --- create a named array */
36485
36486static void
36487create_new_array()
36488@{
36489    awk_array_t a_cookie;
36490    awk_array_t subarray;
36491    awk_value_t index, value;
36492
36493    a_cookie = create_array();
36494    value.val_type = AWK_ARRAY;
36495    value.array_cookie = a_cookie;
36496
36497    if (! sym_update("new_array", & value))
36498        printf("create_new_array: sym_update(\"new_array\") failed!\n");
36499    a_cookie = value.array_cookie;
36500@end example
36501
36502@noindent
36503Note how @code{a_cookie} is reset from the @code{array_cookie} field in
36504the @code{value} structure.
36505
36506The second step is to install two regular values into @code{new_array}:
36507
36508@example
36509    (void) make_const_string("hello", 5, & index);
36510    (void) make_const_string("world", 5, & value);
36511    if (! set_array_element(a_cookie, & index, & value)) @{
36512        printf("fill_in_array: set_array_element failed\n");
36513        return;
36514    @}
36515
36516    (void) make_const_string("answer", 6, & index);
36517    (void) make_number(42.0, & value);
36518    if (! set_array_element(a_cookie, & index, & value)) @{
36519        printf("fill_in_array: set_array_element failed\n");
36520        return;
36521    @}
36522@end example
36523
36524The third step is to create the subarray and install it:
36525
36526@example
36527    (void) make_const_string("subarray", 8, & index);
36528    subarray = create_array();
36529    value.val_type = AWK_ARRAY;
36530    value.array_cookie = subarray;
36531    if (! set_array_element(a_cookie, & index, & value)) @{
36532        printf("fill_in_array: set_array_element failed\n");
36533        return;
36534    @}
36535    subarray = value.array_cookie;
36536@end example
36537
36538The final step is to populate the subarray with its own element:
36539
36540@example
36541    (void) make_const_string("foo", 3, & index);
36542    (void) make_const_string("bar", 3, & value);
36543    if (! set_array_element(subarray, & index, & value)) @{
36544        printf("fill_in_array: set_array_element failed\n");
36545        return;
36546    @}
36547@}
36548@ignore
36549static awk_ext_func_t func_table[] = @{
36550    @{ NULL, NULL, 0 @}
36551@};
36552
36553/* init_testarray --- additional initialization function */
36554
36555static awk_bool_t init_testarray(void)
36556@{
36557    create_new_array();
36558
36559    return awk_true;
36560@}
36561
36562static awk_bool_t (*init_func)(void) = init_testarray;
36563
36564dl_load_func(func_table, testarray, "")
36565@end ignore
36566@end example
36567
36568Here is a sample script that loads the extension
36569and then dumps the array:
36570
36571@example
36572@@load "subarray"
36573
36574function dumparray(name, array,     i)
36575@{
36576    for (i in array)
36577        if (isarray(array[i]))
36578            dumparray(name "[\"" i "\"]", array[i])
36579        else
36580            printf("%s[\"%s\"] = %s\n", name, i, array[i])
36581@}
36582
36583BEGIN @{
36584    dumparray("new_array", new_array);
36585@}
36586@end example
36587
36588Here is the result of running the script:
36589
36590@example
36591$ @kbd{AWKLIBPATH=$PWD gawk -f subarray.awk}
36592@print{} new_array["subarray"]["foo"] = bar
36593@print{} new_array["hello"] = world
36594@print{} new_array["answer"] = 42
36595@end example
36596
36597@noindent
36598(@xref{Finding Extensions} for more information on the
36599@env{AWKLIBPATH} environment variable.)
36600
36601@node Redirection API
36602@subsection Accessing and Manipulating Redirections
36603
36604The following function allows extensions to access and manipulate redirections.
36605
36606@table @code
36607@item awk_bool_t get_file(const char *name,
36608@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ size_t name_len,
36609@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const char *filetype,
36610@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ int fd,
36611@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_input_buf_t **ibufp,
36612@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_output_buf_t **obufp);
36613Look up file @code{name} in @command{gawk}'s internal redirection table.
36614If @code{name} is @code{NULL} or @code{name_len} is zero, return
36615data for the currently open input file corresponding to @code{FILENAME}.
36616(This does not access the @code{filetype} argument, so that may be undefined).
36617If the file is not already open, attempt to open it.
36618The @code{filetype} argument must be zero-terminated and should be one of:
36619
36620@table @code
36621@item ">"
36622A file opened for output.
36623
36624@item ">>"
36625A file opened for append.
36626
36627@item "<"
36628A file opened for input.
36629
36630@item "|>"
36631A pipe opened for output.
36632
36633@item "|<"
36634A pipe opened for input.
36635
36636@item "|&"
36637A two-way coprocess.
36638@end table
36639
36640On error, return @code{awk_false}.  Otherwise, return
36641@code{awk_true}, and return additional information about the redirection
36642in the @code{ibufp} and @code{obufp} pointers.
36643
36644For input redirections, the @code{*ibufp} value should be non-@code{NULL},
36645and @code{*obufp} should be @code{NULL}.  For output redirections,
36646the @code{*obufp} value should be non-@code{NULL}, and @code{*ibufp}
36647should be @code{NULL}.  For two-way coprocesses, both values should
36648be non-@code{NULL}.
36649
36650In the usual case, the extension is interested in @code{(*ibufp)->fd}
36651and/or @code{fileno((*obufp)->fp)}.  If the file is not already
36652open, and the @code{fd} argument is nonnegative, @command{gawk}
36653will use that file descriptor instead of opening the file in the
36654usual way.  If @code{fd} is nonnegative, but the file exists already,
36655@command{gawk} ignores @code{fd} and returns the existing file.  It is
36656the caller's responsibility to notice that neither the @code{fd} in
36657the returned @code{awk_input_buf_t} nor the @code{fd} in the returned
36658@code{awk_output_buf_t} matches the requested value.
36659
36660Note that supplying a file descriptor is currently @emph{not} supported
36661for pipes.  However, supplying a file descriptor should work for input,
36662output, append, and two-way (coprocess) sockets.  If @code{filetype}
36663is two-way, @command{gawk} assumes that it is a socket!  Note that in
36664the two-way case, the input and output file descriptors may differ.
36665To check for success, you must check whether either matches.
36666@end table
36667
36668It is anticipated that this API function will be used to implement I/O
36669multiplexing and a socket library.
36670
36671@node Extension API Variables
36672@subsection API Variables
36673
36674The API provides two sets of variables.  The first provides information
36675about the version of the API (both with which the extension was compiled,
36676and with which @command{gawk} was compiled).  The second provides
36677information about how @command{gawk} was invoked.
36678
36679@menu
36680* Extension Versioning::          API Version information.
36681* Extension GMP/MPFR Versioning:: Version information about GMP and MPFR.
36682* Extension API Informational Variables:: Variables providing information about
36683                                  @command{gawk}'s invocation.
36684@end menu
36685
36686@node Extension Versioning
36687@subsubsection API Version Constants and Variables
36688@cindex API @subentry version
36689@cindex extension API @subentry version number
36690
36691The API provides both a ``major'' and a ``minor'' version number.
36692The API versions are available at compile time as C preprocessor defines
36693to support conditional compilation, and as enum constants to facilitate
36694debugging:
36695
36696@float Table,gawk-api-version
36697@caption{gawk API version constants}
36698@multitable {@b{API Version}} {@code{gawk_api_major_version}} {@code{GAWK_API_MAJOR_VERSION}}
36699@headitem API Version @tab C Preprocessor Define @tab enum constant
36700@item Major @tab @code{gawk_api_major_version} @tab @code{GAWK_API_MAJOR_VERSION}
36701@item Minor @tab @code{gawk_api_minor_version} @tab @code{GAWK_API_MINOR_VERSION}
36702@end multitable
36703@end float
36704
36705The minor version increases when new functions are added to the API. Such
36706new functions are always added to the end of the API @code{struct}.
36707
36708The major version increases (and the minor version is reset to zero) if any
36709of the data types change size or member order, or if any of the existing
36710functions change signature.
36711
36712It could happen that an extension may be compiled against one version
36713of the API but loaded by a version of @command{gawk} using a different
36714version. For this reason, the major and minor API versions of the
36715running @command{gawk} are included in the API @code{struct} as read-only
36716constant integers:
36717
36718@table @code
36719@item api->major_version
36720The major version of the running @command{gawk}.
36721
36722@item api->minor_version
36723The minor version of the running @command{gawk}.
36724@end table
36725
36726It is up to the extension to decide if there are API incompatibilities.
36727Typically, a check like this is enough:
36728
36729@example
36730if (   api->major_version != GAWK_API_MAJOR_VERSION
36731    || api->minor_version < GAWK_API_MINOR_VERSION) @{
36732        fprintf(stderr, "foo_extension: version mismatch with gawk!\n");
36733        fprintf(stderr, "\tmy version (%d, %d), gawk version (%d, %d)\n",
36734                GAWK_API_MAJOR_VERSION, GAWK_API_MINOR_VERSION,
36735                api->major_version, api->minor_version);
36736        exit(1);
36737@}
36738@end example
36739
36740Such code is included in the boilerplate @code{dl_load_func()} macro
36741provided in @file{gawkapi.h} (discussed in
36742@ref{Extension API Boilerplate}).
36743
36744@node Extension GMP/MPFR Versioning
36745@subsubsection GMP and MPFR Version Information
36746
36747The API also includes information about the versions of GMP and MPFR
36748with which the running @command{gawk} was compiled (if any).
36749They are included in the API @code{struct} as read-only
36750constant integers:
36751
36752@table @code
36753@item api->gmp_major_version
36754The major version of the GMP library used to compile @command{gawk}.
36755
36756@item api->gmp_minor_version
36757The minor version of the GMP library used to compile @command{gawk}.
36758
36759@item api->mpfr_major_version
36760The major version of the MPFR library used to compile @command{gawk}.
36761
36762@item api->mpfr_minor_version
36763The minor version of the MPFR library used to compile @command{gawk}.
36764@end table
36765
36766These fields are set to zero if @command{gawk} was compiled without
36767MPFR support.
36768
36769You can check if the versions of MPFR and GMP that you are using match those
36770of @command{gawk} with the following macro:
36771
36772@table @code
36773@item check_mpfr_version(extension)
36774The @code{extension} is the extension id passed to all the other macros
36775and functions defined in @file{gawkapi.h}.  If you have not included
36776the @code{<mpfr.h>} header file, then this macro will be defined to do nothing.
36777
36778If you have included that file, then this macro compares the MPFR
36779and GMP major and minor versions against those of the library you are
36780compiling against.  If your libraries are newer than @command{gawk}'s, it
36781produces a fatal error message.
36782
36783The @code{dl_load_func()} macro (@pxref{Extension API Boilerplate})
36784calls @code{check_mpfr_version()}.
36785@end table
36786
36787@node Extension API Informational Variables
36788@subsubsection Informational Variables
36789@cindex API @subentry informational variables
36790@cindex extension API @subentry informational variables
36791
36792The API provides access to several variables that describe
36793whether the corresponding command-line options were enabled when
36794@command{gawk} was invoked.  The variables are:
36795
36796@table @code
36797@item do_debug
36798This variable is true if @command{gawk} was invoked with @option{--debug} option.
36799
36800@item do_lint
36801This variable is true if @command{gawk} was invoked with @option{--lint} option.
36802
36803@item do_mpfr
36804This variable is true if @command{gawk} was invoked with @option{--bignum} option.
36805
36806@item do_profile
36807This variable is true if @command{gawk} was invoked with @option{--profile} option.
36808
36809@item do_sandbox
36810This variable is true if @command{gawk} was invoked with @option{--sandbox} option.
36811
36812@item do_traditional
36813This variable is true if @command{gawk} was invoked with @option{--traditional} option.
36814@end table
36815
36816The value of @code{do_lint} can change if @command{awk} code
36817modifies the @code{LINT} predefined variable (@pxref{Built-in Variables}).
36818The others should not change during execution.
36819
36820@node Extension API Boilerplate
36821@subsection Boilerplate Code
36822
36823As mentioned earlier (@pxref{Extension Mechanism Outline}), the function
36824definitions as presented are really macros. To use these macros, your
36825extension must provide a small amount of boilerplate code (variables and
36826functions) toward the top of your source file, using predefined names
36827as described here.  The boilerplate needed is also provided in comments
36828in the @file{gawkapi.h} header file:
36829
36830@example
36831@group
36832/* Boilerplate code: */
36833int plugin_is_GPL_compatible;
36834
36835static gawk_api_t *const api;
36836@end group
36837static awk_ext_id_t ext_id;
36838static const char *ext_version = NULL; /* or @dots{} = "some string" */
36839
36840static awk_ext_func_t func_table[] = @{
36841    @{ "name", do_name, 1, 0, awk_false, NULL @},
36842    /* @dots{} */
36843@};
36844
36845/* EITHER: */
36846
36847static awk_bool_t (*init_func)(void) = NULL;
36848
36849/* OR: */
36850
36851static awk_bool_t
36852init_my_extension(void)
36853@{
36854    @dots{}
36855@}
36856
36857static awk_bool_t (*init_func)(void) = init_my_extension;
36858
36859dl_load_func(func_table, some_name, "name_space_in_quotes")
36860@end example
36861
36862These variables and functions are as follows:
36863
36864@table @code
36865@item int plugin_is_GPL_compatible;
36866This asserts that the extension is compatible with
36867@ifclear FOR_PRINT
36868the GNU GPL (@pxref{Copying}).
36869@end ifclear
36870@ifset FOR_PRINT
36871the GNU GPL.
36872@end ifset
36873If your extension does not have this, @command{gawk}
36874will not load it (@pxref{Plugin License}).
36875
36876@item static gawk_api_t *const api;
36877This global @code{static} variable should be set to point to
36878the @code{gawk_api_t} pointer that @command{gawk} passes to your
36879@code{dl_load()} function.  This variable is used by all of the macros.
36880
36881@item static awk_ext_id_t ext_id;
36882This global static variable should be set to the @code{awk_ext_id_t}
36883value that @command{gawk} passes to your @code{dl_load()} function.
36884This variable is used by all of the macros.
36885
36886@item static const char *ext_version = NULL; /* or @dots{} = "some string" */
36887This global @code{static} variable should be set either
36888to @code{NULL}, or to point to a string giving the name and version of
36889your extension.
36890
36891@item static awk_ext_func_t func_table[] = @{ @dots{} @};
36892This is an array of one or more @code{awk_ext_func_t} structures,
36893as described earlier (@pxref{Extension Functions}).
36894It can then be looped over for multiple calls to
36895@code{add_ext_func()}.
36896
36897@c Use @var{OR} for docbook
36898@item static awk_bool_t (*init_func)(void) = NULL;
36899@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @var{OR}
36900@itemx static awk_bool_t init_my_extension(void) @{ @dots{} @}
36901@itemx static awk_bool_t (*init_func)(void) = init_my_extension;
36902If you need to do some initialization work, you should define a
36903function that does it (creates variables, opens files, etc.)
36904and then define the @code{init_func} pointer to point to your
36905function.
36906The function should return @code{awk_false} upon failure, or @code{awk_true}
36907if everything goes well.
36908
36909If you don't need to do any initialization, define the pointer and
36910initialize it to @code{NULL}.
36911
36912@item dl_load_func(func_table, some_name, "name_space_in_quotes")
36913This macro expands to a @code{dl_load()} function that performs
36914all the necessary initializations.
36915@end table
36916
36917The point of all the variables and arrays is to let the
36918@code{dl_load()} function (from the @code{dl_load_func()}
36919macro) do all the standard work. It does the following:
36920
36921@enumerate 1
36922@item
36923Check the API versions. If the extension major version does not match
36924@command{gawk}'s, or if the extension minor version is greater than
36925@command{gawk}'s, it prints a fatal error message and exits.
36926
36927@item
36928Check the MPFR and GMP versions. If there is a mismatch, it prints
36929a fatal error message and exits.
36930
36931@item
36932Load the functions defined in @code{func_table}.
36933If any of them fails to load, it prints a warning message but
36934continues on.
36935
36936@item
36937If the @code{init_func} pointer is not @code{NULL}, call the
36938function it points to. If it returns @code{awk_false}, print a
36939warning message.
36940
36941@item
36942If @code{ext_version} is not @code{NULL}, register
36943the version string with @command{gawk}.
36944@end enumerate
36945
36946
36947@node Changes from API V1
36948@subsection Changes From Version 1 of the API
36949
36950The current API is @emph{not} binary compatible with version 1 of the API.
36951You will have to recompile your extensions in order to use them with
36952the current version of @command{gawk}.
36953
36954Fortunately, at the possible expense of some compile-time warnings, the API remains
36955source-code--compatible with the previous API. The major differences are
36956the additional members in the @code{awk_ext_func_t} structure, and the
36957addition of the third argument to the C implementation function
36958(@pxref{Extension Functions}).
36959
36960Here is a list of individual features that changed from version 1 to
36961version 2 of the API:
36962
36963@itemize @bullet
36964
36965@item
36966Numeric values can now have MPFR/MPZ variants
36967(@pxref{General Data Types}).
36968
36969@item
36970There are new string types: @code{AWK_REGEX} and @code{AWK_STRNUM}
36971(@pxref{General Data Types}).
36972
36973@item
36974The @code{ezalloc()} macro is new
36975(@pxref{Memory Allocation Functions}).
36976
36977@item
36978The @code{awk_ext_func_t} structure changed. Instead of
36979@code{num_expected_args}, it now has @code{max_expected} and
36980@code{min_required}
36981(@pxref{Extension Functions}).
36982
36983@item
36984For @code{get_record()}, an input parser can now specify field widths
36985(@pxref{Input Parsers}).
36986
36987@item
36988Extensions can now produce nonfatal error messages
36989(@pxref{Printing Messages}).
36990
36991@item
36992When flattening an array, you can now specify the index and value types
36993(@pxref{Array Functions}).
36994
36995@item
36996The @code{get_file()} API is new
36997(@pxref{Redirection API}).
36998@end itemize
36999
37000@node Finding Extensions
37001@section How @command{gawk} Finds Extensions
37002@cindex extensions @subentry loadable @subentry search path
37003@cindex finding extensions
37004
37005Compiled extensions have to be installed in a directory where
37006@command{gawk} can find them.  If @command{gawk} is configured and
37007built in the default fashion, the directory in which to find
37008extensions is @file{/usr/local/lib/gawk}.  You can also specify a search
37009path with a list of directories to search for compiled extensions.
37010@xref{AWKLIBPATH Variable} for more information.
37011
37012@node Extension Example
37013@section Example: Some File Functions
37014@cindex extensions @subentry loadable @subentry example
37015
37016@quotation
37017@i{No matter where you go, there you are.}
37018@author Buckaroo Banzai
37019@end quotation
37020
37021@c It's enough to show chdir and stat, no need for fts
37022
37023Two useful functions that are not in @command{awk} are @code{chdir()} (so
37024that an @command{awk} program can change its directory) and @code{stat()}
37025(so that an @command{awk} program can gather information about a file).
37026In order to illustrate the API in action, this @value{SECTION} implements
37027these functions for @command{gawk} in an extension.
37028
37029@menu
37030* Internal File Description::   What the new functions will do.
37031* Internal File Ops::           The code for internal file operations.
37032* Using Internal File Ops::     How to use an external extension.
37033@end menu
37034
37035@node Internal File Description
37036@subsection Using @code{chdir()} and @code{stat()}
37037
37038This @value{SECTION} shows how to use the new functions at
37039the @command{awk} level once they've been integrated into the
37040running @command{gawk} interpreter.  Using @code{chdir()} is very
37041straightforward. It takes one argument, the new directory to change to:
37042
37043@example
37044@@load "filefuncs"
37045@dots{}
37046newdir = "/home/arnold/funstuff"
37047ret = chdir(newdir)
37048if (ret < 0) @{
37049    printf("could not change to %s: %s\n", newdir, ERRNO) > "/dev/stderr"
37050    exit 1
37051@}
37052@dots{}
37053@end example
37054
37055The return value is negative if the @code{chdir()} failed, and
37056@code{ERRNO} (@pxref{Built-in Variables}) is set to a string indicating
37057the error.
37058
37059Using @code{stat()} is a bit more complicated.  The C @code{stat()}
37060function fills in a structure that has a fair amount of information.
37061The right way to model this in @command{awk} is to fill in an associative
37062array with the appropriate information:
37063
37064@c broke printf for page breaking
37065@example
37066file = "/home/arnold/.profile"
37067ret = stat(file, fdata)
37068if (ret < 0) @{
37069    printf("could not stat %s: %s\n",
37070             file, ERRNO) > "/dev/stderr"
37071    exit 1
37072@}
37073printf("size of %s is %d bytes\n", file, fdata["size"])
37074@end example
37075
37076The @code{stat()} function always clears the data array, even if
37077the @code{stat()} fails.  It fills in the following elements:
37078
37079@table @code
37080@item "name"
37081The name of the file that was @code{stat()}ed.
37082
37083@item "dev"
37084@itemx "ino"
37085The file's device and inode numbers, respectively.
37086
37087@item "mode"
37088The file's mode, as a numeric value. This includes both the file's
37089type and its permissions.
37090
37091@item "nlink"
37092The number of hard links (directory entries) the file has.
37093
37094@item "uid"
37095@itemx "gid"
37096The numeric user and group ID numbers of the file's owner.
37097
37098@item "size"
37099The size in bytes of the file.
37100
37101@item "blocks"
37102The number of disk blocks the file actually occupies. This may not
37103be a function of the file's size if the file has holes.
37104
37105@item "atime"
37106@itemx "mtime"
37107@itemx "ctime"
37108The file's last access, modification, and inode update times,
37109respectively.  These are numeric timestamps, suitable for formatting
37110with @code{strftime()}
37111(@pxref{Time Functions}).
37112
37113@item "pmode"
37114The file's ``printable mode.''  This is a string representation of
37115the file's type and permissions, such as is produced by
37116@samp{ls -l}---for example, @code{"drwxr-xr-x"}.
37117
37118@item "type"
37119A printable string representation of the file's type.  The value
37120is one of the following:
37121
37122@table @code
37123@item "blockdev"
37124@itemx "chardev"
37125The file is a block or character device (``special file'').
37126
37127@ignore
37128@item "door"
37129The file is a Solaris ``door'' (special file used for
37130interprocess communications).
37131@end ignore
37132
37133@item "directory"
37134The file is a directory.
37135
37136@item "fifo"
37137The file is a named pipe (also known as a FIFO).
37138
37139@item "file"
37140The file is just a regular file.
37141
37142@item "socket"
37143The file is an @code{AF_UNIX} (``Unix domain'') socket in the
37144filesystem.
37145
37146@item "symlink"
37147The file is a symbolic link.
37148@end table
37149
37150@c 5/2013: Thanks to Corinna Vinschen for this information.
37151@item "devbsize"
37152The size of a block for the element indexed by @code{"blocks"}.
37153This information is derived from either the @code{DEV_BSIZE}
37154constant defined in @code{<sys/param.h>} on most systems,
37155or the @code{S_BLKSIZE} constant in @code{<sys/stat.h>} on BSD systems.
37156For some other systems, @dfn{a priori} knowledge is used to provide
37157a value. Where no value can be determined, it defaults to 512.
37158@end table
37159
37160Several additional elements may be present, depending upon the operating
37161system and the type of the file.  You can test for them in your @command{awk}
37162program by using the @code{in} operator
37163(@pxref{Reference to Elements}):
37164
37165@table @code
37166@item "blksize"
37167The preferred block size for I/O to the file. This field is not
37168present on all POSIX-like systems in the C @code{stat} structure.
37169
37170@item "linkval"
37171If the file is a symbolic link, this element is the name of the
37172file the link points to (i.e., the value of the link).
37173
37174@item "rdev"
37175@itemx "major"
37176@itemx "minor"
37177If the file is a block or character device file, then these values
37178represent the numeric device number and the major and minor components
37179of that number, respectively.
37180@end table
37181
37182@node Internal File Ops
37183@subsection C Code for @code{chdir()} and @code{stat()}
37184
37185Here is the C code for these extensions.@footnote{This version is
37186edited slightly for presentation.  See @file{extension/filefuncs.c}
37187in the @command{gawk} distribution for the complete version.}
37188
37189The file includes a number of standard header files, and then includes
37190the @file{gawkapi.h} header file, which provides the API definitions.
37191Those are followed by the necessary variable declarations
37192to make use of the API macros and boilerplate code
37193(@pxref{Extension API Boilerplate}):
37194
37195@example
37196#ifdef HAVE_CONFIG_H
37197#include <config.h>
37198#endif
37199
37200#include <stdio.h>
37201#include <assert.h>
37202#include <errno.h>
37203#include <stdlib.h>
37204#include <string.h>
37205#include <unistd.h>
37206
37207#include <sys/types.h>
37208#include <sys/stat.h>
37209
37210#include "gawkapi.h"
37211
37212#include "gettext.h"
37213#define _(msgid)  gettext(msgid)
37214#define N_(msgid) msgid
37215
37216#include "gawkfts.h"
37217#include "stack.h"
37218
37219static const gawk_api_t *api;    /* for convenience macros to work */
37220static awk_ext_id_t ext_id;
37221static awk_bool_t init_filefuncs(void);
37222static awk_bool_t (*init_func)(void) = init_filefuncs;
37223static const char *ext_version = "filefuncs extension: version 1.0";
37224
37225int plugin_is_GPL_compatible;
37226@end example
37227
37228@cindex programming conventions @subentry @command{gawk} extensions
37229By convention, for an @command{awk} function @code{foo()}, the C function
37230that implements it is called @code{do_foo()}.  The function should have
37231two arguments. The first is an @code{int}, usually called @code{nargs},
37232that represents the number of actual arguments for the function.
37233The second is a pointer to an @code{awk_value_t} structure, usually named
37234@code{result}:
37235
37236@example
37237@group
37238/*  do_chdir --- provide dynamically loaded chdir() function for gawk */
37239
37240static awk_value_t *
37241do_chdir(int nargs, awk_value_t *result, struct awk_ext_func *unused)
37242@end group
37243@{
37244    awk_value_t newdir;
37245    int ret = -1;
37246
37247    assert(result != NULL);
37248@end example
37249
37250The @code{newdir}
37251variable represents the new directory to change to, which is retrieved
37252with @code{get_argument()}.  Note that the first argument is
37253numbered zero.
37254
37255If the argument is retrieved successfully, the function calls the
37256@code{chdir()} system call. Otherwise, if the @code{chdir()} fails,
37257it updates @code{ERRNO}:
37258
37259@example
37260    if (get_argument(0, AWK_STRING, & newdir)) @{
37261        ret = chdir(newdir.str_value.str);
37262        if (ret < 0)
37263            update_ERRNO_int(errno);
37264    @}
37265@end example
37266
37267Finally, the function returns the return value to the @command{awk} level:
37268
37269@example
37270    return make_number(ret, result);
37271@}
37272@end example
37273
37274The @code{stat()} extension is more involved.  First comes a function
37275that turns a numeric mode into a printable representation
37276(e.g., octal @code{0644} becomes @samp{-rw-r--r--}). This is omitted here for brevity:
37277
37278@example
37279/* format_mode --- turn a stat mode field into something readable */
37280
37281static char *
37282format_mode(unsigned long fmode)
37283@{
37284    @dots{}
37285@}
37286@end example
37287
37288Next comes a function for reading symbolic links, which is also
37289omitted here for brevity:
37290
37291@example
37292/* read_symlink --- read a symbolic link into an allocated buffer.
37293   @dots{} */
37294
37295static char *
37296read_symlink(const char *fname, size_t bufsize, ssize_t *linksize)
37297@{
37298    @dots{}
37299@}
37300@end example
37301
37302Two helper functions simplify entering values in the
37303array that will contain the result of the @code{stat()}:
37304
37305@example
37306/* array_set --- set an array element */
37307
37308static void
37309array_set(awk_array_t array, const char *sub, awk_value_t *value)
37310@{
37311    awk_value_t index;
37312
37313    set_array_element(array,
37314                      make_const_string(sub, strlen(sub), & index),
37315                      value);
37316
37317@}
37318
37319/* array_set_numeric --- set an array element with a number */
37320
37321static void
37322array_set_numeric(awk_array_t array, const char *sub, double num)
37323@{
37324    awk_value_t tmp;
37325
37326    array_set(array, sub, make_number(num, & tmp));
37327@}
37328@end example
37329
37330The following function does most of the work to fill in
37331the @code{awk_array_t} result array with values obtained
37332from a valid @code{struct stat}. This work is done in a separate function
37333to support the @code{stat()} function for @command{gawk} and also
37334to support the @code{fts()} extension, which is included in
37335the same file but whose code is not shown here
37336(@pxref{Extension Sample File Functions}).
37337
37338The first part of the function is variable declarations,
37339including a table to map file types to strings:
37340
37341@example
37342/* fill_stat_array --- do the work to fill an array with stat info */
37343
37344static int
37345fill_stat_array(const char *name, awk_array_t array, struct stat *sbuf)
37346@{
37347    char *pmode;    /* printable mode */
37348    const char *type = "unknown";
37349    awk_value_t tmp;
37350    static struct ftype_map @{
37351        unsigned int mask;
37352        const char *type;
37353    @} ftype_map[] = @{
37354        @{ S_IFREG, "file" @},
37355        @{ S_IFBLK, "blockdev" @},
37356        @{ S_IFCHR, "chardev" @},
37357        @{ S_IFDIR, "directory" @},
37358#ifdef S_IFSOCK
37359        @{ S_IFSOCK, "socket" @},
37360#endif
37361#ifdef S_IFIFO
37362        @{ S_IFIFO, "fifo" @},
37363#endif
37364#ifdef S_IFLNK
37365        @{ S_IFLNK, "symlink" @},
37366#endif
37367#ifdef S_IFDOOR /* Solaris weirdness */
37368        @{ S_IFDOOR, "door" @},
37369#endif
37370    @};
37371    int j, k;
37372@end example
37373
37374The destination array is cleared, and then code fills in
37375various elements based on values in the @code{struct stat}:
37376
37377@example
37378    /* empty out the array */
37379    clear_array(array);
37380
37381    /* fill in the array */
37382    array_set(array, "name", make_const_string(name, strlen(name),
37383                                               & tmp));
37384    array_set_numeric(array, "dev", sbuf->st_dev);
37385    array_set_numeric(array, "ino", sbuf->st_ino);
37386    array_set_numeric(array, "mode", sbuf->st_mode);
37387    array_set_numeric(array, "nlink", sbuf->st_nlink);
37388    array_set_numeric(array, "uid", sbuf->st_uid);
37389    array_set_numeric(array, "gid", sbuf->st_gid);
37390    array_set_numeric(array, "size", sbuf->st_size);
37391    array_set_numeric(array, "blocks", sbuf->st_blocks);
37392    array_set_numeric(array, "atime", sbuf->st_atime);
37393    array_set_numeric(array, "mtime", sbuf->st_mtime);
37394    array_set_numeric(array, "ctime", sbuf->st_ctime);
37395
37396    /* for block and character devices, add rdev,
37397       major and minor numbers */
37398    if (S_ISBLK(sbuf->st_mode) || S_ISCHR(sbuf->st_mode)) @{
37399        array_set_numeric(array, "rdev", sbuf->st_rdev);
37400        array_set_numeric(array, "major", major(sbuf->st_rdev));
37401        array_set_numeric(array, "minor", minor(sbuf->st_rdev));
37402    @}
37403@end example
37404
37405@noindent
37406The latter part of the function makes selective additions
37407to the destination array, depending upon the availability of
37408certain members and/or the type of the file. It then returns zero,
37409for success:
37410
37411@example
37412@group
37413#ifdef HAVE_STRUCT_STAT_ST_BLKSIZE
37414    array_set_numeric(array, "blksize", sbuf->st_blksize);
37415#endif
37416@end group
37417
37418    pmode = format_mode(sbuf->st_mode);
37419    array_set(array, "pmode", make_const_string(pmode, strlen(pmode),
37420                                                & tmp));
37421
37422    /* for symbolic links, add a linkval field */
37423    if (S_ISLNK(sbuf->st_mode)) @{
37424        char *buf;
37425        ssize_t linksize;
37426
37427        if ((buf = read_symlink(name, sbuf->st_size,
37428                    & linksize)) != NULL)
37429            array_set(array, "linkval",
37430                      make_malloced_string(buf, linksize, & tmp));
37431        else
37432            warning(ext_id, _("stat: unable to read symbolic link `%s'"),
37433                    name);
37434    @}
37435
37436    /* add a type field */
37437    type = "unknown";   /* shouldn't happen */
37438    for (j = 0, k = sizeof(ftype_map)/sizeof(ftype_map[0]); j < k; j++) @{
37439        if ((sbuf->st_mode & S_IFMT) == ftype_map[j].mask) @{
37440            type = ftype_map[j].type;
37441            break;
37442        @}
37443    @}
37444
37445    array_set(array, "type", make_const_string(type, strlen(type), & tmp));
37446
37447    return 0;
37448@}
37449@end example
37450
37451The third argument to @code{stat()} was not discussed previously. This
37452argument is optional. If present, it causes @code{do_stat()} to use
37453the @code{stat()} system call instead of the @code{lstat()} system
37454call.  This is done by using a function pointer: @code{statfunc}.
37455@code{statfunc} is initialized to point to @code{lstat()} (instead
37456of @code{stat()}) to get the file information, in case the file is a
37457symbolic link. However, if the third argument is included, @code{statfunc}
37458is set to point to @code{stat()}, instead.
37459
37460Here is the @code{do_stat()} function, which starts with
37461variable declarations and argument checking:
37462
37463@example
37464/* do_stat --- provide a stat() function for gawk */
37465
37466static awk_value_t *
37467do_stat(int nargs, awk_value_t *result, struct awk_ext_func *unused)
37468@{
37469    awk_value_t file_param, array_param;
37470    char *name;
37471    awk_array_t array;
37472    int ret;
37473    struct stat sbuf;
37474    /* default is lstat() */
37475    int (*statfunc)(const char *path, struct stat *sbuf) = lstat;
37476
37477    assert(result != NULL);
37478@end example
37479
37480Then comes the actual work. First, the function gets the arguments.
37481Next, it gets the information for the file.  If the called function
37482(@code{lstat()} or @code{stat()}) returns an error, the code sets
37483@code{ERRNO} and returns:
37484
37485@example
37486    /* file is first arg, array to hold results is second */
37487    if (   ! get_argument(0, AWK_STRING, & file_param)
37488        || ! get_argument(1, AWK_ARRAY, & array_param)) @{
37489        warning(ext_id, _("stat: bad parameters"));
37490        return make_number(-1, result);
37491    @}
37492
37493    if (nargs == 3) @{
37494        statfunc = stat;
37495    @}
37496
37497    name = file_param.str_value.str;
37498    array = array_param.array_cookie;
37499
37500    /* always empty out the array */
37501    clear_array(array);
37502
37503    /* stat the file; if error, set ERRNO and return */
37504    ret = statfunc(name, & sbuf);
37505@group
37506    if (ret < 0) @{
37507        update_ERRNO_int(errno);
37508        return make_number(ret, result);
37509    @}
37510@end group
37511@end example
37512
37513The tedious work is done by @code{fill_stat_array()}, shown
37514earlier.  When done, the function returns the result from @code{fill_stat_array()}:
37515
37516@example
37517@group
37518    ret = fill_stat_array(name, array, & sbuf);
37519
37520    return make_number(ret, result);
37521@}
37522@end group
37523@end example
37524
37525Finally, it's necessary to provide the ``glue'' that loads the
37526new function(s) into @command{gawk}.
37527
37528The @code{filefuncs} extension also provides an @code{fts()}
37529function, which we omit here
37530(@pxref{Extension Sample File Functions}).
37531For its sake, there is an initialization
37532function:
37533
37534@example
37535/* init_filefuncs --- initialization routine */
37536
37537static awk_bool_t
37538init_filefuncs(void)
37539@{
37540    @dots{}
37541@}
37542@end example
37543
37544We are almost done. We need an array of @code{awk_ext_func_t}
37545structures for loading each function into @command{gawk}:
37546
37547@example
37548static awk_ext_func_t func_table[] = @{
37549    @{ "chdir", do_chdir, 1, 1, awk_false, NULL @},
37550    @{ "stat",  do_stat, 3, 2, awk_false, NULL @},
37551    @dots{}
37552@};
37553@end example
37554
37555Each extension must have a routine named @code{dl_load()} to load
37556everything that needs to be loaded.  It is simplest to use the
37557@code{dl_load_func()} macro in @code{gawkapi.h}:
37558
37559@example
37560/* define the dl_load() function using the boilerplate macro */
37561
37562dl_load_func(func_table, filefuncs, "")
37563@end example
37564
37565And that's it!
37566
37567@node Using Internal File Ops
37568@subsection Integrating the Extensions
37569
37570@cindex @command{gawk} @subentry interpreter, adding code to
37571Now that the code is written, it must be possible to add it at
37572runtime to the running @command{gawk} interpreter.  First, the
37573code must be compiled.  Assuming that the functions are in
37574a file named @file{filefuncs.c}, and @var{idir} is the location
37575of the @file{gawkapi.h} header file,
37576the following steps@footnote{In practice, you would probably want to
37577use the GNU Autotools (Automake, Autoconf, Libtool, and @command{gettext}) to
37578configure and build your libraries. Instructions for doing so are beyond
37579the scope of this @value{DOCUMENT}. @xref{gawkextlib} for Internet links to
37580the tools.} create a GNU/Linux shared library:
37581
37582@example
37583$ @kbd{gcc -fPIC -shared -DHAVE_CONFIG_H -c -O -g -I@var{idir} filefuncs.c}
37584$ @kbd{gcc -o filefuncs.so -shared filefuncs.o}
37585@end example
37586
37587Once the library exists, it is loaded by using the @code{@@load} keyword:
37588
37589@example
37590# file testff.awk
37591@@load "filefuncs"
37592
37593BEGIN @{
37594    "pwd" | getline curdir  # save current directory
37595    close("pwd")
37596
37597    chdir("/tmp")
37598    system("pwd")   # test it
37599    chdir(curdir)   # go back
37600
37601    print "Info for testff.awk"
37602    ret = stat("testff.awk", data)
37603    print "ret =", ret
37604    for (i in data)
37605        printf "data[\"%s\"] = %s\n", i, data[i]
37606    print "testff.awk modified:",
37607        strftime("%m %d %Y %H:%M:%S", data["mtime"])
37608
37609    print "\nInfo for JUNK"
37610    ret = stat("JUNK", data)
37611    print "ret =", ret
37612    for (i in data)
37613        printf "data[\"%s\"] = %s\n", i, data[i]
37614    print "JUNK modified:", strftime("%m %d %Y %H:%M:%S", data["mtime"])
37615@}
37616@end example
37617
37618The @env{AWKLIBPATH} environment variable tells
37619@command{gawk} where to find extensions (@pxref{Finding Extensions}).
37620We set it to the current directory and run the program:
37621
37622@example
37623$ @kbd{AWKLIBPATH=$PWD gawk -f testff.awk}
37624@print{} /tmp
37625@print{} Info for testff.awk
37626@print{} ret = 0
37627@print{} data["blksize"] = 4096
37628@print{} data["devbsize"] = 512
37629@print{} data["mtime"] = 1412004710
37630@print{} data["mode"] = 33204
37631@print{} data["type"] = file
37632@print{} data["dev"] = 2053
37633@print{} data["gid"] = 1000
37634@print{} data["ino"] = 10358899
37635@print{} data["ctime"] = 1412004710
37636@print{} data["blocks"] = 8
37637@print{} data["nlink"] = 1
37638@print{} data["name"] = testff.awk
37639@print{} data["atime"] = 1412004716
37640@print{} data["pmode"] = -rw-rw-r--
37641@print{} data["size"] = 666
37642@print{} data["uid"] = 1000
37643@print{} testff.awk modified: 09 29 2014 18:31:50
37644@print{}
37645@print{} Info for JUNK
37646@print{} ret = -1
37647@print{} JUNK modified: 01 01 1970 02:00:00
37648@end example
37649
37650@node Extension Samples
37651@section The Sample Extensions in the @command{gawk} Distribution
37652@cindex extensions @subentry loadable @subentry distributed with @command{gawk}
37653
37654This @value{SECTION} provides a brief overview of the sample extensions
37655that come in the @command{gawk} distribution. Some of them are intended
37656for production use (e.g., the @code{filefuncs}, @code{readdir}, and
37657@code{inplace} extensions).  Others mainly provide example code that
37658shows how to use the extension API.
37659
37660@menu
37661* Extension Sample File Functions::   The file functions sample.
37662* Extension Sample Fnmatch::          An interface to @code{fnmatch()}.
37663* Extension Sample Fork::             An interface to @code{fork()} and other
37664                                      process functions.
37665* Extension Sample Inplace::          Enabling in-place file editing.
37666* Extension Sample Ord::              Character to value to character
37667                                      conversions.
37668* Extension Sample Readdir::          An interface to @code{readdir()}.
37669* Extension Sample Revout::           Reversing output sample output wrapper.
37670* Extension Sample Rev2way::          Reversing data sample two-way processor.
37671* Extension Sample Read write array:: Serializing an array to a file.
37672* Extension Sample Readfile::         Reading an entire file into a string.
37673* Extension Sample Time::             An interface to @code{gettimeofday()}
37674                                      and @code{sleep()}.
37675* Extension Sample API Tests::        Tests for the API.
37676@end menu
37677
37678@node Extension Sample File Functions
37679@subsection File-Related Functions
37680
37681The @code{filefuncs} extension provides three different functions, as follows.
37682The usage is:
37683
37684@table @asis
37685@item @code{@@load "filefuncs"}
37686This is how you load the extension.
37687
37688@cindex @code{chdir()} extension function
37689@item @code{result = chdir("/some/directory")}
37690The @code{chdir()} function is a direct hook to the @code{chdir()}
37691system call to change the current directory.  It returns zero
37692upon success or a value less than zero upon error.
37693In the latter case, it updates @code{ERRNO}.
37694
37695@cindex @code{stat()} extension function
37696@item @code{result = stat("/some/path", statdata} [@code{, follow}]@code{)}
37697The @code{stat()} function provides a hook into the
37698@code{stat()} system call.
37699It returns zero upon success or a value less than zero upon error.
37700In the latter case, it updates @code{ERRNO}.
37701
37702By default, it uses the @code{lstat()} system call.  However, if passed
37703a third argument, it uses @code{stat()} instead.
37704
37705In all cases, it clears the @code{statdata} array.
37706When the call is successful, @code{stat()} fills the @code{statdata}
37707array with information retrieved from the filesystem, as follows:
37708
37709@multitable @columnfractions .15 .50 .20
37710@headitem Subscript @tab Field in @code{struct stat} @tab File type
37711@item @code{"name"} @tab The @value{FN} @tab All
37712@item @code{"dev"} @tab @code{st_dev} @tab All
37713@item @code{"ino"} @tab @code{st_ino} @tab All
37714@item @code{"mode"} @tab @code{st_mode} @tab All
37715@item @code{"nlink"} @tab @code{st_nlink} @tab All
37716@item @code{"uid"} @tab @code{st_uid} @tab All
37717@item @code{"gid"} @tab @code{st_gid} @tab All
37718@item @code{"size"} @tab @code{st_size} @tab All
37719@item @code{"atime"} @tab @code{st_atime} @tab All
37720@item @code{"mtime"} @tab @code{st_mtime} @tab All
37721@item @code{"ctime"} @tab @code{st_ctime} @tab All
37722@item @code{"rdev"} @tab @code{st_rdev} @tab Device files
37723@item @code{"major"} @tab @code{st_major} @tab Device files
37724@item @code{"minor"} @tab @code{st_minor} @tab Device files
37725@item @code{"blksize"} @tab @code{st_blksize} @tab All
37726@item @code{"pmode"} @tab A human-readable version of the mode value, like that printed by
37727@command{ls} (for example, @code{"-rwxr-xr-x"}) @tab All
37728@item @code{"linkval"} @tab The value of the symbolic link @tab Symbolic links
37729@item @code{"type"} @tab The type of the file as a string---one of
37730@code{"file"},
37731@code{"blockdev"},
37732@code{"chardev"},
37733@code{"directory"},
37734@code{"socket"},
37735@code{"fifo"},
37736@code{"symlink"},
37737@code{"door"},
37738or
37739@code{"unknown"}
37740(not all systems support all file types) @tab All
37741@end multitable
37742
37743@cindex @code{fts()} extension function
37744@item @code{flags = or(FTS_PHYSICAL, ...)}
37745@itemx @code{result = fts(pathlist, flags, filedata)}
37746Walk the file trees provided in @code{pathlist} and fill in the
37747@code{filedata} array, as described next.  @code{flags} is the bitwise
37748OR of several predefined values, also described in a moment.
37749Return zero if there were no errors, otherwise return @minus{}1.
37750@end table
37751
37752The @code{fts()} function provides a hook to the C library @code{fts()}
37753routines for traversing file hierarchies.  Instead of returning data
37754about one file at a time in a stream, it fills in a multidimensional
37755array with data about each file and directory encountered in the requested
37756hierarchies.
37757
37758The arguments are as follows:
37759
37760@table @code
37761@item pathlist
37762An array of @value{FN}s.  The element values are used; the index values are ignored.
37763
37764@item flags
37765This should be the bitwise OR of one or more of the following
37766predefined constant flag values.  At least one of
37767@code{FTS_LOGICAL} or @code{FTS_PHYSICAL} must be provided; otherwise
37768@code{fts()} returns an error value and sets @code{ERRNO}.
37769The flags are:
37770
37771@c nested table
37772@table @code
37773@item FTS_LOGICAL
37774Do a ``logical'' file traversal, where the information returned for
37775a symbolic link refers to the linked-to file, and not to the symbolic
37776link itself.  This flag is mutually exclusive with @code{FTS_PHYSICAL}.
37777
37778@item FTS_PHYSICAL
37779Do a ``physical'' file traversal, where the information returned for a
37780symbolic link refers to the symbolic link itself.  This flag is mutually
37781exclusive with @code{FTS_LOGICAL}.
37782
37783@item FTS_NOCHDIR
37784As a performance optimization, the C library @code{fts()} routines
37785change directory as they traverse a file hierarchy.  This flag disables
37786that optimization.
37787
37788@item FTS_COMFOLLOW
37789Immediately follow a symbolic link named in @code{pathlist},
37790whether or not @code{FTS_LOGICAL} is set.
37791
37792@item FTS_SEEDOT
37793By default, the C library @code{fts()} routines do not return entries for
37794@file{.} (dot) and @file{..} (dot-dot).  This option causes entries for
37795dot-dot to also be included.  (The extension always includes an entry
37796for dot; more on this in a moment.)
37797
37798@item FTS_XDEV
37799During a traversal, do not cross onto a different mounted filesystem.
37800@end table
37801
37802@item filedata
37803The @code{filedata} array holds the results.
37804@code{fts()} first clears it.  Then it creates
37805an element in @code{filedata} for every element in @code{pathlist}.
37806The index is the name of the directory or file given in @code{pathlist}.
37807The element for this index is itself an array.  There are two cases:
37808
37809@c nested table
37810@table @emph
37811@item The path is a file
37812In this case, the array contains two or three elements:
37813
37814@c doubly nested table
37815@table @code
37816@item "path"
37817The full path to this file, starting from the ``root'' that was given
37818in the @code{pathlist} array.
37819
37820@item "stat"
37821This element is itself an array, containing the same information as provided
37822by the @code{stat()} function described earlier for its
37823@code{statdata} argument.  The element may not be present if
37824the @code{stat()} system call for the file failed.
37825
37826@item "error"
37827If some kind of error was encountered, the array will also
37828contain an element named @code{"error"}, which is a string describing the error.
37829@end table
37830
37831@item The path is a directory
37832In this case, the array contains one element for each entry in the
37833directory.  If an entry is a file, that element is the same as for files, just
37834described.  If the entry is a directory, that element is (recursively)
37835an array describing the subdirectory.  If @code{FTS_SEEDOT} was provided
37836in the flags, then there will also be an element named @code{".."}.  This
37837element will be an array containing the data as provided by @code{stat()}.
37838
37839In addition, there will be an element whose index is @code{"."}.
37840This element is an array containing the same two or three elements as
37841for a file: @code{"path"}, @code{"stat"}, and @code{"error"}.
37842@end table
37843@end table
37844
37845The @code{fts()} function returns zero if there were no errors.
37846Otherwise, it returns @minus{}1.
37847
37848@quotation NOTE
37849The @code{fts()} extension does not exactly mimic the
37850interface of the C library @code{fts()} routines, choosing instead to
37851provide an interface that is based on associative arrays, which is
37852more comfortable to use from an @command{awk} program.  This includes the
37853lack of a comparison function, because @command{gawk} already provides
37854powerful array sorting facilities.  Although an @code{fts_read()}-like
37855interface could have been provided, this felt less natural than simply
37856creating a multidimensional array to represent the file hierarchy and
37857its information.
37858@end quotation
37859
37860See @file{test/fts.awk} in the @command{gawk} distribution for an example
37861use of the @code{fts()} extension function.
37862
37863@node Extension Sample Fnmatch
37864@subsection Interface to @code{fnmatch()}
37865
37866This extension provides an interface to the C library
37867@code{fnmatch()} function.  The usage is:
37868
37869@table @code
37870@item @@load "fnmatch"
37871This is how you load the extension.
37872
37873@cindex @code{fnmatch()} extension function
37874@item result = fnmatch(pattern, string, flags)
37875The return value is zero on success, @code{FNM_NOMATCH}
37876if the string did not match the pattern, or
37877a different nonzero value if an error occurred.
37878@end table
37879
37880In addition to the @code{fnmatch()} function, the @code{fnmatch} extension
37881adds one constant (@code{FNM_NOMATCH}), and an array of flag values
37882named @code{FNM}.
37883
37884The arguments to @code{fnmatch()} are:
37885
37886@table @code
37887@item pattern
37888The @value{FN} wildcard to match
37889
37890@item string
37891The @value{FN} string
37892
37893@item flag
37894Either zero, or the bitwise OR of one or more of the
37895flags in the @code{FNM} array
37896@end table
37897
37898The flags are as follows:
37899
37900@multitable @columnfractions .25 .75
37901@headitem Array element @tab Corresponding flag defined by @code{fnmatch()}
37902@item @code{FNM["CASEFOLD"]} @tab @code{FNM_CASEFOLD}
37903@item @code{FNM["FILE_NAME"]} @tab @code{FNM_FILE_NAME}
37904@item @code{FNM["LEADING_DIR"]} @tab @code{FNM_LEADING_DIR}
37905@item @code{FNM["NOESCAPE"]} @tab @code{FNM_NOESCAPE}
37906@item @code{FNM["PATHNAME"]} @tab @code{FNM_PATHNAME}
37907@item @code{FNM["PERIOD"]} @tab @code{FNM_PERIOD}
37908@end multitable
37909
37910Here is an example:
37911
37912@example
37913@@load "fnmatch"
37914@dots{}
37915flags = or(FNM["PERIOD"], FNM["NOESCAPE"])
37916if (fnmatch("*.a", "foo.c", flags) == FNM_NOMATCH)
37917    print "no match"
37918@end example
37919
37920@node Extension Sample Fork
37921@subsection Interface to @code{fork()}, @code{wait()}, and @code{waitpid()}
37922
37923The @code{fork} extension adds three functions, as follows:
37924
37925@table @code
37926@item @@load "fork"
37927This is how you load the extension.
37928
37929@cindex @code{fork()} extension function
37930@item pid = fork()
37931This function creates a new process. The return value is zero in the
37932child and the process ID number of the child in the parent, or @minus{}1
37933upon error. In the latter case, @code{ERRNO} indicates the problem.
37934In the child, @code{PROCINFO["pid"]} and @code{PROCINFO["ppid"]} are
37935updated to reflect the correct values.
37936
37937@cindex @code{waitpid()} extension function
37938@item ret = waitpid(pid)
37939This function takes a numeric argument, which is the process ID to
37940wait for. The return value is that of the
37941@code{waitpid()} system call.
37942
37943@cindex @code{wait()} extension function
37944@item ret = wait()
37945This function waits for the first child to die.
37946The return value is that of the
37947@code{wait()} system call.
37948@end table
37949
37950There is no corresponding @code{exec()} function.
37951
37952Here is an example:
37953
37954@example
37955@@load "fork"
37956@dots{}
37957if ((pid = fork()) == 0)
37958    print "hello from the child"
37959else
37960    print "hello from the parent"
37961@end example
37962
37963@node Extension Sample Inplace
37964@subsection Enabling In-Place File Editing
37965
37966@cindex @code{inplace} extension
37967The @code{inplace} extension emulates GNU @command{sed}'s @option{-i} option,
37968which performs ``in-place'' editing of each input file.
37969It uses the bundled @file{inplace.awk} include file to invoke the extension
37970properly.  This extension makes use of the namespace facility to place
37971all the variables and functions in the @code{inplace} namespace
37972(@pxref{Namespaces}):
37973
37974@example
37975@c file eg/lib/inplace.awk
37976@group
37977# inplace --- load and invoke the inplace extension.
37978@c endfile
37979@ignore
37980@c file eg/lib/inplace.awk
37981#
37982# Copyright (C) 2013, 2017, 2019 the Free Software Foundation, Inc.
37983#
37984# This file is part of GAWK, the GNU implementation of the
37985# AWK Programming Language.
37986#
37987# GAWK is free software; you can redistribute it and/or modify
37988# it under the terms of the GNU General Public License as published by
37989# the Free Software Foundation; either version 3 of the License, or
37990# (at your option) any later version.
37991#
37992# GAWK is distributed in the hope that it will be useful,
37993# but WITHOUT ANY WARRANTY; without even the implied warranty of
37994# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
37995# GNU General Public License for more details.
37996#
37997# You should have received a copy of the GNU General Public License
37998# along with this program; if not, write to the Free Software
37999# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA
38000#
38001# Andrew J. Schorr, aschorr@@telemetry-investments.com
38002# January 2013
38003#
38004# Revised for namespaces
38005# Arnold Robbins, arnold@@skeeve.com
38006# July 2017
38007# June 2019, add backwards compatibility
38008@c endfile
38009@end ignore
38010@c file eg/lib/inplace.awk
38011
38012@@load "inplace"
38013
38014# Please set inplace::suffix to make a backup copy.  For example, you may
38015# want to set inplace::suffix to .bak on the command line or in a BEGIN rule.
38016
38017# Before there were namespaces in gawk, this extension used
38018# INPLACE_SUFFIX as the variable for making backup copies. We allow this
38019# too, so that any code that used the previous version continues to work.
38020
38021# By default, each filename on the command line will be edited inplace.
38022# But you can selectively disable this by adding an inplace::enable=0 argument
38023# prior to files that you do not want to process this way.  You can then
38024# reenable it later on the commandline by putting inplace::enable=1 before files
38025# that you wish to be subject to inplace editing.
38026
38027# N.B. We call inplace::end() in the BEGINFILE and END rules so that any
38028# actions in an ENDFILE rule will be redirected as expected.
38029
38030@@namespace "inplace"
38031@end group
38032
38033@group
38034BEGIN @{
38035    enable = 1         # enabled by default
38036@}
38037@end group
38038
38039@group
38040BEGINFILE @{
38041    sfx = (suffix ? suffix : awk::INPLACE_SUFFIX)
38042    if (filename != "")
38043        end(filename, sfx)
38044    if (enable)
38045        begin(filename = FILENAME, sfx)
38046    else
38047        filename = ""
38048@}
38049@end group
38050
38051@group
38052END @{
38053    if (filename != "")
38054        end(filename, (suffix ? suffix : awk::INPLACE_SUFFIX))
38055@}
38056@end group
38057@c endfile
38058@end example
38059
38060For each regular file that is processed, the extension redirects
38061standard output to a temporary file configured to have the same owner
38062and permissions as the original.  After the file has been processed,
38063the extension restores standard output to its original destination.
38064If @code{inplace::suffix} is not an empty string, the original file is
38065linked to a backup @value{FN} created by appending that suffix.  Finally,
38066the temporary file is renamed to the original @value{FN}.
38067
38068Note that the use of this feature can be controlled by placing
38069@samp{inplace::enable=0} on the command-line prior to listing files that
38070should not be processed this way.  You can reenable inplace editing by adding
38071an @samp{inplace::enable=1} argument prior to files that should be subject
38072to inplace editing.
38073
38074The @code{inplace::filename} variable serves to keep track of the
38075current @value{FN} so as to not invoke @code{inplace::end()} before
38076processing the first file.
38077
38078If any error occurs, the extension issues a fatal error to terminate
38079processing immediately without damaging the original file.
38080
38081Here are some simple examples:
38082
38083@example
38084$ @kbd{gawk -i inplace '@{ gsub(/foo/, "bar") @}; @{ print @}' file1 file2 file3}
38085@end example
38086
38087To keep a backup copy of the original files, try this:
38088
38089@example
38090$ @kbd{gawk -i inplace -v inplace::suffix=.bak '@{ gsub(/foo/, "bar") @}}
38091> @kbd{@{ print @}' file1 file2 file3}
38092@end example
38093
38094Please note that, while the extension does attempt to preserve ownership and permissions, it makes no attempt to copy the ACLs from the original file.
38095
38096If the program dies prematurely, as might happen if an unhandled signal is received, a temporary file may be left behind.
38097
38098@node Extension Sample Ord
38099@subsection Character and Numeric values: @code{ord()} and @code{chr()}
38100
38101The @code{ordchr} extension adds two functions, named
38102@code{ord()} and @code{chr()}, as follows:
38103
38104@table @code
38105@item @@load "ordchr"
38106This is how you load the extension.
38107
38108@cindex @code{ord()} extension function
38109@item number = ord(string)
38110Return the numeric value of the first character in @code{string}.
38111
38112@cindex @code{chr()} extension function
38113@item char = chr(number)
38114Return a string whose first character is that represented by @code{number}.
38115@end table
38116
38117These functions are inspired by the Pascal language functions
38118of the same name.  Here is an example:
38119
38120@example
38121@@load "ordchr"
38122@dots{}
38123printf("The numeric value of 'A' is %d\n", ord("A"))
38124printf("The string value of 65 is %s\n", chr(65))
38125@end example
38126
38127@node Extension Sample Readdir
38128@subsection Reading Directories
38129
38130The @code{readdir} extension adds an input parser for directories.
38131The usage is as follows:
38132
38133@cindex @code{readdir} extension
38134@example
38135@@load "readdir"
38136@end example
38137
38138When this extension is in use, instead of skipping directories named
38139on the command line (or with @code{getline}),
38140they are read, with each entry returned as a record.
38141
38142The record consists of three fields. The first two are the inode number and the
38143@value{FN}, separated by a forward slash character.
38144On systems where the directory entry contains the file type, the record
38145has a third field (also separated by a slash), which is a single letter
38146indicating the type of the file. The letters and their corresponding file
38147types are shown in @ref{table-readdir-file-types}.
38148
38149@float Table,table-readdir-file-types
38150@caption{File types returned by the @code{readdir} extension}
38151@multitable @columnfractions .1 .9
38152@headitem Letter @tab File type
38153@item @code{b} @tab Block device
38154@item @code{c} @tab Character device
38155@item @code{d} @tab Directory
38156@item @code{f} @tab Regular file
38157@item @code{l} @tab Symbolic link
38158@item @code{p} @tab Named pipe (FIFO)
38159@item @code{s} @tab Socket
38160@item @code{u} @tab Anything else (unknown)
38161@end multitable
38162@end float
38163
38164On systems without the file type information, the third field is always
38165@samp{u}.
38166
38167@quotation NOTE
38168On GNU/Linux systems, there are filesystems that don't support the
38169@code{d_type} entry (see the @i{readdir}(3) manual page), and so the file
38170type is always @samp{u}.  You can use the @code{filefuncs} extension to call
38171@code{stat()} in order to get correct type information.
38172@end quotation
38173
38174By default, if a directory cannot be opened (due to permission problems,
38175for example), @command{gawk} will exit.  As with regular files, this
38176situation can be handled using a @code{BEGINFILE} rule that checks
38177@code{ERRNO} and prints an error or otherwise handles the problem.
38178
38179Here is an example:
38180
38181@example
38182@@load "readdir"
38183@dots{}
38184BEGIN @{ FS = "/" @}
38185@{ print "@value{FN} is", $2 @}
38186@end example
38187
38188@node Extension Sample Revout
38189@subsection Reversing Output
38190
38191The @code{revoutput} extension adds a simple output wrapper that reverses
38192the characters in each output line.  Its main purpose is to show how to
38193write an output wrapper, although it may be mildly amusing for the unwary.
38194Here is an example:
38195
38196@cindex @code{revoutput} extension
38197@example
38198@@load "revoutput"
38199
38200BEGIN @{
38201    REVOUT = 1
38202    print "don't panic" > "/dev/stdout"
38203@}
38204@end example
38205
38206The output from this program is @samp{cinap t'nod}.
38207
38208@node Extension Sample Rev2way
38209@subsection Two-Way I/O Example
38210
38211The @code{revtwoway} extension adds a simple two-way processor that
38212reverses the characters in each line sent to it for reading back by
38213the @command{awk} program.  Its main purpose is to show how to write
38214a two-way processor, although it may also be mildly amusing.
38215The following example shows how to use it:
38216
38217@cindex @code{revtwoway} extension
38218@example
38219@@load "revtwoway"
38220
38221BEGIN @{
38222    cmd = "/magic/mirror"
38223    print "don't panic" |& cmd
38224    cmd |& getline result
38225    print result
38226    close(cmd)
38227@}
38228@end example
38229
38230The output from this program
38231@ifnotinfo
38232also is:
38233@end ifnotinfo
38234@ifinfo
38235is:
38236@end ifinfo
38237@samp{cinap t'nod}.
38238
38239@node Extension Sample Read write array
38240@subsection Dumping and Restoring an Array
38241
38242The @code{rwarray} extension adds two functions,
38243named @code{writea()} and @code{reada()}, as follows:
38244
38245@table @code
38246@item @@load "rwarray"
38247This is how you load the extension.
38248
38249@cindex @code{writea()} extension function
38250@item ret = writea(file, array)
38251This function takes a string argument, which is the name of the file
38252to which to dump the array, and the array itself as the second argument.
38253@code{writea()} understands arrays of arrays.  It returns one on
38254success, or zero upon failure.
38255
38256@cindex @code{reada()} extension function
38257@item ret = reada(file, array)
38258@code{reada()} is the inverse of @code{writea()};
38259it reads the file named as its first argument, filling in
38260the array named as the second argument. It clears the array first.
38261Here too, the return value is one on success, or zero upon failure.
38262@end table
38263
38264The array created by @code{reada()} is identical to that written by
38265@code{writea()} in the sense that the contents are the same. However,
38266due to implementation issues, the array traversal order of the re-created
38267array is likely to be different from that of the original array.  As array
38268traversal order in @command{awk} is by default undefined, this is (technically)
38269not a problem.  If you need to guarantee a particular traversal
38270order, use the array sorting features in @command{gawk} to do so
38271(@pxref{Array Sorting}).
38272
38273The file contains binary data.  All integral values are written in network
38274byte order.  However, double-precision floating-point values are written
38275as native binary data.  Thus, arrays containing only string data can
38276theoretically be dumped on systems with one byte order and restored on
38277systems with a different one, but this has not been tried.
38278
38279Here is an example:
38280
38281@example
38282@@load "rwarray"
38283@dots{}
38284ret = writea("arraydump.bin", array)
38285@dots{}
38286ret = reada("arraydump.bin", array)
38287@end example
38288
38289@node Extension Sample Readfile
38290@subsection Reading an Entire File
38291
38292The @code{readfile} extension adds a single function
38293named @code{readfile()}, and an input parser:
38294
38295@table @code
38296@item @@load "readfile"
38297This is how you load the extension.
38298
38299@cindex @code{readfile()} extension function
38300@item result = readfile("/some/path")
38301The argument is the name of the file to read.  The return value is a
38302string containing the entire contents of the requested file.  Upon error,
38303the function returns the empty string and sets @code{ERRNO}.
38304
38305@item BEGIN @{ PROCINFO["readfile"] = 1 @}
38306In addition, the extension adds an input parser that is activated if
38307@code{PROCINFO["readfile"]} exists.
38308When activated, each input file is returned in its entirety as @code{$0}.
38309@code{RT} is set to the null string.
38310@end table
38311
38312Here is an example:
38313
38314@example
38315@@load "readfile"
38316@dots{}
38317contents = readfile("/path/to/file");
38318if (contents == "" && ERRNO != "") @{
38319    print("problem reading file", ERRNO) > "/dev/stderr"
38320    ...
38321@}
38322@end example
38323
38324@node Extension Sample Time
38325@subsection Extension Time Functions
38326
38327@quotation CAUTION
38328As @command{gawk} @value{PVERSION} 5.1.0, this extension is considered to be obsolete.
38329It is replaced by the @code{timex} extension in @code{gawkextlib}
38330(@pxref{gawkextlib}).
38331
38332For @value{PVERSION} 5.1, no warning will be issued if this extension is used.
38333For the next major release, a warning will be issued. In the release after that
38334this extension will be removed from the distribution.
38335@end quotation
38336
38337The @code{time} extension adds two functions, named @code{gettimeofday()}
38338and @code{sleep()}, as follows:
38339
38340@table @code
38341@item @@load "time"
38342This is how you load the extension.
38343
38344@cindex @code{gettimeofday()} extension function
38345@item the_time = gettimeofday()
38346Return the time in seconds that has elapsed since 1970-01-01 UTC as a
38347floating-point value.  If the time is unavailable on this platform, return
38348@minus{}1 and set @code{ERRNO}.  The returned time should have sub-second
38349precision, but the actual precision may vary based on the platform.
38350If the standard C @code{gettimeofday()} system call is available on this
38351platform, then it simply returns the value.  Otherwise, if on MS-Windows,
38352it tries to use @code{GetSystemTimeAsFileTime()}.
38353
38354@cindex @code{sleep()} extension function
38355@item result = sleep(@var{seconds})
38356Attempt to sleep for @var{seconds} seconds.  If @var{seconds} is negative,
38357or the attempt to sleep fails, return @minus{}1 and set @code{ERRNO}.
38358Otherwise, return zero after sleeping for the indicated amount of time.
38359Note that @var{seconds} may be a floating-point (nonintegral) value.
38360Implementation details: depending on platform availability, this function
38361tries to use @code{nanosleep()} or @code{select()} to implement the delay.
38362@end table
38363
38364@node Extension Sample API Tests
38365@subsection API Tests
38366@cindex @code{testext} extension
38367
38368The @code{testext} extension exercises parts of the extension API that
38369are not tested by the other samples.  The @file{extension/testext.c}
38370file contains both the C code for the extension and @command{awk}
38371test code inside C comments that run the tests. The testing framework
38372extracts the @command{awk} code and runs the tests.  See the source file
38373for more information.
38374
38375@node gawkextlib
38376@section The @code{gawkextlib} Project
38377@cindex extensions @subentry loadable @subentry @code{gawkextlib} project
38378
38379@cindex @code{gawkextlib} project
38380The @uref{https://sourceforge.net/projects/gawkextlib/, @code{gawkextlib}}
38381project provides a number of @command{gawk} extensions, including one for
38382processing XML files.  This is the evolution of the original @command{xgawk}
38383(XML @command{gawk}) project.
38384
38385There are a number of extensions. Some of the more interesting ones are:
38386
38387@itemize @value{BULLET}
38388@item
38389@code{abort} extension. It allows you to exit immediately from your
38390@command{awk} program without running the @code{END} rules.
38391
38392@item
38393@code{json} extension.
38394This serializes a multidimensional array into a JSON string, and
38395can deserialize a JSON string into a @command{gawk} array.
38396This extension is interesting since it is written in C++ instead of C.
38397
38398@item
38399MPFR library extension.
38400This provides access to a number of MPFR functions that @command{gawk}'s
38401native MPFR support does not.
38402
38403@item
38404Select extension. It provides functionality based on the
38405@code{select()} system call.
38406
38407@item
38408XML parser extension, using the @uref{https://expat.sourceforge.net, Expat}
38409XML parsing library
38410@end itemize
38411
38412@cindex @command{git} utility
38413You can check out the code for the @code{gawkextlib} project
38414using the @uref{https://git-scm.com, Git} distributed source
38415code control system.  The command is as follows:
38416
38417@example
38418git clone git://git.code.sf.net/p/gawkextlib/code gawkextlib-code
38419@end example
38420
38421@cindex RapidJson JSON parser library
38422You will need to have the @uref{http://www.rapidjson.org, RapidJson}
38423JSON parser library installed in order to build and use the @code{json} extension.
38424
38425@cindex Expat XML parser library
38426You will need to have the @uref{https://expat.sourceforge.net, Expat}
38427XML parser library installed in order to build and use the XML extension.
38428
38429In addition, you must have the GNU Autotools installed
38430(@uref{https://www.gnu.org/software/autoconf, Autoconf},
38431@uref{https://www.gnu.org/software/automake, Automake},
38432@uref{https://www.gnu.org/software/libtool, Libtool},
38433and
38434@uref{https://www.gnu.org/software/gettext, GNU @command{gettext}}).
38435
38436The simple recipe for building and testing @code{gawkextlib} is as follows.
38437First, build and install @command{gawk}:
38438
38439@example
38440cd .../path/to/gawk/code
38441./configure --prefix=/tmp/newgawk     @ii{Install in /tmp/newgawk for now}
38442make && make check                    @ii{Build and check that all is OK}
38443make install                          @ii{Install gawk}
38444@end example
38445
38446Next, go to @url{https://sourceforge.net/projects/gawkextlib/files} to
38447download @code{gawkextlib} and any extensions that you would like to build.
38448The @file{README} file at that site explains how to build the code.  If you
38449installed @command{gawk} in a non-standard location, you will need to
38450specify @samp{./configure --with-gawk=@var{/path/to/gawk}} to find it.
38451You may need to use the @command{sudo} utility
38452to install both @command{gawk} and @code{gawkextlib}, depending upon
38453how your system works.
38454
38455If you write an extension that you wish to share with other
38456@command{gawk} users, consider doing so through the
38457@code{gawkextlib} project.
38458See the project's website for more information.
38459
38460@node Extension summary
38461@section Summary
38462
38463@itemize @value{BULLET}
38464@item
38465You can write extensions (sometimes called plug-ins) for @command{gawk}
38466in C or C++ using the application programming interface (API) defined
38467by the @command{gawk} developers.
38468
38469@item
38470Extensions must have a license compatible with the GNU General Public
38471License (GPL), and they must assert that fact by declaring a variable
38472named @code{plugin_is_GPL_compatible}.
38473
38474@item
38475Communication between @command{gawk} and an extension is two-way.
38476@command{gawk} passes a @code{struct} to the extension that contains
38477various data fields and function pointers.  The extension can then call
38478into @command{gawk} via the supplied function pointers to accomplish
38479certain tasks.
38480
38481@item
38482One of these tasks is to ``register'' the name and implementation of
38483new @command{awk}-level functions with @command{gawk}.  The implementation
38484takes the form of a C function pointer with a defined signature.
38485By convention, implementation functions are named @code{do_@var{XXXX}()}
38486for some @command{awk}-level function @code{@var{XXXX}()}.
38487
38488@item
38489The API is defined in a header file named @file{gawkapi.h}. You must include
38490a number of standard header files @emph{before} including it in your source file.
38491
38492@item
38493API function pointers are provided for the following kinds of operations:
38494
38495@itemize @value{BULLET}
38496@item
38497Allocating, reallocating, and releasing memory
38498
38499@item
38500Registration functions (you may register
38501extension functions,
38502exit callbacks,
38503a version string,
38504input parsers,
38505output wrappers,
38506and two-way processors)
38507
38508@item
38509Printing fatal, nonfatal, warning, and ``lint'' warning messages
38510
38511@item
38512Updating @code{ERRNO}, or unsetting it
38513
38514@item
38515Accessing parameters, including converting an undefined parameter into
38516an array
38517
38518@item
38519Symbol table access (retrieving a global variable, creating one,
38520or changing one)
38521
38522@item
38523Creating and releasing cached values; this provides an
38524efficient way to use values for multiple variables and
38525can be a big performance win
38526
38527@item
38528Manipulating arrays
38529(retrieving, adding, deleting, and modifying elements;
38530getting the count of elements in an array;
38531creating a new array;
38532clearing an array;
38533and
38534flattening an array for easy C-style looping over all its indices and elements)
38535@end itemize
38536
38537@item
38538The API defines a number of standard data types for representing
38539@command{awk} values, array elements, and arrays.
38540
38541@item
38542The API provides convenience functions for constructing values.
38543It also provides memory management functions to ensure compatibility
38544between memory allocated by @command{gawk} and memory allocated by an
38545extension.
38546
38547@item
38548@emph{All} memory passed from @command{gawk} to an extension must be
38549treated as read-only by the extension.
38550
38551@item
38552@emph{All} memory passed from an extension to @command{gawk} must come from
38553the API's memory allocation functions. @command{gawk} takes responsibility for
38554the memory and releases it when appropriate.
38555
38556@item
38557The API provides information about the running version of @command{gawk} so
38558that an extension can make sure it is compatible with the @command{gawk}
38559that loaded it.
38560
38561@item
38562It is easiest to start a new extension by copying the boilerplate code
38563described in this @value{CHAPTER}.  Macros in the @file{gawkapi.h} header
38564file make this easier to do.
38565
38566@item
38567The @command{gawk} distribution includes a number of small but useful
38568sample extensions. The @code{gawkextlib} project includes several more
38569(larger) extensions.  If you wish to write an extension and contribute it
38570to the community of @command{gawk} users, the @code{gawkextlib} project
38571is the place to do so.
38572
38573@end itemize
38574
38575@c EXCLUDE START
38576@node Extension Exercises
38577@section Exercises
38578
38579@enumerate
38580@item
38581Add functions to implement system calls such as @code{chown()},
38582@code{chmod()}, and @code{umask()} to the file operations extension
38583presented in @ref{Internal File Ops}.
38584
38585@c Idea from comp.lang.awk, February 2015
38586@item
38587Write an input parser that prints a prompt if the input is
38588a from a ``terminal'' device.  You can use the @code{isatty()}
38589function to tell if the input file is a terminal. (Hint: this function
38590is usually expensive to call; try to call it just once.)
38591The content of the prompt should come from a variable settable
38592by @command{awk}-level code.
38593You can write the prompt to standard error. However,
38594for best results, open a new file descriptor (or file pointer)
38595on @file{/dev/tty} and print the prompt there, in case standard
38596error has been redirected.
38597
38598Why is standard error a better
38599choice than standard output for writing the prompt?
38600Which reading mechanism should you replace, the one to get
38601a record, or the one to read raw bytes?
38602
38603@item
38604Write a wrapper script that provides an interface similar to
38605@samp{sed -i} for the ``inplace'' extension presented in
38606@ref{Extension Sample Inplace}.
38607
38608@end enumerate
38609@c EXCLUDE END
38610
38611@ifnotinfo
38612@part @value{PART4}Appendices
38613@end ifnotinfo
38614
38615@ifdocbook
38616
38617@ifclear FOR_PRINT
38618Part IV contains the appendices (including the two licenses that cover
38619the @command{gawk} source code and this @value{DOCUMENT}, respectively)
38620and the Glossary:
38621@end ifclear
38622
38623@ifset FOR_PRINT
38624Part IV contains three appendices, the last of which is the license that
38625covers the @command{gawk} source code:
38626@end ifset
38627
38628@itemize @value{BULLET}
38629@item
38630@ref{Language History}
38631
38632@item
38633@ref{Installation}
38634
38635@ifclear FOR_PRINT
38636@item
38637@ref{Notes}
38638
38639@item
38640@ref{Basic Concepts}
38641
38642@item
38643@ref{Glossary}
38644@end ifclear
38645
38646@item
38647@ref{Copying}
38648
38649@ifclear FOR_PRINT
38650@item
38651@ref{GNU Free Documentation License}
38652@end ifclear
38653@end itemize
38654@end ifdocbook
38655
38656@node Language History
38657@appendix The Evolution of the @command{awk} Language
38658
38659This @value{DOCUMENT} describes the GNU implementation of @command{awk},
38660which follows the POSIX specification.  Many longtime @command{awk}
38661users learned @command{awk} programming with the original @command{awk}
38662implementation in Version 7 Unix.  (This implementation was the basis for
38663@command{awk} in Berkeley Unix, through 4.3-Reno.  Subsequent versions
38664of Berkeley Unix, and, for a while, some systems derived from 4.4BSD-Lite, used various
38665versions of @command{gawk} for their @command{awk}.)  This @value{CHAPTER}
38666briefly describes the evolution of the @command{awk} language, with
38667cross-references to other parts of the @value{DOCUMENT} where you can
38668find more information.
38669
38670@ifset FOR_PRINT
38671To save space, we have omitted
38672information on the history of features in @command{gawk} from this
38673edition. You can find it in the
38674@uref{https://www.gnu.org/software/gawk/manual/html_node/Feature-History.html,
38675online documentation}.
38676@end ifset
38677
38678@menu
38679* V7/SVR3.1::                   The major changes between V7 and System V
38680                                Release 3.1.
38681* SVR4::                        Minor changes between System V Releases 3.1
38682                                and 4.
38683* POSIX::                       New features from the POSIX standard.
38684* BTL::                         New features from Brian Kernighan's version of
38685                                @command{awk}.
38686* POSIX/GNU::                   The extensions in @command{gawk} not in POSIX
38687                                @command{awk}.
38688* Feature History::             The history of the features in @command{gawk}.
38689* Common Extensions::           Common Extensions Summary.
38690* Ranges and Locales::          How locales used to affect regexp ranges.
38691* Contributors::                The major contributors to @command{gawk}.
38692* History summary::             History summary.
38693@end menu
38694
38695@node V7/SVR3.1
38696@appendixsec Major Changes Between V7 and SVR3.1
38697@cindex @command{awk} @subentry versions of
38698@cindex @command{awk} @subentry versions of @subentry changes between V7 and SVR3.1
38699
38700The @command{awk} language evolved considerably between the release of
38701Version 7 Unix (1978) and the new version that was first made generally available in
38702System V Release 3.1 (1987).  This @value{SECTION} summarizes the changes, with
38703cross-references to further details:
38704
38705@itemize @value{BULLET}
38706@item
38707The requirement for @samp{;} to separate rules on a line
38708(@pxref{Statements/Lines})
38709
38710@item
38711User-defined functions and the @code{return} statement
38712(@pxref{User-defined})
38713
38714@item
38715The @code{delete} statement (@pxref{Delete})
38716
38717@item
38718The @code{do}-@code{while} statement
38719(@pxref{Do Statement})
38720
38721@item
38722The built-in functions @code{atan2()}, @code{cos()}, @code{sin()}, @code{rand()}, and
38723@code{srand()} (@pxref{Numeric Functions})
38724
38725@item
38726The built-in functions @code{gsub()}, @code{sub()}, and @code{match()}
38727(@pxref{String Functions})
38728
38729@item
38730The built-in functions @code{close()} and @code{system()}
38731(@pxref{I/O Functions})
38732
38733@item
38734The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART},
38735and @code{SUBSEP} predefined variables (@pxref{Built-in Variables})
38736
38737@item
38738Assignable @code{$0} (@pxref{Changing Fields})
38739
38740@item
38741The conditional expression using the ternary operator @samp{?:}
38742(@pxref{Conditional Exp})
38743
38744@item
38745The expression @samp{@var{indx} in @var{array}} outside of @code{for}
38746statements (@pxref{Reference to Elements})
38747
38748@item
38749The exponentiation operator @samp{^}
38750(@pxref{Arithmetic Ops}) and its assignment operator
38751form @samp{^=} (@pxref{Assignment Ops})
38752
38753@item
38754C-compatible operator precedence, which breaks some old @command{awk}
38755programs (@pxref{Precedence})
38756
38757@item
38758Regexps as the value of @code{FS}
38759(@pxref{Field Separators}) and as the
38760third argument to the @code{split()} function
38761(@pxref{String Functions}), rather than using only the first character
38762of @code{FS}
38763
38764@item
38765Dynamic regexps as operands of the @samp{~} and @samp{!~} operators
38766(@pxref{Computed Regexps})
38767
38768@item
38769The escape sequences @samp{\b}, @samp{\f}, and @samp{\r}
38770(@pxref{Escape Sequences})
38771
38772@item
38773Redirection of input for the @code{getline} function
38774(@pxref{Getline})
38775
38776@item
38777Multiple @code{BEGIN} and @code{END} rules
38778(@pxref{BEGIN/END})
38779
38780@item
38781Multidimensional arrays
38782(@pxref{Multidimensional})
38783@end itemize
38784
38785@node SVR4
38786@appendixsec Changes Between SVR3.1 and SVR4
38787
38788@cindex @command{awk} @subentry versions of @subentry changes between SVR3.1 and SVR4
38789The System V Release 4 (1989) version of Unix @command{awk} added these features
38790(some of which originated in @command{gawk}):
38791
38792@itemize @value{BULLET}
38793@item
38794The @code{ENVIRON} array (@pxref{Built-in Variables})
38795@c gawk and MKS awk
38796
38797@item
38798Multiple @option{-f} options on the command line
38799(@pxref{Options})
38800@c MKS awk
38801
38802@item
38803The @option{-v} option for assigning variables before program execution begins
38804(@pxref{Options})
38805@c GNU, Bell Laboratories & MKS together
38806
38807@item
38808The @option{--} signal for terminating command-line options
38809
38810@item
38811The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences
38812(@pxref{Escape Sequences})
38813@c GNU, for ANSI C compat
38814
38815@item
38816A defined return value for the @code{srand()} built-in function
38817(@pxref{Numeric Functions})
38818
38819@item
38820The @code{toupper()} and @code{tolower()} built-in string functions
38821for case translation
38822(@pxref{String Functions})
38823
38824@item
38825A cleaner specification for the @samp{%c} format-control letter in the
38826@code{printf} function
38827(@pxref{Control Letters})
38828
38829@item
38830The ability to dynamically pass the field width and precision (@code{"%*.*d"})
38831in the argument list of @code{printf} and @code{sprintf()}
38832(@pxref{Control Letters})
38833
38834@item
38835The use of regexp constants, such as @code{/foo/}, as expressions, where
38836they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/}
38837(@pxref{Using Constant Regexps})
38838
38839@item
38840Processing of escape sequences inside command-line variable assignments
38841(@pxref{Assignment Options})
38842@end itemize
38843
38844@node POSIX
38845@appendixsec Changes Between SVR4 and POSIX @command{awk}
38846@cindex @command{awk} @subentry versions of @subentry changes between SVR4 and POSIX @command{awk}
38847@cindex POSIX @command{awk} @subentry changes in @command{awk} versions
38848
38849The POSIX Command Language and Utilities standard for @command{awk} (1992)
38850introduced the following changes into the language:
38851
38852@itemize @value{BULLET}
38853@item
38854The use of @option{-W} for implementation-specific options
38855(@pxref{Options})
38856
38857@item
38858The use of @code{CONVFMT} for controlling the conversion of numbers
38859to strings (@pxref{Conversion})
38860
38861@item
38862The concept of a numeric string and tighter comparison rules to go
38863with it (@pxref{Typing and Comparison})
38864
38865@item
38866The use of predefined variables as function parameter names is forbidden
38867(@pxref{Definition Syntax})
38868
38869@item
38870More complete documentation of many of the previously undocumented
38871features of the language
38872@end itemize
38873
38874In 2012, a number of extensions that had been commonly available for
38875many years were finally added to POSIX. They are:
38876
38877@itemize @value{BULLET}
38878@item
38879The @code{fflush()} built-in function for flushing buffered output
38880(@pxref{I/O Functions})
38881
38882@item
38883The @code{nextfile} statement
38884(@pxref{Nextfile Statement})
38885
38886@item
38887The ability to delete all of an array at once with @samp{delete @var{array}}
38888(@pxref{Delete})
38889
38890@end itemize
38891
38892@xref{Common Extensions} for a list of common extensions
38893not permitted by the POSIX standard.
38894
38895The 2018 POSIX standard can be found online at
38896@url{https://pubs.opengroup.org/onlinepubs/9699919799/}.
38897
38898
38899@node BTL
38900@appendixsec Extensions in Brian Kernighan's @command{awk}
38901
38902@cindex @command{awk} @subentry versions of @seealso{Brian Kernighan's @command{awk}}
38903@cindex extensions @subentry Brian Kernighan's @command{awk}
38904@cindex Brian Kernighan's @command{awk} @subentry extensions
38905@cindex Kernighan, Brian
38906Brian Kernighan
38907has made his version available via his home page
38908(@pxref{Other Versions}).
38909
38910This @value{SECTION} describes common extensions that
38911originally appeared in his version of @command{awk}:
38912
38913@itemize @value{BULLET}
38914@item
38915The @samp{**} and @samp{**=} operators
38916(@pxref{Arithmetic Ops}
38917and
38918@ref{Assignment Ops})
38919
38920@item
38921The use of @code{func} as an abbreviation for @code{function}
38922(@pxref{Definition Syntax})
38923
38924@item
38925The @code{fflush()} built-in function for flushing buffered output
38926(@pxref{I/O Functions})
38927
38928@ignore
38929@item
38930The @code{SYMTAB} array, that allows access to @command{awk}'s internal symbol
38931table. This feature was never documented for his @command{awk}, largely because
38932it is somewhat shakily implemented. For instance, you cannot access arrays
38933or array elements through it
38934@end ignore
38935@end itemize
38936
38937@xref{Common Extensions} for a full list of the extensions
38938available in his @command{awk}.
38939
38940@node POSIX/GNU
38941@appendixsec Extensions in @command{gawk} Not in POSIX @command{awk}
38942
38943@cindex compatibility mode (@command{gawk}) @subentry extensions
38944@cindex extensions @subentry in @command{gawk}, not in POSIX @command{awk}
38945@cindex POSIX @subentry @command{gawk} extensions not included in
38946The GNU implementation, @command{gawk}, adds a large number of features.
38947They can all be disabled with either the @option{--traditional} or
38948@option{--posix} options
38949(@pxref{Options}).
38950
38951A number of features have come and gone over the years. This @value{SECTION}
38952summarizes the additional features over POSIX @command{awk} that are
38953in the current version of @command{gawk}.
38954
38955@itemize @value{BULLET}
38956
38957@item
38958Additional predefined variables:
38959
38960@itemize @value{MINUS}
38961@item
38962The
38963@code{ARGIND},
38964@code{BINMODE},
38965@code{ERRNO},
38966@code{FIELDWIDTHS},
38967@code{FPAT},
38968@code{IGNORECASE},
38969@code{LINT},
38970@code{PROCINFO},
38971@code{RT},
38972and
38973@code{TEXTDOMAIN}
38974variables
38975(@pxref{Built-in Variables})
38976@end itemize
38977
38978@item
38979Special files in I/O redirections:
38980
38981@itemize @value{MINUS}
38982@item
38983The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and
38984@file{/dev/fd/@var{N}} special @value{FN}s
38985(@pxref{Special Files})
38986
38987@item
38988The @file{/inet}, @file{/inet4}, and @file{/inet6} special files for
38989TCP/IP networking using @samp{|&} to specify which version of the
38990IP protocol to use
38991(@pxref{TCP/IP Networking})
38992@end itemize
38993
38994@item
38995Changes and/or additions to the language:
38996
38997@itemize @value{MINUS}
38998@item
38999The @samp{\x} escape sequence
39000(@pxref{Escape Sequences})
39001
39002@item
39003Full support for both POSIX and GNU regexps
39004(@pxref{Regexp})
39005
39006@item
39007The ability for @code{FS} and for the third
39008argument to @code{split()} to be null strings
39009(@pxref{Single Character Fields})
39010
39011@item
39012The ability for @code{RS} to be a regexp
39013(@pxref{Records})
39014
39015@item
39016The ability to use octal and hexadecimal constants in @command{awk}
39017program source code
39018(@pxref{Nondecimal-numbers})
39019
39020@item
39021The @samp{|&} operator for two-way I/O to a coprocess
39022(@pxref{Two-way I/O})
39023
39024@item
39025Indirect function calls
39026(@pxref{Indirect Calls})
39027
39028@item
39029Directories on the command line produce a warning and are skipped
39030(@pxref{Command-line directories})
39031
39032@item
39033Output with @code{print} and @code{printf} need not be fatal
39034(@pxref{Nonfatal})
39035@end itemize
39036
39037@item
39038New keywords:
39039
39040@itemize @value{MINUS}
39041@item
39042The @code{BEGINFILE} and @code{ENDFILE} special patterns
39043(@pxref{BEGINFILE/ENDFILE})
39044
39045@item
39046The @code{switch} statement
39047(@pxref{Switch Statement})
39048@end itemize
39049
39050@item
39051Changes to standard @command{awk} functions:
39052
39053@itemize @value{MINUS}
39054@item
39055The optional second argument to @code{close()} that allows closing one end
39056of a two-way pipe to a coprocess
39057(@pxref{Two-way I/O})
39058
39059@item
39060POSIX compliance for @code{gsub()} and @code{sub()} with @option{--posix}
39061
39062@item
39063The @code{length()} function accepts an array argument
39064and returns the number of elements in the array
39065(@pxref{String Functions})
39066
39067@item
39068The optional third argument to the @code{match()} function
39069for capturing text-matching subexpressions within a regexp
39070(@pxref{String Functions})
39071
39072@item
39073Positional specifiers in @code{printf} formats for
39074making translations easier
39075(@pxref{Printf Ordering})
39076
39077@item
39078The @code{split()} function's additional optional fourth
39079argument, which is an array to hold the text of the field separators
39080(@pxref{String Functions})
39081@end itemize
39082
39083@item
39084Additional functions only in @command{gawk}:
39085
39086@itemize @value{MINUS}
39087@item
39088The @code{gensub()}, @code{patsplit()}, and @code{strtonum()} functions
39089for more powerful text manipulation
39090(@pxref{String Functions})
39091
39092@item
39093The @code{asort()} and @code{asorti()} functions for sorting arrays
39094(@pxref{Array Sorting})
39095
39096@item
39097The @code{mktime()}, @code{systime()}, and @code{strftime()}
39098functions for working with timestamps
39099(@pxref{Time Functions})
39100
39101@item
39102The
39103@code{and()},
39104@code{compl()},
39105@code{lshift()},
39106@code{or()},
39107@code{rshift()},
39108and
39109@code{xor()}
39110functions for bit manipulation
39111(@pxref{Bitwise Functions})
39112@c In 4.1, and(), or() and xor() grew the ability to take > 2 arguments
39113
39114@item
39115The @code{isarray()} function to check if a variable is an array or not
39116(@pxref{Type Functions})
39117
39118@item
39119The @code{bindtextdomain()}, @code{dcgettext()}, and @code{dcngettext()}
39120functions for internationalization
39121(@pxref{Programmer i18n})
39122
39123@ifset INTDIV
39124@item
39125The @code{intdiv0()} function for doing integer
39126division and remainder
39127(@pxref{Numeric Functions})
39128@end ifset
39129@end itemize
39130
39131@item
39132Changes and/or additions in the command-line options:
39133
39134@itemize @value{MINUS}
39135@item
39136The @env{AWKPATH} environment variable for specifying a path search for
39137the @option{-f} command-line option
39138(@pxref{Options})
39139
39140@item
39141The @env{AWKLIBPATH} environment variable for specifying a path search for
39142the @option{-l} command-line option
39143(@pxref{Options})
39144
39145@item
39146The
39147@option{-b},
39148@option{-c},
39149@option{-C},
39150@option{-d},
39151@option{-D},
39152@option{-e},
39153@option{-E},
39154@option{-g},
39155@option{-h},
39156@option{-i},
39157@option{-l},
39158@option{-L},
39159@option{-M},
39160@option{-n},
39161@option{-N},
39162@option{-o},
39163@option{-O},
39164@option{-p},
39165@option{-P},
39166@option{-r},
39167@option{-s},
39168@option{-S},
39169@option{-t},
39170and
39171@option{-V}
39172short options. Also, the
39173ability to use GNU-style long-named options that start with @option{--},
39174and the
39175@option{--assign},
39176@option{--bignum},
39177@option{--characters-as-bytes},
39178@option{--copyright},
39179@option{--debug},
39180@option{--dump-variables},
39181@option{--exec},
39182@option{--field-separator},
39183@option{--file},
39184@option{--gen-pot},
39185@option{--help},
39186@option{--include},
39187@option{--lint},
39188@option{--lint-old},
39189@option{--load},
39190@option{--non-decimal-data},
39191@option{--optimize},
39192@option{--no-optimize},
39193@option{--posix},
39194@option{--pretty-print},
39195@option{--profile},
39196@option{--re-interval},
39197@option{--sandbox},
39198@option{--source},
39199@option{--traditional},
39200@option{--use-lc-numeric},
39201and
39202@option{--version}
39203long options
39204(@pxref{Options}).
39205@end itemize
39206
39207@c       new ports
39208
39209@item
39210Support for the following obsolete systems was removed from the code
39211and the documentation for @command{gawk} @value{PVERSION} 4.0:
39212
39213@c nested table
39214@itemize @value{MINUS}
39215@item
39216Amiga
39217
39218@item
39219Atari
39220
39221@item
39222BeOS
39223
39224@item
39225Cray
39226
39227@item
39228MIPS RiscOS
39229
39230@item
39231MS-DOS with the Microsoft Compiler
39232
39233@item
39234MS-Windows with the Microsoft Compiler
39235
39236@item
39237NeXT
39238
39239@item
39240SunOS 3.x, Sun 386 (Road Runner)
39241
39242@item
39243Tandem (non-POSIX)
39244
39245@item
39246Prestandard VAX C compiler for VAX/VMS
39247
39248@item
39249GCC for VAX and Alpha has not been tested for a while.
39250
39251@end itemize
39252
39253@item
39254Support for the following obsolete system was removed from the code
39255for @command{gawk} @value{PVERSION} 4.1:
39256
39257@c nested table
39258@itemize @value{MINUS}
39259@item
39260Ultrix
39261@end itemize
39262
39263@item
39264Support for the following systems was removed from the code
39265for @command{gawk} @value{PVERSION} 4.2:
39266
39267@c nested table
39268@itemize @value{MINUS}
39269@item
39270MirBSD
39271
39272@item
39273GNU/Linux on Alpha
39274@end itemize
39275
39276@end itemize
39277
39278@c XXX ADD MORE STUFF HERE
39279
39280
39281@c This does not need to be in the formal book.
39282@ifclear FOR_PRINT
39283@node Feature History
39284@appendixsec History of @command{gawk} Features
39285
39286@ignore
39287See the thread:
39288https://groups.google.com/forum/#!topic/comp.lang.awk/SAUiRuff30c
39289This motivated me to add this section.
39290@end ignore
39291
39292@ignore
39293I've tried to follow this general order, esp.@: for the 3.0 and 3.1 sections:
39294       variables
39295       special files
39296       language changes (e.g., hex constants)
39297       differences in standard awk functions
39298       new gawk functions
39299       new keywords
39300       new command-line options
39301       behavioral changes
39302       extension API changes
39303       new / deprecated / removed ports
39304       installation time stuff
39305Within each category, be alphabetical.
39306@end ignore
39307
39308This @value{SECTION} describes the features in @command{gawk}
39309over and above those in POSIX @command{awk},
39310in the order they were added to @command{gawk}.
39311
39312Version 2.10 of @command{gawk} introduced the following features:
39313
39314@itemize @value{BULLET}
39315@item
39316The @env{AWKPATH} environment variable for specifying a path search for
39317the @option{-f} command-line option
39318(@pxref{Options}).
39319
39320@item
39321The @code{IGNORECASE} variable and its effects
39322(@pxref{Case-sensitivity}).
39323
39324@item
39325The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr} and
39326@file{/dev/fd/@var{N}} special @value{FN}s
39327(@pxref{Special Files}).
39328@end itemize
39329
39330Version 2.13 of @command{gawk} introduced the following features:
39331
39332@itemize @value{BULLET}
39333@item
39334The @code{FIELDWIDTHS} variable and its effects
39335(@pxref{Constant Size}).
39336
39337@item
39338The @code{systime()} and @code{strftime()} built-in functions for obtaining
39339and printing timestamps
39340(@pxref{Time Functions}).
39341
39342@item
39343Additional command-line options
39344(@pxref{Options}):
39345
39346@itemize @value{MINUS}
39347@item
39348The @option{-W lint} option to provide error and portability checking
39349for both the source code and at runtime.
39350
39351@item
39352The @option{-W compat} option to turn off the GNU extensions.
39353
39354@item
39355The @option{-W posix} option for full POSIX compliance.
39356@end itemize
39357@end itemize
39358
39359Version 2.14 of @command{gawk} introduced the following feature:
39360
39361@itemize @value{BULLET}
39362@item
39363The @code{next file} statement for skipping to the next @value{DF}
39364(@pxref{Nextfile Statement}).
39365@end itemize
39366
39367Version 2.15 of @command{gawk} introduced the following features:
39368
39369@itemize @value{BULLET}
39370@item
39371New variables (@pxref{Built-in Variables}):
39372
39373@itemize @value{MINUS}
39374@item
39375@code{ARGIND}, which tracks the movement of @code{FILENAME}
39376through @code{ARGV}.
39377
39378@item
39379@code{ERRNO}, which contains the system error message when
39380@code{getline} returns @minus{}1 or @code{close()} fails.
39381@end itemize
39382
39383@item
39384The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and
39385@file{/dev/user} special @value{FN}s. These have since been removed.
39386
39387@item
39388The ability to delete all of an array at once with @samp{delete @var{array}}
39389(@pxref{Delete}).
39390
39391@item
39392Command-line option changes
39393(@pxref{Options}):
39394
39395@itemize @value{MINUS}
39396@item
39397The ability to use GNU-style long-named options that start with @option{--}.
39398
39399@item
39400The @option{--source} option for mixing command-line and library-file
39401source code.
39402@end itemize
39403@end itemize
39404
39405Version 3.0 of @command{gawk} introduced the following features:
39406
39407@itemize @value{BULLET}
39408@item
39409New or changed variables:
39410
39411@itemize @value{MINUS}
39412@item
39413@code{IGNORECASE} changed, now applying to string comparison as well
39414as regexp operations
39415(@pxref{Case-sensitivity}).
39416
39417@item
39418@code{RT}, which contains the input text that matched @code{RS}
39419(@pxref{Records}).
39420@end itemize
39421
39422@item
39423Full support for both POSIX and GNU regexps
39424(@pxref{Regexp}).
39425
39426@item
39427The @code{gensub()} function for more powerful text manipulation
39428(@pxref{String Functions}).
39429
39430@item
39431The @code{strftime()} function acquired a default time format,
39432allowing it to be called with no arguments
39433(@pxref{Time Functions}).
39434
39435@item
39436The ability for @code{FS} and for the third
39437argument to @code{split()} to be null strings
39438(@pxref{Single Character Fields}).
39439
39440@item
39441The ability for @code{RS} to be a regexp
39442(@pxref{Records}).
39443
39444@item
39445The @code{next file} statement became @code{nextfile}
39446(@pxref{Nextfile Statement}).
39447
39448@item
39449The @code{fflush()} function from
39450BWK @command{awk}
39451(then at Bell Laboratories;
39452@pxref{I/O Functions}).
39453
39454@item
39455New command-line options:
39456
39457@itemize @value{MINUS}
39458@item
39459The @option{--lint-old} option to
39460warn about constructs that are not available in
39461the original Version 7 Unix version of @command{awk}
39462(@pxref{V7/SVR3.1}).
39463
39464@item
39465The @option{-m} option from BWK @command{awk}.  (Brian was
39466still at Bell Laboratories at the time.)  This was later removed from
39467both his @command{awk} and from @command{gawk}.
39468
39469@item
39470The @option{--re-interval} option to provide interval expressions in regexps
39471(@pxref{Regexp Operators}).
39472
39473@item
39474The @option{--traditional} option was added as a better name for
39475@option{--compat} (@pxref{Options}).
39476@end itemize
39477
39478@item
39479The use of GNU Autoconf to control the configuration process
39480(@pxref{Quick Installation}).
39481
39482@item
39483Amiga support.
39484This has since been removed.
39485
39486@end itemize
39487
39488Version 3.1 of @command{gawk} introduced the following features:
39489
39490@itemize @value{BULLET}
39491@item
39492New variables
39493(@pxref{Built-in Variables}):
39494
39495@itemize @value{MINUS}
39496@item
39497@code{BINMODE}, for non-POSIX systems,
39498which allows binary I/O for input and/or output files
39499(@pxref{PC Using}).
39500
39501@item
39502@code{LINT}, which dynamically controls lint warnings.
39503
39504@item
39505@code{PROCINFO}, an array for providing process-related information.
39506
39507@item
39508@code{TEXTDOMAIN}, for setting an application's internationalization text domain
39509(@pxref{Internationalization}).
39510@end itemize
39511
39512@item
39513The ability to use octal and hexadecimal constants in @command{awk}
39514program source code
39515(@pxref{Nondecimal-numbers}).
39516
39517@item
39518The @samp{|&} operator for two-way I/O to a coprocess
39519(@pxref{Two-way I/O}).
39520
39521@item
39522The @file{/inet} special files for TCP/IP networking using @samp{|&}
39523(@pxref{TCP/IP Networking}).
39524
39525@item
39526The optional second argument to @code{close()} that allows closing one end
39527of a two-way pipe to a coprocess
39528(@pxref{Two-way I/O}).
39529
39530@item
39531The optional third argument to the @code{match()} function
39532for capturing text-matching subexpressions within a regexp
39533(@pxref{String Functions}).
39534
39535@item
39536Positional specifiers in @code{printf} formats for
39537making translations easier
39538(@pxref{Printf Ordering}).
39539
39540@item
39541A number of new built-in functions:
39542
39543@itemize @value{MINUS}
39544@item
39545The @code{asort()} and @code{asorti()} functions for sorting arrays
39546(@pxref{Array Sorting}).
39547
39548@item
39549The @code{bindtextdomain()}, @code{dcgettext()} and @code{dcngettext()} functions
39550for internationalization
39551(@pxref{Programmer i18n}).
39552
39553@item
39554The @code{extension()} function and the ability to add
39555new built-in functions dynamically. This has seen removed.
39556It was replaced by the new extension mechanism.
39557@xref{Dynamic Extensions}.
39558
39559@item
39560The @code{mktime()} function for creating timestamps
39561(@pxref{Time Functions}).
39562
39563@item
39564The @code{and()}, @code{or()}, @code{xor()}, @code{compl()},
39565@code{lshift()}, @code{rshift()}, and @code{strtonum()} functions
39566(@pxref{Bitwise Functions}).
39567@end itemize
39568
39569@item
39570@cindex @code{next file} statement
39571The support for @samp{next file} as two words was removed completely
39572(@pxref{Nextfile Statement}).
39573
39574@item
39575Additional command-line options
39576(@pxref{Options}):
39577
39578@itemize @value{MINUS}
39579@item
39580The @option{--dump-variables} option to print a list of all global variables.
39581
39582@item
39583The @option{--exec} option, for use in CGI scripts.
39584
39585@item
39586The @option{--gen-po} command-line option and the use of a leading
39587underscore to mark strings that should be translated
39588(@pxref{String Extraction}).
39589
39590@item
39591The @option{--non-decimal-data} option to allow non-decimal
39592input data
39593(@pxref{Nondecimal Data}).
39594
39595@item
39596The @option{--profile} option and @command{pgawk}, the
39597profiling version of @command{gawk}, for producing execution
39598profiles of @command{awk} programs
39599(@pxref{Profiling}).
39600
39601@item
39602The @option{--use-lc-numeric} option to force @command{gawk}
39603to use the locale's decimal point for parsing input data
39604(@pxref{Conversion}).
39605@end itemize
39606
39607@item
39608The use of GNU Automake to help in standardizing the configuration process
39609(@pxref{Quick Installation}).
39610
39611@item
39612The use of GNU @command{gettext} for @command{gawk}'s own message output
39613(@pxref{Gawk I18N}).
39614
39615@item
39616BeOS support. This was later removed.
39617
39618@item
39619Tandem support. This was later removed.
39620
39621@item
39622The Atari port became officially unsupported and was
39623later removed entirely.
39624
39625@item
39626The source code changed to use ISO C standard-style function definitions.
39627
39628@item
39629POSIX compliance for @code{sub()} and @code{gsub()}
39630(@pxref{Gory Details}).
39631
39632@item
39633The @code{length()} function was extended to accept an array argument
39634and return the number of elements in the array
39635(@pxref{String Functions}).
39636
39637@item
39638The @code{strftime()} function acquired a third argument to
39639enable printing times as UTC
39640(@pxref{Time Functions}).
39641@end itemize
39642
39643Version 4.0 of @command{gawk} introduced the following features:
39644
39645@itemize @value{BULLET}
39646
39647@item
39648Variable additions:
39649
39650@itemize @value{MINUS}
39651@item
39652@code{FPAT}, which allows you to specify a regexp that matches
39653the fields, instead of matching the field separator
39654(@pxref{Splitting By Content}).
39655
39656@item
39657If @code{PROCINFO["sorted_in"]} exists, @samp{for (iggy in foo)} loops sort the
39658indices before looping over them.  The value of this element
39659provides control over how the indices are sorted before the loop
39660traversal starts
39661(@pxref{Controlling Scanning}).
39662
39663@item
39664@code{PROCINFO["strftime"]}, which holds
39665the default format for @code{strftime()}
39666(@pxref{Time Functions}).
39667@end itemize
39668
39669@item
39670The special files @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}
39671and @file{/dev/user} were removed.
39672
39673@item
39674Support for IPv6 was added via the @file{/inet6} special file.
39675@file{/inet4} forces IPv4 and @file{/inet} chooses the system
39676default, which is probably IPv4
39677(@pxref{TCP/IP Networking}).
39678
39679@item
39680The use of @samp{\s} and @samp{\S} escape sequences in regular expressions
39681(@pxref{GNU Regexp Operators}).
39682
39683@item
39684Interval expressions became part of default regular expressions
39685(@pxref{Regexp Operators}).
39686
39687@item
39688POSIX character classes work even with @option{--traditional}
39689(@pxref{Regexp Operators}).
39690
39691@item
39692@code{break} and @code{continue} became invalid outside a loop,
39693even with @option{--traditional}
39694(@pxref{Break Statement}, and also see
39695@ref{Continue Statement}).
39696
39697@item
39698@code{fflush()}, @code{nextfile}, and @samp{delete @var{array}}
39699are allowed if @option{--posix} or @option{--traditional}, since they
39700are all now part of POSIX.
39701
39702@item
39703An optional third argument to
39704@code{asort()} and @code{asorti()}, specifying how to sort
39705(@pxref{String Functions}).
39706
39707@item
39708The behavior of @code{fflush()} changed to match BWK @command{awk}
39709and for POSIX; now both @samp{fflush()} and @samp{fflush("")}
39710flush all open output redirections
39711(@pxref{I/O Functions}).
39712
39713@item
39714The @code{isarray()}
39715function which distinguishes if an item is an array
39716or not, to make it possible to traverse arrays of arrays
39717(@pxref{Type Functions}).
39718
39719@item
39720The @code{patsplit()}
39721function which gives the same capability as @code{FPAT}, for splitting
39722(@pxref{String Functions}).
39723
39724@item
39725An optional fourth argument to the @code{split()} function,
39726which is an array to hold the values of the separators
39727(@pxref{String Functions}).
39728
39729@item
39730Arrays of arrays
39731(@pxref{Arrays of Arrays}).
39732
39733@item
39734The @code{BEGINFILE} and @code{ENDFILE} special patterns
39735(@pxref{BEGINFILE/ENDFILE}).
39736
39737@item
39738Indirect function calls
39739(@pxref{Indirect Calls}).
39740
39741@item
39742@code{switch} / @code{case} are enabled by default
39743(@pxref{Switch Statement}).
39744
39745@item
39746Command-line option changes
39747(@pxref{Options}):
39748
39749@itemize @value{MINUS}
39750@item
39751The @option{-b} and @option{--characters-as-bytes} options
39752which prevent @command{gawk} from treating input as a multibyte string.
39753
39754@item
39755The redundant @option{--compat}, @option{--copyleft}, and @option{--usage}
39756long options were removed.
39757
39758@item
39759The @option{--gen-po} option was finally renamed to the correct @option{--gen-pot}.
39760
39761@item
39762The @option{--sandbox} option which disables certain features.
39763
39764@item
39765All long options acquired corresponding short options, for use in @samp{#!} scripts.
39766@end itemize
39767
39768@item
39769Directories named on the command line now produce a warning, not a fatal
39770error, unless @option{--posix} or @option{--traditional} are used
39771(@pxref{Command-line directories}).
39772
39773@item
39774The @command{gawk} internals were rewritten, bringing the @command{dgawk}
39775debugger and possibly improved performance
39776(@pxref{Debugger}).
39777
39778@item
39779Per the GNU Coding Standards, dynamic extensions must now define
39780a global symbol indicating that they are GPL-compatible
39781(@pxref{Plugin License}).
39782
39783@item
39784@cindex POSIX mode
39785In POSIX mode, string comparisons use @code{strcoll()} / @code{wcscoll()}
39786(@pxref{POSIX String Comparison}).
39787
39788@item
39789The option for raw sockets was removed, since it was never implemented
39790(@pxref{TCP/IP Networking}).
39791
39792@item
39793Ranges of the form @samp{[d-h]} are treated as if they were in the
39794C locale, no matter what kind of regexp is being used, and even if
39795@option{--posix}
39796(@pxref{Ranges and Locales}).
39797
39798@item
39799Support was removed for the following systems:
39800
39801@itemize @value{MINUS}
39802@item
39803Atari
39804
39805@item
39806Amiga
39807
39808@item
39809BeOS
39810
39811@item
39812Cray
39813
39814@item
39815MIPS RiscOS
39816
39817@item
39818MS-DOS with the Microsoft Compiler
39819
39820@item
39821MS-Windows with the Microsoft Compiler
39822
39823@item
39824NeXT
39825
39826@item
39827SunOS 3.x, Sun 386 (Road Runner)
39828
39829@item
39830Tandem (non-POSIX)
39831
39832@item
39833Prestandard VAX C compiler for VAX/VMS
39834@end itemize
39835@end itemize
39836
39837Version 4.1 of @command{gawk} introduced the following features:
39838
39839@itemize @value{BULLET}
39840
39841@item
39842Three new arrays:
39843@code{SYMTAB}, @code{FUNCTAB}, and @code{PROCINFO["identifiers"]}
39844(@pxref{Auto-set}).
39845
39846@item
39847The three executables @command{gawk}, @command{pgawk}, and @command{dgawk}, were merged into
39848one, named just @command{gawk}.  As a result the command-line options changed.
39849
39850@item
39851Command-line option changes
39852(@pxref{Options}):
39853
39854@itemize @value{MINUS}
39855@item
39856The @option{-D} option invokes the debugger.
39857
39858@item
39859The @option{-i} and @option{--include} options
39860load @command{awk} library files.
39861
39862@item
39863The @option{-l} and @option{--load} options load compiled dynamic extensions.
39864
39865@item
39866The @option{-M} and @option{--bignum} options enable MPFR.
39867
39868@item
39869The @option{-o} option only does pretty-printing.
39870
39871@item
39872The @option{-p} option is used for profiling.
39873
39874@item
39875The @option{-R} option was removed.
39876@end itemize
39877
39878@item
39879Support for high precision arithmetic with MPFR
39880(@pxref{Arbitrary Precision Arithmetic}).
39881
39882@item
39883The @code{and()}, @code{or()} and @code{xor()} functions
39884changed to allow any number of arguments,
39885with a minimum of two
39886(@pxref{Bitwise Functions}).
39887
39888@item
39889The dynamic extension interface was completely redone
39890(@pxref{Dynamic Extensions}).
39891
39892@item
39893Redirected @code{getline} became allowed inside
39894@code{BEGINFILE} and @code{ENDFILE}
39895(@pxref{BEGINFILE/ENDFILE}).
39896
39897@item
39898The @code{where} command was added to the debugger
39899(@pxref{Execution Stack}).
39900
39901@item
39902Support for Ultrix was removed.
39903
39904@end itemize
39905
39906Version 4.2 of @command{gawk} introduced the following changes:
39907
39908@itemize @bullet
39909@item
39910Changes to @code{ENVIRON} are reflected into @command{gawk}'s
39911environment and that of programs that it runs.
39912@xref{Auto-set}.
39913
39914@item
39915@code{FIELDWIDTHS} was enhanced to allow skipping characters
39916before assigning a value to a field
39917(@pxref{Splitting By Content}).
39918
39919@item
39920The @code{PROCINFO["argv"]} array.
39921@xref{Auto-set}.
39922
39923@item
39924The maximum number of hexadecimal digits in @samp{\x} escapes
39925is now two.
39926@xref{Escape Sequences}.
39927
39928@item
39929Strongly typed regexp constants of the form @samp{@@/@dots{}/}
39930(@pxref{Strong Regexp Constants}).
39931
39932@item
39933The bitwise functions changed, making negative arguments into
39934a fatal error (@pxref{Bitwise Functions}).
39935
39936@ifset INTDIV
39937@item
39938The @code{intdiv0()} function.
39939@xref{Numeric Functions}.
39940@end ifset
39941
39942@item
39943The @code{mktime()} function now accepts an optional
39944second argument
39945(@pxref{Time Functions}).
39946
39947@item
39948The @code{typeof()} function (@pxref{Type Functions}).
39949
39950@item
39951Optimizations are enabled by default. Use @option{-s} /
39952@option{--no-optimize} to disable optimizations.
39953
39954@item
39955For many years, POSIX specified that default field splitting
39956only allowed spaces and tabs to separate fields, and this was
39957how @command{gawk} behaved with @option{--posix}. As of 2013,
39958the standard restored historical behavior, and now default
39959field splitting with @option{--posix} also allows newlines to
39960separate fields.
39961
39962@item
39963Nonfatal output with @code{print} and @code{printf}.
39964@xref{Nonfatal}.
39965
39966@item
39967Retryable I/O via @code{PROCINFO[@var{input-file}, "RETRY"]};
39968(@pxref{Retrying Input}).
39969
39970@item
39971Changes to the pretty-printer (@pxref{Profiling}):
39972
39973@c nested table
39974@itemize @value{MINUS}
39975@item
39976The @option{--pretty-print} option no longer runs the @command{awk}
39977program too.
39978
39979@item
39980Comments in the source program are preserved and placed into the
39981output file.
39982
39983@item
39984Explicit parentheses for expressions
39985in the input are preserved in the generated output.
39986@end itemize
39987
39988@item
39989Improvements to the extension API
39990(@pxref{Dynamic Extensions}):
39991
39992@c nested
39993@itemize @value{MINUS}
39994@item
39995The @code{get_file()} function to access open redirections.
39996
39997@item
39998The @code{nonfatal()} function for generating nonfatal error messages.
39999
40000@item
40001Support for GMP and MPFR values.
40002
40003@item
40004Input parsers can now override the default field parsing mechanism
40005by specifying explicit locations.
40006@end itemize
40007
40008@item
40009Shell startup files are supplied with the distribution and
40010installed by @samp{make install} (@pxref{Shell Startup Files}).
40011
40012@item
40013The @command{igawk} program and its manual page are no longer
40014installed when @command{gawk} is built.
40015@xref{Igawk Program}.
40016
40017@item
40018Support for MirBSD was removed.
40019
40020@item
40021Support for GNU/Linux on Alpha was removed.
40022
40023@end itemize
40024
40025Version 5.0 added the following features:
40026
40027@itemize
40028@item
40029The @code{PROCINFO["platform"]} array element, which allows you
40030to write code that takes the operating system / platform into account.
40031@end itemize
40032
40033Version 5.1 was created to release @command{gawk} with a correct
40034major version number for the API. This was overlooked for version 5.0,
40035unfortunately. It added the following features:
40036
40037@itemize
40038@item
40039The index for this manual was completely reworked.
40040
40041@item
40042Support was added for MSYS2.
40043
40044@item
40045@code{asort()} and @code{asorti()} were changed to
40046allow @code{FUNCTAB} and @code{SYMTAB} as the first argument if a
40047second destination array is supplied (@pxref{String Functions}).
40048
40049@item
40050The @option{-I}/@option{--trace} options were added to
40051print a trace of the byte codes as they execute (@pxref{Options}).
40052
40053@item
40054@code{$0} and the fields are now cleared before starting a
40055@code{BEGINFILE} rule (@pxref{BEGINFILE/ENDFILE}).
40056
40057@item
40058Several example programs in the manual were updated to their modern
40059POSIX equivalents.
40060
40061@item
40062The ``no effect'' lint warnings from @option{--lint} were fixed up
40063and now behave more sanely (@pxref{Options}).
40064
40065@item
40066Handling of Infinity and NaN values were improved.
40067@xref{Math Definitions}, and also see
40068@ref{POSIX Floating Point Problems}.
40069@end itemize
40070
40071@c XXX ADD MORE STUFF HERE
40072@end ifclear
40073
40074@node Common Extensions
40075@appendixsec Common Extensions Summary
40076
40077@cindex extensions @subentry Brian Kernighan's @command{awk}
40078@cindex extensions @subentry @command{mawk}
40079The following table summarizes the common extensions supported
40080by @command{gawk}, Brian Kernighan's @command{awk}, and @command{mawk},
40081the three most widely used freely available versions of @command{awk}
40082(@pxref{Other Versions}).
40083
40084@multitable {@file{/dev/stderr} special file} {BWK @command{awk}} {@command{mawk}} {@command{gawk}} {Now standard}
40085@headitem Feature @tab BWK @command{awk} @tab @command{mawk} @tab @command{gawk} @tab Now standard
40086@item @samp{\x} escape sequence @tab X @tab X @tab X @tab
40087@item @code{FS} as null string @tab X @tab X @tab X @tab
40088@item @file{/dev/stdin} special file @tab X @tab X @tab X @tab
40089@item @file{/dev/stdout} special file @tab X @tab X @tab X @tab
40090@item @file{/dev/stderr} special file @tab X @tab X @tab X @tab
40091@item @code{delete} without subscript @tab X @tab X @tab X @tab X
40092@item @code{fflush()} function @tab X @tab X @tab X @tab X
40093@item @code{length()} of an array @tab X @tab X @tab X @tab
40094@item @code{nextfile} statement @tab X @tab X @tab X @tab X
40095@item @code{**} and @code{**=} operators @tab X @tab @tab X @tab
40096@item @code{func} keyword @tab X @tab @tab X @tab
40097@item @code{BINMODE} variable @tab @tab X @tab X @tab
40098@item @code{RS} as regexp @tab X @tab X @tab X @tab
40099@item Time-related functions @tab @tab X @tab X @tab
40100@end multitable
40101
40102@node Ranges and Locales
40103@appendixsec Regexp Ranges and Locales: A Long Sad Story
40104
40105This @value{SECTION} describes the confusing history of ranges within
40106regular expressions and their interactions with locales, and how this
40107affected different versions of @command{gawk}.
40108
40109@cindex ASCII
40110@cindex EBCDIC
40111The original Unix tools that worked with regular expressions defined
40112character ranges (such as @samp{[a-z]}) to match any character between
40113the first character in the range and the last character in the range,
40114inclusive.  Ordering was based on the numeric value of each character
40115in the machine's native character set.  Thus, on ASCII-based systems,
40116@samp{[a-z]} matched all the lowercase letters, and only the lowercase
40117letters, as the numeric values for the letters from @samp{a} through
40118@samp{z} were contiguous.  (On an EBCDIC system, the range @samp{[a-z]}
40119includes additional nonalphabetic characters as well.)
40120
40121Almost all introductory Unix literature explained range expressions
40122as working in this fashion, and in particular, would teach that the
40123``correct'' way to match lowercase letters was with @samp{[a-z]}, and
40124that @samp{[A-Z]} was the ``correct'' way to match uppercase letters.
40125And indeed, this was true.@footnote{And Life was good.}
40126
40127The 1992 POSIX standard introduced the idea of locales (@pxref{Locales}).
40128Because many locales include other letters besides the plain 26
40129letters of the English alphabet, the POSIX standard added
40130character classes (@pxref{Bracket Expressions}) as a way to match
40131different kinds of characters besides the traditional ones in the ASCII
40132character set.
40133
40134However, the standard @emph{changed} the interpretation of range expressions.
40135In the @code{"C"} and @code{"POSIX"} locales, a range expression like
40136@samp{[a-dx-z]} is still equivalent to @samp{[abcdxyz]}, as in ASCII.
40137But outside those locales, the ordering was defined to be based on
40138@dfn{collation order}.
40139
40140What does that mean?
40141In many locales, @samp{A} and @samp{a} are both less than @samp{B}.
40142In other words, these locales sort characters in dictionary order,
40143and @samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]};
40144instead, it might be equivalent to @samp{[ABCXYabcdxyz]}, for example.
40145
40146This point needs to be emphasized: much literature teaches that you should
40147use @samp{[a-z]} to match a lowercase character.  But on systems with
40148non-ASCII locales, this also matches all of the uppercase characters
40149except @samp{A} or @samp{Z}!  This was a continuous cause of confusion, even well
40150into the twenty-first century.
40151
40152To demonstrate these issues, the following example uses the @code{sub()}
40153function, which does text replacement (@pxref{String Functions}).  Here,
40154the intent is to remove trailing uppercase characters:
40155
40156@example
40157$ @kbd{echo something1234abc | gawk-3.1.8 '@{ sub("[A-Z]*$", ""); print @}'}
40158@print{} something1234a
40159@end example
40160
40161@noindent
40162This output is unexpected, as the @samp{bc} at the end of
40163@samp{something1234abc} should not normally match @samp{[A-Z]*}.
40164This result is due to the locale setting (and thus you may not see
40165it on your system).
40166
40167@cindex Unicode
40168@cindex ASCII
40169Similar considerations apply to other ranges.  For example, @samp{["-/]}
40170is perfectly valid in ASCII, but is not valid in many Unicode locales,
40171such as @code{en_US.UTF-8}.
40172
40173Early versions of @command{gawk} used regexp matching code that was not
40174locale-aware, so ranges had their traditional interpretation.
40175
40176When @command{gawk} switched to using locale-aware regexp matchers,
40177the problems began; especially as both GNU/Linux and commercial Unix
40178vendors started implementing non-ASCII locales, @emph{and making them
40179the default}.  Perhaps the most frequently asked question became something
40180like, ``Why does @samp{[A-Z]} match lowercase letters?!?''
40181
40182@cindex Berry, Karl
40183This situation existed for close to 10 years, if not more, and
40184the @command{gawk} maintainer grew weary of trying to explain that
40185@command{gawk} was being nicely standards-compliant, and that the issue
40186was in the user's locale.  During the development of @value{PVERSION} 4.0,
40187he modified @command{gawk} to always treat ranges in the original,
40188pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}).@footnote{And
40189thus was born the Campaign for Rational Range Interpretation (or
40190RRI). A number of GNU tools have already implemented this change,
40191or will soon.  Thanks to Karl Berry for coining the phrase ``Rational
40192Range Interpretation.''}
40193
40194Fortunately, shortly before the final release of @command{gawk} 4.0,
40195the maintainer learned that the 2008 standard had changed the
40196definition of ranges, such that outside the @code{"C"} and @code{"POSIX"}
40197locales, the meaning of range expressions was @emph{undefined}.@footnote{See
40198@uref{https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05, the standard}
40199and
40200@uref{https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05, its rationale}.}
40201
40202By using this lovely technical term, the standard gives license
40203to implementers to implement ranges in whatever way they choose.
40204The @command{gawk} maintainer chose to apply the pre-POSIX meaning
40205both with the default regexp matching and when @option{--traditional} or
40206@option{--posix} are used.
40207In all cases @command{gawk} remains POSIX-compliant.
40208
40209@node Contributors
40210@appendixsec Major Contributors to @command{gawk}
40211@cindex @command{gawk} @subentry list of contributors to
40212@quotation
40213@i{Always give credit where credit is due.}
40214@author Anonymous
40215@end quotation
40216
40217This @value{SECTION} names the major contributors to @command{gawk}
40218and/or this @value{DOCUMENT}, in approximate chronological order:
40219
40220@itemize @value{BULLET}
40221@item
40222@cindex Aho, Alfred
40223@cindex Weinberger, Peter
40224@cindex Kernighan, Brian
40225Dr.@: Alfred V.@: Aho,
40226Dr.@: Peter J.@: Weinberger, and
40227Dr.@: Brian W.@: Kernighan, all of Bell Laboratories,
40228designed and implemented Unix @command{awk},
40229from which @command{gawk} gets the majority of its feature set.
40230
40231@item
40232@cindex Rubin, Paul
40233Paul Rubin
40234did the initial design and implementation in 1986, and wrote
40235the first draft (around 40 pages) of this @value{DOCUMENT}.
40236
40237@item
40238@cindex Fenlason, Jay
40239Jay Fenlason
40240finished the initial implementation.
40241
40242@item
40243@cindex Close, Diane
40244Diane Close
40245revised the first draft of this @value{DOCUMENT}, bringing it
40246to around 90 pages.
40247
40248@item
40249@cindex Stallman, Richard
40250Richard Stallman
40251helped finish the implementation and the initial draft of this
40252@value{DOCUMENT}.
40253He is also the founder of the FSF and the GNU Project.
40254
40255@item
40256@cindex Woods, John
40257John Woods
40258contributed parts of the code (mostly fixes) in
40259the initial version of @command{gawk}.
40260
40261@item
40262@cindex Trueman, David
40263In 1988,
40264David Trueman
40265took over primary maintenance of @command{gawk},
40266making it compatible with ``new'' @command{awk}, and
40267greatly improving its performance.
40268
40269@item
40270@cindex Kwok, Conrad
40271@cindex Garfinkle, Scott
40272@cindex Williams, Kent
40273Conrad Kwok,
40274Scott Garfinkle,
40275and
40276Kent Williams
40277did the initial ports to MS-DOS with various versions of MSC.
40278
40279@item
40280@cindex Rankin, Pat
40281Pat Rankin
40282provided the VMS port and its documentation.
40283
40284@item
40285@cindex Peterson, Hal
40286Hal Peterson
40287provided help in porting @command{gawk} to Cray systems.
40288(This is no longer supported.)
40289
40290@item
40291@cindex Rommel, Kai Uwe
40292Kai Uwe Rommel
40293provided the initial port to OS/2 and its documentation.
40294
40295@item
40296@cindex Jaegermann, Michal
40297Michal Jaegermann
40298provided the port to Atari systems and its documentation.
40299(This port is no longer supported.)
40300He continues to provide portability checking,
40301and has done a lot of work to make sure @command{gawk}
40302works on non-32-bit systems.
40303
40304@item
40305@cindex Fish, Fred
40306Fred Fish
40307provided the port to Amiga systems and its documentation.
40308(With Fred's sad passing, this is no longer supported.)
40309
40310@item
40311@cindex Deifik, Scott
40312Scott Deifik
40313formerly maintained the MS-DOS port using DJGPP.
40314
40315@item
40316@cindex Zaretskii, Eli
40317Eli Zaretskii
40318currently maintains the MS-Windows port using MinGW.
40319
40320@item
40321@cindex Grigera, Juan
40322Juan Grigera
40323provided a port to Windows32 systems.
40324(This is no longer supported.)
40325
40326
40327@item
40328@cindex Hankerson, Darrel
40329For many years,
40330Dr.@: Darrel Hankerson
40331acted as coordinator for the various ports to different PC platforms
40332and created binary distributions for various PC operating systems.
40333He was also instrumental in keeping the documentation up to date for
40334the various PC platforms.
40335
40336@item
40337@cindex Zoulas, Christos
40338Christos Zoulas
40339provided the @code{extension()}
40340built-in function for dynamically adding new functions.
40341(This was obsoleted at @command{gawk} 4.1.)
40342
40343@item
40344@cindex Kahrs, J@"urgen
40345J@"urgen Kahrs
40346contributed the initial version of the TCP/IP networking
40347code and documentation, and motivated the inclusion of the @samp{|&} operator.
40348
40349@item
40350@cindex Davies, Stephen
40351Stephen Davies
40352provided the initial port to Tandem systems and its documentation.
40353(However, this is no longer supported.)
40354He was also instrumental in the initial work to integrate the
40355byte-code internals into the @command{gawk} code base.
40356Additionally, he did most of the work enabling the pretty-printer
40357to preserve and output comments.
40358
40359@item
40360@cindex Woehlke, Matthew
40361Matthew Woehlke
40362provided improvements for Tandem's POSIX-compliant systems.
40363
40364@item
40365@cindex Brown, Martin
40366Martin Brown
40367provided the port to BeOS and its documentation.
40368(This is no longer supported.)
40369
40370@item
40371@cindex Peters, Arno
40372Arno Peters
40373did the initial work to convert @command{gawk} to use
40374GNU Automake and GNU @command{gettext}.
40375
40376@item
40377@cindex Broder, Alan J.@:
40378Alan J.@: Broder
40379provided the initial version of the @code{asort()} function
40380as well as the code for the optional third argument to the
40381@code{match()} function.
40382
40383@item
40384@cindex Buening, Andreas
40385Andreas Buening
40386updated the @command{gawk} port for OS/2.
40387
40388@item
40389@cindex Hasegawa, Isamu
40390Isamu Hasegawa,
40391of IBM in Japan, contributed support for multibyte characters.
40392
40393@item
40394@cindex Benzinger, Michael
40395Michael Benzinger contributed the initial code for @code{switch} statements.
40396
40397@item
40398@cindex McPhee, Patrick T.J.@:
40399Patrick T.J.@: McPhee contributed the code for dynamic loading in Windows32
40400environments.
40401(This is no longer supported.)
40402
40403@item
40404@cindex Wallin, Anders
40405Anders Wallin helped keep the VMS port going for several years.
40406
40407@item
40408@cindex Gordon, Assaf
40409Assaf Gordon contributed the initial code to implement the
40410@option{--sandbox} option.
40411
40412@item
40413@cindex Haque, John
40414John Haque made the following contributions:
40415
40416@itemize @value{MINUS}
40417@item
40418The modifications to convert @command{gawk}
40419into a byte-code interpreter, including the debugger
40420
40421@item
40422The addition of true arrays of arrays
40423
40424@item
40425The additional modifications for support of arbitrary-precision arithmetic
40426
40427@item
40428The initial text of
40429@ref{Arbitrary Precision Arithmetic}
40430
40431@item
40432The work to merge the three versions of @command{gawk}
40433into one, for the 4.1 release
40434
40435@item
40436Improved array internals for arrays indexed by integers
40437
40438@item
40439The improved array sorting features were also driven by John, together
40440with Pat Rankin
40441@end itemize
40442
40443@cindex Papadopoulos, Panos
40444@item
40445Panos Papadopoulos contributed the original text for @ref{Include Files}.
40446
40447@item
40448@cindex Yawitz, Efraim
40449Efraim Yawitz contributed the original text for @ref{Debugger}.
40450
40451@item
40452@cindex Schorr, Andrew
40453The development of the extension API first released with
40454@command{gawk} 4.1 was driven primarily by
40455Arnold Robbins and Andrew Schorr, with notable contributions from
40456the rest of the development team.
40457
40458@cindex Malmberg, John
40459@item
40460John Malmberg contributed significant improvements to the
40461OpenVMS port and the related documentation.
40462
40463@item
40464@cindex Colombo, Antonio
40465Antonio Giovanni Colombo rewrote a number of examples in the early
40466chapters that were severely dated, for which I am incredibly grateful.
40467He also provided and maintains the Italian translation.
40468
40469@item
40470@cindex Curreli, Marco
40471Marco Curreli, together with Antonio Colombo, translated this
40472@value{DOCUMENT} into Italian.  It is included in the @command{gawk}
40473distribution.
40474
40475@item
40476@cindex Guerrero, Juan Manuel
40477Juan Manuel Guerrero took over maintenance of the DJGPP port.
40478
40479@item
40480@cindex Jannick
40481``Jannick'' provided support for MSYS2.
40482
40483@item
40484@cindex Robbins @subentry Arnold
40485Arnold Robbins
40486has been working on @command{gawk} since 1988, at first
40487helping David Trueman, and as the primary maintainer since around 1994.
40488@end itemize
40489
40490@node History summary
40491@appendixsec Summary
40492
40493@itemize @value{BULLET}
40494@item
40495The @command{awk} language has evolved over time. The first release
40496was with V7 Unix, circa 1978.  In 1987, for System V Release 3.1,
40497major additions, including user-defined functions, were made to the language.
40498Additional changes were made for System V Release 4, in 1989.
40499Since then, further minor changes have happened under the auspices of the
40500POSIX standard.
40501
40502@item
40503Brian Kernighan's @command{awk} provides a small number of extensions
40504that are implemented in common with other versions of @command{awk}.
40505
40506@item
40507@command{gawk} provides a large number of extensions over POSIX @command{awk}.
40508They can be disabled with either the @option{--traditional} or @option{--posix}
40509options.
40510
40511@item
40512@cindex ASCII
40513@cindex EBCDIC
40514The interaction of POSIX locales and regexp matching in @command{gawk} has been confusing over
40515the years. Today, @command{gawk} implements Rational Range Interpretation, where
40516ranges of the form @samp{[a-z]} match @emph{only} the characters numerically between
40517@samp{a} through @samp{z} in the machine's native character set.  Usually this is ASCII,
40518but it can be EBCDIC on IBM S/390 systems.
40519
40520@item
40521Many people have contributed to @command{gawk} development over the years.
40522We hope that the list provided in this @value{CHAPTER} is complete and gives
40523the appropriate credit where credit is due.
40524
40525@end itemize
40526
40527@node Installation
40528@appendix Installing @command{gawk}
40529
40530@c last two commas are part of see also
40531@cindex operating systems
40532@cindex operating systems @seealso{GNU/Linux}
40533@cindex operating systems @seealso{PC operating systems}
40534@cindex operating systems @seealso{Unix}
40535@cindex @command{gawk} @subentry installing
40536@cindex installing @command{gawk}
40537This appendix provides instructions for installing @command{gawk} on the
40538various platforms that are supported by the developers.  The primary
40539developer supports GNU/Linux (and Unix), whereas the other ports are
40540contributed.
40541@xref{Bugs}
40542for the email addresses of the people who maintain
40543the respective ports.
40544
40545@menu
40546* Gawk Distribution::           What is in the @command{gawk} distribution.
40547* Unix Installation::           Installing @command{gawk} under various
40548                                versions of Unix.
40549* Non-Unix Installation::       Installation on Other Operating Systems.
40550* Bugs::                        Reporting Problems and Bugs.
40551* Other Versions::              Other freely available @command{awk}
40552                                implementations.
40553* Installation summary::        Summary of installation.
40554@end menu
40555
40556@node Gawk Distribution
40557@appendixsec The @command{gawk} Distribution
40558@cindex source code @subentry @command{gawk}
40559
40560This @value{SECTION} describes how to get the @command{gawk}
40561distribution, how to extract it, and then what is in the various files and
40562subdirectories.
40563
40564@menu
40565* Getting::                     How to get the distribution.
40566* Extracting::                  How to extract the distribution.
40567* Distribution contents::       What is in the distribution.
40568@end menu
40569
40570@node Getting
40571@appendixsubsec Getting the @command{gawk} Distribution
40572@cindex @command{gawk} @subentry source code, obtaining
40573There are two ways to get GNU software:
40574
40575@itemize @value{BULLET}
40576@item
40577Copy it from someone else who already has it.
40578
40579@cindex FSF (Free Software Foundation)
40580@cindex Free Software Foundation (FSF)
40581@item
40582Retrieve @command{gawk}
40583from the Internet host
40584@code{ftp.gnu.org}, in the directory @file{/gnu/gawk}.
40585Both anonymous @command{ftp} and @code{http} access are supported.
40586If you have the @command{wget} program, you can use a command like
40587the following:
40588
40589@example
40590wget https://ftp.gnu.org/gnu/gawk/gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz
40591@end example
40592@end itemize
40593
40594The GNU software archive is mirrored around the world.
40595The up-to-date list of mirror sites is available from
40596@uref{https://www.gnu.org/order/ftp.html, the main FSF website}.
40597Try to use one of the mirrors; they
40598will be less busy, and you can usually find one closer to your site.
40599
40600You may also retrieve the @command{gawk} source code from the official
40601Git repository; for more information see @ref{Accessing The Source}.
40602
40603@node Extracting
40604@appendixsubsec Extracting the Distribution
40605@command{gawk} is distributed as several @command{tar} files compressed with
40606different compression programs: @command{gzip}, @command{bzip2},
40607and @command{xz}. For simplicity, the rest of these instructions assume
40608you are using the one compressed with the GNU Gzip program (@command{gzip}).
40609
40610Once you have the distribution (e.g.,
40611@file{gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz}),
40612use @command{gzip} to expand the
40613file and then use @command{tar} to extract it.  You can use the following
40614pipeline to produce the @command{gawk} distribution:
40615
40616@example
40617gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf -
40618@end example
40619
40620On a system with GNU @command{tar}, you can let @command{tar}
40621do the decompression for you:
40622
40623@example
40624tar -xvpzf gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz
40625@end example
40626
40627@noindent
40628Extracting the archive
40629creates a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}
40630in the current directory.
40631
40632The distribution @value{FN} is of the form
40633@file{gawk-@var{V}.@var{R}.@var{P}.tar.gz}.
40634The @var{V} represents the major version of @command{gawk},
40635the @var{R} represents the current release of version @var{V}, and
40636the @var{P} represents a @dfn{patch level}, meaning that minor bugs have
40637been fixed in the release.  The current patch level is @value{PATCHLEVEL},
40638but when retrieving distributions, you should get the version with the highest
40639version, release, and patch level.  (Note, however, that patch levels greater than
40640or equal to 60 denote ``beta'' or nonproduction software; you might not want
40641to retrieve such a version unless you don't mind experimenting.)
40642If you are not on a Unix or GNU/Linux system, you need to make other arrangements
40643for getting and extracting the @command{gawk} distribution.  You should consult
40644a local expert.
40645
40646@node Distribution contents
40647@appendixsubsec Contents of the @command{gawk} Distribution
40648@cindex @command{gawk} @subentry distribution
40649
40650The @command{gawk} distribution has a number of C source files,
40651documentation files,
40652subdirectories, and files related to the configuration process
40653(@pxref{Unix Installation}),
40654as well as several subdirectories related to different non-Unix
40655operating systems:
40656
40657@table @asis
40658@item Various @samp{.c}, @samp{.y}, and @samp{.h} files
40659These files contain the actual @command{gawk} source code.
40660@end table
40661
40662@table @file
40663@item support/*
40664C header and source files for routines that @command{gawk}
40665uses, but that are not part of its core functionality.
40666For example, argument parsing, regular expression matching,
40667and random number generating routines are all kept here.
40668
40669@item ABOUT-NLS
40670A file containing information about GNU @command{gettext} and translations.
40671
40672@item AUTHORS
40673A file with some information about the authorship of @command{gawk}.
40674It exists only to satisfy the pedants at the Free Software Foundation.
40675
40676@item README
40677@itemx README_d/README.*
40678Descriptive files: @file{README} for @command{gawk} under Unix and the
40679rest for the various hardware and software combinations.
40680
40681@item INSTALL
40682A file providing an overview of the configuration and installation process.
40683
40684@item ChangeLog
40685A detailed list of source code changes as bugs are fixed or improvements made.
40686There are similar files in all of the subdirectories.
40687
40688@item ChangeLog.0
40689@itemx ChangeLog.1
40690Older lists of source code changes.
40691There are similar files in all of the subdirectories.
40692
40693@item NEWS
40694A list of changes to @command{gawk} since the last release or patch.
40695There may be similar files in other subdirectories.
40696
40697@item NEWS.0
40698@itemx NEWS.1
40699Older lists of changes to @command{gawk}.
40700There may be similar files in other subdirectories.
40701
40702@item COPYING
40703The GNU General Public License.
40704
40705@item POSIX.STD
40706A description of behaviors in the POSIX standard for @command{awk} that
40707are left undefined, or where @command{gawk} may not comply fully, as well
40708as a list of things that the POSIX standard should describe but does not.
40709
40710@cindex artificial intelligence, @command{gawk} and
40711@item doc/awkforai.txt
40712Pointers to the original draft of
40713a short article describing why @command{gawk} is a good language for
40714artificial intelligence (AI) programming.
40715
40716@item doc/bc_notes
40717A brief description of @command{gawk}'s ``byte code'' internals.
40718
40719@item doc/README.card
40720@itemx doc/ad.block
40721@itemx doc/awkcard.in
40722@itemx doc/cardfonts
40723@itemx doc/colors
40724@itemx doc/macros
40725@itemx doc/no.colors
40726@itemx doc/setter.outline
40727The @command{troff} source for a five-color @command{awk} reference card.
40728A modern version of @command{troff} such as GNU @command{troff} (@command{groff}) is
40729needed to produce the color version. See the file @file{README.card}
40730for instructions if you have an older @command{troff}.
40731
40732@item doc/gawk.1
40733The @command{troff} source for a manual page describing @command{gawk}.
40734This is distributed for the convenience of Unix users.
40735
40736@cindex Texinfo
40737@item doc/gawktexi.in
40738@itemx doc/sidebar.awk
40739The Texinfo source file for this @value{DOCUMENT}.
40740It should be processed by @file{doc/sidebar.awk}
40741before processing with @command{texi2dvi} or @command{texi2pdf}
40742to produce a printed document, and
40743with @command{makeinfo} to produce an Info or HTML file.
40744The @file{Makefile} takes care of this processing and produces
40745printable output via @command{texi2dvi} or @command{texi2pdf}.
40746
40747@item doc/gawk.texi
40748The file produced after processing @file{gawktexi.in}
40749with @file{sidebar.awk}.
40750
40751@item doc/gawk.info
40752The generated Info file for this @value{DOCUMENT}.
40753
40754@item doc/gawkinet.texi
40755The Texinfo source file for
40756@ifinfo
40757@inforef{Top, , General Introduction, gawkinet, @value{GAWKINETTITLE}}.
40758@end ifinfo
40759@ifnotinfo
40760@cite{@value{GAWKINETTITLE}}.
40761@end ifnotinfo
40762It should be processed with @TeX{}
40763(via @command{texi2dvi} or @command{texi2pdf})
40764to produce a printed document and
40765with @command{makeinfo} to produce an Info or HTML file.
40766
40767@item doc/gawkinet.info
40768The generated Info file for
40769@cite{@value{GAWKINETTITLE}}.
40770
40771@item doc/gawkworkflow.texi
40772The Texinfo source file for
40773@ifinfo
40774@inforef{Top, , General Introduction, gawkworkflow, @value{GAWKWORKFLOWTITLE}}.
40775@end ifinfo
40776@ifnotinfo
40777@cite{@value{GAWKWORKFLOWTITLE}}.
40778@end ifnotinfo
40779It should be processed with @TeX{}
40780(via @command{texi2dvi} or @command{texi2pdf})
40781to produce a printed document and
40782with @command{makeinfo} to produce an Info or HTML file.
40783
40784@item doc/gawkworkflow.info
40785The generated Info file for
40786@cite{@value{GAWKWORKFLOWTITLE}}.
40787
40788@item doc/igawk.1
40789The @command{troff} source for a manual page describing the @command{igawk}
40790program presented in
40791@ref{Igawk Program}.
40792(Since @command{gawk} can do its own @code{@@include} processing,
40793neither @command{igawk} nor @file{igawk.1} are installed.)
40794
40795@item doc/it/*
40796Files for the Italian translation of this @value{DOCUMENT}, produced and
40797contributed by Antonio Colombo and Marco Curreli.
40798
40799@item doc/Makefile.in
40800The input file used during the configuration process to generate the
40801actual @file{Makefile} for creating the documentation.
40802
40803@item Makefile.am
40804@itemx */Makefile.am
40805Files used by the GNU Automake software for generating
40806the @file{Makefile.in} files used by Autoconf and
40807@command{configure}.
40808
40809@item Makefile.in
40810@itemx aclocal.m4
40811@itemx bisonfix.awk
40812@itemx config.guess
40813@itemx configh.in
40814@itemx configure.ac
40815@itemx configure
40816@itemx custom.h
40817@itemx depcomp
40818@itemx install-sh
40819@itemx missing_d/*
40820@itemx mkinstalldirs
40821@itemx m4/*
40822These files and subdirectories are used when configuring and compiling
40823@command{gawk} for various Unix systems.  Most of them are explained
40824in @ref{Unix Installation}. The rest are there to support the main
40825infrastructure.
40826
40827@item po/*
40828The @file{po} library contains message translations.
40829
40830@item awklib/extract.awk
40831@itemx awklib/Makefile.am
40832@itemx awklib/Makefile.in
40833@itemx awklib/eg/*
40834The @file{awklib} directory contains a copy of @file{extract.awk}
40835(@pxref{Extract Program}),
40836which can be used to extract the sample programs from the Texinfo
40837source file for this @value{DOCUMENT}. It also contains a @file{Makefile.in} file, which
40838@command{configure} uses to generate a @file{Makefile}.
40839@file{Makefile.am} is used by GNU Automake to create @file{Makefile.in}.
40840The library functions from
40841@ref{Library Functions},
40842are included as ready-to-use files in the @command{gawk} distribution.
40843They are installed as part of the installation process.
40844The rest of the programs in this @value{DOCUMENT} are available in appropriate
40845subdirectories of @file{awklib/eg}.
40846
40847@item extension/*
40848The source code, manual pages, and infrastructure files for
40849the sample extensions included with @command{gawk}.
40850@xref{Dynamic Extensions}, for more information.
40851
40852@item extras/*
40853Additional non-essential files.  Currently, this directory contains some shell
40854startup files to be installed in @file{/etc/profile.d} to aid in manipulating
40855the @env{AWKPATH} and @env{AWKLIBPATH} environment variables.
40856@xref{Shell Startup Files}, for more information.
40857
40858@item posix/*
40859Files needed for building @command{gawk} on POSIX-compliant systems.
40860
40861@item pc/*
40862Files needed for building @command{gawk} under MS-Windows
40863(@pxref{PC Installation} for details).
40864
40865@item vms/*
40866Files needed for building @command{gawk} under Vax/VMS and OpenVMS
40867(@pxref{VMS Installation} for details).
40868
40869@item test/*
40870A test suite for
40871@command{gawk}.  You can use @samp{make check} from the top-level @command{gawk}
40872directory to run your version of @command{gawk} against the test suite.
40873If @command{gawk} successfully passes @samp{make check}, then you can
40874be confident of a successful port.
40875@end table
40876
40877@node Unix Installation
40878@appendixsec Compiling and Installing @command{gawk} on Unix-Like Systems
40879
40880Usually, you can compile and install @command{gawk} by typing only two
40881commands.  However, if you use an unusual system, you may need
40882to configure @command{gawk} for your system yourself.
40883
40884@menu
40885* Quick Installation::               Compiling @command{gawk} under Unix.
40886* Shell Startup Files::              Shell convenience functions.
40887* Additional Configuration Options:: Other compile-time options.
40888* Configuration Philosophy::         How it's all supposed to work.
40889* Compiling from Git::               Compiling from Git.
40890* Building the Documentation::       Building the Documentation.
40891@end menu
40892
40893@node Quick Installation
40894@appendixsubsec Compiling @command{gawk} for Unix-Like Systems
40895
40896@menu
40897* Compiling with MPFR::         Building with MPFR.
40898@end menu
40899
40900The normal installation steps should work on all modern commercial
40901Unix-derived systems, GNU/Linux, BSD-based systems, and the Cygwin
40902environment for MS-Windows.
40903
40904After you have extracted the @command{gawk} distribution, @command{cd}
40905to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}.  As with most GNU
40906software, you configure @command{gawk} for your system by running the
40907@command{configure} program.  This program is a Bourne shell script that
40908is generated automatically using GNU Autoconf.
40909@ifnotinfo
40910(The Autoconf software is
40911described fully in
40912@cite{Autoconf---Generating Automatic Configuration Scripts},
40913which can be found online at
40914@uref{https://www.gnu.org/software/autoconf/manual/index.html,
40915the Free Software Foundation's website}.)
40916@end ifnotinfo
40917@ifinfo
40918(The Autoconf software is described fully starting with
40919@inforef{Top, , Autoconf, autoconf,Autoconf---Generating Automatic Configuration Scripts}.)
40920@end ifinfo
40921
40922To configure @command{gawk}, simply run @command{configure}:
40923
40924@example
40925sh ./configure
40926@end example
40927
40928This produces a @file{Makefile} and @file{config.h} tailored to your system.
40929The @file{config.h} file describes various facts about your system.
40930You might want to edit the @file{Makefile} to
40931change the @code{CFLAGS} variable, which controls
40932the command-line options that are passed to the C compiler (such as
40933optimization levels or compiling for debugging).
40934
40935Alternatively, you can add your own values for most @command{make}
40936variables on the command line, such as @code{CC} and @code{CFLAGS}, when
40937running @command{configure}:
40938
40939@example
40940CC=cc CFLAGS=-g sh ./configure
40941@end example
40942
40943@noindent
40944See the file @file{INSTALL} in the @command{gawk} distribution for
40945all the details.
40946
40947After you have run @command{configure} and possibly edited the @file{Makefile},
40948type:
40949
40950@example
40951make
40952@end example
40953
40954@noindent
40955Shortly thereafter, you should have an executable version of @command{gawk}.
40956That's all there is to it!
40957To verify that @command{gawk} is working properly,
40958run @samp{make check}.  All of the tests should succeed.
40959If these steps do not work, or if any of the tests fail,
40960check the files in the @file{README_d} directory to see if you've
40961found a known problem.  If the failure is not described there,
40962send in a bug report (@pxref{Bugs}).
40963
40964Of course, once you've built @command{gawk}, it is likely that you will
40965wish to install it.  To do so, you need to run the command @samp{make
40966install}, as a user with the appropriate permissions.  How to do this
40967varies by system, but on many systems you can use the @command{sudo}
40968command to do so.  The command then becomes @samp{sudo make install}. It
40969is likely that you will be asked for your password, and you will have
40970to have been set up previously as a user who is allowed to run the
40971@command{sudo} command.
40972
40973
40974@node Compiling with MPFR
40975@appendixsubsubsec Building With MPFR
40976
40977@cindex MPFR library, building with
40978Use of the MPFR library with @command{gawk}
40979is an optional feature: if you have the MPFR and GMP libraries already installed
40980when you configure and build @command{gawk},
40981@command{gawk} automatically will be able to use them.
40982
40983You can install these libraries from source code by fetching them
40984from the GNU distribution site at @code{ftp.gnu.org}.
40985
40986Most modern systems provide package managers which save you the trouble
40987of building from source. They fetch and install the library header files
40988and binaries for you.  You will need to research how to do this for
40989your particular system.
40990
40991@node Shell Startup Files
40992@appendixsubsec Shell Startup Files
40993
40994The distribution contains shell startup files @file{gawk.sh} and
40995@file{gawk.csh}, containing functions to aid in manipulating
40996the @env{AWKPATH} and @env{AWKLIBPATH} environment variables.
40997On a Fedora GNU/Linux system, these files should be installed in @file{/etc/profile.d};
40998on other platforms, the appropriate location may be different.
40999
41000@table @command
41001
41002@cindex @command{gawkpath_default} shell function
41003@cindex shell function @subentry @command{gawkpath_default}
41004@item gawkpath_default
41005Reset the @env{AWKPATH} environment variable to its default value.
41006
41007@cindex @command{gawkpath_prepend} shell function
41008@cindex shell function @subentry @command{gawkpath_prepend}
41009@item gawkpath_prepend
41010Add the argument to the front of the @env{AWKPATH} environment variable.
41011
41012@cindex @command{gawkpath_append} shell function
41013@cindex shell function @subentry @command{gawkpath_append}
41014@item gawkpath_append
41015Add the argument to the end of the @env{AWKPATH} environment variable.
41016
41017@cindex @command{gawklibpath_default} shell function
41018@cindex shell function @subentry @command{gawklibpath_default}
41019@item gawklibpath_default
41020Reset the @env{AWKLIBPATH} environment variable to its default value.
41021
41022@cindex @command{gawklibpath_prepend} shell function
41023@cindex shell function @subentry @command{gawklibpath_prepend}
41024@item gawklibpath_prepend
41025Add the argument to the front of the @env{AWKLIBPATH} environment variable.
41026
41027@cindex @command{gawklibpath_append} shell function
41028@cindex shell function @subentry @command{gawklibpath_append}
41029@item gawklibpath_append
41030Add the argument to the end of the @env{AWKLIBPATH} environment variable.
41031
41032@end table
41033
41034
41035@node Additional Configuration Options
41036@appendixsubsec Additional Configuration Options
41037@cindex @command{gawk} @subentry configuring @subentry options
41038@cindex configuration options, @command{gawk}
41039
41040There are several additional options you may use on the @command{configure}
41041command line when compiling @command{gawk} from scratch, including:
41042
41043@table @code
41044
41045@cindex @option{--disable-extensions} configuration option
41046@cindex configuration option @subentry @option{--disable-extensions}
41047@item --disable-extensions
41048Disable the extension mechanism within @command{gawk}. With this
41049option, it is not possible to use dynamic extensions.  This also
41050disables configuring and building the sample extensions in the
41051@file{extension} directory.
41052
41053This option may be useful for cross-compiling.
41054The default action is to dynamically check if the extensions
41055can be configured and compiled.
41056
41057@cindex @option{--disable-lint} configuration option
41058@cindex configuration option @subentry @option{--disable-lint}
41059@item --disable-lint
41060Disable all lint checking within @command{gawk}.  The
41061@option{--lint} and @option{--lint-old} options
41062(@pxref{Options})
41063are accepted, but silently do nothing.
41064Similarly, setting the @code{LINT} variable
41065(@pxref{User-modified})
41066has no effect on the running @command{awk} program.
41067
41068When used with the GNU Compiler Collection's (GCC's)
41069automatic dead-code-elimination, this option
41070cuts almost 23K bytes off the size of the @command{gawk}
41071executable on GNU/Linux x86_64 systems.  Results on other systems and
41072with other compilers are likely to vary.
41073Using this option may bring you some slight performance improvement.
41074
41075@quotation CAUTION
41076Using this option will cause some of the tests in the test suite
41077to fail.  This option may be removed at a later date.
41078@end quotation
41079
41080@cindex @option{--disable-mpfr} configuration option
41081@cindex configuration option @subentry @option{--disable-mpfr}
41082@item --disable-mpfr
41083Skip checking for the MPFR and GMP libraries. This is useful
41084mainly for the developers, to make sure nothing breaks if
41085MPFR support is not available.
41086
41087@cindex @option{--disable-nls} configuration option
41088@cindex configuration option @subentry @option{--disable-nls}
41089@item --disable-nls
41090Disable all message-translation facilities.
41091This is usually not desirable, but it may bring you some slight performance
41092improvement.
41093
41094@cindex @option{--enable-versioned-extension-dir} configuration option
41095@cindex configuration option @subentry @option{--enable-versioned-extension-dir}
41096@item --enable-versioned-extension-dir
41097Use a versioned directory for extensions.  The directory name will
41098include the major and minor API versions in it. This makes it possible
41099to keep extensions for different API versions on the same system
41100without their conflicting with one another.
41101
41102@end table
41103
41104Use the command @samp{./configure --help} to see the full list of
41105options supplied by @command{configure}.
41106
41107@node Configuration Philosophy
41108@appendixsubsec The Configuration Process
41109
41110@cindex @command{gawk} @subentry configuring
41111This @value{SECTION} is of interest only if you know something about using the
41112C language and Unix-like operating systems.
41113
41114The source code for @command{gawk} generally attempts to adhere to formal
41115standards wherever possible.  This means that @command{gawk} uses library
41116routines that are specified by the ISO C standard and by the POSIX
41117operating system interface standard.
41118The @command{gawk} source code requires using an ISO C compiler (the 1999
41119standard).
41120
41121Many Unix systems do not support all of either the ISO or the
41122POSIX standards.  The @file{missing_d} subdirectory in the @command{gawk}
41123distribution contains replacement versions of those functions that are
41124most likely to be missing.
41125
41126The @file{config.h} file that @command{configure} creates contains
41127definitions that describe features of the particular operating system
41128where you are attempting to compile @command{gawk}.  The three things
41129described by this file are: what header files are available, so that
41130they can be correctly included, what (supposedly) standard functions
41131are actually available in your C libraries, and various miscellaneous
41132facts about your operating system.  For example, there may not be an
41133@code{st_blksize} element in the @code{stat} structure.  In this case,
41134@samp{HAVE_STRUCT_STAT_ST_BLKSIZE} is undefined.
41135
41136@cindex @code{custom.h} file
41137It is possible for your C compiler to lie to @command{configure}. It may
41138do so by not exiting with an error when a library function is not
41139available.  To get around this, edit the @file{custom.h} file.
41140Use an @samp{#ifdef} that is appropriate for your system, and either
41141@code{#define} any constants that @command{configure} should have defined but
41142didn't, or @code{#undef} any constants that @command{configure} defined and
41143should not have.  The @file{custom.h} file is automatically included by
41144the @file{config.h} file.
41145
41146It is also possible that the @command{configure} program generated by
41147Autoconf will not work on your system in some other fashion.
41148If you do have a problem, the @file{configure.ac} file is the input for
41149Autoconf.  You may be able to change this file and generate a
41150new version of @command{configure} that works on your system
41151(@pxref{Bugs}
41152for information on how to report problems in configuring @command{gawk}).
41153The same mechanism may be used to send in updates to @file{configure.ac}
41154and/or @file{custom.h}.
41155
41156@node Compiling from Git
41157@appendixsubsec Compiling from Git
41158
41159Building @command{gawk} directly from the development source control
41160repository is possible, but not recommended for everyday users, as the
41161code may not be as stable as released versions are.  If you really do
41162want to do that, here are the steps:
41163
41164@example
41165git clone https://git.savannah.gnu.org/r/gawk.git
41166cd gawk
41167./bootstrap.sh && ./configure && make && make check
41168@end example
41169
41170@node Building the Documentation
41171@appendixsubsec Building the Documentation
41172
41173@cindex documentation @subentry building @subentry Info files
41174The generated Info documentation is included in the distribution
41175@command{tar} files and in the Git source code repository; you should
41176not need to rebuild it. However, if it needs to be done, simply running
41177@command{make} will do it, assuming that you have a recent enough version
41178of @command{makeinfo} installed.
41179
41180@cindex documentation @subentry building @subentry PDF
41181If you wish to build the PDF version of the manuals, you will need
41182to have @TeX{} installed, and possibly additional packages that
41183provide the necessary fonts and tools, such as @command{dvi2pdf}
41184and @command{ps2pdf}.  You will also need GNU Troff (@command{groff})
41185installed in order to format the reference card and the manual page
41186(@pxref{Distribution contents}).  Managing this process is beyond the
41187scope of this @value{DOCUMENT}.
41188
41189Assuming you have all you need, then the following commands produce the
41190PDF versions of the documentation:
41191
41192@example
41193cd doc
41194make pdf
41195@end example
41196
41197@noindent
41198This creates PDF versions of all three Texinfo documents included
41199in the distribution, as well as of the manual page and the reference card.
41200
41201@cindex documentation @subentry building @subentry HTML
41202Similarly, if you have a recent enough version of @command{makeinfo},
41203you can make the HTML version of the manuals with:
41204
41205@example
41206cd doc
41207make html
41208@end example
41209
41210@noindent
41211This creates HTML versions of all three Texinfo documents included
41212in the distribution.
41213
41214@node Non-Unix Installation
41215@appendixsec Installation on Other Operating Systems
41216
41217This @value{SECTION} describes how to install @command{gawk} on
41218various non-Unix systems.
41219
41220@menu
41221* PC Installation::             Installing and Compiling @command{gawk} on
41222                                Microsoft Windows.
41223* VMS Installation::            Installing @command{gawk} on VMS.
41224@end menu
41225
41226@node PC Installation
41227@appendixsubsec Installation on MS-Windows
41228
41229@cindex PC operating systems, @command{gawk} on @subentry installing
41230@cindex operating systems @subentry PC, @command{gawk} on @subentry installing
41231This @value{SECTION} covers installation and usage of @command{gawk}
41232on Intel architecture machines running any version of MS-Windows.
41233In this @value{SECTION}, the term ``Windows32''
41234refers to any of Microsoft Windows 95/98/ME/NT/2000/XP/Vista/7/8/10.
41235
41236See also the @file{README_d/README.pc} file in the distribution.
41237
41238@menu
41239* PC Binary Installation::      Installing a prepared distribution.
41240* PC Compiling::                Compiling @command{gawk} for Windows32.
41241* PC Using::                    Running @command{gawk} on Windows32.
41242* Cygwin::                      Building and running @command{gawk} for
41243                                Cygwin.
41244* MSYS::                        Using @command{gawk} In The MSYS Environment.
41245@end menu
41246
41247@node PC Binary Installation
41248@appendixsubsubsec Installing a Prepared Distribution for MS-Windows Systems
41249@cindex installing @command{gawk} @subentry MS-Windows
41250
41251The only supported binary distribution for MS-Windows systems
41252is that provided by Eli Zaretskii's @uref{https://sourceforge.net/projects/ezwinports/,
41253``ezwinports''} project.  Install the compiled @command{gawk} from there.
41254
41255@node PC Compiling
41256@appendixsubsubsec Compiling @command{gawk} for PC Operating Systems
41257
41258@command{gawk} can be compiled for Windows32 using MinGW (Windows32).
41259The file @file{README_d/README.pc} in the @command{gawk} distribution
41260contains additional notes, and @file{pc/Makefile} contains important
41261information on compilation options.
41262
41263@cindex compiling @command{gawk} @subentry for MS-Windows
41264To build @command{gawk} for Windows32, copy the files in
41265the @file{pc} directory (@emph{except} for @file{ChangeLog}) to the
41266directory with the rest of the @command{gawk} sources, then invoke
41267@command{make} with the appropriate target name as an argument to
41268build @command{gawk}.  The @file{Makefile} copied from the @file{pc}
41269directory contains a configuration section with comments and may need
41270to be edited in order to work with your @command{make} utility.
41271
41272The @file{Makefile} supports a number of targets for building various
41273MS-DOS and Windows32 versions.  A list of targets is printed if the
41274@command{make} command is given without a target.  As an example,
41275to build a native MS-Windows binary of @command{gawk} using the MinGW tools,
41276type @samp{make mingw32}.
41277
41278@node PC Using
41279@appendixsubsubsec Using @command{gawk} on PC Operating Systems
41280@cindex operating systems @subentry PC, @command{gawk} on
41281@cindex PC operating systems, @command{gawk} on
41282
41283Information in this section applies to the MinGW and
41284DJGPP ports of @command{gawk}. @xref{Cygwin} for information
41285about the Cygwin port.
41286
41287Under MS-Windows, the MinGW environment supports
41288both the @samp{|&} operator and TCP/IP networking
41289(@pxref{TCP/IP Networking}).
41290The DJGPP environment does not support @samp{|&}.
41291
41292@cindex search paths
41293@cindex search paths @subentry for source files
41294@cindex @command{gawk} @subentry MS-Windows version of
41295@cindex @code{;} (semicolon) @subentry @env{AWKPATH} variable and
41296@cindex semicolon (@code{;}) @subentry @env{AWKPATH} variable and
41297@cindex @env{AWKPATH} environment variable
41298@cindex environment variables @subentry @env{AWKPATH}
41299The MS-Windows version of @command{gawk} searches for
41300program files as described in @ref{AWKPATH Variable}.  However,
41301semicolons (rather than colons) separate elements in the @env{AWKPATH}
41302variable.  If @env{AWKPATH} is not set or is empty, then the default
41303search path is @samp{@w{.;c:/lib/awk;c:/gnu/lib/awk}}.
41304
41305@cindex common extensions @subentry @code{BINMODE} variable
41306@cindex extensions @subentry common @subentry @code{BINMODE} variable
41307@cindex differences in @command{awk} and @command{gawk} @subentry @code{BINMODE} variable
41308@cindex @code{BINMODE} variable
41309Under MS-Windows,
41310@command{gawk} (and many other text programs) silently
41311translates end-of-line @samp{\r\n} to @samp{\n} on input and @samp{\n}
41312to @samp{\r\n} on output.  A special @code{BINMODE} variable @value{COMMONEXT}
41313allows control over these translations and is interpreted as follows:
41314
41315@itemize @value{BULLET}
41316@item
41317If @code{BINMODE} is @code{"r"} or one,
41318then
41319binary mode is set on read (i.e., no translations on reads).
41320
41321@item
41322If @code{BINMODE} is @code{"w"} or two,
41323then
41324binary mode is set on write (i.e., no translations on writes).
41325
41326@item
41327If @code{BINMODE} is @code{"rw"} or @code{"wr"} or three,
41328binary mode is set for both read and write.
41329
41330@item
41331@code{BINMODE=@var{non-null-string}} is
41332the same as @samp{BINMODE=3} (i.e., no translations on
41333reads or writes).  However, @command{gawk} issues a warning
41334message if the string is not one of @code{"rw"} or @code{"wr"}.
41335@end itemize
41336
41337@noindent
41338The modes for standard input and standard output are set one time
41339only (after the
41340command line is read, but before processing any of the @command{awk} program).
41341Setting @code{BINMODE} for standard input or
41342standard output is accomplished by using an
41343appropriate @samp{-v BINMODE=@var{N}} option on the command line.
41344@code{BINMODE} is set at the time a file or pipe is opened and cannot be
41345changed midstream.
41346
41347On POSIX-compatible systems, this variable's value has no effect.
41348Thus, if you think your program will run on multiple different systems
41349and that you may need to use @code{BINMODE}, you should simply set it
41350(in the program or on the command line) unconditionally, and not worry
41351about the operating system on which your program is running.
41352
41353The name @code{BINMODE} was chosen to match @command{mawk}
41354(@pxref{Other Versions}).
41355@command{mawk} and @command{gawk} handle @code{BINMODE} similarly; however,
41356@command{mawk} adds a @samp{-W BINMODE=@var{N}} option and an environment
41357variable that can set @code{BINMODE}, @code{RS}, and @code{ORS}.  The
41358files @file{binmode[1-3].awk} (under @file{gnu/lib/awk} in some of the
41359prepared binary distributions) have been chosen to match @command{mawk}'s @samp{-W
41360BINMODE=@var{N}} option.  These can be changed or discarded; in particular,
41361the setting of @code{RS} giving the fewest ``surprises'' is open to debate.
41362@command{mawk} uses @samp{RS = "\r\n"} if binary mode is set on read, which is
41363appropriate for files with the MS-DOS-style end-of-line.
41364
41365To illustrate, the following examples set binary mode on writes for standard
41366output and other files, and set @code{ORS} as the ``usual'' MS-DOS-style
41367end-of-line:
41368
41369@example
41370gawk -v BINMODE=2 -v ORS="\r\n" @dots{}
41371@end example
41372
41373@noindent
41374or:
41375
41376@example
41377gawk -v BINMODE=w -f binmode2.awk @dots{}
41378@end example
41379
41380@noindent
41381These give the same result as the @samp{-W BINMODE=2} option in
41382@command{mawk}.
41383The following changes the record separator to @code{"\r\n"} and sets binary
41384mode on reads, but does not affect the mode on standard input:
41385
41386@example
41387gawk -v RS="\r\n" -e "BEGIN @{ BINMODE = 1 @}" @dots{}
41388@end example
41389
41390@noindent
41391or:
41392
41393@example
41394gawk -f binmode1.awk @dots{}
41395@end example
41396
41397@noindent
41398With proper quoting, in the first example the setting of @code{RS} can be
41399moved into the @code{BEGIN} rule.
41400
41401@node Cygwin
41402@appendixsubsubsec Using @command{gawk} In The Cygwin Environment
41403@cindex compiling @command{gawk} @subentry for Cygwin
41404
41405@command{gawk} can be built and used ``out of the box'' under MS-Windows
41406if you are using the @uref{http://www.cygwin.com, Cygwin environment}.
41407This environment provides an excellent simulation of GNU/Linux, using
41408Bash, GCC, GNU Make,
41409and other GNU programs.  Compilation and installation for Cygwin is the
41410same as for a Unix system:
41411
41412@example
41413tar -xvpzf gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz
41414cd gawk-@value{VERSION}.@value{PATCHLEVEL}
41415./configure
41416make && make check
41417@end example
41418
41419When compared to GNU/Linux on the same system, the @samp{configure}
41420step on Cygwin takes considerably longer.  However, it does finish,
41421and then the @samp{make} proceeds as usual.
41422
41423@cindex installing @command{gawk} @subentry Cygwin
41424You may also install @command{gawk} using the regular Cygwin installer.
41425In general Cygwin supplies the latest released version.
41426
41427Recent versions of Cygwin open all files in binary mode. This means
41428that you should use @samp{RS = "\r?\n"} in order to be able to
41429handle standard MS-Windows text files with carriage-return plus
41430line-feed line endings.
41431
41432The Cygwin environment supports
41433both the @samp{|&} operator and TCP/IP networking
41434(@pxref{TCP/IP Networking}).
41435
41436@node MSYS
41437@appendixsubsubsec Using @command{gawk} In The MSYS Environment
41438
41439In the MSYS environment under MS-Windows, @command{gawk} automatically
41440uses binary mode for reading and writing files.  Thus, there is no
41441need to use the @code{BINMODE} variable.
41442
41443This can cause problems with other Unix-like components that have
41444been ported to MS-Windows that expect @command{gawk} to do automatic
41445translation of @code{"\r\n"}, because it won't.
41446
41447Under MSYS2, compilation using the standard @samp{./configure && make}
41448recipe works ``out of the box.''
41449
41450@node VMS Installation
41451@appendixsubsec Compiling and Installing @command{gawk} on Vax/VMS and OpenVMS
41452
41453@c based on material from Pat Rankin <rankin@eql.caltech.edu>
41454@c now rankin@pactechdata.com
41455@c now r.pat.rankin@gmail.com
41456
41457@cindex @command{gawk} @subentry VMS version of
41458@cindex installing @command{gawk} @subentry VMS
41459This @value{SUBSECTION} describes how to compile and install @command{gawk} under OpenVMS.
41460The older designation ``VMS'' is used throughout to refer to OpenVMS.
41461
41462@menu
41463* VMS Compilation::             How to compile @command{gawk} under VMS.
41464* VMS Dynamic Extensions::      Compiling @command{gawk} dynamic extensions on
41465                                VMS.
41466* VMS Installation Details::    How to install @command{gawk} under VMS.
41467* VMS Running::                 How to run @command{gawk} under VMS.
41468* VMS GNV::                     The VMS GNV Project.
41469@end menu
41470
41471@node VMS Compilation
41472@appendixsubsubsec Compiling @command{gawk} on VMS
41473@cindex compiling @command{gawk} @subentry for VMS
41474
41475To compile @command{gawk} under VMS, there is a @code{DCL} command procedure
41476that issues all the necessary @code{CC} and @code{LINK} commands. There is
41477also a @file{Makefile} for use with the @code{MMS} and @code{MMK} utilities.
41478From the source directory, use either:
41479
41480@example
41481$ @kbd{@@[.vms]vmsbuild.com}
41482@end example
41483
41484@noindent
41485or:
41486
41487@example
41488$ @kbd{MMS/DESCRIPTION=[.vms]descrip.mms gawk}
41489@end example
41490
41491@noindent
41492or:
41493
41494@example
41495$ @kbd{MMK/DESCRIPTION=[.vms]descrip.mms gawk}
41496@end example
41497
41498@command{MMK} is an open source, free, near-clone of @command{MMS} and
41499can better handle ODS-5 volumes with upper- and lowercase @value{FN}s.
41500@command{MMK} is available from @uref{https://github.com/endlesssoftware/mmk}.
41501
41502With ODS-5 volumes and extended parsing enabled, the case of the target
41503parameter may need to be exact.
41504
41505@command{gawk} has been tested under VAX/VMS 7.3 and Alpha/VMS 7.3-1
41506using Compaq C V6.4, and under Alpha/VMS 7.3, Alpha/VMS 7.3-2, and IA64/VMS 8.3.
41507The most recent builds used HP C V7.3 on Alpha VMS 8.3 and both
41508Alpha and IA64 VMS 8.4 used HP C 7.3.@footnote{The IA64 architecture
41509is also known as ``Itanium.''}
41510
41511@xref{VMS GNV} for information on building
41512@command{gawk} as a PCSI kit that is compatible with the GNV product.
41513
41514@node VMS Dynamic Extensions
41515@appendixsubsubsec Compiling @command{gawk} Dynamic Extensions on VMS
41516
41517The extensions that have been ported to VMS can be built using one of
41518the following commands:
41519
41520@example
41521$ @kbd{MMS/DESCRIPTION=[.vms]descrip.mms extensions}
41522@end example
41523
41524@noindent
41525or:
41526
41527@example
41528$ @kbd{MMK/DESCRIPTION=[.vms]descrip.mms extensions}
41529@end example
41530
41531@command{gawk} uses @code{AWKLIBPATH} as either an environment variable
41532or a logical name to find the dynamic extensions.
41533
41534Dynamic extensions need to be compiled with the same compiler options for
41535floating-point, pointer size, and symbol name handling as were used
41536to compile @command{gawk} itself.
41537Alpha and Itanium should use IEEE floating point.  The pointer size is 32 bits,
41538and the symbol name handling should be exact case with CRC shortening for
41539symbols longer than 32 bits.
41540
41541For Alpha and Itanium:
41542
41543@example
41544/name=(as_is,short)
41545/float=ieee/ieee_mode=denorm_results
41546@end example
41547
41548For VAX:
41549
41550@example
41551/name=(as_is,short)
41552@end example
41553
41554Compile-time macros need to be defined before the first VMS-supplied
41555header file is included, as follows:
41556
41557@example
41558#if (__CRTL_VER >= 70200000) && !defined (__VAX)
41559#define _LARGEFILE 1
41560#endif
41561
41562#ifndef __VAX
41563#ifdef __CRTL_VER
41564#if __CRTL_VER >= 80200000
41565#define _USE_STD_STAT 1
41566#endif
41567#endif
41568#endif
41569@end example
41570
41571If you are writing your own extensions to run on VMS, you must supply these
41572definitions yourself. The @file{config.h} file created when building @command{gawk}
41573on VMS does this for you; if instead you use that file or a similar one, then you
41574must remember to include it before any VMS-supplied header files.
41575
41576@node VMS Installation Details
41577@appendixsubsubsec Installing @command{gawk} on VMS
41578
41579To use @command{gawk}, all you need is a ``foreign'' command, which is a
41580@code{DCL} symbol whose value begins with a dollar sign. For example:
41581
41582@example
41583$ @kbd{GAWK :== $disk1:[gnubin]gawk}
41584@end example
41585
41586@noindent
41587Substitute the actual location of @command{gawk.exe} for
41588@samp{$disk1:[gnubin]}. The symbol should be placed in the
41589@file{login.com} of any user who wants to run @command{gawk},
41590so that it is defined every time the user logs on.
41591Alternatively, the symbol may be placed in the system-wide
41592@file{sylogin.com} procedure, which allows all users
41593to run @command{gawk}.
41594
41595If your @command{gawk} was installed by a PCSI kit into the
41596@file{GNV$GNU:} directory tree, the program will be known as
41597@file{GNV$GNU:[bin]gnv$gawk.exe} and the help file will be
41598@file{GNV$GNU:[vms_help]gawk.hlp}.
41599
41600The PCSI kit also installs a @file{GNV$GNU:[vms_bin]gawk_verb.cld} file
41601that can be used to add @command{gawk} and @command{awk} as DCL commands.
41602
41603For just the current process you can use:
41604
41605@example
41606$ @kbd{set command gnv$gnu:[vms_bin]gawk_verb.cld}
41607@end example
41608
41609Or the system manager can use @file{GNV$GNU:[vms_bin]gawk_verb.cld} to
41610add the @command{gawk} and @command{awk} commands to the system-wide @samp{DCLTABLES}.
41611
41612The DCL syntax is documented in the @file{gawk.hlp} file.
41613
41614Optionally, the @file{gawk.hlp} entry can be loaded into a VMS help library:
41615
41616@example
41617$ @kbd{LIBRARY/HELP sys$help:helplib [.vms]gawk.hlp}
41618@end example
41619
41620@noindent
41621(You may want to substitute a site-specific help library rather than
41622the standard VMS library @samp{HELPLIB}.)  After loading the help text,
41623the command:
41624
41625@example
41626$ @kbd{HELP GAWK}
41627@end example
41628
41629@noindent
41630provides information about both the @command{gawk} implementation and the
41631@command{awk} programming language.
41632
41633The logical name @samp{AWK_LIBRARY} can designate a default location
41634for @command{awk} program files.  For the @option{-f} option, if the specified
41635@value{FN} has no device or directory path information in it, @command{gawk}
41636looks in the current directory first, then in the directory specified
41637by the translation of @samp{AWK_LIBRARY} if the file is not found.
41638If, after searching in both directories, the file still is not found,
41639@command{gawk} appends the suffix @samp{.awk} to the @value{FN} and retries
41640the file search.  If @samp{AWK_LIBRARY} has no definition, a default value
41641of @samp{SYS$LIBRARY:} is used for it.
41642
41643@node VMS Running
41644@appendixsubsubsec Running @command{gawk} on VMS
41645
41646Command-line parsing and quoting conventions are significantly different
41647on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor
41648changes.  They @emph{are} minor though, and all @command{awk} programs
41649should run correctly.
41650
41651Here are a couple of trivial tests:
41652
41653@example
41654$ @kbd{gawk -- "BEGIN @{print ""Hello, World!""@}"}
41655$ @kbd{gawk -"W" version}
41656! could also be -"W version" or "-W version"
41657@end example
41658
41659@noindent
41660Note that uppercase and mixed-case text must be quoted.
41661
41662The VMS port of @command{gawk} includes a @code{DCL}-style interface in addition
41663to the original shell-style interface (see the help entry for details).
41664One side effect of dual command-line parsing is that if there is only a
41665single parameter (as in the quoted string program), the command
41666becomes ambiguous.  To work around this, the normally optional @option{--}
41667flag is required to force Unix-style parsing rather than @code{DCL} parsing.
41668If any other dash-type options (or multiple parameters such as @value{DF}s to
41669process) are present, there is no ambiguity and @option{--} can be omitted.
41670
41671@cindex exit status, of @command{gawk} @subentry on VMS
41672The @code{exit} value is a Unix-style value and is encoded into a VMS exit
41673status value when the program exits.
41674
41675The VMS severity bits will be set based on the @code{exit} value.
41676A failure is indicated by 1, and VMS sets the @code{ERROR} status.
41677A fatal error is indicated by 2, and VMS sets the @code{FATAL} status.
41678All other values will have the @code{SUCCESS} status.  The exit value is
41679encoded to comply with VMS coding standards and will have the
41680@code{C_FACILITY_NO} of @code{0x350000} with the constant @code{0xA000}
41681added to the number shifted over by 3 bits to make room for the severity codes.
41682
41683To extract the actual @command{gawk} exit code from the VMS status, use:
41684
41685@example
41686unix_status = (vms_status .and. %x7f8) / 8
41687@end example
41688
41689@noindent
41690A C program that uses @code{exec()} to call @command{gawk} will get the original
41691Unix-style exit value.
41692
41693Older versions of @command{gawk} for VMS treated a Unix exit code 0 as 1,
41694a failure as 2, a fatal error as 4, and passed all the other numbers through.
41695This violated the VMS exit status coding requirements.
41696
41697@cindex floating-point @subentry numbers @subentry VAX/VMS
41698VAX/VMS floating point uses unbiased rounding. @xref{Round Function}.
41699
41700VMS reports time values in GMT unless one of the @code{SYS$TIMEZONE_RULE}
41701or @code{TZ} logical names is set.  Older versions of VMS, such as VAX/VMS
417027.3, do not set these logical names.
41703
41704@cindex search paths
41705@cindex search paths @subentry for source files
41706The default search path, when looking for @command{awk} program files specified
41707by the @option{-f} option, is @code{"SYS$DISK:[],AWK_LIBRARY:"}.  The logical
41708name @env{AWKPATH} can be used to override this default.  The format
41709of @env{AWKPATH} is a comma-separated list of directory specifications.
41710When defining it, the value should be quoted so that it retains a single
41711translation and not a multitranslation @code{RMS} searchlist.
41712
41713@cindex redirection @subentry on VMS
41714
41715This restriction also applies to running @command{gawk} under GNV,
41716as redirection is always to a DCL command.
41717
41718If you are redirecting data to a VMS command or utility, the current
41719implementation requires that setting up a VMS foreign command that runs
41720a command file before invoking @command{gawk}.
41721(This restriction may be removed in a future release of @command{gawk} on VMS.)
41722
41723Without this command file, the input data will also appear prepended
41724to the output data.
41725
41726This also allows simulating POSIX commands that are not found on VMS or the
41727use of GNV utilities.
41728
41729The example below is for @command{gawk} redirecting data to the VMS
41730@command{sort} command.
41731
41732@example
41733$ sort = "@@device:[dir]vms_gawk_sort.com"
41734@end example
41735
41736The command file needs to be of the format in the example below.
41737
41738The first line inhibits the passed input data from also showing up in the
41739output.  It must be in the format in the example.
41740
41741The next line creates a foreign command that overrides the outer foreign
41742command which prevents an infinite recursion of command files.
41743
41744The next to the last command redirects @code{sys$input} to be
41745@code{sys$command}, in order to pick up the data that is being redirected
41746to the command.
41747
41748The last line runs the actual command.  It must be the last command as the data
41749redirected from @command{gawk} will be read when the command file ends.
41750
41751@example
41752$!'f$verify(0,0)'
41753$ sort := sort
41754$ define/user sys$input sys$command:
41755$ sort sys$input: sys$output:
41756@end example
41757
41758@node VMS GNV
41759@appendixsubsubsec The VMS GNV Project
41760
41761The VMS GNV package provides a build environment similar to POSIX with ports
41762of a collection of open source tools.  The @command{gawk} found in the GNV
41763base kit is an older port.  Currently, the GNV project is being reorganized
41764to supply individual PCSI packages for each component.
41765See @w{@uref{https://sourceforge.net/p/gnv/wiki/InstallingGNVPackages/}.}
41766
41767The normal build procedure for @command{gawk} produces a program that
41768is suitable for use with GNV.
41769
41770The file @file{vms/gawk_build_steps.txt} in the distribution documents
41771the procedure for building a VMS PCSI kit that is compatible with GNV.
41772
41773@node Bugs
41774@appendixsec Reporting Problems and Bugs
41775@cindex archaeologists
41776@quotation
41777@i{There is nothing more dangerous than a bored archaeologist.}
41778@author Douglas Adams, @cite{The Hitchhiker's Guide to the Galaxy}
41779@end quotation
41780@c the radio show, not the book. :-)
41781
41782@cindex debugging @command{gawk}, bug reports
41783@cindex troubleshooting @subentry @command{gawk} @subentry bug reports
41784If you have problems with @command{gawk} or think that you have found a bug,
41785report it to the developers; we cannot promise to do anything,
41786but we might well want to fix it.
41787
41788@menu
41789* Bug definition::              Defining what is and is not a bug.
41790* Bug address::                 Where to send reports to.
41791* Usenet::                      Where not to send reports to.
41792* Performance bugs::            What to do if you think there is a performance
41793                                issue.
41794* Asking for help::             Dealing with non-bug questions.
41795* Maintainers::                 Maintainers of non-*nix ports.
41796@end menu
41797
41798@node Bug definition
41799@appendixsubsec Defining What Is and What Is Not A Bug
41800
41801Before talking about reporting bugs, let's define what is a bug,
41802and what is not.
41803
41804A bug is:
41805
41806@itemize @bullet
41807@item
41808When @command{gawk} behaves differently from what's described
41809in the POSIX standard, and that difference is not mentioned
41810in this @value{DOCUMENT} as being done on purpose.
41811
41812@item
41813When @command{gawk} behaves differently from what's described
41814in this @value{DOCUMENT}.
41815
41816@item
41817When @command{gawk} behaves differently from other @command{awk}
41818implementations in particular circumstances, and that behavior cannot
41819be attributed to an additional feature in @command{gawk}.
41820
41821@item
41822Something that is obviously wrong, such as a core dump.
41823
41824@item
41825When this @value{DOCUMENT} is unclear or ambiguous about a particular
41826feature's behavior.
41827@end itemize
41828
41829The following things are @emph{not} bugs, and should not be reported
41830to the bug mailing list.  You can ask about them on the ``help'' mailing
41831list (@pxref{Asking for help}), but don't be surprised if you get an
41832answer of the form ``that's how @command{gawk} behaves and it isn't
41833going to change.'' Here's the list:
41834
41835@itemize @bullet
41836@item
41837Missing features, for any definition of @dfn{feature}. For example,
41838additional built-in arithmetic functions, or additional ways to split
41839fields or records, or anything else.
41840
41841The number of features that @command{gawk} does @emph{not} have is
41842by definition infinite.  It cannot be all things to all people.
41843In short, just because @command{gawk} doesn't do what @emph{you}
41844think it should, it's not necessarily a bug.
41845
41846@item
41847Behaviors that are defined by the POSIX standard and/or for historical
41848compatibility with Unix @command{awk}.  Even if you happen to dislike
41849those behaviors, they're not going to change: changing them would
41850break millions of existing @command{awk} programs.
41851
41852@item
41853Behaviors that differ from how it's done in other languages. @command{awk}
41854and @command{gawk} stand on their own and do not have to follow the crowd.
41855This is particularly true when the requested behavior change would break
41856backwards compatibility.
41857
41858This applies also to differences in behavior between @command{gawk}
41859and other language compilers and interpreters, such as wishes for more
41860detailed descriptions of what the problem is when a syntax error is
41861encountered.
41862
41863@item
41864Documentation issues of the form ``the manual doesn't tell me how to
41865do XYZ.''  The manual is not a cookbook to solve every little problem
41866you may have.  Its purpose is to teach you how to solve your problems
41867on your own.
41868
41869@item
41870General questions and discussion about @command{awk} programming or
41871why @command{gawk} behaves the way it does. For that use the ``help''
41872mailing list: see @ref{Asking for help}.
41873@end itemize
41874
41875For more information, see @uref{http://www.skeeve.com/fork-my-code.html,
41876@cite{Fork My Code, Please!---An Open Letter To Those of You Who Are Unhappy}},
41877by Arnold Robbins and Chet Ramey.
41878
41879@node Bug address
41880@appendixsubsec Submitting Bug Reports
41881
41882Before reporting a bug, make sure you have really found a genuine bug.
41883
41884Here are the steps for submitting a bug report. Following them will
41885make both your life and the lives of the maintainers much easier.
41886
41887@enumerate 1
41888@item
41889Make sure that what you want to report is appropriate.
41890@xref{Bug definition}.  If it's not, you are wasting your
41891time and ours.
41892
41893@item
41894Verify that you have the latest version of @command{gawk}.
41895Many bugs (usually subtle ones) are fixed at each release, and if yours
41896is out-of-date, the problem may already have been solved.
41897
41898@item
41899Please see if setting the environment variable @env{LC_ALL}
41900to @code{LC_ALL=C} causes things to behave as you expect. If so, it's
41901a locale issue, and may or may not really be a bug.
41902
41903@item
41904Carefully reread the documentation and see if it says you can do
41905what you're trying to do.  If it's not clear whether you should be able
41906to do something or not, report that too; it's a bug in the documentation!
41907
41908@item
41909Before reporting a bug or trying to fix it yourself, try to isolate it
41910to the smallest possible @command{awk} program and input @value{DF} that
41911reproduce the problem.  Then send us:
41912
41913@itemize @bullet
41914@item
41915The program and @value{DF}.
41916
41917@item
41918Some idea of what kind of Unix system you're using.
41919
41920@item
41921The compiler you used to compile @command{gawk}.
41922
41923@item
41924The exact results
41925@command{gawk} gave you.  Also say what you expected to occur; this helps
41926us decide whether the problem is really in the documentation.
41927
41928@item
41929The version number of @command{gawk} you are using.
41930You can get this information with the command @samp{gawk --version}.
41931@end itemize
41932
41933@item
41934Do @emph{not} send screenshots. Instead, use copy/paste to send text, or
41935send files.
41936
41937@item
41938Do send files as attachments, instead of inline. This avoids corruption
41939by mailer programs out in the wilds of the Internet.
41940
41941@item
41942Please be sure to send all mail in @emph{plain text},
41943not (or not exclusively) in HTML.
41944
41945@item
41946@emph{All email must be in English. This is the only language
41947understood in common by all the maintainers.}
41948@end enumerate
41949
41950@cindex @email{bug-gawk@@gnu.org} bug reporting address
41951@cindex email address for bug reports, @email{bug-gawk@@gnu.org}
41952@cindex bug reports, email address, @email{bug-gawk@@gnu.org}
41953Once you have a precise problem description, send email to
41954@EMAIL{bug-gawk@@gnu.org,bug dash gawk at gnu dot org}.
41955
41956The @command{gawk} maintainers subscribe to this address, and
41957thus they will receive your bug report.
41958Although you can send mail to the maintainers directly,
41959the bug reporting address is preferred because the
41960email list is archived at the GNU Project.
41961
41962@quotation NOTE
41963Many distributions of GNU/Linux and the various BSD-based operating systems
41964have their own bug reporting systems.  If you report a bug using your distribution's
41965bug reporting system, you should also send a copy to
41966@EMAIL{bug-gawk@@gnu.org,bug dash gawk at gnu dot org}.
41967
41968This is for two reasons.  First, although some distributions forward
41969bug reports ``upstream'' to the GNU mailing list, many don't, so there is a good
41970chance that the @command{gawk}  maintainers won't even see the bug report!  Second,
41971mail to the GNU list is archived, and having everything at the GNU Project
41972keeps things self-contained and not dependent on other organizations.
41973@end quotation
41974
41975Please note: We ask that you follow the
41976@uref{https://gnu.org/philosophy/kind-communication.html,
41977GNU Kind Communication Guidelines} in your correspondence on the
41978list (as well as off of it).
41979
41980@node Usenet
41981@appendixsubsec Please Don't Post Bug Reports to USENET
41982
41983@quotation
41984@c Date: Sun, 17 May 2015 19:50:14 -0400
41985@c From: Chet Ramey <chet.ramey@case.edu>
41986@c Reply-To: chet.ramey@case.edu
41987@c Organization: ITS, Case Western Reserve University
41988@c To: Aharon Robbins <arnold@skeeve.com>
41989@c CC: chet.ramey@case.edu
41990I gave up on Usenet a couple of years ago and haven't really looked back.
41991It's like sports talk radio---you feel smarter for not having read it.
41992@author Chet Ramey
41993@end quotation
41994
41995@cindex @code{comp.lang.awk} newsgroup
41996Please do @emph{not} try to report bugs in @command{gawk} by posting to the
41997Usenet/Internet newsgroup @code{comp.lang.awk}.  Although some of the
41998@command{gawk} developers occasionally read this news group, the primary
41999@command{gawk} maintainer no longer does.  Thus it's virtually guaranteed
42000that he will @emph{not} see your posting.
42001
42002If you really don't care about the previous paragraph and continue to
42003post bug reports in @code{comp.lang.awk}, then understand that you're
42004not reporting bugs, you're just whining.
42005
42006Similarly, posting bug reports or questions in web forums (such
42007as @uref{https://stackoverflow.com/, Stack Overflow}) may get you
42008an answer, but it won't be from the @command{gawk} maintainers,
42009who do not spend their time in web forums.  The steps described here are
42010the only officially recognized way for reporting bugs.  Really.
42011
42012@ignore
42013And another one:
42014
42015Date: Thu, 11 Jun 2015 09:00:56 -0400
42016From: Chet Ramey <chet.ramey@case.edu>
42017
42018My memory was imperfect.  Back in June 2009, I wrote:
42019
42020"That's the nice thing about open source, right?  You can take your ball
42021and run to another section of the playground.  Then, if you like mixing
42022metaphors, you can throw rocks from there."
42023@end ignore
42024
42025@node Performance bugs
42026@appendixsubsec What To Do If You Think There Is A Performance Issue
42027
42028@cindex performance, checking issues
42029@cindex profiling, compiling @command{gawk} for
42030If you think that @command{gawk} is too slow at doing a particular task,
42031you should investigate before sending in a bug report. Here are the steps
42032to follow:
42033
42034@enumerate 1
42035@item
42036Run @command{gawk} with the @option{--profile} option (@pxref{Options})
42037to see what your
42038program is doing. It may be that you have written it in an inefficient manner.
42039For example, you may be doing something for every record that could be done
42040just once, for every file.
42041(Use a @code{BEGINFILE} rule; @pxref{BEGINFILE/ENDFILE}.)
42042Or you may be doing something for every file that only needs to be done
42043once per run of the program.
42044(Use a @code{BEGIN} rule; @pxref{BEGIN/END}.)
42045
42046@item
42047If profiling at the @command{awk} level doesn't help, then you will
42048need to compile @command{gawk} itself for profiling at the C language level.
42049
42050To do that, start with the latest released version of
42051@command{gawk}. Unpack the source code in a new directory, and configure
42052it:
42053
42054@example
42055$ @kbd{tar -xpzvf gawk-X.Y.Z.tar.gz}
42056@print{} @dots{}                                @ii{Output omitted}
42057$ @kbd{cd gawk-X.Y.Z}
42058$ @kbd{./configure}
42059@print{} @dots{}                                @ii{Output omitted}
42060@end example
42061
42062@item
42063Edit the files @file{Makefile} and @file{support/Makefile}.
42064Change every instance of @option{-O2} or @option{-O} to @option{-pg}.
42065This causes @command{gawk} to be compiled for profiling.
42066
42067@item
42068Compile the program by running the @command{make} command:
42069
42070@example
42071@group
42072$ @kbd{make}
42073@print{} @dots{}                                @ii{Output omitted}
42074@end group
42075@end example
42076
42077@item
42078Run the freshly compiled @command{gawk} on a @emph{real} program,
42079using @emph{real} data.  Using an artificial program to try to time one
42080particular feature of @command{gawk} is useless; real @command{awk} programs
42081generally spend most of their time doing I/O, not computing.  If you want to prove
42082that something is slow, it @emph{must} be done using a real program and real data.
42083
42084Use a data file that is large enough for the statistical profiling to measure
42085where @command{gawk} spends its time. It should be at least 100 megabytes in size.
42086
42087@example
42088$ @kbd{./gawk -f realprogram.awk realdata > /dev/null}
42089@end example
42090
42091@item
42092When done, you should have a file in the current directory named @file{gmon.out}.
42093Run the command @samp{gprof gawk gmon.out > gprof.out}.
42094
42095@item
42096Submit a bug report explaining what you think is slow. Include the @file{gprof.out}
42097file with it.
42098
42099Preferably, you should also submit the program and the data, or else indicate where to
42100get the data if the file is large.
42101
42102@item
42103If you have not submitted your program and data, be prepared to apply patches and
42104rerun the profiling in order to see if the patches were effective.
42105
42106@end enumerate
42107
42108If you are incapable or unwilling to do the steps listed above, then you will
42109just have to live with @command{gawk} as it is.
42110
42111@node Asking for help
42112@appendixsubsec Where To Send Non-bug Questions
42113
42114If you have questions related to @command{awk} programming, or why @command{gawk}
42115behaves a certain way, or any other @command{awk}- or @command{gawk}-related issue,
42116please @emph{do not} send it to the bug reporting address.
42117
42118As of July, 2021, there is a separate mailing list for this purpose:
42119@EMAIL{help-gawk@@gnu.org, help dash gawk at gnu dot org}.
42120Anything that is not a bug report should be sent to that list.
42121
42122@quotation NOTE
42123If you disregard these directions and send non-bug mails to the bug list,
42124you will be told to use the help list.
42125After two such requests you will be silently @emph{blacklisted} from the bug list.
42126@end quotation
42127
42128Please note: As with the bug list, we ask that you follow the
42129@uref{https://gnu.org/philosophy/kind-communication.html,
42130GNU Kind Communication Guidelines} in your correspondence on the help
42131list (as well as off of it).
42132
42133@cindex Proulx, Bob
42134If you wish to the subscribe to the list, in order to help out
42135others, or to learn from others, here are instructions, courtesy
42136of Bob Proulx:
42137
42138@table @emph
42139@item Subscribe by email
42140
42141Send an email message to
42142@EMAIL{help-gawk-request@@gnu.org, help dash gawk dash request at gnu dot org}
42143with ``subscribe'' in
42144the body of the message.  The subject does not matter and is not used.
42145
42146@item Subscribe by web form
42147
42148To use the web interface visit
42149@uref{https://lists.gnu.org/mailman/listinfo/help-gawk,
42150the list information page}.
42151Use the
42152subscribe form to fill out your email address and submit using the
42153@code{Subscribe} button.
42154
42155@item Reply to the confirmation message
42156
42157In both cases then reply to the confirmation message that is sent to
42158your address in reply.
42159@end table
42160
42161Bob mentions that you may also use email for subscribing and
42162unsubscribing. For example:
42163
42164@example
42165$ @kbd{echo help | mailx -s request help-gawk-request@@gnu.org}
42166$ @kbd{echo subscribe | mailx -s request help-gawk-request@@gnu.org}
42167$ @kbd{echo unsubscribe | mailx -s request help-gawk-request@@gnu.org}
42168@end example
42169
42170@node Maintainers
42171@appendixsubsec Reporting Problems with Non-Unix Ports
42172
42173If you find bugs in one of the non-Unix ports of @command{gawk},
42174send an email to the bug list, with a copy to the
42175person who maintains that port.  The maintainers are named in the following list,
42176as well as in the @file{README} file in the @command{gawk} distribution.
42177Information in the @file{README} file should be considered authoritative
42178if it conflicts with this @value{DOCUMENT}.
42179
42180The people maintaining the various @command{gawk} ports are:
42181
42182@c put the index entries outside the table, for docbook
42183@cindex Buening, Andreas
42184@cindex Malmberg, John
42185@cindex G., Daniel Richard
42186@cindex Robbins @subentry Arnold
42187@cindex Zaretskii, Eli
42188@cindex Guerrero, Juan Manuel
42189@multitable {MS-Windows with MinGW} {123456789012345678901234567890123456789001234567890}
42190@item Unix and POSIX systems @tab Arnold Robbins, @EMAIL{arnold@@skeeve.com,arnold at skeeve dot com}
42191
42192@item MS-DOS with DJGPP @tab Juan Manuel Guerrero, @EMAIL{juan.guerrero@@gmx.de, juan dot guerrero at gmx dot de}
42193
42194@item MS-Windows with MinGW @tab Eli Zaretskii, @EMAIL{eliz@@gnu.org,eliz at gnu dot org}
42195
42196@c Leave this in the document on purpose.
42197@c OS/2 is not mentioned anywhere else though.
42198@item OS/2 @tab Andreas Buening, @EMAIL{andreas.buening@@nexgo.de,andreas dot buening at nexgo dot de}
42199
42200@item VMS @tab John Malmberg, @EMAIL{wb8tyw@@qsl.net,wb8tyw at qsl dot net}
42201
42202@item z/OS (OS/390) @tab Daniel Richard G.@: @EMAIL{skunk@@iSKUNK.ORG,skunk at iSKUNK dot ORG}
42203@end multitable
42204
42205If your bug is also reproducible under Unix, send a copy of your
42206report to the @EMAIL{bug-gawk@@gnu.org,bug dash gawk at gnu dot org} email list as well.
42207
42208@node Other Versions
42209@appendixsec Other Freely Available @command{awk} Implementations
42210@cindex @command{awk} @subentry implementations
42211@ignore
42212From: emory!amc.com!brennan (Michael Brennan)
42213Subject: C++ comments in awk programs
42214To: arnold@gnu.ai.mit.edu (Arnold Robbins)
42215Date: Wed, 4 Sep 1996 08:11:48 -0700 (PDT)
42216
42217@end ignore
42218@cindex Brennan, Michael
42219@ifnotdocbook
42220@quotation
42221@i{It's kind of fun to put comments like this in your awk code:}@*
42222@ @ @ @ @ @ @code{// Do C++ comments work? answer: yes! of course}
42223@author Michael Brennan
42224@end quotation
42225@end ifnotdocbook
42226
42227@docbook
42228<blockquote><attribution>Michael Brennan</attribution>
42229<literallayout><emphasis>It's kind of fun to put comments like this in your awk code.</emphasis>
42230&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<literal>// Do C++ comments work? answer: yes! of course</literal></literallayout>
42231</blockquote>
42232@end docbook
42233
42234There are a number of other freely available @command{awk} implementations.
42235This @value{SECTION} briefly describes where to get them:
42236
42237@table @asis
42238@cindex Kernighan, Brian
42239@cindex source code @subentry Brian Kernighan's @command{awk}
42240@cindex @command{awk} @subentry versions of @seealso{Brian Kernighan's @command{awk}}
42241@cindex Brian Kernighan's @command{awk} @subentry source code
42242@item Unix @command{awk}
42243Brian Kernighan, one of the original designers of Unix @command{awk},
42244has made his implementation of
42245@command{awk} freely available.
42246You can retrieve it from GitHub:
42247
42248@cindex @command{git} utility
42249@example
42250git clone git://github.com/onetrueawk/awk bwkawk
42251@end example
42252
42253@noindent
42254This command creates a copy of the @uref{https://git-scm.com, Git}
42255repository in a directory named @file{bwkawk}.  If you omit the last argument
42256from the @command{git} command line, the repository copy is created in a
42257directory named @file{awk}.
42258
42259This version requires an ISO C (1990 standard) compiler; the C compiler
42260from GCC (the GNU Compiler Collection) works quite nicely.
42261
42262To build it, review the settings in the @file{makefile}, and then just run
42263@command{make}.  Note that the result of compilation is named
42264@command{a.out}; you will have to rename it to something reasonable.
42265
42266@xref{Common Extensions}
42267for a list of extensions in this @command{awk} that are not in POSIX @command{awk}.
42268
42269As a side note, Dan Bornstein has created a Git repository tracking
42270all the versions of BWK @command{awk} that he could find. It's
42271available at @uref{git://github.com/danfuzz/one-true-awk}.
42272
42273@cindex Brennan, Michael
42274@cindex @command{mawk} utility
42275@cindex source code @subentry @command{mawk}
42276@item @command{mawk}
42277Michael Brennan wrote an independent implementation of @command{awk},
42278called @command{mawk}.  It is available under the
42279@ifclear FOR_PRINT
42280GPL (@pxref{Copying}),
42281@end ifclear
42282@ifset FOR_PRINT
42283GPL,
42284@end ifset
42285just as @command{gawk} is.
42286
42287The original distribution site for the @command{mawk} source code
42288no longer has it.  A copy is available at
42289@uref{http://www.skeeve.com/gawk/mawk1.3.3.tar.gz}.
42290
42291In 2009, Thomas Dickey took on @command{mawk} maintenance.
42292Basic information is available on
42293@uref{http://www.invisible-island.net/mawk, the project's web page}.
42294The download URL is
42295@url{http://invisible-island.net/datafiles/release/mawk.tar.gz}.
42296
42297Once you have it,
42298@command{gunzip} may be used to decompress this file. Installation
42299is similar to @command{gawk}'s
42300(@pxref{Unix Installation}).
42301
42302@xref{Common Extensions}
42303for a list of extensions in @command{mawk} that are not in POSIX @command{awk}.
42304
42305@item @command{mawk} 2.0
42306In 2016, Michael Brennan resumed @command{mawk} development.
42307His development snapshots are available via Git from the project's
42308@uref{https://github.com/mikebrennan000/mawk-2, GitHub page}.
42309
42310@cindex Sumner, Andrew
42311@cindex @command{awka} compiler for @command{awk}
42312@cindex source code @subentry @command{awka}
42313@item @command{awka}
42314Written by Andrew Sumner,
42315@command{awka} translates @command{awk} programs into C, compiles them,
42316and links them with a library of functions that provide the core
42317@command{awk} functionality.
42318It also has a number of extensions.
42319
42320Both the @command{awk} translator and the library are released under the GPL.
42321
42322To get @command{awka}, go to @url{https://sourceforge.net/projects/awka}.
42323@c You can reach Andrew Sumner at @email{andrew@@zbcom.net}.
42324@c andrewsumner@@yahoo.net
42325
42326The project seems to be frozen; no new code changes have been made
42327since approximately 2001.
42328
42329@item Revive Awka
42330This project, available at @uref{https://github.com/noyesno/awka},
42331intends to fix bugs in @command{awka} and add more features.
42332
42333@cindex Beebe, Nelson H.F.@:
42334@cindex @command{pawk} (profiling version of Brian Kernighan's @command{awk})
42335@cindex source code @subentry @command{pawk} (profiling version of Brian Kernighan's @command{awk})
42336@item @command{pawk}
42337Nelson H.F.@: Beebe at the University of Utah has modified
42338BWK @command{awk} to provide timing and profiling information.
42339It is different from @command{gawk} with the @option{--profile} option
42340(@pxref{Profiling})
42341in that it uses CPU-based profiling, not line-count
42342profiling.  You may find it at either
42343@uref{ftp://ftp.math.utah.edu/pub/pawk/pawk-20030606.tar.gz}
42344or
42345@uref{http://www.math.utah.edu/pub/pawk/pawk-20030606.tar.gz}.
42346
42347@item BusyBox @command{awk}
42348@cindex BusyBox Awk
42349@cindex source code @subentry BusyBox Awk
42350BusyBox is a GPL-licensed program providing small versions of many
42351applications within a single executable. It is aimed at embedded systems.
42352It includes a full implementation of POSIX @command{awk}.  When building
42353it, be careful not to do @samp{make install} as it will overwrite
42354copies of other applications in your @file{/usr/local/bin}.  For more
42355information, see the @uref{https://busybox.net, project's home page}.
42356
42357@cindex OpenSolaris
42358@cindex Solaris, POSIX-compliant @command{awk}
42359@cindex source code @subentry Solaris @command{awk}
42360@item The OpenSolaris POSIX @command{awk}
42361The versions of @command{awk} in @file{/usr/xpg4/bin} and
42362@file{/usr/xpg6/bin} on Solaris are more or less POSIX-compliant.
42363They are based on the @command{awk} from Mortice Kern Systems for PCs.
42364We were able to make this code compile and work under GNU/Linux
42365with 1--2 hours of work.  Making it more generally portable (using
42366GNU Autoconf and/or Automake) would take more work, and this
42367has not been done, at least to our knowledge.
42368
42369@cindex Illumos, POSIX-compliant @command{awk}
42370@cindex source code @subentry Illumos @command{awk}
42371The source code used to be available from the OpenSolaris website.
42372However, that project was ended and the website shut down.  Fortunately, the
42373@uref{https://wiki.illumos.org/display/illumos/illumos+Home, Illumos project}
42374makes this implementation available.  You can view the files one at a time from
42375@uref{https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/awk_xpg4}.
42376
42377@cindex @command{frawk}
42378@cindex source code @subentry @command{frawk}
42379@item @command{frawk}
42380This is a language for writing short programs.  ``To a first
42381approximation, it is an implementation of the AWK language;
42382many common @command{awk} programs produce equivalent output
42383when passed to @command{frawk}.''  However, it has a number of
42384important additional features.  The code is available at
42385@uref{https://github.com/ezrosent/frawk}.
42386
42387@cindex @command{goawk}
42388@cindex Go implementation of @command{awk}
42389@cindex source code @subentry @command{goawk}
42390@cindex programming languages @subentry Go
42391@item @command{goawk}
42392This is an @command{awk} interpreter written in the
42393@uref{https://golang.org/, Go programming language}.
42394It implements POSIX @command{awk}, with a few minor extensions.
42395Source code is available from @uref{https://github.com/benhoyt/goawk}.
42396The author wrote a nice
42397@uref{https://benhoyt.com/writings/goawk/, article}
42398describing the implementation.
42399
42400@cindex @command{jawk}
42401@cindex Java implementation of @command{awk}
42402@cindex source code @subentry @command{jawk}
42403@item @command{jawk}
42404This is an interpreter for @command{awk} written in Java. It claims
42405to be a full interpreter, although because it uses Java facilities
42406for I/O and for regexp matching, the language it supports is different
42407from POSIX @command{awk}.  More information is available on the
42408@uref{http://jawk.sourceforge.net, project's home page}.
42409
42410@item Hoijui's @command{jawk}
42411This project, available at @uref{https://github.com/hoijui/Jawk},
42412is another @command{awk} interpreter written in Java. It uses
42413modern Java build tools.
42414
42415@item Libmawk
42416@cindex libmawk
42417@cindex source code @subentry libmawk
42418This is an embeddable @command{awk} interpreter derived from
42419@command{mawk}. For more information, see
42420@uref{http://repo.hu/projects/libmawk/}.
42421
42422@cindex source code @subentry embeddable @command{awk} interpreter
42423@cindex Neacsu, Mircea
42424@item Mircea Neacsu's Embeddable @command{awk}
42425Mircea Neacsu has created an embeddable @command{awk}
42426interpreter, based on BWK awk. It's available
42427at @uref{https://github.com/neacsum/awk}.
42428
42429@item @code{pawk}
42430@cindex source code @subentry @command{pawk} (Python version)
42431@cindex @code{pawk}, @command{awk}-like facilities for Python
42432This is a Python module that claims to bring @command{awk}-like
42433features to Python. See @uref{https://github.com/alecthomas/pawk}
42434for more information. (This is not related to Nelson Beebe's
42435modified version of BWK @command{awk}, described earlier.)
42436
42437@item @w{QSE @command{awk}}
42438@cindex QSE @command{awk}
42439@cindex source code @subentry QSE @command{awk}
42440This is an embeddable @command{awk} interpreter. For more information,
42441see @uref{https://code.google.com/p/qse/}. @c and @uref{http://awk.info/?tools/qse}.
42442
42443@item @command{QTawk}
42444@cindex QuikTrim Awk
42445@cindex source code @subentry QuikTrim Awk
42446This is an independent implementation of @command{awk} distributed
42447under the GPL. It has a large number of extensions over standard
42448@command{awk} and may not be 100% syntactically compatible with it.
42449See @uref{http://www.quiktrim.org/QTawk.html} for more information,
42450including the manual. The download link there is out of date; see
42451@uref{http://www.quiktrim.org/#AdditionalResources} for the latest
42452download link.
42453
42454The project may also be frozen; no new code changes have been made
42455since approximately 2014.
42456
42457@item Other versions
42458See also the ``Versions and implementations'' section of the
42459@uref{https://en.wikipedia.org/wiki/Awk_language#Versions_and_implementations,
42460Wikipedia article} on @command{awk} for information on additional versions.
42461
42462@end table
42463
42464An interesting collection of library functions is available
42465at @uref{https://github.com/e36freak/awk-libs}.
42466
42467@node Installation summary
42468@appendixsec Summary
42469
42470@itemize @value{BULLET}
42471@item
42472The @command{gawk} distribution is available from the GNU Project's main
42473distribution site, @code{ftp.gnu.org}.  The canonical build recipe is:
42474
42475@example
42476wget https://ftp.gnu.org/gnu/gawk/gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz
42477tar -xvpzf gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz
42478cd gawk-@value{VERSION}.@value{PATCHLEVEL}
42479./configure && make && make check
42480@end example
42481
42482@quotation NOTE
42483Because of the @samp{https://} URL, you may have to supply the
42484@option{--no-check-certificate} option to @command{wget} to download
42485the file.
42486@end quotation
42487
42488@item
42489@command{gawk} may be built on non-POSIX systems as well. The currently
42490supported systems are MS-Windows using
42491MSYS, MSYS2, DJGPP, MinGW, and Cygwin,
42492@c OS/2,
42493and both Vax/VMS and OpenVMS.
42494Instructions for each system are included in this @value{APPENDIX}.
42495
42496@item
42497Bug reports should be sent via email to @EMAIL{bug-gawk@@gnu.org, bug dash gawk at gnu dot org}.
42498Bug reports should be in English and should include the version of @command{gawk},
42499how it was compiled, and a short program and @value{DF} that demonstrate
42500the problem.
42501
42502@item
42503Non-bug emails should be sent to @EMAIL{help-gawk@@gnu.org, help dash gawk at gnu dot org}.
42504Repeatedly sending non-bug emails to the bug list will get you blacklisted from it.
42505
42506@item
42507There are a number of other freely available @command{awk}
42508implementations.  Many are POSIX-compliant; others are less so.
42509
42510@end itemize
42511
42512
42513@ifclear FOR_PRINT
42514@node Notes
42515@appendix Implementation Notes
42516@cindex @command{gawk} @subentry implementation issues
42517@cindex implementation issues, @command{gawk}
42518
42519This appendix contains information mainly of interest to implementers and
42520maintainers of @command{gawk}.  Everything in it applies specifically to
42521@command{gawk} and not to other implementations.
42522
42523@menu
42524* Compatibility Mode::          How to disable certain @command{gawk}
42525                                extensions.
42526* Additions::                   Making Additions To @command{gawk}.
42527* Future Extensions::           New features that may be implemented one day.
42528* Implementation Limitations::  Some limitations of the implementation.
42529* Extension Design::            Design notes about the extension API.
42530* Notes summary::               Summary of implementation notes.
42531@end menu
42532
42533@node Compatibility Mode
42534@appendixsec Downward Compatibility and Debugging
42535@cindex @command{gawk} @subentry implementation issues @subentry downward compatibility
42536@cindex @command{gawk} @subentry implementation issues @subentry debugging
42537@cindex troubleshooting @subentry @command{gawk}
42538@cindex implementation issues, @command{gawk} @subentry debugging
42539
42540@xref{POSIX/GNU},
42541for a summary of the GNU extensions to the @command{awk} language and program.
42542All of these features can be turned off by invoking @command{gawk} with the
42543@option{--traditional} option or with the @option{--posix} option.
42544
42545If @command{gawk} is compiled for debugging with @samp{-DDEBUG}, then there
42546is one more option available on the command line:
42547
42548@table @code
42549@item -Y
42550@itemx --parsedebug
42551Print out the parse stack information as the program is being parsed.
42552@end table
42553
42554This option is intended only for serious @command{gawk} developers
42555and not for the casual user.  It probably has not even been compiled into
42556your version of @command{gawk}, since it slows down execution.
42557
42558@node Additions
42559@appendixsec Making Additions to @command{gawk}
42560
42561If you find that you want to enhance @command{gawk} in a significant
42562fashion, you are perfectly free to do so.  That is the point of having
42563free software; the source code is available and you are free to change
42564it as you want (@pxref{Copying}).
42565
42566This @value{SECTION} discusses the ways you might want to change @command{gawk}
42567as well as any considerations you should bear in mind.
42568
42569@menu
42570* Accessing The Source::        Accessing the Git repository.
42571* Adding Code::                 Adding code to the main body of
42572                                @command{gawk}.
42573* New Ports::                   Porting @command{gawk} to a new operating
42574                                system.
42575* Derived Files::               Why derived files are kept in the Git
42576                                repository.
42577@end menu
42578
42579@node Accessing The Source
42580@appendixsubsec Accessing The @command{gawk} Git Repository
42581
42582As @command{gawk} is Free Software, the source code is always available.
42583@ref{Gawk Distribution} describes how to get and build the formal,
42584released versions of @command{gawk}.
42585
42586@cindex @command{git} utility
42587However, if you want to modify @command{gawk} and contribute back your
42588changes, you will probably wish to work with the development version.
42589To do so, you will need to access the @command{gawk} source code
42590repository.  The code is maintained using the
42591@uref{https://git-scm.com, Git distributed version control system}.
42592You will need to install it if your system doesn't have it.
42593Once you have done so, use the command:
42594
42595@example
42596git clone git://git.savannah.gnu.org/gawk.git
42597@end example
42598
42599@noindent
42600This clones the @command{gawk} repository.  If you are behind a
42601firewall that does not allow you to use the Git native protocol, you
42602can still access the repository using:
42603
42604@example
42605git clone https://git.savannah.gnu.org/r/gawk.git
42606@end example
42607
42608Once you have made changes, you can use @samp{git diff} to produce a
42609patch, and send that to the @command{gawk} maintainer; see @ref{Bugs},
42610for how to do that.
42611
42612Once upon a time there was Git--CVS gateway for use by people who could
42613not install Git. However, this gateway no longer works, so you may have
42614better luck using a more modern version control system like Bazaar,
42615that has a Git plug-in for working with Git repositories.
42616
42617@node Adding Code
42618@appendixsubsec Adding New Features
42619
42620@cindex adding @subentry features to @command{gawk}
42621@cindex features @subentry adding to @command{gawk}
42622@cindex @command{gawk} @subentry features @subentry adding
42623You are free to add any new features you like to @command{gawk}.
42624However, if you want your changes to be incorporated into the @command{gawk}
42625distribution, there are several steps that you need to take in order to
42626make it possible to include them:
42627
42628@enumerate 1
42629@item
42630Before building the new feature into @command{gawk} itself,
42631consider writing it as an extension
42632(@pxref{Dynamic Extensions}).
42633If that's not possible, continue with the rest of the steps in this list.
42634
42635@item
42636Be prepared to sign the appropriate paperwork.
42637In order for the FSF to distribute your changes, you must either place
42638those changes in the public domain and submit a signed statement to that
42639effect, or assign the copyright in your changes to the FSF.
42640Both of these actions are easy to do and @emph{many} people have done so
42641already. If you have questions, please contact me
42642(@pxref{Bugs}),
42643or @EMAIL{assign@@gnu.org,assign at gnu dot org}.
42644
42645@item
42646Get the latest version.
42647It is much easier for me to integrate changes if they are relative to
42648the most recent distributed version of @command{gawk}, or better yet,
42649relative to the latest code in the Git repository.  If your version of
42650@command{gawk} is very old, I may not be able to integrate your changes at all.
42651(@xref{Getting},
42652for information on getting the latest version of @command{gawk}.)
42653
42654@item
42655@ifnotinfo
42656Follow the @cite{GNU Coding Standards}.
42657@end ifnotinfo
42658@ifinfo
42659See @inforef{Top, , Version, standards, GNU Coding Standards}.
42660@end ifinfo
42661This document describes how GNU software should be written. If you haven't
42662read it, please do so, preferably @emph{before} starting to modify @command{gawk}.
42663(The @cite{GNU Coding Standards} are available from
42664the GNU Project's
42665@uref{https://www.gnu.org/prep/standards/, website}.
42666Texinfo, Info, and DVI versions are also available.)
42667
42668@cindex @command{gawk} @subentry coding style in
42669@item
42670Use the @command{gawk} coding style.
42671The C code for @command{gawk} follows the instructions in the
42672@cite{GNU Coding Standards}, with minor exceptions.  The code is formatted
42673using the traditional ``K&R'' style, particularly as regards to the placement
42674of braces and the use of TABs.  In brief, the coding rules for @command{gawk}
42675are as follows:
42676
42677@itemize @value{BULLET}
42678@item
42679Use ANSI/ISO style (prototype) function headers when defining functions.
42680
42681@item
42682Put the name of the function at the beginning of its own line.
42683
42684@item
42685Use @samp{#elif} instead of nesting @samp{#if} inside @samp{#else}.
42686
42687@item
42688Put the return type of the function, even if it is @code{int}, on the
42689line above the line with the name and arguments of the function.
42690
42691@item
42692Put spaces around parentheses used in control structures
42693(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch},
42694and @code{return}).
42695
42696@item
42697Do not put spaces in front of parentheses used in function calls.
42698
42699@item
42700Put spaces around all C operators and after commas in function calls.
42701
42702@item
42703Do not use the comma operator to produce multiple side effects, except
42704in @code{for} loop initialization and increment parts, and in macro bodies.
42705
42706@item
42707Use real TABs for indenting, not spaces.
42708
42709@item
42710Use the ``K&R'' brace layout style.
42711
42712@item
42713Use comparisons against @code{NULL} and @code{'\0'} in the conditions of
42714@code{if}, @code{while}, and @code{for} statements, as well as in the @code{case}s
42715of @code{switch} statements, instead of just the
42716plain pointer or character value.
42717
42718@item
42719Use @code{true} and @code{false} for @code{bool} values,
42720the @code{NULL} symbolic constant for pointer values,
42721and the character constant @code{'\0'} where appropriate, instead of @code{1}
42722and @code{0}.
42723
42724@item
42725Provide one-line descriptive comments for each function.
42726
42727@item
42728Do not use the @code{alloca()} function for allocating memory off the
42729stack.  Its use causes more portability trouble than is worth the minor
42730benefit of not having to free the storage. Instead, use @code{malloc()}
42731and @code{free()}.
42732
42733@item
42734Do not use comparisons of the form @samp{! strcmp(a, b)} or similar.
42735As Henry Spencer once said, ``@code{strcmp()} is not a boolean!''
42736Instead, use @samp{strcmp(a, b) == 0}.
42737
42738@item
42739If adding new bit flag values, use explicit hexadecimal constants
42740(@code{0x001}, @code{0x002}, @code{0x004}, and so on) instead of
42741shifting one left by successive amounts (@samp{(1<<0)}, @samp{(1<<1)},
42742and so on).
42743@end itemize
42744
42745@quotation NOTE
42746If I have to reformat your code to follow the coding style used in
42747@command{gawk}, I may not bother to integrate your changes at all.
42748@end quotation
42749
42750@cindex Texinfo
42751@item
42752Update the documentation.
42753Along with your new code, please supply new sections and/or chapters
42754for this @value{DOCUMENT}.  If at all possible, please use real
42755Texinfo, instead of just supplying unformatted ASCII text (although
42756even that is better than no documentation at all).
42757Conventions to be followed in @cite{@value{TITLE}} are provided
42758after the @samp{@@bye} at the end of the Texinfo source file.
42759If possible, please update the @command{man} page as well.
42760
42761You will also have to sign paperwork for your documentation changes.
42762
42763@cindex @command{git} utility
42764@item
42765Submit changes as unified diffs.
42766Use @samp{diff -u -r -N} to compare
42767the original @command{gawk} source tree with your version.
42768I recommend using the GNU version of @command{diff}, or best of all,
42769@samp{git diff} or @samp{git format-patch}.
42770Send the output produced by @command{diff} to me when you
42771submit your changes.
42772(@xref{Bugs}, for the electronic mail
42773information.)
42774
42775Using this format makes it easy for me to apply your changes to the
42776master version of the @command{gawk} source code (using @command{patch}).
42777If I have to apply the changes manually, using a text editor, I may
42778not do so, particularly if there are lots of changes.
42779
42780@item
42781Include an entry for the @file{ChangeLog} file with your submission.
42782This helps further minimize the amount of work I have to do,
42783making it easier for me to accept patches.
42784It is simplest if you just make this part of your diff.
42785@end enumerate
42786
42787Although this sounds like a lot of work, please remember that while you
42788may write the new code, I have to maintain it and support it. If it
42789isn't possible for me to do that with a minimum of extra work, then I
42790probably will not.
42791
42792@node New Ports
42793@appendixsubsec Porting @command{gawk} to a New Operating System
42794@cindex portability @subentry @command{gawk}
42795@cindex operating systems @subentry porting @command{gawk} to
42796
42797@cindex porting @command{gawk}
42798If you want to port @command{gawk} to a new operating system, there are
42799several steps:
42800
42801@enumerate 1
42802@item
42803Follow the guidelines in
42804@ifinfo
42805@ref{Adding Code},
42806@end ifinfo
42807@ifnotinfo
42808the previous @value{SECTION}
42809@end ifnotinfo
42810concerning coding style, submission of diffs, and so on.
42811
42812@item
42813Be prepared to sign the appropriate paperwork.
42814In order for the FSF to distribute your code, you must either place
42815your code in the public domain and submit a signed statement to that
42816effect, or assign the copyright in your code to the FSF.
42817Both of these actions are easy to do and @emph{many} people have done so
42818already. If you have questions, please contact me, or
42819@EMAIL{gnu@@gnu.org, gnu at gnu dot org}.
42820
42821@item
42822When doing a port, bear in mind that your code must coexist peacefully
42823with the rest of @command{gawk} and the other ports. Avoid gratuitous
42824changes to the system-independent parts of the code. If at all possible,
42825avoid sprinkling @samp{#ifdef}s just for your port throughout the
42826code.
42827
42828If the changes needed for a particular system affect too much of the
42829code, I probably will not accept them.  In such a case, you can, of course,
42830distribute your changes on your own, as long as you comply
42831with the GPL
42832(@pxref{Copying}).
42833
42834@item
42835A number of the files that come with @command{gawk} are maintained by other
42836people.  Thus, you should not change them
42837unless it is for a very good reason; i.e., changes are not out of the
42838question, but changes to these files are scrutinized extra carefully.
42839These are all the files in the @file{support} directory
42840within the @command{gawk} distribution. See there.
42841
42842@item
42843A number of other files are provided by the GNU
42844Autotools (Autoconf, Automake, and GNU @command{gettext}).
42845You should not change them either, unless it is for a very
42846good reason. The files are
42847@file{ABOUT-NLS},
42848@file{config.guess},
42849@file{config.rpath},
42850@file{config.sub},
42851@file{depcomp},
42852@file{INSTALL},
42853@file{install-sh},
42854@file{missing},
42855@file{mkinstalldirs},
42856and
42857@file{ylwrap}.
42858
42859@item
42860Be willing to continue to maintain the port.
42861Non-Unix operating systems are supported by volunteers who maintain
42862the code needed to compile and run @command{gawk} on their systems. If no-one
42863volunteers to maintain a port, it becomes unsupported and it may
42864be necessary to remove it from the distribution.
42865
42866@item
42867Supply an appropriate @file{gawkmisc.???} file.
42868Each port has its own @file{gawkmisc.???} that implements certain
42869operating system specific functions. This is cleaner than a plethora of
42870@samp{#ifdef}s scattered throughout the code.  The @file{gawkmisc.c} in
42871the main source directory includes the appropriate
42872@file{gawkmisc.???} file from each subdirectory.
42873Be sure to update it as well.
42874
42875Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine
42876or operating system for the port---for example, @file{pc/gawkmisc.pc} and
42877@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain
42878@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory
42879into the main subdirectory, without accidentally destroying the real
42880@file{gawkmisc.c} file.  (Currently, this is only an issue for the
42881PC operating system ports.)
42882
42883@item
42884Supply a @file{Makefile} as well as any other C source and header files that are
42885necessary for your operating system.  All your code should be in a
42886separate subdirectory, with a name that is the same as, or reminiscent
42887of, either your operating system or the computer system.  If possible,
42888try to structure things so that it is not necessary to move files out
42889of the subdirectory into the main source directory.  If that is not
42890possible, then be sure to avoid using names for your files that
42891duplicate the names of files in the main source directory.
42892
42893@item
42894Update the documentation.
42895Please write a section (or sections) for this @value{DOCUMENT} describing the
42896installation and compilation steps needed to compile and/or install
42897@command{gawk} for your system.
42898@end enumerate
42899
42900Following these steps makes it much easier to integrate your changes
42901into @command{gawk} and have them coexist happily with other
42902operating systems' code that is already there.
42903
42904In the code that you supply and maintain, feel free to use a
42905coding style and brace layout that suits your taste.
42906
42907@node Derived Files
42908@appendixsubsec Why Generated Files Are Kept In Git
42909
42910@cindex Git, use of for @command{gawk} source code
42911@c From emails written March 22, 2012, to the gawk developers list.
42912
42913If you look at the @command{gawk} source in the Git
42914repository, you will notice that it includes files that are automatically
42915generated by GNU infrastructure tools, such as @file{Makefile.in} from
42916Automake and even @file{configure} from Autoconf.
42917
42918This is different from many Free Software projects that do not store
42919the derived files, because that keeps the repository less cluttered,
42920and it is easier to see the substantive changes when comparing versions
42921and trying to understand what changed between commits.
42922
42923However, there are several reasons why the @command{gawk} maintainer
42924likes to have everything in the repository.
42925
42926First, because it is then easy to reproduce any given version completely,
42927without relying upon the availability of (older, likely obsolete, and
42928maybe even impossible to find) other tools.
42929
42930As an extreme example, if you ever even think about trying to compile,
42931oh, say, the V7 @command{awk}, you will discover that not only do you
42932have to bootstrap the V7 @command{yacc} to do so, but you also need the
42933V7 @command{lex}.  And the latter is pretty much impossible to bring up
42934on a modern GNU/Linux system.@footnote{We tried. It was painful.}
42935
42936(Or, let's say @command{gawk} 1.2 required @command{bison} whatever-it-was
42937in 1989 and that there was no @file{awkgram.c} file in the repository.  Is
42938there a guarantee that we could find that @command{bison} version? Or that
42939@emph{it} would build?)
42940
42941If the repository has all the generated files, then it's easy to just check
42942them out and build. (Or @emph{easier}, depending upon how far back we go.)
42943
42944And that brings us to the second (and stronger) reason why all the files
42945really need to be in Git.  It boils down to who do you cater
42946to---the @command{gawk} developer(s), or the user who just wants to check
42947out a version and try it out?
42948
42949The @command{gawk} maintainer
42950wants it to be possible for any interested @command{awk} user in the
42951world to just clone the repository, check out the branch of interest and
42952build it. Without their having to have the correct version(s) of the
42953autotools.@footnote{There is one GNU program that is (in our opinion)
42954severely difficult to bootstrap from the Git repository. For
42955example, on the author's old (but still working) PowerPC Macintosh with
42956Mac OS X 10.5, it was necessary to bootstrap a ton of software, starting
42957with Git itself, in order to try to work with the latest code.
42958It's not pleasant, and especially on older systems, it's a big waste
42959of time.
42960
42961Starting with the latest tarball was no picnic either. The maintainers
42962had dropped @file{.gz} and @file{.bz2} files and only distribute
42963@file{.tar.xz} files.  It was necessary to bootstrap @command{xz} first!}
42964That is the point of the @file{bootstrap.sh} file.  It touches the
42965various other files in the right order such that
42966
42967@example
42968# The canonical incantation for building GNU software:
42969./bootstrap.sh && ./configure && make
42970@end example
42971
42972@noindent
42973will @emph{just work}.
42974
42975This is extremely important for the @code{master} and
42976@code{gawk-@var{X}.@var{Y}-stable} branches.
42977
42978Further, the @command{gawk} maintainer would argue that it's also
42979important for the @command{gawk} developers. When he tried to check out
42980the @code{xgawk} branch@footnote{A branch (since removed) created by one of the other
42981developers that did not include the generated files.} to build it, he
42982couldn't. (No @file{ltmain.sh} file, and he had no idea how to create it,
42983and that was not the only problem.)
42984
42985He felt @emph{extremely} frustrated.  With respect to that branch,
42986the maintainer is no different than Jane User who wants to try to build
42987@code{gawk-4.1-stable} or @code{master} from the repository.
42988
42989Thus, the maintainer thinks that it's not just important, but critical,
42990that for any given branch, the above incantation @emph{just works}.
42991
42992@c Added 9/2014:
42993A third reason to have all the files is that without them, using @samp{git
42994bisect} to try to find the commit that introduced a bug is exceedingly
42995difficult. The maintainer tried to do that on another project that
42996requires running bootstrapping scripts just to create @command{configure}
42997and so on; it was really painful. When the repository is self-contained,
42998using @command{git bisect} in it is very easy.
42999
43000@c So - that's my reasoning and philosophy.
43001
43002What are some of the consequences and/or actions to take?
43003
43004@enumerate 1
43005@item
43006We don't mind that there are differing files in the different branches
43007as a result of different versions of the autotools.
43008
43009@enumerate A
43010@item
43011It's the maintainer's job to merge them and he will deal with it.
43012
43013@item
43014He is really good at @samp{git diff x y > /tmp/diff1 ; gvim /tmp/diff1} to
43015remove the diffs that aren't of interest in order to review code.
43016@end enumerate
43017
43018@item
43019It would certainly help if everyone used the same versions of the GNU tools
43020as he does, which in general are the latest released versions of
43021Automake,
43022Autoconf,
43023@command{bison},
43024GNU @command{gettext},
43025and
43026Libtool.
43027
43028@ignore
43029If it would help if I sent out an ``I just upgraded to version x.y
43030of tool Z'' kind of message to this list, I can do that.  Up until
43031now it hasn't been a real issue since I'm the only one who's been
43032dorking with the configuration machinery.
43033@end ignore
43034
43035@c @enumerate A
43036@c @item
43037Installing from source is quite easy. It's how the maintainer worked for years
43038(and still works).
43039He had @file{/usr/local/bin} at the front of his @env{PATH} and just did:
43040
43041@example
43042wget https://ftp.gnu.org/gnu/@var{package}/@var{package}-@var{x}.@var{y}.@var{z}.tar.gz
43043tar -xpzvf @var{package}-@var{x}.@var{y}.@var{z}.tar.gz
43044cd @var{package}-@var{x}.@var{y}.@var{z}
43045./configure && make && make check
43046make install    # as root
43047@end example
43048
43049@quotation NOTE
43050Because of the @samp{https://} URL, you may have to supply the
43051@option{--no-check-certificate} option to @command{wget} to download
43052the file.
43053@end quotation
43054
43055@c @item
43056@ignore
43057These days the maintainer uses Ubuntu 12.04 which is medium current, but
43058he is already doing the above for Automake, Autoconf, and @command{bison}.
43059@end ignore
43060
43061@ignore
43062(C. Rant: Recent Linux versions with GNOME 3 really suck. What
43063    are all those people thinking?  Fedora 15 was such a bust it drove
43064    me to Ubuntu, but Ubuntu 11.04 and 11.10 are totally unusable from
43065    a UI perspective. Bleah.)
43066@end ignore
43067@c @end enumerate
43068
43069@ignore
43070@item
43071If someone still feels really strongly about all this, then perhaps they
43072can have two branches, one for their development with just the clean
43073changes, and one that is buildable (xgawk and xgawk-buildable, maybe).
43074Or, as I suggested in another mail, make commits in pairs, the first with
43075the "real" changes and the second with "everything else needed for
43076 building".
43077@end ignore
43078@end enumerate
43079
43080Most of the above was originally written by the maintainer to other
43081@command{gawk} developers.  It raised the objection from one of
43082the developers ``@dots{} that anybody pulling down the source from
43083Git is not an end user.''
43084
43085However, this is not true. There are ``power @command{awk} users''
43086who can build @command{gawk} (using the magic incantation shown previously)
43087but who can't program in C.  Thus, the major branches should be
43088kept buildable all the time.
43089
43090It was then suggested that there be a @command{cron} job to create
43091nightly tarballs of ``the source.''  Here, the problem is that there
43092are source trees, corresponding to the various branches! So,
43093nightly tarballs aren't the answer, especially as the repository can go
43094for weeks without significant change being introduced.
43095
43096Fortunately, the Git server can meet this need. For any given
43097branch named @var{branchname}, use:
43098
43099@example
43100wget https://git.savannah.gnu.org/cgit/gawk.git/snapshot/gawk-@var{branchname}.tar.gz
43101@end example
43102
43103@noindent
43104to retrieve a snapshot of the given branch.
43105
43106@node Future Extensions
43107@appendixsec Probable Future Extensions
43108@ignore
43109From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995
43110Return-Path: <emory!scalpel.netlabs.com!lwall>
43111Message-Id: <9510311732.AA28472@scalpel.netlabs.com>
43112To: arnold@skeeve.atl.ga.us (Arnold D. Robbins)
43113Subject: Re: May I quote you?
43114In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST."
43115             <m0tAHPQ-00014MC@skeeve.atl.ga.us>
43116Date: Tue, 31 Oct 95 09:32:46 -0800
43117From: Larry Wall <emory!scalpel.netlabs.com!lwall>
43118
43119: Greetings. I am working on the release of gawk 3.0. Part of it will be a
43120: thoroughly updated manual. One of the sections deals with planned future
43121: extensions and enhancements.  I have the following at the beginning
43122: of it:
43123:
43124: @cindex PERL
43125: @cindex Wall, Larry
43126: @display
43127: @i{AWK is a language similar to PERL, only considerably more elegant.} @*
43128: Arnold Robbins
43129: @sp 1
43130: @i{Hey!} @*
43131: Larry Wall
43132: @end display
43133:
43134: Before I actually release this for publication, I wanted to get your
43135: permission to quote you.  (Hopefully, in the spirit of much of GNU, the
43136: implied humor is visible... :-)
43137
43138I think that would be fine.
43139
43140Larry
43141@end ignore
43142@cindex Perl
43143@cindex Wall, Larry
43144@cindex Robbins @subentry Arnold
43145@quotation
43146@i{AWK is a language similar to PERL, only considerably more elegant.}
43147@author Arnold Robbins
43148@end quotation
43149
43150@quotation
43151@i{Hey!}
43152@author Larry Wall
43153@end quotation
43154
43155The @file{TODO} file in the @code{master} branch of the @command{gawk}
43156Git repository lists possible future enhancements.  Some of these relate
43157to the source code, and others to possible new features.  Please see
43158that file for the list.
43159@xref{Additions},
43160if you are interested in tackling any of the projects listed there.
43161
43162@node Implementation Limitations
43163@appendixsec Some Limitations of the Implementation
43164
43165This following table describes limits of @command{gawk} on a Unix-like
43166system (although it is variable even then). Other systems may have
43167different limits.
43168
43169@multitable @columnfractions .40 .60
43170@headitem Item @tab Limit
43171@item Characters in a character class @tab 2^(number of bits per byte)
43172@item Length of input record in bytes @tab @code{ULONG_MAX}
43173@item Length of output record @tab Unlimited
43174@item Length of source line @tab Unlimited
43175@item Number of fields in a record @tab @code{ULONG_MAX}
43176@item Number of file redirections @tab Unlimited
43177@item Number of input records in one file @tab @code{MAX_LONG}
43178@item Number of input records total @tab @code{MAX_LONG}
43179@item Number of pipe redirections @tab min(number of processes per user, number of open files)
43180@item Numeric values @tab Double-precision floating point (if not using MPFR)
43181@item Size of a field in bytes @tab @code{ULONG_MAX}
43182@item Size of a literal string in bytes @tab @code{ULONG_MAX}
43183@item Size of a printf string in bytes @tab @code{ULONG_MAX}
43184@end multitable
43185
43186@node Extension Design
43187@appendixsec Extension API Design
43188
43189This @value{SECTION} documents the design of the extension API,
43190including a discussion of some of the history and problems that needed
43191to be solved.
43192
43193The first version of extensions for @command{gawk} was developed in
43194the mid-1990s and released with @command{gawk} 3.1 in the late 1990s.
43195The basic mechanisms and design remained unchanged for close to 15 years,
43196until 2012.
43197
43198The old extension mechanism used data types and functions from
43199@command{gawk} itself, with a ``clever hack'' to install extension
43200functions.
43201
43202@command{gawk} included some sample extensions, of which a few were
43203really useful.  However, it was clear from the outset that the extension
43204mechanism was bolted onto the side and was not really well thought out.
43205
43206@menu
43207* Old Extension Problems::           Problems with the old mechanism.
43208* Extension New Mechanism Goals::    Goals for the new mechanism.
43209* Extension Other Design Decisions:: Some other design decisions.
43210* Extension Future Growth::          Some room for future growth.
43211@end menu
43212
43213@node Old Extension Problems
43214@appendixsubsec Problems With The Old Mechanism
43215
43216The old extension mechanism had several problems:
43217
43218@itemize @value{BULLET}
43219@item
43220It depended heavily upon @command{gawk} internals.  Any time the
43221@code{NODE} structure@footnote{A critical central data structure
43222inside @command{gawk}.} changed, an extension would have to be
43223recompiled. Furthermore, to really write extensions required understanding
43224something about @command{gawk}'s internal functions.  There was some
43225documentation in this @value{DOCUMENT}, but it was quite minimal.
43226
43227@item
43228Being able to call into @command{gawk} from an extension required linker
43229facilities that are common on Unix-derived systems but that did
43230not work on MS-Windows systems; users wanting extensions on MS-Windows
43231had to statically link them into @command{gawk}, even though MS-Windows supports
43232dynamic loading of shared objects.
43233
43234@item
43235The API would change occasionally as @command{gawk} changed; no compatibility
43236between versions was ever offered or planned for.
43237@end itemize
43238
43239Despite the drawbacks, the @command{xgawk} project developers forked
43240@command{gawk} and developed several significant extensions. They also
43241enhanced @command{gawk}'s facilities relating to file inclusion and
43242shared object access.
43243
43244A new API was desired for a long time, but only in 2012 did the
43245@command{gawk} maintainer and the @command{xgawk} developers finally
43246start working on it together.  More information about the @command{xgawk}
43247project is provided in @ref{gawkextlib}.
43248
43249@node Extension New Mechanism Goals
43250@appendixsubsec Goals For A New Mechanism
43251
43252Some goals for the new API were:
43253
43254@itemize @value{BULLET}
43255@item
43256The API should be independent of @command{gawk} internals.  Changes in
43257@command{gawk} internals should not be visible to the writer of an
43258extension function.
43259
43260@item
43261The API should provide @emph{binary} compatibility across @command{gawk}
43262releases as long as the API itself does not change.
43263
43264@item
43265The API should enable extensions written in C or C++ to have roughly the
43266same ``appearance'' to @command{awk}-level code as @command{awk}
43267functions do. This means that extensions should have:
43268
43269@itemize @value{MINUS}
43270@item
43271The ability to access function parameters.
43272
43273@item
43274The ability to turn an undefined parameter into an array (call by reference).
43275
43276@item
43277The ability to create, access and update global variables.
43278
43279@item
43280Easy access to all the elements of an array at once (``array flattening'')
43281in order to loop over all the element in an easy fashion for C code.
43282
43283@item
43284The ability to create arrays (including @command{gawk}'s true
43285arrays of arrays).
43286@end itemize
43287@end itemize
43288
43289Some additional important goals were:
43290
43291@itemize @value{BULLET}
43292@item
43293The API should use only features in ISO C 90, so that extensions
43294can be written using the widest range of C and C++ compilers. The header
43295should include the appropriate @samp{#ifdef __cplusplus} and @samp{extern "C"}
43296magic so that a C++ compiler could be used.  (If using C++, the runtime
43297system has to be smart enough to call any constructors and destructors,
43298as @command{gawk} is a C program. As of this writing, this has not been
43299tested.)
43300
43301@item
43302The API mechanism should not require access to @command{gawk}'s
43303symbols@footnote{The @dfn{symbols} are the variables and functions
43304defined inside @command{gawk}.  Access to these symbols by code
43305external to @command{gawk} loaded dynamically at runtime is
43306problematic on MS-Windows.} by the compile-time or dynamic linker,
43307in order to enable creation of extensions that also work on MS-Windows.
43308@end itemize
43309
43310During development, it became clear that there were other features
43311that should be available to extensions, which were also subsequently
43312provided:
43313
43314@itemize @value{BULLET}
43315@item
43316Extensions should have the ability to hook into @command{gawk}'s
43317I/O redirection mechanism.  In particular, the @command{xgawk}
43318developers provided a so-called ``open hook'' to take over reading
43319records.  During development, this was generalized to allow
43320extensions to hook into input processing, output processing, and
43321two-way I/O.
43322
43323@item
43324An extension should be able to provide a ``call back'' function
43325to perform cleanup actions when @command{gawk} exits.
43326
43327@item
43328An extension should be able to provide a version string so that
43329@command{gawk}'s @option{--version} option can provide information
43330about extensions as well.
43331@end itemize
43332
43333The requirement to avoid access to @command{gawk}'s symbols is, at first
43334glance, a difficult one to meet.
43335
43336One design, apparently used by Perl and Ruby and maybe others, would
43337be to make the mainline @command{gawk} code into a library, with the
43338@command{gawk} utility a small C @code{main()} function linked against
43339the library.
43340
43341This seemed like the tail wagging the dog, complicating build and
43342installation and making a simple copy of the @command{gawk} executable
43343from one system to another (or one place to another on the same
43344system!) into a chancy operation.
43345
43346Pat Rankin suggested the solution that was adopted.
43347@xref{Extension Mechanism Outline}, for the details.
43348
43349@node Extension Other Design Decisions
43350@appendixsubsec Other Design Decisions
43351
43352As an arbitrary design decision, extensions can read the values of
43353predefined variables and arrays (such as @code{ARGV} and @code{FS}), but cannot
43354change them, with the exception of @code{PROCINFO}.
43355
43356The reason for this is to prevent an extension function from affecting
43357the flow of an @command{awk} program outside its control.  While a real
43358@command{awk} function can do what it likes, that is at the discretion
43359of the programmer.  An extension function should provide a service or
43360make a C API available for use within @command{awk}, and not mess with
43361@code{FS} or @code{ARGC} and @code{ARGV}.
43362
43363In addition, it becomes easy to start down a slippery slope. How
43364much access to @command{gawk} facilities do extensions need?
43365Do they need @code{getline}?  What about calling @code{gsub()} or
43366compiling regular expressions?  What about calling into @command{awk}
43367functions? (@emph{That} would be messy.)
43368
43369In order to avoid these issues, the @command{gawk} developers chose
43370to start with the simplest, most basic features that are still truly useful.
43371
43372Another decision is that although @command{gawk} provides nice things like
43373MPFR, and arrays indexed internally by integers, these features are not
43374being brought out to the API in order to keep things simple and close to
43375traditional @command{awk} semantics.  (In fact, arrays indexed internally
43376by integers are so transparent that they aren't even documented!)
43377
43378Additionally, all functions in the API check that their pointer
43379input parameters are not @code{NULL}. If they are, they return an error.
43380(It is a good idea for extension code to verify that
43381pointers received from @command{gawk} are not @code{NULL}.
43382Such a thing should not happen, but the @command{gawk} developers
43383are only human, and they have been known to occasionally make
43384mistakes.)
43385
43386With time, the API will undoubtedly evolve; the @command{gawk} developers
43387expect this to be driven by user needs. For now, the current API seems
43388to provide a minimal yet powerful set of features for creating extensions.
43389
43390@node Extension Future Growth
43391@appendixsubsec Room For Future Growth
43392
43393The API can later be expanded, in at least the following way:
43394
43395@itemize @value{BULLET}
43396@item
43397@command{gawk} passes an ``extension id'' into the extension when it
43398first loads the extension.  The extension then passes this id back
43399to @command{gawk} with each function call.  This mechanism allows
43400@command{gawk} to identify the extension calling into it, should it need
43401to know.
43402
43403@end itemize
43404
43405Of course, as of this writing, no decisions have been made with respect
43406to the above.
43407
43408@node Notes summary
43409@appendixsec Summary
43410
43411@itemize @value{BULLET}
43412@item
43413@command{gawk}'s extensions can be disabled with either the
43414@option{--traditional} option or with the @option{--posix} option.
43415The @option{--parsedebug} option is available if @command{gawk} is
43416compiled with @samp{-DDEBUG}.
43417
43418@item
43419The source code for @command{gawk} is maintained in a publicly
43420accessible Git repository. Anyone may check it out and view the source.
43421
43422@item
43423Contributions to @command{gawk} are welcome. Following the steps
43424outlined in this @value{CHAPTER} will make it easier to integrate
43425your contributions into the code base.
43426This applies both to new feature contributions and to ports to
43427additional operating systems.
43428
43429@item
43430@command{gawk} has some limits---generally those that are imposed by
43431the machine architecture.
43432
43433@item
43434The extension API design was intended to solve a number of problems
43435with the previous extension mechanism, enable features needed by
43436the @code{xgawk} project, and provide binary compatibility going forward.
43437
43438@item
43439The previous extension mechanism is no longer supported and was
43440removed from the code base with the 4.2 release.
43441
43442@end itemize
43443
43444
43445@node Basic Concepts
43446@appendix Basic Programming Concepts
43447@cindex programming @subentry concepts
43448@cindex programming @subentry concepts
43449
43450This @value{APPENDIX} attempts to define some of the basic concepts
43451and terms that are used throughout the rest of this @value{DOCUMENT}.
43452As this @value{DOCUMENT} is specifically about @command{awk},
43453and not about computer programming in general, the coverage here
43454is by necessity fairly cursory and simplistic.
43455(If you need more background, there are many
43456other introductory texts that you should refer to instead.)
43457
43458@menu
43459* Basic High Level::            The high level view.
43460* Basic Data Typing::           A very quick intro to data types.
43461@end menu
43462
43463@node Basic High Level
43464@appendixsec What a Program Does
43465
43466@cindex processing data
43467At the most basic level, the job of a program is to process
43468some input data and produce results.
43469@ifnotdocbook
43470See @ref{figure-general-flow}.
43471@end ifnotdocbook
43472@ifdocbook
43473See @inlineraw{docbook, <xref linkend="figure-general-flow"/>}.
43474@end ifdocbook
43475
43476@ifnotdocbook
43477@float Figure,figure-general-flow
43478@caption{General Program Flow}
43479@center @image{general-program, , , General program flow}
43480@end float
43481@end ifnotdocbook
43482
43483@docbook
43484<figure id="figure-general-flow" float="0">
43485<title>General Program Flow</title>
43486<mediaobject>
43487<imageobject role="web"><imagedata fileref="general-program.png" format="PNG"/></imageobject>
43488</mediaobject>
43489</figure>
43490@end docbook
43491
43492@cindex compiled programs
43493@cindex interpreted programs
43494The ``program'' in the figure can be either a compiled
43495program@footnote{Compiled programs are typically written
43496in lower-level languages such as C, C++, or Ada,
43497and then translated, or @dfn{compiled}, into a form that
43498the computer can execute directly.}
43499(such as @command{ls}),
43500or it may be @dfn{interpreted}.  In the latter case, a machine-executable
43501program such as @command{awk} reads your program, and then uses the
43502instructions in your program to process the data.
43503
43504@cindex programming @subentry basic steps
43505When you write a program, it usually consists
43506of the following, very basic set of steps,
43507@ifnotdocbook
43508as shown in @ref{figure-process-flow}:
43509@end ifnotdocbook
43510@ifdocbook
43511as shown in @inlineraw{docbook, <xref linkend="figure-process-flow"/>}:
43512@end ifdocbook
43513
43514@ifnotdocbook
43515@float Figure,figure-process-flow
43516@caption{Basic Program Steps}
43517@center @image{process-flow, , , Basic Program Stages}
43518@end float
43519@end ifnotdocbook
43520
43521@docbook
43522<figure id="figure-process-flow" float="0">
43523<title>Basic Program Stages</title>
43524<mediaobject>
43525<imageobject role="web"><imagedata fileref="process-flow.png" format="PNG"/></imageobject>
43526</mediaobject>
43527</figure>
43528@end docbook
43529
43530@table @asis
43531@item Initialization
43532These are the things you do before actually starting to process
43533data, such as checking arguments, initializing any data you need
43534to work with, and so on.
43535This step corresponds to @command{awk}'s @code{BEGIN} rule
43536(@pxref{BEGIN/END}).
43537
43538If you were baking a cake, this might consist of laying out all the
43539mixing bowls and the baking pan, and making sure you have all the
43540ingredients that you need.
43541
43542@item Processing
43543This is where the actual work is done.  Your program reads data,
43544one logical chunk at a time, and processes it as appropriate.
43545
43546In most programming languages, you have to manually manage the reading
43547of data, checking to see if there is more each time you read a chunk.
43548@command{awk}'s pattern-action paradigm
43549(@pxref{Getting Started})
43550handles the mechanics of this for you.
43551
43552In baking a cake, the processing corresponds to the actual labor:
43553breaking eggs, mixing the flour, water, and other ingredients, and then putting the cake
43554into the oven.
43555
43556@item Clean Up
43557Once you've processed all the data, you may have things you need to
43558do before exiting.
43559This step corresponds to @command{awk}'s @code{END} rule
43560(@pxref{BEGIN/END}).
43561
43562After the cake comes out of the oven, you still have to wrap it in
43563plastic wrap to keep anyone from tasting it, as well as wash
43564the mixing bowls and utensils.
43565@end table
43566
43567@cindex algorithms
43568An @dfn{algorithm} is a detailed set of instructions necessary to accomplish
43569a task, or process data.  It is much the same as a recipe for baking
43570a cake.  Programs implement algorithms.  Often, it is up to you to design
43571the algorithm and implement it, simultaneously.
43572
43573@cindex records
43574@cindex fields
43575The ``logical chunks'' we talked about previously are called @dfn{records},
43576similar to the records a company keeps on employees, a school keeps for
43577students, or a doctor keeps for patients.
43578Each record has many component parts, such as first and last names,
43579date of birth, address, and so on.  The component parts are referred
43580to as the @dfn{fields} of the record.
43581
43582The act of reading data is termed @dfn{input}, and that of
43583generating results, not too surprisingly, is termed @dfn{output}.
43584They are often referred to together as ``input/output,''
43585and even more often, as ``I/O'' for short.
43586(You will also see ``input'' and ``output'' used as verbs.)
43587
43588@cindex data-driven languages
43589@cindex languages, data-driven
43590@command{awk} manages the reading of data for you, as well as the
43591breaking it up into records and fields.  Your program's job is to
43592tell @command{awk} what to do with the data.  You do this by describing
43593@dfn{patterns} in the data to look for, and @dfn{actions} to execute
43594when those patterns are seen.  This @dfn{data-driven} nature of
43595@command{awk} programs usually makes them both easier to write
43596and easier to read.
43597
43598@node Basic Data Typing
43599@appendixsec Data Values in a Computer
43600
43601@cindex variables
43602In a program,
43603you keep track of information and values in things called @dfn{variables}.
43604A variable is just a name for a given value, such as @code{first_name},
43605@code{last_name}, @code{address}, and so on.
43606@command{awk} has several predefined variables, and it has
43607special names to refer to the current input record
43608and the fields of the record.
43609You may also group multiple
43610associated values under one name, as an array.
43611
43612@cindex values @subentry numeric
43613@cindex values @subentry string
43614@cindex scalar values
43615Data, particularly in @command{awk}, consists of either numeric
43616values, such as 42 or 3.1415927, or string values.
43617String values are essentially anything that's not a number, such as a name.
43618Strings are sometimes referred to as @dfn{character data}, since they
43619store the individual characters that comprise them.
43620Individual variables, as well as numeric and string variables, are
43621referred to as @dfn{scalar} values.
43622Groups of values, such as arrays, are not scalars.
43623
43624@ref{Computer Arithmetic}, provided a basic introduction to numeric
43625types (integer and floating-point) and how they are used in a computer.
43626Please review that information, including a number of caveats that
43627were presented.
43628
43629@cindex null strings
43630While you are probably used to the idea of a number without a value (i.e., zero),
43631it takes a bit more getting used to the idea of zero-length character data.
43632Nevertheless, such a thing exists.
43633It is called the @dfn{null string}.
43634The null string is character data that has no value.
43635In other words, it is empty.  It is written in @command{awk} programs
43636like this: @code{""}.
43637
43638Humans are used to working in decimal; i.e., base 10.  In base 10,
43639numbers go from 0 to 9, and then ``roll over'' into the next
43640@iftex
43641column.  (Remember grade school? @math{42 = 4\times 10 + 2}.)
43642@end iftex
43643@ifnottex
43644column.  (Remember grade school? 42 = 4 x 10 + 2.)
43645@end ifnottex
43646
43647There are other number bases though.  Computers commonly use base 2
43648or @dfn{binary}, base 8 or @dfn{octal}, and base 16 or @dfn{hexadecimal}.
43649In binary, each column represents two times the value in the column to
43650its right. Each column may contain either a 0 or a 1.
43651@iftex
43652Thus, binary 1010 represents @math{(1\times 8) + (0\times 4) + (1\times 2) + (0\times 1)}, or decimal 10.
43653@end iftex
43654@ifnottex
43655Thus, binary 1010 represents (1 x 8) + (0 x 4) + (1 x 2)
43656+ (0 x 1), or decimal 10.
43657@end ifnottex
43658Octal and hexadecimal are discussed more in
43659@ref{Nondecimal-numbers}.
43660
43661At the very lowest level, computers store values as groups of binary digits,
43662or @dfn{bits}.  Modern computers group bits into groups of eight, called @dfn{bytes}.
43663Advanced applications sometimes have to manipulate bits directly,
43664and @command{gawk} provides functions for doing so.
43665
43666Programs are written in programming languages.
43667Hundreds, if not thousands, of programming languages exist.
43668One of the most popular is the C programming language.
43669The C language had a very strong influence on the design of
43670the @command{awk} language.
43671
43672@cindex Kernighan, Brian
43673@cindex Ritchie, Dennis
43674There have been several versions of C.  The first is often referred to
43675as ``K&R'' C, after the initials of Brian Kernighan and Dennis Ritchie,
43676the authors of the first book on C.  (Dennis Ritchie created the language,
43677and Brian Kernighan was one of the creators of @command{awk}.)
43678
43679In the mid-1980s, an effort began to produce an international standard
43680for C.  This work culminated in 1989, with the production of the ANSI
43681standard for C.  This standard became an ISO standard in 1990.
43682In 1999, a revised ISO C standard was approved and released.
43683Where it makes sense, POSIX @command{awk} is compatible with 1999 ISO C.
43684
43685
43686@node Glossary
43687@unnumbered Glossary
43688
43689@table @asis
43690@item Action
43691A series of @command{awk} statements attached to a rule.  If the rule's
43692pattern matches an input record, @command{awk} executes the
43693rule's action.  Actions are always enclosed in braces.
43694(@xref{Action Overview}.)
43695
43696@cindex Ada programming language
43697@cindex programming languages @subentry Ada
43698@item Ada
43699A programming language originally defined by the U.S.@: Department of
43700Defense for embedded programming. It was designed to enforce good
43701Software Engineering practices.
43702
43703@cindex Spencer, Henry
43704@cindex @command{sed} utility
43705@cindex amazing @command{awk} assembler (@command{aaa})
43706@cindex @command{aaa} (amazing @command{awk} assembler) program
43707@item Amazing @command{awk} Assembler
43708Henry Spencer at the University of Toronto wrote a retargetable assembler
43709completely as @command{sed} and @command{awk} scripts.  It is thousands
43710of lines long, including machine descriptions for several eight-bit
43711microcomputers.  It is a good example of a program that would have been
43712better written in another language.
43713@c You can get it from @uref{http://awk.info/?awk100/aaa}.
43714
43715@cindex amazingly workable formatter (@command{awf})
43716@cindex @command{awf} (amazingly workable formatter) program
43717@item Amazingly Workable Formatter (@command{awf})
43718Henry Spencer at the University of Toronto wrote a formatter that accepts
43719a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting
43720commands, using @command{awk} and @command{sh}.
43721@c It is available
43722@c from @uref{http://awk.info/?tools/awf}.
43723
43724@item Anchor
43725The regexp metacharacters @samp{^} and @samp{$}, which force the match
43726to the beginning or end of the string, respectively.
43727
43728@cindex ANSI
43729@item ANSI
43730The American National Standards Institute.  This organization produces
43731many standards, among them the standards for the C and C++ programming
43732languages.
43733These standards often become international standards as well. See also
43734``ISO.''
43735
43736@item Argument
43737An argument can be two different things.  It can be an option or a
43738@value{FN} passed to a command while invoking it from the command line, or
43739it can be something passed to a @dfn{function} inside a program, e.g.
43740inside @command{awk}.
43741
43742In the latter case, an argument can be passed to a function in two ways.
43743Either it is given to the called function by value, i.e., a copy of the
43744value of the variable is made available to the called function, but the
43745original variable cannot be modified by the function itself; or it is
43746given by reference, i.e., a pointer to the interested variable is passed to
43747the function, which can then directly modify it. In @command{awk}
43748scalars are passed by value, and arrays are passed by reference.
43749See ``Pass By Value/Reference.''
43750
43751@item Array
43752A grouping of multiple values under the same name.
43753Most languages just provide sequential arrays.
43754@command{awk} provides associative arrays.
43755
43756@item Assertion
43757A statement in a program that a condition is true at this point in the program.
43758Useful for reasoning about how a program is supposed to behave.
43759
43760@item Assignment
43761An @command{awk} expression that changes the value of some @command{awk}
43762variable or data object.  An object that you can assign to is called an
43763@dfn{lvalue}.  The assigned values are called @dfn{rvalues}.
43764@xref{Assignment Ops}.
43765
43766@item Associative Array
43767Arrays in which the indices may be numbers or strings, not just
43768sequential integers in a fixed range.
43769
43770@item @command{awk} Language
43771The language in which @command{awk} programs are written.
43772
43773@item @command{awk} Program
43774An @command{awk} program consists of a series of @dfn{patterns} and
43775@dfn{actions}, collectively known as @dfn{rules}.  For each input record
43776given to the program, the program's rules are all processed in turn.
43777@command{awk} programs may also contain function definitions.
43778
43779@item @command{awk} Script
43780Another name for an @command{awk} program.
43781
43782@item Bash
43783The GNU version of the standard shell
43784@ifnotinfo
43785(the @b{B}ourne-@b{A}gain @b{SH}ell).
43786@end ifnotinfo
43787@ifinfo
43788(the Bourne-Again SHell).
43789@end ifinfo
43790See also ``Bourne Shell.''
43791
43792@item Binary
43793Base-two notation, where the digits are @code{0}--@code{1}. Since
43794electronic circuitry works ``naturally'' in base 2 (just think of Off/On),
43795everything inside a computer is calculated using base 2. Each digit
43796represents the presence (or absence) of a power of 2 and is called a
43797@dfn{bit}. So, for example, the base-two number @code{10101} is
43798@iftex
43799the same as decimal 21, (@math{(1\times 16) + (1\times 4) + (1\times 1)}).
43800@end iftex
43801@ifnottex
43802the same as decimal 21, ((1 x 16) + (1 x 4) + (1 x 1)).
43803@end ifnottex
43804
43805Since base-two numbers quickly become
43806very long to read and write, they are usually grouped by 3 (i.e., they are
43807read as octal numbers), or by 4 (i.e., they are read as hexadecimal
43808numbers). There is no direct way to insert base 2 numbers in a C program.
43809If need arises, such numbers are usually inserted as octal or hexadecimal
43810numbers. The number of base-two digits that fit into registers used for
43811representing integer numbers in computers is a rough indication of the
43812computing power of the computer itself.  Most computers nowadays use 64
43813bits for representing integer numbers in their registers, but 32-bit,
4381416-bit and 8-bit registers have been widely used in the past.
43815@xref{Nondecimal-numbers}.
43816@item Bit
43817Short for ``Binary Digit.''
43818All values in computer memory ultimately reduce to binary digits: values
43819that are either zero or one.
43820Groups of bits may be interpreted differently---as integers,
43821floating-point numbers, character data, addresses of other
43822memory objects, or other data.
43823@command{awk} lets you work with floating-point numbers and strings.
43824@command{gawk} lets you manipulate bit values with the built-in
43825functions described in
43826@ref{Bitwise Functions}.
43827
43828Computers are often defined by how many bits they use to represent integer
43829values.  Typical systems are 32-bit systems, but 64-bit systems are
43830becoming increasingly popular, and 16-bit systems have essentially
43831disappeared.
43832
43833@item Boolean Expression
43834Named after the English mathematician Boole. See also ``Logical Expression.''
43835
43836@item Bourne Shell
43837The standard shell (@file{/bin/sh}) on Unix and Unix-like systems,
43838originally written by Steven R.@: Bourne at Bell Laboratories.
43839Many shells (Bash, @command{ksh}, @command{pdksh}, @command{zsh}) are
43840generally upwardly compatible with the Bourne shell.
43841
43842@item Braces
43843The characters @samp{@{} and @samp{@}}.  Braces are used in
43844@command{awk} for delimiting actions, compound statements, and function
43845bodies.
43846
43847@item Bracket Expression
43848Inside a @dfn{regular expression}, an expression included in square
43849brackets, meant to designate a single character as belonging to a
43850specified character class. A bracket expression can contain a list of one
43851or more characters, like @samp{[abc]}, a range of characters, like
43852@samp{[A-Z]}, or a name, delimited by @samp{:}, that designates a known set
43853of characters, like @samp{[:digit:]}. The form of bracket expression
43854enclosed between @samp{:} is independent of the underlying representation
43855of the character themselves, which could utilize the ASCII, EBCDIC, or
43856Unicode codesets, depending on the architecture of the computer system, and on
43857localization.
43858See also ``Regular Expression.''
43859
43860@item Built-in Function
43861The @command{awk} language provides built-in functions that perform various
43862numerical, I/O-related, and string computations.  Examples are
43863@code{sqrt()} (for the square root of a number) and @code{substr()} (for a
43864substring of a string).
43865@command{gawk} provides functions for timestamp management, bit manipulation,
43866array sorting, type checking,
43867and runtime string translation.
43868(@xref{Built-in}.)
43869
43870@item Built-in Variable
43871@code{ARGC},
43872@code{ARGV},
43873@code{CONVFMT},
43874@code{ENVIRON},
43875@code{FILENAME},
43876@code{FNR},
43877@code{FS},
43878@code{NF},
43879@code{NR},
43880@code{OFMT},
43881@code{OFS},
43882@code{ORS},
43883@code{RLENGTH},
43884@code{RSTART},
43885@code{RS},
43886and
43887@code{SUBSEP}
43888are the variables that have special meaning to @command{awk}.
43889In addition,
43890@code{ARGIND},
43891@code{BINMODE},
43892@code{ERRNO},
43893@code{FIELDWIDTHS},
43894@code{FPAT},
43895@code{IGNORECASE},
43896@code{LINT},
43897@code{PROCINFO},
43898@code{RT},
43899and
43900@code{TEXTDOMAIN}
43901are the variables that have special meaning to @command{gawk}.
43902Changing some of them affects @command{awk}'s running environment.
43903(@xref{Built-in Variables}.)
43904
43905@item C
43906The system programming language that most GNU software is written in.  The
43907@command{awk} programming language has C-like syntax, and this @value{DOCUMENT}
43908points out similarities between @command{awk} and C when appropriate.
43909
43910In general, @command{gawk} attempts to be as similar to the 1990 version
43911of ISO C as makes sense.
43912
43913@item C Shell
43914The C Shell (@command{csh} or its improved version, @command{tcsh}) is a Unix shell that was
43915created by Bill Joy in the late 1970s. The C shell was differentiated from
43916other shells by its interactive features and overall style, which
43917looks more like C. The C Shell is not backward compatible with the Bourne
43918Shell, so special attention is required when converting scripts
43919written for other Unix shells to the C shell, especially with regard to the management of
43920shell variables.
43921See also ``Bourne Shell.''
43922
43923@item C++
43924A popular object-oriented programming language derived from C.
43925
43926@item Character Class
43927See ``Bracket Expression.''
43928
43929@item Character List
43930See ``Bracket Expression.''
43931
43932@cindex ASCII
43933@cindex ISO @subentry ISO 8859-1 character set
43934@cindex ISO @subentry ISO Latin-1 character set
43935@cindex character sets (machine character encodings)
43936@cindex Unicode
43937@item Character Set
43938The set of numeric codes used by a computer system to represent the
43939characters (letters, numbers, punctuation, etc.) of a particular country
43940or place. The most common character set in use today is ASCII (American
43941Standard Code for Information Interchange).  Many European
43942countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1).
43943The @uref{http://www.unicode.org, Unicode character set} is
43944increasingly popular and standard, and is particularly
43945widely used on GNU/Linux systems.
43946
43947@cindex Kernighan, Brian
43948@cindex Bentley, Jon
43949@cindex @command{chem} utility
43950@item CHEM
43951A preprocessor for @command{pic} that reads descriptions of molecules
43952and produces @command{pic} input for drawing them.
43953It was written in @command{awk}
43954by Brian Kernighan and Jon Bentley, and is available from
43955@uref{http://netlib.org/typesetting/chem}.
43956
43957@item Comparison Expression
43958A relation that is either true or false, such as @samp{a < b}.
43959Comparison expressions are used in @code{if}, @code{while}, @code{do},
43960and @code{for}
43961statements, and in patterns to select which input records to process.
43962(@xref{Typing and Comparison}.)
43963
43964@cindex compiled programs
43965@item Compiler
43966A program that translates human-readable source code into
43967machine-executable object code.  The object code is then executed
43968directly by the computer.
43969See also ``Interpreter.''
43970
43971@item Complemented Bracket Expression
43972The negation of a @dfn{bracket expression}.  All that is @emph{not}
43973described by a given bracket expression. The symbol @samp{^} precedes
43974the negated bracket expression.  E.g.: @samp{[^[:digit:]]}
43975designates whatever character is not a digit. @samp{[^bad]}
43976designates whatever character is not one of the letters @samp{b}, @samp{a},
43977or @samp{d}.
43978See ``Bracket Expression.''
43979
43980@item Compound Statement
43981A series of @command{awk} statements, enclosed in curly braces.  Compound
43982statements may be nested.
43983(@xref{Statements}.)
43984
43985@item Computed Regexps
43986See ``Dynamic Regular Expressions.''
43987
43988@item Concatenation
43989Concatenating two strings means sticking them together, one after another,
43990producing a new string.  For example, the string @samp{foo} concatenated with
43991the string @samp{bar} gives the string @samp{foobar}.
43992(@xref{Concatenation}.)
43993
43994@item Conditional Expression
43995An expression using the @samp{?:} ternary operator, such as
43996@samp{@var{expr1} ? @var{expr2} : @var{expr3}}.  The expression
43997@var{expr1} is evaluated; if the result is true, the value of the whole
43998expression is the value of @var{expr2}; otherwise the value is
43999@var{expr3}.  In either case, only one of @var{expr2} and @var{expr3}
44000is evaluated. (@xref{Conditional Exp}.)
44001
44002@item Control Statement
44003A control statement is an instruction to perform a given operation or a set
44004of operations inside an @command{awk} program, if a given condition is
44005true. Control statements are: @code{if}, @code{for}, @code{while}, and
44006@code{do}
44007(@pxref{Statements}).
44008
44009@cindex McIlroy, Doug
44010@cindex cookie
44011@item Cookie
44012A peculiar goodie, token, saying or remembrance
44013produced by or presented to a program. (With thanks to Professor Doug McIlroy.)
44014@ignore
44015From: Doug McIlroy <doug@cs.dartmouth.edu>
44016Date: Sat, 13 Oct 2012 19:55:25 -0400
44017To: arnold@skeeve.com
44018Subject: Re: origin of the term "cookie"?
44019
44020I believe the term "cookie", for a more or less inscrutable
44021saying or crumb of information, was injected into Unix
44022jargon by Bob Morris, who used the word quite frequently.
44023It had no fixed meaning as it now does in browsers.
44024
44025The word had been around long before it was recognized in
44026the 8th edition glossary (earlier editions had no glossary):
44027
44028cookie   a peculiar goodie, token, saying or remembrance
44029returned by or presented to a program. [I would say that
44030"returned by" would better read "produced by", and assume
44031responsibility for the inexactitude.]
44032
44033Doug McIlroy
44034
44035From: Doug McIlroy <doug@cs.dartmouth.edu>
44036Date: Sun, 14 Oct 2012 10:08:43 -0400
44037To: arnold@skeeve.com
44038Subject: Re: origin of the term "cookie"?
44039
44040> Can I forward your email to Eric Raymond, for possible addition to the
44041> Jargon File?
44042
44043Sure. I might add that I don't know how "cookie" entered Morris's
44044vocabulary. Certainly "values of beta give rise to dom!" (see google)
44045was an early, if not the earliest Unix cookie.  The fact that it was
44046found lying around on a model 37 teletype (which had Greek beta in
44047its type box) suggests that maybe it was seen to be like milk and
44048cookies laid out for Santa Claus. Morris was wont to make such
44049connections.
44050
44051Doug
44052@end ignore
44053
44054@item Coprocess
44055A subordinate program with which two-way communications is possible.
44056
44057@item Curly Braces
44058See ``Braces.''
44059
44060@cindex dark corner
44061@item Dark Corner
44062An area in the language where specifications often were (or still
44063are) not clear, leading to unexpected or undesirable behavior.
44064Such areas are marked in this @value{DOCUMENT} with
44065@iftex
44066the picture of a flashlight in the margin
44067@end iftex
44068@ifnottex
44069``(d.c.)'' in the text
44070@end ifnottex
44071and are indexed under the heading ``dark corner.''
44072
44073@item Data Driven
44074A description of @command{awk} programs, where you specify the data you
44075are interested in processing, and what to do when that data is seen.
44076
44077@item Data Objects
44078These are numbers and strings of characters.  Numbers are converted into
44079strings and vice versa, as needed.
44080(@xref{Conversion}.)
44081
44082@item Deadlock
44083The situation in which two communicating processes are each waiting
44084for the other to perform an action.
44085
44086@item Debugger
44087A program used to help developers remove ``bugs'' from (de-bug)
44088their programs.
44089
44090@item Double Precision
44091An internal representation of numbers that can have fractional parts.
44092Double precision numbers keep track of more digits than do single precision
44093numbers, but operations on them are sometimes more expensive.  This is the way
44094@command{awk} stores numeric values.  It is the C type @code{double}.
44095
44096@item Dynamic Regular Expression
44097A dynamic regular expression is a regular expression written as an
44098ordinary expression.  It could be a string constant, such as
44099@code{"foo"}, but it may also be an expression whose value can vary.
44100(@xref{Computed Regexps}.)
44101
44102@item Empty String
44103See ``Null String.''
44104
44105@item Environment
44106A collection of strings, of the form @samp{@var{name}=@var{val}}, that each
44107program has available to it. Users generally place values into the
44108environment in order to provide information to various programs. Typical
44109examples are the environment variables @env{HOME} and @env{PATH}.
44110
44111@cindex epoch, definition of
44112@item Epoch
44113The date used as the ``beginning of time'' for timestamps.
44114Time values in most systems are represented as seconds since the epoch,
44115with library functions available for converting these values into
44116standard date and time formats.
44117
44118The epoch on Unix and POSIX systems is 1970-01-01 00:00:00 UTC.
44119See also ``GMT'' and ``UTC.''
44120
44121@item Escape Sequences
44122@cindex ASCII
44123A special sequence of characters used for describing nonprinting
44124characters, such as @samp{\n} for newline or @samp{\033} for the ASCII
44125ESC (Escape) character. (@xref{Escape Sequences}.)
44126
44127@item Extension
44128An additional feature or change to a programming language or
44129utility not defined by that language's or utility's standard.
44130@command{gawk} has (too) many extensions over POSIX @command{awk}.
44131
44132@item FDL
44133See ``Free Documentation License.''
44134
44135@item Field
44136When @command{awk} reads an input record, it splits the record into pieces
44137separated by whitespace (or by a separator regexp that you can
44138change by setting the predefined variable @code{FS}).  Such pieces are
44139called fields.  If the pieces are of fixed length, you can use the built-in
44140variable @code{FIELDWIDTHS} to describe their lengths.
44141If you wish to specify the contents of fields instead of the field
44142separator, you can use the predefined variable @code{FPAT} to do so.
44143(@xref{Field Separators},
44144@ref{Constant Size},
44145and
44146@ref{Splitting By Content}.)
44147
44148@item Flag
44149A variable whose truth value indicates the existence or nonexistence
44150of some condition.
44151
44152@item Floating-Point Number
44153Often referred to in mathematical terms as a ``rational'' or real number,
44154this is just a number that can have a fractional part.
44155See also ``Double Precision'' and ``Single Precision.''
44156
44157@item Format
44158Format strings control the appearance of output in the
44159@code{strftime()} and @code{sprintf()} functions, and in the
44160@code{printf} statement as well.  Also, data conversions from numbers to strings
44161are controlled by the format strings contained in the predefined variables
44162@code{CONVFMT} and @code{OFMT}. (@xref{Control Letters}.)
44163
44164@item Fortran
44165Shorthand for FORmula TRANslator, one of the first programming languages
44166available for scientific calculations. It was created by John Backus,
44167and has been available since 1957. It is still in use today.
44168
44169@item Free Documentation License
44170This document describes the terms under which this @value{DOCUMENT}
44171is published and may be copied. (@xref{GNU Free Documentation License}.)
44172
44173@cindex FSF (Free Software Foundation)
44174@cindex Free Software Foundation (FSF)
44175@cindex Stallman, Richard
44176@item Free Software Foundation
44177A nonprofit organization dedicated
44178to the production and distribution of freely distributable software.
44179It was founded by Richard M.@: Stallman, the author of the original
44180Emacs editor.  GNU Emacs is the most widely used version of Emacs today.
44181
44182@item FSF
44183See ``Free Software Foundation.''
44184
44185@item Function
44186A part of an @command{awk} program that can be invoked from every point of
44187the program, to perform a task.  @command{awk} has several built-in
44188functions.
44189Users can define their own functions in every part of the program.
44190Function can be recursive, i.e., they may invoke themselves.
44191@xref{Functions}.
44192In @command{gawk} it is also possible to have functions shared
44193among different programs, and included where required using the
44194@code{@@include} directive
44195(@pxref{Include Files}).
44196In @command{gawk} the name of the function that should be invoked
44197can be generated at run time, i.e., dynamically.
44198The @command{gawk} extension API provides constructor functions
44199(@pxref{Constructor Functions}).
44200
44201
44202@item @command{gawk}
44203The GNU implementation of @command{awk}.
44204
44205@cindex GPL (General Public License)
44206@item General Public License
44207This document describes the terms under which @command{gawk} and its source
44208code may be distributed. (@xref{Copying}.)
44209
44210@item GMT
44211``Greenwich Mean Time.''
44212This is the old term for UTC.
44213It is the time of day used internally for Unix and POSIX systems.
44214See also ``Epoch'' and ``UTC.''
44215
44216@cindex FSF (Free Software Foundation)
44217@cindex Free Software Foundation (FSF)
44218@cindex GNU Project
44219@item GNU
44220``GNU's not Unix''.  An on-going project of the Free Software Foundation
44221to create a complete, freely distributable, POSIX-compliant computing
44222environment.
44223
44224@item GNU/Linux
44225A variant of the GNU system using the Linux kernel, instead of the
44226Free Software Foundation's Hurd kernel.
44227The Linux kernel is a stable, efficient, full-featured clone of Unix that has
44228been ported to a variety of architectures.
44229It is most popular on PC-class systems, but runs well on a variety of
44230other systems too.
44231The Linux kernel source code is available under the terms of the GNU General
44232Public License, which is perhaps its most important aspect.
44233
44234@item GPL
44235See ``General Public License.''
44236
44237@item Hexadecimal
44238Base 16 notation, where the digits are @code{0}--@code{9} and
44239@code{A}--@code{F}, with @samp{A}
44240representing 10, @samp{B} representing 11, and so on, up to @samp{F} for 15.
44241Hexadecimal numbers are written in C using a leading @samp{0x},
44242@iftex
44243to indicate their base.  Thus, @code{0x12} is 18 (@math{(1\times 16) + 2}).
44244@end iftex
44245@ifnottex
44246to indicate their base.  Thus, @code{0x12} is 18 ((1 x 16) + 2).
44247@end ifnottex
44248@xref{Nondecimal-numbers}.
44249
44250@item I/O
44251Abbreviation for ``Input/Output,'' the act of moving data into and/or
44252out of a running program.
44253
44254@item Input Record
44255A single chunk of data that is read in by @command{awk}.  Usually, an @command{awk} input
44256record consists of one line of text.
44257(@xref{Records}.)
44258
44259@item Integer
44260A whole number, i.e., a number that does not have a fractional part.
44261
44262@item Internationalization
44263The process of writing or modifying a program so
44264that it can use multiple languages without requiring
44265further source code changes.
44266
44267@cindex interpreted programs
44268@item Interpreter
44269A program that reads human-readable source code directly, and uses
44270the instructions in it to process data and produce results.
44271@command{awk} is typically (but not always) implemented as an interpreter.
44272See also ``Compiler.''
44273
44274@item Interval Expression
44275A component of a regular expression that lets you specify repeated matches of
44276some part of the regexp.  Interval expressions were not originally available
44277in @command{awk} programs.
44278
44279@cindex ISO
44280@item ISO
44281The International Organization for Standardization.
44282This organization produces international standards for many things, including
44283programming languages, such as C and C++.
44284In the computer arena, important standards like those for C, C++, and POSIX
44285become both American national and ISO international standards simultaneously.
44286This @value{DOCUMENT} refers to Standard C as ``ISO C'' throughout.
44287See @uref{https://www.iso.org/iso/home/about.htm, the ISO website} for more
44288information about the name of the organization and its language-independent
44289three-letter acronym.
44290
44291@cindex Java programming language
44292@cindex programming languages @subentry Java
44293@item Java
44294A modern programming language originally developed by Sun Microsystems
44295(now Oracle) supporting Object-Oriented programming.  Although usually
44296implemented by compiling to the instructions for a standard virtual
44297machine (the JVM), the language can be compiled to native code.
44298
44299@item Keyword
44300In the @command{awk} language, a keyword is a word that has special
44301meaning.  Keywords are reserved and may not be used as variable names.
44302
44303@command{gawk}'s keywords are:
44304@code{BEGIN},
44305@code{BEGINFILE},
44306@code{END},
44307@code{ENDFILE},
44308@code{break},
44309@code{case},
44310@code{continue},
44311@code{default},
44312@code{delete},
44313@code{do@dots{}while},
44314@code{else},
44315@code{exit},
44316@code{for@dots{}in},
44317@code{for},
44318@code{function},
44319@code{func},
44320@code{if},
44321@code{next},
44322@code{nextfile},
44323@code{switch},
44324and
44325@code{while}.
44326
44327@item Korn Shell
44328The Korn Shell (@command{ksh}) is a Unix shell which was developed by David Korn at Bell
44329Laboratories in the early 1980s. The Korn Shell is backward-compatible with the Bourne
44330shell and includes many features of the C shell.
44331See also ``Bourne Shell.''
44332
44333@cindex LGPL (Lesser General Public License)
44334@cindex Lesser General Public License (LGPL)
44335@cindex GNU Lesser General Public License
44336@item Lesser General Public License
44337This document describes the terms under which binary library archives
44338or shared objects,
44339and their source code may be distributed.
44340
44341@item LGPL
44342See ``Lesser General Public License.''
44343
44344@item Linux
44345See ``GNU/Linux.''
44346
44347@item Localization
44348The process of providing the data necessary for an
44349internationalized program to work in a particular language.
44350
44351@item Logical Expression
44352An expression using the operators for logic, AND, OR, and NOT, written
44353@samp{&&}, @samp{||}, and @samp{!} in @command{awk}. Often called Boolean
44354expressions, after the mathematician who pioneered this kind of
44355mathematical logic.
44356
44357@item Lvalue
44358An expression that can appear on the left side of an assignment
44359operator.  In most languages, lvalues can be variables or array
44360elements.  In @command{awk}, a field designator can also be used as an
44361lvalue.
44362
44363@item Matching
44364The act of testing a string against a regular expression.  If the
44365regexp describes the contents of the string, it is said to @dfn{match} it.
44366
44367@item Metacharacters
44368Characters used within a regexp that do not stand for themselves.
44369Instead, they denote regular expression operations, such as repetition,
44370grouping, or alternation.
44371
44372@item Nesting
44373Nesting is where information is organized in layers, or where objects
44374contain other similar objects.
44375In @command{gawk} the @code{@@include}
44376directive can be nested. The ``natural'' nesting of arithmetic and
44377logical operations can be changed using parentheses
44378(@pxref{Precedence}).
44379
44380@item No-op
44381An operation that does nothing.
44382
44383@item Null String
44384A string with no characters in it.  It is represented explicitly in
44385@command{awk} programs by placing two double quote characters next to
44386each other (@code{""}).  It can appear in input data by having two successive
44387occurrences of the field separator appear next to each other.
44388
44389@item Number
44390A numeric-valued data object.  Modern @command{awk} implementations use
44391double precision floating-point to represent numbers.
44392Ancient @command{awk} implementations used single precision floating-point.
44393
44394@item Octal
44395Base-eight notation, where the digits are @code{0}--@code{7}.
44396Octal numbers are written in C using a leading @samp{0},
44397@iftex
44398to indicate their base.  Thus, @code{013} is 11 (@math{(1\times 8) + 3}).
44399@end iftex
44400@ifnottex
44401to indicate their base.  Thus, @code{013} is 11 ((1 x 8) + 3).
44402@end ifnottex
44403@xref{Nondecimal-numbers}.
44404
44405@item Output Record
44406A single chunk of data that is written out by @command{awk}.  Usually, an
44407@command{awk} output record consists of one or more lines of text.
44408@xref{Records}.
44409
44410@item Pattern
44411Patterns tell @command{awk} which input records are interesting to which
44412rules.
44413
44414A pattern is an arbitrary conditional expression against which input is
44415tested.  If the condition is satisfied, the pattern is said to @dfn{match}
44416the input record.  A typical pattern might compare the input record against
44417a regular expression. (@xref{Pattern Overview}.)
44418
44419@item PEBKAC
44420An acronym describing what is possibly the most frequent
44421source of computer usage problems. (Problem Exists Between
44422Keyboard And Chair.)
44423
44424@item Plug-in
44425See ``Extensions.''
44426
44427@item POSIX
44428The name for a series of standards
44429that specify a Portable Operating System interface.  The ``IX'' denotes
44430the Unix heritage of these standards.  The main standard of interest for
44431@command{awk} users is
44432@cite{IEEE Standard for Information Technology, Standard 1003.1@sup{TM}-2017
44433(Revision of IEEE Std 1003.1-2008)}.
44434The 2018 POSIX standard can be found online at
44435@url{https://pubs.opengroup.org/onlinepubs/9699919799/}.
44436
44437@item Precedence
44438The order in which operations are performed when operators are used
44439without explicit parentheses.
44440
44441@item Private
44442Variables and/or functions that are meant for use exclusively by library
44443functions and not for the main @command{awk} program. Special care must be
44444taken when naming such variables and functions.
44445(@xref{Library Names}.)
44446
44447@item Range (of input lines)
44448A sequence of consecutive lines from the input file(s).  A pattern
44449can specify ranges of input lines for @command{awk} to process or it can
44450specify single lines. (@xref{Pattern Overview}.)
44451
44452@item Record
44453See ``Input record'' and ``Output record.''
44454
44455@item Recursion
44456When a function calls itself, either directly or indirectly.
44457If this is clear, stop, and proceed to the next entry.
44458Otherwise, refer to the entry for ``recursion.''
44459
44460@item Redirection
44461Redirection means performing input from something other than the standard input
44462stream, or performing output to something other than the standard output stream.
44463
44464You can redirect input to the @code{getline} statement using
44465the @samp{<}, @samp{|}, and @samp{|&} operators.
44466You can redirect the output of the @code{print} and @code{printf} statements
44467to a file or a system command, using the @samp{>}, @samp{>>}, @samp{|}, and @samp{|&}
44468operators.
44469(@xref{Getline},
44470and @ref{Redirection}.)
44471
44472@item Reference Counts
44473An internal mechanism in @command{gawk} to minimize the amount of memory
44474needed to store the value of string variables. If the value assumed by
44475a variable is used in more than one place, only one copy of the value
44476itself is kept, and the associated reference count is increased when the
44477same value is used by an additional variable, and decreased when the related
44478variable is no longer in use. When the reference count goes to zero,
44479the memory space used to store the value of the variable is freed.
44480
44481@item Regexp
44482See ``Regular Expression.''
44483
44484@item Regular Expression
44485A regular expression (``regexp'' for short) is a pattern that denotes a
44486set of strings, possibly an infinite set.  For example, the regular expression
44487@samp{R.*xp} matches any string starting with the letter @samp{R}
44488and ending with the letters @samp{xp}.  In @command{awk}, regular expressions are
44489used in patterns and in conditional expressions.  Regular expressions may contain
44490escape sequences. (@xref{Regexp}.)
44491
44492@item Regular Expression Constant
44493A regular expression constant is a regular expression written within
44494slashes, such as @code{/foo/}.  This regular expression is chosen
44495when you write the @command{awk} program and cannot be changed during
44496its execution. (@xref{Regexp Usage}.)
44497
44498@item Regular Expression Operators
44499See ``Metacharacters.''
44500
44501@item Rounding
44502Rounding the result of an arithmetic operation can be tricky.
44503More than one way of rounding exists, and in @command{gawk}
44504it is possible to choose which method should be used in a program.
44505@xref{Setting the rounding mode}.
44506
44507@item Rule
44508A segment of an @command{awk} program that specifies how to process single
44509input records.  A rule consists of a @dfn{pattern} and an @dfn{action}.
44510@command{awk} reads an input record; then, for each rule, if the input record
44511satisfies the rule's pattern, @command{awk} executes the rule's action.
44512Otherwise, the rule does nothing for that input record.
44513
44514@item Rvalue
44515A value that can appear on the right side of an assignment operator.
44516In @command{awk}, essentially every expression has a value. These values
44517are rvalues.
44518
44519@item Scalar
44520A single value, be it a number or a string.
44521Regular variables are scalars; arrays and functions are not.
44522
44523@item Search Path
44524In @command{gawk}, a list of directories to search for @command{awk} program source files.
44525In the shell, a list of directories to search for executable programs.
44526
44527@item @command{sed}
44528See ``Stream Editor.''
44529
44530@item Seed
44531The initial value, or starting point, for a sequence of random numbers.
44532
44533@item Shell
44534The command interpreter for Unix and POSIX-compliant systems.
44535The shell works both interactively, and as a programming language
44536for batch files, or shell scripts.
44537
44538@item Short-Circuit
44539The nature of the @command{awk} logical operators @samp{&&} and @samp{||}.
44540If the value of the entire expression is determinable from evaluating just
44541the lefthand side of these operators, the righthand side is not
44542evaluated.
44543(@xref{Boolean Ops}.)
44544
44545@item Side Effect
44546A side effect occurs when an expression has an effect aside from merely
44547producing a value.  Assignment expressions, increment and decrement
44548expressions, and function calls have side effects.
44549(@xref{Assignment Ops}.)
44550
44551@item Single Precision
44552An internal representation of numbers that can have fractional parts.
44553Single precision numbers keep track of fewer digits than do double precision
44554numbers, but operations on them are sometimes less expensive in terms of CPU time.
44555This is the type used by some ancient versions of @command{awk} to store
44556numeric values.  It is the C type @code{float}.
44557
44558@item Space
44559The character generated by hitting the space bar on the keyboard.
44560
44561@item Special File
44562A @value{FN} interpreted internally by @command{gawk}, instead of being handed
44563directly to the underlying operating system---for example, @file{/dev/stderr}.
44564(@xref{Special Files}.)
44565
44566@item Statement
44567An expression inside an @command{awk} program in the action part
44568of a pattern--action rule, or inside an
44569@command{awk} function. A statement can be a variable assignment,
44570an array operation, a loop, etc.
44571
44572@item Stream Editor
44573A program that reads records from an input stream and processes them one
44574or more at a time.  This is in contrast with batch programs, which may
44575expect to read their input files in entirety before starting to do
44576anything, as well as with interactive programs which require input from the
44577user.
44578
44579@item String
44580A datum consisting of a sequence of characters, such as @samp{I am a
44581string}.  Constant strings are written with double quotes in the
44582@command{awk} language and may contain escape sequences.
44583(@xref{Escape Sequences}.)
44584
44585@item Tab
44586The character generated by hitting the @kbd{TAB} key on the keyboard.
44587It usually expands to up to eight spaces upon output.
44588
44589@item Text Domain
44590A unique name that identifies an application.
44591Used for grouping messages that are translated at runtime
44592into the local language.
44593
44594@item Timestamp
44595A value in the ``seconds since the epoch'' format used by Unix
44596and POSIX systems.  Used for the @command{gawk} functions
44597@code{mktime()}, @code{strftime()}, and @code{systime()}.
44598See also ``Epoch,'' ``GMT,'' and ``UTC.''
44599
44600@cindex GNU/Linux
44601@cindex Unix
44602@cindex BSD-based operating systems
44603@cindex NetBSD
44604@cindex FreeBSD
44605@cindex OpenBSD
44606@item Unix
44607A computer operating system originally developed in the early 1970's at
44608AT&T Bell Laboratories.  It initially became popular in universities around
44609the world and later moved into commercial environments as a software
44610development system and network server system. There are many commercial
44611versions of Unix, as well as several work-alike systems whose source code
44612is freely available (such as GNU/Linux, @uref{http://www.netbsd.org, NetBSD},
44613@uref{https://www.freebsd.org, FreeBSD}, and @uref{http://www.openbsd.org, OpenBSD}).
44614
44615@item UTC
44616The accepted abbreviation for ``Universal Coordinated Time.''
44617This is standard time in Greenwich, England, which is used as a
44618reference time for day and date calculations.
44619See also ``Epoch'' and ``GMT.''
44620
44621@item Variable
44622A name for a value. In @command{awk}, variables may be either scalars
44623or arrays.
44624
44625@item Whitespace
44626A sequence of space, TAB, or newline characters occurring inside an input
44627record or a string.
44628
44629@end table
44630
44631@end ifclear
44632
44633@c The GNU General Public License.
44634@node Copying
44635@unnumbered GNU General Public License
44636@ifnotdocbook
44637@center Version 3, 29 June 2007
44638@end ifnotdocbook
44639@docbook
44640<subtitle>Version 3, 29 June 2007</subtitle>
44641@end docbook
44642
44643@c This file is intended to be included within another document,
44644@c hence no sectioning command or @node.
44645
44646@display
44647Copyright @copyright{} 2007 Free Software Foundation, Inc. @url{https://fsf.org/}
44648
44649Everyone is permitted to copy and distribute verbatim copies of this
44650license document, but changing it is not allowed.
44651@end display
44652
44653@c fakenode --- for prepinfo
44654@heading Preamble
44655
44656The GNU General Public License is a free, copyleft license for
44657software and other kinds of works.
44658
44659The licenses for most software and other practical works are designed
44660to take away your freedom to share and change the works.  By contrast,
44661the GNU General Public License is intended to guarantee your freedom
44662to share and change all versions of a program---to make sure it remains
44663free software for all its users.  We, the Free Software Foundation,
44664use the GNU General Public License for most of our software; it
44665applies also to any other work released this way by its authors.  You
44666can apply it to your programs, too.
44667
44668When we speak of free software, we are referring to freedom, not
44669price.  Our General Public Licenses are designed to make sure that you
44670have the freedom to distribute copies of free software (and charge for
44671them if you wish), that you receive source code or can get it if you
44672want it, that you can change the software or use pieces of it in new
44673free programs, and that you know you can do these things.
44674
44675To protect your rights, we need to prevent others from denying you
44676these rights or asking you to surrender the rights.  Therefore, you
44677have certain responsibilities if you distribute copies of the
44678software, or if you modify it: responsibilities to respect the freedom
44679of others.
44680
44681For example, if you distribute copies of such a program, whether
44682gratis or for a fee, you must pass on to the recipients the same
44683freedoms that you received.  You must make sure that they, too,
44684receive or can get the source code.  And you must show them these
44685terms so they know their rights.
44686
44687Developers that use the GNU GPL protect your rights with two steps:
44688(1) assert copyright on the software, and (2) offer you this License
44689giving you legal permission to copy, distribute and/or modify it.
44690
44691For the developers' and authors' protection, the GPL clearly explains
44692that there is no warranty for this free software.  For both users' and
44693authors' sake, the GPL requires that modified versions be marked as
44694changed, so that their problems will not be attributed erroneously to
44695authors of previous versions.
44696
44697Some devices are designed to deny users access to install or run
44698modified versions of the software inside them, although the
44699manufacturer can do so.  This is fundamentally incompatible with the
44700aim of protecting users' freedom to change the software.  The
44701systematic pattern of such abuse occurs in the area of products for
44702individuals to use, which is precisely where it is most unacceptable.
44703Therefore, we have designed this version of the GPL to prohibit the
44704practice for those products.  If such problems arise substantially in
44705other domains, we stand ready to extend this provision to those
44706domains in future versions of the GPL, as needed to protect the
44707freedom of users.
44708
44709Finally, every program is threatened constantly by software patents.
44710States should not allow patents to restrict development and use of
44711software on general-purpose computers, but in those that do, we wish
44712to avoid the special danger that patents applied to a free program
44713could make it effectively proprietary.  To prevent this, the GPL
44714assures that patents cannot be used to render the program non-free.
44715
44716The precise terms and conditions for copying, distribution and
44717modification follow.
44718
44719@c fakenode --- for prepinfo
44720@heading TERMS AND CONDITIONS
44721
44722@enumerate 0
44723@item Definitions.
44724
44725``This License'' refers to version 3 of the GNU General Public License.
44726
44727``Copyright'' also means copyright-like laws that apply to other kinds
44728of works, such as semiconductor masks.
44729
44730``The Program'' refers to any copyrightable work licensed under this
44731License.  Each licensee is addressed as ``you''.  ``Licensees'' and
44732``recipients'' may be individuals or organizations.
44733
44734To ``modify'' a work means to copy from or adapt all or part of the work
44735in a fashion requiring copyright permission, other than the making of
44736an exact copy.  The resulting work is called a ``modified version'' of
44737the earlier work or a work ``based on'' the earlier work.
44738
44739A ``covered work'' means either the unmodified Program or a work based
44740on the Program.
44741
44742To ``propagate'' a work means to do anything with it that, without
44743permission, would make you directly or secondarily liable for
44744infringement under applicable copyright law, except executing it on a
44745computer or modifying a private copy.  Propagation includes copying,
44746distribution (with or without modification), making available to the
44747public, and in some countries other activities as well.
44748
44749To ``convey'' a work means any kind of propagation that enables other
44750parties to make or receive copies.  Mere interaction with a user
44751through a computer network, with no transfer of a copy, is not
44752conveying.
44753
44754An interactive user interface displays ``Appropriate Legal Notices'' to
44755the extent that it includes a convenient and prominently visible
44756feature that (1) displays an appropriate copyright notice, and (2)
44757tells the user that there is no warranty for the work (except to the
44758extent that warranties are provided), that licensees may convey the
44759work under this License, and how to view a copy of this License.  If
44760the interface presents a list of user commands or options, such as a
44761menu, a prominent item in the list meets this criterion.
44762
44763@item Source Code.
44764
44765The ``source code'' for a work means the preferred form of the work for
44766making modifications to it.  ``Object code'' means any non-source form
44767of a work.
44768
44769A ``Standard Interface'' means an interface that either is an official
44770standard defined by a recognized standards body, or, in the case of
44771interfaces specified for a particular programming language, one that
44772is widely used among developers working in that language.
44773
44774The ``System Libraries'' of an executable work include anything, other
44775than the work as a whole, that (a) is included in the normal form of
44776packaging a Major Component, but which is not part of that Major
44777Component, and (b) serves only to enable use of the work with that
44778Major Component, or to implement a Standard Interface for which an
44779implementation is available to the public in source code form.  A
44780``Major Component'', in this context, means a major essential component
44781(kernel, window system, and so on) of the specific operating system
44782(if any) on which the executable work runs, or a compiler used to
44783produce the work, or an object code interpreter used to run it.
44784
44785The ``Corresponding Source'' for a work in object code form means all
44786the source code needed to generate, install, and (for an executable
44787work) run the object code and to modify the work, including scripts to
44788control those activities.  However, it does not include the work's
44789System Libraries, or general-purpose tools or generally available free
44790programs which are used unmodified in performing those activities but
44791which are not part of the work.  For example, Corresponding Source
44792includes interface definition files associated with source files for
44793the work, and the source code for shared libraries and dynamically
44794linked subprograms that the work is specifically designed to require,
44795such as by intimate data communication or control flow between those
44796subprograms and other parts of the work.
44797
44798The Corresponding Source need not include anything that users can
44799regenerate automatically from other parts of the Corresponding Source.
44800
44801The Corresponding Source for a work in source code form is that same
44802work.
44803
44804@item Basic Permissions.
44805
44806All rights granted under this License are granted for the term of
44807copyright on the Program, and are irrevocable provided the stated
44808conditions are met.  This License explicitly affirms your unlimited
44809permission to run the unmodified Program.  The output from running a
44810covered work is covered by this License only if the output, given its
44811content, constitutes a covered work.  This License acknowledges your
44812rights of fair use or other equivalent, as provided by copyright law.
44813
44814You may make, run and propagate covered works that you do not convey,
44815without conditions so long as your license otherwise remains in force.
44816You may convey covered works to others for the sole purpose of having
44817them make modifications exclusively for you, or provide you with
44818facilities for running those works, provided that you comply with the
44819terms of this License in conveying all material for which you do not
44820control copyright.  Those thus making or running the covered works for
44821you must do so exclusively on your behalf, under your direction and
44822control, on terms that prohibit them from making any copies of your
44823copyrighted material outside their relationship with you.
44824
44825Conveying under any other circumstances is permitted solely under the
44826conditions stated below.  Sublicensing is not allowed; section 10
44827makes it unnecessary.
44828
44829@item Protecting Users' Legal Rights From Anti-Circumvention Law.
44830
44831No covered work shall be deemed part of an effective technological
44832measure under any applicable law fulfilling obligations under article
4483311 of the WIPO copyright treaty adopted on 20 December 1996, or
44834similar laws prohibiting or restricting circumvention of such
44835measures.
44836
44837When you convey a covered work, you waive any legal power to forbid
44838circumvention of technological measures to the extent such
44839circumvention is effected by exercising rights under this License with
44840respect to the covered work, and you disclaim any intention to limit
44841operation or modification of the work as a means of enforcing, against
44842the work's users, your or third parties' legal rights to forbid
44843circumvention of technological measures.
44844
44845@item Conveying Verbatim Copies.
44846
44847You may convey verbatim copies of the Program's source code as you
44848receive it, in any medium, provided that you conspicuously and
44849appropriately publish on each copy an appropriate copyright notice;
44850keep intact all notices stating that this License and any
44851non-permissive terms added in accord with section 7 apply to the code;
44852keep intact all notices of the absence of any warranty; and give all
44853recipients a copy of this License along with the Program.
44854
44855You may charge any price or no price for each copy that you convey,
44856and you may offer support or warranty protection for a fee.
44857
44858@item Conveying Modified Source Versions.
44859
44860You may convey a work based on the Program, or the modifications to
44861produce it from the Program, in the form of source code under the
44862terms of section 4, provided that you also meet all of these
44863conditions:
44864
44865@enumerate a
44866@item
44867The work must carry prominent notices stating that you modified it,
44868and giving a relevant date.
44869
44870@item
44871The work must carry prominent notices stating that it is released
44872under this License and any conditions added under section 7.  This
44873requirement modifies the requirement in section 4 to ``keep intact all
44874notices''.
44875
44876@item
44877You must license the entire work, as a whole, under this License to
44878anyone who comes into possession of a copy.  This License will
44879therefore apply, along with any applicable section 7 additional terms,
44880to the whole of the work, and all its parts, regardless of how they
44881are packaged.  This License gives no permission to license the work in
44882any other way, but it does not invalidate such permission if you have
44883separately received it.
44884
44885@item
44886If the work has interactive user interfaces, each must display
44887Appropriate Legal Notices; however, if the Program has interactive
44888interfaces that do not display Appropriate Legal Notices, your work
44889need not make them do so.
44890@end enumerate
44891
44892A compilation of a covered work with other separate and independent
44893works, which are not by their nature extensions of the covered work,
44894and which are not combined with it such as to form a larger program,
44895in or on a volume of a storage or distribution medium, is called an
44896``aggregate'' if the compilation and its resulting copyright are not
44897used to limit the access or legal rights of the compilation's users
44898beyond what the individual works permit.  Inclusion of a covered work
44899in an aggregate does not cause this License to apply to the other
44900parts of the aggregate.
44901
44902@item  Conveying Non-Source Forms.
44903
44904You may convey a covered work in object code form under the terms of
44905sections 4 and 5, provided that you also convey the machine-readable
44906Corresponding Source under the terms of this License, in one of these
44907ways:
44908
44909@enumerate a
44910@item
44911Convey the object code in, or embodied in, a physical product
44912(including a physical distribution medium), accompanied by the
44913Corresponding Source fixed on a durable physical medium customarily
44914used for software interchange.
44915
44916@item
44917Convey the object code in, or embodied in, a physical product
44918(including a physical distribution medium), accompanied by a written
44919offer, valid for at least three years and valid for as long as you
44920offer spare parts or customer support for that product model, to give
44921anyone who possesses the object code either (1) a copy of the
44922Corresponding Source for all the software in the product that is
44923covered by this License, on a durable physical medium customarily used
44924for software interchange, for a price no more than your reasonable
44925cost of physically performing this conveying of source, or (2) access
44926to copy the Corresponding Source from a network server at no charge.
44927
44928@item
44929Convey individual copies of the object code with a copy of the written
44930offer to provide the Corresponding Source.  This alternative is
44931allowed only occasionally and noncommercially, and only if you
44932received the object code with such an offer, in accord with subsection
449336b.
44934
44935@item
44936Convey the object code by offering access from a designated place
44937(gratis or for a charge), and offer equivalent access to the
44938Corresponding Source in the same way through the same place at no
44939further charge.  You need not require recipients to copy the
44940Corresponding Source along with the object code.  If the place to copy
44941the object code is a network server, the Corresponding Source may be
44942on a different server (operated by you or a third party) that supports
44943equivalent copying facilities, provided you maintain clear directions
44944next to the object code saying where to find the Corresponding Source.
44945Regardless of what server hosts the Corresponding Source, you remain
44946obligated to ensure that it is available for as long as needed to
44947satisfy these requirements.
44948
44949@item
44950Convey the object code using peer-to-peer transmission, provided you
44951inform other peers where the object code and Corresponding Source of
44952the work are being offered to the general public at no charge under
44953subsection 6d.
44954
44955@end enumerate
44956
44957A separable portion of the object code, whose source code is excluded
44958from the Corresponding Source as a System Library, need not be
44959included in conveying the object code work.
44960
44961A ``User Product'' is either (1) a ``consumer product'', which means any
44962tangible personal property which is normally used for personal,
44963family, or household purposes, or (2) anything designed or sold for
44964incorporation into a dwelling.  In determining whether a product is a
44965consumer product, doubtful cases shall be resolved in favor of
44966coverage.  For a particular product received by a particular user,
44967``normally used'' refers to a typical or common use of that class of
44968product, regardless of the status of the particular user or of the way
44969in which the particular user actually uses, or expects or is expected
44970to use, the product.  A product is a consumer product regardless of
44971whether the product has substantial commercial, industrial or
44972non-consumer uses, unless such uses represent the only significant
44973mode of use of the product.
44974
44975``Installation Information'' for a User Product means any methods,
44976procedures, authorization keys, or other information required to
44977install and execute modified versions of a covered work in that User
44978Product from a modified version of its Corresponding Source.  The
44979information must suffice to ensure that the continued functioning of
44980the modified object code is in no case prevented or interfered with
44981solely because modification has been made.
44982
44983If you convey an object code work under this section in, or with, or
44984specifically for use in, a User Product, and the conveying occurs as
44985part of a transaction in which the right of possession and use of the
44986User Product is transferred to the recipient in perpetuity or for a
44987fixed term (regardless of how the transaction is characterized), the
44988Corresponding Source conveyed under this section must be accompanied
44989by the Installation Information.  But this requirement does not apply
44990if neither you nor any third party retains the ability to install
44991modified object code on the User Product (for example, the work has
44992been installed in ROM).
44993
44994The requirement to provide Installation Information does not include a
44995requirement to continue to provide support service, warranty, or
44996updates for a work that has been modified or installed by the
44997recipient, or for the User Product in which it has been modified or
44998installed.  Access to a network may be denied when the modification
44999itself materially and adversely affects the operation of the network
45000or violates the rules and protocols for communication across the
45001network.
45002
45003Corresponding Source conveyed, and Installation Information provided,
45004in accord with this section must be in a format that is publicly
45005documented (and with an implementation available to the public in
45006source code form), and must require no special password or key for
45007unpacking, reading or copying.
45008
45009@item Additional Terms.
45010
45011``Additional permissions'' are terms that supplement the terms of this
45012License by making exceptions from one or more of its conditions.
45013Additional permissions that are applicable to the entire Program shall
45014be treated as though they were included in this License, to the extent
45015that they are valid under applicable law.  If additional permissions
45016apply only to part of the Program, that part may be used separately
45017under those permissions, but the entire Program remains governed by
45018this License without regard to the additional permissions.
45019
45020When you convey a copy of a covered work, you may at your option
45021remove any additional permissions from that copy, or from any part of
45022it.  (Additional permissions may be written to require their own
45023removal in certain cases when you modify the work.)  You may place
45024additional permissions on material, added by you to a covered work,
45025for which you have or can give appropriate copyright permission.
45026
45027Notwithstanding any other provision of this License, for material you
45028add to a covered work, you may (if authorized by the copyright holders
45029of that material) supplement the terms of this License with terms:
45030
45031@enumerate a
45032@item
45033Disclaiming warranty or limiting liability differently from the terms
45034of sections 15 and 16 of this License; or
45035
45036@item
45037Requiring preservation of specified reasonable legal notices or author
45038attributions in that material or in the Appropriate Legal Notices
45039displayed by works containing it; or
45040
45041@item
45042Prohibiting misrepresentation of the origin of that material, or
45043requiring that modified versions of such material be marked in
45044reasonable ways as different from the original version; or
45045
45046@item
45047Limiting the use for publicity purposes of names of licensors or
45048authors of the material; or
45049
45050@item
45051Declining to grant rights under trademark law for use of some trade
45052names, trademarks, or service marks; or
45053
45054@item
45055Requiring indemnification of licensors and authors of that material by
45056anyone who conveys the material (or modified versions of it) with
45057contractual assumptions of liability to the recipient, for any
45058liability that these contractual assumptions directly impose on those
45059licensors and authors.
45060@end enumerate
45061
45062All other non-permissive additional terms are considered ``further
45063restrictions'' within the meaning of section 10.  If the Program as you
45064received it, or any part of it, contains a notice stating that it is
45065governed by this License along with a term that is a further
45066restriction, you may remove that term.  If a license document contains
45067a further restriction but permits relicensing or conveying under this
45068License, you may add to a covered work material governed by the terms
45069of that license document, provided that the further restriction does
45070not survive such relicensing or conveying.
45071
45072If you add terms to a covered work in accord with this section, you
45073must place, in the relevant source files, a statement of the
45074additional terms that apply to those files, or a notice indicating
45075where to find the applicable terms.
45076
45077Additional terms, permissive or non-permissive, may be stated in the
45078form of a separately written license, or stated as exceptions; the
45079above requirements apply either way.
45080
45081@item Termination.
45082
45083You may not propagate or modify a covered work except as expressly
45084provided under this License.  Any attempt otherwise to propagate or
45085modify it is void, and will automatically terminate your rights under
45086this License (including any patent licenses granted under the third
45087paragraph of section 11).
45088
45089However, if you cease all violation of this License, then your license
45090from a particular copyright holder is reinstated (a) provisionally,
45091unless and until the copyright holder explicitly and finally
45092terminates your license, and (b) permanently, if the copyright holder
45093fails to notify you of the violation by some reasonable means prior to
4509460 days after the cessation.
45095
45096Moreover, your license from a particular copyright holder is
45097reinstated permanently if the copyright holder notifies you of the
45098violation by some reasonable means, this is the first time you have
45099received notice of violation of this License (for any work) from that
45100copyright holder, and you cure the violation prior to 30 days after
45101your receipt of the notice.
45102
45103Termination of your rights under this section does not terminate the
45104licenses of parties who have received copies or rights from you under
45105this License.  If your rights have been terminated and not permanently
45106reinstated, you do not qualify to receive new licenses for the same
45107material under section 10.
45108
45109@item Acceptance Not Required for Having Copies.
45110
45111You are not required to accept this License in order to receive or run
45112a copy of the Program.  Ancillary propagation of a covered work
45113occurring solely as a consequence of using peer-to-peer transmission
45114to receive a copy likewise does not require acceptance.  However,
45115nothing other than this License grants you permission to propagate or
45116modify any covered work.  These actions infringe copyright if you do
45117not accept this License.  Therefore, by modifying or propagating a
45118covered work, you indicate your acceptance of this License to do so.
45119
45120@item Automatic Licensing of Downstream Recipients.
45121
45122Each time you convey a covered work, the recipient automatically
45123receives a license from the original licensors, to run, modify and
45124propagate that work, subject to this License.  You are not responsible
45125for enforcing compliance by third parties with this License.
45126
45127An ``entity transaction'' is a transaction transferring control of an
45128organization, or substantially all assets of one, or subdividing an
45129organization, or merging organizations.  If propagation of a covered
45130work results from an entity transaction, each party to that
45131transaction who receives a copy of the work also receives whatever
45132licenses to the work the party's predecessor in interest had or could
45133give under the previous paragraph, plus a right to possession of the
45134Corresponding Source of the work from the predecessor in interest, if
45135the predecessor has it or can get it with reasonable efforts.
45136
45137You may not impose any further restrictions on the exercise of the
45138rights granted or affirmed under this License.  For example, you may
45139not impose a license fee, royalty, or other charge for exercise of
45140rights granted under this License, and you may not initiate litigation
45141(including a cross-claim or counterclaim in a lawsuit) alleging that
45142any patent claim is infringed by making, using, selling, offering for
45143sale, or importing the Program or any portion of it.
45144
45145@item Patents.
45146
45147A ``contributor'' is a copyright holder who authorizes use under this
45148License of the Program or a work on which the Program is based.  The
45149work thus licensed is called the contributor's ``contributor version''.
45150
45151A contributor's ``essential patent claims'' are all patent claims owned
45152or controlled by the contributor, whether already acquired or
45153hereafter acquired, that would be infringed by some manner, permitted
45154by this License, of making, using, or selling its contributor version,
45155but do not include claims that would be infringed only as a
45156consequence of further modification of the contributor version.  For
45157purposes of this definition, ``control'' includes the right to grant
45158patent sublicenses in a manner consistent with the requirements of
45159this License.
45160
45161Each contributor grants you a non-exclusive, worldwide, royalty-free
45162patent license under the contributor's essential patent claims, to
45163make, use, sell, offer for sale, import and otherwise run, modify and
45164propagate the contents of its contributor version.
45165
45166In the following three paragraphs, a ``patent license'' is any express
45167agreement or commitment, however denominated, not to enforce a patent
45168(such as an express permission to practice a patent or covenant not to
45169sue for patent infringement).  To ``grant'' such a patent license to a
45170party means to make such an agreement or commitment not to enforce a
45171patent against the party.
45172
45173If you convey a covered work, knowingly relying on a patent license,
45174and the Corresponding Source of the work is not available for anyone
45175to copy, free of charge and under the terms of this License, through a
45176publicly available network server or other readily accessible means,
45177then you must either (1) cause the Corresponding Source to be so
45178available, or (2) arrange to deprive yourself of the benefit of the
45179patent license for this particular work, or (3) arrange, in a manner
45180consistent with the requirements of this License, to extend the patent
45181license to downstream recipients.  ``Knowingly relying'' means you have
45182actual knowledge that, but for the patent license, your conveying the
45183covered work in a country, or your recipient's use of the covered work
45184in a country, would infringe one or more identifiable patents in that
45185country that you have reason to believe are valid.
45186
45187If, pursuant to or in connection with a single transaction or
45188arrangement, you convey, or propagate by procuring conveyance of, a
45189covered work, and grant a patent license to some of the parties
45190receiving the covered work authorizing them to use, propagate, modify
45191or convey a specific copy of the covered work, then the patent license
45192you grant is automatically extended to all recipients of the covered
45193work and works based on it.
45194
45195A patent license is ``discriminatory'' if it does not include within the
45196scope of its coverage, prohibits the exercise of, or is conditioned on
45197the non-exercise of one or more of the rights that are specifically
45198granted under this License.  You may not convey a covered work if you
45199are a party to an arrangement with a third party that is in the
45200business of distributing software, under which you make payment to the
45201third party based on the extent of your activity of conveying the
45202work, and under which the third party grants, to any of the parties
45203who would receive the covered work from you, a discriminatory patent
45204license (a) in connection with copies of the covered work conveyed by
45205you (or copies made from those copies), or (b) primarily for and in
45206connection with specific products or compilations that contain the
45207covered work, unless you entered into that arrangement, or that patent
45208license was granted, prior to 28 March 2007.
45209
45210Nothing in this License shall be construed as excluding or limiting
45211any implied license or other defenses to infringement that may
45212otherwise be available to you under applicable patent law.
45213
45214@item No Surrender of Others' Freedom.
45215
45216If conditions are imposed on you (whether by court order, agreement or
45217otherwise) that contradict the conditions of this License, they do not
45218excuse you from the conditions of this License.  If you cannot convey
45219a covered work so as to satisfy simultaneously your obligations under
45220this License and any other pertinent obligations, then as a
45221consequence you may not convey it at all.  For example, if you agree
45222to terms that obligate you to collect a royalty for further conveying
45223from those to whom you convey the Program, the only way you could
45224satisfy both those terms and this License would be to refrain entirely
45225from conveying the Program.
45226
45227@item Use with the GNU Affero General Public License.
45228
45229Notwithstanding any other provision of this License, you have
45230permission to link or combine any covered work with a work licensed
45231under version 3 of the GNU Affero General Public License into a single
45232combined work, and to convey the resulting work.  The terms of this
45233License will continue to apply to the part which is the covered work,
45234but the special requirements of the GNU Affero General Public License,
45235section 13, concerning interaction through a network will apply to the
45236combination as such.
45237
45238@item Revised Versions of this License.
45239
45240The Free Software Foundation may publish revised and/or new versions
45241of the GNU General Public License from time to time.  Such new
45242versions will be similar in spirit to the present version, but may
45243differ in detail to address new problems or concerns.
45244
45245Each version is given a distinguishing version number.  If the Program
45246specifies that a certain numbered version of the GNU General Public
45247License ``or any later version'' applies to it, you have the option of
45248following the terms and conditions either of that numbered version or
45249of any later version published by the Free Software Foundation.  If
45250the Program does not specify a version number of the GNU General
45251Public License, you may choose any version ever published by the Free
45252Software Foundation.
45253
45254If the Program specifies that a proxy can decide which future versions
45255of the GNU General Public License can be used, that proxy's public
45256statement of acceptance of a version permanently authorizes you to
45257choose that version for the Program.
45258
45259Later license versions may give you additional or different
45260permissions.  However, no additional obligations are imposed on any
45261author or copyright holder as a result of your choosing to follow a
45262later version.
45263
45264@item Disclaimer of Warranty.
45265
45266THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
45267APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
45268HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM ``AS IS'' WITHOUT
45269WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT
45270LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
45271A PARTICULAR PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND
45272PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE PROGRAM PROVE
45273DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR
45274CORRECTION.
45275
45276@item Limitation of Liability.
45277
45278IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
45279WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR
45280CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
45281INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES
45282ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT
45283NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR
45284LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM
45285TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER
45286PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
45287
45288@item Interpretation of Sections 15 and 16.
45289
45290If the disclaimer of warranty and limitation of liability provided
45291above cannot be given local legal effect according to their terms,
45292reviewing courts shall apply local law that most closely approximates
45293an absolute waiver of all civil liability in connection with the
45294Program, unless a warranty or assumption of liability accompanies a
45295copy of the Program in return for a fee.
45296
45297@end enumerate
45298
45299@c fakenode --- for prepinfo
45300@heading END OF TERMS AND CONDITIONS
45301
45302@c fakenode --- for prepinfo
45303@heading How to Apply These Terms to Your New Programs
45304
45305If you develop a new program, and you want it to be of the greatest
45306possible use to the public, the best way to achieve this is to make it
45307free software which everyone can redistribute and change under these
45308terms.
45309
45310To do so, attach the following notices to the program.  It is safest
45311to attach them to the start of each source file to most effectively
45312state the exclusion of warranty; and each file should have at least
45313the ``copyright'' line and a pointer to where the full notice is found.
45314
45315@smallexample
45316@var{one line to give the program's name and a brief idea of what it does.}
45317Copyright (C) @var{year} @var{name of author}
45318
45319This program is free software: you can redistribute it and/or modify
45320it under the terms of the GNU General Public License as published by
45321the Free Software Foundation, either version 3 of the License, or (at
45322your option) any later version.
45323
45324This program is distributed in the hope that it will be useful, but
45325WITHOUT ANY WARRANTY; without even the implied warranty of
45326MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
45327General Public License for more details.
45328
45329You should have received a copy of the GNU General Public License
45330along with this program.  If not, see @url{https://www.gnu.org/licenses/}.
45331@end smallexample
45332
45333Also add information on how to contact you by electronic and paper mail.
45334
45335If the program does terminal interaction, make it output a short
45336notice like this when it starts in an interactive mode:
45337
45338@smallexample
45339@var{program} Copyright (C) @var{year} @var{name of author}
45340This program comes with ABSOLUTELY NO WARRANTY; for details type @samp{show w}.
45341This is free software, and you are welcome to redistribute it
45342under certain conditions; type @samp{show c} for details.
45343@end smallexample
45344
45345The hypothetical commands @samp{show w} and @samp{show c} should show
45346the appropriate parts of the General Public License.  Of course, your
45347program's commands might be different; for a GUI interface, you would
45348use an ``about box''.
45349
45350You should also get your employer (if you work as a programmer) or school,
45351if any, to sign a ``copyright disclaimer'' for the program, if necessary.
45352For more information on this, and how to apply and follow the GNU GPL, see
45353@url{https://www.gnu.org/licenses/}.
45354
45355The GNU General Public License does not permit incorporating your
45356program into proprietary programs.  If your program is a subroutine
45357library, you may consider it more useful to permit linking proprietary
45358applications with the library.  If this is what you want to do, use
45359the GNU Lesser General Public License instead of this License.  But
45360first, please read @url{https://www.gnu.org/philosophy/why-not-lgpl.html}.
45361
45362@ifclear FOR_PRINT
45363@c The GNU Free Documentation License.
45364@node GNU Free Documentation License
45365@unnumbered GNU Free Documentation License
45366@ifnotdocbook
45367@center Version 1.3, 3 November 2008
45368@end ifnotdocbook
45369
45370@docbook
45371<subtitle>Version 1.3, 3 November 2008</subtitle>
45372@end docbook
45373
45374@cindex FDL (Free Documentation License)
45375@cindex Free Documentation License (FDL)
45376@cindex GNU Free Documentation License
45377
45378@c This file is intended to be included within another document,
45379@c hence no sectioning command or @node.
45380
45381@display
45382Copyright @copyright{} 2000, 2001, 2002, 2007, 2008 Free Software Foundation, Inc.
45383@uref{https://fsf.org/}
45384
45385Everyone is permitted to copy and distribute verbatim copies
45386of this license document, but changing it is not allowed.
45387@end display
45388
45389@enumerate 0
45390@item
45391PREAMBLE
45392
45393The purpose of this License is to make a manual, textbook, or other
45394functional and useful document @dfn{free} in the sense of freedom: to
45395assure everyone the effective freedom to copy and redistribute it,
45396with or without modifying it, either commercially or noncommercially.
45397Secondarily, this License preserves for the author and publisher a way
45398to get credit for their work, while not being considered responsible
45399for modifications made by others.
45400
45401This License is a kind of ``copyleft'', which means that derivative
45402works of the document must themselves be free in the same sense.  It
45403complements the GNU General Public License, which is a copyleft
45404license designed for free software.
45405
45406We have designed this License in order to use it for manuals for free
45407software, because free software needs free documentation: a free
45408program should come with manuals providing the same freedoms that the
45409software does.  But this License is not limited to software manuals;
45410it can be used for any textual work, regardless of subject matter or
45411whether it is published as a printed book.  We recommend this License
45412principally for works whose purpose is instruction or reference.
45413
45414@item
45415APPLICABILITY AND DEFINITIONS
45416
45417This License applies to any manual or other work, in any medium, that
45418contains a notice placed by the copyright holder saying it can be
45419distributed under the terms of this License.  Such a notice grants a
45420world-wide, royalty-free license, unlimited in duration, to use that
45421work under the conditions stated herein.  The ``Document'', below,
45422refers to any such manual or work.  Any member of the public is a
45423licensee, and is addressed as ``you''.  You accept the license if you
45424copy, modify or distribute the work in a way requiring permission
45425under copyright law.
45426
45427A ``Modified Version'' of the Document means any work containing the
45428Document or a portion of it, either copied verbatim, or with
45429modifications and/or translated into another language.
45430
45431A ``Secondary Section'' is a named appendix or a front-matter section
45432of the Document that deals exclusively with the relationship of the
45433publishers or authors of the Document to the Document's overall
45434subject (or to related matters) and contains nothing that could fall
45435directly within that overall subject.  (Thus, if the Document is in
45436part a textbook of mathematics, a Secondary Section may not explain
45437any mathematics.)  The relationship could be a matter of historical
45438connection with the subject or with related matters, or of legal,
45439commercial, philosophical, ethical or political position regarding
45440them.
45441
45442The ``Invariant Sections'' are certain Secondary Sections whose titles
45443are designated, as being those of Invariant Sections, in the notice
45444that says that the Document is released under this License.  If a
45445section does not fit the above definition of Secondary then it is not
45446allowed to be designated as Invariant.  The Document may contain zero
45447Invariant Sections.  If the Document does not identify any Invariant
45448Sections then there are none.
45449
45450The ``Cover Texts'' are certain short passages of text that are listed,
45451as Front-Cover Texts or Back-Cover Texts, in the notice that says that
45452the Document is released under this License.  A Front-Cover Text may
45453be at most 5 words, and a Back-Cover Text may be at most 25 words.
45454
45455A ``Transparent'' copy of the Document means a machine-readable copy,
45456represented in a format whose specification is available to the
45457general public, that is suitable for revising the document
45458straightforwardly with generic text editors or (for images composed of
45459pixels) generic paint programs or (for drawings) some widely available
45460drawing editor, and that is suitable for input to text formatters or
45461for automatic translation to a variety of formats suitable for input
45462to text formatters.  A copy made in an otherwise Transparent file
45463format whose markup, or absence of markup, has been arranged to thwart
45464or discourage subsequent modification by readers is not Transparent.
45465An image format is not Transparent if used for any substantial amount
45466of text.  A copy that is not ``Transparent'' is called ``Opaque''.
45467
45468Examples of suitable formats for Transparent copies include plain
45469@sc{ascii} without markup, Texinfo input format, La@TeX{} input
45470format, @acronym{SGML} or @acronym{XML} using a publicly available
45471@acronym{DTD}, and standard-conforming simple @acronym{HTML},
45472PostScript or @acronym{PDF} designed for human modification.  Examples
45473of transparent image formats include @acronym{PNG}, @acronym{XCF} and
45474@acronym{JPG}.  Opaque formats include proprietary formats that can be
45475read and edited only by proprietary word processors, @acronym{SGML} or
45476@acronym{XML} for which the @acronym{DTD} and/or processing tools are
45477not generally available, and the machine-generated @acronym{HTML},
45478PostScript or @acronym{PDF} produced by some word processors for
45479output purposes only.
45480
45481The ``Title Page'' means, for a printed book, the title page itself,
45482plus such following pages as are needed to hold, legibly, the material
45483this License requires to appear in the title page.  For works in
45484formats which do not have any title page as such, ``Title Page'' means
45485the text near the most prominent appearance of the work's title,
45486preceding the beginning of the body of the text.
45487
45488The ``publisher'' means any person or entity that distributes copies
45489of the Document to the public.
45490
45491A section ``Entitled XYZ'' means a named subunit of the Document whose
45492title either is precisely XYZ or contains XYZ in parentheses following
45493text that translates XYZ in another language.  (Here XYZ stands for a
45494specific section name mentioned below, such as ``Acknowledgements'',
45495``Dedications'', ``Endorsements'', or ``History''.)  To ``Preserve the Title''
45496of such a section when you modify the Document means that it remains a
45497section ``Entitled XYZ'' according to this definition.
45498
45499The Document may include Warranty Disclaimers next to the notice which
45500states that this License applies to the Document.  These Warranty
45501Disclaimers are considered to be included by reference in this
45502License, but only as regards disclaiming warranties: any other
45503implication that these Warranty Disclaimers may have is void and has
45504no effect on the meaning of this License.
45505
45506@item
45507VERBATIM COPYING
45508
45509You may copy and distribute the Document in any medium, either
45510commercially or noncommercially, provided that this License, the
45511copyright notices, and the license notice saying this License applies
45512to the Document are reproduced in all copies, and that you add no other
45513conditions whatsoever to those of this License.  You may not use
45514technical measures to obstruct or control the reading or further
45515copying of the copies you make or distribute.  However, you may accept
45516compensation in exchange for copies.  If you distribute a large enough
45517number of copies you must also follow the conditions in section 3.
45518
45519You may also lend copies, under the same conditions stated above, and
45520you may publicly display copies.
45521
45522@item
45523COPYING IN QUANTITY
45524
45525If you publish printed copies (or copies in media that commonly have
45526printed covers) of the Document, numbering more than 100, and the
45527Document's license notice requires Cover Texts, you must enclose the
45528copies in covers that carry, clearly and legibly, all these Cover
45529Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
45530the back cover.  Both covers must also clearly and legibly identify
45531you as the publisher of these copies.  The front cover must present
45532the full title with all words of the title equally prominent and
45533visible.  You may add other material on the covers in addition.
45534Copying with changes limited to the covers, as long as they preserve
45535the title of the Document and satisfy these conditions, can be treated
45536as verbatim copying in other respects.
45537
45538If the required texts for either cover are too voluminous to fit
45539legibly, you should put the first ones listed (as many as fit
45540reasonably) on the actual cover, and continue the rest onto adjacent
45541pages.
45542
45543If you publish or distribute Opaque copies of the Document numbering
45544more than 100, you must either include a machine-readable Transparent
45545copy along with each Opaque copy, or state in or with each Opaque copy
45546a computer-network location from which the general network-using
45547public has access to download using public-standard network protocols
45548a complete Transparent copy of the Document, free of added material.
45549If you use the latter option, you must take reasonably prudent steps,
45550when you begin distribution of Opaque copies in quantity, to ensure
45551that this Transparent copy will remain thus accessible at the stated
45552location until at least one year after the last time you distribute an
45553Opaque copy (directly or through your agents or retailers) of that
45554edition to the public.
45555
45556It is requested, but not required, that you contact the authors of the
45557Document well before redistributing any large number of copies, to give
45558them a chance to provide you with an updated version of the Document.
45559
45560@item
45561MODIFICATIONS
45562
45563You may copy and distribute a Modified Version of the Document under
45564the conditions of sections 2 and 3 above, provided that you release
45565the Modified Version under precisely this License, with the Modified
45566Version filling the role of the Document, thus licensing distribution
45567and modification of the Modified Version to whoever possesses a copy
45568of it.  In addition, you must do these things in the Modified Version:
45569
45570@enumerate A
45571@item
45572Use in the Title Page (and on the covers, if any) a title distinct
45573from that of the Document, and from those of previous versions
45574(which should, if there were any, be listed in the History section
45575of the Document).  You may use the same title as a previous version
45576if the original publisher of that version gives permission.
45577
45578@item
45579List on the Title Page, as authors, one or more persons or entities
45580responsible for authorship of the modifications in the Modified
45581Version, together with at least five of the principal authors of the
45582Document (all of its principal authors, if it has fewer than five),
45583unless they release you from this requirement.
45584
45585@item
45586State on the Title page the name of the publisher of the
45587Modified Version, as the publisher.
45588
45589@item
45590Preserve all the copyright notices of the Document.
45591
45592@item
45593Add an appropriate copyright notice for your modifications
45594adjacent to the other copyright notices.
45595
45596@item
45597Include, immediately after the copyright notices, a license notice
45598giving the public permission to use the Modified Version under the
45599terms of this License, in the form shown in the Addendum below.
45600
45601@item
45602Preserve in that license notice the full lists of Invariant Sections
45603and required Cover Texts given in the Document's license notice.
45604
45605@item
45606Include an unaltered copy of this License.
45607
45608@item
45609Preserve the section Entitled ``History'', Preserve its Title, and add
45610to it an item stating at least the title, year, new authors, and
45611publisher of the Modified Version as given on the Title Page.  If
45612there is no section Entitled ``History'' in the Document, create one
45613stating the title, year, authors, and publisher of the Document as
45614given on its Title Page, then add an item describing the Modified
45615Version as stated in the previous sentence.
45616
45617@item
45618Preserve the network location, if any, given in the Document for
45619public access to a Transparent copy of the Document, and likewise
45620the network locations given in the Document for previous versions
45621it was based on.  These may be placed in the ``History'' section.
45622You may omit a network location for a work that was published at
45623least four years before the Document itself, or if the original
45624publisher of the version it refers to gives permission.
45625
45626@item
45627For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve
45628the Title of the section, and preserve in the section all the
45629substance and tone of each of the contributor acknowledgements and/or
45630dedications given therein.
45631
45632@item
45633Preserve all the Invariant Sections of the Document,
45634unaltered in their text and in their titles.  Section numbers
45635or the equivalent are not considered part of the section titles.
45636
45637@item
45638Delete any section Entitled ``Endorsements''.  Such a section
45639may not be included in the Modified Version.
45640
45641@item
45642Do not retitle any existing section to be Entitled ``Endorsements'' or
45643to conflict in title with any Invariant Section.
45644
45645@item
45646Preserve any Warranty Disclaimers.
45647@end enumerate
45648
45649If the Modified Version includes new front-matter sections or
45650appendices that qualify as Secondary Sections and contain no material
45651copied from the Document, you may at your option designate some or all
45652of these sections as invariant.  To do this, add their titles to the
45653list of Invariant Sections in the Modified Version's license notice.
45654These titles must be distinct from any other section titles.
45655
45656You may add a section Entitled ``Endorsements'', provided it contains
45657nothing but endorsements of your Modified Version by various
45658parties---for example, statements of peer review or that the text has
45659been approved by an organization as the authoritative definition of a
45660standard.
45661
45662You may add a passage of up to five words as a Front-Cover Text, and a
45663passage of up to 25 words as a Back-Cover Text, to the end of the list
45664of Cover Texts in the Modified Version.  Only one passage of
45665Front-Cover Text and one of Back-Cover Text may be added by (or
45666through arrangements made by) any one entity.  If the Document already
45667includes a cover text for the same cover, previously added by you or
45668by arrangement made by the same entity you are acting on behalf of,
45669you may not add another; but you may replace the old one, on explicit
45670permission from the previous publisher that added the old one.
45671
45672The author(s) and publisher(s) of the Document do not by this License
45673give permission to use their names for publicity for or to assert or
45674imply endorsement of any Modified Version.
45675
45676@item
45677COMBINING DOCUMENTS
45678
45679You may combine the Document with other documents released under this
45680License, under the terms defined in section 4 above for modified
45681versions, provided that you include in the combination all of the
45682Invariant Sections of all of the original documents, unmodified, and
45683list them all as Invariant Sections of your combined work in its
45684license notice, and that you preserve all their Warranty Disclaimers.
45685
45686The combined work need only contain one copy of this License, and
45687multiple identical Invariant Sections may be replaced with a single
45688copy.  If there are multiple Invariant Sections with the same name but
45689different contents, make the title of each such section unique by
45690adding at the end of it, in parentheses, the name of the original
45691author or publisher of that section if known, or else a unique number.
45692Make the same adjustment to the section titles in the list of
45693Invariant Sections in the license notice of the combined work.
45694
45695In the combination, you must combine any sections Entitled ``History''
45696in the various original documents, forming one section Entitled
45697``History''; likewise combine any sections Entitled ``Acknowledgements'',
45698and any sections Entitled ``Dedications''.  You must delete all
45699sections Entitled ``Endorsements.''
45700
45701@item
45702COLLECTIONS OF DOCUMENTS
45703
45704You may make a collection consisting of the Document and other documents
45705released under this License, and replace the individual copies of this
45706License in the various documents with a single copy that is included in
45707the collection, provided that you follow the rules of this License for
45708verbatim copying of each of the documents in all other respects.
45709
45710You may extract a single document from such a collection, and distribute
45711it individually under this License, provided you insert a copy of this
45712License into the extracted document, and follow this License in all
45713other respects regarding verbatim copying of that document.
45714
45715@item
45716AGGREGATION WITH INDEPENDENT WORKS
45717
45718A compilation of the Document or its derivatives with other separate
45719and independent documents or works, in or on a volume of a storage or
45720distribution medium, is called an ``aggregate'' if the copyright
45721resulting from the compilation is not used to limit the legal rights
45722of the compilation's users beyond what the individual works permit.
45723When the Document is included in an aggregate, this License does not
45724apply to the other works in the aggregate which are not themselves
45725derivative works of the Document.
45726
45727If the Cover Text requirement of section 3 is applicable to these
45728copies of the Document, then if the Document is less than one half of
45729the entire aggregate, the Document's Cover Texts may be placed on
45730covers that bracket the Document within the aggregate, or the
45731electronic equivalent of covers if the Document is in electronic form.
45732Otherwise they must appear on printed covers that bracket the whole
45733aggregate.
45734
45735@item
45736TRANSLATION
45737
45738Translation is considered a kind of modification, so you may
45739distribute translations of the Document under the terms of section 4.
45740Replacing Invariant Sections with translations requires special
45741permission from their copyright holders, but you may include
45742translations of some or all Invariant Sections in addition to the
45743original versions of these Invariant Sections.  You may include a
45744translation of this License, and all the license notices in the
45745Document, and any Warranty Disclaimers, provided that you also include
45746the original English version of this License and the original versions
45747of those notices and disclaimers.  In case of a disagreement between
45748the translation and the original version of this License or a notice
45749or disclaimer, the original version will prevail.
45750
45751If a section in the Document is Entitled ``Acknowledgements'',
45752``Dedications'', or ``History'', the requirement (section 4) to Preserve
45753its Title (section 1) will typically require changing the actual
45754title.
45755
45756@item
45757TERMINATION
45758
45759You may not copy, modify, sublicense, or distribute the Document
45760except as expressly provided under this License.  Any attempt
45761otherwise to copy, modify, sublicense, or distribute it is void, and
45762will automatically terminate your rights under this License.
45763
45764However, if you cease all violation of this License, then your license
45765from a particular copyright holder is reinstated (a) provisionally,
45766unless and until the copyright holder explicitly and finally
45767terminates your license, and (b) permanently, if the copyright holder
45768fails to notify you of the violation by some reasonable means prior to
4576960 days after the cessation.
45770
45771Moreover, your license from a particular copyright holder is
45772reinstated permanently if the copyright holder notifies you of the
45773violation by some reasonable means, this is the first time you have
45774received notice of violation of this License (for any work) from that
45775copyright holder, and you cure the violation prior to 30 days after
45776your receipt of the notice.
45777
45778Termination of your rights under this section does not terminate the
45779licenses of parties who have received copies or rights from you under
45780this License.  If your rights have been terminated and not permanently
45781reinstated, receipt of a copy of some or all of the same material does
45782not give you any rights to use it.
45783
45784@item
45785FUTURE REVISIONS OF THIS LICENSE
45786
45787The Free Software Foundation may publish new, revised versions
45788of the GNU Free Documentation License from time to time.  Such new
45789versions will be similar in spirit to the present version, but may
45790differ in detail to address new problems or concerns.  See
45791@uref{https://www.gnu.org/copyleft/}.
45792
45793Each version of the License is given a distinguishing version number.
45794If the Document specifies that a particular numbered version of this
45795License ``or any later version'' applies to it, you have the option of
45796following the terms and conditions either of that specified version or
45797of any later version that has been published (not as a draft) by the
45798Free Software Foundation.  If the Document does not specify a version
45799number of this License, you may choose any version ever published (not
45800as a draft) by the Free Software Foundation.  If the Document
45801specifies that a proxy can decide which future versions of this
45802License can be used, that proxy's public statement of acceptance of a
45803version permanently authorizes you to choose that version for the
45804Document.
45805
45806@item
45807RELICENSING
45808
45809``Massive Multiauthor Collaboration Site'' (or ``MMC Site'') means any
45810World Wide Web server that publishes copyrightable works and also
45811provides prominent facilities for anybody to edit those works.  A
45812public wiki that anybody can edit is an example of such a server.  A
45813``Massive Multiauthor Collaboration'' (or ``MMC'') contained in the
45814site means any set of copyrightable works thus published on the MMC
45815site.
45816
45817``CC-BY-SA'' means the Creative Commons Attribution-Share Alike 3.0
45818license published by Creative Commons Corporation, a not-for-profit
45819corporation with a principal place of business in San Francisco,
45820California, as well as future copyleft versions of that license
45821published by that same organization.
45822
45823``Incorporate'' means to publish or republish a Document, in whole or
45824in part, as part of another Document.
45825
45826An MMC is ``eligible for relicensing'' if it is licensed under this
45827License, and if all works that were first published under this License
45828somewhere other than this MMC, and subsequently incorporated in whole
45829or in part into the MMC, (1) had no cover texts or invariant sections,
45830and (2) were thus incorporated prior to November 1, 2008.
45831
45832The operator of an MMC Site may republish an MMC contained in the site
45833under CC-BY-SA on the same site at any time before August 1, 2009,
45834provided the MMC is eligible for relicensing.
45835
45836@end enumerate
45837
45838@c fakenode --- for prepinfo
45839@unnumberedsec ADDENDUM: How to use this License for your documents
45840
45841To use this License in a document you have written, include a copy of
45842the License in the document and put the following copyright and
45843license notices just after the title page:
45844
45845@smallexample
45846@group
45847  Copyright (C)  @var{year}  @var{your name}.
45848  Permission is granted to copy, distribute and/or modify this document
45849  under the terms of the GNU Free Documentation License, Version 1.3
45850  or any later version published by the Free Software Foundation;
45851  with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
45852  Texts.  A copy of the license is included in the section entitled ``GNU
45853  Free Documentation License''.
45854@end group
45855@end smallexample
45856
45857If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
45858replace the ``with@dots{}Texts.'' line with this:
45859
45860@smallexample
45861@group
45862    with the Invariant Sections being @var{list their titles}, with
45863    the Front-Cover Texts being @var{list}, and with the Back-Cover Texts
45864    being @var{list}.
45865@end group
45866@end smallexample
45867
45868If you have Invariant Sections without Cover Texts, or some other
45869combination of the three, merge those two alternatives to suit the
45870situation.
45871
45872If your document contains nontrivial examples of program code, we
45873recommend releasing these examples in parallel under your choice of
45874free software license, such as the GNU General Public License,
45875to permit their use in free software.
45876
45877@end ifclear
45878
45879@ifnotdocbook
45880@node Index
45881@unnumbered Index
45882@end ifnotdocbook
45883@printindex cp
45884
45885@bye
45886
45887Unresolved Issues:
45888------------------
458891. From ADR.
45890
45891   Robert J. Chassell points out that awk programs should have some indication
45892   of how to use them.  It would be useful to perhaps have a "programming
45893   style" section of the manual that would include this and other tips.
45894
45895Consistency issues:
45896	/.../ regexps are in @code, not @samp
45897	".." strings are in @code, not @samp
45898	no @print before @dots
45899	values of expressions in the text (@code{x} has the value 15),
45900		should be in roman, not @code
45901	Use   TAB   and not   tab
45902	Use   ESC   and not   ESCAPE
45903	Use   space and not   blank	to describe the space bar's character
45904	The term "blank" is thus basically reserved for "blank lines" etc.
45905	To make dark corners work, the @value{DARKCORNER} has to be outside
45906		closing `.' of a sentence and after (pxref{...}).
45907	Make sure that each @value{DARKCORNER} has an index entry, and
45908		also that each `@cindex dark corner' has an @value{DARKCORNER}.
45909	" " should have an @w{} around it
45910	Use "non-" only with language names or acronyms, or the words bug and option and null
45911	Use @command{ftp} when talking about anonymous ftp
45912	Use uppercase and lowercase, not "upper-case" and "lower-case"
45913		or "upper case" and "lower case"
45914	Use "single precision" and "double precision", not "single-precision" or "double-precision"
45915	Use alphanumeric, not alpha-numeric
45916	Use POSIX-compliant, not POSIX compliant
45917	Use --foo, not -Wfoo when describing long options
45918	Use "Bell Laboratories", but not "Bell Labs".
45919	Use "behavior" instead of "behaviour".
45920	Use "coprocess" instead of "co-process".
45921	Use "zeros" instead of "zeroes".
45922	Use "nonzero" not "non-zero".
45923	Use "runtime" not "run time" or "run-time".
45924	Use "command-line" as an adjective and "command line" as a noun.
45925	Use "online" not "on-line".
45926	Use "whitespace" not "white space".
45927	Use "Input/Output", not "input/output". Also "I/O", not "i/o".
45928	Use "lefthand"/"righthand", not "left-hand"/"right-hand".
45929	Use "workaround", not "work-around".
45930	Use "startup"/"cleanup", not "start-up"/"clean-up"
45931	Use "filesystem", not "file system"
45932	Use @code{do}, and not @code{do}-@code{while}, except where
45933		actually discussing the do-while.
45934	Use "versus" in text and "vs." in index entries
45935	Use @code{"C"} for the C locale, not ``C'' or @samp{C}.
45936	The words "a", "and", "as", "between", "for", "from", "in", "of",
45937		"on", "that", "the", "to", "with", and "without",
45938		should not be capitalized in @chapter, @section etc.
45939		"Into" and "How" should.
45940	Search for @dfn; make sure important items are also indexed.
45941	"e.g." should always be followed by a comma.
45942	"i.e." should always be followed by a comma.
45943	The numbers zero through ten should be spelled out, except when
45944		talking about file descriptor numbers. > 10 and < 0, it's
45945		ok to use numbers.
45946	For most cases, do NOT put a comma before "and", "or" or "but".
45947		But exercise taste with this rule.
45948	Don't show the awk command with a program in quotes when it's
45949		just the program.  I.e.
45950
45951			{
45952				....
45953			}
45954
45955		not
45956			awk '{
45957				...
45958			}'
45959
45960	Do show it when showing command-line arguments, data files, etc, even
45961		if there is no output shown.
45962
45963	Use numbered lists only to show a sequential series of steps.
45964
45965	Use @code{xxx} for the xxx operator in indexing statements, not @samp.
45966	Use MS-Windows not MS Windows
45967	Use MS-DOS not MS DOS
45968	Use an empty set of parentheses after built-in and awk function names.
45969	Use "multiFOO" without a hyphen.
45970	Use "time zone" as two words, not "timezone".
45971
45972Date: Wed, 13 Apr 94 15:20:52 -0400
45973From: rms@gnu.org (Richard Stallman)
45974To: gnu-prog@gnu.org
45975Subject: A reminder: no pathnames in GNU
45976
45977It's a GNU convention to use the term "file name" for the name of a
45978file, never "pathname".  We use the term "path" for search paths,
45979which are lists of file names.  Using it for a single file name as
45980well is potentially confusing to users.
45981
45982So please check any documentation you maintain, if you think you might
45983have used "pathname".
45984
45985Note that "file name" should be two words when it appears as ordinary
45986text.  It's ok as one word when it's a metasyntactic variable, though.
45987
45988------------------------
45989ORA uses filename, thus the macro.
45990
45991Suggestions:
45992------------
45993
45994Better sidebars can almost sort of be done with:
45995
45996	@ifdocbook
45997	@macro @sidebar{title, content}
45998	@inlinefmt{docbook, <sidebar><title>}
45999	\title\
46000	@inlinefmt{docbook, </title>}
46001	\content\
46002	@inlinefmt{docbook, </sidebar>}
46003	@end macro
46004	@end ifdocbook
46005
46006
46007	@ifnotdocbook
46008	@macro @sidebar{title, content}
46009	@cartouche
46010	@center @b{\title\}
46011
46012	\content\
46013	@end cartouche
46014	@end macro
46015	@end ifnotdocbook
46016
46017But to use it you have to say
46018
46019	@sidebar{Title Here,
46020	@include file-with-content
46021	}
46022
46023which sorta sucks.
46024
46025TODO:
46026