1\input texinfo @c -*-texinfo-*- 2@c vim: filetype=texinfo 3@c %**start of header (This is for running Texinfo on a region.) 4@setfilename gawk.info 5@settitle The GNU Awk User's Guide 6@c %**end of header (This is for running Texinfo on a region.) 7 8@dircategory Text creation and manipulation 9@direntry 10* Gawk: (gawk). A text scanning and processing language. 11@end direntry 12@dircategory Individual utilities 13@direntry 14* awk: (gawk)Invoking Gawk. Text scanning and processing. 15@end direntry 16 17@ifset FOR_PRINT 18@tex 19\gdef\xrefprintnodename#1{``#1''} 20@end tex 21@end ifset 22 23@ifclear FOR_PRINT 24@c With early 2014 texinfo.tex, restore PDF links and colors 25@tex 26\gdef\linkcolor{0.5 0.09 0.12} % Dark Red 27\gdef\urlcolor{0.5 0.09 0.12} % Also 28\global\urefurlonlylinktrue 29@end tex 30@end ifclear 31 32@ifnotdocbook 33@set BULLET @bullet{} 34@set MINUS @minus{} 35@end ifnotdocbook 36 37@ifdocbook 38@set BULLET 39@set MINUS 40@end ifdocbook 41 42@iftex 43@set TIMES @times 44@end iftex 45@ifnottex 46@set TIMES * 47@end ifnottex 48 49@c Let texinfo.tex give us full section titles 50@xrefautomaticsectiontitle on 51 52@c The following information should be updated here only! 53@c This sets the edition of the document, the version of gawk it 54@c applies to and all the info about who's publishing this edition 55 56@c These apply across the board. 57@set UPDATE-MONTH October, 2021 58@set VERSION 5.1 59@set PATCHLEVEL 1 60 61@set GAWKINETTITLE TCP/IP Internetworking with @command{gawk} 62@set GAWKWORKFLOWTITLE Participating in @command{gawk} Development 63@ifset FOR_PRINT 64@set TITLE Effective awk Programming 65@end ifset 66@ifclear FOR_PRINT 67@set TITLE GAWK: Effective AWK Programming 68@end ifclear 69@set SUBTITLE A User's Guide for GNU Awk 70@set EDITION 5.1 71 72@iftex 73@set DOCUMENT book 74@set CHAPTER chapter 75@set APPENDIX appendix 76@set SECTION section 77@set SUBSECTION subsection 78@set DARKCORNER @inmargin{@image{lflashlight,1cm}, @image{rflashlight,1cm}} 79@set COMMONEXT (c.e.) 80@set PAGE page 81@end iftex 82@ifinfo 83@set DOCUMENT Info file 84@set CHAPTER major node 85@set APPENDIX major node 86@set SECTION minor node 87@set SUBSECTION node 88@set DARKCORNER (d.c.) 89@set COMMONEXT (c.e.) 90@set PAGE screen 91@end ifinfo 92@ifhtml 93@set DOCUMENT Web page 94@set CHAPTER chapter 95@set APPENDIX appendix 96@set SECTION section 97@set SUBSECTION subsection 98@set DARKCORNER (d.c.) 99@set COMMONEXT (c.e.) 100@set PAGE screen 101@end ifhtml 102@ifdocbook 103@set DOCUMENT book 104@set CHAPTER chapter 105@set APPENDIX appendix 106@set SECTION section 107@set SUBSECTION subsection 108@set DARKCORNER (d.c.) 109@set COMMONEXT (c.e.) 110@set PAGE page 111@end ifdocbook 112@ifxml 113@set DOCUMENT book 114@set CHAPTER chapter 115@set APPENDIX appendix 116@set SECTION section 117@set SUBSECTION subsection 118@set DARKCORNER (d.c.) 119@set COMMONEXT (c.e.) 120@set PAGE page 121@end ifxml 122@ifplaintext 123@set DOCUMENT book 124@set CHAPTER chapter 125@set APPENDIX appendix 126@set SECTION section 127@set SUBSECTION subsection 128@set DARKCORNER (d.c.) 129@set COMMONEXT (c.e.) 130@set PAGE page 131@end ifplaintext 132 133@ifdocbook 134@c empty on purpose 135@set PART1 136@set PART2 137@set PART3 138@set PART4 139@end ifdocbook 140 141@ifnotdocbook 142@set PART1 Part I:@* 143@set PART2 Part II:@* 144@set PART3 Part III:@* 145@set PART4 Part IV:@* 146@end ifnotdocbook 147 148@c some special symbols 149@iftex 150@set LEQ @math{@leq} 151@set PI @math{@pi} 152@end iftex 153@ifdocbook 154@set LEQ @inlineraw{docbook, ≤} 155@set PI @inlineraw{docbook, &pgr;} 156@end ifdocbook 157@ifnottex 158@ifnotdocbook 159@set LEQ <= 160@set PI @i{pi} 161@end ifnotdocbook 162@end ifnottex 163 164@ifnottex 165@ifnotdocbook 166@macro ii{text} 167@i{\text\} 168@end macro 169@end ifnotdocbook 170@end ifnottex 171 172@ifdocbook 173@macro ii{text} 174@inlineraw{docbook,<lineannotation>\text\</lineannotation>} 175@end macro 176@end ifdocbook 177 178@ifclear FOR_PRINT 179@set FN file name 180@set FFN File name 181@set DF data file 182@set DDF Data file 183@set PVERSION version 184@end ifclear 185@ifset FOR_PRINT 186@set FN filename 187@set FFN Filename 188@set DF datafile 189@set DDF Datafile 190@set PVERSION version 191@end ifset 192 193@c For HTML, spell out email addresses, to avoid problems with 194@c address harvesters for spammers. 195@ifhtml 196@macro EMAIL{real,spelled} 197``\spelled\'' 198@end macro 199@end ifhtml 200@ifnothtml 201@macro EMAIL{real,spelled} 202@email{\real\} 203@end macro 204@end ifnothtml 205 206@c Indexing macros 207@ifinfo 208 209@macro cindexawkfunc{name} 210@cindex @code{\name\} 211@end macro 212 213@macro cindexgawkfunc{name} 214@cindex @code{\name\} 215@end macro 216 217@end ifinfo 218 219@ifnotinfo 220 221@macro cindexawkfunc{name} 222@cindex @code{\name\()} function 223@end macro 224 225@macro cindexgawkfunc{name} 226@cindex @code{\name\()} function (@command{gawk}) 227@end macro 228@end ifnotinfo 229 230@ignore 231Some comments on the layout for TeX. 2321. Use at least texinfo.tex 2016-02-05.07. 233@end ignore 234 235@c merge the function and variable indexes into the concept index 236@ifinfo 237@synindex fn cp 238@synindex vr cp 239@end ifinfo 240@iftex 241@syncodeindex fn cp 242@syncodeindex vr cp 243@end iftex 244@ifxml 245@syncodeindex fn cp 246@syncodeindex vr cp 247@end ifxml 248@ifdocbook 249@synindex fn cp 250@synindex vr cp 251@end ifdocbook 252 253@c If "finalout" is commented out, the printed output will show 254@c black boxes that mark lines that are too long. Thus, it is 255@c unwise to comment it out when running a master in case there are 256@c overfulls which are deemed okay. 257 258@iftex 259@finalout 260@end iftex 261 262@c Enabled '-quotes in PDF files so that cut/paste works in 263@c more places. 264 265@codequoteundirected on 266@codequotebacktick on 267 268@copying 269@docbook 270<para> 271“To boldly go where no man has gone before” is a 272Registered Trademark of Paramount Pictures Corporation.</para> 273 274<para>Published by:</para> 275 276<literallayout class="normal">Free Software Foundation 27751 Franklin Street, Fifth Floor 278Boston, MA 02110-1301 USA 279Phone: +1-617-542-5942 280Fax: +1-617-542-2652 281Email: <email>gnu@@gnu.org</email> 282URL: <ulink url="https://www.gnu.org">https://www.gnu.org/</ulink></literallayout> 283 284<literallayout class="normal">Copyright © 1989, 1991, 1992, 1993, 1996–2005, 2007, 2009–2021 285Free Software Foundation, Inc. 286All Rights Reserved.</literallayout> 287@end docbook 288 289@ifnotdocbook 290Copyright @copyright{} 1989, 1991, 1992, 1993, 1996--2005, 2007, 2009--2021 @* 291Free Software Foundation, Inc. 292@end ifnotdocbook 293@sp 2 294 295This is Edition @value{EDITION} of @cite{@value{TITLE}: @value{SUBTITLE}}, 296for the @value{VERSION}.@value{PATCHLEVEL} (or later) version of the GNU 297implementation of AWK. 298 299Permission is granted to copy, distribute and/or modify this document 300under the terms of the GNU Free Documentation License, Version 1.3 or 301any later version published by the Free Software Foundation; with the 302Invariant Sections being ``GNU General Public License'', with the 303Front-Cover Texts being ``A GNU Manual'', and with the Back-Cover Texts 304as in (a) below. 305@ifclear FOR_PRINT 306A copy of the license is included in the section entitled 307``GNU Free Documentation License''. 308@end ifclear 309@ifset FOR_PRINT 310A copy of the license 311may be found on the Internet at 312@uref{https://www.gnu.org/software/gawk/manual/html_node/GNU-Free-Documentation-License.html, 313the GNU Project's website}. 314@end ifset 315 316@enumerate a 317@item 318The FSF's Back-Cover Text is: ``You have the freedom to 319copy and modify this GNU manual.'' 320@end enumerate 321@end copying 322 323@c Comment out the "smallbook" for technical review. Saves 324@c considerable paper. Remember to turn it back on *before* 325@c starting the page-breaking work. 326 327@c 4/2002: Karl Berry recommends commenting out this and the 328@c `@setchapternewpage odd', and letting users use `texi2dvi -t' 329@c if they want to waste paper. 330@c @smallbook 331 332 333@c Uncomment this for the release. Leaving it off saves paper 334@c during editing and review. 335@setchapternewpage odd 336 337@shorttitlepage GNU Awk 338@titlepage 339@title @value{TITLE} 340@subtitle @value{SUBTITLE} 341@subtitle Edition @value{EDITION} 342@subtitle @value{UPDATE-MONTH} 343@author Arnold D. Robbins 344 345@ifnotdocbook 346@c Include the Distribution inside the titlepage environment so 347@c that headings are turned off. Headings on and off do not work. 348 349@page 350@vskip 0pt plus 1filll 351``To boldly go where no man has gone before'' is a 352Registered Trademark of Paramount Pictures Corporation. @* 353@c sorry, i couldn't resist 354@sp 3 355Published by: 356@sp 1 357 358Free Software Foundation @* 35951 Franklin Street, Fifth Floor @* 360Boston, MA 02110-1301 USA @* 361Phone: +1-617-542-5942 @* 362Fax: +1-617-542-2652 @* 363Email: @email{gnu@@gnu.org} @* 364URL: @uref{https://www.gnu.org/} @* 365 366@c This one is correct for gawk 3.1.0 from the FSF 367ISBN 1-882114-28-0 @* 368@sp 2 369@insertcopying 370@end ifnotdocbook 371@end titlepage 372 373@c Thanks to Bob Chassell for directions on doing dedications. 374@iftex 375@headings off 376@page 377@w{ } 378@sp 9 379@center @i{To my parents, for their love, and for the wonderful example they set for me.} 380@sp 1 381@center @i{To my wife, Miriam, for making me complete. 382Thank you for building your life together with me.} 383@sp 1 384@center @i{To our children, Chana, Rivka, Nachum, and Malka, for enrichening our lives in innumerable ways.} 385@sp 1 386@w{ } 387@page 388@w{ } 389@page 390@headings on 391@end iftex 392 393@docbook 394<dedication> 395<para>To my parents, for their love, and for the wonderful 396example they set for me.</para> 397<para>To my wife Miriam, for making me complete. 398Thank you for building your life together with me.</para> 399<para>To our children Chana, Rivka, Nachum and Malka, 400for enrichening our lives in innumerable ways.</para> 401</dedication> 402@end docbook 403 404@iftex 405@headings off 406@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| 407@oddheading @| @| @strong{@thischapter}@ @ @ @thispage 408@end iftex 409 410@ifnottex 411@ifnotxml 412@ifnotdocbook 413@node Top 414@top General Introduction 415@c Preface node should come right after the Top 416@c node, in `unnumbered' sections, then the chapter, `What is gawk'. 417@c Licensing nodes are appendices, they're not central to AWK. 418 419This file documents @command{awk}, a program that you can use to select 420particular records in a file and perform operations upon them. 421 422@insertcopying 423 424@end ifnotdocbook 425@end ifnotxml 426@end ifnottex 427 428@menu 429* Foreword3:: Some nice words about this 430 @value{DOCUMENT}. 431* Foreword4:: More nice words. 432* Preface:: What this @value{DOCUMENT} is about; brief 433 history and acknowledgments. 434* Getting Started:: A basic introduction to using 435 @command{awk}. How to run an @command{awk} 436 program. Command-line syntax. 437* Invoking Gawk:: How to run @command{gawk}. 438* Regexp:: All about matching things using regular 439 expressions. 440* Reading Files:: How to read files and manipulate fields. 441* Printing:: How to print using @command{awk}. Describes 442 the @code{print} and @code{printf} 443 statements. Also describes redirection of 444 output. 445* Expressions:: Expressions are the basic building blocks 446 of statements. 447* Patterns and Actions:: Overviews of patterns and actions. 448* Arrays:: The description and use of arrays. Also 449 includes array-oriented control statements. 450* Functions:: Built-in and user-defined functions. 451* Library Functions:: A Library of @command{awk} Functions. 452* Sample Programs:: Many @command{awk} programs with complete 453 explanations. 454* Advanced Features:: Stuff for advanced users, specific to 455 @command{gawk}. 456* Internationalization:: Getting @command{gawk} to speak your 457 language. 458* Debugger:: The @command{gawk} debugger. 459* Namespaces:: How namespaces work in @command{gawk}. 460* Arbitrary Precision Arithmetic:: Arbitrary precision arithmetic with 461 @command{gawk}. 462* Dynamic Extensions:: Adding new built-in functions to 463 @command{gawk}. 464* Language History:: The evolution of the @command{awk} 465 language. 466* Installation:: Installing @command{gawk} under various 467 operating systems. 468* Notes:: Notes about adding things to @command{gawk} 469 and possible future work. 470* Basic Concepts:: A very quick introduction to programming 471 concepts. 472* Glossary:: An explanation of some unfamiliar terms. 473* Copying:: Your right to copy and distribute 474 @command{gawk}. 475* GNU Free Documentation License:: The license for this @value{DOCUMENT}. 476* Index:: Concept and Variable Index. 477 478@detailmenu 479* History:: The history of @command{gawk} and 480 @command{awk}. 481* Names:: What name to use to find 482 @command{awk}. 483* This Manual:: Using this @value{DOCUMENT}. Includes 484 sample input files that you can use. 485* Conventions:: Typographical Conventions. 486* Manual History:: Brief history of the GNU project and 487 this @value{DOCUMENT}. 488* How To Contribute:: Helping to save the world. 489* Acknowledgments:: Acknowledgments. 490* Running gawk:: How to run @command{gawk} programs; 491 includes command-line syntax. 492* One-shot:: Running a short throwaway 493 @command{awk} program. 494* Read Terminal:: Using no input files (input from the 495 keyboard instead). 496* Long:: Putting permanent @command{awk} 497 programs in files. 498* Executable Scripts:: Making self-contained @command{awk} 499 programs. 500* Comments:: Adding documentation to @command{gawk} 501 programs. 502* Quoting:: More discussion of shell quoting 503 issues. 504* DOS Quoting:: Quoting in Windows Batch Files. 505* Sample Data Files:: Sample data files for use in the 506 @command{awk} programs illustrated in 507 this @value{DOCUMENT}. 508* Very Simple:: A very simple example. 509* Two Rules:: A less simple one-line example using 510 two rules. 511* More Complex:: A more complex example. 512* Statements/Lines:: Subdividing or combining statements 513 into lines. 514* Other Features:: Other Features of @command{awk}. 515* When:: When to use @command{gawk} and when to 516 use other things. 517* Intro Summary:: Summary of the introduction. 518* Command Line:: How to run @command{awk}. 519* Options:: Command-line options and their 520 meanings. 521* Other Arguments:: Input file names and variable 522 assignments. 523* Naming Standard Input:: How to specify standard input with 524 other files. 525* Environment Variables:: The environment variables 526 @command{gawk} uses. 527* AWKPATH Variable:: Searching directories for 528 @command{awk} programs. 529* AWKLIBPATH Variable:: Searching directories for 530 @command{awk} shared libraries. 531* Other Environment Variables:: The environment variables. 532* Exit Status:: @command{gawk}'s exit status. 533* Include Files:: Including other files into your 534 program. 535* Loading Shared Libraries:: Loading shared libraries into your 536 program. 537* Obsolete:: Obsolete Options and/or features. 538* Undocumented:: Undocumented Options and Features. 539* Invoking Summary:: Invocation summary. 540* Regexp Usage:: How to Use Regular Expressions. 541* Escape Sequences:: How to write nonprinting characters. 542* Regexp Operators:: Regular Expression Operators. 543* Regexp Operator Details:: The actual details. 544* Interval Expressions:: Notes on interval expressions. 545* Bracket Expressions:: What can go between @samp{[...]}. 546* Leftmost Longest:: How much text matches. 547* Computed Regexps:: Using Dynamic Regexps. 548* GNU Regexp Operators:: Operators specific to GNU software. 549* Case-sensitivity:: How to do case-insensitive matching. 550* Regexp Summary:: Regular expressions summary. 551* Records:: Controlling how data is split into 552 records. 553* awk split records:: How standard @command{awk} splits 554 records. 555* gawk split records:: How @command{gawk} splits records. 556* Fields:: An introduction to fields. 557* Nonconstant Fields:: Nonconstant Field Numbers. 558* Changing Fields:: Changing the Contents of a Field. 559* Field Separators:: The field separator and how to change 560 it. 561* Default Field Splitting:: How fields are normally separated. 562* Regexp Field Splitting:: Using regexps as the field separator. 563* Single Character Fields:: Making each character a separate 564 field. 565* Command Line Field Separator:: Setting @code{FS} from the command 566 line. 567* Full Line Fields:: Making the full line be a single 568 field. 569* Field Splitting Summary:: Some final points and a summary table. 570* Constant Size:: Reading constant width data. 571* Fixed width data:: Processing fixed-width data. 572* Skipping intervening:: Skipping intervening fields. 573* Allowing trailing data:: Capturing optional trailing data. 574* Fields with fixed data:: Field values with fixed-width data. 575* Splitting By Content:: Defining Fields By Content 576* More CSV:: More on CSV files. 577* FS versus FPAT:: A subtle difference. 578* Testing field creation:: Checking how @command{gawk} is 579 splitting records. 580* Multiple Line:: Reading multiline records. 581* Getline:: Reading files under explicit program 582 control using the @code{getline} 583 function. 584* Plain Getline:: Using @code{getline} with no 585 arguments. 586* Getline/Variable:: Using @code{getline} into a variable. 587* Getline/File:: Using @code{getline} from a file. 588* Getline/Variable/File:: Using @code{getline} into a variable 589 from a file. 590* Getline/Pipe:: Using @code{getline} from a pipe. 591* Getline/Variable/Pipe:: Using @code{getline} into a variable 592 from a pipe. 593* Getline/Coprocess:: Using @code{getline} from a coprocess. 594* Getline/Variable/Coprocess:: Using @code{getline} into a variable 595 from a coprocess. 596* Getline Notes:: Important things to know about 597 @code{getline}. 598* Getline Summary:: Summary of @code{getline} Variants. 599* Read Timeout:: Reading input with a timeout. 600* Retrying Input:: Retrying input after certain errors. 601* Command-line directories:: What happens if you put a directory on 602 the command line. 603* Input Summary:: Input summary. 604* Input Exercises:: Exercises. 605* Print:: The @code{print} statement. 606* Print Examples:: Simple examples of @code{print} 607 statements. 608* Output Separators:: The output separators and how to 609 change them. 610* OFMT:: Controlling Numeric Output With 611 @code{print}. 612* Printf:: The @code{printf} statement. 613* Basic Printf:: Syntax of the @code{printf} statement. 614* Control Letters:: Format-control letters. 615* Format Modifiers:: Format-specification modifiers. 616* Printf Examples:: Several examples. 617* Redirection:: How to redirect output to multiple 618 files and pipes. 619* Special FD:: Special files for I/O. 620* Special Files:: File name interpretation in 621 @command{gawk}. @command{gawk} allows 622 access to inherited file descriptors. 623* Other Inherited Files:: Accessing other open files with 624 @command{gawk}. 625* Special Network:: Special files for network 626 communications. 627* Special Caveats:: Things to watch out for. 628* Close Files And Pipes:: Closing Input and Output Files and 629 Pipes. 630* Nonfatal:: Enabling Nonfatal Output. 631* Output Summary:: Output summary. 632* Output Exercises:: Exercises. 633* Values:: Constants, Variables, and Regular 634 Expressions. 635* Constants:: String, numeric and regexp constants. 636* Scalar Constants:: Numeric and string constants. 637* Nondecimal-numbers:: What are octal and hex numbers. 638* Regexp Constants:: Regular Expression constants. 639* Using Constant Regexps:: When and how to use a regexp constant. 640* Standard Regexp Constants:: Regexp constants in standard 641 @command{awk}. 642* Strong Regexp Constants:: Strongly typed regexp constants. 643* Variables:: Variables give names to values for 644 later use. 645* Using Variables:: Using variables in your programs. 646* Assignment Options:: Setting variables on the command line 647 and a summary of command-line syntax. 648 This is an advanced method of input. 649* Conversion:: The conversion of strings to numbers 650 and vice versa. 651* Strings And Numbers:: How @command{awk} Converts Between 652 Strings And Numbers. 653* Locale influences conversions:: How the locale may affect conversions. 654* All Operators:: @command{gawk}'s operators. 655* Arithmetic Ops:: Arithmetic operations (@samp{+}, 656 @samp{-}, etc.) 657* Concatenation:: Concatenating strings. 658* Assignment Ops:: Changing the value of a variable or a 659 field. 660* Increment Ops:: Incrementing the numeric value of a 661 variable. 662* Truth Values and Conditions:: Testing for true and false. 663* Truth Values:: What is ``true'' and what is 664 ``false''. 665* Typing and Comparison:: How variables acquire types and how 666 this affects comparison of numbers and 667 strings with @samp{<}, etc. 668* Variable Typing:: String type versus numeric type. 669* Comparison Operators:: The comparison operators. 670* POSIX String Comparison:: String comparison with POSIX rules. 671* Boolean Ops:: Combining comparison expressions using 672 boolean operators @samp{||} (``or''), 673 @samp{&&} (``and'') and @samp{!} 674 (``not''). 675* Conditional Exp:: Conditional expressions select between 676 two subexpressions under control of a 677 third subexpression. 678* Function Calls:: A function call is an expression. 679* Precedence:: How various operators nest. 680* Locales:: How the locale affects things. 681* Expressions Summary:: Expressions summary. 682* Pattern Overview:: What goes into a pattern. 683* Regexp Patterns:: Using regexps as patterns. 684* Expression Patterns:: Any expression can be used as a 685 pattern. 686* Ranges:: Pairs of patterns specify record 687 ranges. 688* BEGIN/END:: Specifying initialization and cleanup 689 rules. 690* Using BEGIN/END:: How and why to use BEGIN/END rules. 691* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. 692* BEGINFILE/ENDFILE:: Two special patterns for advanced 693 control. 694* Empty:: The empty pattern, which matches every 695 record. 696* Using Shell Variables:: How to use shell variables with 697 @command{awk}. 698* Action Overview:: What goes into an action. 699* Statements:: Describes the various control 700 statements in detail. 701* If Statement:: Conditionally execute some 702 @command{awk} statements. 703* While Statement:: Loop until some condition is 704 satisfied. 705* Do Statement:: Do specified action while looping 706 until some condition is satisfied. 707* For Statement:: Another looping statement, that 708 provides initialization and increment 709 clauses. 710* Switch Statement:: Switch/case evaluation for conditional 711 execution of statements based on a 712 value. 713* Break Statement:: Immediately exit the innermost 714 enclosing loop. 715* Continue Statement:: Skip to the end of the innermost 716 enclosing loop. 717* Next Statement:: Stop processing the current input 718 record. 719* Nextfile Statement:: Stop processing the current file. 720* Exit Statement:: Stop execution of @command{awk}. 721* Built-in Variables:: Summarizes the predefined variables. 722* User-modified:: Built-in variables that you change to 723 control @command{awk}. 724* Auto-set:: Built-in variables where @command{awk} 725 gives you information. 726* ARGC and ARGV:: Ways to use @code{ARGC} and 727 @code{ARGV}. 728* Pattern Action Summary:: Patterns and Actions summary. 729* Array Basics:: The basics of arrays. 730* Array Intro:: Introduction to Arrays 731* Reference to Elements:: How to examine one element of an 732 array. 733* Assigning Elements:: How to change an element of an array. 734* Array Example:: Basic Example of an Array 735* Scanning an Array:: A variation of the @code{for} 736 statement. It loops through the 737 indices of an array's existing 738 elements. 739* Controlling Scanning:: Controlling the order in which arrays 740 are scanned. 741* Numeric Array Subscripts:: How to use numbers as subscripts in 742 @command{awk}. 743* Uninitialized Subscripts:: Using Uninitialized variables as 744 subscripts. 745* Delete:: The @code{delete} statement removes an 746 element from an array. 747* Multidimensional:: Emulating multidimensional arrays in 748 @command{awk}. 749* Multiscanning:: Scanning multidimensional arrays. 750* Arrays of Arrays:: True multidimensional arrays. 751* Arrays Summary:: Summary of arrays. 752* Built-in:: Summarizes the built-in functions. 753* Calling Built-in:: How to call built-in functions. 754* Numeric Functions:: Functions that work with numbers, 755 including @code{int()}, @code{sin()} 756 and @code{rand()}. 757* String Functions:: Functions for string manipulation, 758 such as @code{split()}, @code{match()} 759 and @code{sprintf()}. 760* Gory Details:: More than you want to know about 761 @samp{\} and @samp{&} with 762 @code{sub()}, @code{gsub()}, and 763 @code{gensub()}. 764* I/O Functions:: Functions for files and shell 765 commands. 766* Time Functions:: Functions for dealing with timestamps. 767* Bitwise Functions:: Functions for bitwise operations. 768* Type Functions:: Functions for type information. 769* I18N Functions:: Functions for string translation. 770* User-defined:: Describes User-defined functions in 771 detail. 772* Definition Syntax:: How to write definitions and what they 773 mean. 774* Function Example:: An example function definition and 775 what it does. 776* Function Calling:: Calling user-defined functions. 777* Calling A Function:: Don't use spaces. 778* Variable Scope:: Controlling variable scope. 779* Pass By Value/Reference:: Passing parameters. 780* Function Caveats:: Other points to know about functions. 781* Return Statement:: Specifying the value a function 782 returns. 783* Dynamic Typing:: How variable types can change at 784 runtime. 785* Indirect Calls:: Choosing the function to call at 786 runtime. 787* Functions Summary:: Summary of functions. 788* Library Names:: How to best name private global 789 variables in library functions. 790* General Functions:: Functions that are of general use. 791* Strtonum Function:: A replacement for the built-in 792 @code{strtonum()} function. 793* Assert Function:: A function for assertions in 794 @command{awk} programs. 795* Round Function:: A function for rounding if 796 @code{sprintf()} does not do it 797 correctly. 798* Cliff Random Function:: The Cliff Random Number Generator. 799* Ordinal Functions:: Functions for using characters as 800 numbers and vice versa. 801* Join Function:: A function to join an array into a 802 string. 803* Getlocaltime Function:: A function to get formatted times. 804* Readfile Function:: A function to read an entire file at 805 once. 806* Shell Quoting:: A function to quote strings for the 807 shell. 808* Isnumeric Function:: A function to test whether a value is 809 numeric. 810* Data File Management:: Functions for managing command-line 811 data files. 812* Filetrans Function:: A function for handling data file 813 transitions. 814* Rewind Function:: A function for rereading the current 815 file. 816* File Checking:: Checking that data files are readable. 817* Empty Files:: Checking for zero-length files. 818* Ignoring Assigns:: Treating assignments as file names. 819* Getopt Function:: A function for processing command-line 820 arguments. 821* Passwd Functions:: Functions for getting user 822 information. 823* Group Functions:: Functions for getting group 824 information. 825* Walking Arrays:: A function to walk arrays of arrays. 826* Library Functions Summary:: Summary of library functions. 827* Library Exercises:: Exercises. 828* Running Examples:: How to run these examples. 829* Clones:: Clones of common utilities. 830* Cut Program:: The @command{cut} utility. 831* Egrep Program:: The @command{egrep} utility. 832* Id Program:: The @command{id} utility. 833* Split Program:: The @command{split} utility. 834* Tee Program:: The @command{tee} utility. 835* Uniq Program:: The @command{uniq} utility. 836* Wc Program:: The @command{wc} utility. 837* Bytes vs. Characters:: Modern character sets. 838* Using extensions:: A brief intro to extensions. 839* @command{wc} program:: Code for @file{wc.awk}. 840* Miscellaneous Programs:: Some interesting @command{awk} 841 programs. 842* Dupword Program:: Finding duplicated words in a 843 document. 844* Alarm Program:: An alarm clock. 845* Translate Program:: A program similar to the @command{tr} 846 utility. 847* Labels Program:: Printing mailing labels. 848* Word Sorting:: A program to produce a word usage 849 count. 850* History Sorting:: Eliminating duplicate entries from a 851 history file. 852* Extract Program:: Pulling out programs from Texinfo 853 source files. 854* Simple Sed:: A Simple Stream Editor. 855* Igawk Program:: A wrapper for @command{awk} that 856 includes files. 857* Anagram Program:: Finding anagrams from a dictionary. 858* Signature Program:: People do amazing things with too much 859 time on their hands. 860* Programs Summary:: Summary of programs. 861* Programs Exercises:: Exercises. 862* Nondecimal Data:: Allowing nondecimal input data. 863* Array Sorting:: Facilities for controlling array 864 traversal and sorting arrays. 865* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. 866* Array Sorting Functions:: How to use @code{asort()} and 867 @code{asorti()}. 868* Two-way I/O:: Two-way communications with another 869 process. 870* TCP/IP Networking:: Using @command{gawk} for network 871 programming. 872* Profiling:: Profiling your @command{awk} programs. 873* Extension Philosophy:: What should be built-in and what 874 should not. 875* Advanced Features Summary:: Summary of advanced features. 876* I18N and L10N:: Internationalization and Localization. 877* Explaining gettext:: How GNU @command{gettext} works. 878* Programmer i18n:: Features for the programmer. 879* Translator i18n:: Features for the translator. 880* String Extraction:: Extracting marked strings. 881* Printf Ordering:: Rearranging @code{printf} arguments. 882* I18N Portability:: @command{awk}-level portability 883 issues. 884* I18N Example:: A simple i18n example. 885* Gawk I18N:: @command{gawk} is also 886 internationalized. 887* I18N Summary:: Summary of I18N stuff. 888* Debugging:: Introduction to @command{gawk} 889 debugger. 890* Debugging Concepts:: Debugging in General. 891* Debugging Terms:: Additional Debugging Concepts. 892* Awk Debugging:: Awk Debugging. 893* Sample Debugging Session:: Sample debugging session. 894* Debugger Invocation:: How to Start the Debugger. 895* Finding The Bug:: Finding the Bug. 896* List of Debugger Commands:: Main debugger commands. 897* Breakpoint Control:: Control of Breakpoints. 898* Debugger Execution Control:: Control of Execution. 899* Viewing And Changing Data:: Viewing and Changing Data. 900* Execution Stack:: Dealing with the Stack. 901* Debugger Info:: Obtaining Information about the 902 Program and the Debugger State. 903* Miscellaneous Debugger Commands:: Miscellaneous Commands. 904* Readline Support:: Readline support. 905* Limitations:: Limitations and future plans. 906* Debugging Summary:: Debugging summary. 907* Global Namespace:: The global namespace in standard 908 @command{awk}. 909* Qualified Names:: How to qualify names with a namespace. 910* Default Namespace:: The default namespace. 911* Changing The Namespace:: How to change the namespace. 912* Naming Rules:: Namespace and Component Naming Rules. 913* Internal Name Management:: How names are stored internally. 914* Namespace Example:: An example of code using a namespace. 915* Namespace And Features:: Namespaces and other @command{gawk} 916 features. 917* Namespace Summary:: Summarizing namespaces. 918* Computer Arithmetic:: A quick intro to computer math. 919* Math Definitions:: Defining terms used. 920* MPFR features:: The MPFR features in @command{gawk}. 921* FP Math Caution:: Things to know. 922* Inexactness of computations:: Floating point math is not exact. 923* Inexact representation:: Numbers are not exactly represented. 924* Comparing FP Values:: How to compare floating point values. 925* Errors accumulate:: Errors get bigger as they go. 926* Getting Accuracy:: Getting more accuracy takes some work. 927* Try To Round:: Add digits and round. 928* Setting precision:: How to set the precision. 929* Setting the rounding mode:: How to set the rounding mode. 930* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic 931 with @command{gawk}. 932* Checking for MPFR:: How to check if MPFR is available. 933* POSIX Floating Point Problems:: Standards Versus Existing Practice. 934* Floating point summary:: Summary of floating point discussion. 935* Extension Intro:: What is an extension. 936* Plugin License:: A note about licensing. 937* Extension Mechanism Outline:: An outline of how it works. 938* Extension API Description:: A full description of the API. 939* Extension API Functions Introduction:: Introduction to the API functions. 940* General Data Types:: The data types. 941* Memory Allocation Functions:: Functions for allocating memory. 942* Constructor Functions:: Functions for creating values. 943* API Ownership of MPFR and GMP Values:: Managing MPFR and GMP Values. 944* Registration Functions:: Functions to register things with 945 @command{gawk}. 946* Extension Functions:: Registering extension functions. 947* Exit Callback Functions:: Registering an exit callback. 948* Extension Version String:: Registering a version string. 949* Input Parsers:: Registering an input parser. 950* Output Wrappers:: Registering an output wrapper. 951* Two-way processors:: Registering a two-way processor. 952* Printing Messages:: Functions for printing messages. 953* Updating @code{ERRNO}:: Functions for updating @code{ERRNO}. 954* Requesting Values:: How to get a value. 955* Accessing Parameters:: Functions for accessing parameters. 956* Symbol Table Access:: Functions for accessing global 957 variables. 958* Symbol table by name:: Accessing variables by name. 959* Symbol table by cookie:: Accessing variables by ``cookie''. 960* Cached values:: Creating and using cached values. 961* Array Manipulation:: Functions for working with arrays. 962* Array Data Types:: Data types for working with arrays. 963* Array Functions:: Functions for working with arrays. 964* Flattening Arrays:: How to flatten arrays. 965* Creating Arrays:: How to create and populate arrays. 966* Redirection API:: How to access and manipulate 967 redirections. 968* Extension API Variables:: Variables provided by the API. 969* Extension Versioning:: API Version information. 970* Extension GMP/MPFR Versioning:: Version information about GMP and 971 MPFR. 972* Extension API Informational Variables:: Variables providing information about 973 @command{gawk}'s invocation. 974* Extension API Boilerplate:: Boilerplate code for using the API. 975* Changes from API V1:: Changes from V1 of the API. 976* Finding Extensions:: How @command{gawk} finds compiled 977 extensions. 978* Extension Example:: Example C code for an extension. 979* Internal File Description:: What the new functions will do. 980* Internal File Ops:: The code for internal file operations. 981* Using Internal File Ops:: How to use an external extension. 982* Extension Samples:: The sample extensions that ship with 983 @command{gawk}. 984* Extension Sample File Functions:: The file functions sample. 985* Extension Sample Fnmatch:: An interface to @code{fnmatch()}. 986* Extension Sample Fork:: An interface to @code{fork()} and 987 other process functions. 988* Extension Sample Inplace:: Enabling in-place file editing. 989* Extension Sample Ord:: Character to value to character 990 conversions. 991* Extension Sample Readdir:: An interface to @code{readdir()}. 992* Extension Sample Revout:: Reversing output sample output 993 wrapper. 994* Extension Sample Rev2way:: Reversing data sample two-way 995 processor. 996* Extension Sample Read write array:: Serializing an array to a file. 997* Extension Sample Readfile:: Reading an entire file into a string. 998* Extension Sample Time:: An interface to @code{gettimeofday()} 999 and @code{sleep()}. 1000* Extension Sample API Tests:: Tests for the API. 1001* gawkextlib:: The @code{gawkextlib} project. 1002* Extension summary:: Extension summary. 1003* Extension Exercises:: Exercises. 1004* V7/SVR3.1:: The major changes between V7 and 1005 System V Release 3.1. 1006* SVR4:: Minor changes between System V 1007 Releases 3.1 and 4. 1008* POSIX:: New features from the POSIX standard. 1009* BTL:: New features from Brian Kernighan's 1010 version of @command{awk}. 1011* POSIX/GNU:: The extensions in @command{gawk} not 1012 in POSIX @command{awk}. 1013* Feature History:: The history of the features in 1014 @command{gawk}. 1015* Common Extensions:: Common Extensions Summary. 1016* Ranges and Locales:: How locales used to affect regexp 1017 ranges. 1018* Contributors:: The major contributors to 1019 @command{gawk}. 1020* History summary:: History summary. 1021* Gawk Distribution:: What is in the @command{gawk} 1022 distribution. 1023* Getting:: How to get the distribution. 1024* Extracting:: How to extract the distribution. 1025* Distribution contents:: What is in the distribution. 1026* Unix Installation:: Installing @command{gawk} under 1027 various versions of Unix. 1028* Quick Installation:: Compiling @command{gawk} under Unix. 1029* Compiling with MPFR:: Building with MPFR. 1030* Shell Startup Files:: Shell convenience functions. 1031* Additional Configuration Options:: Other compile-time options. 1032* Configuration Philosophy:: How it's all supposed to work. 1033* Compiling from Git:: Compiling from Git. 1034* Building the Documentation:: Building the Documentation. 1035* Non-Unix Installation:: Installation on Other Operating 1036 Systems. 1037* PC Installation:: Installing and Compiling 1038 @command{gawk} on Microsoft Windows. 1039* PC Binary Installation:: Installing a prepared distribution. 1040* PC Compiling:: Compiling @command{gawk} for 1041 Windows32. 1042* PC Using:: Running @command{gawk} on Windows32. 1043* Cygwin:: Building and running @command{gawk} 1044 for Cygwin. 1045* MSYS:: Using @command{gawk} In The MSYS 1046 Environment. 1047* VMS Installation:: Installing @command{gawk} on VMS. 1048* VMS Compilation:: How to compile @command{gawk} under 1049 VMS. 1050* VMS Dynamic Extensions:: Compiling @command{gawk} dynamic 1051 extensions on VMS. 1052* VMS Installation Details:: How to install @command{gawk} under 1053 VMS. 1054* VMS Running:: How to run @command{gawk} under VMS. 1055* VMS GNV:: The VMS GNV Project. 1056* Bugs:: Reporting Problems and Bugs. 1057* Bug definition:: Defining what is and is not a bug. 1058* Bug address:: Where to send reports to. 1059* Usenet:: Where not to send reports to. 1060* Performance bugs:: What to do if you think there is a 1061 performance issue. 1062* Asking for help:: Dealing with non-bug questions. 1063* Maintainers:: Maintainers of non-*nix ports. 1064* Other Versions:: Other freely available @command{awk} 1065 implementations. 1066* Installation summary:: Summary of installation. 1067* Compatibility Mode:: How to disable certain @command{gawk} 1068 extensions. 1069* Additions:: Making Additions To @command{gawk}. 1070* Accessing The Source:: Accessing the Git repository. 1071* Adding Code:: Adding code to the main body of 1072 @command{gawk}. 1073* New Ports:: Porting @command{gawk} to a new 1074 operating system. 1075* Derived Files:: Why derived files are kept in the Git 1076 repository. 1077* Future Extensions:: New features that may be implemented 1078 one day. 1079* Implementation Limitations:: Some limitations of the 1080 implementation. 1081* Extension Design:: Design notes about the extension API. 1082* Old Extension Problems:: Problems with the old mechanism. 1083* Extension New Mechanism Goals:: Goals for the new mechanism. 1084* Extension Other Design Decisions:: Some other design decisions. 1085* Extension Future Growth:: Some room for future growth. 1086* Notes summary:: Summary of implementation notes. 1087* Basic High Level:: The high level view. 1088* Basic Data Typing:: A very quick intro to data types. 1089@end detailmenu 1090@end menu 1091 1092@c dedication for Info file 1093@ifinfo 1094To my parents, for their love, and for the wonderful 1095example they set for me. 1096@sp 1 1097To my wife Miriam, for making me complete. 1098Thank you for building your life together with me. 1099@sp 1 1100To our children Chana, Rivka, Nachum and Malka, 1101for enrichening our lives in innumerable ways. 1102@end ifinfo 1103 1104@summarycontents 1105@contents 1106 1107@node Foreword3 1108@unnumbered Foreword to the Third Edition 1109 1110@c This bit is post-processed by a script which turns the chapter 1111@c tag into a preface tag, and moves this stuff to before the title. 1112@c Bleah. 1113@docbook 1114 <prefaceinfo> 1115 <author> 1116 <firstname>Michael</firstname> 1117 <surname>Brennan</surname> 1118 <!-- can't put mawk into command tags. sigh. --> 1119 <affiliation><jobtitle>Author of mawk</jobtitle></affiliation> 1120 </author> 1121 <date>March 2001</date> 1122 </prefaceinfo> 1123@end docbook 1124 1125Arnold Robbins and I are good friends. We were introduced 1126@c 11 years ago 1127in 1990 1128by circumstances---and our favorite programming language, AWK. 1129The circumstances started a couple of years 1130earlier. I was working at a new job and noticed an unplugged 1131Unix computer sitting in the corner. No one knew how to use it, 1132and neither did I. However, 1133a couple of days later, it was running, and 1134I was @code{root} and the one-and-only user. 1135That day, I began the transition from statistician to Unix programmer. 1136 1137On one of many trips to the library or bookstore in search of 1138books on Unix, I found the gray AWK book, a.k.a.@: 1139Alfred V.@: Aho, Brian W.@: Kernighan, and 1140Peter J.@: Weinberger's @cite{The AWK Programming Language} (Addison-Wesley, 11411988). @command{awk}'s simple programming paradigm---find a pattern in the 1142input and then perform an action---often reduced complex or tedious 1143data manipulations to a few lines of code. I was excited to try my 1144hand at programming in AWK. 1145 1146Alas, the @command{awk} on my computer was a limited version of the 1147language described in the gray book. I discovered that my computer 1148had ``old @command{awk}'' and the book described 1149``new @command{awk}.'' 1150I learned that this was typical; the old version refused to step 1151aside or relinquish its name. If a system had a new @command{awk}, it was 1152invariably called @command{nawk}, and few systems had it. 1153The best way to get a new @command{awk} was to @command{ftp} the source code for 1154@command{gawk} from @code{prep.ai.mit.edu}. @command{gawk} was a version of 1155new @command{awk} written by David Trueman and Arnold, and available under 1156the GNU General Public License. 1157 1158(Incidentally, 1159it's no longer difficult to find a new @command{awk}. @command{gawk} ships with 1160GNU/Linux, and you can download binaries or source code for almost 1161any system; my wife uses @command{gawk} on her VMS box.) 1162 1163My Unix system started out unplugged from the wall; it certainly was not 1164plugged into a network. So, oblivious to the existence of @command{gawk} 1165and the Unix community in general, and desiring a new @command{awk}, I wrote 1166my own, called @command{mawk}. 1167Before I was finished, I knew about @command{gawk}, 1168but it was too late to stop, so I eventually posted 1169to a @code{comp.sources} newsgroup. 1170 1171A few days after my posting, I got a friendly email 1172from Arnold introducing 1173himself. He suggested we share design and algorithms and 1174attached a draft of the POSIX standard so 1175that I could update @command{mawk} to support language extensions added 1176after publication of @cite{The AWK Programming Language}. 1177 1178Frankly, if our roles had 1179been reversed, I would not have been so open and we probably would 1180have never met. I'm glad we did meet. 1181He is an AWK expert's AWK expert and a genuinely nice person. 1182Arnold contributes significant amounts of his 1183expertise and time to the Free Software Foundation. 1184 1185This book is the @command{gawk} reference manual, but at its core it 1186is a book about AWK programming that 1187will appeal to a wide audience. 1188It is a definitive reference to the AWK language as defined by the 11891987 Bell Laboratories release and codified in the 1992 POSIX Utilities 1190standard. 1191 1192On the other hand, the novice AWK programmer can study 1193a wealth of practical programs that emphasize 1194the power of AWK's basic idioms: 1195data-driven control flow, pattern matching with regular expressions, 1196and associative arrays. 1197Those looking for something new can try out @command{gawk}'s 1198interface to network protocols via special @file{/inet} files. 1199 1200The programs in this book make clear that an AWK program is 1201typically much smaller and faster to develop than 1202a counterpart written in C. 1203Consequently, there is often a payoff to prototyping an 1204algorithm or design in AWK to get it running quickly and expose 1205problems early. Often, the interpreted performance is adequate 1206and the AWK prototype becomes the product. 1207 1208The new @command{pgawk} (profiling @command{gawk}), produces 1209program execution counts. 1210I recently experimented with an algorithm that for 1211@ifnotdocbook 1212@math{n} 1213@end ifnotdocbook 1214@ifdocbook 1215@i{n} 1216@end ifdocbook 1217lines of input, exhibited 1218@tex 1219$\sim\! Cn^2$ 1220@end tex 1221@ifnottex 1222@ifnotdocbook 1223~ C n^2 1224@end ifnotdocbook 1225@end ifnottex 1226@docbook 1227<emphasis>∼ Cn<superscript>2</superscript></emphasis> 1228@end docbook 1229performance, while 1230theory predicted 1231@tex 1232$\sim\! Cn\log n$ 1233@end tex 1234@ifnottex 1235@ifnotdocbook 1236~ C n log n 1237@end ifnotdocbook 1238@end ifnottex 1239@docbook 1240<emphasis>∼ Cn log n</emphasis> 1241@end docbook 1242behavior. A few minutes poring 1243over the @file{awkprof.out} profile pinpointed the problem to 1244a single line of code. @command{pgawk} is a welcome addition to 1245my programmer's toolbox. 1246 1247Arnold has distilled over a decade of experience writing and 1248using AWK programs, and developing @command{gawk}, into this book. If you use 1249AWK or want to learn how, then read this book. 1250 1251@ifnotdocbook 1252@cindex Brennan, Michael 1253@display 1254Michael Brennan 1255Author of @command{mawk} 1256March 2001 1257@end display 1258@end ifnotdocbook 1259 1260@node Foreword4 1261@unnumbered Foreword to the Fourth Edition 1262 1263@c This bit is post-processed by a script which turns the chapter 1264@c tag into a preface tag, and moves this stuff to before the title. 1265@c Bleah. 1266@docbook 1267 <prefaceinfo> 1268 <author> 1269 <firstname>Michael</firstname> 1270 <surname>Brennan</surname> 1271 <!-- can't put mawk into command tags. sigh. --> 1272 <affiliation><jobtitle>Author of mawk</jobtitle></affiliation> 1273 </author> 1274 <date>October 2014</date> 1275 </prefaceinfo> 1276@end docbook 1277 1278Some things don't change. Thirteen years ago I wrote: 1279``If you use AWK or want to learn how, then read this book.'' 1280True then, and still true today. 1281 1282Learning to use a programming language is about more than mastering the 1283syntax. One needs to acquire an understanding of how to use the 1284features of the language to solve practical programming problems. 1285A focus of this book is many examples that show how to use AWK. 1286 1287Some things do change. Our computers are much faster and have more memory. 1288Consequently, speed and storage inefficiencies of a high-level language 1289matter less. Prototyping in AWK and then rewriting in C for performance 1290reasons happens less, because more often the prototype is fast enough. 1291 1292Of course, there are computing operations that are best done in C or C++. 1293With @command{gawk} 4.1 and later, you do not have to choose between writing 1294your program in AWK or in C/C++. You can write most of your 1295program in AWK and the aspects that require C/C++ capabilities can be written 1296in C/C++, and then the pieces glued together when the @command{gawk} module loads 1297the C/C++ module as a dynamic plug-in. 1298@c Chapter 16 1299@ref{Dynamic Extensions}, 1300has all the 1301details, and, as expected, many examples to help you learn the ins and outs. 1302 1303I enjoy programming in AWK and had fun (re)reading this book. 1304I think you will too. 1305 1306@ifnotdocbook 1307@cindex Brennan, Michael 1308@display 1309Michael Brennan 1310Author of @command{mawk} 1311October 2014 1312@end display 1313@end ifnotdocbook 1314 1315@node Preface 1316@unnumbered Preface 1317@c I saw a comment somewhere that the preface should describe the book itself, 1318@c and the introduction should describe what the book covers. 1319@c 1320@c 12/2000: Chuck wants the preface & intro combined. 1321 1322@c This bit is post-processed by a script which turns the chapter 1323@c tag into a preface tag, and moves this stuff to before the title. 1324@c Bleah. 1325@docbook 1326 <prefaceinfo> 1327 <author> 1328 <firstname>Arnold</firstname> 1329 <surname>Robbins</surname> 1330 <affiliation><jobtitle>Nof Ayalon</jobtitle></affiliation> 1331 <affiliation><jobtitle>Israel</jobtitle></affiliation> 1332 </author> 1333 <date>February 2015</date> 1334 </prefaceinfo> 1335@end docbook 1336 1337@cindex @command{awk} 1338Several kinds of tasks occur repeatedly when working with text files. 1339You might want to extract certain lines and discard the rest. Or you 1340may need to make changes wherever certain patterns appear, but leave the 1341rest of the file alone. Such jobs are often easy with @command{awk}. 1342The @command{awk} utility interprets a special-purpose programming 1343language that makes it easy to handle simple data-reformatting jobs. 1344 1345@cindex @command{gawk} 1346The GNU implementation of @command{awk} is called @command{gawk}; if you 1347invoke it with the proper options or environment variables, 1348it is fully compatible with 1349the POSIX@footnote{The 2018 POSIX standard is accessible online at 1350@w{@url{https://pubs.opengroup.org/onlinepubs/9699919799/}.}} 1351specification of the @command{awk} language 1352and with the Unix version of @command{awk} maintained 1353by Brian Kernighan. 1354This means that all 1355properly written @command{awk} programs should work with @command{gawk}. 1356So most of the time, we don't distinguish between @command{gawk} and other 1357@command{awk} implementations. 1358 1359@cindex @command{awk} @subentry POSIX and @seealso{POSIX @command{awk}} 1360@cindex @command{awk} @subentry POSIX and 1361@cindex POSIX @subentry @command{awk} and 1362@cindex @command{gawk} @subentry @command{awk} and 1363@cindex @command{awk} @subentry @command{gawk} and 1364@cindex @command{awk} @subentry uses for 1365Using @command{awk} you can: 1366 1367@itemize @value{BULLET} 1368@item 1369Manage small, personal databases 1370 1371@item 1372Generate reports 1373 1374@item 1375Validate data 1376 1377@item 1378Produce indexes and perform other document-preparation tasks 1379 1380@item 1381Experiment with algorithms that you can adapt later to other computer 1382languages 1383@end itemize 1384 1385@cindex @command{awk} @seealso{@command{gawk}} 1386@cindex @command{gawk} @seealso{@command{awk}} 1387@cindex @command{gawk} @subentry uses for 1388In addition, 1389@command{gawk} 1390provides facilities that make it easy to: 1391 1392@itemize @value{BULLET} 1393@item 1394Extract bits and pieces of data for processing 1395 1396@item 1397Sort data 1398 1399@item 1400Perform simple network communications 1401 1402@item 1403Profile and debug @command{awk} programs 1404 1405@item 1406Extend the language with functions written in C or C++ 1407@end itemize 1408 1409This @value{DOCUMENT} teaches you about the @command{awk} language and 1410how you can use it effectively. You should already be familiar with basic 1411system commands, such as @command{cat} and @command{ls},@footnote{These utilities 1412are available on POSIX-compliant systems, as well as on traditional 1413Unix-based systems. If you are using some other operating system, you still need to 1414be familiar with the ideas of I/O redirection and pipes.} as well as basic shell 1415facilities, such as input/output (I/O) redirection and pipes. 1416 1417@cindex GNU @command{awk} @seeentry{@command{gawk}} 1418Implementations of the @command{awk} language are available for many 1419different computing environments. This @value{DOCUMENT}, while describing 1420the @command{awk} language in general, also describes the particular 1421implementation of @command{awk} called @command{gawk} (which stands for 1422``GNU @command{awk}''). @command{gawk} runs on a broad range of Unix systems, 1423ranging from Intel-architecture PC-based computers 1424up through large-scale systems. 1425@command{gawk} has also been ported to Mac OS X, 1426Microsoft Windows 1427(all versions), 1428and OpenVMS.@footnote{Some other, obsolete systems to which @command{gawk} 1429was once ported are no longer supported and the code for those systems 1430has been removed.} 1431 1432@menu 1433* History:: The history of @command{gawk} and 1434 @command{awk}. 1435* Names:: What name to use to find @command{awk}. 1436* This Manual:: Using this @value{DOCUMENT}. Includes sample 1437 input files that you can use. 1438* Conventions:: Typographical Conventions. 1439* Manual History:: Brief history of the GNU project and this 1440 @value{DOCUMENT}. 1441* How To Contribute:: Helping to save the world. 1442* Acknowledgments:: Acknowledgments. 1443@end menu 1444 1445@node History 1446@unnumberedsec History of @command{awk} and @command{gawk} 1447@cindex recipe for a programming language 1448@cindex programming language, recipe for 1449@sidebar Recipe for a Programming Language 1450 1451@multitable {2 parts} {1 part @code{egrep}} {1 part @code{snobol}} 1452@item @tab 1 part @code{egrep} @tab 1 part @code{snobol} 1453@item @tab 2 parts @code{ed} @tab 3 parts C 1454@end multitable 1455 1456Blend all parts well using @code{lex} and @code{yacc}. 1457Document minimally and release. 1458 1459After eight years, add another part @code{egrep} and two 1460more parts C. Document very well and release. 1461@end sidebar 1462 1463@cindex Aho, Alfred 1464@cindex Weinberger, Peter 1465@cindex Kernighan, Brian 1466@cindex @command{awk} @subentry history of 1467The name @command{awk} comes from the initials of its designers: Alfred V.@: 1468Aho, Peter J.@: Weinberger, and Brian W.@: Kernighan. The original version of 1469@command{awk} was written in 1977 at AT&T Bell Laboratories. 1470In 1985, a new version made the programming 1471language more powerful, introducing user-defined functions, multiple input 1472streams, and computed regular expressions. 1473This new version became widely available with Unix System V 1474Release 3.1 (1987). 1475The version in System V Release 4 (1989) added some new features and cleaned 1476up the behavior in some of the ``dark corners'' of the language. 1477The specification for @command{awk} in the POSIX Command Language 1478and Utilities standard further clarified the language. 1479Both the @command{gawk} designers and the original @command{awk} designers at Bell Laboratories 1480provided feedback for the POSIX specification. 1481 1482@cindex Rubin, Paul 1483@cindex Fenlason, Jay 1484@cindex Trueman, David 1485Paul Rubin wrote @command{gawk} in 1986. 1486Jay Fenlason completed it, with advice from Richard Stallman. John Woods 1487contributed parts of the code as well. In 1988 and 1989, David Trueman, with 1488help from me, thoroughly reworked @command{gawk} for compatibility 1489with the newer @command{awk}. 1490Circa 1994, I became the primary maintainer. 1491Current development focuses on bug fixes, 1492performance improvements, standards compliance, and, occasionally, new features. 1493 1494In May 1997, J@"urgen Kahrs felt the need for network access 1495from @command{awk}, and with a little help from me, set about adding 1496features to do this for @command{gawk}. At that time, he also 1497wrote the bulk of 1498@cite{@value{GAWKINETTITLE}} 1499(a separate document, available as part of the @command{gawk} distribution). 1500His code finally became part of the main @command{gawk} distribution 1501with @command{gawk} @value{PVERSION} 3.1. 1502 1503John Haque rewrote the @command{gawk} internals, in the process providing 1504an @command{awk}-level debugger. This version became available as 1505@command{gawk} @value{PVERSION} 4.0 in 2011. 1506 1507@xref{Contributors} 1508for a full list of those who have made important contributions to @command{gawk}. 1509 1510@node Names 1511@unnumberedsec A Rose by Any Other Name 1512 1513@cindex @command{awk} @subentry new vs.@: old 1514The @command{awk} language has evolved over the years. Full details are 1515provided in @ref{Language History}. 1516The language described in this @value{DOCUMENT} 1517is often referred to as ``new @command{awk}.'' 1518By analogy, the original version of @command{awk} is 1519referred to as ``old @command{awk}.'' 1520 1521On most current systems, when you run the @command{awk} utility 1522you get some version of new @command{awk}.@footnote{Only 1523Solaris systems still use an old @command{awk} for the 1524default @command{awk} utility. A more modern @command{awk} lives in 1525@file{/usr/xpg6/bin} on these systems.} If your system's standard 1526@command{awk} is the old one, you will see something like this 1527if you try the following test program: 1528 1529@example 1530@group 1531$ @kbd{awk 1 /dev/null} 1532@error{} awk: syntax error near line 1 1533@error{} awk: bailing out near line 1 1534@end group 1535@end example 1536 1537@noindent 1538In this case, you should find a version of new @command{awk}, 1539or just install @command{gawk}! 1540 1541Throughout this @value{DOCUMENT}, whenever we refer to a language feature 1542that should be available in any complete implementation of POSIX @command{awk}, 1543we simply use the term @command{awk}. When referring to a feature that is 1544specific to the GNU implementation, we use the term @command{gawk}. 1545 1546@node This Manual 1547@unnumberedsec Using This Book 1548@cindex @command{awk} @subentry terms describing 1549 1550The term @command{awk} refers to a particular program as well as to the language you 1551use to tell this program what to do. When we need to be careful, we call 1552the language ``the @command{awk} language,'' 1553and the program ``the @command{awk} utility.'' 1554This @value{DOCUMENT} explains 1555both how to write programs in the @command{awk} language and how to 1556run the @command{awk} utility. 1557The term ``@command{awk} program'' refers to a program written by you in 1558the @command{awk} programming language. 1559 1560@cindex @command{gawk} @subentry @command{awk} and 1561@cindex @command{awk} @subentry @command{gawk} and 1562@cindex POSIX @command{awk} 1563Primarily, this @value{DOCUMENT} explains the features of @command{awk} 1564as defined in the POSIX standard. It does so in the context of the 1565@command{gawk} implementation. While doing so, it also 1566attempts to describe important differences between @command{gawk} 1567and other @command{awk} 1568@ifclear FOR_PRINT 1569implementations.@footnote{All such differences 1570appear in the index under the 1571entry ``differences in @command{awk} and @command{gawk}.''} 1572@end ifclear 1573@ifset FOR_PRINT 1574implementations. 1575@end ifset 1576Finally, it notes any @command{gawk} features that are not in 1577the POSIX standard for @command{awk}. 1578 1579@ifnotinfo 1580This @value{DOCUMENT} has the difficult task of being both a tutorial and a reference. 1581If you are a novice, feel free to skip over details that seem too complex. 1582You should also ignore the many cross-references; they are for the 1583expert user and for the Info and 1584@uref{https://www.gnu.org/software/gawk/manual/, HTML} 1585versions of the @value{DOCUMENT}. 1586@end ifnotinfo 1587 1588There are sidebars 1589scattered throughout the @value{DOCUMENT}. 1590They add a more complete explanation of points that are relevant, but not likely 1591to be of interest on first reading. 1592@ifclear FOR_PRINT 1593All appear in the index, under the heading ``sidebar.'' 1594@end ifclear 1595 1596Most of the time, the examples use complete @command{awk} programs. 1597Some of the more advanced @value{SECTION}s show only the part of the @command{awk} 1598program that illustrates the concept being described. 1599 1600Although this @value{DOCUMENT} is aimed principally at people who have not been 1601exposed 1602to @command{awk}, there is a lot of information here that even the @command{awk} 1603expert should find useful. In particular, the description of POSIX 1604@command{awk} and the example programs in 1605@ref{Library Functions}, and 1606@ifnotdocbook 1607in 1608@end ifnotdocbook 1609@ref{Sample Programs}, 1610should be of interest. 1611 1612This @value{DOCUMENT} is split into several parts, as follows: 1613 1614@c FULLXREF ON 1615 1616@itemize @value{BULLET} 1617@item 1618Part I describes the @command{awk} language and the @command{gawk} program in detail. 1619It starts with the basics, and continues through all of the features of @command{awk}. 1620It contains the following chapters: 1621 1622@c nested 1623@itemize @value{MINUS} 1624@item 1625@ref{Getting Started}, 1626provides the essentials you need to know to begin using @command{awk}. 1627 1628@item 1629@ref{Invoking Gawk}, 1630describes how to run @command{gawk}, the meaning of its 1631command-line options, and how it finds @command{awk} 1632program source files. 1633 1634@item 1635@ref{Regexp}, 1636introduces regular expressions in general, and in particular the flavors 1637supported by POSIX @command{awk} and @command{gawk}. 1638 1639@item 1640@ref{Reading Files}, 1641describes how @command{awk} reads your data. 1642It introduces the concepts of records and fields, as well 1643as the @code{getline} command. 1644I/O redirection is first described here. 1645Network I/O is also briefly introduced here. 1646 1647@item 1648@ref{Printing}, 1649describes how @command{awk} programs can produce output with 1650@code{print} and @code{printf}. 1651 1652@item 1653@ref{Expressions}, 1654describes expressions, which are the basic building blocks 1655for getting most things done in a program. 1656 1657@item 1658@ref{Patterns and Actions}, 1659describes how to write patterns for matching records, actions for 1660doing something when a record is matched, and the predefined variables 1661@command{awk} and @command{gawk} use. 1662 1663@item 1664@ref{Arrays}, 1665covers @command{awk}'s one-and-only data structure: the associative array. 1666Deleting array elements and whole arrays is described, as well as 1667sorting arrays in @command{gawk}. The @value{CHAPTER} also describes how 1668@command{gawk} provides arrays of arrays. 1669 1670@item 1671@ref{Functions}, 1672describes the built-in functions @command{awk} and @command{gawk} provide, 1673as well as how to define your own functions. It also discusses how 1674@command{gawk} lets you call functions indirectly. 1675@end itemize 1676 1677@item 1678Part II shows how to use @command{awk} and @command{gawk} for problem solving. 1679There is lots of code here for you to read and learn from. 1680This part contains the following chapters: 1681 1682@c nested 1683@itemize @value{MINUS} 1684@item 1685@ref{Library Functions}, provides a number of functions meant to 1686be used from main @command{awk} programs. 1687 1688@item 1689@ref{Sample Programs}, 1690provides many sample @command{awk} programs. 1691@end itemize 1692 1693Reading these two chapters allows you to see @command{awk} 1694solving real problems. 1695 1696@item 1697Part III focuses on features specific to @command{gawk}. 1698It contains the following chapters: 1699 1700@c nested 1701@itemize @value{MINUS} 1702@item 1703@ref{Advanced Features}, 1704describes a number of advanced features. 1705Of particular note 1706are the abilities to control the order of array traversal, 1707have two-way communications with another process, 1708perform TCP/IP networking, and 1709profile your @command{awk} programs. 1710 1711@item 1712@ref{Internationalization}, 1713describes special features for translating program 1714messages into different languages at runtime. 1715 1716@item 1717@ref{Debugger}, describes the @command{gawk} debugger. 1718 1719@item 1720@ref{Namespaces}, describes how @command{gawk} allows variables and/or 1721functions of the same name to be in different namespaces. 1722 1723@item 1724@ref{Arbitrary Precision Arithmetic}, 1725describes advanced arithmetic facilities. 1726 1727@item 1728@ref{Dynamic Extensions}, describes how to add new variables and 1729functions to @command{gawk} by writing extensions in C or C++. 1730@end itemize 1731 1732@item 1733@ifclear FOR_PRINT 1734Part IV provides the appendices, the Glossary, and two licenses that cover 1735the @command{gawk} source code and this @value{DOCUMENT}, respectively. 1736It contains the following appendices: 1737@end ifclear 1738@ifset FOR_PRINT 1739Part IV provides the following appendices, 1740including the GNU General Public License: 1741@end ifset 1742 1743@itemize @value{MINUS} 1744@item 1745@ref{Language History}, 1746describes how the @command{awk} language has evolved since 1747its first release to the present. It also describes how @command{gawk} 1748has acquired features over time. 1749 1750@item 1751@ref{Installation}, 1752describes how to get @command{gawk}, how to compile it 1753on POSIX-compatible systems, 1754and how to compile and use it on different 1755non-POSIX systems. It also describes how to report bugs 1756in @command{gawk} and where to get other freely 1757available @command{awk} implementations. 1758 1759@ifset FOR_PRINT 1760@item 1761@ref{Copying}, 1762presents the license that covers the @command{gawk} source code. 1763@end ifset 1764 1765@ifclear FOR_PRINT 1766@item 1767@ref{Notes}, 1768describes how to disable @command{gawk}'s extensions, as 1769well as how to contribute new code to @command{gawk}, 1770and some possible future directions for @command{gawk} development. 1771 1772@item 1773@ref{Basic Concepts}, 1774provides some very cursory background material for those who 1775are completely unfamiliar with computer programming. 1776 1777@item 1778The @ref{Glossary}, defines most, if not all, of the significant terms used 1779throughout the @value{DOCUMENT}. If you find terms that you aren't familiar with, 1780try looking them up here. 1781 1782@item 1783@ref{Copying}, and 1784@ref{GNU Free Documentation License}, 1785present the licenses that cover the @command{gawk} source code 1786and this @value{DOCUMENT}, respectively. 1787@end ifclear 1788@end itemize 1789@end itemize 1790 1791@ifset FOR_PRINT 1792The version of this @value{DOCUMENT} distributed with @command{gawk} 1793contains additional appendices and other end material. 1794To save space, we have omitted them from the 1795printed edition. You may find them online, as follows: 1796 1797@itemize @value{BULLET} 1798@item 1799@uref{https://www.gnu.org/software/gawk/manual/html_node/Notes.html, 1800The appendix on implementation notes} 1801describes how to disable @command{gawk}'s extensions, how to contribute 1802new code to @command{gawk}, where to find information on some possible 1803future directions for @command{gawk} development, and the design decisions 1804behind the extension API. 1805 1806@item 1807@uref{https://www.gnu.org/software/gawk/manual/html_node/Basic-Concepts.html, 1808The appendix on basic concepts} 1809provides some very cursory background material for those who 1810are completely unfamiliar with computer programming. 1811 1812@item 1813@uref{https://www.gnu.org/software/gawk/manual/html_node/Glossary.html, 1814The Glossary} 1815defines most, if not all, of the significant terms used 1816throughout the @value{DOCUMENT}. If you find terms that you aren't familiar with, 1817try looking them up here. 1818 1819@item 1820@uref{https://www.gnu.org/software/gawk/manual/html_node/GNU-Free-Documentation-License.html, 1821The GNU FDL} 1822is the license that covers this @value{DOCUMENT}. 1823@end itemize 1824 1825@c ok not to use CHAPTER / SECTION here 1826Some of the chapters have exercise sections; these have also been 1827omitted from the print edition but are available online. 1828@end ifset 1829 1830@c FULLXREF OFF 1831 1832@node Conventions 1833@unnumberedsec Typographical Conventions 1834 1835@cindex Texinfo 1836This @value{DOCUMENT} is written in @uref{https://www.gnu.org/software/texinfo/, Texinfo}, 1837the GNU documentation formatting language. 1838A single Texinfo source file is used to produce both the printed and online 1839versions of the documentation. 1840@ifnotinfo 1841Because of this, the typographical conventions 1842are slightly different than in other books you may have read. 1843@end ifnotinfo 1844@ifinfo 1845This @value{SECTION} briefly documents the typographical conventions used in Texinfo. 1846@end ifinfo 1847 1848Examples you would type at the command line are preceded by the common 1849shell primary and secondary prompts, @samp{$} and @samp{>}, respectively. 1850Input that you type is shown @kbd{like this}. 1851@c 8/2014: @print{} is stripped from the texi to make docbook. 1852@ifclear FOR_PRINT 1853Output from the command is preceded by the glyph ``@print{}''. 1854This typically represents the command's standard output. 1855@end ifclear 1856@ifset FOR_PRINT 1857Output from the command, usually its standard output, appears 1858@code{like this}. 1859@end ifset 1860Error messages and other output on the command's standard error are preceded 1861by the glyph ``@error{}''. For example: 1862 1863@example 1864$ @kbd{echo hi on stdout} 1865@print{} hi on stdout 1866$ @kbd{echo hello on stderr 1>&2} 1867@error{} hello on stderr 1868@end example 1869 1870@ifnotinfo 1871In the text, almost anything related to programming, such as 1872command names, 1873variable and function names, and string, numeric and regexp constants 1874appear in @code{this font}. Code fragments 1875appear in the same font and quoted, @samp{like this}. 1876Things that are replaced by the user or programmer 1877appear in @var{this font}. 1878Options look like this: @option{-f}. 1879@value{FFN}s are indicated like this: @file{/path/to/ourfile}. 1880@ifclear FOR_PRINT 1881Some things are 1882emphasized @emph{like this}, and if a point needs to be made 1883strongly, it is done @strong{like this}. 1884@end ifclear 1885The first occurrence of 1886a new term is usually its @dfn{definition} and appears in the same 1887font as the previous occurrence of ``definition'' in this sentence. 1888@end ifnotinfo 1889 1890Characters that you type at the keyboard look @kbd{like this}. In particular, 1891there are special characters called ``control characters.'' These are 1892characters that you type by holding down both the @kbd{CONTROL} key and 1893another key, at the same time. For example, a @kbd{Ctrl-d} is typed 1894by first pressing and holding the @kbd{CONTROL} key, next 1895pressing the @kbd{d} key, and finally releasing both keys. 1896 1897For the sake of brevity, throughout this @value{DOCUMENT}, we refer to 1898Brian Kernighan's version of @command{awk} as ``BWK @command{awk}.'' 1899(@xref{Other Versions} for information on his and other versions.) 1900 1901@ifset FOR_PRINT 1902@quotation NOTE 1903Notes of interest look like this. 1904@end quotation 1905 1906@quotation CAUTION 1907Cautionary or warning notes look like this. 1908@end quotation 1909@end ifset 1910 1911@c fakenode --- for prepinfo 1912@unnumberedsubsec Dark Corners 1913@cindex Kernighan, Brian @subentry quotes 1914@quotation 1915@i{Dark corners are basically fractal---no matter how much 1916you illuminate, there's always a smaller but darker one.} 1917@author Brian Kernighan 1918@end quotation 1919 1920@cindex d.c. @seeentry{dark corner} 1921@cindex dark corner 1922Until the POSIX standard (and @cite{@value{TITLE}}), 1923many features of @command{awk} were either poorly documented or not 1924documented at all. Descriptions of such features 1925(often called ``dark corners'') are noted in this @value{DOCUMENT} with 1926@iftex 1927the picture of a flashlight in the margin, as shown here. 1928@value{DARKCORNER} 1929@end iftex 1930@ifnottex 1931``(d.c.).'' 1932@end ifnottex 1933@ifclear FOR_PRINT 1934They also appear in the index under the heading ``dark corner.'' 1935@end ifclear 1936 1937But, as noted by the opening quote, any coverage of dark 1938corners is by definition incomplete. 1939 1940@cindex c.e. @seeentry{common extensions} 1941Extensions to the standard @command{awk} language that are supported by 1942more than one @command{awk} implementation are marked 1943@ifclear FOR_PRINT 1944``@value{COMMONEXT},'' and listed in the index under ``common extensions'' 1945and ``extensions, common.'' 1946@end ifclear 1947@ifset FOR_PRINT 1948``@value{COMMONEXT}'' for ``common extension.'' 1949@end ifset 1950 1951@node Manual History 1952@unnumberedsec The GNU Project and This Book 1953 1954@cindex FSF (Free Software Foundation) 1955@cindex Free Software Foundation (FSF) 1956@cindex Stallman, Richard 1957The Free Software Foundation (FSF) is a nonprofit organization dedicated 1958to the production and distribution of freely distributable software. 1959It was founded by Richard M.@: Stallman, the author of the original 1960Emacs editor. GNU Emacs is the most widely used version of Emacs today. 1961 1962@cindex GNU Project 1963@cindex GPL (General Public License) 1964@cindex GNU General Public License @seeentry{GPL} 1965@cindex General Public License @seeentry{GPL} 1966@cindex documentation @subentry online 1967The GNU@footnote{GNU stands for ``GNU's Not Unix.''} 1968Project is an ongoing effort on the part of the Free Software 1969Foundation to create a complete, freely distributable, POSIX-compliant 1970computing environment. 1971The FSF uses the GNU General Public License (GPL) to ensure that 1972its software's 1973source code is always available to the end user. 1974@ifclear FOR_PRINT 1975A copy of the GPL is included 1976@ifnotinfo 1977in this @value{DOCUMENT} 1978@end ifnotinfo 1979for your reference 1980(@pxref{Copying}). 1981@end ifclear 1982The GPL applies to the C language source code for @command{gawk}. 1983To find out more about the FSF and the GNU Project online, 1984see @uref{https://www.gnu.org, the GNU Project's home page}. 1985This @value{DOCUMENT} may also be read from 1986@uref{https://www.gnu.org/software/gawk/manual/, GNU's website}. 1987 1988@ifclear FOR_PRINT 1989A shell, an editor (Emacs), highly portable optimizing C, C++, and 1990Objective-C compilers, a symbolic debugger and dozens of large and 1991small utilities (such as @command{gawk}), have all been completed and are 1992freely available. The GNU operating 1993system kernel (the HURD), has been released but remains in an early 1994stage of development. 1995 1996@cindex Linux @seeentry{GNU/Linux} 1997@cindex GNU/Linux 1998@cindex operating systems @subentry BSD-based 1999Until the GNU operating system is more fully developed, you should 2000consider using GNU/Linux, a freely distributable, Unix-like operating 2001system for Intel, 2002Power Architecture, 2003Sun SPARC, IBM S/390, and other 2004systems.@footnote{The terminology ``GNU/Linux'' is explained 2005in the @ref{Glossary}.} 2006Many GNU/Linux distributions are 2007available for download from the Internet. 2008@end ifclear 2009 2010@ifnotinfo 2011The @value{DOCUMENT} you are reading is actually free---at least, the 2012information in it is free to anyone. The machine-readable 2013source code for the @value{DOCUMENT} comes with @command{gawk}. 2014@ifclear FOR_PRINT 2015(Take a moment to check the Free Documentation 2016License in @ref{GNU Free Documentation License}.) 2017@end ifclear 2018@end ifnotinfo 2019 2020@cindex Close, Diane 2021The @value{DOCUMENT} itself has gone through multiple previous editions. 2022Paul Rubin wrote the very first draft of @cite{The GAWK Manual}; 2023it was around 40 pages long. 2024Diane Close and Richard Stallman improved it, yielding a 2025version that was 2026around 90 pages and barely described the original, ``old'' 2027version of @command{awk}. 2028 2029I started working with that version in the fall of 1988. 2030As work on it progressed, 2031the FSF published several preliminary versions (numbered 0.@var{x}). 2032In 1996, edition 1.0 was released with @command{gawk} 3.0.0. 2033The FSF published the first two editions under 2034the title @cite{The GNU Awk User's Guide}. 2035@ifset FOR_PRINT 2036SSC published two editions of the @value{DOCUMENT} under the 2037title @cite{Effective awk Programming}, and O'Reilly published 2038the third edition in 2001. 2039@end ifset 2040 2041This edition maintains the basic structure of the previous editions. 2042For FSF edition 4.0, the content was thoroughly reviewed and updated. All 2043references to @command{gawk} versions prior to 4.0 were removed. 2044Of significant note for that edition was the addition of @ref{Debugger}. 2045 2046For FSF edition 2047@ifclear FOR_PRINT 20485.0, 2049@end ifclear 2050@ifset FOR_PRINT 2051@value{EDITION} 2052(the fourth edition as published by O'Reilly), 2053@end ifset 2054the content has been reorganized into parts, 2055and the major new additions are @ref{Arbitrary Precision Arithmetic}, 2056and @ref{Dynamic Extensions}. 2057 2058This @value{DOCUMENT} will undoubtedly continue to evolve. If you 2059find an error in the @value{DOCUMENT}, please report it! @xref{Bugs} 2060for information on submitting problem reports electronically. 2061 2062@ifset FOR_PRINT 2063@c fakenode --- for prepinfo 2064@unnumberedsec How to Stay Current 2065 2066You may have a newer version of @command{gawk} than the 2067one described here. To find out what has changed, 2068you should first look at the @file{NEWS} file in the @command{gawk} 2069distribution, which provides a high-level summary of the changes in 2070each release. 2071 2072You can then look at the @uref{https://www.gnu.org/software/gawk/manual/, 2073online version} of this @value{DOCUMENT} to read about any new features. 2074@end ifset 2075 2076@ifclear FOR_PRINT 2077@node How To Contribute 2078@unnumberedsec How to Contribute 2079 2080As the maintainer of GNU @command{awk}, I once thought that I would be 2081able to manage a collection of publicly available @command{awk} programs 2082and I even solicited contributions. Making things available on the Internet 2083helps keep the @command{gawk} distribution down to manageable size. 2084 2085The initial collection of material, such as it is, is still available 2086at @uref{ftp://ftp.freefriends.org/arnold/Awkstuff}. 2087 2088In the hopes of doing something more broad, I acquired the 2089@code{awklang.org} domain. Late in 2017, a volunteer took on the task 2090of managing it. 2091 2092If you have written an interesting @command{awk} program, that 2093you would like to share with the rest of the world, please see 2094@uref{http://www.awklang.org} and use the ``Contact'' link. 2095 2096If you have written a @command{gawk} extension, please see 2097@ref{gawkextlib}. 2098@end ifclear 2099 2100@node Acknowledgments 2101@unnumberedsec Acknowledgments 2102 2103The initial draft of @cite{The GAWK Manual} had the following acknowledgments: 2104 2105@quotation 2106Many people need to be thanked for their assistance in producing this 2107manual. Jay Fenlason contributed many ideas and sample programs. Richard 2108Mlynarik and Robert Chassell gave helpful comments on drafts of this 2109manual. The paper @cite{A Supplemental Document for AWK} by John W.@: 2110Pierce of the Chemistry Department at UC San Diego, pinpointed several 2111issues relevant both to @command{awk} implementation and to this manual, that 2112would otherwise have escaped us. 2113@end quotation 2114 2115@cindex Stallman, Richard 2116I would like to acknowledge Richard M.@: Stallman, for his vision of a 2117better world and for his courage in founding the FSF and starting the 2118GNU Project. 2119 2120@ifclear FOR_PRINT 2121Earlier editions of this @value{DOCUMENT} had the following acknowledgements: 2122@end ifclear 2123@ifset FOR_PRINT 2124The previous edition of this @value{DOCUMENT} had 2125the following acknowledgements: 2126@end ifset 2127 2128@quotation 2129The following people (in alphabetical order) 2130provided helpful comments on various 2131versions of this book: 2132Rick Adams, 2133Dr.@: Nelson H.F. Beebe, 2134Karl Berry, 2135Dr.@: Michael Brennan, 2136Rich Burridge, 2137Claire Cloutier, 2138Diane Close, 2139Scott Deifik, 2140Christopher (``Topher'') Eliot, 2141Jeffrey Friedl, 2142Dr.@: Darrel Hankerson, 2143Michal Jaegermann, 2144Dr.@: Richard J.@: LeBlanc, 2145Michael Lijewski, 2146Pat Rankin, 2147Miriam Robbins, 2148Mary Sheehan, 2149and 2150Chuck Toporek. 2151 2152@cindex Berry, Karl 2153@cindex Chassell, Robert J.@: 2154@c @cindex Texinfo 2155Robert J.@: Chassell provided much valuable advice on 2156the use of Texinfo. 2157He also deserves special thanks for 2158convincing me @emph{not} to title this @value{DOCUMENT} 2159@cite{How to Gawk Politely}. 2160Karl Berry helped significantly with the @TeX{} part of Texinfo. 2161 2162@cindex Hartholz @subentry Marshall 2163@cindex Hartholz @subentry Elaine 2164@cindex Schreiber @subentry Bert 2165@cindex Schreiber @subentry Rita 2166I would like to thank Marshall and Elaine Hartholz of Seattle and 2167Dr.@: Bert and Rita Schreiber of Detroit for large amounts of quiet vacation 2168time in their homes, which allowed me to make significant progress on 2169this @value{DOCUMENT} and on @command{gawk} itself. 2170 2171@cindex Hughes, Phil 2172Phil Hughes of SSC 2173contributed in a very important way by loaning me his laptop GNU/Linux 2174system, not once, but twice, which allowed me to do a lot of work while 2175away from home. 2176 2177@cindex Trueman, David 2178David Trueman deserves special credit; he has done a yeoman job 2179of evolving @command{gawk} so that it performs well and without bugs. 2180Although he is no longer involved with @command{gawk}, 2181working with him on this project was a significant pleasure. 2182 2183@cindex Drepper, Ulrich 2184@cindex GNITS mailing list 2185@cindex mailing list, GNITS 2186The intrepid members of the GNITS mailing list, and most notably Ulrich 2187Drepper, provided invaluable help and feedback for the design of the 2188internationalization features. 2189 2190Chuck Toporek, Mary Sheehan, and Claire Cloutier of O'Reilly & Associates contributed 2191significant editorial help for this @value{DOCUMENT} for the 21923.1 release of @command{gawk}. 2193@end quotation 2194 2195@cindex Beebe, Nelson H.F.@: 2196@cindex Buening, Andreas 2197@cindex Collado, Manuel 2198@cindex Colombo, Antonio 2199@cindex Davies, Stephen 2200@cindex Deifik, Scott 2201@cindex Demaille, Akim 2202@cindex G., Daniel Richard 2203@cindex Guerrero, Juan Manuel 2204@cindex Hankerson, Darrel 2205@cindex Jaegermann, Michal 2206@cindex Kahrs, J@"urgen 2207@cindex Kasal, Stepan 2208@cindex Malmberg, John 2209@cindex Ramey, Chet 2210@cindex Rankin, Pat 2211@cindex Schorr, Andrew 2212@cindex Vinschen, Corinna 2213@cindex Zaretskii, Eli 2214 2215Dr.@: Nelson Beebe, 2216Andreas Buening, 2217Dr.@: Manuel Collado, 2218Antonio Colombo, 2219Stephen Davies, 2220Scott Deifik, 2221Akim Demaille, 2222Daniel Richard G., 2223Juan Manuel Guerrero, 2224Darrel Hankerson, 2225Michal Jaegermann, 2226J@"urgen Kahrs, 2227Stepan Kasal, 2228John Malmberg, 2229Chet Ramey, 2230Pat Rankin, 2231Andrew Schorr, 2232Corinna Vinschen, 2233and Eli Zaretskii 2234(in alphabetical order) 2235make up the current @command{gawk} ``crack portability team.'' Without 2236their hard work and help, @command{gawk} would not be nearly the robust, 2237portable program it is today. It has been and continues to be a pleasure 2238working with this team of fine people. 2239 2240Notable code and documentation contributions were made by 2241a number of people. @xref{Contributors} for the full list. 2242 2243@ifset FOR_PRINT 2244@cindex Oram, Andy 2245Thanks to Andy Oram of O'Reilly Media for initiating 2246the fourth edition and for his support during the work. 2247Thanks to Jasmine Kwityn for her copyediting work. 2248@end ifset 2249 2250Thanks to Michael Brennan for the Forewords. 2251 2252@cindex Duman, Patrice 2253@cindex Berry, Karl 2254@cindex Smith, Gavin 2255Thanks to Patrice Dumas for the new @command{makeinfo} program. 2256Thanks to Karl Berry for his past work on Texinfo, and 2257to Gavin Smith, who continues to work to improve 2258the Texinfo markup language. 2259 2260@cindex Kernighan, Brian 2261@cindex Brennan, Michael 2262@cindex Day, Robert P.J.@: 2263Robert P.J.@: Day, Michael Brennan, and Brian Kernighan kindly acted as 2264reviewers for the 2015 edition of this @value{DOCUMENT}. Their feedback 2265helped improve the final work. 2266 2267I would also like to thank Brian Kernighan for his invaluable assistance during the 2268testing and debugging of @command{gawk}, and for his ongoing 2269help and advice in clarifying numerous points about the language. 2270We could not have done nearly as good a job on either @command{gawk} 2271or its documentation without his help. 2272 2273Brian is in a class by himself as a programmer and technical 2274author. I have to thank him (yet again) for his ongoing friendship 2275and for being a role model to me for over 30 years! 2276Having him as a reviewer is an exciting privilege. It has also 2277been extremely humbling@enddots{} 2278 2279@cindex Robbins @subentry Miriam 2280@cindex Robbins @subentry Jean 2281@cindex Robbins @subentry Harry 2282@cindex G-d 2283I must thank my wonderful wife, Miriam, for her patience through 2284the many versions of this project, for her proofreading, 2285and for sharing me with the computer. 2286I would like to thank my parents for their love, and for the grace with 2287which they raised and educated me. 2288Finally, I also must acknowledge my gratitude to G-d, for the many opportunities 2289He has sent my way, as well as for the gifts He has given me with which to 2290take advantage of those opportunities. 2291@ifnotdocbook 2292@sp 2 2293@noindent 2294Arnold Robbins @* 2295Nof Ayalon @* 2296Israel @* 2297March, 2020 2298@end ifnotdocbook 2299 2300@ifnotinfo 2301@part @value{PART1}The @command{awk} Language 2302@end ifnotinfo 2303 2304@ifdocbook 2305 2306Part I describes the @command{awk} language and @command{gawk} program 2307in detail. It starts with the basics, and continues through all of 2308the features of @command{awk}. Included also are many, but not all, 2309of the features of @command{gawk}. This part contains the 2310following chapters: 2311 2312@itemize @value{BULLET} 2313@item 2314@ref{Getting Started} 2315 2316@item 2317@ref{Invoking Gawk} 2318 2319@item 2320@ref{Regexp} 2321 2322@item 2323@ref{Reading Files} 2324 2325@item 2326@ref{Printing} 2327 2328@item 2329@ref{Expressions} 2330 2331@item 2332@ref{Patterns and Actions} 2333 2334@item 2335@ref{Arrays} 2336 2337@item 2338@ref{Functions} 2339@end itemize 2340@end ifdocbook 2341 2342@node Getting Started 2343@chapter Getting Started with @command{awk} 2344@c @cindex script, definition of 2345@c @cindex rule, definition of 2346@c @cindex program, definition of 2347@c @cindex basic function of @command{awk} 2348@cindex @command{awk} @subentry function of 2349 2350The basic function of @command{awk} is to search files for lines (or other 2351units of text) that contain certain patterns. When a line matches one 2352of the patterns, @command{awk} performs specified actions on that line. 2353@command{awk} continues to process input lines in this way until it reaches 2354the end of the input files. 2355 2356@cindex @command{awk} @subentry uses for 2357@cindex programming languages @subentry data-driven vs.@: procedural 2358@cindex @command{awk} programs 2359Programs in @command{awk} are different from programs in most other languages, 2360because @command{awk} programs are @dfn{data driven} (i.e., you describe 2361the data you want to work with and then what to do when you find it). 2362Most other languages are @dfn{procedural}; you have to describe, in great 2363detail, every step the program should take. When working with procedural 2364languages, it is usually much 2365harder to clearly describe the data your program will process. 2366For this reason, @command{awk} programs are often refreshingly easy to 2367read and write. 2368 2369@cindex program, definition of 2370@cindex rule, definition of 2371When you run @command{awk}, you specify an @command{awk} @dfn{program} that 2372tells @command{awk} what to do. The program consists of a series of 2373@dfn{rules} (it may also contain @dfn{function definitions}, 2374an advanced feature that we will ignore for now; 2375@pxref{User-defined}). Each rule specifies one 2376pattern to search for and one action to perform 2377upon finding the pattern. 2378 2379Syntactically, a rule consists of a @dfn{pattern} followed by an 2380@dfn{action}. The action is enclosed in braces to separate it from the 2381pattern. Newlines usually separate rules. Therefore, an @command{awk} 2382program looks like this: 2383 2384@example 2385@var{pattern} @{ @var{action} @} 2386@var{pattern} @{ @var{action} @} 2387@dots{} 2388@end example 2389 2390@menu 2391* Running gawk:: How to run @command{gawk} programs; includes 2392 command-line syntax. 2393* Sample Data Files:: Sample data files for use in the @command{awk} 2394 programs illustrated in this @value{DOCUMENT}. 2395* Very Simple:: A very simple example. 2396* Two Rules:: A less simple one-line example using two 2397 rules. 2398* More Complex:: A more complex example. 2399* Statements/Lines:: Subdividing or combining statements into 2400 lines. 2401* Other Features:: Other Features of @command{awk}. 2402* When:: When to use @command{gawk} and when to use 2403 other things. 2404* Intro Summary:: Summary of the introduction. 2405@end menu 2406 2407@node Running gawk 2408@section How to Run @command{awk} Programs 2409 2410@cindex @command{awk} programs @subentry running 2411There are several ways to run an @command{awk} program. If the program is 2412short, it is easiest to include it in the command that runs @command{awk}, 2413like this: 2414 2415@example 2416awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} 2417@end example 2418 2419@cindex command line @subentry formats 2420When the program is long, it is usually more convenient to put it in a file 2421and run it with a command like this: 2422 2423@example 2424awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{} 2425@end example 2426 2427This @value{SECTION} discusses both mechanisms, along with several 2428variations of each. 2429 2430@menu 2431* One-shot:: Running a short throwaway @command{awk} 2432 program. 2433* Read Terminal:: Using no input files (input from the keyboard 2434 instead). 2435* Long:: Putting permanent @command{awk} programs in 2436 files. 2437* Executable Scripts:: Making self-contained @command{awk} programs. 2438* Comments:: Adding documentation to @command{gawk} 2439 programs. 2440* Quoting:: More discussion of shell quoting issues. 2441@end menu 2442 2443@node One-shot 2444@subsection One-Shot Throwaway @command{awk} Programs 2445 2446Once you are familiar with @command{awk}, you will often type in simple 2447programs the moment you want to use them. Then you can write the 2448program as the first argument of the @command{awk} command, like this: 2449 2450@example 2451awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} 2452@end example 2453 2454@noindent 2455where @var{program} consists of a series of patterns and 2456actions, as described earlier. 2457 2458@cindex single quote (@code{'}) 2459@cindex @code{'} (single quote) 2460This command format instructs the @dfn{shell}, or command interpreter, 2461to start @command{awk} and use the @var{program} to process records in the 2462input file(s). There are single quotes around @var{program} so 2463the shell won't interpret any @command{awk} characters as special shell 2464characters. The quotes also cause the shell to treat all of @var{program} as 2465a single argument for @command{awk}, and allow @var{program} to be more 2466than one line long. 2467 2468@cindex shells @subentry scripts 2469@cindex @command{awk} programs @subentry running @subentry from shell scripts 2470This format is also useful for running short or medium-sized @command{awk} 2471programs from shell scripts, because it avoids the need for a separate 2472file for the @command{awk} program. A self-contained shell script is more 2473reliable because there are no other files to misplace. 2474 2475Later in this chapter, in 2476@ifdocbook 2477the @value{SECTION} 2478@end ifdocbook 2479@ref{Very Simple}, 2480we'll see examples of several short, 2481self-contained programs. 2482 2483@node Read Terminal 2484@subsection Running @command{awk} Without Input Files 2485 2486@cindex standard input 2487@cindex input @subentry standard 2488@cindex input files @subentry running @command{awk} without 2489You can also run @command{awk} without any input files. If you type the 2490following command line: 2491 2492@example 2493awk '@var{program}' 2494@end example 2495 2496@noindent 2497@command{awk} applies the @var{program} to the @dfn{standard input}, 2498which usually means whatever you type on the keyboard. This continues 2499until you indicate end-of-file by typing @kbd{Ctrl-d}. 2500(On non-POSIX operating systems, the end-of-file character may be different.) 2501 2502@cindex files @subentry input @seeentry{input files} 2503@cindex input files @subentry running @command{awk} without 2504@cindex @command{awk} programs @subentry running @subentry without input files 2505As an example, the following program prints a friendly piece of advice 2506(from Douglas Adams's @cite{The Hitchhiker's Guide to the Galaxy}), 2507to keep you from worrying about the complexities of computer 2508programming: 2509 2510@example 2511$ @kbd{awk 'BEGIN @{ print "Don\47t Panic!" @}'} 2512@print{} Don't Panic! 2513@end example 2514 2515@command{awk} executes statements associated with @code{BEGIN} before 2516reading any input. If there are no other statements in your program, 2517as is the case here, @command{awk} just stops, instead of trying to read 2518input it doesn't know how to process. 2519The @samp{\47} is a magic way (explained later) of getting a single quote into 2520the program, without having to engage in ugly shell quoting tricks. 2521 2522@quotation NOTE 2523If you use Bash as your shell, you should execute the 2524command @samp{set +H} before running this program interactively, to 2525disable the C shell-style command history, which treats @samp{!} as a 2526special character. We recommend putting this command into your personal 2527startup file. 2528@end quotation 2529 2530This next simple @command{awk} program 2531emulates the @command{cat} utility; it copies whatever you type on the 2532keyboard to its standard output (why this works is explained shortly): 2533 2534@example 2535$ @kbd{awk '@{ print @}'} 2536@kbd{Now is the time for all good men} 2537@print{} Now is the time for all good men 2538@kbd{to come to the aid of their country.} 2539@print{} to come to the aid of their country. 2540@kbd{Four score and seven years ago, ...} 2541@print{} Four score and seven years ago, ... 2542@kbd{What, me worry?} 2543@print{} What, me worry? 2544@kbd{Ctrl-d} 2545@end example 2546 2547@node Long 2548@subsection Running Long Programs 2549 2550@cindex @command{awk} programs @subentry running 2551@cindex @command{awk} programs @subentry lengthy 2552@cindex files @subentry @command{awk} programs in 2553Sometimes @command{awk} programs are very long. In these cases, it is 2554more convenient to put the program into a separate file. In order to tell 2555@command{awk} to use that file for its program, you type: 2556 2557@example 2558awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{} 2559@end example 2560 2561@cindex @option{-f} option 2562@cindex command line @subentry option @option{-f} 2563The @option{-f} instructs the @command{awk} utility to get the 2564@command{awk} program from the file @var{source-file} (@pxref{Options}). 2565Any @value{FN} can be used for @var{source-file}. For example, you 2566could put the program: 2567 2568@example 2569BEGIN @{ print "Don't Panic!" @} 2570@end example 2571 2572@noindent 2573into the file @file{advice}. Then this command: 2574 2575@example 2576awk -f advice 2577@end example 2578 2579@noindent 2580does the same thing as this one: 2581 2582@example 2583awk 'BEGIN @{ print "Don\47t Panic!" @}' 2584@end example 2585 2586@cindex quoting @subentry in @command{gawk} command lines 2587@noindent 2588This was explained earlier 2589(@pxref{Read Terminal}). 2590Note that you don't usually need single quotes around the @value{FN} that you 2591specify with @option{-f}, because most @value{FN}s don't contain any of the shell's 2592special characters. Notice that in @file{advice}, the @command{awk} 2593program did not have single quotes around it. The quotes are only needed 2594for programs that are provided on the @command{awk} command line. 2595(Also, placing the program in a file allows us to use a literal single quote in the program 2596text, instead of the magic @samp{\47}.) 2597 2598@cindex single quote (@code{'}) @subentry in @command{gawk} command lines 2599@cindex @code{'} (single quote) @subentry in @command{gawk} command lines 2600If you want to clearly identify an @command{awk} program file as such, 2601you can add the extension @file{.awk} to the @value{FN}. This doesn't 2602affect the execution of the @command{awk} program but it does make 2603``housekeeping'' easier. 2604 2605@node Executable Scripts 2606@subsection Executable @command{awk} Programs 2607@cindex @command{awk} programs 2608@cindex @code{#} (number sign) @subentry @code{#!} (executable scripts) 2609@cindex Unix @subentry @command{awk} scripts and 2610@cindex number sign (@code{#}) @subentry @code{#!} (executable scripts) 2611 2612Once you have learned @command{awk}, you may want to write self-contained 2613@command{awk} scripts, using the @samp{#!} script mechanism. You can do 2614this on many systems.@footnote{The @samp{#!} mechanism works on 2615GNU/Linux systems, BSD-based systems, and commercial Unix systems.} 2616For example, you could update the file @file{advice} to look like this: 2617 2618@example 2619#! /bin/awk -f 2620 2621BEGIN @{ print "Don't Panic!" @} 2622@end example 2623 2624@noindent 2625After making this file executable (with the @command{chmod} utility), 2626simply type @samp{advice} 2627at the shell and the system arranges to run @command{awk} as if you had 2628typed @samp{awk -f advice}: 2629 2630@example 2631$ @kbd{chmod +x advice} 2632$ @kbd{./advice} 2633@print{} Don't Panic! 2634@end example 2635 2636@noindent 2637Self-contained @command{awk} scripts are useful when you want to write a 2638program that users can invoke without their having to know that the program is 2639written in @command{awk}. 2640 2641@sidebar Understanding @samp{#!} 2642@cindex portability @subentry @code{#!} (executable scripts) 2643 2644@command{awk} is an @dfn{interpreted} language. This means that the 2645@command{awk} utility reads your program and then processes your data 2646according to the instructions in your program. (This is different 2647from a @dfn{compiled} language such as C, where your program is first 2648compiled into machine code that is executed directly by your system's 2649processor.) The @command{awk} utility is thus termed an @dfn{interpreter}. 2650Many modern languages are interpreted. 2651 2652The line beginning with @samp{#!} lists the full @value{FN} of an 2653interpreter to run and a single optional initial command-line argument 2654to pass to that interpreter. The operating system then runs the 2655interpreter with the given argument and the full argument list of the 2656executed program. The first argument in the list is the full @value{FN} 2657of the @command{awk} program. The rest of the argument list contains 2658either options to @command{awk}, or @value{DF}s, or both. (Note that on 2659many systems @command{awk} is found in @file{/usr/bin} instead of 2660in @file{/bin}.) 2661 2662Some systems limit the length of the interpreter name to 32 characters. 2663Often, this can be dealt with by using a symbolic link. 2664 2665You should not put more than one argument on the @samp{#!} 2666line after the path to @command{awk}. It does not work. The operating system 2667treats the rest of the line as a single argument and passes it to @command{awk}. 2668Doing this leads to confusing behavior---most likely a usage diagnostic 2669of some sort from @command{awk}. 2670 2671@cindex @code{ARGC}/@code{ARGV} variables @subentry portability and 2672@cindex portability @subentry @code{ARGV} variable 2673@cindex dark corner @subentry @code{ARGV} variable, value of 2674Finally, the value of @code{ARGV[0]} 2675(@pxref{Built-in Variables}) 2676varies depending upon your operating system. 2677Some systems put @samp{awk} there, some put the full pathname 2678of @command{awk} (such as @file{/bin/awk}), and some put the name 2679of your script (@samp{advice}). @value{DARKCORNER} 2680Don't rely on the value of @code{ARGV[0]} 2681to provide your script name. 2682@end sidebar 2683 2684@node Comments 2685@subsection Comments in @command{awk} Programs 2686@cindex @code{#} (number sign) @subentry commenting 2687@cindex number sign (@code{#}) @subentry commenting 2688@cindex commenting 2689@cindex @command{awk} programs @subentry documenting 2690 2691A @dfn{comment} is some text that is included in a program for the sake 2692of human readers; it is not really an executable part of the program. Comments 2693can explain what the program does and how it works. Nearly all 2694programming languages have provisions for comments, as programs are 2695typically hard to understand without them. 2696 2697In the @command{awk} language, a comment starts with the number sign 2698character (@samp{#}) and continues to the end of the line. 2699The @samp{#} does not have to be the first character on the line. The 2700@command{awk} language ignores the rest of a line following a number sign. 2701For example, we could have put the following into @file{advice}: 2702 2703@example 2704# This program prints a nice, friendly message. It helps 2705# keep novice users from being afraid of the computer. 2706BEGIN @{ print "Don't Panic!" @} 2707@end example 2708 2709You can put comment lines into keyboard-composed throwaway @command{awk} 2710programs, but this usually isn't very useful; the purpose of a 2711comment is to help you or another person understand the program 2712when reading it at a later time. 2713 2714@cindex quoting @subentry for small awk programs 2715@cindex single quote (@code{'}) @subentry vs.@: apostrophe 2716@cindex @code{'} (single quote) @subentry vs.@: apostrophe 2717@quotation CAUTION 2718As mentioned in 2719@ref{One-shot}, 2720you can enclose short to medium-sized programs in single quotes, 2721in order to keep 2722your shell scripts self-contained. When doing so, @emph{don't} put 2723an apostrophe (i.e., a single quote) into a comment (or anywhere else 2724in your program). The shell interprets the quote as the closing 2725quote for the entire program. As a result, usually the shell 2726prints a message about mismatched quotes, and if @command{awk} actually 2727runs, it will probably print strange messages about syntax errors. 2728For example, look at the following: 2729 2730@example 2731$ @kbd{awk 'BEGIN @{ print "hello" @} # let's be cute'} 2732> 2733@end example 2734 2735The shell sees that the first two quotes match, and that 2736a new quoted object begins at the end of the command line. 2737It therefore prompts with the secondary prompt, waiting for more input. 2738With Unix @command{awk}, closing the quoted string produces this result: 2739 2740@example 2741$ @kbd{awk '@{ print "hello" @} # let's be cute'} 2742> @kbd{'} 2743@error{} awk: can't open file be 2744@error{} source line number 1 2745@end example 2746 2747@cindex @code{\} (backslash) 2748@cindex backslash (@code{\}) 2749Putting a backslash before the single quote in @samp{let's} wouldn't help, 2750because backslashes are not special inside single quotes. 2751The next @value{SUBSECTION} describes the shell's quoting rules. 2752@end quotation 2753 2754@node Quoting 2755@subsection Shell Quoting Issues 2756@cindex shell quoting, rules for 2757 2758@menu 2759* DOS Quoting:: Quoting in Windows Batch Files. 2760@end menu 2761 2762For short to medium-length @command{awk} programs, it is most convenient 2763to enter the program on the @command{awk} command line. 2764This is best done by enclosing the entire program in single quotes. 2765This is true whether you are entering the program interactively at 2766the shell prompt, or writing it as part of a larger shell script: 2767 2768@example 2769awk '@var{program text}' @var{input-file1} @var{input-file2} @dots{} 2770@end example 2771 2772@cindex shells @subentry quoting @subentry rules for 2773@cindex Bourne shell, quoting rules for 2774Once you are working with the shell, it is helpful to have a basic 2775knowledge of shell quoting rules. The following rules apply only to 2776POSIX-compliant, Bourne-style shells (such as Bash, the GNU Bourne-Again 2777Shell). If you use the C shell, you're on your own. 2778 2779Before diving into the rules, we introduce a concept that appears 2780throughout this @value{DOCUMENT}, which is that of the @dfn{null}, 2781or empty, string. 2782 2783The null string is character data that has no value. 2784In other words, it is empty. It is written in @command{awk} programs 2785like this: @code{""}. In the shell, it can be written using single 2786or double quotes: @code{""} or @code{''}. Although the null string has 2787no characters in it, it does exist. For example, consider this command: 2788 2789@example 2790$ @kbd{echo ""} 2791@end example 2792 2793@noindent 2794Here, the @command{echo} utility receives a single argument, even 2795though that argument has no characters in it. In the rest of this 2796@value{DOCUMENT}, we use the terms @dfn{null string} and @dfn{empty string} 2797interchangeably. Now, on to the quoting rules: 2798 2799@itemize @value{BULLET} 2800@item 2801Quoted items can be concatenated with nonquoted items as well as with other 2802quoted items. The shell turns everything into one argument for 2803the command. 2804 2805@item 2806Preceding any single character with a backslash (@samp{\}) quotes 2807that character. The shell removes the backslash and passes the quoted 2808character on to the command. 2809 2810@item 2811@cindex @code{\} (backslash) @subentry in shell commands 2812@cindex backslash (@code{\}) @subentry in shell commands 2813@cindex single quote (@code{'}) @subentry in shell commands 2814@cindex @code{'} (single quote) @subentry in shell commands 2815Single quotes protect everything between the opening and closing quotes. 2816The shell does no interpretation of the quoted text, passing it on verbatim 2817to the command. 2818It is @emph{impossible} to embed a single quote inside single-quoted text. 2819Refer back to 2820@ref{Comments} 2821for an example of what happens if you try. 2822 2823@item 2824@cindex double quote (@code{"}) @subentry in shell commands 2825@cindex @code{"} (double quote) @subentry in shell commands 2826Double quotes protect most things between the opening and closing quotes. 2827The shell does at least variable and command substitution on the quoted text. 2828Different shells may do additional kinds of processing on double-quoted text. 2829 2830Because certain characters within double-quoted text are processed by the shell, 2831they must be @dfn{escaped} within the text. Of note are the characters 2832@samp{$}, @samp{`}, @samp{\}, and @samp{"}, all of which must be preceded by 2833a backslash within double-quoted text if they are to be passed on literally 2834to the program. (The leading backslash is stripped first.) 2835Thus, the example seen 2836@ifnotinfo 2837previously 2838@end ifnotinfo 2839in @ref{Read Terminal}: 2840 2841@example 2842awk 'BEGIN @{ print "Don\47t Panic!" @}' 2843@end example 2844 2845@noindent 2846could instead be written this way: 2847 2848@example 2849$ @kbd{awk "BEGIN @{ print \"Don't Panic!\" @}"} 2850@print{} Don't Panic! 2851@end example 2852 2853@cindex single quote (@code{'}) @subentry with double quotes 2854@cindex @code{'} (single quote) @subentry with double quotes 2855Note that the single quote is not special within double quotes. 2856 2857@item 2858Null strings are removed when they occur as part of a non-null 2859command-line argument, while explicit null objects are kept. 2860For example, to specify that the field separator @code{FS} should 2861be set to the null string, use: 2862 2863@example 2864awk -F "" '@var{program}' @var{files} # correct 2865@end example 2866 2867@noindent 2868@cindex null strings @subentry in @command{gawk} arguments, quoting and 2869Don't use this: 2870 2871@example 2872awk -F"" '@var{program}' @var{files} # wrong! 2873@end example 2874 2875@noindent 2876In the second case, @command{awk} attempts to use the text of the program 2877as the value of @code{FS}, and the first @value{FN} as the text of the program! 2878This results in syntax errors at best, and confusing behavior at worst. 2879@end itemize 2880 2881@cindex quoting @subentry in @command{gawk} command lines @subentry tricks for 2882Mixing single and double quotes is difficult. You have to resort 2883to shell quoting tricks, like this: 2884 2885@example 2886$ @kbd{awk 'BEGIN @{ print "Here is a single quote <'"'"'>" @}'} 2887@print{} Here is a single quote <'> 2888@end example 2889 2890@noindent 2891This program consists of three concatenated quoted strings. The first and the 2892third are single-quoted, and the second is double-quoted. 2893 2894This can be ``simplified'' to: 2895 2896@example 2897$ @kbd{awk 'BEGIN @{ print "Here is a single quote <'\''>" @}'} 2898@print{} Here is a single quote <'> 2899@end example 2900 2901@noindent 2902Judge for yourself which of these two is the more readable. 2903 2904Another option is to use double quotes, escaping the embedded, @command{awk}-level 2905double quotes: 2906 2907@example 2908$ @kbd{awk "BEGIN @{ print \"Here is a single quote <'>\" @}"} 2909@print{} Here is a single quote <'> 2910@end example 2911 2912@noindent 2913This option is also painful, because double quotes, backslashes, and dollar signs 2914are very common in more advanced @command{awk} programs. 2915 2916A third option is to use the octal escape sequence equivalents 2917(@pxref{Escape Sequences}) 2918for the 2919single- and double-quote characters, like so: 2920 2921@example 2922@group 2923$ @kbd{awk 'BEGIN @{ print "Here is a single quote <\47>" @}'} 2924@print{} Here is a single quote <'> 2925$ @kbd{awk 'BEGIN @{ print "Here is a double quote <\42>" @}'} 2926@print{} Here is a double quote <"> 2927@end group 2928@end example 2929 2930@noindent 2931This works nicely, but you should comment clearly what the 2932escape sequences mean. 2933 2934A fourth option is to use command-line variable assignment, like this: 2935 2936@example 2937$ @kbd{awk -v sq="'" 'BEGIN @{ print "Here is a single quote <" sq ">" @}'} 2938@print{} Here is a single quote <'> 2939@end example 2940 2941(Here, the two string constants and the value of @code{sq} are concatenated 2942into a single string that is printed by @code{print}.) 2943 2944If you really need both single and double quotes in your @command{awk} 2945program, it is probably best to move it into a separate file, where 2946the shell won't be part of the picture and you can say what you mean. 2947 2948@node DOS Quoting 2949@subsubsection Quoting in MS-Windows Batch Files 2950 2951@ignore 2952Date: Wed, 21 May 2008 09:58:43 +0200 (CEST) 2953From: jeroen.brink@inter.NL.net 2954Subject: (g)awk "contribution" 2955To: arnold@skeeve.com 2956Message-id: <42220.193.172.132.34.1211356723.squirrel@webmail.internl.net> 2957 2958Hello Arnold, 2959 2960maybe you can help me out. Found your email on the GNU/awk online manual 2961pages. 2962 2963I've searched hard to figure out how, on Windows, to print double quotes. 2964Couldn't find it in the Quotes area, nor on google or elsewhere. Finally i 2965figured out how to do this myself. 2966 2967How to print all lines in a file surrounded by double quotes (on Windows): 2968 2969gawk "{ print \"\042\" $0 \"\042\" }" <file> 2970 2971Maybe this is a helpfull tip for other (Windows) gawk users. However, i 2972don't have a clue as to where to "publish" this tip! Do you? 2973 2974Kind regards, 2975 2976Jeroen Brink 2977@end ignore 2978 2979Although this @value{DOCUMENT} generally only worries about POSIX systems and the 2980POSIX shell, the following issue arises often enough for many users that 2981it is worth addressing. 2982 2983@cindex Brink, Jeroen 2984The ``shells'' on Microsoft Windows systems use the double-quote 2985character for quoting, and make it difficult or impossible to include an 2986escaped double-quote character in a command-line script. The following 2987example, courtesy of Jeroen Brink, shows how to escape the double quotes 2988from this one liner script that prints all lines in a file surrounded by 2989double quotes: 2990 2991@example 2992@{ print "\"" $0 "\"" @} 2993@end example 2994 2995@noindent 2996In an MS-Windows command-line the one-liner script above may be passed as 2997follows: 2998 2999@example 3000gawk "@{ print \"\042\" $0 \"\042\" @}" @var{file} 3001@end example 3002 3003In this example the @samp{\042} is the octal code for a double-quote; 3004@command{gawk} converts it into a real double-quote for output by 3005the @code{print} statement. 3006 3007In MS-Windows escaping double-quotes is a little tricky because you use 3008backslashes to escape double-quotes, but backslashes themselves are not 3009escaped in the usual way; indeed they are either duplicated or not, 3010depending upon whether there is a subsequent double-quote. The MS-Windows 3011rule for double-quoting a string is the following: 3012 3013@enumerate 3014@item 3015For each double quote in the original string, let @var{N} be the number 3016of backslash(es) before it, @var{N} might be zero. Replace these @var{N} 3017backslash(es) by @math{2@value{TIMES}@var{N}+1} backslash(es) 3018 3019@item 3020Let @var{N} be the number of backslash(es) tailing the original string, 3021@var{N} might be zero. Replace these @var{N} backslash(es) by 3022@math{2@value{TIMES}@var{N}} backslash(es) 3023 3024@item 3025Surround the resulting string by double-quotes. 3026@end enumerate 3027 3028So to double-quote the one-liner script @samp{@{ print "\"" $0 "\"" @}} 3029from the previous example you would do it this way: 3030 3031@example 3032gawk "@{ print \"\\\"\" $0 \"\\\"\" @}" @var{file} 3033@end example 3034 3035@noindent 3036However, the use of @samp{\042} instead of @samp{\\\"} is also possible 3037and easier to read, because backslashes that are not followed by a 3038double-quote don't need duplication. 3039 3040@node Sample Data Files 3041@section @value{DDF}s for the Examples 3042 3043@cindex input files @subentry examples 3044@cindex @code{mail-list} file 3045Many of the examples in this @value{DOCUMENT} take their input from two sample 3046@value{DF}s. The first, @file{mail-list}, represents a list of peoples' names 3047together with their email addresses and information about those people. 3048The second @value{DF}, called @file{inventory-shipped}, contains 3049information about monthly shipments. In both files, 3050each line is considered to be one @dfn{record}. 3051 3052In @file{mail-list}, each record contains the name of a person, 3053his/her phone number, his/her email address, and a code for his/her relationship 3054with the author of the list. 3055The columns are aligned using spaces. 3056An @samp{A} in the last column 3057means that the person is an acquaintance. An @samp{F} in the last 3058column means that the person is a friend. 3059An @samp{R} means that the person is a relative: 3060 3061@example 3062@c system if test ! -d eg ; then mkdir eg ; fi 3063@c system if test ! -d eg/lib ; then mkdir eg/lib ; fi 3064@c system if test ! -d eg/data ; then mkdir eg/data ; fi 3065@c system if test ! -d eg/prog ; then mkdir eg/prog ; fi 3066@c system if test ! -d eg/misc ; then mkdir eg/misc ; fi 3067@c file eg/data/mail-list 3068Amelia 555-5553 amelia.zodiacusque@@gmail.com F 3069Anthony 555-3412 anthony.asserturo@@hotmail.com A 3070Becky 555-7685 becky.algebrarum@@gmail.com A 3071Bill 555-1675 bill.drowning@@hotmail.com A 3072Broderick 555-0542 broderick.aliquotiens@@yahoo.com R 3073Camilla 555-2912 camilla.infusarum@@skynet.be R 3074Fabius 555-1234 fabius.undevicesimus@@ucb.edu F 3075Julie 555-6699 julie.perscrutabor@@skeeve.com F 3076Martin 555-6480 martin.codicibus@@hotmail.com A 3077Samuel 555-3430 samuel.lanceolis@@shu.edu A 3078Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R 3079@c endfile 3080@end example 3081 3082@cindex @code{inventory-shipped} file 3083The @value{DF} @file{inventory-shipped} represents 3084information about shipments during the year. 3085Each record contains the month, the number 3086of green crates shipped, the number of red boxes shipped, the number of 3087orange bags shipped, and the number of blue packages shipped, 3088respectively. There are 16 entries, covering the 12 months of last year 3089and the first four months of the current year. 3090An empty line separates the data for the two years: 3091 3092@example 3093@c file eg/data/inventory-shipped 3094Jan 13 25 15 115 3095Feb 15 32 24 226 3096Mar 15 24 34 228 3097Apr 31 52 63 420 3098May 16 34 29 208 3099Jun 31 42 75 492 3100Jul 24 34 67 436 3101Aug 15 34 47 316 3102Sep 13 55 37 277 3103Oct 29 54 68 525 3104Nov 20 87 82 577 3105Dec 17 35 61 401 3106 3107Jan 21 36 64 620 3108Feb 26 58 80 652 3109Mar 24 75 70 495 3110Apr 21 70 74 514 3111@c endfile 3112@end example 3113 3114The sample files are included in the @command{gawk} distribution, 3115in the directory @file{awklib/eg/data}. 3116 3117@node Very Simple 3118@section Some Simple Examples 3119 3120The following command runs a simple @command{awk} program that searches the 3121input file @file{mail-list} for the character string @samp{li} (a 3122grouping of characters is usually called a @dfn{string}; 3123the term @dfn{string} is based on similar usage in English, such 3124as ``a string of pearls'' or ``a string of cars in a train''): 3125 3126@example 3127awk '/li/ @{ print $0 @}' mail-list 3128@end example 3129 3130@noindent 3131When lines containing @samp{li} are found, they are printed because 3132@w{@samp{print $0}} means print the current line. (Just @samp{print} by 3133itself means the same thing, so we could have written that 3134instead.) 3135 3136You will notice that slashes (@samp{/}) surround the string @samp{li} 3137in the @command{awk} program. The slashes indicate that @samp{li} 3138is the pattern to search for. This type of pattern is called a 3139@dfn{regular expression}, which is covered in more detail later 3140(@pxref{Regexp}). 3141The pattern is allowed to match parts of words. 3142There are 3143single quotes around the @command{awk} program so that the shell won't 3144interpret any of it as special shell characters. 3145 3146Here is what this program prints: 3147 3148@example 3149$ @kbd{awk '/li/ @{ print $0 @}' mail-list} 3150@print{} Amelia 555-5553 amelia.zodiacusque@@gmail.com F 3151@print{} Broderick 555-0542 broderick.aliquotiens@@yahoo.com R 3152@print{} Julie 555-6699 julie.perscrutabor@@skeeve.com F 3153@print{} Samuel 555-3430 samuel.lanceolis@@shu.edu A 3154@end example 3155 3156@cindex actions @subentry default 3157@cindex patterns @subentry default 3158In an @command{awk} rule, either the pattern or the action can be omitted, 3159but not both. If the pattern is omitted, then the action is performed 3160for @emph{every} input line. If the action is omitted, the default 3161action is to print all lines that match the pattern. 3162 3163@cindex actions @subentry empty 3164Thus, we could leave out the action (the @code{print} statement and the 3165braces) in the previous example and the result would be the same: 3166@command{awk} prints all lines matching the pattern @samp{li}. By comparison, 3167omitting the @code{print} statement but retaining the braces makes an 3168empty action that does nothing (i.e., no lines are printed). 3169 3170@cindex @command{awk} programs @subentry one-line examples 3171Many practical @command{awk} programs are just a line or two long. Following is a 3172collection of useful, short programs to get you started. Some of these 3173programs contain constructs that haven't been covered yet. (The description 3174of the program will give you a good idea of what is going on, but you'll 3175need to read the rest of the @value{DOCUMENT} to become an @command{awk} expert!) 3176Most of the examples use a @value{DF} named @file{data}. This is just a 3177placeholder; if you use these programs yourself, substitute 3178your own @value{FN}s for @file{data}. 3179 3180@cindex @command{ls} utility 3181Some of the following examples use the output of @w{@samp{ls -l}} as input. 3182@command{ls} is a system command that gives you a listing of the files in a 3183directory. With the @option{-l} option, this listing includes each file's 3184size and the date the file was last modified. Its output looks like this: 3185 3186@example 3187-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile 3188-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 awk.h 3189-rw-r--r-- 1 arnold user 983 Apr 13 12:14 awk.tab.h 3190-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 awkgram.y 3191-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 awk1.c 3192-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 awk2.c 3193-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 awk3.c 3194-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 awk4.c 3195@end example 3196 3197@noindent 3198The first field contains read-write permissions, the second field contains 3199the number of links to the file, and the third field identifies the 3200file's owner. The fourth field identifies the file's group. The fifth 3201field contains the file's size in bytes. The sixth, seventh, and eighth 3202fields contain the month, day, and time, respectively, that the file 3203was last modified. Finally, the ninth field contains the @value{FN}. 3204 3205For future reference, note that there is often more than 3206one way to do things in @command{awk}. At some point, you may want 3207to look back at these examples and see if 3208you can come up with different ways to do the same things shown here: 3209 3210@itemize @value{BULLET} 3211@item 3212Print every line that is longer than 80 characters: 3213 3214@example 3215awk 'length($0) > 80' data 3216@end example 3217 3218The sole rule has a relational expression as its pattern and has no 3219action---so it uses the default action, printing the record. 3220 3221@item 3222Print the length of the longest input line: 3223 3224@example 3225@group 3226awk '@{ if (length($0) > max) max = length($0) @} 3227 END @{ print max @}' data 3228@end group 3229@end example 3230 3231The code associated with @code{END} executes after all 3232input has been read; it's the other side of the coin to @code{BEGIN}. 3233 3234@cindex @command{expand} utility 3235@item 3236Print the length of the longest line in @file{data}: 3237 3238@example 3239expand data | awk '@{ if (x < length($0)) x = length($0) @} 3240 END @{ print "maximum line length is " x @}' 3241@end example 3242 3243This example differs slightly from the previous one: 3244the input is processed by the @command{expand} utility to change TABs 3245into spaces, so the widths compared are actually the right-margin columns, 3246as opposed to the number of input characters on each line. 3247 3248@item 3249Print every line that has at least one field: 3250 3251@example 3252awk 'NF > 0' data 3253@end example 3254 3255This is an easy way to delete blank lines from a file (or rather, to 3256create a new file similar to the old file but from which the blank lines 3257have been removed). 3258 3259@item 3260Print seven random numbers from 0 to 100, inclusive: 3261 3262@example 3263awk 'BEGIN @{ for (i = 1; i <= 7; i++) 3264 print int(101 * rand()) @}' 3265@end example 3266 3267@item 3268Print the total number of bytes used by @var{files}: 3269 3270@example 3271ls -l @var{files} | awk '@{ x += $5 @} 3272 END @{ print "total bytes: " x @}' 3273@end example 3274 3275@item 3276Print the total number of kilobytes used by @var{files}: 3277 3278@c Don't use \ continuation, not discussed yet 3279@c Remember that awk does floating point division, 3280@c no need for (x+1023) / 1024 3281@example 3282ls -l @var{files} | awk '@{ x += $5 @} 3283 END @{ print "total K-bytes:", x / 1024 @}' 3284@end example 3285 3286@item 3287Print a sorted list of the login names of all users: 3288 3289@example 3290awk -F: '@{ print $1 @}' /etc/passwd | sort 3291@end example 3292 3293@item 3294Count the lines in a file: 3295 3296@example 3297awk 'END @{ print NR @}' data 3298@end example 3299 3300@item 3301Print the even-numbered lines in the @value{DF}: 3302 3303@example 3304awk 'NR % 2 == 0' data 3305@end example 3306 3307If you used the expression @samp{NR % 2 == 1} instead, 3308the program would print the odd-numbered lines. 3309@end itemize 3310 3311@node Two Rules 3312@section An Example with Two Rules 3313@cindex @command{awk} programs 3314 3315The @command{awk} utility reads the input files one line at a 3316time. For each line, @command{awk} tries the patterns of each rule. 3317If several patterns match, then several actions execute in the order in 3318which they appear in the @command{awk} program. If no patterns match, then 3319no actions run. 3320 3321After processing all the rules that match the line (and perhaps there are none), 3322@command{awk} reads the next line. (However, 3323@pxref{Next Statement} 3324@ifdocbook 3325and @ref{Nextfile Statement}.) 3326@end ifdocbook 3327@ifnotdocbook 3328and also @pxref{Nextfile Statement}.) 3329@end ifnotdocbook 3330This continues until the program reaches the end of the file. 3331For example, the following @command{awk} program contains two rules: 3332 3333@example 3334/12/ @{ print $0 @} 3335/21/ @{ print $0 @} 3336@end example 3337 3338@noindent 3339The first rule has the string @samp{12} as the 3340pattern and @samp{print $0} as the action. The second rule has the 3341string @samp{21} as the pattern and also has @samp{print $0} as the 3342action. Each rule's action is enclosed in its own pair of braces. 3343 3344This program prints every line that contains the string 3345@samp{12} @emph{or} the string @samp{21}. If a line contains both 3346strings, it is printed twice, once by each rule. 3347 3348This is what happens if we run this program on our two sample @value{DF}s, 3349@file{mail-list} and @file{inventory-shipped}: 3350 3351@example 3352$ @kbd{awk '/12/ @{ print $0 @}} 3353> @kbd{/21/ @{ print $0 @}' mail-list inventory-shipped} 3354@print{} Anthony 555-3412 anthony.asserturo@@hotmail.com A 3355@print{} Camilla 555-2912 camilla.infusarum@@skynet.be R 3356@print{} Fabius 555-1234 fabius.undevicesimus@@ucb.edu F 3357@print{} Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R 3358@print{} Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R 3359@print{} Jan 21 36 64 620 3360@print{} Apr 21 70 74 514 3361@end example 3362 3363@noindent 3364Note how the line beginning with @samp{Jean-Paul} 3365in @file{mail-list} was printed twice, once for each rule. 3366 3367@node More Complex 3368@section A More Complex Example 3369 3370Now that we've mastered some simple tasks, let's look at 3371what typical @command{awk} 3372programs do. This example shows how @command{awk} can be used to 3373summarize, select, and rearrange the output of another utility. It uses 3374features that haven't been covered yet, so don't worry if you don't 3375understand all the details: 3376 3377@example 3378ls -l | awk '$6 == "Nov" @{ sum += $5 @} 3379 END @{ print sum @}' 3380@end example 3381 3382@cindex @command{ls} utility 3383This command prints the total number of bytes in all the files in the 3384current directory that were last modified in November (of any year). 3385 3386As a reminder, the output of @w{@samp{ls -l}} gives you a listing of the 3387files in a directory, including each file's size and the date the file 3388was last modified. The first field contains read-write permissions, 3389the second field contains the number of links to the file, and the 3390third field identifies the file's owner. The fourth field identifies 3391the file's group. The fifth field contains the file's size in bytes. 3392The sixth, seventh, and eighth fields contain the month, day, and time, 3393respectively, that the file was last modified. Finally, the ninth field 3394contains the @value{FN}. 3395 3396@c @cindex automatic initialization 3397@cindex initialization, automatic 3398The @samp{$6 == "Nov"} in our @command{awk} program is an expression that 3399tests whether the sixth field of the output from @w{@samp{ls -l}} 3400matches the string @samp{Nov}. Each time a line has the string 3401@samp{Nov} for its sixth field, @command{awk} performs the action 3402@samp{sum += $5}. This adds the fifth field (the file's size) to the variable 3403@code{sum}. As a result, when @command{awk} has finished reading all the 3404input lines, @code{sum} is the total of the sizes of the files whose 3405lines matched the pattern. (This works because @command{awk} variables 3406are automatically initialized to zero.) 3407 3408After the last line of output from @command{ls} has been processed, the 3409@code{END} rule executes and prints the value of @code{sum}. 3410In this example, the value of @code{sum} is 80600. 3411 3412These more advanced @command{awk} techniques are covered in later 3413@value{SECTION}s 3414(@pxref{Action Overview}). Before you can move on to more 3415advanced @command{awk} programming, you have to know how @command{awk} interprets 3416your input and displays your output. By manipulating fields and using 3417@code{print} statements, you can produce some very useful and 3418impressive-looking reports. 3419 3420@node Statements/Lines 3421@section @command{awk} Statements Versus Lines 3422@cindex line breaks 3423@cindex newlines 3424 3425Most often, each line in an @command{awk} program is a separate statement or 3426separate rule, like this: 3427 3428@example 3429awk '/12/ @{ print $0 @} 3430 /21/ @{ print $0 @}' mail-list inventory-shipped 3431@end example 3432 3433@cindex @command{gawk} @subentry newlines in 3434However, @command{gawk} ignores newlines after any of the following 3435symbols and keywords: 3436 3437@example 3438, @{ ? : || && do else 3439@end example 3440 3441@noindent 3442A newline at any other point is considered the end of the 3443statement.@footnote{The @samp{?} and @samp{:} referred to here is the 3444three-operand conditional expression described in 3445@ref{Conditional Exp}. 3446Splitting lines after @samp{?} and @samp{:} is a minor @command{gawk} 3447extension; if @option{--posix} is specified 3448(@pxref{Options}), then this extension is disabled.} 3449 3450@cindex @code{\} (backslash) @subentry continuing lines and 3451@cindex backslash (@code{\}) @subentry continuing lines and 3452If you would like to split a single statement into two lines at a point 3453where a newline would terminate it, you can @dfn{continue} it by ending the 3454first line with a backslash character (@samp{\}). The backslash must be 3455the final character on the line in order to be recognized as a continuation 3456character. A backslash followed by a newline is allowed anywhere in the statement, even 3457in the middle of a string or regular expression. For example: 3458 3459@example 3460awk '/This regular expression is too long, so continue it\ 3461 on the next line/ @{ print $1 @}' 3462@end example 3463 3464@noindent 3465@cindex portability @subentry backslash continuation and 3466We have generally not used backslash continuation in our sample programs. 3467@command{gawk} places no limit on the 3468length of a line, so backslash continuation is never strictly necessary; 3469it just makes programs more readable. For this same reason, as well as 3470for clarity, we have kept most statements short in the programs 3471presented throughout the @value{DOCUMENT}. 3472 3473Backslash continuation is 3474most useful when your @command{awk} program is in a separate source file 3475instead of entered from the command line. You should also note that 3476many @command{awk} implementations are more particular about where you 3477may use backslash continuation. For example, they may not allow you to 3478split a string constant using backslash continuation. Thus, for maximum 3479portability of your @command{awk} programs, it is best not to split your 3480lines in the middle of a regular expression or a string. 3481@c 10/2000: gawk, mawk, and current bell labs awk allow it, 3482@c solaris 2.7 nawk does not. Solaris /usr/xpg4/bin/awk does though! sigh. 3483 3484@cindex @command{csh} utility 3485@cindex line continuations @subentry with C shell 3486@cindex backslash (@code{\}) @subentry continuing lines and @subentry in @command{csh} 3487@cindex @code{\} (backslash) @subentry continuing lines and @subentry in @command{csh} 3488@quotation CAUTION 3489@emph{Backslash continuation does not work as described 3490with the C shell.} It works for @command{awk} programs in files and 3491for one-shot programs, @emph{provided} you are using a POSIX-compliant 3492shell, such as the Unix Bourne shell or Bash. But the C shell behaves 3493differently! There you must use two backslashes in a row, followed by 3494a newline. Note also that when using the C shell, @emph{every} newline 3495in your @command{awk} program must be escaped with a backslash. To illustrate: 3496 3497@example 3498% @kbd{awk 'BEGIN @{ \} 3499? @kbd{ print \\} 3500? @kbd{ "hello, world" \} 3501? @kbd{@}'} 3502@print{} hello, world 3503@end example 3504 3505@noindent 3506Here, the @samp{%} and @samp{?} are the C shell's primary and secondary 3507prompts, analogous to the standard shell's @samp{$} and @samp{>}. 3508 3509Compare the previous example to how it is done with a POSIX-compliant shell: 3510 3511@example 3512$ @kbd{awk 'BEGIN @{} 3513> @kbd{print \} 3514> @kbd{"hello, world"} 3515> @kbd{@}'} 3516@print{} hello, world 3517@end example 3518@end quotation 3519 3520@command{awk} is a line-oriented language. Each rule's action has to 3521begin on the same line as the pattern. To have the pattern and action 3522on separate lines, you @emph{must} use backslash continuation; there 3523is no other option. 3524 3525@cindex backslash (@code{\}) @subentry continuing lines and @subentry comments and 3526@cindex @code{\} (backslash) @subentry continuing lines and @subentry comments and 3527@cindex commenting @subentry backslash continuation and 3528Another thing to keep in mind is that backslash continuation and 3529comments do not mix. As soon as @command{awk} sees the @samp{#} that 3530starts a comment, it ignores @emph{everything} on the rest of the 3531line. For example: 3532 3533@example 3534@group 3535$ @kbd{gawk 'BEGIN @{ print "dont panic" # a friendly \} 3536> @kbd{ BEGIN rule} 3537> @kbd{@}'} 3538@error{} gawk: cmd. line:2: BEGIN rule 3539@error{} gawk: cmd. line:2: ^ syntax error 3540@end group 3541@end example 3542 3543@noindent 3544In this case, it looks like the backslash would continue the comment onto the 3545next line. However, the backslash-newline combination is never even 3546noticed because it is ``hidden'' inside the comment. Thus, the 3547@code{BEGIN} is noted as a syntax error. 3548 3549@cindex statements @subentry multiple 3550@cindex @code{;} (semicolon) @subentry separating statements in actions 3551@cindex semicolon (@code{;}) @subentry separating statements in actions 3552@cindex @code{;} (semicolon) @subentry separating rules 3553@cindex semicolon (@code{;}) @subentry separating rules 3554When @command{awk} statements within one rule are short, you might want to put 3555more than one of them on a line. This is accomplished by separating the statements 3556with a semicolon (@samp{;}). 3557This also applies to the rules themselves. 3558Thus, the program shown at the start of this @value{SECTION} 3559could also be written this way: 3560 3561@example 3562/12/ @{ print $0 @} ; /21/ @{ print $0 @} 3563@end example 3564 3565@quotation NOTE 3566The requirement that states that rules on the same line must be 3567separated with a semicolon was not in the original @command{awk} 3568language; it was added for consistency with the treatment of statements 3569within an action. 3570@end quotation 3571 3572@node Other Features 3573@section Other Features of @command{awk} 3574 3575@cindex variables 3576The @command{awk} language provides a number of predefined, or 3577@dfn{built-in}, variables that your programs can use to get information 3578from @command{awk}. There are other variables your program can set 3579as well to control how @command{awk} processes your data. 3580 3581In addition, @command{awk} provides a number of built-in functions for doing 3582common computational and string-related operations. 3583@command{gawk} provides built-in functions for working with timestamps, 3584performing bit manipulation, for runtime string translation (internationalization), 3585determining the type of a variable, 3586and array sorting. 3587 3588As we develop our presentation of the @command{awk} language, we will introduce 3589most of the variables and many of the functions. They are described 3590systematically in @ref{Built-in Variables} and in 3591@ref{Built-in}. 3592 3593@node When 3594@section When to Use @command{awk} 3595 3596@cindex @command{awk} @subentry uses for 3597Now that you've seen some of what @command{awk} can do, 3598you might wonder how @command{awk} could be useful for you. By using 3599utility programs, advanced patterns, field separators, arithmetic 3600statements, and other selection criteria, you can produce much more 3601complex output. The @command{awk} language is very useful for producing 3602reports from large amounts of raw data, such as summarizing information 3603from the output of other utility programs like @command{ls}. 3604(@xref{More Complex}.) 3605 3606Programs written with @command{awk} are usually much smaller than they would 3607be in other languages. This makes @command{awk} programs easy to compose and 3608use. Often, @command{awk} programs can be quickly composed at your keyboard, 3609used once, and thrown away. Because @command{awk} programs are interpreted, you 3610can avoid the (usually lengthy) compilation part of the typical 3611edit-compile-test-debug cycle of software development. 3612 3613@cindex BWK @command{awk} @seeentry{Brian Kernighan's @command{awk}} 3614@cindex Brian Kernighan's @command{awk} 3615Complex programs have been written in @command{awk}, including a complete 3616retargetable assembler for 3617@ifclear FOR_PRINT 3618eight-bit microprocessors (@pxref{Glossary}, for more information), 3619@end ifclear 3620@ifset FOR_PRINT 3621eight-bit microprocessors, 3622@end ifset 3623and a microcode assembler for a special-purpose Prolog 3624computer. 3625The original @command{awk}'s capabilities were strained by tasks 3626of such complexity, but modern versions are more capable. 3627 3628@cindex @command{awk} programs @subentry complex 3629If you find yourself writing @command{awk} scripts of more than, say, 3630a few hundred lines, you might consider using a different programming 3631language. The shell is good at string and pattern matching; in addition, 3632it allows powerful use of the system utilities. Python offers a nice 3633balance between high-level ease of programming and access to system 3634facilities.@footnote{Other popular scripting languages include Ruby 3635and Perl.} 3636 3637@node Intro Summary 3638@section Summary 3639 3640@itemize @value{BULLET} 3641@item 3642Programs in @command{awk} consist of @var{pattern}--@var{action} pairs. 3643 3644@item 3645An @var{action} without a @var{pattern} always runs. The default 3646@var{action} for a pattern without one is @samp{@{ print $0 @}}. 3647 3648@item 3649Use either 3650@samp{awk '@var{program}' @var{files}} 3651or 3652@samp{awk -f @var{program-file} @var{files}} 3653to run @command{awk}. 3654 3655@item 3656You may use the special @samp{#!} header line to create @command{awk} 3657programs that are directly executable. 3658 3659@item 3660Comments in @command{awk} programs start with @samp{#} and continue to 3661the end of the same line. 3662 3663@item 3664Be aware of quoting issues when writing @command{awk} programs as 3665part of a larger shell script (or MS-Windows batch file). 3666 3667@item 3668You may use backslash continuation to continue a source line. 3669Lines are automatically continued after 3670a comma, open brace, question mark, colon, 3671@samp{||}, @samp{&&}, @code{do}, and @code{else}. 3672@end itemize 3673 3674@node Invoking Gawk 3675@chapter Running @command{awk} and @command{gawk} 3676 3677This @value{CHAPTER} covers how to run @command{awk}, both POSIX-standard 3678and @command{gawk}-specific command-line options, and what 3679@command{awk} and 3680@command{gawk} do with nonoption arguments. 3681It then proceeds to cover how @command{gawk} searches for source files, 3682reading standard input along with other files, @command{gawk}'s 3683environment variables, @command{gawk}'s exit status, using include files, 3684and obsolete and undocumented options and/or features. 3685 3686Many of the options and features described here are discussed in 3687more detail later in the @value{DOCUMENT}; feel free to skip over 3688things in this @value{CHAPTER} that don't interest you right now. 3689 3690@menu 3691* Command Line:: How to run @command{awk}. 3692* Options:: Command-line options and their meanings. 3693* Other Arguments:: Input file names and variable assignments. 3694* Naming Standard Input:: How to specify standard input with other 3695 files. 3696* Environment Variables:: The environment variables @command{gawk} uses. 3697* Exit Status:: @command{gawk}'s exit status. 3698* Include Files:: Including other files into your program. 3699* Loading Shared Libraries:: Loading shared libraries into your program. 3700* Obsolete:: Obsolete Options and/or features. 3701* Undocumented:: Undocumented Options and Features. 3702* Invoking Summary:: Invocation summary. 3703@end menu 3704 3705@node Command Line 3706@section Invoking @command{awk} 3707@cindex command line @subentry invoking @command{awk} from 3708@cindex @command{awk} @subentry invoking 3709@cindex arguments @subentry command-line @subentry invoking @command{awk} 3710@cindex options @subentry command-line @subentry invoking @command{awk} 3711 3712There are two ways to run @command{awk}---with an explicit program or with 3713one or more program files. Here are templates for both of them; items 3714enclosed in [@dots{}] in these templates are optional: 3715 3716@display 3717@command{awk} [@var{options}] @option{-f} @var{progfile} [@option{--}] @var{file} @dots{} 3718@command{awk} [@var{options}] [@option{--}] @code{'@var{program}'} @var{file} @dots{} 3719@end display 3720 3721@cindex GNU long options 3722@cindex long options 3723@cindex options @subentry long 3724In addition to traditional one-letter POSIX-style options, @command{gawk} also 3725supports GNU long options. 3726 3727@cindex dark corner @subentry invoking @command{awk} 3728@cindex lint checking @subentry empty programs 3729It is possible to invoke @command{awk} with an empty program: 3730 3731@example 3732awk '' datafile1 datafile2 3733@end example 3734 3735@cindex @option{--lint} option 3736@cindex dark corner @subentry empty programs 3737@noindent 3738Doing so makes little sense, though; @command{awk} exits 3739silently when given an empty program. 3740@value{DARKCORNER} 3741If @option{--lint} has 3742been specified on the command line, @command{gawk} issues a 3743warning that the program is empty. 3744 3745@node Options 3746@section Command-Line Options 3747@cindex options @subentry command-line 3748@cindex command line @subentry options 3749@cindex GNU long options 3750@cindex options @subentry long 3751 3752Options begin with a dash and consist of a single character. 3753GNU-style long options consist of two dashes and a keyword. 3754The keyword can be abbreviated, as long as the abbreviation allows the option 3755to be uniquely identified. If the option takes an argument, either the 3756keyword is immediately followed by an equals sign (@samp{=}) and the 3757argument's value, or the keyword and the argument's value are separated 3758by whitespace (spaces or TABs). 3759If a particular option with a value is given more than once, it is (usually) 3760the last value that counts. 3761 3762@cindex POSIX @command{awk} @subentry GNU long options and 3763Each long option for @command{gawk} has a corresponding 3764POSIX-style short option. 3765The long and short options are 3766interchangeable in all contexts. 3767The following list describes options mandated by the POSIX standard: 3768 3769@table @code 3770@item -F @var{fs} 3771@itemx --field-separator @var{fs} 3772@cindex @option{-F} option 3773@cindex @option{--field-separator} option 3774@cindex @code{FS} variable @subentry @option{--field-separator} option and 3775Set the @code{FS} variable to @var{fs} 3776(@pxref{Field Separators}). 3777 3778@item -f @var{source-file} 3779@itemx --file @var{source-file} 3780@cindex @option{-f} option 3781@cindex @option{--file} option 3782@cindex @command{awk} programs @subentry location of 3783Read the @command{awk} program source from @var{source-file} 3784instead of in the first nonoption argument. 3785This option may be given multiple times; the @command{awk} 3786program consists of the concatenation of the contents of 3787each specified @var{source-file}. 3788 3789Files named with @option{-f} are treated as if they had @samp{@@namespace "awk"} 3790at their beginning. @xref{Changing The Namespace}, for more information 3791on this advanced feature. 3792 3793@item -v @var{var}=@var{val} 3794@itemx --assign @var{var}=@var{val} 3795@cindex @option{-v} option 3796@cindex @option{--assign} option 3797@cindex variables @subentry setting 3798Set the variable @var{var} to the value @var{val} @emph{before} 3799execution of the program begins. Such variable values are available 3800inside the @code{BEGIN} rule 3801(@pxref{Other Arguments}). 3802 3803The @option{-v} option can only set one variable, but it can be used 3804more than once, setting another variable each time, like this: 3805@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}. 3806 3807@cindex predefined variables @subentry @code{-v} option, setting with 3808@cindex variables @subentry predefined @subentry @code{-v} option, setting with 3809@quotation CAUTION 3810Using @option{-v} to set the values of the built-in 3811variables may lead to surprising results. @command{awk} will reset the 3812values of those variables as it needs to, possibly ignoring any 3813initial value you may have given. 3814@end quotation 3815 3816@item -W @var{gawk-opt} 3817@cindex @option{-W} option 3818Provide an implementation-specific option. 3819This is the POSIX convention for providing implementation-specific options. 3820These options 3821also have corresponding GNU-style long options. 3822Note that the long options may be abbreviated, as long as 3823the abbreviations remain unique. 3824The full list of @command{gawk}-specific options is provided next. 3825 3826@item -- 3827@cindex command line @subentry options @subentry end of 3828@cindex options @subentry command-line @subentry end of 3829Signal the end of the command-line options. The following arguments 3830are not treated as options even if they begin with @samp{-}. This 3831interpretation of @option{--} follows the POSIX argument parsing 3832conventions. 3833 3834@cindex @code{-} (hyphen) @subentry file names beginning with 3835@cindex hyphen (@code{-}) @subentry file names beginning with 3836This is useful if you have @value{FN}s that start with @samp{-}, 3837or in shell scripts, if you have @value{FN}s that will be specified 3838by the user that could start with @samp{-}. 3839It is also useful for passing options on to the @command{awk} 3840program; see @ref{Getopt Function}. 3841@end table 3842 3843The following list describes @command{gawk}-specific options: 3844 3845@c Have to use @asis here to get docbook to come out right. 3846@table @asis 3847@item @option{-b} 3848@itemx @option{--characters-as-bytes} 3849@cindex @option{-b} option 3850@cindex @option{--characters-as-bytes} option 3851Cause @command{gawk} to treat all input data as single-byte characters. 3852In addition, all output written with @code{print} or @code{printf} 3853is treated as single-byte characters. 3854 3855Normally, @command{gawk} follows the POSIX standard and attempts to process 3856its input data according to the current locale (@pxref{Locales}). This can often involve 3857converting multibyte characters into wide characters (internally), and 3858can lead to problems or confusion if the input data does not contain valid 3859multibyte characters. This option is an easy way to tell @command{gawk}, 3860``Hands off my data!'' 3861 3862@item @option{-c} 3863@itemx @option{--traditional} 3864@cindex @option{-c} option 3865@cindex @option{--traditional} option 3866@cindex compatibility mode (@command{gawk}) @subentry specifying 3867Specify @dfn{compatibility mode}, in which the GNU extensions to 3868the @command{awk} language are disabled, so that @command{gawk} behaves just 3869like BWK @command{awk}. 3870@xref{POSIX/GNU}, 3871which summarizes the extensions. 3872@ifclear FOR_PRINT 3873Also see 3874@ref{Compatibility Mode}. 3875@end ifclear 3876 3877@item @option{-C} 3878@itemx @option{--copyright} 3879@cindex @option{-C} option 3880@cindex @option{--copyright} option 3881@cindex GPL (General Public License) @subentry printing 3882Print the short version of the General Public License and then exit. 3883 3884@item @option{-d}[@var{file}] 3885@itemx @option{--dump-variables}[@code{=}@var{file}] 3886@cindex @option{-d} option 3887@cindex @option{--dump-variables} option 3888@cindex dump all variables of a program 3889@cindex @file{awkvars.out} file 3890@cindex files @subentry @file{awkvars.out} 3891@cindex variables @subentry global @subentry printing list of 3892Print a sorted list of global variables, their types, and final values 3893to @var{file}. If no @var{file} is provided, print this 3894list to a file named @file{awkvars.out} in the current directory. 3895No space is allowed between the @option{-d} and @var{file}, if 3896@var{file} is supplied. 3897 3898@cindex troubleshooting @subentry typographical errors, global variables 3899Having a list of all global variables is a good way to look for 3900typographical errors in your programs. 3901You would also use this option if you have a large program with a lot of 3902functions, and you want to be sure that your functions don't 3903inadvertently use global variables that you meant to be local. 3904(This is a particularly easy mistake to make with simple variable 3905names like @code{i}, @code{j}, etc.) 3906 3907@item @option{-D}[@var{file}] 3908@itemx @option{--debug}[@code{=}@var{file}] 3909@cindex @option{-D} option 3910@cindex @option{--debug} option 3911@cindex @command{awk} programs @subentry debugging, enabling 3912Enable debugging of @command{awk} programs 3913(@pxref{Debugging}). 3914By default, the debugger reads commands interactively from the keyboard 3915(standard input). 3916The optional @var{file} argument allows you to specify a file with a list 3917of commands for the debugger to execute noninteractively. 3918No space is allowed between the @option{-D} and @var{file}, if 3919@var{file} is supplied. 3920 3921@item @option{-e} @var{program-text} 3922@itemx @option{--source} @var{program-text} 3923@cindex @option{-e} option 3924@cindex @option{--source} option 3925@cindex source code @subentry mixing 3926Provide program source code in the @var{program-text}. 3927This option allows you to mix source code in files with source 3928code that you enter on the command line. 3929This is particularly useful 3930when you have library functions that you want to use from your command-line 3931programs (@pxref{AWKPATH Variable}). 3932 3933Note that @command{gawk} treats each string as if it ended with 3934a newline character (even if it doesn't). This makes building 3935the total program easier. 3936 3937@quotation CAUTION 3938Prior to @value{PVERSION} 5.0, there was 3939no requirement that each @var{program-text} 3940be a full syntactic unit. I.e., the following worked: 3941 3942@example 3943$ @kbd{gawk -e 'BEGIN @{ a = 5 ;' -e 'print a @}'} 3944@print{} 5 3945@end example 3946 3947@noindent 3948However, this is no longer true. If you have any scripts that 3949rely upon this feature, you should revise them. 3950 3951This is because each @var{program-text} is treated as if it had 3952@samp{@@namespace "awk"} at its beginning. @xref{Changing The Namespace}, 3953for more information. 3954@end quotation 3955 3956@item @option{-E} @var{file} 3957@itemx @option{--exec} @var{file} 3958@cindex @option{-E} option 3959@cindex @option{--exec} option 3960@cindex @command{awk} programs @subentry location of 3961@cindex CGI, @command{awk} scripts for 3962Similar to @option{-f}, read @command{awk} program text from @var{file}. 3963There are two differences from @option{-f}: 3964 3965@itemize @value{BULLET} 3966@item 3967This option terminates option processing; anything 3968else on the command line is passed on directly to the @command{awk} program. 3969 3970@item 3971Command-line variable assignments of the form 3972@samp{@var{var}=@var{value}} are disallowed. 3973@end itemize 3974 3975This option is particularly necessary for World Wide Web CGI applications 3976that pass arguments through the URL; using this option prevents a malicious 3977(or other) user from passing in options, assignments, or @command{awk} source 3978code (via @option{-e}) to the CGI application.@footnote{For more detail, 3979please see Section 4.4 of @uref{http://www.ietf.org/rfc/rfc3875, 3980RFC 3875}. Also see the 3981@uref{https://lists.gnu.org/archive/html/bug-gawk/2014-11/msg00022.html, 3982explanatory note sent to the @command{gawk} bug 3983mailing list}.} 3984This option should be used 3985with @samp{#!} scripts (@pxref{Executable Scripts}), like so: 3986 3987@example 3988#! /usr/local/bin/gawk -E 3989 3990@var{awk program here @dots{}} 3991@end example 3992 3993@item @option{-g} 3994@itemx @option{--gen-pot} 3995@cindex @option{-g} option 3996@cindex @option{--gen-pot} option 3997@cindex portable object @subentry files @subentry generating 3998@cindex files @subentry portable object @subentry generating 3999Analyze the source program and 4000generate a GNU @command{gettext} portable object template file on standard 4001output for all string constants that have been marked for translation. 4002@xref{Internationalization}, 4003for information about this option. 4004 4005@item @option{-h} 4006@itemx @option{--help} 4007@cindex @option{-h} option 4008@cindex @option{--help} option 4009@cindex GNU long options @subentry printing list of 4010@cindex options @subentry printing list of 4011@cindex printing @subentry list of options 4012Print a ``usage'' message summarizing the short- and long-style options 4013that @command{gawk} accepts and then exit. 4014 4015@item @option{-i} @var{source-file} 4016@itemx @option{--include} @var{source-file} 4017@cindex @option{-i} option 4018@cindex @option{--include} option 4019@cindex @command{awk} programs @subentry location of 4020Read an @command{awk} source library from @var{source-file}. This option 4021is completely equivalent to using the @code{@@include} directive inside 4022your program. It is very similar to the @option{-f} option, 4023but there are two important differences. First, when @option{-i} is 4024used, the program source is not loaded if it has been previously 4025loaded, whereas with @option{-f}, @command{gawk} always loads the file. 4026Second, because this option is intended to be used with code libraries, 4027@command{gawk} does not recognize such files as constituting main program 4028input. Thus, after processing an @option{-i} argument, @command{gawk} 4029still expects to find the main source code via the @option{-f} option 4030or on the command line. 4031 4032Files named with @option{-i} are treated as if they had @samp{@@namespace "awk"} 4033at their beginning. @xref{Changing The Namespace}, for more information. 4034 4035@item @option{-I} 4036@itemx @option{--trace} 4037@cindex @option{-I} option 4038@cindex @option{--trace} option 4039@cindex trace, internal instructions 4040@cindex instructions, trace of internal 4041@cindex op-codes, trace of internal 4042Print the internal byte code names as they are executed when running 4043the program. The trace is printed to standard error. Each ``op code'' 4044is preceded by a @code{+} 4045sign in the output. 4046 4047@item @option{-l} @var{ext} 4048@itemx @option{--load} @var{ext} 4049@cindex @option{-l} option 4050@cindex @option{--load} option 4051@cindex loading extensions 4052Load a dynamic extension named @var{ext}. Extensions 4053are stored as system shared libraries. 4054This option searches for the library using the @env{AWKLIBPATH} 4055environment variable. The correct library suffix for your platform will be 4056supplied by default, so it need not be specified in the extension name. 4057The extension initialization routine should be named @code{dl_load()}. 4058An alternative is to use the @code{@@load} keyword inside the program to load 4059a shared library. This advanced feature is described in detail in @ref{Dynamic Extensions}. 4060 4061@item @option{-L}[@var{value}] 4062@itemx @option{--lint}[@code{=}@var{value}] 4063@cindex @option{-l} option 4064@cindex @option{--lint} option 4065@cindex lint checking @subentry issuing warnings 4066@cindex warnings, issuing 4067Warn about constructs that are dubious or nonportable to 4068other @command{awk} implementations. 4069No space is allowed between the @option{-L} and @var{value}, if 4070@var{value} is supplied. 4071Some warnings are issued when @command{gawk} first reads your program. Others 4072are issued at runtime, as your program executes. The optional 4073argument may be one of the following: 4074 4075@table @code 4076@item fatal 4077Cause lint warnings become fatal errors. 4078This may be drastic, but its use will certainly encourage the 4079development of cleaner @command{awk} programs. 4080 4081@item invalid 4082Only issue warnings about things 4083that are actually invalid are issued. (This is not fully implemented yet.) 4084 4085@item no-ext 4086Disable warnings about @command{gawk} extensions. 4087@end table 4088 4089Some warnings are only printed once, even if the dubious constructs they 4090warn about occur multiple times in your @command{awk} program. Thus, 4091when eliminating problems pointed out by @option{--lint}, you should take 4092care to search for all occurrences of each inappropriate construct. As 4093@command{awk} programs are usually short, doing so is not burdensome. 4094 4095@item @option{-M} 4096@itemx @option{--bignum} 4097@cindex @option{-M} option 4098@cindex @option{--bignum} option 4099Select arbitrary-precision arithmetic on numbers. This option has no effect 4100if @command{gawk} is not compiled to use the GNU MPFR and MP libraries 4101(@pxref{Arbitrary Precision Arithmetic}). 4102 4103@item @option{-n} 4104@itemx @option{--non-decimal-data} 4105@cindex @option{-n} option 4106@cindex @option{--non-decimal-data} option 4107@cindex hexadecimal values, enabling interpretation of 4108@cindex octal values, enabling interpretation of 4109@cindex troubleshooting @subentry @option{--non-decimal-data} option 4110Enable automatic interpretation of octal and hexadecimal 4111values in input data 4112(@pxref{Nondecimal Data}). 4113 4114@quotation CAUTION 4115This option can severely break old programs. Use with care. Also note 4116that this option may disappear in a future version of @command{gawk}. 4117@end quotation 4118 4119@item @option{-N} 4120@itemx @option{--use-lc-numeric} 4121@cindex @option{-N} option 4122@cindex @option{--use-lc-numeric} option 4123Force the use of the locale's decimal point character 4124when parsing numeric input data (@pxref{Locales}). 4125 4126@cindex pretty printing 4127@item @option{-o}[@var{file}] 4128@itemx @option{--pretty-print}[@code{=}@var{file}] 4129@cindex @option{-o} option 4130@cindex @option{--pretty-print} option 4131Enable pretty-printing of @command{awk} programs. 4132Implies @option{--no-optimize}. 4133By default, the output program is created in a file named @file{awkprof.out} 4134(@pxref{Profiling}). 4135The optional @var{file} argument allows you to specify a different 4136@value{FN} for the output. 4137No space is allowed between the @option{-o} and @var{file}, if 4138@var{file} is supplied. 4139 4140@quotation NOTE 4141In the past, this option would also execute your program. 4142This is no longer the case. 4143@end quotation 4144 4145@item @option{-O} 4146@itemx @option{--optimize} 4147@cindex @option{--optimize} option 4148@cindex @option{-O} option 4149Enable @command{gawk}'s default optimizations on the internal 4150representation of the program. At the moment, this includes just simple 4151constant folding. 4152 4153Optimization is enabled by default. 4154This option remains primarily for backwards compatibility. However, it may 4155be used to cancel the effect of an earlier @option{-s} option 4156(see later in this list). 4157 4158@item @option{-p}[@var{file}] 4159@itemx @option{--profile}[@code{=}@var{file}] 4160@cindex @option{-p} option 4161@cindex @option{--profile} option 4162@cindex @command{awk} @subentry profiling, enabling 4163Enable profiling of @command{awk} programs 4164(@pxref{Profiling}). 4165Implies @option{--no-optimize}. 4166By default, profiles are created in a file named @file{awkprof.out}. 4167The optional @var{file} argument allows you to specify a different 4168@value{FN} for the profile file. 4169No space is allowed between the @option{-p} and @var{file}, if 4170@var{file} is supplied. 4171 4172The profile contains execution counts for each statement in the program 4173in the left margin, and function call counts for each function. 4174 4175@item @option{-P} 4176@itemx @option{--posix} 4177@cindex @option{-P} option 4178@cindex @option{--posix} option 4179@cindex POSIX mode 4180@cindex @command{gawk} @subentry extensions, disabling 4181Operate in strict POSIX mode. This disables all @command{gawk} 4182extensions (just like @option{--traditional}) and 4183disables all extensions not allowed by POSIX. 4184@xref{Common Extensions} for a summary of the extensions 4185in @command{gawk} that are disabled by this option. 4186Also, 4187the following additional 4188restrictions apply: 4189 4190@itemize @value{BULLET} 4191 4192@cindex newlines 4193@cindex whitespace @subentry newlines as 4194@item 4195Newlines are not allowed after @samp{?} or @samp{:} 4196(@pxref{Conditional Exp}). 4197 4198 4199@cindex @code{FS} variable @subentry TAB character as 4200@item 4201Specifying @samp{-Ft} on the command line does not set the value 4202of @code{FS} to be a single TAB character 4203(@pxref{Field Separators}). 4204 4205@cindex locale decimal point character 4206@cindex decimal point character, locale specific 4207@item 4208The locale's decimal point character is used for parsing input 4209data (@pxref{Locales}). 4210@end itemize 4211 4212@c @cindex automatic warnings 4213@c @cindex warnings, automatic 4214@cindex @option{--traditional} option @subentry @option{--posix} option and 4215@cindex @option{--posix} option @subentry @option{--traditional} option and 4216If you supply both @option{--traditional} and @option{--posix} on the 4217command line, @option{--posix} takes precedence. @command{gawk} 4218issues a warning if both options are supplied. 4219 4220@item @option{-r} 4221@itemx @option{--re-interval} 4222@cindex @option{-r} option 4223@cindex @option{--re-interval} option 4224@cindex regular expressions @subentry interval expressions and 4225Allow interval expressions 4226(@pxref{Regexp Operators}) 4227in regexps. 4228This is now @command{gawk}'s default behavior. 4229Nevertheless, this option remains (both for backward compatibility 4230and for use in combination with @option{--traditional}). 4231 4232@item @option{-s} 4233@itemx @option{--no-optimize} 4234@cindex @option{--no-optimize} option 4235@cindex @option{-s} option 4236Disable @command{gawk}'s default optimizations on the internal 4237representation of the program. 4238 4239@item @option{-S} 4240@itemx @option{--sandbox} 4241@cindex @option{-S} option 4242@cindex @option{--sandbox} option 4243@cindex sandbox mode 4244@cindex @code{ARGV} array 4245Disable the @code{system()} function, 4246input redirections with @code{getline}, 4247output redirections with @code{print} and @code{printf}, 4248and dynamic extensions. 4249Also, disallow adding @value{FN}s to @code{ARGV} that were 4250not there when @command{gawk} started running. 4251This is particularly useful when you want to run @command{awk} scripts 4252from questionable sources and need to make sure the scripts 4253can't access your system (other than the specified input @value{DF}s). 4254 4255@item @option{-t} 4256@itemx @option{--lint-old} 4257@cindex @option{-L} option 4258@cindex @option{--lint-old} option 4259Warn about constructs that are not available in the original version of 4260@command{awk} from Version 7 Unix 4261(@pxref{V7/SVR3.1}). 4262 4263@item @option{-V} 4264@itemx @option{--version} 4265@cindex @option{-V} option 4266@cindex @option{--version} option 4267@cindex @command{gawk} @subentry version of @subentry printing information about 4268Print version information for this particular copy of @command{gawk}. 4269This allows you to determine if your copy of @command{gawk} is up to date 4270with respect to whatever the Free Software Foundation is currently 4271distributing. 4272It is also useful for bug reports 4273(@pxref{Bugs}). 4274 4275@cindex @code{-} (hyphen) @subentry @code{--} end of options marker 4276@cindex hyphen (@code{-}) @subentry @code{--} end of options marker 4277@item @code{--} 4278Mark the end of all options. 4279Any command-line arguments following @code{--} are placed in @code{ARGV}, 4280even if they start with a minus sign. 4281@end table 4282 4283In compatibility mode, 4284as long as program text has been supplied, 4285any other options are flagged as invalid with a warning message but 4286are otherwise ignored. 4287 4288@cindex @option{-F} option @subentry @option{-Ft} sets @code{FS} to TAB 4289In compatibility mode, as a special case, if the value of @var{fs} supplied 4290to the @option{-F} option is @samp{t}, then @code{FS} is set to the TAB 4291character (@code{"\t"}). This is true only for @option{--traditional} and not 4292for @option{--posix} 4293(@pxref{Field Separators}). 4294 4295@cindex @option{-f} option @subentry multiple uses 4296The @option{-f} option may be used more than once on the command line. 4297If it is, @command{awk} reads its program source from all of the named files, as 4298if they had been concatenated together into one big file. This is 4299useful for creating libraries of @command{awk} functions. These functions 4300can be written once and then retrieved from a standard place, instead 4301of having to be included in each individual program. 4302The @option{-i} option is similar in this regard. 4303(As mentioned in 4304@ref{Definition Syntax}, 4305function names must be unique.) 4306 4307With standard @command{awk}, library functions can still be used, even 4308if the program is entered at the keyboard, 4309by specifying @samp{-f /dev/tty}. After typing your program, 4310type @kbd{Ctrl-d} (the end-of-file character) to terminate it. 4311(You may also use @samp{-f -} to read program source from the standard 4312input, but then you will not be able to also use the standard input as a 4313source of data.) 4314 4315Because it is clumsy using the standard @command{awk} mechanisms to mix 4316source file and command-line @command{awk} programs, @command{gawk} 4317provides the @option{-e} option. This does not require you to 4318preempt the standard input for your source code, and it allows you to easily 4319mix command-line and library source code (@pxref{AWKPATH Variable}). 4320As with @option{-f}, the @option{-e} and @option{-i} 4321options may also be used multiple times on the command line. 4322 4323@cindex @option{-e} option 4324If no @option{-f} option (or @option{-e} option for @command{gawk}) 4325is specified, then @command{awk} uses the first nonoption command-line 4326argument as the text of the program source code. Arguments on 4327the command line that follow the program text are entered into the 4328@code{ARGV} array; @command{awk} does @emph{not} continue to parse the 4329command line looking for options. 4330 4331@cindex @env{POSIXLY_CORRECT} environment variable 4332@cindex environment variables @subentry @env{POSIXLY_CORRECT} 4333@cindex lint checking @subentry @env{POSIXLY_CORRECT} environment variable 4334@cindex POSIX mode 4335If the environment variable @env{POSIXLY_CORRECT} exists, 4336then @command{gawk} behaves in strict POSIX mode, exactly as if 4337you had supplied @option{--posix}. 4338Many GNU programs look for this environment variable to suppress 4339extensions that conflict with POSIX, but @command{gawk} behaves 4340differently: it suppresses all extensions, even those that do not 4341conflict with POSIX, and behaves in 4342strict POSIX mode. If @option{--lint} is supplied on the command line 4343and @command{gawk} turns on POSIX mode because of @env{POSIXLY_CORRECT}, 4344then it issues a warning message indicating that POSIX 4345mode is in effect. 4346You would typically set this variable in your shell's startup file. 4347For a Bourne-compatible shell (such as Bash), you would add these 4348lines to the @file{.profile} file in your home directory: 4349 4350@example 4351POSIXLY_CORRECT=true 4352export POSIXLY_CORRECT 4353@end example 4354 4355@cindex @command{csh} utility @subentry @env{POSIXLY_CORRECT} environment variable 4356For a C shell-compatible 4357shell,@footnote{Not recommended.} 4358you would add this line to the @file{.login} file in your home directory: 4359 4360@example 4361setenv POSIXLY_CORRECT true 4362@end example 4363 4364@cindex portability @subentry @env{POSIXLY_CORRECT} environment variable 4365Having @env{POSIXLY_CORRECT} set is not recommended for daily use, 4366but it is good for testing the portability of your programs to other 4367environments. 4368 4369@node Other Arguments 4370@section Other Command-Line Arguments 4371@cindex command line @subentry arguments 4372@cindex arguments @subentry command-line 4373 4374Any additional arguments on the command line are normally treated as 4375input files to be processed in the order specified. However, an 4376argument that has the form @code{@var{var}=@var{value}}, assigns 4377the value @var{value} to the variable @var{var}---it does not specify a 4378file at all. (See @ref{Assignment Options}.) In the following example, 4379@samp{count=1} is a variable assignment, not a @value{FN}: 4380 4381@example 4382awk -f program.awk file1 count=1 file2 4383@end example 4384 4385@noindent 4386As a side point, should you really need to have @command{awk} 4387process a file named @file{count=1} (or any file whose name looks like 4388a variable assignment), precede the file name with @samp{./}, like so: 4389 4390@example 4391awk -f program.awk file1 ./count=1 file2 4392@end example 4393 4394@cindex @command{gawk} @subentry @code{ARGIND} variable in 4395@cindex @code{ARGIND} variable @subentry command-line arguments 4396@cindex @code{ARGV} array, indexing into 4397@cindex @code{ARGC}/@code{ARGV} variables @subentry command-line arguments 4398@cindex @command{gawk} @subentry @code{PROCINFO} array in 4399All the command-line arguments are made available to your @command{awk} program in the 4400@code{ARGV} array (@pxref{Built-in Variables}). Command-line options 4401and the program text (if present) are omitted from @code{ARGV}. 4402All other arguments, including variable assignments, are 4403included. As each element of @code{ARGV} is processed, @command{gawk} 4404sets @code{ARGIND} to the index in @code{ARGV} of the 4405current element. (@command{gawk} makes the full command line, 4406including program text and options, available in @code{PROCINFO["argv"]}; 4407@pxref{Auto-set}.) 4408 4409Changing @code{ARGC} and @code{ARGV} in your @command{awk} program lets 4410you control how @command{awk} processes the input files; this is described 4411in more detail in @ref{ARGC and ARGV}. 4412 4413@cindex input files @subentry variable assignments and 4414@cindex variable assignments and input files 4415The distinction between @value{FN} arguments and variable-assignment 4416arguments is made when @command{awk} is about to open the next input file. 4417At that point in execution, it checks the @value{FN} to see whether 4418it is really a variable assignment; if so, @command{awk} sets the variable 4419instead of reading a file. 4420 4421Therefore, the variables actually receive the given values after all 4422previously specified files have been read. In particular, the values of 4423variables assigned in this fashion are @emph{not} available inside a 4424@code{BEGIN} rule 4425(@pxref{BEGIN/END}), 4426because such rules are run before @command{awk} begins scanning the argument list. 4427 4428@cindex dark corner @subentry escape sequences 4429The variable values given on the command line are processed for escape 4430sequences (@pxref{Escape Sequences}). 4431@value{DARKCORNER} 4432 4433In some very early implementations of @command{awk}, when a variable assignment 4434occurred before any @value{FN}s, the assignment would happen @emph{before} 4435the @code{BEGIN} rule was executed. @command{awk}'s behavior was thus 4436inconsistent; some command-line assignments were available inside the 4437@code{BEGIN} rule, while others were not. Unfortunately, 4438some applications came to depend 4439upon this ``feature.'' When @command{awk} was changed to be more consistent, 4440the @option{-v} option was added to accommodate applications that depended 4441upon the old behavior. 4442 4443The variable assignment feature is most useful for assigning to variables 4444such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and 4445output formats, before scanning the @value{DF}s. It is also useful for 4446controlling state if multiple passes are needed over a @value{DF}. For 4447example: 4448 4449@cindex files @subentry multiple passes over 4450@example 4451awk 'pass == 1 @{ @var{pass 1 stuff} @} 4452 pass == 2 @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata 4453@end example 4454 4455Given the variable assignment feature, the @option{-F} option for setting 4456the value of @code{FS} is not 4457strictly necessary. It remains for historical compatibility. 4458 4459@sidebar Quoting Shell Variables On The @command{awk} Command Line 4460@cindex quoting @subentry in @command{gawk} command lines 4461@cindex shell quoting, rules for 4462@cindex null strings @subentry in @command{gawk} arguments, quoting and 4463 4464Small @command{awk} programs are often embedded in larger shell scripts, 4465so it's worthwhile to understand some shell basics. Consider the following: 4466 4467@example 4468f="" 4469awk '@{ print("hi") @}' $f 4470@end example 4471 4472In this case, @command{awk} reads from standard input instead of trying 4473to open any command line files. To the unwary, this looks like @command{awk} 4474is hanging. 4475 4476However @command{awk} doesn't see an explicit empty string. When a 4477variable expansion is the null string, @emph{and} it's not quoted, 4478the shell simply removes it from the command line. To demonstrate: 4479 4480@example 4481$ @kbd{f=""} 4482$ @kbd{awk 'BEGIN @{ print ARGC @}' $f} 4483@print{} 1 4484$ @kbd{awk 'BEGIN @{ print ARGC @}' "$f"} 4485@print{} 2 4486@end example 4487@end sidebar 4488 4489@node Naming Standard Input 4490@section Naming Standard Input 4491 4492Often, you may wish to read standard input together with other files. 4493For example, you may wish to read one file, read standard input coming 4494from a pipe, and then read another file. 4495 4496The way to name the standard input, with all versions of @command{awk}, 4497is to use a single, standalone minus sign or dash, @samp{-}. For example: 4498 4499@example 4500@var{some_command} | awk -f myprog.awk file1 - file2 4501@end example 4502 4503@noindent 4504Here, @command{awk} first reads @file{file1}, then it reads 4505the output of @var{some_command}, and finally it reads 4506@file{file2}. 4507 4508You may also use @code{"-"} to name standard input when reading 4509files with @code{getline} (@pxref{Getline/File}). 4510And, you can even use @code{"-"} with the @option{-f} option 4511to read program source code from standard input (@pxref{Options}). 4512 4513In addition, @command{gawk} allows you to specify the special 4514@value{FN} @file{/dev/stdin}, both on the command line and 4515with @code{getline}. 4516Some other versions of @command{awk} also support this, but it 4517is not standard. 4518(Some operating systems provide a @file{/dev/stdin} file 4519in the filesystem; however, @command{gawk} always processes 4520this @value{FN} itself.) 4521 4522@node Environment Variables 4523@section The Environment Variables @command{gawk} Uses 4524@cindex environment variables @subentry used by @command{gawk} 4525 4526A number of environment variables influence how @command{gawk} 4527behaves. 4528 4529@menu 4530* AWKPATH Variable:: Searching directories for @command{awk} 4531 programs. 4532* AWKLIBPATH Variable:: Searching directories for @command{awk} shared 4533 libraries. 4534* Other Environment Variables:: The environment variables. 4535@end menu 4536 4537@node AWKPATH Variable 4538@subsection The @env{AWKPATH} Environment Variable 4539@cindex @env{AWKPATH} environment variable 4540@cindex environment variables @subentry @env{AWKPATH} 4541@cindex directories @subentry searching @subentry for source files 4542@cindex search paths @subentry for source files 4543@cindex differences in @command{awk} and @command{gawk} @subentry @env{AWKPATH} environment variable 4544@ifinfo 4545The previous @value{SECTION} described how @command{awk} program files can be named 4546on the command line with the @option{-f} option. 4547@end ifinfo 4548In most @command{awk} 4549implementations, you must supply a precise pathname for each program 4550file, unless the file is in the current directory. 4551But with @command{gawk}, if the @value{FN} supplied to the @option{-f} 4552or @option{-i} options 4553does not contain a directory separator @samp{/}, then @command{gawk} searches a list of 4554directories (called the @dfn{search path}) one by one, looking for a 4555file with the specified name. 4556 4557The search path is a string consisting of directory names 4558separated by colons.@footnote{Semicolons on MS-Windows.} 4559@command{gawk} gets its search path from the 4560@env{AWKPATH} environment variable. If that variable does not exist, 4561or if it has an empty value, 4562@command{gawk} uses a default path (described shortly). 4563 4564The search path feature is particularly helpful for building libraries 4565of useful @command{awk} functions. The library files can be placed in a 4566standard directory in the default path and then specified on 4567the command line with a short @value{FN}. Otherwise, you would have to 4568type the full @value{FN} for each file. 4569 4570By using the @option{-i} or @option{-f} options, your command-line 4571@command{awk} programs can use facilities in @command{awk} library files 4572(@pxref{Library Functions}). 4573Path searching is not done if @command{gawk} is in compatibility mode. 4574This is true for both @option{--traditional} and @option{--posix}. 4575@xref{Options}. 4576 4577If the source code file is not found after the initial search, the path is searched 4578again after adding the suffix @samp{.awk} to the @value{FN}. 4579 4580@command{gawk}'s path search mechanism is similar 4581to the shell's. 4582(See @uref{https://www.gnu.org/software/bash/manual/, 4583@cite{The Bourne-Again SHell manual}}.) 4584It treats a null entry in the path as indicating the current 4585directory. 4586(A null entry is indicated by starting or ending the path with a 4587colon or by placing two colons next to each other [@samp{::}].) 4588 4589@quotation NOTE 4590To include the current directory in the path, either place @file{.} 4591as an entry in the path or write a null entry in the path. 4592 4593Different past versions of @command{gawk} would also look explicitly in 4594the current directory, either before or after the path search. As of 4595@value{PVERSION} 4.1.2, this no longer happens; if you wish to look 4596in the current directory, you must include @file{.} either as a separate 4597entry or as a null entry in the search path. 4598@end quotation 4599 4600The default value for @env{AWKPATH} is 4601@samp{.:/usr/local/share/awk}.@footnote{Your version of @command{gawk} 4602may use a different directory; it 4603will depend upon how @command{gawk} was built and installed. The actual 4604directory is the value of @code{$(pkgdatadir)} generated when 4605@command{gawk} was configured. 4606(For more detail, see the @file{INSTALL} file in the source distribution, 4607and see @ref{Quick Installation}. 4608You probably don't need to worry about this, 4609though.)} Since @file{.} is included at the beginning, @command{gawk} 4610searches first in the current directory and then in @file{/usr/local/share/awk}. 4611In practice, this means that you will rarely need to change the 4612value of @env{AWKPATH}. 4613 4614@xref{Shell Startup Files}, for information on functions that help to 4615manipulate the @env{AWKPATH} variable. 4616 4617@command{gawk} places the value of the search path that it used into 4618@code{ENVIRON["AWKPATH"]}. This provides access to the actual search 4619path value from within an @command{awk} program. 4620 4621Although you can change @code{ENVIRON["AWKPATH"]} within your @command{awk} 4622program, this has no effect on the running program's behavior. This makes 4623sense: the @env{AWKPATH} environment variable is used to find the program 4624source files. Once your program is running, all the files have been 4625found, and @command{gawk} no longer needs to use @env{AWKPATH}. 4626 4627@node AWKLIBPATH Variable 4628@subsection The @env{AWKLIBPATH} Environment Variable 4629@cindex @env{AWKLIBPATH} environment variable 4630@cindex environment variables @subentry @env{AWKLIBPATH} 4631@cindex directories @subentry searching @subentry for loadable extensions 4632@cindex search paths @subentry for loadable extensions 4633@cindex differences in @command{awk} and @command{gawk} @subentry @code{AWKLIBPATH} environment variable 4634 4635The @env{AWKLIBPATH} environment variable is similar to the @env{AWKPATH} 4636variable, but it is used to search for loadable extensions (stored as 4637system shared libraries) specified with the @option{-l} option rather 4638than for source files. If the extension is not found, the path is 4639searched again after adding the appropriate shared library suffix for 4640the platform. For example, on GNU/Linux systems, the suffix @samp{.so} 4641is used. The search path specified is also used for extensions loaded 4642via the @code{@@load} keyword (@pxref{Loading Shared Libraries}). 4643 4644If @env{AWKLIBPATH} does not exist in the environment, or if it has 4645an empty value, @command{gawk} uses a default path; this 4646is typically @samp{/usr/local/lib/gawk}, although it can vary depending 4647upon how @command{gawk} was built.@footnote{Your version of @command{gawk} 4648may use a different directory; it 4649will depend upon how @command{gawk} was built and installed. The actual 4650directory is the value of @code{$(pkgextensiondir)} generated when 4651@command{gawk} was configured. 4652(For more detail, see the @file{INSTALL} file in the source distribution, 4653and see @ref{Quick Installation}. 4654You probably don't need to worry about this, 4655though.)} 4656 4657@xref{Shell Startup Files}, for information on functions that help to 4658manipulate the @env{AWKLIBPATH} variable. 4659 4660@command{gawk} places the value of the search path that it used into 4661@code{ENVIRON["AWKLIBPATH"]}. This provides access to the actual search 4662path value from within an @command{awk} program. 4663 4664Although you can change @code{ENVIRON["AWKLIBPATH"]} within your 4665@command{awk} program, this has no effect on the running program's 4666behavior. This makes sense: the @env{AWKLIBPATH} environment variable 4667is used to find any requested extensions, and they are loaded before 4668the program starts to run. Once your program is running, all the 4669extensions have been found, and @command{gawk} no longer needs to use 4670@env{AWKLIBPATH}. 4671 4672@node Other Environment Variables 4673@subsection Other Environment Variables 4674 4675A number of other environment variables affect @command{gawk}'s 4676behavior, but they are more specialized. Those in the following 4677list are meant to be used by regular users: 4678 4679@table @env 4680@item GAWK_MSEC_SLEEP 4681Specifies the interval between connection retries, 4682in milliseconds. On systems that do not support 4683the @code{usleep()} system call, 4684the value is rounded up to an integral number of seconds. 4685 4686@item GAWK_READ_TIMEOUT 4687Specifies the time, in milliseconds, for @command{gawk} to 4688wait for input before returning with an error. 4689@xref{Read Timeout}. 4690 4691@item GAWK_SOCK_RETRIES 4692Controls the number of times @command{gawk} attempts to 4693retry a two-way TCP/IP (socket) connection before giving up. 4694@xref{TCP/IP Networking}. 4695Note that when nonfatal I/O is enabled (@pxref{Nonfatal}), 4696@command{gawk} only tries to open a TCP/IP socket once. 4697 4698@item POSIXLY_CORRECT 4699Causes @command{gawk} to switch to POSIX-compatibility 4700mode, disabling all traditional and GNU extensions. 4701@xref{Options}. 4702@end table 4703 4704The environment variables in the following list are meant 4705for use by the @command{gawk} developers for testing and tuning. 4706They are subject to change. The variables are: 4707 4708@table @env 4709@item AWKBUFSIZE 4710This variable only affects @command{gawk} on POSIX-compliant systems. 4711With a value of @samp{exact}, @command{gawk} uses the size of each input 4712file as the size of the memory buffer to allocate for I/O. Otherwise, 4713the value should be a number, and @command{gawk} uses that number as 4714the size of the buffer to allocate. (When this variable is not set, 4715@command{gawk} uses the smaller of the file's size and the ``default'' 4716blocksize, which is usually the filesystem's I/O blocksize.) 4717 4718@item AWK_HASH 4719If this variable exists with a value of @samp{gst}, @command{gawk} 4720switches to using the hash function from GNU Smalltalk for 4721managing arrays. 4722This function may be marginally faster than the standard function. 4723 4724@item AWKREADFUNC 4725If this variable exists, @command{gawk} switches to reading source 4726files one line at a time, instead of reading in blocks. This exists 4727for debugging problems on filesystems on non-POSIX operating systems 4728where I/O is performed in records, not in blocks. 4729 4730@item GAWK_MSG_SRC 4731If this variable exists, @command{gawk} includes the @value{FN} 4732and line number within the @command{gawk} source code 4733from which warning and/or fatal messages 4734are generated. Its purpose is to help isolate the source of a 4735message, as there are multiple places that produce the 4736same warning or error message. 4737 4738@item GAWK_LOCALE_DIR 4739Specifies the location of compiled message object files 4740for @command{gawk} itself. This is passed to the @code{bindtextdomain()} 4741function when @command{gawk} starts up. 4742 4743@item GAWK_NO_DFA 4744If this variable exists, @command{gawk} does not use the DFA regexp matcher 4745for ``does it match'' kinds of tests. This can cause @command{gawk} 4746to be slower. Its purpose is to help isolate differences between the 4747two regexp matchers that @command{gawk} uses internally. (There aren't 4748supposed to be differences, but occasionally theory and practice don't 4749coordinate with each other.) 4750 4751@item GAWK_STACKSIZE 4752This specifies the amount by which @command{gawk} should grow its 4753internal evaluation stack, when needed. 4754 4755@item INT_CHAIN_MAX 4756This specifies intended maximum number of items @command{gawk} will maintain on a 4757hash chain for managing arrays indexed by integers. 4758 4759@item STR_CHAIN_MAX 4760This specifies intended maximum number of items @command{gawk} will maintain on a 4761hash chain for managing arrays indexed by strings. 4762 4763@item TIDYMEM 4764If this variable exists, @command{gawk} uses the @code{mtrace()} library 4765calls from the GNU C library to help track down possible memory leaks. 4766@end table 4767 4768@node Exit Status 4769@section @command{gawk}'s Exit Status 4770 4771@cindex exit status, of @command{gawk} 4772If the @code{exit} statement is used with a value 4773(@pxref{Exit Statement}), then @command{gawk} exits with 4774the numeric value given to it. 4775 4776Otherwise, if there were no problems during execution, 4777@command{gawk} exits with the value of the C constant 4778@code{EXIT_SUCCESS}. This is usually zero. 4779 4780If an error occurs, @command{gawk} exits with the value of 4781the C constant @code{EXIT_FAILURE}. This is usually one. 4782 4783If @command{gawk} exits because of a fatal error, the exit 4784status is two. On non-POSIX systems, this value may be mapped 4785to @code{EXIT_FAILURE}. 4786 4787@node Include Files 4788@section Including Other Files into Your Program 4789 4790@c Panos Papadopoulos <panos1962@gmail.com> contributed the original 4791@c text for this section. 4792 4793This @value{SECTION} describes a feature that is specific to @command{gawk}. 4794 4795@cindex @code{@@} (at-sign) @subentry @code{@@include} directive 4796@cindex at-sign (@code{@@}) @subentry @code{@@include} directive 4797@cindex file inclusion, @code{@@include} directive 4798@cindex including files, @code{@@include} directive 4799@cindex @code{@@include} directive @sortas{include directive} 4800The @code{@@include} keyword can be used to read external @command{awk} source 4801files. This gives you the ability to split large @command{awk} source files 4802into smaller, more manageable pieces, and also lets you reuse common @command{awk} 4803code from various @command{awk} scripts. In other words, you can group 4804together @command{awk} functions used to carry out specific tasks 4805into external files. These files can be used just like function libraries, 4806using the @code{@@include} keyword in conjunction with the @env{AWKPATH} 4807environment variable. Note that source files may also be included 4808using the @option{-i} option. 4809 4810Let's see an example. 4811We'll start with two (trivial) @command{awk} scripts, namely 4812@file{test1} and @file{test2}. Here is the @file{test1} script: 4813 4814@example 4815BEGIN @{ 4816 print "This is script test1." 4817@} 4818@end example 4819 4820@noindent 4821and here is @file{test2}: 4822 4823@example 4824@@include "test1" 4825BEGIN @{ 4826 print "This is script test2." 4827@} 4828@end example 4829 4830Running @command{gawk} with @file{test2} 4831produces the following result: 4832 4833@example 4834$ @kbd{gawk -f test2} 4835@print{} This is script test1. 4836@print{} This is script test2. 4837@end example 4838 4839@command{gawk} runs the @file{test2} script, which includes @file{test1} 4840using the @code{@@include} 4841keyword. So, to include external @command{awk} source files, you just 4842use @code{@@include} followed by the name of the file to be included, 4843enclosed in double quotes. 4844 4845@quotation NOTE 4846Keep in mind that this is a language construct and the @value{FN} cannot 4847be a string variable, but rather just a literal string constant in double quotes. 4848@end quotation 4849 4850The files to be included may be nested; e.g., given a third 4851script, namely @file{test3}: 4852 4853@example 4854@group 4855@@include "test2" 4856BEGIN @{ 4857 print "This is script test3." 4858@} 4859@end group 4860@end example 4861 4862@noindent 4863Running @command{gawk} with the @file{test3} script produces the 4864following results: 4865 4866@example 4867$ @kbd{gawk -f test3} 4868@print{} This is script test1. 4869@print{} This is script test2. 4870@print{} This is script test3. 4871@end example 4872 4873The @value{FN} can, of course, be a pathname. For example: 4874 4875@example 4876@@include "../io_funcs" 4877@end example 4878 4879@noindent 4880and: 4881 4882@example 4883@@include "/usr/awklib/network" 4884@end example 4885 4886@noindent 4887are both valid. The @env{AWKPATH} environment variable can be of great 4888value when using @code{@@include}. The same rules for the use 4889of the @env{AWKPATH} variable in command-line file searches 4890(@pxref{AWKPATH Variable}) apply to 4891@code{@@include} also. 4892 4893This is very helpful in constructing @command{gawk} function libraries. 4894If you have a large script with useful, general-purpose @command{awk} 4895functions, you can break it down into library files and put those files 4896in a special directory. You can then include those ``libraries,'' 4897either by using the full pathnames of the files, or by setting the @env{AWKPATH} 4898environment variable accordingly and then using @code{@@include} with 4899just the file part of the full pathname. Of course, 4900you can keep library files in more than one directory; 4901the more complex the working 4902environment is, the more directories you may need to organize the files 4903to be included. 4904 4905Given the ability to specify multiple @option{-f} options, the 4906@code{@@include} mechanism is not strictly necessary. 4907However, the @code{@@include} keyword 4908can help you in constructing self-contained @command{gawk} programs, 4909thus reducing the need for writing complex and tedious command lines. 4910In particular, @code{@@include} is very useful for writing CGI scripts 4911to be run from web pages. 4912 4913The rules for finding a source file described in @ref{AWKPATH Variable} also 4914apply to files loaded with @code{@@include}. 4915 4916Finally, files included with @code{@@include} 4917are treated as if they had @samp{@@namespace "awk"} 4918at their beginning. @xref{Changing The Namespace}, for more information. 4919 4920@node Loading Shared Libraries 4921@section Loading Dynamic Extensions into Your Program 4922 4923This @value{SECTION} describes a feature that is specific to @command{gawk}. 4924 4925@cindex @code{@@} (at-sign) @subentry @code{@@load} directive 4926@cindex at-sign (@code{@@}) @subentry @code{@@load} directive 4927@cindex loading extensions @subentry @code{@@load} directive 4928@cindex extensions @subentry loadable @subentry loading, @code{@@load} directive 4929@cindex @code{@@load} directive @sortas{load directive} 4930The @code{@@load} keyword can be used to read external @command{awk} extensions 4931(stored as system shared libraries). 4932This allows you to link in compiled code that may offer superior 4933performance and/or give you access to extended capabilities not supported 4934by the @command{awk} language. The @env{AWKLIBPATH} variable is used to 4935search for the extension. Using @code{@@load} is completely equivalent 4936to using the @option{-l} command-line option. 4937 4938If the extension is not initially found in @env{AWKLIBPATH}, another 4939search is conducted after appending the platform's default shared library 4940suffix to the @value{FN}. For example, on GNU/Linux systems, the suffix 4941@samp{.so} is used: 4942 4943@example 4944$ @kbd{gawk '@@load "ordchr"; BEGIN @{print chr(65)@}'} 4945@print{} A 4946@end example 4947 4948@noindent 4949This is equivalent to the following example: 4950 4951@example 4952@group 4953$ @kbd{gawk -lordchr 'BEGIN @{print chr(65)@}'} 4954@print{} A 4955@end group 4956@end example 4957 4958@noindent 4959For command-line usage, the @option{-l} option is more convenient, 4960but @code{@@load} is useful for embedding inside an @command{awk} source file 4961that requires access to an extension. 4962 4963@ref{Dynamic Extensions}, describes how to write extensions (in C or C++) 4964that can be loaded with either @code{@@load} or the @option{-l} option. 4965It also describes the @code{ordchr} extension. 4966 4967@node Obsolete 4968@section Obsolete Options and/or Features 4969 4970@c update this section for each release! 4971 4972@cindex options @subentry deprecated 4973@cindex features @subentry deprecated 4974@cindex obsolete features 4975This @value{SECTION} describes features and/or command-line options from 4976previous releases of @command{gawk} that either are not available in the 4977current version or are still supported but deprecated (meaning that 4978they will @emph{not} be in the next release). 4979 4980The process-related special files @file{/dev/pid}, @file{/dev/ppid}, 4981@file{/dev/pgrpid}, and @file{/dev/user} were deprecated in @command{gawk} 49823.1, but still worked. As of @value{PVERSION} 4.0, they are no longer 4983interpreted specially by @command{gawk}. (Use @code{PROCINFO} instead; 4984see @ref{Auto-set}.) 4985 4986@ignore 4987This @value{SECTION} 4988is thus essentially a place holder, 4989in case some option becomes obsolete in a future version of @command{gawk}. 4990@end ignore 4991 4992@node Undocumented 4993@section Undocumented Options and Features 4994@cindex undocumented features 4995@cindex features @subentry undocumented 4996@cindex Skywalker, Luke 4997@cindex Kenobi, Obi-Wan 4998@cindex jedi knights 4999@cindex knights, jedi 5000@quotation 5001@i{Use the Source, Luke!} 5002@author Obi-Wan 5003@end quotation 5004 5005@cindex shells @subentry sea 5006This @value{SECTION} intentionally left 5007blank. 5008 5009@ignore 5010@c If these came out in the Info file or TeX document, then they wouldn't 5011@c be undocumented, would they? 5012 5013@command{gawk} has one undocumented option: 5014 5015@table @code 5016@item -W nostalgia 5017@itemx --nostalgia 5018Print the message @samp{awk: bailing out near line 1} and dump core. 5019This option was inspired by the common behavior of very early versions of 5020Unix @command{awk} and by a t--shirt. 5021The message is @emph{not} subject to translation in non-English locales. 5022@c so there! nyah, nyah. 5023@end table 5024 5025Early versions of @command{awk} used to not require any separator (either 5026a newline or @samp{;}) between the rules in @command{awk} programs. Thus, 5027it was common to see one-line programs like: 5028 5029@example 5030awk '@{ sum += $1 @} END @{ print sum @}' 5031@end example 5032 5033@command{gawk} actually supports this but it is purposely undocumented 5034because it is bad style. The correct way to write such a program 5035is either: 5036 5037@example 5038awk '@{ sum += $1 @} ; END @{ print sum @}' 5039@end example 5040 5041@noindent 5042or: 5043 5044@example 5045awk '@{ sum += $1 @} 5046 END @{ print sum @}' data 5047@end example 5048 5049@noindent 5050@xref{Statements/Lines}, for a fuller explanation. 5051 5052You can insert newlines after the @samp{;} in @code{for} loops. 5053This seems to have been a long-undocumented feature in Unix @command{awk}. 5054 5055Similarly, you may use @code{print} or @code{printf} statements in the 5056@var{init} and @var{increment} parts of a @code{for} loop. This is another 5057long-undocumented ``feature'' of Unix @command{awk}. 5058 5059@command{gawk} lets you use the names of built-in functions that are 5060@command{gawk} extensions as the names of parameters in user-defined functions. 5061This is intended to ``future-proof'' old code that happens to use 5062function names added by @command{gawk} after the code was written. 5063Standard @command{awk} built-in functions, such as @code{sin()} or 5064@code{substr()} are @emph{not} shadowed in this way. 5065 5066You can use a @samp{P} modifier for the @code{printf()} floating-point 5067format control letters to use the underlying C library's result for 5068NaN and Infinity values, instead of the special values @command{gawk} 5069usually produces, as described in @ref{POSIX Floating Point Problems}. 5070This is mainly useful for the included unit tests. 5071 5072The @code{typeof()} built-in function 5073(@pxref{Type Functions}) 5074takes an optional second array argument that, if present, will be cleared 5075and populated with some information about the internal implementation of 5076the variable. This can be useful for debugging. At the moment, this 5077returns a textual version of the flags for scalar variables, and the 5078array back-end implementation type for arrays. This interface is subject 5079to change and may not be stable. 5080 5081When not in POSIX or compatibility mode, if you set @code{LINENO} to a 5082numeric value using the @option{-v} option, @command{gawk} adds that value 5083to the real line number for use in error messages. This is intended for 5084use within Bash shell scripts, such that the error message will reflect 5085the line number in the shell script, instead of in the @command{awk} 5086program. To demonstrate: 5087 5088@example 5089$ @kbd{gawk -v LINENO=10 'BEGIN @{ print("hi" @}'} 5090@error{} gawk: cmd. line:11: BEGIN @{ print("hi" @} 5091@error{} gawk: cmd. line:11: ^ syntax error 5092@end example 5093 5094@end ignore 5095 5096@node Invoking Summary 5097@section Summary 5098 5099@itemize @value{BULLET} 5100 5101@c From Neil R. Ormos 5102@item 5103@command{gawk} parses arguments on the command line, left to right, to 5104determine if they should be treated as options or as non-option arguments. 5105 5106@item 5107@command{gawk} recognizes several options which control its operation, 5108as described in @ref{Options}. All options begin with @samp{-}. 5109 5110@item 5111Any argument that is not recognized as an option is treated as a 5112non-option argument, even if it begins with @samp{-}. 5113 5114@itemize @value{MINUS} 5115@item 5116However, when an option itself requires an argument, and the option is separated 5117from that argument on the command line by at least one space, the space 5118is ignored, and the argument is considered to be related to the option. Thus, in 5119the invocation, @samp{gawk -F x}, the @samp{x} is treated as belonging to the 5120@option{-F} option, not as a separate non-option argument. 5121@end itemize 5122 5123@item 5124Once @command{gawk} finds a non-option argument, it stops looking for 5125options. Therefore, all following arguments are also non-option arguments, 5126even if they resemble recognized options. 5127 5128@item 5129If no @option{-e} or @option{-f} options are present, @command{gawk} 5130expects the program text to be in the first non-option argument. 5131 5132@item 5133All non-option arguments, except program text provided in the first 5134non-option argument, are placed in @code{ARGV} as explained in 5135@ref{ARGC and ARGV}, and are processed as described in @ref{Other Arguments}. 5136@c And I wrote: 5137Adjusting @code{ARGC} and @code{ARGV} 5138affects how @command{awk} processes input. 5139 5140@c ---------------------------------------- 5141 5142@item 5143The three standard options for all versions of @command{awk} are 5144@option{-f}, @option{-F}, and @option{-v}. @command{gawk} supplies these 5145and many others, as well as corresponding GNU-style long options. 5146 5147@item 5148Nonoption command-line arguments are usually treated as @value{FN}s, 5149unless they have the form @samp{@var{var}=@var{value}}, in which case 5150they are taken as variable assignments to be performed at that point 5151in processing the input. 5152 5153@item 5154You can use a single minus sign (@samp{-}) to refer to standard input 5155on the command line. @command{gawk} also lets you use the special 5156@value{FN} @file{/dev/stdin}. 5157 5158@item 5159@command{gawk} pays attention to a number of environment variables. 5160@env{AWKPATH}, @env{AWKLIBPATH}, and @env{POSIXLY_CORRECT} are the 5161most important ones. 5162 5163@item 5164@command{gawk}'s exit status conveys information to the program 5165that invoked it. Use the @code{exit} statement from within 5166an @command{awk} program to set the exit status. 5167 5168@item 5169@command{gawk} allows you to include other @command{awk} source files into 5170your program using the @code{@@include} statement and/or the @option{-i} 5171and @option{-f} command-line options. 5172 5173@item 5174@command{gawk} allows you to load additional functions written in C 5175or C++ using the @code{@@load} statement and/or the @option{-l} option. 5176(This advanced feature is described later, in @ref{Dynamic Extensions}.) 5177@end itemize 5178 5179@node Regexp 5180@chapter Regular Expressions 5181@cindex regexp 5182@cindex regular expressions 5183 5184A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a 5185set of strings. 5186Because regular expressions are such a fundamental part of @command{awk} 5187programming, their format and use deserve a separate @value{CHAPTER}. 5188 5189@cindex forward slash (@code{/}) @subentry to enclose regular expressions 5190@cindex @code{/} (forward slash) @subentry to enclose regular expressions 5191A regular expression enclosed in slashes (@samp{/}) 5192is an @command{awk} pattern that matches every input record whose text 5193belongs to that set. 5194The simplest regular expression is a sequence of letters, numbers, or 5195both. Such a regexp matches any string that contains that sequence. 5196Thus, the regexp @samp{foo} matches any string containing @samp{foo}. 5197Thus, the pattern @code{/foo/} matches any input record containing 5198the three adjacent characters @samp{foo} @emph{anywhere} in the record. Other 5199kinds of regexps let you specify more complicated classes of strings. 5200 5201@ifnotinfo 5202Initially, the examples in this @value{CHAPTER} are simple. 5203As we explain more about how 5204regular expressions work, we present more complicated instances. 5205@end ifnotinfo 5206 5207@menu 5208* Regexp Usage:: How to Use Regular Expressions. 5209* Escape Sequences:: How to write nonprinting characters. 5210* Regexp Operators:: Regular Expression Operators. 5211* Bracket Expressions:: What can go between @samp{[...]}. 5212* Leftmost Longest:: How much text matches. 5213* Computed Regexps:: Using Dynamic Regexps. 5214* GNU Regexp Operators:: Operators specific to GNU software. 5215* Case-sensitivity:: How to do case-insensitive matching. 5216* Regexp Summary:: Regular expressions summary. 5217@end menu 5218 5219@node Regexp Usage 5220@section How to Use Regular Expressions 5221 5222@cindex patterns @subentry regexp constants as 5223@cindex regular expressions @subentry as patterns 5224A regular expression can be used as a pattern by enclosing it in 5225slashes. Then the regular expression is tested against the 5226entire text of each record. (Normally, it only needs 5227to match some part of the text in order to succeed.) For example, the 5228following prints the second field of each record where the string 5229@samp{li} appears anywhere in the record: 5230 5231@example 5232$ @kbd{awk '/li/ @{ print $2 @}' mail-list} 5233@print{} 555-5553 5234@print{} 555-0542 5235@print{} 555-6699 5236@print{} 555-3430 5237@end example 5238 5239@cindex regular expressions @subentry operators 5240@cindex operators @subentry string-matching 5241@c @cindex operators, @code{~} 5242@cindex string-matching operators 5243@cindex @code{~} (tilde), @code{~} operator 5244@cindex tilde (@code{~}), @code{~} operator 5245@cindex @code{!} (exclamation point) @subentry @code{!~} operator 5246@cindex exclamation point (@code{!}) @subentry @code{!~} operator 5247@c @cindex operators, @code{!~} 5248@cindex @code{if} statement @subentry use of regexps in 5249@cindex @code{while} statement @subentry use of regexps in 5250@cindex @code{do}-@code{while} statement @subentry use of regexps in 5251@c @cindex statements, @code{if} 5252@c @cindex statements, @code{while} 5253@c @cindex statements, @code{do} 5254Regular expressions can also be used in matching expressions. These 5255expressions allow you to specify the string to match against; it need 5256not be the entire current input record. The two operators @samp{~} 5257and @samp{!~} perform regular expression comparisons. Expressions 5258using these operators can be used as patterns, or in @code{if}, 5259@code{while}, @code{for}, and @code{do} statements. 5260(@xref{Statements}.) 5261For example, the following is true if the expression @var{exp} (taken 5262as a string) matches @var{regexp}: 5263 5264@example 5265@var{exp} ~ /@var{regexp}/ 5266@end example 5267 5268@noindent 5269This example matches, or selects, all input records with the uppercase 5270letter @samp{J} somewhere in the first field: 5271 5272@example 5273$ @kbd{awk '$1 ~ /J/' inventory-shipped} 5274@print{} Jan 13 25 15 115 5275@print{} Jun 31 42 75 492 5276@print{} Jul 24 34 67 436 5277@print{} Jan 21 36 64 620 5278@end example 5279 5280So does this: 5281 5282@example 5283awk '@{ if ($1 ~ /J/) print @}' inventory-shipped 5284@end example 5285 5286This next example is true if the expression @var{exp} 5287(taken as a character string) 5288does @emph{not} match @var{regexp}: 5289 5290@example 5291@var{exp} !~ /@var{regexp}/ 5292@end example 5293 5294The following example matches, 5295or selects, all input records whose first field @emph{does not} contain 5296the uppercase letter @samp{J}: 5297 5298@example 5299$ @kbd{awk '$1 !~ /J/' inventory-shipped} 5300@print{} Feb 15 32 24 226 5301@print{} Mar 15 24 34 228 5302@print{} Apr 31 52 63 420 5303@print{} May 16 34 29 208 5304@dots{} 5305@end example 5306 5307@cindex regexp constants 5308@cindex constants @subentry regexp 5309@cindex regular expressions, constants @seeentry{regexp constants} 5310When a regexp is enclosed in slashes, such as @code{/foo/}, we call it 5311a @dfn{regexp constant}, much like @code{5.27} is a numeric constant and 5312@code{"foo"} is a string constant. 5313 5314@node Escape Sequences 5315@section Escape Sequences 5316 5317@cindex escape sequences 5318@cindex escape sequences @seealso{backslash} 5319@cindex backslash (@code{\}) @subentry in escape sequences 5320@cindex @code{\} (backslash) @subentry in escape sequences 5321Some characters cannot be included literally in string constants 5322(@code{"foo"}) or regexp constants (@code{/foo/}). 5323Instead, they should be represented with @dfn{escape sequences}, 5324which are character sequences beginning with a backslash (@samp{\}). 5325One use of an escape sequence is to include a double-quote character in 5326a string constant. Because a plain double quote ends the string, you 5327must use @samp{\"} to represent an actual double-quote character as a 5328part of the string. For example: 5329 5330@example 5331$ @kbd{awk 'BEGIN @{ print "He said \"hi!\" to her." @}'} 5332@print{} He said "hi!" to her. 5333@end example 5334 5335The backslash character itself is another character that cannot be 5336included normally; you must write @samp{\\} to put one backslash in the 5337string or regexp. Thus, the string whose contents are the two characters 5338@samp{"} and @samp{\} must be written @code{"\"\\"}. 5339 5340Other escape sequences represent unprintable characters 5341such as TAB or newline. There is nothing to stop you from entering most 5342unprintable characters directly in a string constant or regexp constant, 5343but they may look ugly. 5344 5345The following list presents 5346all the escape sequences used in @command{awk} and 5347what they represent. Unless noted otherwise, all these escape 5348sequences apply to both string constants and regexp constants: 5349 5350@cindex ASCII 5351@table @code 5352@item \\ 5353A literal backslash, @samp{\}. 5354 5355@c @cindex @command{awk} language, V.4 version 5356@cindex @code{\} (backslash) @subentry @code{\a} escape sequence 5357@cindex backslash (@code{\}) @subentry @code{\a} escape sequence 5358@item \a 5359The ``alert'' character, @kbd{Ctrl-g}, ASCII code 7 (BEL). 5360(This often makes some sort of audible noise.) 5361 5362@cindex @code{\} (backslash) @subentry @code{\b} escape sequence 5363@cindex backslash (@code{\}) @subentry @code{\b} escape sequence 5364@item \b 5365Backspace, @kbd{Ctrl-h}, ASCII code 8 (BS). 5366 5367@cindex @code{\} (backslash) @subentry @code{\f} escape sequence 5368@cindex backslash (@code{\}) @subentry @code{\f} escape sequence 5369@item \f 5370Formfeed, @kbd{Ctrl-l}, ASCII code 12 (FF). 5371 5372@cindex @code{\} (backslash) @subentry @code{\n} escape sequence 5373@cindex backslash (@code{\}) @subentry @code{\n} escape sequence 5374@item \n 5375Newline, @kbd{Ctrl-j}, ASCII code 10 (LF). 5376 5377@cindex @code{\} (backslash) @subentry @code{\r} escape sequence 5378@cindex backslash (@code{\}) @subentry @code{\r} escape sequence 5379@item \r 5380Carriage return, @kbd{Ctrl-m}, ASCII code 13 (CR). 5381 5382@cindex @code{\} (backslash) @subentry @code{\t} escape sequence 5383@cindex backslash (@code{\}) @subentry @code{\t} escape sequence 5384@item \t 5385Horizontal TAB, @kbd{Ctrl-i}, ASCII code 9 (HT). 5386 5387@c @cindex @command{awk} language, V.4 version 5388@cindex @code{\} (backslash) @subentry @code{\v} escape sequence 5389@cindex backslash (@code{\}) @subentry @code{\v} escape sequence 5390@item \v 5391Vertical TAB, @kbd{Ctrl-k}, ASCII code 11 (VT). 5392 5393@cindex @code{\} (backslash) @subentry @code{\}@var{nnn} escape sequence 5394@cindex backslash (@code{\}) @subentry @code{\}@var{nnn} escape sequence 5395@item \@var{nnn} 5396The octal value @var{nnn}, where @var{nnn} stands for 1 to 3 digits 5397between @samp{0} and @samp{7}. For example, the code for the ASCII ESC 5398(escape) character is @samp{\033}. 5399 5400@c @cindex @command{awk} language, V.4 version 5401@c @cindex @command{awk} language, POSIX version 5402@cindex @code{\} (backslash) @subentry @code{\x} escape sequence 5403@cindex backslash (@code{\}) @subentry @code{\x} escape sequence 5404@cindex common extensions @subentry @code{\x} escape sequence 5405@cindex extensions @subentry common @subentry @code{\x} escape sequence 5406@item \x@var{hh}@dots{} 5407The hexadecimal value @var{hh}, where @var{hh} stands for a sequence 5408of hexadecimal digits (@samp{0}--@samp{9}, and either @samp{A}--@samp{F} 5409or @samp{a}--@samp{f}). A maximum of two digts are allowed after 5410the @samp{\x}. Any further hexadecimal digits are treated as simple 5411letters or numbers. @value{COMMONEXT} 5412(The @samp{\x} escape sequence is not allowed in POSIX awk.) 5413 5414@quotation CAUTION 5415In ISO C, the escape sequence continues until the first nonhexadecimal 5416digit is seen. 5417For many years, @command{gawk} would continue incorporating 5418hexadecimal digits into the value until a non-hexadecimal digit 5419or the end of the string was encountered. 5420However, using more than two hexadecimal digits produced 5421undefined results. 5422As of @value{PVERSION} 4.2, only two digits 5423are processed. 5424@end quotation 5425 5426@cindex @code{\} (backslash) @subentry @code{\/} escape sequence 5427@cindex backslash (@code{\}) @subentry @code{\/} escape sequence 5428@item \/ 5429A literal slash (should be used for regexp constants only). 5430This sequence is used when you want to write a regexp 5431constant that contains a slash 5432(such as @code{/.*:\/home\/[[:alnum:]]+:.*/}; the @samp{[[:alnum:]]} 5433notation is discussed in @ref{Bracket Expressions}). 5434Because the regexp is delimited by 5435slashes, you need to escape any slash that is part of the pattern, 5436in order to tell @command{awk} to keep processing the rest of the regexp. 5437 5438@cindex @code{\} (backslash) @subentry @code{\"} escape sequence 5439@cindex backslash (@code{\}) @subentry @code{\"} escape sequence 5440@item \" 5441A literal double quote (should be used for string constants only). 5442This sequence is used when you want to write a string 5443constant that contains a double quote 5444(such as @code{"He said \"hi!\" to her."}). 5445Because the string is delimited by 5446double quotes, you need to escape any quote that is part of the string, 5447in order to tell @command{awk} to keep processing the rest of the string. 5448@end table 5449 5450In @command{gawk}, a number of additional two-character sequences that begin 5451with a backslash have special meaning in regexps. 5452@xref{GNU Regexp Operators}. 5453 5454In a regexp, a backslash before any character that is not in the previous list 5455and not listed in 5456@ref{GNU Regexp Operators} 5457means that the next character should be taken literally, even if it would 5458normally be a regexp operator. For example, @code{/a\+b/} matches the three 5459characters @samp{a+b}. 5460 5461@cindex backslash (@code{\}) @subentry in escape sequences 5462@cindex @code{\} (backslash) @subentry in escape sequences 5463@cindex portability 5464For complete portability, do not use a backslash before any character not 5465shown in the previous list or that is not an operator. 5466 5467@c 11/2014: Moved so as to not stack sidebars 5468@sidebar Backslash Before Regular Characters 5469@cindex portability @subentry backslash in escape sequences 5470@cindex POSIX @command{awk} @subentry backslashes in string constants 5471@cindex backslash (@code{\}) @subentry in escape sequences @subentry POSIX and 5472@cindex @code{\} (backslash) @subentry in escape sequences @subentry POSIX and 5473 5474@cindex troubleshooting @subentry backslash before nonspecial character 5475If you place a backslash in a string constant before something that is 5476not one of the characters previously listed, POSIX @command{awk} purposely 5477leaves what happens as undefined. There are two choices: 5478 5479@c @cindex automatic warnings 5480@c @cindex warnings, automatic 5481@cindex Brian Kernighan's @command{awk} 5482@table @asis 5483@item Strip the backslash out 5484This is what BWK @command{awk} and @command{gawk} both do. 5485For example, @code{"a\qc"} is the same as @code{"aqc"}. 5486(Because this is such an easy bug both to introduce and to miss, 5487@command{gawk} warns you about it.) 5488Consider @samp{FS = @w{"[ \t]+\|[ \t]+"}} to use vertical bars 5489surrounded by whitespace as the field separator. There should be 5490two backslashes in the string: @samp{FS = @w{"[ \t]+\\|[ \t]+"}}.) 5491@c I did this! This is why I added the warning. 5492 5493@cindex @command{gawk} @subentry escape sequences 5494@cindex @command{gawk} @subentry escape sequences @seealso{backslash} 5495@cindex Unix @command{awk} @subentry backslashes in escape sequences 5496@cindex @command{mawk} utility 5497@item Leave the backslash alone 5498Some other @command{awk} implementations do this. 5499In such implementations, typing @code{"a\qc"} is the same as typing 5500@code{"a\\qc"}. 5501@end table 5502@end sidebar 5503 5504To summarize: 5505 5506@itemize @value{BULLET} 5507@item 5508The escape sequences in the preceding list are always processed first, 5509for both string constants and regexp constants. This happens very early, 5510as soon as @command{awk} reads your program. 5511 5512@item 5513@command{gawk} processes both regexp constants and dynamic regexps 5514(@pxref{Computed Regexps}), 5515for the special operators listed in 5516@ref{GNU Regexp Operators}. 5517 5518@item 5519A backslash before any other character means to treat that character 5520literally. 5521@end itemize 5522 5523@sidebar Escape Sequences for Metacharacters 5524@cindex metacharacters @subentry escape sequences for 5525 5526Suppose you use an octal or hexadecimal 5527escape to represent a regexp metacharacter. 5528(See @ref{Regexp Operators}.) 5529Does @command{awk} treat the character as a literal character or as a regexp 5530operator? 5531 5532@cindex dark corner @subentry escape sequences @subentry for metacharacters 5533Historically, such characters were taken literally. 5534@value{DARKCORNER} 5535However, the POSIX standard indicates that they should be treated 5536as real metacharacters, which is what @command{gawk} does. 5537In compatibility mode (@pxref{Options}), 5538@command{gawk} treats the characters represented by octal and hexadecimal 5539escape sequences literally when used in regexp constants. Thus, 5540@code{/a\52b/} is equivalent to @code{/a\*b/}. 5541@end sidebar 5542 5543@node Regexp Operators 5544@section Regular Expression Operators 5545@cindex regular expressions @subentry operators 5546@cindex metacharacters @subentry in regular expressions 5547 5548You can combine regular expressions with special characters, 5549called @dfn{regular expression operators} or @dfn{metacharacters}, to 5550increase the power and versatility of regular expressions. 5551 5552@menu 5553* Regexp Operator Details:: The actual details. 5554* Interval Expressions:: Notes on interval expressions. 5555@end menu 5556 5557@node Regexp Operator Details 5558@subsection Regexp Operators in @command{awk} 5559 5560The escape sequences described 5561@ifnotinfo 5562earlier 5563@end ifnotinfo 5564in @ref{Escape Sequences} 5565are valid inside a regexp. They are introduced by a @samp{\} and 5566are recognized and converted into corresponding real characters as 5567the very first step in processing regexps. 5568 5569Here is a list of metacharacters. All characters that are not escape 5570sequences and that are not listed here stand for themselves: 5571 5572@c Use @asis so the docbook comes out ok. Sigh. 5573@table @asis 5574@cindex backslash (@code{\}) @subentry regexp operator 5575@cindex @code{\} (backslash) @subentry regexp operator 5576@item @code{\} 5577This suppresses the special meaning of a character when 5578matching. For example, @samp{\$} 5579matches the character @samp{$}. 5580 5581@cindex regular expressions @subentry anchors in 5582@cindex Texinfo @subentry chapter beginnings in files 5583@cindex @code{^} (caret) @subentry regexp operator 5584@cindex caret (@code{^}) @subentry regexp operator 5585@item @code{^} 5586This matches the beginning of a string. @samp{^@@chapter} 5587matches @samp{@@chapter} at the beginning of a string, 5588for example, and can be used 5589to identify chapter beginnings in Texinfo source files. 5590The @samp{^} is known as an @dfn{anchor}, because it anchors the pattern to 5591match only at the beginning of the string. 5592 5593It is important to realize that @samp{^} does not match the beginning of 5594a line (the point right after a @samp{\n} newline character) embedded in a string. 5595The condition is not true in the following example: 5596 5597@example 5598if ("line1\nLINE 2" ~ /^L/) @dots{} 5599@end example 5600 5601@cindex @code{$} (dollar sign) @subentry regexp operator 5602@cindex dollar sign (@code{$}) @subentry regexp operator 5603@item @code{$} 5604This is similar to @samp{^}, but it matches only at the end of a string. 5605For example, @samp{p$} 5606matches a record that ends with a @samp{p}. The @samp{$} is an anchor 5607and does not match the end of a line 5608(the point right before a @samp{\n} newline character) 5609embedded in a string. 5610The condition in the following example is not true: 5611 5612@example 5613if ("line1\nLINE 2" ~ /1$/) @dots{} 5614@end example 5615 5616@cindex @code{.} (period), regexp operator 5617@cindex period (@code{.}), regexp operator 5618@item @code{.} (period) 5619This matches any single character, 5620@emph{including} the newline character. For example, @samp{.P} 5621matches any single character followed by a @samp{P} in a string. Using 5622concatenation, we can make a regular expression such as @samp{U.A}, which 5623matches any three-character sequence that begins with @samp{U} and ends 5624with @samp{A}. 5625 5626@cindex POSIX mode 5627@cindex POSIX @command{awk} @subentry period (@code{.}), using 5628In strict POSIX mode (@pxref{Options}), 5629@samp{.} does not match the @sc{nul} 5630character, which is a character with all bits equal to zero. 5631Otherwise, @sc{nul} is just another character. Other versions of @command{awk} 5632may not be able to match the @sc{nul} character. 5633 5634@cindex @code{[]} (square brackets), regexp operator 5635@cindex square brackets (@code{[]}), regexp operator 5636@cindex bracket expressions 5637@cindex character sets (in regular expressions) @seeentry{bracket expressions} 5638@cindex character lists @seeentry{bracket expressions} 5639@cindex character classes @seeentry{bracket expressions} 5640@item @code{[}@dots{}@code{]} 5641This is called a @dfn{bracket expression}.@footnote{In other literature, 5642you may see a bracket expression referred to as either a 5643@dfn{character set}, a @dfn{character class}, or a @dfn{character list}.} 5644It matches any @emph{one} of the characters that are enclosed in 5645the square brackets. For example, @samp{[MVX]} matches any one of 5646the characters @samp{M}, @samp{V}, or @samp{X} in a string. A full 5647discussion of what can be inside the square brackets of a bracket expression 5648is given in 5649@ref{Bracket Expressions}. 5650 5651@cindex bracket expressions @subentry complemented 5652@item @code{[^}@dots{}@code{]} 5653This is a @dfn{complemented bracket expression}. The first character after 5654the @samp{[} @emph{must} be a @samp{^}. It matches any characters 5655@emph{except} those in the square brackets. For example, @samp{[^awk]} 5656matches any character that is not an @samp{a}, @samp{w}, 5657or @samp{k}. 5658 5659@cindex @code{|} (vertical bar) 5660@cindex vertical bar (@code{|}) 5661@item @code{|} 5662This is the @dfn{alternation operator} and it is used to specify 5663alternatives. The @samp{|} has the lowest precedence of all the regular 5664expression operators. For example, @samp{^P|[aeiouy]} matches any string 5665that matches either @samp{^P} or @samp{[aeiouy]}. This means it matches 5666any string that starts with @samp{P} or contains (anywhere within it) 5667a lowercase English vowel. 5668 5669The alternation applies to the largest possible regexps on either side. 5670 5671@cindex @code{()} (parentheses) @subentry regexp operator 5672@cindex parentheses @code{()} @subentry regexp operator 5673@item @code{(}@dots{}@code{)} 5674Parentheses are used for grouping in regular expressions, as in 5675arithmetic. They can be used to concatenate regular expressions 5676containing the alternation operator, @samp{|}. For example, 5677@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and 5678@samp{@@samp@{bar@}}. 5679(These are Texinfo formatting control sequences. The @samp{+} is 5680explained further on in this list.) 5681 5682The left or opening parenthesis is always a metacharacter; to match 5683one literally, precede it with a backslash. However, the right or 5684closing parenthesis is only special when paired with a left parenthesis; 5685an unpaired right parenthesis is (silently) treated as a regular character. 5686 5687@cindex @code{*} (asterisk) @subentry @code{*} operator @subentry as regexp operator 5688@cindex asterisk (@code{*}) @subentry @code{*} operator @subentry as regexp operator 5689@item @code{*} 5690This symbol means that the preceding regular expression should be 5691repeated as many times as necessary to find a match. For example, @samp{ph*} 5692applies the @samp{*} symbol to the preceding @samp{h} and looks for matches 5693of one @samp{p} followed by any number of @samp{h}s. This also matches 5694just @samp{p} if no @samp{h}s are present. 5695 5696There are two subtle points to understand about how @samp{*} works. 5697First, the @samp{*} applies only to the single preceding regular expression 5698component (e.g., in @samp{ph*}, it applies just to the @samp{h}). 5699To cause @samp{*} to apply to a larger subexpression, use parentheses: 5700@samp{(ph)*} matches @samp{ph}, @samp{phph}, @samp{phphph}, and so on. 5701 5702Second, @samp{*} finds as many repetitions as possible. If the text 5703to be matched is @samp{phhhhhhhhhhhhhhooey}, @samp{ph*} matches all of 5704the @samp{h}s. 5705 5706@cindex @code{+} (plus sign) @subentry regexp operator 5707@cindex plus sign (@code{+}) @subentry regexp operator 5708@item @code{+} 5709This symbol is similar to @samp{*}, except that the preceding expression must be 5710matched at least once. This means that @samp{wh+y} 5711would match @samp{why} and @samp{whhy}, but not @samp{wy}, whereas 5712@samp{wh*y} would match all three. 5713 5714@cindex @code{?} (question mark) @subentry regexp operator 5715@cindex question mark (@code{?}) @subentry regexp operator 5716@item @code{?} 5717This symbol is similar to @samp{*}, except that the preceding expression can be 5718matched either once or not at all. For example, @samp{fe?d} 5719matches @samp{fed} and @samp{fd}, but nothing else. 5720 5721@cindex @code{@{@}} (braces) @subentry regexp operator 5722@cindex braces (@code{@{@}}) @subentry regexp operator 5723@cindex interval expressions, regexp operator 5724@item @code{@{}@var{n}@code{@}} 5725@itemx @code{@{}@var{n}@code{,@}} 5726@itemx @code{@{}@var{n}@code{,}@var{m}@code{@}} 5727One or two numbers inside braces denote an @dfn{interval expression}. 5728If there is one number in the braces, the preceding regexp is repeated 5729@var{n} times. 5730If there are two numbers separated by a comma, the preceding regexp is 5731repeated @var{n} to @var{m} times. 5732If there is one number followed by a comma, then the preceding regexp 5733is repeated at least @var{n} times: 5734 5735@table @code 5736@item wh@{3@}y 5737Matches @samp{whhhy}, but not @samp{why} or @samp{whhhhy}. 5738 5739@item wh@{3,5@}y 5740Matches @samp{whhhy}, @samp{whhhhy}, or @samp{whhhhhy} only. 5741 5742@item wh@{2,@}y 5743Matches @samp{whhy}, @samp{whhhy}, and so on. 5744@end table 5745@end table 5746 5747@cindex precedence @subentry regexp operators 5748@cindex regular expressions @subentry operators @subentry precedence of 5749In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators, 5750as well as the braces @samp{@{} and @samp{@}}, 5751have 5752the highest precedence, followed by concatenation, and finally by @samp{|}. 5753As in arithmetic, parentheses can change how operators are grouped. 5754 5755@cindex POSIX @command{awk} @subentry regular expressions and 5756@cindex @command{gawk} @subentry regular expressions @subentry precedence 5757In POSIX @command{awk} and @command{gawk}, the @samp{*}, @samp{+}, and 5758@samp{?} operators stand for themselves when there is nothing in the 5759regexp that precedes them. For example, @code{/+/} matches a literal 5760plus sign. However, many other versions of @command{awk} treat such a 5761usage as a syntax error. 5762 5763@sidebar What About The Empty Regexp? 5764@cindex empty regexps 5765@cindex regexps, empty 5766We describe here an advanced regexp usage. Feel free to skip it 5767upon first reading. 5768 5769You can supply an empty regexp constant (@samp{//}) in all places 5770where a regexp is expected. Is this useful? What does it match? 5771 5772It is useful. It matches the (invisible) empty string at the start 5773and end of a string of characters, as well as the empty string 5774between characters. This is best illustrated with the @code{gsub()} 5775function, which makes global substitutions in a string 5776(@pxref{String Functions}). Normal usage of @code{gsub()} is like 5777so: 5778 5779@example 5780$ @kbd{awk '} 5781> @kbd{BEGIN @{} 5782> @kbd{ x = "ABC_CBA"} 5783> @kbd{ gsub(/B/, "bb", x)} 5784> @kbd{ print x} 5785> @kbd{@}'} 5786@print{} AbbC_CbbA 5787@end example 5788 5789We can use @code{gsub()} to see where the empty strings 5790are that match the empty regexp: 5791 5792@example 5793$ @kbd{awk '} 5794> @kbd{BEGIN @{} 5795> @kbd{ x = "ABC"} 5796> @kbd{ gsub(//, "x", x)} 5797> @kbd{ print x} 5798> @kbd{@}'} 5799@print{} xAxBxCx 5800@end example 5801@end sidebar 5802 5803@node Interval Expressions 5804@subsection Some Notes On Interval Expressions 5805 5806@cindex POSIX @command{awk} @subentry interval expressions in 5807Interval expressions were not traditionally available in @command{awk}. 5808They were added as part of the POSIX standard to make @command{awk} 5809and @command{egrep} consistent with each other. 5810 5811@cindex @command{gawk} @subentry interval expressions and 5812Initially, because old programs may use @samp{@{} and @samp{@}} in regexp 5813constants, 5814@command{gawk} did @emph{not} match interval expressions 5815in regexps. 5816 5817However, beginning with @value{PVERSION} 4.0, 5818@command{gawk} does match interval expressions by default. 5819This is because compatibility with POSIX has become more 5820important to most @command{gawk} users than compatibility with 5821old programs. 5822 5823For programs that use @samp{@{} and @samp{@}} in regexp constants, 5824it is good practice to always escape them with a backslash. Then the 5825regexp constants are valid and work the way you want them to, using 5826any version of @command{awk}.@footnote{Use two backslashes if you're 5827using a string constant with a regexp operator or function.} 5828 5829When @samp{@{} and @samp{@}} appear in regexp constants 5830in a way that cannot be interpreted as an interval expression 5831(such as @code{/q@{a@}/}), then they stand for themselves. 5832 5833As mentioned, interval expressions were not traditionally available 5834in @command{awk}. In March of 2019, BWK @command{awk} (finally) acquired them. 5835Nonetheless, because they were not available for 5836so many decades, @command{gawk} continues to not supply them 5837when in compatibility mode (@pxref{Options}). 5838 5839POSIX says that interval expressions containing repetition counts greater 5840than 255 produce unspecified results. 5841 5842@cindex Eggert, Paul 5843In the manual for GNU @command{grep}, Paul Eggert notes the following: 5844 5845@quotation 5846Interval expressions may be implemented internally via repetition. 5847For example, @samp{^(a|bc)@{2,4@}$} might be implemented as 5848@samp{^(a|bc)(a|bc)((a|bc)(a|bc)?)?$}. A large repetition count may 5849exhaust memory or greatly slow matching. Even small counts can cause 5850problems if cascaded; for example, @samp{grep -E 5851".*@{10,@}@{10,@}@{10,@}@{10,@}@{10,@}"} is likely to overflow a 5852stack. Fortunately, regular expressions like these are typically 5853artificial, and cascaded repetitions do not conform to POSIX so cannot 5854be used in portable programs anyway. 5855@end quotation 5856 5857@noindent 5858This same caveat applies to @command{gawk}. 5859 5860@node Bracket Expressions 5861@section Using Bracket Expressions 5862@cindex bracket expressions 5863@cindex bracket expressions @subentry range expressions 5864@cindex range expressions (regexps) 5865@cindex bracket expressions @subentry character lists 5866 5867As mentioned earlier, a bracket expression matches any character among 5868those listed between the opening and closing square brackets. 5869 5870Within a bracket expression, a @dfn{range expression} consists of two 5871characters separated by a hyphen. It matches any single character that 5872sorts between the two characters, based upon the system's native character 5873set. For example, @samp{[0-9]} is equivalent to @samp{[0123456789]}. 5874(See @ref{Ranges and Locales} for an explanation of how the POSIX 5875standard and @command{gawk} have changed over time. This is mainly 5876of historical interest.) 5877 5878With the increasing popularity of the 5879@uref{http://www.unicode.org, Unicode character standard}, 5880there is an additional wrinkle to consider. Octal and hexadecimal 5881escape sequences inside bracket expressions are taken to represent 5882only single-byte characters (characters whose values fit within 5883the range 0--256). To match a range of characters where the endpoints 5884of the range are larger than 256, enter the multibyte encodings of 5885the characters directly. 5886 5887@cindex @code{\} (backslash) @subentry in bracket expressions 5888@cindex backslash (@code{\}) @subentry in bracket expressions 5889@cindex @code{^} (caret) @subentry in bracket expressions 5890@cindex caret (@code{^}) @subentry in bracket expressions 5891@cindex @code{-} (hyphen) @subentry in bracket expressions 5892@cindex hyphen (@code{-}) @subentry in bracket expressions 5893To include one of the characters @samp{\}, @samp{]}, @samp{-}, or @samp{^} in a 5894bracket expression, put a @samp{\} in front of it. For example: 5895 5896@example 5897[d\]] 5898@end example 5899 5900@noindent 5901matches either @samp{d} or @samp{]}. 5902Additionally, if you place @samp{]} right after the opening 5903@samp{[}, the closing bracket is treated as one of the 5904characters to be matched. 5905 5906@cindex POSIX @command{awk} @subentry bracket expressions and 5907@cindex Extended Regular Expressions (EREs) 5908@cindex EREs (Extended Regular Expressions) 5909@cindex @command{egrep} utility 5910The treatment of @samp{\} in bracket expressions 5911is compatible with other @command{awk} 5912implementations and is also mandated by POSIX. 5913The regular expressions in @command{awk} are a superset 5914of the POSIX specification for Extended Regular Expressions (EREs). 5915POSIX EREs are based on the regular expressions accepted by the 5916traditional @command{egrep} utility. 5917 5918@cindex bracket expressions @subentry character classes 5919@cindex POSIX @command{awk} @subentry bracket expressions and @subentry character classes 5920@dfn{Character classes} are a feature introduced in the POSIX standard. 5921A character class is a special notation for describing 5922lists of characters that have a specific attribute, but the 5923actual characters can vary from country to country and/or 5924from character set to character set. For example, the notion of what 5925is an alphabetic character differs between the United States and France. 5926 5927A character class is only valid in a regexp @emph{inside} the 5928brackets of a bracket expression. Character classes consist of @samp{[:}, 5929a keyword denoting the class, and @samp{:]}. 5930@ref{table-char-classes} lists the character classes defined by the 5931POSIX standard. 5932 5933@float Table,table-char-classes 5934@caption{POSIX character classes} 5935@multitable @columnfractions .15 .85 5936@headitem Class @tab Meaning 5937@item @code{[:alnum:]} @tab Alphanumeric characters 5938@item @code{[:alpha:]} @tab Alphabetic characters 5939@item @code{[:blank:]} @tab Space and TAB characters 5940@item @code{[:cntrl:]} @tab Control characters 5941@item @code{[:digit:]} @tab Numeric characters 5942@item @code{[:graph:]} @tab Characters that are both printable and visible 5943(a space is printable but not visible, whereas an @samp{a} is both) 5944@item @code{[:lower:]} @tab Lowercase alphabetic characters 5945@item @code{[:print:]} @tab Printable characters (characters that are not control characters) 5946@item @code{[:punct:]} @tab Punctuation characters (characters that are not letters, digits, 5947control characters, or space characters) 5948@item @code{[:space:]} @tab Space characters (these are: space, TAB, newline, carriage return, formfeed and vertical tab) 5949@item @code{[:upper:]} @tab Uppercase alphabetic characters 5950@item @code{[:xdigit:]} @tab Characters that are hexadecimal digits 5951@end multitable 5952@end float 5953 5954For example, before the POSIX standard, you had to write @code{/[A-Za-z0-9]/} 5955to match alphanumeric characters. If your 5956character set had other alphabetic characters in it, this would not 5957match them. 5958With the POSIX character classes, you can write 5959@code{/[[:alnum:]]/} to match the alphabetic 5960and numeric characters in your character set. 5961 5962@ignore 5963From eliz@gnu.org Fri Feb 15 03:38:41 2019 5964Date: Fri, 15 Feb 2019 12:38:23 +0200 5965From: Eli Zaretskii <eliz@gnu.org> 5966To: arnold@skeeve.com 5967CC: pengyu.ut@gmail.com, bug-gawk@gnu.org 5968Subject: Re: [bug-gawk] Does gawk character classes follow this? 5969 5970> From: arnold@skeeve.com 5971> Date: Fri, 15 Feb 2019 03:01:34 -0700 5972> Cc: pengyu.ut@gmail.com, bug-gawk@gnu.org 5973> 5974> I get the feeling that there's something really bothering you, but 5975> I don't understand what. 5976> 5977> Can you clarify, please? 5978 5979I thought I already did: we cannot be expected to provide a definitive 5980description of what the named classes stand for, because the answer 5981depends on various factors out of our control. 5982@end ignore 5983 5984@c Thanks to 5985@c Date: Tue, 01 Jul 2014 07:39:51 +0200 5986@c From: Hermann Peifer <peifer@gmx.eu> 5987@cindex ASCII 5988Some utilities that match regular expressions provide a nonstandard 5989@samp{[:ascii:]} character class; @command{awk} does not. However, you 5990can simulate such a construct using @samp{[\x00-\x7F]}. This matches 5991all values numerically between zero and 127, which is the defined 5992range of the ASCII character set. Use a complemented character list 5993(@samp{[^\x00-\x7F]}) to match any single-byte characters that are not 5994in the ASCII range. 5995 5996@quotation NOTE 5997Some older versions of Unix @command{awk} 5998treat @code{[:blank:]} like @code{[:space:]}, incorrectly matching 5999more characters than they should. Caveat Emptor. 6000@end quotation 6001 6002@cindex bracket expressions @subentry collating elements 6003@cindex bracket expressions @subentry non-ASCII 6004@cindex collating elements 6005Two additional special sequences can appear in bracket expressions. 6006These apply to non-ASCII character sets, which can have single symbols 6007(called @dfn{collating elements}) that are represented with more than one 6008character. They can also have several characters that are equivalent for 6009@dfn{collating}, or sorting, purposes. (For example, in French, a plain ``e'' 6010and a grave-accented ``@`e'' are equivalent.) 6011These sequences are: 6012 6013@table @asis 6014@cindex bracket expressions @subentry collating symbols 6015@cindex collating symbols 6016@item Collating symbols 6017Multicharacter collating elements enclosed between 6018@samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element, 6019then @samp{[[.ch.]]} is a regexp that matches this collating element, whereas 6020@samp{[ch]} is a regexp that matches either @samp{c} or @samp{h}. 6021 6022@cindex bracket expressions @subentry equivalence classes 6023@item Equivalence classes 6024Locale-specific names for a list of 6025characters that are equal. The name is enclosed between 6026@samp{[=} and @samp{=]}. 6027For example, the name @samp{e} might be used to represent all of 6028``e,'' ``@^e,'' ``@`e,'' and ``@'e.'' In this case, @samp{[[=e=]]} is a regexp 6029that matches any of @samp{e}, @samp{@^e}, @samp{@'e}, or @samp{@`e}. 6030@end table 6031 6032These features are very valuable in non-English-speaking locales. 6033 6034@cindex internationalization @subentry localization @subentry character classes 6035@cindex @command{gawk} @subentry character classes and 6036@cindex POSIX @command{awk} @subentry bracket expressions and @subentry character classes 6037@quotation CAUTION 6038The library functions that @command{gawk} uses for regular 6039expression matching currently recognize only POSIX character classes; 6040they do not recognize collating symbols or equivalence classes. 6041@end quotation 6042@c maybe one day ... 6043 6044Inside a bracket expression, an opening bracket (@samp{[}) that does 6045not start a character class, collating element or equivalence class is 6046taken literally. This is also true of @samp{.} and @samp{*}. 6047 6048@node Leftmost Longest 6049@section How Much Text Matches? 6050 6051@cindex regular expressions @subentry leftmost longest match 6052@c @cindex matching, leftmost longest 6053Consider the following: 6054 6055@example 6056echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}' 6057@end example 6058 6059This example uses the @code{sub()} function to make a change to the input 6060record. (@code{sub()} replaces the first instance of any text matched 6061by the first argument with the string provided as the second argument; 6062@pxref{String Functions}.) Here, the regexp @code{/a+/} indicates ``one 6063or more @samp{a} characters,'' and the replacement text is @samp{<A>}. 6064 6065The input contains four @samp{a} characters. 6066@command{awk} (and POSIX) regular expressions always match 6067the leftmost, @emph{longest} sequence of input characters that can 6068match. Thus, all four @samp{a} characters are 6069replaced with @samp{<A>} in this example: 6070 6071@example 6072$ @kbd{echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'} 6073@print{} <A>bcd 6074@end example 6075 6076For simple match/no-match tests, this is not so important. But when doing 6077text matching and substitutions with the @code{match()}, @code{sub()}, @code{gsub()}, 6078and @code{gensub()} functions, it is very important. 6079@ifinfo 6080@xref{String Functions}, 6081for more information on these functions. 6082@end ifinfo 6083Understanding this principle is also important for regexp-based record 6084and field splitting (@pxref{Records}, 6085and also @pxref{Field Separators}). 6086 6087@node Computed Regexps 6088@section Using Dynamic Regexps 6089 6090@cindex regular expressions @subentry computed 6091@cindex regular expressions @subentry dynamic 6092@cindex @code{~} (tilde), @code{~} operator 6093@cindex tilde (@code{~}), @code{~} operator 6094@cindex @code{!} (exclamation point) @subentry @code{!~} operator 6095@cindex exclamation point (@code{!}) @subentry @code{!~} operator 6096@c @cindex operators, @code{~} 6097@c @cindex operators, @code{!~} 6098The righthand side of a @samp{~} or @samp{!~} operator need not be a 6099regexp constant (i.e., a string of characters between slashes). It may 6100be any expression. The expression is evaluated and converted to a string 6101if necessary; the contents of the string are then used as the 6102regexp. A regexp computed in this way is called a @dfn{dynamic 6103regexp} or a @dfn{computed regexp}: 6104 6105@example 6106BEGIN @{ digits_regexp = "[[:digit:]]+" @} 6107$0 ~ digits_regexp @{ print @} 6108@end example 6109 6110@noindent 6111This sets @code{digits_regexp} to a regexp that describes one or more digits, 6112and tests whether the input record matches this regexp. 6113 6114@quotation NOTE 6115When using the @samp{~} and @samp{!~} 6116operators, be aware that there is a difference between a regexp constant 6117enclosed in slashes and a string constant enclosed in double quotes. 6118If you are going to use a string constant, you have to understand that 6119the string is, in essence, scanned @emph{twice}: the first time when 6120@command{awk} reads your program, and the second time when it goes to 6121match the string on the lefthand side of the operator with the pattern 6122on the right. This is true of any string-valued expression (such as 6123@code{digits_regexp}, shown in the previous example), not just string constants. 6124@end quotation 6125 6126@cindex regexp constants @subentry slashes vs.@: quotes 6127@cindex @code{\} (backslash) @subentry in regexp constants 6128@cindex backslash (@code{\}) @subentry in regexp constants 6129@cindex @code{"} (double quote) @subentry in regexp constants 6130@cindex double quote (@code{"}) @subentry in regexp constants 6131What difference does it make if the string is 6132scanned twice? The answer has to do with escape sequences, and particularly 6133with backslashes. To get a backslash into a regular expression inside a 6134string, you have to type two backslashes. 6135 6136For example, @code{/\*/} is a regexp constant for a literal @samp{*}. 6137Only one backslash is needed. To do the same thing with a string, 6138you have to type @code{"\\*"}. The first backslash escapes the 6139second one so that the string actually contains the 6140two characters @samp{\} and @samp{*}. 6141 6142@cindex troubleshooting @subentry regexp constants vs.@: string constants 6143@cindex regexp constants @subentry vs.@: string constants 6144@cindex string @subentry constants @subentry vs.@: regexp constants 6145Given that you can use both regexp and string constants to describe 6146regular expressions, which should you use? The answer is ``regexp 6147constants,'' for several reasons: 6148 6149@itemize @value{BULLET} 6150@item 6151String constants are more complicated to write and 6152more difficult to read. Using regexp constants makes your programs 6153less error-prone. Not understanding the difference between the two 6154kinds of constants is a common source of errors. 6155 6156@item 6157It is more efficient to use regexp constants. @command{awk} can note 6158that you have supplied a regexp and store it internally in a form that 6159makes pattern matching more efficient. When using a string constant, 6160@command{awk} must first convert the string into this internal form and 6161then perform the pattern matching. 6162 6163@item 6164Using regexp constants is better form; it shows clearly that you 6165intend a regexp match. 6166@end itemize 6167 6168@sidebar Using @code{\n} in Bracket Expressions of Dynamic Regexps 6169@cindex regular expressions @subentry dynamic @subentry with embedded newlines 6170@cindex newlines @subentry in dynamic regexps 6171 6172Some older versions of @command{awk} do not allow the newline 6173character to be used inside a bracket expression for a dynamic regexp: 6174 6175@example 6176$ @kbd{awk '$0 ~ "[ \t\n]"'} 6177@error{} awk: newline in character class [ 6178@error{} ]... 6179@error{} source line number 1 6180@error{} context is 6181@error{} $0 ~ "[ >>> \t\n]" <<< 6182@end example 6183 6184@cindex newlines @subentry in regexp constants 6185But a newline in a regexp constant works with no problem: 6186 6187@example 6188$ @kbd{awk '$0 ~ /[ \t\n]/'} 6189@kbd{here is a sample line} 6190@print{} here is a sample line 6191@kbd{Ctrl-d} 6192@end example 6193 6194@command{gawk} does not have this problem, and it isn't likely to 6195occur often in practice, but it's worth noting for future reference. 6196@end sidebar 6197 6198@node GNU Regexp Operators 6199@section @command{gawk}-Specific Regexp Operators 6200 6201@c This section adapted (long ago) from the regex-0.12 manual 6202 6203@cindex regular expressions @subentry operators @subentry @command{gawk} 6204@cindex @command{gawk} @subentry regular expressions @subentry operators 6205@cindex operators @subentry GNU-specific 6206@cindex regular expressions @subentry operators @subentry for words 6207@cindex word, regexp definition of 6208GNU software that deals with regular expressions provides a number of 6209additional regexp operators. These operators are described in this 6210@value{SECTION} and are specific to @command{gawk}; 6211they are not available in other @command{awk} implementations. 6212Most of the additional operators deal with word matching. 6213For our purposes, a @dfn{word} is a sequence of one or more letters, digits, 6214or underscores (@samp{_}): 6215 6216@table @code 6217@c @cindex operators, @code{\s} (@command{gawk}) 6218@cindex backslash (@code{\}) @subentry @code{\s} operator (@command{gawk}) 6219@cindex @code{\} (backslash) @subentry @code{\s} operator (@command{gawk}) 6220@item \s 6221Matches any space character as defined by the current locale. 6222Think of it as shorthand for 6223@w{@samp{[[:space:]]}}. 6224 6225@c @cindex operators, @code{\S} (@command{gawk}) 6226@cindex backslash (@code{\}) @subentry @code{\S} operator (@command{gawk}) 6227@cindex @code{\} (backslash) @subentry @code{\S} operator (@command{gawk}) 6228@item \S 6229Matches any character that is not a space, as defined by the current locale. 6230Think of it as shorthand for 6231@w{@samp{[^[:space:]]}}. 6232 6233@c @cindex operators, @code{\w} (@command{gawk}) 6234@cindex backslash (@code{\}) @subentry @code{\w} operator (@command{gawk}) 6235@cindex @code{\} (backslash) @subentry @code{\w} operator (@command{gawk}) 6236@item \w 6237Matches any word-constituent character---that is, it matches any 6238letter, digit, or underscore. Think of it as shorthand for 6239@w{@samp{[[:alnum:]_]}}. 6240 6241@c @cindex operators, @code{\W} (@command{gawk}) 6242@cindex backslash (@code{\}) @subentry @code{\W} operator (@command{gawk}) 6243@cindex @code{\} (backslash) @subentry @code{\W} operator (@command{gawk}) 6244@item \W 6245Matches any character that is not word-constituent. 6246Think of it as shorthand for 6247@w{@samp{[^[:alnum:]_]}}. 6248 6249@c @cindex operators, @code{\<} (@command{gawk}) 6250@cindex backslash (@code{\}) @subentry @code{\<} operator (@command{gawk}) 6251@cindex @code{\} (backslash) @subentry @code{\<} operator (@command{gawk}) 6252@item \< 6253Matches the empty string at the beginning of a word. 6254For example, @code{/\<away/} matches @samp{away} but not 6255@samp{stowaway}. 6256 6257@c @cindex operators, @code{\>} (@command{gawk}) 6258@cindex backslash (@code{\}) @subentry @code{\>} operator (@command{gawk}) 6259@cindex @code{\} (backslash) @subentry @code{\>} operator (@command{gawk}) 6260@item \> 6261Matches the empty string at the end of a word. 6262For example, @code{/stow\>/} matches @samp{stow} but not @samp{stowaway}. 6263 6264@c @cindex operators, @code{\y} (@command{gawk}) 6265@cindex backslash (@code{\}) @subentry @code{\y} operator (@command{gawk}) 6266@cindex @code{\} (backslash) @subentry @code{\y} operator (@command{gawk}) 6267@cindex word boundaries, matching 6268@item \y 6269Matches the empty string at either the beginning or the 6270end of a word (i.e., the word boundar@strong{y}). For example, @samp{\yballs?\y} 6271matches either @samp{ball} or @samp{balls}, as a separate word. 6272 6273@c @cindex operators, @code{\B} (@command{gawk}) 6274@cindex backslash (@code{\}) @subentry @code{\B} operator (@command{gawk}) 6275@cindex @code{\} (backslash) @subentry @code{\B} operator (@command{gawk}) 6276@item \B 6277Matches the empty string that occurs between two 6278word-constituent characters. For example, 6279@code{/\Brat\B/} matches @samp{crate}, but it does not match @samp{dirty rat}. 6280@samp{\B} is essentially the opposite of @samp{\y}. 6281@end table 6282 6283@cindex buffers @subentry operators for 6284@cindex regular expressions @subentry operators @subentry for buffers 6285@cindex operators @subentry string-matching @subentry for buffers 6286There are two other operators that work on buffers. In Emacs, a 6287@dfn{buffer} is, naturally, an Emacs buffer. 6288Other GNU programs, including @command{gawk}, 6289consider the entire string to match as the buffer. 6290The operators are: 6291 6292@table @code 6293@item \` 6294@c @cindex operators, @code{\`} (@command{gawk}) 6295@cindex backslash (@code{\}) @subentry @code{\`} operator (@command{gawk}) 6296@cindex @code{\} (backslash) @subentry @code{\`} operator (@command{gawk}) 6297Matches the empty string at the 6298beginning of a buffer (string) 6299 6300@c @cindex operators, @code{\'} (@command{gawk}) 6301@cindex backslash (@code{\}) @subentry @code{\'} operator (@command{gawk}) 6302@cindex @code{\} (backslash) @subentry @code{\'} operator (@command{gawk}) 6303@item \' 6304Matches the empty string at the 6305end of a buffer (string) 6306@end table 6307 6308@cindex @code{^} (caret) @subentry regexp operator 6309@cindex caret (@code{^}) @subentry regexp operator 6310@cindex @code{?} (question mark) @subentry regexp operator 6311@cindex question mark (@code{?}) @subentry regexp operator 6312Because @samp{^} and @samp{$} always work in terms of the beginning 6313and end of strings, these operators don't add any new capabilities 6314for @command{awk}. They are provided for compatibility with other 6315GNU software. 6316 6317@cindex @command{gawk} @subentry word-boundary operator 6318@cindex word-boundary operator (@command{gawk}) 6319@cindex operators @subentry word-boundary (@command{gawk}) 6320In other GNU software, the word-boundary operator is @samp{\b}. However, 6321that conflicts with the @command{awk} language's definition of @samp{\b} 6322as backspace, so @command{gawk} uses a different letter. 6323An alternative method would have been to require two backslashes in the 6324GNU operators, but this was deemed too confusing. The current 6325method of using @samp{\y} for the GNU @samp{\b} appears to be the 6326lesser of two evils. 6327 6328@cindex regular expressions @subentry @command{gawk}, command-line options 6329@cindex @command{gawk} @subentry command-line options, regular expressions and 6330The various command-line options 6331(@pxref{Options}) 6332control how @command{gawk} interprets characters in regexps: 6333 6334@table @asis 6335@item No options 6336In the default case, @command{gawk} provides all the facilities of 6337POSIX regexps and the 6338@ifnotinfo 6339previously described 6340GNU regexp operators. 6341@end ifnotinfo 6342@ifnottex 6343@ifnotdocbook 6344GNU regexp operators described 6345in @ref{Regexp Operators}. 6346@end ifnotdocbook 6347@end ifnottex 6348 6349@item @option{--posix} 6350Match only POSIX regexps; the GNU operators are not special 6351(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions 6352are allowed. 6353 6354@cindex Brian Kernighan's @command{awk} 6355@item @option{--traditional} 6356Match traditional Unix @command{awk} regexps. The GNU operators 6357are not special, and interval expressions are not available. 6358Because BWK @command{awk} supports them, 6359the POSIX character classes (@samp{[[:alnum:]]}, etc.) are available. 6360Characters described by octal and hexadecimal escape sequences are 6361treated literally, even if they represent regexp metacharacters. 6362 6363@item @option{--re-interval} 6364Allow interval expressions in regexps, if @option{--traditional} 6365has been provided. 6366Otherwise, interval expressions are available by default. 6367@end table 6368 6369@node Case-sensitivity 6370@section Case Sensitivity in Matching 6371 6372@cindex regular expressions @subentry case sensitivity 6373@cindex case sensitivity @subentry regexps and 6374Case is normally significant in regular expressions, both when matching 6375ordinary characters (i.e., not metacharacters) and inside bracket 6376expressions. Thus, a @samp{w} in a regular expression matches only a lowercase 6377@samp{w} and not an uppercase @samp{W}. 6378 6379The simplest way to do a case-independent match is to use a bracket 6380expression---for example, @samp{[Ww]}. However, this can be cumbersome if 6381you need to use it often, and it can make the regular expressions harder 6382to read. There are two alternatives that you might prefer. 6383 6384One way to perform a case-insensitive match at a particular point in the 6385program is to convert the data to a single case, using the 6386@code{tolower()} or @code{toupper()} built-in string functions (which we 6387haven't discussed yet; 6388@pxref{String Functions}). 6389For example: 6390 6391@example 6392tolower($1) ~ /foo/ @{ @dots{} @} 6393@end example 6394 6395@noindent 6396converts the first field to lowercase before matching against it. 6397This works in any POSIX-compliant @command{awk}. 6398 6399@cindex @command{gawk} @subentry regular expressions @subentry case sensitivity 6400@cindex case sensitivity @subentry @command{gawk} 6401@cindex differences in @command{awk} and @command{gawk} @subentry regular expressions 6402@cindex @code{~} (tilde), @code{~} operator 6403@cindex tilde (@code{~}), @code{~} operator 6404@cindex @code{!} (exclamation point) @subentry @code{!~} operator 6405@cindex exclamation point (@code{!}) @subentry @code{!~} operator 6406@cindex @code{IGNORECASE} variable @subentry with @code{~} and @code{!~} operators 6407@cindex @command{gawk} @subentry @code{IGNORECASE} variable in 6408@c @cindex variables, @code{IGNORECASE} 6409Another method, specific to @command{gawk}, is to set the variable 6410@code{IGNORECASE} to a nonzero value (@pxref{Built-in Variables}). 6411When @code{IGNORECASE} is not zero, @emph{all} regexp and string 6412operations ignore case. 6413 6414Changing the value of @code{IGNORECASE} dynamically controls the 6415case sensitivity of the program as it runs. Case is significant by 6416default because @code{IGNORECASE} (like most variables) is initialized 6417to zero: 6418 6419@example 6420x = "aB" 6421if (x ~ /ab/) @dots{} # this test will fail 6422 6423IGNORECASE = 1 6424if (x ~ /ab/) @dots{} # now it will succeed 6425@end example 6426 6427In general, you cannot use @code{IGNORECASE} to make certain rules 6428case insensitive and other rules case sensitive, as there is no 6429straightforward way 6430to set @code{IGNORECASE} just for the pattern of 6431a particular rule.@footnote{Experienced C and C++ programmers will note 6432that it is possible, using something like 6433@samp{IGNORECASE = 1 && /foObAr/ @{ @dots{} @}} 6434and 6435@samp{IGNORECASE = 0 || /foobar/ @{ @dots{} @}}. 6436However, this is somewhat obscure and we don't recommend it.} 6437To do this, use either bracket expressions or @code{tolower()}. However, one 6438thing you can do with @code{IGNORECASE} only is dynamically turn 6439case sensitivity on or off for all the rules at once. 6440 6441@code{IGNORECASE} can be set on the command line or in a @code{BEGIN} rule 6442(@pxref{Other Arguments}; also 6443@pxref{Using BEGIN/END}). 6444Setting @code{IGNORECASE} from the command line is a way to make 6445a program case insensitive without having to edit it. 6446 6447@c @cindex ISO 8859-1 6448@c @cindex ISO Latin-1 6449In multibyte locales, the equivalences between upper- and lowercase 6450characters are tested based on the wide-character values of the locale's 6451character set. Prior to @value{PVERSION} 5.0, single-byte characters were 6452tested based on the ISO-8859-1 (ISO Latin-1) character set. However, as 6453of @value{PVERSION} 5.0, single-byte characters are also tested based on 6454the values of the locale's character set.@footnote{If you don't understand 6455this, don't worry about it; it just means that @command{gawk} does the 6456right thing.} 6457 6458The value of @code{IGNORECASE} has no effect if @command{gawk} is in 6459compatibility mode (@pxref{Options}). 6460Case is always significant in compatibility mode. 6461 6462@node Regexp Summary 6463@section Summary 6464 6465@itemize @value{BULLET} 6466@item 6467Regular expressions describe sets of strings to be matched. 6468In @command{awk}, regular expression constants are written enclosed 6469between slashes: @code{/}@dots{}@code{/}. 6470 6471@item 6472Regexp constants may be used standalone in patterns and 6473in conditional expressions, or as part of matching expressions 6474using the @samp{~} and @samp{!~} operators. 6475 6476@item 6477Escape sequences let you represent nonprintable characters and 6478also let you represent regexp metacharacters as literal characters 6479to be matched. 6480 6481@item 6482Regexp operators provide grouping, alternation, and repetition. 6483 6484@item 6485Bracket expressions give you a shorthand for specifying sets 6486of characters that can match at a particular point in a regexp. 6487Within bracket expressions, POSIX character classes let you specify 6488certain groups of characters in a locale-independent fashion. 6489 6490@item 6491Regular expressions match the leftmost longest text in the string being 6492matched. This matters for cases where you need to know the extent of 6493the match, such as for text substitution and when the record separator 6494is a regexp. 6495 6496@item 6497Matching expressions may use dynamic regexps (i.e., string values 6498treated as regular expressions). 6499 6500@item 6501@command{gawk}'s @code{IGNORECASE} variable lets you control the 6502case sensitivity of regexp matching. In other @command{awk} 6503versions, use @code{tolower()} or @code{toupper()}. 6504 6505@end itemize 6506 6507 6508@node Reading Files 6509@chapter Reading Input Files 6510 6511@cindex reading input files 6512@cindex input files @subentry reading 6513@cindex input files 6514@cindex @code{FILENAME} variable 6515In the typical @command{awk} program, 6516@command{awk} reads all input either from the 6517standard input (by default, this is the keyboard, but often it is a pipe from another 6518command) or from files whose names you specify on the @command{awk} 6519command line. If you specify input files, @command{awk} reads them 6520in order, processing all the data from one before going on to the next. 6521The name of the current input file can be found in the predefined variable 6522@code{FILENAME} 6523(@pxref{Built-in Variables}). 6524 6525@cindex records 6526@cindex fields 6527The input is read in units called @dfn{records}, and is processed by the 6528rules of your program one record at a time. 6529By default, each record is one line. Each 6530record is automatically split into chunks called @dfn{fields}. 6531This makes it more convenient for programs to work on the parts of a record. 6532 6533@cindex @code{getline} command 6534On rare occasions, you may need to use the @code{getline} command. 6535The @code{getline} command is valuable both because it 6536can do explicit input from any number of files, and because the files 6537used with it do not have to be named on the @command{awk} command line 6538(@pxref{Getline}). 6539 6540@menu 6541* Records:: Controlling how data is split into records. 6542* Fields:: An introduction to fields. 6543* Nonconstant Fields:: Nonconstant Field Numbers. 6544* Changing Fields:: Changing the Contents of a Field. 6545* Field Separators:: The field separator and how to change it. 6546* Constant Size:: Reading constant width data. 6547* Splitting By Content:: Defining Fields By Content 6548* Testing field creation:: Checking how @command{gawk} is splitting 6549 records. 6550* Multiple Line:: Reading multiline records. 6551* Getline:: Reading files under explicit program control 6552 using the @code{getline} function. 6553* Read Timeout:: Reading input with a timeout. 6554* Retrying Input:: Retrying input after certain errors. 6555* Command-line directories:: What happens if you put a directory on the 6556 command line. 6557* Input Summary:: Input summary. 6558* Input Exercises:: Exercises. 6559@end menu 6560 6561@node Records 6562@section How Input Is Split into Records 6563 6564@cindex input @subentry splitting into records 6565@cindex records @subentry splitting input into 6566@cindex @code{NR} variable 6567@cindex @code{FNR} variable 6568@command{awk} divides the input for your program into records and fields. 6569It keeps track of the number of records that have been read so far from 6570the current input file. This value is stored in a predefined variable 6571called @code{FNR}, which is reset to zero every time a new file is started. 6572Another predefined variable, @code{NR}, records the total number of input 6573records read so far from all @value{DF}s. It starts at zero, but is 6574never automatically reset to zero. 6575 6576Normally, records are separated by newline characters. You can control how 6577records are separated by assigning values to the built-in variable @code{RS}. 6578If @code{RS} is any single character, that character separates records. 6579Otherwise (in @command{gawk}), @code{RS} is treated as a regular expression. 6580This mechanism is explained in greater detail shortly. 6581 6582@menu 6583* awk split records:: How standard @command{awk} splits records. 6584* gawk split records:: How @command{gawk} splits records. 6585@end menu 6586 6587@node awk split records 6588@subsection Record Splitting with Standard @command{awk} 6589 6590@cindex separators @subentry for records 6591@cindex record separators 6592Records are separated by a character called the @dfn{record separator}. 6593By default, the record separator is the newline character. 6594This is why records are, by default, single lines. 6595To use a different character for the record separator, 6596simply assign that character to the predefined variable @code{RS}. 6597 6598@cindex record separators @subentry newlines as 6599@cindex newlines @subentry as record separators 6600@cindex @code{RS} variable 6601Like any other variable, 6602the value of @code{RS} can be changed in the @command{awk} program 6603with the assignment operator, @samp{=} 6604(@pxref{Assignment Ops}). 6605The new record-separator character should be enclosed in quotation marks, 6606which indicate a string constant. Often, the right time to do this is 6607at the beginning of execution, before any input is processed, 6608so that the very first record is read with the proper separator. 6609To do this, use the special @code{BEGIN} pattern 6610(@pxref{BEGIN/END}). 6611For example: 6612 6613@example 6614awk 'BEGIN @{ RS = "u" @} 6615 @{ print $0 @}' mail-list 6616@end example 6617 6618@noindent 6619changes the value of @code{RS} to @samp{u}, before reading any input. 6620The new value is a string whose first character is the letter ``u''; as a result, records 6621are separated by the letter ``u''. Then the input file is read, and the second 6622rule in the @command{awk} program (the action with no pattern) prints each 6623record. Because each @code{print} statement adds a newline at the end of 6624its output, this @command{awk} program copies the input 6625with each @samp{u} changed to a newline. Here are the results of running 6626the program on @file{mail-list}: 6627 6628@example 6629@group 6630$ @kbd{awk 'BEGIN @{ RS = "u" @}} 6631> @kbd{@{ print $0 @}' mail-list} 6632@end group 6633@print{} Amelia 555-5553 amelia.zodiac 6634@print{} sq 6635@print{} e@@gmail.com F 6636@print{} Anthony 555-3412 anthony.assert 6637@print{} ro@@hotmail.com A 6638@print{} Becky 555-7685 becky.algebrar 6639@print{} m@@gmail.com A 6640@print{} Bill 555-1675 bill.drowning@@hotmail.com A 6641@print{} Broderick 555-0542 broderick.aliq 6642@print{} otiens@@yahoo.com R 6643@print{} Camilla 555-2912 camilla.inf 6644@print{} sar 6645@print{} m@@skynet.be R 6646@print{} Fabi 6647@print{} s 555-1234 fabi 6648@print{} s. 6649@print{} ndevicesim 6650@print{} s@@ 6651@print{} cb.ed 6652@print{} F 6653@print{} J 6654@print{} lie 555-6699 j 6655@print{} lie.perscr 6656@print{} tabor@@skeeve.com F 6657@print{} Martin 555-6480 martin.codicib 6658@print{} s@@hotmail.com A 6659@print{} Sam 6660@print{} el 555-3430 sam 6661@print{} el.lanceolis@@sh 6662@print{} .ed 6663@print{} A 6664@print{} Jean-Pa 6665@print{} l 555-2127 jeanpa 6666@print{} l.campanor 6667@print{} m@@ny 6668@print{} .ed 6669@print{} R 6670@print{} 6671@end example 6672 6673@noindent 6674Note that the entry for the name @samp{Bill} is not split. 6675In the original @value{DF} 6676(@pxref{Sample Data Files}), 6677the line looks like this: 6678 6679@example 6680Bill 555-1675 bill.drowning@@hotmail.com A 6681@end example 6682 6683@noindent 6684It contains no @samp{u}, so there is no reason to split the record, 6685unlike the others, which each have one or more occurrences of the @samp{u}. 6686In fact, this record is treated as part of the previous record; 6687the newline separating them in the output 6688is the original newline in the @value{DF}, not the one added by 6689@command{awk} when it printed the record! 6690 6691@cindex record separators @subentry changing 6692@cindex separators @subentry for records 6693Another way to change the record separator is on the command line, 6694using the variable-assignment feature 6695(@pxref{Other Arguments}): 6696 6697@example 6698awk '@{ print $0 @}' RS="u" mail-list 6699@end example 6700 6701@noindent 6702This sets @code{RS} to @samp{u} before processing @file{mail-list}. 6703 6704Using an alphabetic character such as @samp{u} for the record separator 6705is highly likely to produce strange results. 6706Using an unusual character such as @samp{/} is more likely to 6707produce correct behavior in the majority of cases, but there 6708are no guarantees. The moral is: Know Your Data. 6709 6710@command{gawk} allows @code{RS} to be a full regular expression 6711(discussed shortly; @pxref{gawk split records}). Even so, using 6712a regular expression metacharacter, such as @samp{.} as the single 6713character in the value of @code{RS} has no special effect: it is 6714treated literally. This is required for backwards compatibility with 6715both Unix @command{awk} and with POSIX. 6716 6717@cindex dark corner @subentry input files 6718Reaching the end of an input file terminates the current input record, 6719even if the last character in the file is not the character in @code{RS}. 6720@value{DARKCORNER} 6721 6722@cindex empty strings @seeentry{null strings} 6723@cindex null strings 6724@cindex strings @subentry empty @seeentry{null strings} 6725The empty string @code{""} (a string without any characters) 6726has a special meaning 6727as the value of @code{RS}. It means that records are separated 6728by one or more blank lines and nothing else. 6729@xref{Multiple Line} for more details. 6730 6731If you change the value of @code{RS} in the middle of an @command{awk} run, 6732the new value is used to delimit subsequent records, but the record 6733currently being processed, as well as records already processed, are not 6734affected. 6735 6736@cindex @command{gawk} @subentry @code{RT} variable in 6737@cindex @code{RT} variable 6738@cindex records @subentry terminating 6739@cindex terminating records 6740@cindex differences in @command{awk} and @command{gawk} @subentry record separators 6741@cindex differences in @command{awk} and @command{gawk} @subentry @code{RS}/@code{RT} variables 6742@cindex regular expressions @subentry as record separators 6743@cindex record separators @subentry regular expressions as 6744@cindex separators @subentry for records @subentry regular expressions as 6745After the end of the record has been determined, @command{gawk} 6746sets the variable @code{RT} to the text in the input that matched 6747@code{RS}. 6748 6749@node gawk split records 6750@subsection Record Splitting with @command{gawk} 6751 6752@cindex common extensions @subentry @code{RS} as a regexp 6753@cindex extensions @subentry common @subentry @code{RS} as a regexp 6754When using @command{gawk}, the value of @code{RS} is not limited to a 6755one-character string. If it contains more than one character, it is 6756treated as a regular expression 6757(@pxref{Regexp}). @value{COMMONEXT} 6758In general, each record 6759ends at the next string that matches the regular expression; the next 6760record starts at the end of the matching string. This general rule is 6761actually at work in the usual case, where @code{RS} contains just a 6762newline: a record ends at the beginning of the next matching string (the 6763next newline in the input), and the following record starts just after 6764the end of this string (at the first character of the following line). 6765The newline, because it matches @code{RS}, is not part of either record. 6766 6767When @code{RS} is a single character, @code{RT} 6768contains the same single character. However, when @code{RS} is a 6769regular expression, @code{RT} contains 6770the actual input text that matched the regular expression. 6771 6772If the input file ends without any text matching @code{RS}, 6773@command{gawk} sets @code{RT} to the null string. 6774 6775The following example illustrates both of these features. 6776It sets @code{RS} equal to a regular expression that 6777matches either a newline or a series of one or more uppercase letters 6778with optional leading and/or trailing whitespace: 6779 6780@example 6781@group 6782$ @kbd{echo record 1 AAAA record 2 BBBB record 3 |} 6783> @kbd{gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @}} 6784> @kbd{@{ print "Record =", $0,"and RT = [" RT "]" @}'} 6785@end group 6786@print{} Record = record 1 and RT = [ AAAA ] 6787@print{} Record = record 2 and RT = [ BBBB ] 6788@print{} Record = record 3 and RT = [ 6789@print{} ] 6790@end example 6791 6792@noindent 6793The square brackets delineate the contents of @code{RT}, letting you 6794see the leading and trailing whitespace. The final value of 6795@code{RT} is a newline. 6796@xref{Simple Sed} for a more useful example 6797of @code{RS} as a regexp and @code{RT}. 6798 6799If you set @code{RS} to a regular expression that allows optional 6800trailing text, such as @samp{RS = "abc(XYZ)?"}, it is possible, due 6801to implementation constraints, that @command{gawk} may match the leading 6802part of the regular expression, but not the trailing part, particularly 6803if the input text that could match the trailing part is fairly long. 6804@command{gawk} attempts to avoid this problem, but currently, there's 6805no guarantee that this will never happen. 6806 6807@sidebar Caveats When Using Regular Expressions for @code{RS} 6808Remember that in @command{awk}, the @samp{^} and @samp{$} anchor 6809metacharacters match the beginning and end of a @emph{string}, and not 6810the beginning and end of a @emph{line}. As a result, something like 6811@samp{RS = "^[[:upper:]]"} can only match at the beginning of a file. 6812This is because @command{gawk} views the input file as one long string 6813that happens to contain newline characters. 6814It is thus best to avoid anchor metacharacters in the value of @code{RS}. 6815 6816Record splitting with regular expressions works differently than 6817regexp matching with the @code{sub()}, @code{gsub()}, and @code{gensub()} 6818(@pxref{String Functions}). Those functions allow a regexp to match the empty string; 6819record splitting does not. Thus, for example @samp{RS = "()"} does @emph{not} 6820split records between characters. 6821@end sidebar 6822 6823@cindex @command{gawk} @subentry @code{RT} variable in 6824@cindex @code{RT} variable 6825@cindex differences in @command{awk} and @command{gawk} @subentry @code{RS}/@code{RT} variables 6826The use of @code{RS} as a regular expression and the @code{RT} 6827variable are @command{gawk} extensions; they are not available in 6828compatibility mode 6829(@pxref{Options}). 6830In compatibility mode, only the first character of the value of 6831@code{RS} determines the end of the record. 6832 6833@cindex Brian Kernighan's @command{awk} 6834@command{mawk} has allowed @code{RS} to be a regexp for decades. 6835As of October, 2019, BWK @command{awk} also supports it. Neither 6836version supplies @code{RT}, however. 6837 6838@sidebar @code{RS = "\0"} Is Not Portable 6839@cindex portability @subentry data files as single record 6840There are times when you might want to treat an entire @value{DF} as a 6841single record. The only way to make this happen is to give @code{RS} 6842a value that you know doesn't occur in the input file. This is hard 6843to do in a general way, such that a program always works for arbitrary 6844input files. 6845 6846You might think that for text files, the @sc{nul} character, which 6847consists of a character with all bits equal to zero, is a good 6848value to use for @code{RS} in this case: 6849 6850@example 6851BEGIN @{ RS = "\0" @} # whole file becomes one record? 6852@end example 6853 6854@cindex differences in @command{awk} and @command{gawk} @subentry strings @subentry storing 6855@command{gawk} in fact accepts this, and uses the @sc{nul} 6856character for the record separator. 6857This works for certain special files, such as @file{/proc/environ} on 6858GNU/Linux systems, where the @sc{nul} character is in fact the record separator. 6859However, this usage is @emph{not} portable 6860to most other @command{awk} implementations. 6861 6862@cindex dark corner @subentry strings, storing 6863Almost all other @command{awk} implementations@footnote{At least that we know 6864about.} store strings internally as C-style strings. C strings use the 6865@sc{nul} character as the string terminator. In effect, this means that 6866@samp{RS = "\0"} is the same as @samp{RS = ""}. 6867@value{DARKCORNER} 6868 6869It happens that recent versions of @command{mawk} can use the @sc{nul} 6870character as a record separator. However, this is a special case: 6871@command{mawk} does not allow embedded @sc{nul} characters in strings. 6872(This may change in a future version of @command{mawk}.) 6873 6874@cindex records @subentry treating files as 6875@cindex treating files, as single records 6876@cindex single records, treating files as 6877@xref{Readfile Function} for an interesting way to read 6878whole files. If you are using @command{gawk}, see @ref{Extension Sample 6879Readfile} for another option. 6880@end sidebar 6881 6882@node Fields 6883@section Examining Fields 6884 6885@cindex examining fields 6886@cindex fields 6887@cindex accessing fields 6888@cindex fields @subentry examining 6889@cindex whitespace @subentry definition of 6890When @command{awk} reads an input record, the record is 6891automatically @dfn{parsed} or separated by the @command{awk} utility into chunks 6892called @dfn{fields}. By default, fields are separated by @dfn{whitespace}, 6893like words in a line. 6894Whitespace in @command{awk} means any string of one or more spaces, 6895TABs, or newlines; other characters 6896that are considered whitespace by other languages 6897(such as formfeed, vertical tab, etc.) are @emph{not} considered 6898whitespace by @command{awk}. 6899 6900The purpose of fields is to make it more convenient for you to refer to 6901these pieces of the record. You don't have to use them---you can 6902operate on the whole record if you want---but fields are what make 6903simple @command{awk} programs so powerful. 6904 6905@cindex field operator @code{$} 6906@cindex @code{$} (dollar sign) @subentry @code{$} field operator 6907@cindex dollar sign (@code{$}) @subentry @code{$} field operator 6908@cindex field operators, dollar sign as 6909You use a dollar sign (@samp{$}) 6910to refer to a field in an @command{awk} program, 6911followed by the number of the field you want. Thus, @code{$1} 6912refers to the first field, @code{$2} to the second, and so on. 6913(Unlike in the Unix shells, the field numbers are not limited to single digits. 6914@code{$127} is the 127th field in the record.) 6915For example, suppose the following is a line of input: 6916 6917@example 6918This seems like a pretty nice example. 6919@end example 6920 6921@noindent 6922Here the first field, or @code{$1}, is @samp{This}, the second field, or 6923@code{$2}, is @samp{seems}, and so on. Note that the last field, 6924@code{$7}, is @samp{example.}. Because there is no space between the 6925@samp{e} and the @samp{.}, the period is considered part of the seventh 6926field. 6927 6928@cindex @code{NF} variable 6929@cindex fields @subentry number of 6930@code{NF} is a predefined variable whose value is the number of fields 6931in the current record. @command{awk} automatically updates the value 6932of @code{NF} each time it reads a record. No matter how many fields 6933there are, the last field in a record can be represented by @code{$NF}. 6934So, @code{$NF} is the same as @code{$7}, which is @samp{example.}. 6935If you try to reference a field beyond the last 6936one (such as @code{$8} when the record has only seven fields), you get 6937the empty string. (If used in a numeric operation, you get zero.) 6938 6939The use of @code{$0}, which looks like a reference to the ``zeroth'' field, is 6940a special case: it represents the whole input record. Use it 6941when you are not interested in specific fields. 6942Here are some more examples: 6943 6944@example 6945$ @kbd{awk '$1 ~ /li/ @{ print $0 @}' mail-list} 6946@print{} Amelia 555-5553 amelia.zodiacusque@@gmail.com F 6947@print{} Julie 555-6699 julie.perscrutabor@@skeeve.com F 6948@end example 6949 6950@noindent 6951This example prints each record in the file @file{mail-list} whose first 6952field contains the string @samp{li}. 6953 6954By contrast, the following example looks for @samp{li} in @emph{the 6955entire record} and prints the first and last fields for each matching 6956input record: 6957 6958@example 6959$ @kbd{awk '/li/ @{ print $1, $NF @}' mail-list} 6960@print{} Amelia F 6961@print{} Broderick R 6962@print{} Julie F 6963@print{} Samuel A 6964@end example 6965 6966@node Nonconstant Fields 6967@section Nonconstant Field Numbers 6968@cindex fields @subentry numbers 6969@cindex field numbers 6970 6971A field number need not be a constant. Any expression in 6972the @command{awk} language can be used after a @samp{$} to refer to a 6973field. The value of the expression specifies the field number. If the 6974value is a string, rather than a number, it is converted to a number. 6975Consider this example: 6976 6977@example 6978awk '@{ print $NR @}' 6979@end example 6980 6981@noindent 6982Recall that @code{NR} is the number of records read so far: one in the 6983first record, two in the second, and so on. So this example prints the first 6984field of the first record, the second field of the second record, and so 6985on. For the twentieth record, field number 20 is printed; most likely, 6986the record has fewer than 20 fields, so this prints a blank line. 6987Here is another example of using expressions as field numbers: 6988 6989@example 6990awk '@{ print $(2*2) @}' mail-list 6991@end example 6992 6993@command{awk} evaluates the expression @samp{(2*2)} and uses 6994its value as the number of the field to print. The @samp{*} 6995represents multiplication, so the expression @samp{2*2} evaluates to four. 6996The parentheses are used so that the multiplication is done before the 6997@samp{$} operation; they are necessary whenever there is a binary 6998operator@footnote{A @dfn{binary operator}, such as @samp{*} for 6999multiplication, is one that takes two operands. The distinction 7000is required because @command{awk} also has unary (one-operand) 7001and ternary (three-operand) operators.} 7002in the field-number expression. This example, then, prints the 7003type of relationship (the fourth field) for every line of the file 7004@file{mail-list}. (All of the @command{awk} operators are listed, in 7005order of decreasing precedence, in 7006@ref{Precedence}.) 7007 7008If the field number you compute is zero, you get the entire record. 7009Thus, @samp{$(2-2)} has the same value as @code{$0}. Negative field 7010numbers are not allowed; trying to reference one usually terminates 7011the program. (The POSIX standard does not define 7012what happens when you reference a negative field number. @command{gawk} 7013notices this and terminates your program. Other @command{awk} 7014implementations may behave differently.) 7015 7016As mentioned in @ref{Fields}, 7017@command{awk} stores the current record's number of fields in the built-in 7018variable @code{NF} (also @pxref{Built-in Variables}). Thus, the expression 7019@code{$NF} is not a special feature---it is the direct consequence of 7020evaluating @code{NF} and using its value as a field number. 7021 7022@node Changing Fields 7023@section Changing the Contents of a Field 7024 7025@cindex fields @subentry changing contents of 7026The contents of a field, as seen by @command{awk}, can be changed within an 7027@command{awk} program; this changes what @command{awk} perceives as the 7028current input record. (The actual input is untouched; @command{awk} @emph{never} 7029modifies the input file.) 7030Consider the following example and its output: 7031 7032@example 7033$ @kbd{awk '@{ nboxes = $3 ; $3 = $3 - 10} 7034> @kbd{print nboxes, $3 @}' inventory-shipped} 7035@print{} 25 15 7036@print{} 32 22 7037@print{} 24 14 7038@dots{} 7039@end example 7040 7041@noindent 7042The program first saves the original value of field three in the variable 7043@code{nboxes}. 7044The @samp{-} sign represents subtraction, so this program reassigns 7045field three, @code{$3}, as the original value of field three minus ten: 7046@samp{$3 - 10}. (@xref{Arithmetic Ops}.) 7047Then it prints the original and new values for field three. 7048(Someone in the warehouse made a consistent mistake while inventorying 7049the red boxes.) 7050 7051For this to work, the text in @code{$3} must make sense 7052as a number; the string of characters must be converted to a number 7053for the computer to do arithmetic on it. The number resulting 7054from the subtraction is converted back to a string of characters that 7055then becomes field three. 7056@xref{Conversion}. 7057 7058When the value of a field is changed (as perceived by @command{awk}), the 7059text of the input record is recalculated to contain the new field where 7060the old one was. In other words, @code{$0} changes to reflect the altered 7061field. Thus, this program 7062prints a copy of the input file, with 10 subtracted from the second 7063field of each line: 7064 7065@example 7066$ @kbd{awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped} 7067@print{} Jan 3 25 15 115 7068@print{} Feb 5 32 24 226 7069@print{} Mar 5 24 34 228 7070@dots{} 7071@end example 7072 7073It is also possible to assign contents to fields that are out 7074of range. For example: 7075 7076@example 7077$ @kbd{awk '@{ $6 = ($5 + $4 + $3 + $2)} 7078> @kbd{ print $6 @}' inventory-shipped} 7079@print{} 168 7080@print{} 297 7081@print{} 301 7082@dots{} 7083@end example 7084 7085@cindex adding @subentry fields 7086@cindex fields @subentry adding 7087@noindent 7088We've just created @code{$6}, whose value is the sum of fields 7089@code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign 7090represents addition. For the file @file{inventory-shipped}, @code{$6} 7091represents the total number of parcels shipped for a particular month. 7092 7093Creating a new field changes @command{awk}'s internal copy of the current 7094input record, which is the value of @code{$0}. Thus, if you do @samp{print $0} 7095after adding a field, the record printed includes the new field, with 7096the appropriate number of field separators between it and the previously 7097existing fields. 7098 7099@cindex @code{OFS} variable 7100@cindex output field separator @seeentry{@code{OFS} variable} 7101@cindex field separator @seealso{@code{OFS}} 7102This recomputation affects and is affected by 7103@code{NF} (the number of fields; @pxref{Fields}). 7104For example, the value of @code{NF} is set to the number of the highest 7105field you create. 7106The exact format of @code{$0} is also affected by a feature that has not been discussed yet: 7107the @dfn{output field separator}, @code{OFS}, 7108used to separate the fields (@pxref{Output Separators}). 7109 7110Note, however, that merely @emph{referencing} an out-of-range field 7111does @emph{not} change the value of either @code{$0} or @code{NF}. 7112Referencing an out-of-range field only produces an empty string. For 7113example: 7114 7115@example 7116if ($(NF+1) != "") 7117 print "can't happen" 7118else 7119 print "everything is normal" 7120@end example 7121 7122@noindent 7123should print @samp{everything is normal}, because @code{NF+1} is certain 7124to be out of range. (@xref{If Statement} 7125for more information about @command{awk}'s @code{if-else} statements. 7126@xref{Typing and Comparison} 7127for more information about the @samp{!=} operator.) 7128 7129It is important to note that making an assignment to an existing field 7130changes the 7131value of @code{$0} but does not change the value of @code{NF}, 7132even when you assign the empty string to a field. For example: 7133 7134@example 7135$ @kbd{echo a b c d | awk '@{ OFS = ":"; $2 = ""} 7136> @kbd{print $0; print NF @}'} 7137@print{} a::c:d 7138@print{} 4 7139@end example 7140 7141@noindent 7142The field is still there; it just has an empty value, delimited by 7143the two colons between @samp{a} and @samp{c}. 7144This example shows what happens if you create a new field: 7145 7146@example 7147$ @kbd{echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new"} 7148> @kbd{print $0; print NF @}'} 7149@print{} a::c:d::new 7150@print{} 6 7151@end example 7152 7153@noindent 7154The intervening field, @code{$5}, is created with an empty value 7155(indicated by the second pair of adjacent colons), 7156and @code{NF} is updated with the value six. 7157 7158@cindex dark corner @subentry @code{NF} variable, decrementing 7159@cindex @code{NF} variable @subentry decrementing 7160Decrementing @code{NF} throws away the values of the fields 7161after the new value of @code{NF} and recomputes @code{$0}. 7162@value{DARKCORNER} 7163Here is an example: 7164 7165@example 7166$ @kbd{echo a b c d e f | awk '@{ print "NF =", NF;} 7167> @kbd{ NF = 3; print $0 @}'} 7168@print{} NF = 6 7169@print{} a b c 7170@end example 7171 7172@cindex portability @subentry @code{NF} variable, decrementing 7173@quotation CAUTION 7174Some versions of @command{awk} don't 7175rebuild @code{$0} when @code{NF} is decremented. 7176Until August, 2018, this included BWK @command{awk}; fortunately 7177his version now handles this correctly. 7178@end quotation 7179 7180Finally, there are times when it is convenient to force 7181@command{awk} to rebuild the entire record, using the current 7182values of the fields and @code{OFS}. To do this, use the 7183seemingly innocuous assignment: 7184 7185@example 7186@group 7187$1 = $1 # force record to be reconstituted 7188print $0 # or whatever else with $0 7189@end group 7190@end example 7191 7192@noindent 7193This forces @command{awk} to rebuild the record. It does help 7194to add a comment, as we've shown here. 7195 7196There is a flip side to the relationship between @code{$0} and 7197the fields. Any assignment to @code{$0} causes the record to be 7198reparsed into fields using the @emph{current} value of @code{FS}. 7199This also applies to any built-in function that updates @code{$0}, 7200such as @code{sub()} and @code{gsub()} 7201(@pxref{String Functions}). 7202 7203@sidebar Understanding @code{$0} 7204 7205It is important to remember that @code{$0} is the @emph{full} 7206record, exactly as it was read from the input. This includes 7207any leading or trailing whitespace, and the exact whitespace (or other 7208characters) that separates the fields. 7209 7210It is a common error to try to change the field separators 7211in a record simply by setting @code{FS} and @code{OFS}, and then 7212expecting a plain @samp{print} or @samp{print $0} to print the 7213modified record. 7214 7215But this does not work, because nothing was done to change the record 7216itself. Instead, you must force the record to be rebuilt, typically 7217with a statement such as @samp{$1 = $1}, as described earlier. 7218@end sidebar 7219 7220 7221@node Field Separators 7222@section Specifying How Fields Are Separated 7223 7224@menu 7225* Default Field Splitting:: How fields are normally separated. 7226* Regexp Field Splitting:: Using regexps as the field separator. 7227* Single Character Fields:: Making each character a separate field. 7228* Command Line Field Separator:: Setting @code{FS} from the command line. 7229* Full Line Fields:: Making the full line be a single field. 7230* Field Splitting Summary:: Some final points and a summary table. 7231@end menu 7232 7233@cindex @code{FS} variable 7234@cindex fields @subentry separating 7235@cindex field separator 7236@cindex fields @subentry separating 7237The @dfn{field separator}, which is either a single character or a regular 7238expression, controls the way @command{awk} splits an input record into fields. 7239@command{awk} scans the input record for character sequences that 7240match the separator; the fields themselves are the text between the matches. 7241 7242In the examples that follow, we use the bullet symbol (@bullet{}) to 7243represent spaces in the output. 7244If the field separator is @samp{oo}, then the following line: 7245 7246@example 7247moo goo gai pan 7248@end example 7249 7250@noindent 7251is split into three fields: @samp{m}, @samp{@bullet{}g}, and 7252@samp{@bullet{}gai@bullet{}pan}. 7253Note the leading spaces in the values of the second and third fields. 7254 7255@cindex troubleshooting @subentry @command{awk} uses @code{FS} not @code{IFS} 7256The field separator is represented by the predefined variable @code{FS}. 7257Shell programmers take note: @command{awk} does @emph{not} use the 7258name @code{IFS} that is used by the POSIX-compliant shells (such as 7259the Unix Bourne shell, @command{sh}, or Bash). 7260 7261@cindex @code{FS} variable @subentry changing value of 7262The value of @code{FS} can be changed in the @command{awk} program with the 7263assignment operator, @samp{=} (@pxref{Assignment Ops}). 7264Often, the right time to do this is at the beginning of execution 7265before any input has been processed, so that the very first record 7266is read with the proper separator. To do this, use the special 7267@code{BEGIN} pattern 7268(@pxref{BEGIN/END}). 7269For example, here we set the value of @code{FS} to the string 7270@code{","}: 7271 7272@example 7273awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}' 7274@end example 7275 7276@cindex @code{BEGIN} pattern 7277@noindent 7278Given the input line: 7279 7280@example 7281John Q. Smith, 29 Oak St., Walamazoo, MI 42139 7282@end example 7283 7284@noindent 7285this @command{awk} program extracts and prints the string 7286@samp{@bullet{}29@bullet{}Oak@bullet{}St.}. 7287 7288@cindex field separator @subentry choice of 7289@cindex regular expressions @subentry as field separators 7290@cindex field separator @subentry regular expression as 7291Sometimes the input data contains separator characters that don't 7292separate fields the way you thought they would. For instance, the 7293person's name in the example we just used might have a title or 7294suffix attached, such as: 7295 7296@example 7297John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139 7298@end example 7299 7300@noindent 7301The same program would extract @samp{@bullet{}LXIX} instead of 7302@samp{@bullet{}29@bullet{}Oak@bullet{}St.}. 7303If you were expecting the program to print the 7304address, you would be surprised. The moral is to choose your data layout and 7305separator characters carefully to prevent such problems. 7306(If the data is not in a form that is easy to process, perhaps you 7307can massage it first with a separate @command{awk} program.) 7308 7309 7310@node Default Field Splitting 7311@subsection Whitespace Normally Separates Fields 7312 7313@cindex field separator @subentry whitespace as 7314@cindex whitespace @subentry as field separators 7315@cindex field separator @subentry @code{FS} variable and 7316@cindex separators @subentry field @subentry @code{FS} variable and 7317Fields are normally separated by whitespace sequences 7318(spaces, TABs, and newlines), not by single spaces. Two spaces in a row do not 7319delimit an empty field. The default value of the field separator @code{FS} 7320is a string containing a single space, @w{@code{" "}}. If @command{awk} 7321interpreted this value in the usual way, each space character would separate 7322fields, so two spaces in a row would make an empty field between them. 7323The reason this does not happen is that a single space as the value of 7324@code{FS} is a special case---it is taken to specify the default manner 7325of delimiting fields. 7326 7327If @code{FS} is any other single character, such as @code{","}, then 7328each occurrence of that character separates two fields. Two consecutive 7329occurrences delimit an empty field. If the character occurs at the 7330beginning or the end of the line, that too delimits an empty field. The 7331space character is the only single character that does not follow these 7332rules. 7333 7334@node Regexp Field Splitting 7335@subsection Using Regular Expressions to Separate Fields 7336 7337@cindex regular expressions @subentry as field separators 7338@cindex field separator @subentry regular expression as 7339The previous @value{SUBSECTION} 7340discussed the use of single characters or simple strings as the 7341value of @code{FS}. 7342More generally, the value of @code{FS} may be a string containing any 7343regular expression. In this case, each match in the record for the regular 7344expression separates fields. For example, the assignment: 7345 7346@example 7347FS = ", \t" 7348@end example 7349 7350@noindent 7351makes every area of an input line that consists of a comma followed by a 7352space and a TAB into a field separator. 7353@ifinfo 7354(@samp{\t} 7355is an @dfn{escape sequence} that stands for a TAB; 7356@pxref{Escape Sequences}, 7357for the complete list of similar escape sequences.) 7358@end ifinfo 7359 7360For a less trivial example of a regular expression, try using 7361single spaces to separate fields the way single commas are used. 7362@code{FS} can be set to @w{@code{"[@ ]"}} (left bracket, space, right 7363bracket). This regular expression matches a single space and nothing else 7364(@pxref{Regexp}). 7365 7366There is an important difference between the two cases of @samp{FS = @w{" "}} 7367(a single space) and @samp{FS = @w{"[ \t\n]+"}} 7368(a regular expression matching one or more spaces, TABs, or newlines). 7369For both values of @code{FS}, fields are separated by @dfn{runs} 7370(multiple adjacent occurrences) of spaces, TABs, 7371and/or newlines. However, when the value of @code{FS} is @w{@code{" "}}, 7372@command{awk} first strips leading and trailing whitespace from 7373the record and then decides where the fields are. 7374For example, the following pipeline prints @samp{b}: 7375 7376@example 7377$ @kbd{echo ' a b c d ' | awk '@{ print $2 @}'} 7378@print{} b 7379@end example 7380 7381@noindent 7382However, this pipeline prints @samp{a} (note the extra spaces around 7383each letter): 7384 7385@example 7386$ @kbd{echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t\n]+" @}} 7387> @kbd{@{ print $2 @}'} 7388@print{} a 7389@end example 7390 7391@noindent 7392@cindex null strings 7393@cindex strings @subentry null 7394In this case, the first field is null, or empty. 7395 7396The stripping of leading and trailing whitespace also comes into 7397play whenever @code{$0} is recomputed. For instance, study this pipeline: 7398 7399@example 7400$ @kbd{echo ' a b c d' | awk '@{ print; $2 = $2; print @}'} 7401@print{} a b c d 7402@print{} a b c d 7403@end example 7404 7405@noindent 7406The first @code{print} statement prints the record as it was read, 7407with leading whitespace intact. The assignment to @code{$2} rebuilds 7408@code{$0} by concatenating @code{$1} through @code{$NF} together, 7409separated by the value of @code{OFS} (which is a space by default). 7410Because the leading whitespace was ignored when finding @code{$1}, 7411it is not part of the new @code{$0}. Finally, the last @code{print} 7412statement prints the new @code{$0}. 7413 7414@cindex @code{FS} variable @subentry containing @code{^} 7415@cindex @code{^} (caret) @subentry in @code{FS} 7416@cindex dark corner @subentry @code{^}, in @code{FS} 7417There is an additional subtlety to be aware of when using regular expressions 7418for field splitting. 7419It is not well specified in the POSIX standard, or anywhere else, what @samp{^} 7420means when splitting fields. Does the @samp{^} match only at the beginning of 7421the entire record? Or is each field separator a new string? It turns out that 7422different @command{awk} versions answer this question differently, and you 7423should not rely on any specific behavior in your programs. 7424@value{DARKCORNER} 7425 7426@cindex Brian Kernighan's @command{awk} 7427As a point of information, BWK @command{awk} allows @samp{^} 7428to match only at the beginning of the record. @command{gawk} 7429also works this way. For example: 7430 7431@example 7432$ @kbd{echo 'xxAA xxBxx C' |} 7433> @kbd{gawk -F '(^x+)|( +)' '@{ for (i = 1; i <= NF; i++)} 7434> @kbd{ printf "-->%s<--\n", $i @}'} 7435@print{} --><-- 7436@print{} -->AA<-- 7437@print{} -->xxBxx<-- 7438@print{} -->C<-- 7439@end example 7440 7441Finally, field splitting with regular expressions works differently than 7442regexp matching with the @code{sub()}, @code{gsub()}, and @code{gensub()} 7443(@pxref{String Functions}). Those functions allow a regexp to match the 7444empty string; field splitting does not. Thus, for example @samp{FS = 7445"()"} does @emph{not} split fields between characters. 7446 7447@node Single Character Fields 7448@subsection Making Each Character a Separate Field 7449 7450@cindex common extensions @subentry single character fields 7451@cindex extensions @subentry common @subentry single character fields 7452@cindex differences in @command{awk} and @command{gawk} @subentry single-character fields 7453@cindex single-character fields 7454@cindex fields @subentry single-character 7455There are times when you may want to examine each character 7456of a record separately. This can be done in @command{gawk} by 7457simply assigning the null string (@code{""}) to @code{FS}. @value{COMMONEXT} 7458In this case, 7459each individual character in the record becomes a separate field. 7460For example: 7461 7462@example 7463$ @kbd{echo a b | gawk 'BEGIN @{ FS = "" @}} 7464> @kbd{@{} 7465> @kbd{for (i = 1; i <= NF; i = i + 1)} 7466> @kbd{print "Field", i, "is", $i} 7467> @kbd{@}'} 7468@print{} Field 1 is a 7469@print{} Field 2 is 7470@print{} Field 3 is b 7471@end example 7472 7473@cindex dark corner @subentry @code{FS} as null string 7474@cindex @code{FS} variable @subentry null string as 7475Traditionally, the behavior of @code{FS} equal to @code{""} was not defined. 7476In this case, most versions of Unix @command{awk} simply treat the entire record 7477as only having one field. 7478@value{DARKCORNER} 7479In compatibility mode 7480(@pxref{Options}), 7481if @code{FS} is the null string, then @command{gawk} also 7482behaves this way. 7483 7484@node Command Line Field Separator 7485@subsection Setting @code{FS} from the Command Line 7486@cindex @option{-F} option @subentry command-line 7487@cindex field separator @subentry on command line 7488@cindex command line @subentry @code{FS} on, setting 7489@cindex @code{FS} variable @subentry setting from command line 7490 7491@code{FS} can be set on the command line. Use the @option{-F} option to 7492do so. For example: 7493 7494@example 7495awk -F, '@var{program}' @var{input-files} 7496@end example 7497 7498@noindent 7499sets @code{FS} to the @samp{,} character. Notice that the option uses 7500an uppercase @samp{F} instead of a lowercase @samp{f}. The latter 7501option (@option{-f}) specifies a file containing an @command{awk} program. 7502 7503The value used for the argument to @option{-F} is processed in exactly the 7504same way as assignments to the predefined variable @code{FS}. 7505Any special characters in the field separator must be escaped 7506appropriately. For example, to use a @samp{\} as the field separator 7507on the command line, you would have to type: 7508 7509@example 7510# same as FS = "\\" 7511awk -F\\\\ '@dots{}' files @dots{} 7512@end example 7513 7514@noindent 7515@cindex field separator @subentry backslash (@code{\}) as 7516@cindex @code{\} (backslash) @subentry as field separator 7517@cindex backslash (@code{\}) @subentry as field separator 7518Because @samp{\} is used for quoting in the shell, @command{awk} sees 7519@samp{-F\\}. Then @command{awk} processes the @samp{\\} for escape 7520characters (@pxref{Escape Sequences}), finally yielding 7521a single @samp{\} to use for the field separator. 7522 7523@c @cindex historical features 7524As a special case, in compatibility mode 7525(@pxref{Options}), 7526if the argument to @option{-F} is @samp{t}, then @code{FS} is set to 7527the TAB character. If you type @samp{-F\t} at the 7528shell, without any quotes, the @samp{\} gets deleted, so @command{awk} 7529figures that you really want your fields to be separated with TABs and 7530not @samp{t}s. Use @samp{-v FS="t"} or @samp{-F"[t]"} on the command line 7531if you really do want to separate your fields with @samp{t}s. 7532Use @samp{-F '\t'} when not in compatibility mode to specify that TABs 7533separate fields. 7534 7535As an example, let's use an @command{awk} program file called @file{edu.awk} 7536that contains the pattern @code{/edu/} and the action @samp{print $1}: 7537 7538@example 7539/edu/ @{ print $1 @} 7540@end example 7541 7542Let's also set @code{FS} to be the @samp{-} character and run the 7543program on the file @file{mail-list}. The following command prints a 7544list of the names of the people that work at or attend a university, and 7545the first three digits of their phone numbers: 7546 7547@example 7548$ @kbd{awk -F- -f edu.awk mail-list} 7549@print{} Fabius 555 7550@print{} Samuel 555 7551@print{} Jean 7552@end example 7553 7554@noindent 7555Note the third line of output. The third line 7556in the original file looked like this: 7557 7558@example 7559Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R 7560@end example 7561 7562The @samp{-} as part of the person's name was used as the field 7563separator, instead of the @samp{-} in the phone number that was 7564originally intended. This demonstrates why you have to be careful in 7565choosing your field and record separators. 7566 7567@cindex Unix @command{awk} @subentry password files, field separators and 7568Perhaps the most common use of a single character as the field separator 7569occurs when processing the Unix system password file. On many Unix 7570systems, each user has a separate entry in the system password file, with one 7571line per user. The information in these lines is separated by colons. 7572The first field is the user's login name and the second is the user's 7573encrypted or shadow password. (A shadow password is indicated by the 7574presence of a single @samp{x} in the second field.) A password file 7575entry might look like this: 7576 7577@cindex Robbins @subentry Arnold 7578@example 7579arnold:x:2076:10:Arnold Robbins:/home/arnold:/bin/bash 7580@end example 7581 7582The following program searches the system password file and prints 7583the entries for users whose full name is not indicated: 7584 7585@example 7586awk -F: '$5 == ""' /etc/passwd 7587@end example 7588 7589@node Full Line Fields 7590@subsection Making the Full Line Be a Single Field 7591 7592Occasionally, it's useful to treat the whole input line as a 7593single field. This can be done easily and portably simply by 7594setting @code{FS} to @code{"\n"} (a newline):@footnote{Thanks to 7595Andrew Schorr for this tip.} 7596 7597@example 7598awk -F'\n' '@var{program}' @var{files @dots{}} 7599@end example 7600 7601@noindent 7602When you do this, @code{$1} is the same as @code{$0}. 7603 7604@sidebar Changing @code{FS} Does Not Affect the Fields 7605 7606@cindex POSIX @command{awk} @subentry field separators and 7607@cindex field separator @subentry POSIX and 7608According to the POSIX standard, @command{awk} is supposed to behave 7609as if each record is split into fields at the time it is read. 7610In particular, this means that if you change the value of @code{FS} 7611after a record is read, the values of the fields (i.e., how they were split) 7612should reflect the old value of @code{FS}, not the new one. 7613 7614@cindex dark corner @subentry field separators 7615@cindex @command{sed} utility 7616@cindex stream editors 7617However, many older implementations of @command{awk} do not work this way. Instead, 7618they defer splitting the fields until a field is actually 7619referenced. The fields are split 7620using the @emph{current} value of @code{FS}! 7621@value{DARKCORNER} 7622This behavior can be difficult 7623to diagnose. The following example illustrates the difference 7624between the two methods: 7625 7626@example 7627sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}' 7628@end example 7629 7630@noindent 7631which usually prints: 7632 7633@example 7634root 7635@end example 7636 7637@noindent 7638on an incorrect implementation of @command{awk}, while @command{gawk} 7639prints the full first line of the file, something like: 7640 7641@example 7642root:x:0:0:Root:/: 7643@end example 7644 7645(The @command{sed}@footnote{The @command{sed} utility is a ``stream editor.'' 7646Its behavior is also defined by the POSIX standard.} 7647command prints just the first line of @file{/etc/passwd}.) 7648@end sidebar 7649 7650@node Field Splitting Summary 7651@subsection Field-Splitting Summary 7652 7653It is important to remember that when you assign a string constant 7654as the value of @code{FS}, it undergoes normal @command{awk} string 7655processing. For example, with Unix @command{awk} and @command{gawk}, 7656the assignment @samp{FS = "\.."} assigns the character string @code{".."} 7657to @code{FS} (the backslash is stripped). This creates a regexp meaning 7658``fields are separated by occurrences of any two characters.'' 7659If instead you want fields to be separated by a literal period followed 7660by any single character, use @samp{FS = "\\.."}. 7661 7662The following list summarizes how fields are split, based on the value 7663of @code{FS} (@samp{==} means ``is equal to''): 7664 7665@table @code 7666@item FS == " " 7667Fields are separated by runs of whitespace. Leading and trailing 7668whitespace are ignored. This is the default. 7669 7670@item FS == @var{any other single character} 7671Fields are separated by each occurrence of the character. Multiple 7672successive occurrences delimit empty fields, as do leading and 7673trailing occurrences. 7674The character can even be a regexp metacharacter; it does not need 7675to be escaped. 7676 7677@item FS == @var{regexp} 7678Fields are separated by occurrences of characters that match @var{regexp}. 7679Leading and trailing matches of @var{regexp} delimit empty fields. 7680 7681@item FS == "" 7682Each individual character in the record becomes a separate field. 7683(This is a common extension; it is not specified by the POSIX standard.) 7684@end table 7685 7686@sidebar @code{FS} and @code{IGNORECASE} 7687 7688The @code{IGNORECASE} variable 7689(@pxref{User-modified}) 7690affects field splitting @emph{only} when the value of @code{FS} is a regexp. 7691It has no effect when @code{FS} is a single character, even if 7692that character is a letter. Thus, in the following code: 7693 7694@example 7695FS = "c" 7696IGNORECASE = 1 7697$0 = "aCa" 7698print $1 7699@end example 7700 7701@noindent 7702The output is @samp{aCa}. If you really want to split fields on an 7703alphabetic character while ignoring case, use a regexp that will 7704do it for you (e.g., @samp{FS = "[c]"}). In this case, @code{IGNORECASE} 7705will take effect. 7706@end sidebar 7707 7708 7709@node Constant Size 7710@section Reading Fixed-Width Data 7711 7712@cindex data, fixed-width 7713@cindex fixed-width data 7714@cindex advanced features @subentry fixed-width data 7715 7716@c O'Reilly doesn't like it as a note the first thing in the section. 7717This @value{SECTION} discusses an advanced 7718feature of @command{gawk}. If you are a novice @command{awk} user, 7719you might want to skip it on the first reading. 7720 7721@command{gawk} provides a facility for dealing with fixed-width fields 7722with no distinctive field separator. We discuss this feature in 7723the following @value{SUBSECTION}s. 7724 7725@menu 7726* Fixed width data:: Processing fixed-width data. 7727* Skipping intervening:: Skipping intervening fields. 7728* Allowing trailing data:: Capturing optional trailing data. 7729* Fields with fixed data:: Field values with fixed-width data. 7730@end menu 7731 7732@node Fixed width data 7733@subsection Processing Fixed-Width Data 7734 7735An example of fixed-width data would be the input for old Fortran programs 7736where numbers are run together, or the output of programs that did not 7737anticipate the use of their output as input for other programs. 7738 7739An example of the latter is a table where all the columns are lined up 7740by the use of a variable number of spaces and @emph{empty fields are 7741just spaces}. Clearly, @command{awk}'s normal field splitting based 7742on @code{FS} does not work well in this case. Although a portable 7743@command{awk} program can use a series of @code{substr()} calls on 7744@code{$0} (@pxref{String Functions}), this is awkward and inefficient 7745for a large number of fields. 7746 7747@cindex troubleshooting @subentry fatal errors @subentry field widths, specifying 7748@cindex @command{w} utility 7749@cindex @code{FIELDWIDTHS} variable 7750@cindex @command{gawk} @subentry @code{FIELDWIDTHS} variable in 7751The splitting of an input record into fixed-width fields is specified by 7752assigning a string containing space-separated numbers to the built-in 7753variable @code{FIELDWIDTHS}. Each number specifies the width of the 7754field, @emph{including} columns between fields. If you want to ignore 7755the columns between fields, you can specify the width as a separate 7756field that is subsequently ignored. It is a fatal error to supply a 7757field width that has a negative value. 7758 7759The following data is the output of the Unix @command{w} utility. It is useful 7760to illustrate the use of @code{FIELDWIDTHS}: 7761 7762@example 7763@group 7764 10:06pm up 21 days, 14:04, 23 users 7765User tty login@ idle JCPU PCPU what 7766hzuo ttyV0 8:58pm 9 5 vi p24.tex 7767hzang ttyV3 6:37pm 50 -csh 7768eklye ttyV5 9:53pm 7 1 em thes.tex 7769dportein ttyV6 8:17pm 1:47 -csh 7770gierd ttyD3 10:00pm 1 elm 7771dave ttyD4 9:47pm 4 4 w 7772brent ttyp0 26Jun91 4:46 26:46 4:41 bash 7773dave ttyq4 26Jun9115days 46 46 wnewmail 7774@end group 7775@end example 7776 7777The following program takes this input, converts the idle time to 7778number of seconds, and prints out the first two fields and the calculated 7779idle time: 7780 7781@example 7782BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @} 7783NR > 2 @{ 7784 idle = $4 7785 sub(/^ +/, "", idle) # strip leading spaces 7786 if (idle == "") 7787 idle = 0 7788 if (idle ~ /:/) @{ # hh:mm 7789 split(idle, t, ":") 7790 idle = t[1] * 60 + t[2] 7791 @} 7792 if (idle ~ /days/) 7793 idle *= 24 * 60 * 60 7794 7795 print $1, $2, idle 7796@} 7797@end example 7798 7799@quotation NOTE 7800The preceding program uses a number of @command{awk} features that 7801haven't been introduced yet. 7802@end quotation 7803 7804Running the program on the data produces the following results: 7805 7806@example 7807hzuo ttyV0 0 7808hzang ttyV3 50 7809eklye ttyV5 0 7810dportein ttyV6 107 7811gierd ttyD3 1 7812dave ttyD4 0 7813brent ttyp0 286 7814dave ttyq4 1296000 7815@end example 7816 7817Another (possibly more practical) example of fixed-width input data 7818is the input from a deck of balloting cards. In some parts of 7819the United States, voters mark their choices by punching holes in computer 7820cards. These cards are then processed to count the votes for any particular 7821candidate or on any particular issue. Because a voter may choose not to 7822vote on some issue, any column on the card may be empty. An @command{awk} 7823program for processing such data could use the @code{FIELDWIDTHS} feature 7824to simplify reading the data. (Of course, getting @command{gawk} to run on 7825a system with card readers is another story!) 7826 7827@node Skipping intervening 7828@subsection Skipping Intervening Fields 7829 7830Starting in @value{PVERSION} 4.2, each field width may optionally be 7831preceded by a colon-separated value specifying the number of characters 7832to skip before the field starts. Thus, the preceding program could be 7833rewritten to specify @code{FIELDWIDTHS} like so: 7834 7835@example 7836BEGIN @{ FIELDWIDTHS = "8 1:5 4:7 6 1:6 1:6 2:33" @} 7837@end example 7838 7839This strips away some of the white space separating the fields. With such 7840a change, the program produces the following results: 7841 7842@example 7843hzang ttyV3 50 7844eklye ttyV5 0 7845dportein ttyV6 107 7846gierd ttyD3 1 7847dave ttyD4 0 7848brent ttyp0 286 7849dave ttyq4 1296000 7850@end example 7851 7852@node Allowing trailing data 7853@subsection Capturing Optional Trailing Data 7854 7855There are times when fixed-width data may be followed by additional data 7856that has no fixed length. Such data may or may not be present, but if 7857it is, it should be possible to get at it from an @command{awk} program. 7858 7859Starting with @value{PVERSION} 4.2, in order to provide a way to say ``anything 7860else in the record after the defined fields,'' @command{gawk} 7861allows you to add a final @samp{*} character to the value of 7862@code{FIELDWIDTHS}. There can only be one such character, and it must 7863be the final non-whitespace character in @code{FIELDWIDTHS}. 7864For example: 7865 7866@example 7867$ @kbd{cat fw.awk} @ii{Show the program} 7868@print{} BEGIN @{ FIELDWIDTHS = "2 2 *" @} 7869@print{} @{ print NF, $1, $2, $3 @} 7870$ @kbd{cat fw.in} @ii{Show sample input} 7871@print{} 1234abcdefghi 7872$ @kbd{gawk -f fw.awk fw.in} @ii{Run the program} 7873@print{} 3 12 34 abcdefghi 7874@end example 7875 7876@node Fields with fixed data 7877@subsection Field Values With Fixed-Width Data 7878 7879So far, so good. But what happens if there isn't as much data as there 7880should be based on the contents of @code{FIELDWIDTHS}? Or, what happens 7881if there is more data than expected? 7882 7883For many years, what happens in these cases was not well defined. Starting 7884with @value{PVERSION} 4.2, the rules are as follows: 7885 7886@table @asis 7887@item Enough data for some fields 7888For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4"} and the 7889input record is @samp{aabbb}. In this case, @code{NF} is set to two. 7890 7891@item Not enough data for a field 7892For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4"} and the 7893input record is @samp{aab}. In this case, @code{NF} is set to two and 7894@code{$2} has the value @code{"b"}. The idea is that even though there 7895aren't as many characters as were expected, there are some, so the data 7896should be made available to the program. 7897 7898@item Too much data 7899For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4"} and the 7900input record is @samp{aabbbccccddd}. In this case, @code{NF} is set to 7901three and the extra characters (@samp{ddd}) are ignored. If you want 7902@command{gawk} to capture the extra characters, supply a final @samp{*} 7903in the value of @code{FIELDWIDTHS}. 7904 7905@item Too much data, but with @samp{*} supplied 7906For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4 *"} and the 7907input record is @samp{aabbbccccddd}. In this case, @code{NF} is set to 7908four, and @code{$4} has the value @code{"ddd"}. 7909 7910@end table 7911 7912@node Splitting By Content 7913@section Defining Fields by Content 7914 7915@menu 7916* More CSV:: More on CSV files. 7917* FS versus FPAT:: A subtle difference. 7918@end menu 7919 7920@c O'Reilly doesn't like it as a note the first thing in the section. 7921This @value{SECTION} discusses an advanced 7922feature of @command{gawk}. If you are a novice @command{awk} user, 7923you might want to skip it on the first reading. 7924 7925@cindex advanced features @subentry specifying field content 7926Normally, when using @code{FS}, @command{gawk} defines the fields as the 7927parts of the record that occur in between each field separator. In other 7928words, @code{FS} defines what a field @emph{is not}, instead of what a field 7929@emph{is}. 7930However, there are times when you really want to define the fields by 7931what they are, and not by what they are not. 7932 7933@cindex CSV (comma separated values) data @subentry parsing with @code{FPAT} 7934@cindex Comma separated values (CSV) data @subentry parsing with @code{FPAT} 7935The most notorious such case 7936is so-called @dfn{comma-separated values} (CSV) data. Many spreadsheet programs, 7937for example, can export their data into text files, where each record is 7938terminated with a newline, and fields are separated by commas. If 7939commas only separated the data, there wouldn't be an issue. The problem comes when 7940one of the fields contains an @emph{embedded} comma. 7941In such cases, most programs embed the field in double quotes.@footnote{The 7942CSV format lacked a formal standard definition for many years. 7943@uref{http://www.ietf.org/rfc/rfc4180.txt, RFC 4180} 7944standardizes the most common practices.} 7945So, we might have data like this: 7946 7947@example 7948@c file eg/misc/addresses.csv 7949Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA 7950@c endfile 7951@end example 7952 7953@cindex @command{gawk} @subentry @code{FPAT} variable in 7954@cindex @code{FPAT} variable 7955The @code{FPAT} variable offers a solution for cases like this. 7956The value of @code{FPAT} should be a string that provides a regular expression. 7957This regular expression describes the contents of each field. 7958 7959In the case of CSV data as presented here, each field is either ``anything that 7960is not a comma,'' or ``a double quote, anything that is not a double quote, and a 7961closing double quote.'' (There are more complicated definitions of CSV data, 7962treated shortly.) 7963If written as a regular expression constant 7964(@pxref{Regexp}), 7965we would have @code{/([^,]+)|("[^"]+")/}. 7966Writing this as a string requires us to escape the double quotes, leading to: 7967 7968@example 7969FPAT = "([^,]+)|(\"[^\"]+\")" 7970@end example 7971 7972Putting this to use, here is a simple program to parse the data: 7973 7974@example 7975@c file eg/misc/simple-csv.awk 7976@group 7977BEGIN @{ 7978 FPAT = "([^,]+)|(\"[^\"]+\")" 7979@} 7980@end group 7981 7982@group 7983@{ 7984 print "NF = ", NF 7985 for (i = 1; i <= NF; i++) @{ 7986 printf("$%d = <%s>\n", i, $i) 7987 @} 7988@} 7989@end group 7990@c endfile 7991@end example 7992 7993When run, we get the following: 7994 7995@example 7996$ @kbd{gawk -f simple-csv.awk addresses.csv} 7997NF = 7 7998$1 = <Robbins> 7999$2 = <Arnold> 8000$3 = <"1234 A Pretty Street, NE"> 8001$4 = <MyTown> 8002$5 = <MyState> 8003$6 = <12345-6789> 8004$7 = <USA> 8005@end example 8006 8007Note the embedded comma in the value of @code{$3}. 8008 8009A straightforward improvement when processing CSV data of this sort 8010would be to remove the quotes when they occur, with something like this: 8011 8012@example 8013if (substr($i, 1, 1) == "\"") @{ 8014 len = length($i) 8015 $i = substr($i, 2, len - 2) # Get text within the two quotes 8016@} 8017@end example 8018 8019@quotation NOTE 8020Some programs export CSV data that contains embedded newlines between 8021the double quotes. @command{gawk} provides no way to deal with this. 8022Even though a formal specification for CSV data exists, there isn't much 8023more to be done; 8024the @code{FPAT} mechanism provides an elegant solution for the majority 8025of cases, and the @command{gawk} developers are satisfied with that. 8026@end quotation 8027 8028As written, the regexp used for @code{FPAT} requires that each field 8029contain at least one character. A straightforward modification 8030(changing the first @samp{+} to @samp{*}) allows fields to be empty: 8031 8032@example 8033FPAT = "([^,]*)|(\"[^\"]+\")" 8034@end example 8035 8036@c 4/2015: 8037@c Consider use of FPAT = "([^,]*)|(\"[^\"]*\")" 8038@c (star in latter part of value) to allow quoted strings to be empty. 8039@c Per email from Ed Morton <mortoneccc@comcast.net> 8040@c 8041@c WONTFIX: 10/2020 8042@c This is too much work. FPAT and CSV files are very flaky and 8043@c fragile. Doing something like this is merely inviting trouble. 8044 8045As with @code{FS}, the @code{IGNORECASE} variable (@pxref{User-modified}) 8046affects field splitting with @code{FPAT}. 8047 8048Assigning a value to @code{FPAT} overrides field splitting 8049with @code{FS} and with @code{FIELDWIDTHS}. 8050 8051Finally, the @code{patsplit()} function makes the same functionality 8052available for splitting regular strings (@pxref{String Functions}). 8053 8054@node More CSV 8055@subsection More on CSV Files 8056 8057@cindex Collado, Manuel 8058Manuel Collado notes that in addition to commas, a CSV field can also 8059contains quotes, that have to be escaped by doubling them. The previously 8060described regexps fail to accept quoted fields with both commas and 8061quotes inside. He suggests that the simplest @code{FPAT} expression that 8062recognizes this kind of fields is @code{/([^,]*)|("([^"]|"")+")/}. He 8063provides the following input data to test these variants: 8064 8065@example 8066@c file eg/misc/sample.csv 8067p,"q,r",s 8068p,"q""r",s 8069p,"q,""r",s 8070p,"",s 8071p,,s 8072@c endfile 8073@end example 8074 8075@noindent 8076And here is his test program: 8077 8078@example 8079@c file eg/misc/test-csv.awk 8080@group 8081BEGIN @{ 8082 fp[0] = "([^,]+)|(\"[^\"]+\")" 8083 fp[1] = "([^,]*)|(\"[^\"]+\")" 8084 fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")" 8085 FPAT = fp[fpat+0] 8086@} 8087@end group 8088 8089@group 8090@{ 8091 print "<" $0 ">" 8092 printf("NF = %s ", NF) 8093 for (i = 1; i <= NF; i++) @{ 8094 printf("<%s>", $i) 8095 @} 8096 print "" 8097@} 8098@end group 8099@c endfile 8100@end example 8101 8102When run on the third variant, it produces: 8103 8104@example 8105$ @kbd{gawk -v fpat=2 -f test-csv.awk sample.csv} 8106@print{} <p,"q,r",s> 8107@print{} NF = 3 <p><"q,r"><s> 8108@print{} <p,"q""r",s> 8109@print{} NF = 3 <p><"q""r"><s> 8110@print{} <p,"q,""r",s> 8111@print{} NF = 3 <p><"q,""r"><s> 8112@print{} <p,"",s> 8113@print{} NF = 3 <p><""><s> 8114@print{} <p,,s> 8115@print{} NF = 3 <p><><s> 8116@end example 8117 8118@cindex Collado, Manuel 8119@cindex @code{CSVMODE} library for @command{gawk} 8120@cindex CSV (comma separated values) data @subentry parsing with @code{CSVMODE} library 8121@cindex Comma separated values (CSV) data @subentry parsing with @code{FPAT} library 8122In general, using @code{FPAT} to do your own CSV parsing is like having 8123a bed with a blanket that's not quite big enough. There's always a corner 8124that isn't covered. We recommend, instead, that you use Manuel Collado's 8125@uref{http://mcollado.z15.es/xgawk/, @code{CSVMODE} library for @command{gawk}}. 8126 8127@node FS versus FPAT 8128@subsection @code{FS} Versus @code{FPAT}: A Subtle Difference 8129 8130As we discussed earlier, @code{FS} describes the data between fields (``what fields are not'') 8131and @code{FPAT} describes the fields themselves (``what fields are''). 8132This leads to a subtle difference in how fields are found when using regexps as the value 8133for @code{FS} or @code{FPAT}. 8134 8135In order to distinguish one field from another, there must be a non-empty separator between 8136each field. This makes intuitive sense---otherwise one could not distinguish fields from 8137separators. 8138 8139Thus, regular expression matching as done when splitting fields with @code{FS} is not 8140allowed to match the null string; it must always match at least one character, in order 8141to be able to proceed through the entire record. 8142 8143On the other hand, regular expression matching with @code{FPAT} can match the null 8144string, and the non-matching intervening characters function as the separators. 8145 8146This same difference is reflected in how matching is done with the @code{split()} 8147and @code{patsplit()} functions (@pxref{String Functions}). 8148 8149@node Testing field creation 8150@section Checking How @command{gawk} Is Splitting Records 8151 8152@cindex @command{gawk} @subentry splitting fields and 8153As we've seen, @command{gawk} provides three independent methods to split 8154input records into fields. The mechanism used is based on which of the 8155three variables---@code{FS}, @code{FIELDWIDTHS}, or @code{FPAT}---was 8156last assigned to. In addition, an API input parser may choose to override 8157the record parsing mechanism; please refer to @ref{Input Parsers} for 8158further information about this feature. 8159 8160To restore normal field splitting after using @code{FIELDWIDTHS} 8161and/or @code{FPAT}, simply assign a value to @code{FS}. 8162You can use @samp{FS = FS} to do this, 8163without having to know the current value of @code{FS}. 8164 8165In order to tell which kind of field splitting is in effect, 8166use @code{PROCINFO["FS"]} (@pxref{Auto-set}). 8167The value is @code{"FS"} if regular field splitting is being used, 8168@code{"FIELDWIDTHS"} if fixed-width field splitting is being used, 8169or @code{"FPAT"} if content-based field splitting is being used: 8170 8171@example 8172if (PROCINFO["FS"] == "FS") 8173 @var{regular field splitting} @dots{} 8174else if (PROCINFO["FS"] == "FIELDWIDTHS") 8175 @var{fixed-width field splitting} @dots{} 8176else if (PROCINFO["FS"] == "FPAT") 8177 @var{content-based field splitting} @dots{} 8178else 8179 @var{API input parser field splitting} @dots{} @ii{(advanced feature)} 8180@end example 8181 8182This information is useful when writing a function that needs to 8183temporarily change @code{FS} or @code{FIELDWIDTHS}, read some records, 8184and then restore the original settings (@pxref{Passwd Functions} for an 8185example of such a function). 8186 8187@node Multiple Line 8188@section Multiple-Line Records 8189 8190@cindex multiple-line records 8191@cindex records @subentry multiline 8192@cindex input @subentry multiline records 8193@cindex files @subentry reading @subentry multiline records 8194@cindex input, files @seeentry{input files} 8195In some databases, a single line cannot conveniently hold all the 8196information in one entry. In such cases, you can use multiline 8197records. The first step in doing this is to choose your data format. 8198 8199@cindex record separators @subentry with multiline records 8200One technique is to use an unusual character or string to separate 8201records. For example, you could use the formfeed character (written 8202@samp{\f} in @command{awk}, as in C) to separate them, making each record 8203a page of the file. To do this, just set the variable @code{RS} to 8204@code{"\f"} (a string containing the formfeed character). Any 8205other character could equally well be used, as long as it won't be part 8206of the data in a record. 8207 8208@cindex @code{RS} variable @subentry multiline records and 8209Another technique is to have blank lines separate records. By a special 8210dispensation, an empty string as the value of @code{RS} indicates that 8211records are separated by one or more blank lines. When @code{RS} is set 8212to the empty string, each record always ends at the first blank line 8213encountered. The next record doesn't start until the first nonblank 8214line that follows. No matter how many blank lines appear in a row, they 8215all act as one record separator. 8216(Blank lines must be completely empty; lines that contain only 8217whitespace do not count.) 8218 8219@cindex leftmost longest match 8220@cindex matching @subentry leftmost longest 8221You can achieve the same effect as @samp{RS = ""} by assigning the 8222string @code{"\n\n+"} to @code{RS}. This regexp matches the newline 8223at the end of the record and one or more blank lines after the record. 8224In addition, a regular expression always matches the longest possible 8225sequence when there is a choice 8226(@pxref{Leftmost Longest}). 8227So, the next record doesn't start until 8228the first nonblank line that follows---no matter how many blank lines 8229appear in a row, they are considered one record separator. 8230 8231@cindex dark corner @subentry multiline records 8232However, there is an important difference between @samp{RS = ""} and 8233@samp{RS = "\n\n+"}. In the first case, leading newlines in the input 8234@value{DF} are ignored, and if a file ends without extra blank lines 8235after the last record, the final newline is removed from the record. 8236In the second case, this special processing is not done. 8237@value{DARKCORNER} 8238 8239@cindex field separator @subentry in multiline records 8240@cindex @code{FS} variable @subentry in multiline records 8241Now that the input is separated into records, the second step is to 8242separate the fields in the records. One way to do this is to divide each 8243of the lines into fields in the normal manner. This happens by default 8244as the result of a special feature. When @code{RS} is set to the empty 8245string @emph{and} @code{FS} is set to a single character, 8246the newline character @emph{always} acts as a field separator. 8247This is in addition to whatever field separations result from 8248@code{FS}. 8249 8250@quotation NOTE 8251When @code{FS} is the null string (@code{""}) 8252or a regexp, this special feature of @code{RS} does not apply. 8253It does apply to the default field separator of a single space: 8254@samp{FS = @w{" "}}. 8255 8256Note that language in the POSIX specification implies that 8257this special feature should apply when @code{FS} is a regexp. 8258However, Unix @command{awk} has never behaved that way, nor has 8259@command{gawk}. This is essentially a bug in POSIX. 8260@c Noted as of 4/2019; working to get the standard fixed. 8261@end quotation 8262 8263The original motivation for this special exception was probably to provide 8264useful behavior in the default case (i.e., @code{FS} is equal 8265to @w{@code{" "}}). This feature can be a problem if you really don't 8266want the newline character to separate fields, because there is no way to 8267prevent it. However, you can work around this by using the @code{split()} 8268function to break up the record manually 8269(@pxref{String Functions}). 8270If you have a single-character field separator, you can work around 8271the special feature in a different way, by making @code{FS} into a 8272regexp for that single character. For example, if the field 8273separator is a percent character, instead of 8274@samp{FS = "%"}, use @samp{FS = "[%]"}. 8275 8276Another way to separate fields is to 8277put each field on a separate line: to do this, just set the 8278variable @code{FS} to the string @code{"\n"}. 8279(This single-character separator matches a single newline.) 8280A practical example of a @value{DF} organized this way might be a mailing 8281list, where blank lines separate the entries. Consider a mailing 8282list in a file named @file{addresses}, which looks like this: 8283 8284@example 8285Jane Doe 8286123 Main Street 8287Anywhere, SE 12345-6789 8288 8289John Smith 8290456 Tree-lined Avenue 8291Smallville, MW 98765-4321 8292@dots{} 8293@end example 8294 8295@noindent 8296A simple program to process this file is as follows: 8297 8298@example 8299# addrs.awk --- simple mailing list program 8300 8301# Records are separated by blank lines. 8302# Each line is one field. 8303BEGIN @{ RS = "" ; FS = "\n" @} 8304 8305@{ 8306 print "Name is:", $1 8307 print "Address is:", $2 8308 print "City and State are:", $3 8309 print "" 8310@} 8311@end example 8312 8313Running the program produces the following output: 8314 8315@example 8316$ @kbd{awk -f addrs.awk addresses} 8317@print{} Name is: Jane Doe 8318@print{} Address is: 123 Main Street 8319@print{} City and State are: Anywhere, SE 12345-6789 8320@print{} 8321@print{} Name is: John Smith 8322@print{} Address is: 456 Tree-lined Avenue 8323@print{} City and State are: Smallville, MW 98765-4321 8324@print{} 8325@dots{} 8326@end example 8327 8328@xref{Labels Program} for a more realistic program dealing with 8329address lists. The following list summarizes how records are split, 8330based on the value of 8331@ifinfo 8332@code{RS}. 8333(@samp{==} means ``is equal to.'') 8334@end ifinfo 8335@ifnotinfo 8336@code{RS}: 8337@end ifnotinfo 8338 8339@table @code 8340@item RS == "\n" 8341Records are separated by the newline character (@samp{\n}). In effect, 8342every line in the @value{DF} is a separate record, including blank lines. 8343This is the default. 8344 8345@item RS == @var{any single character} 8346Records are separated by each occurrence of the character. Multiple 8347successive occurrences delimit empty records. 8348 8349@item RS == "" 8350Records are separated by runs of blank lines. 8351When @code{FS} is a single character, then 8352the newline character 8353always serves as a field separator, in addition to whatever value 8354@code{FS} may have. Leading and trailing newlines in a file are ignored. 8355 8356@item RS == @var{regexp} 8357Records are separated by occurrences of characters that match @var{regexp}. 8358Leading and trailing matches of @var{regexp} delimit empty records. 8359(This is a @command{gawk} extension; it is not specified by the 8360POSIX standard.) 8361@end table 8362 8363@cindex @command{gawk} @subentry @code{RT} variable in 8364@cindex @code{RT} variable 8365@cindex differences in @command{awk} and @command{gawk} @subentry @code{RS}/@code{RT} variables 8366If not in compatibility mode (@pxref{Options}), @command{gawk} sets 8367@code{RT} to the input text that matched the value specified by @code{RS}. 8368But if the input file ended without any text that matches @code{RS}, 8369then @command{gawk} sets @code{RT} to the null string. 8370 8371@node Getline 8372@section Explicit Input with @code{getline} 8373 8374@cindex @code{getline} command @subentry explicit input with 8375@cindex input @subentry explicit 8376So far we have been getting our input data from @command{awk}'s main 8377input stream---either the standard input (usually your keyboard, sometimes 8378the output from another program) or the 8379files specified on the command line. The @command{awk} language has a 8380special built-in command called @code{getline} that 8381can be used to read input under your explicit control. 8382 8383The @code{getline} command is used in several different ways and should 8384@emph{not} be used by beginners. 8385The examples that follow the explanation of the @code{getline} command 8386include material that has not been covered yet. Therefore, come back 8387and study the @code{getline} command @emph{after} you have reviewed the 8388rest of 8389@ifinfo 8390this @value{DOCUMENT} 8391@end ifinfo 8392@ifhtml 8393this @value{DOCUMENT} 8394@end ifhtml 8395@ifnotinfo 8396@ifnothtml 8397Parts I and II 8398@end ifnothtml 8399@end ifnotinfo 8400and have a good knowledge of how @command{awk} works. 8401 8402@cindex @command{gawk} @subentry @code{ERRNO} variable in 8403@cindex @code{ERRNO} variable @subentry with @command{getline} command 8404@cindex differences in @command{awk} and @command{gawk} @subentry @code{getline} command 8405@cindex @code{getline} command @subentry return values 8406@cindex @option{--sandbox} option @subentry input redirection with @code{getline} 8407 8408The @code{getline} command returns 1 if it finds a record and 0 if 8409it encounters the end of the file. If there is some error in getting 8410a record, such as a file that cannot be opened, then @code{getline} 8411returns @minus{}1. In this case, @command{gawk} sets the variable 8412@code{ERRNO} to a string describing the error that occurred. 8413 8414If @code{ERRNO} indicates that the I/O operation may be 8415retried, and @code{PROCINFO["@var{input}", "RETRY"]} is set, 8416then @code{getline} returns @minus{}2 8417instead of @minus{}1, and further calls to @code{getline} 8418may be attempted. @xref{Retrying Input} for further information about 8419this feature. 8420 8421In the following examples, @var{command} stands for a string value that 8422represents a shell command. 8423 8424@quotation NOTE 8425When @option{--sandbox} is specified (@pxref{Options}), 8426reading lines from files, pipes, and coprocesses is disabled. 8427@end quotation 8428 8429@menu 8430* Plain Getline:: Using @code{getline} with no arguments. 8431* Getline/Variable:: Using @code{getline} into a variable. 8432* Getline/File:: Using @code{getline} from a file. 8433* Getline/Variable/File:: Using @code{getline} into a variable from a 8434 file. 8435* Getline/Pipe:: Using @code{getline} from a pipe. 8436* Getline/Variable/Pipe:: Using @code{getline} into a variable from a 8437 pipe. 8438* Getline/Coprocess:: Using @code{getline} from a coprocess. 8439* Getline/Variable/Coprocess:: Using @code{getline} into a variable from a 8440 coprocess. 8441* Getline Notes:: Important things to know about @code{getline}. 8442* Getline Summary:: Summary of @code{getline} Variants. 8443@end menu 8444 8445@node Plain Getline 8446@subsection Using @code{getline} with No Arguments 8447 8448The @code{getline} command can be used without arguments to read input 8449from the current input file. All it does in this case is read the next 8450input record and split it up into fields. This is useful if you've 8451finished processing the current record, but want to do some special 8452processing on the next record @emph{right now}. For example: 8453 8454@c 6/2019: Thanks to Mark Krauze <daburashka@ya.ru> for suggested 8455@c improvements (the inner while loop). 8456@example 8457# Remove text between /* and */, inclusive 8458@{ 8459 while ((start = index($0, "/*")) != 0) @{ 8460 out = substr($0, 1, start - 1) # leading part of the string 8461 rest = substr($0, start + 2) # ... */ ... 8462 while ((end = index(rest, "*/")) == 0) @{ # is */ in trailing part? 8463 # get more text 8464 if (getline <= 0) @{ 8465 print("unexpected EOF or error:", ERRNO) > "/dev/stderr" 8466 exit 8467 @} 8468 # build up the line using string concatenation 8469 rest = rest $0 8470 @} 8471 rest = substr(rest, end + 2) # remove comment 8472 # build up the output line using string concatenation 8473 $0 = out rest 8474 @} 8475 print $0 8476@} 8477@end example 8478 8479This @command{awk} program deletes C-style comments (@samp{/* @dots{} 8480*/}) from the input. 8481It uses a number of features we haven't covered yet, including 8482string concatenation 8483(@pxref{Concatenation}) 8484and the @code{index()} and @code{substr()} built-in 8485functions 8486(@pxref{String Functions}). 8487By replacing the @samp{print $0} with other 8488statements, you could perform more complicated processing on the 8489decommented input, such as searching for matches of a regular 8490expression. 8491 8492Here is some sample input: 8493 8494@example 8495mon/*comment*/key 8496rab/*commen 8497t*/bit 8498horse /*comment*/more text 8499part 1 /*comment*/part 2 /*comment*/part 3 8500no comment 8501@end example 8502 8503When run, the output is: 8504 8505@example 8506$ @kbd{awk -f strip_comments.awk example_text} 8507@print{} monkey 8508@print{} rabbit 8509@print{} horse more text 8510@print{} part 1 part 2 part 3 8511@print{} no comment 8512@end example 8513 8514This form of the @code{getline} command sets @code{NF}, 8515@code{NR}, @code{FNR}, @code{RT}, and the value of @code{$0}. 8516 8517@quotation NOTE 8518The new value of @code{$0} is used to test 8519the patterns of any subsequent rules. The original value 8520of @code{$0} that triggered the rule that executed @code{getline} 8521is lost. 8522By contrast, the @code{next} statement reads a new record 8523but immediately begins processing it normally, starting with the first 8524rule in the program. @xref{Next Statement}. 8525@end quotation 8526 8527@node Getline/Variable 8528@subsection Using @code{getline} into a Variable 8529@cindex @code{getline} command @subentry into a variable 8530@cindex variables @subentry @code{getline} command into, using 8531 8532You can use @samp{getline @var{var}} to read the next record from 8533@command{awk}'s input into the variable @var{var}. No other processing is 8534done. 8535For example, suppose the next line is a comment or a special string, 8536and you want to read it without triggering 8537any rules. This form of @code{getline} allows you to read that line 8538and store it in a variable so that the main 8539read-a-line-and-check-each-rule loop of @command{awk} never sees it. 8540The following example swaps every two lines of input: 8541 8542@example 8543@group 8544@{ 8545 if ((getline tmp) > 0) @{ 8546 print tmp 8547 print $0 8548 @} else 8549 print $0 8550@} 8551@end group 8552@end example 8553 8554@noindent 8555It takes the following list: 8556 8557@example 8558wan 8559tew 8560free 8561phore 8562@end example 8563 8564@noindent 8565and produces these results: 8566 8567@example 8568tew 8569wan 8570phore 8571free 8572@end example 8573 8574The @code{getline} command used in this way sets only the variables 8575@code{NR}, @code{FNR}, and @code{RT} (and, of course, @var{var}). 8576The record is not 8577split into fields, so the values of the fields (including @code{$0}) and 8578the value of @code{NF} do not change. 8579 8580@node Getline/File 8581@subsection Using @code{getline} from a File 8582 8583@cindex @code{getline} command @subentry from a file 8584@cindex input redirection 8585@cindex redirection @subentry of input 8586@cindex @code{<} (left angle bracket) @subentry @code{<} operator (I/O) 8587@cindex left angle bracket (@code{<}) @subentry @code{<} operator (I/O) 8588@cindex operators @subentry input/output 8589Use @samp{getline < @var{file}} to read the next record from @var{file}. 8590Here, @var{file} is a string-valued expression that 8591specifies the @value{FN}. @samp{< @var{file}} is called a @dfn{redirection} 8592because it directs input to come from a different place. 8593For example, the following 8594program reads its input record from the file @file{secondary.input} when it 8595encounters a first field with a value equal to 10 in the current input 8596file: 8597 8598@example 8599@{ 8600 if ($1 == 10) @{ 8601 getline < "secondary.input" 8602 print 8603 @} else 8604 print 8605@} 8606@end example 8607 8608Because the main input stream is not used, the values of @code{NR} and 8609@code{FNR} are not changed. However, the record it reads is split into fields in 8610the normal manner, so the values of @code{$0} and the other fields are 8611changed, resulting in a new value of @code{NF}. 8612@code{RT} is also set. 8613 8614@cindex POSIX @command{awk} @subentry @code{<} operator and 8615@c Thanks to Paul Eggert for initial wording here 8616According to POSIX, @samp{getline < @var{expression}} is ambiguous if 8617@var{expression} contains unparenthesized operators other than 8618@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous 8619because the concatenation operator (not discussed yet; @pxref{Concatenation}) 8620is not parenthesized. You should write it as @samp{getline < (dir "/" file)} if 8621you want your program to be portable to all @command{awk} implementations. 8622 8623@node Getline/Variable/File 8624@subsection Using @code{getline} into a Variable from a File 8625@cindex variables @subentry @code{getline} command into, using 8626 8627Use @samp{getline @var{var} < @var{file}} to read input 8628from the file 8629@var{file}, and put it in the variable @var{var}. As earlier, @var{file} 8630is a string-valued expression that specifies the file from which to read. 8631 8632In this version of @code{getline}, none of the predefined variables are 8633changed and the record is not split into fields. The only variable 8634changed is @var{var}.@footnote{This is not quite true. @code{RT} could 8635be changed if @code{RS} is a regular expression.} 8636For example, the following program copies all the input files to the 8637output, except for records that say @w{@samp{@@include @var{filename}}}. 8638Such a record is replaced by the contents of the file 8639@var{filename}: 8640 8641@example 8642@{ 8643 if (NF == 2 && $1 == "@@include") @{ 8644 while ((getline line < $2) > 0) 8645 print line 8646 close($2) 8647 @} else 8648 print 8649@} 8650@end example 8651 8652Note here how the name of the extra input file is not built into 8653the program; it is taken directly from the data, specifically from the second field on 8654the @code{@@include} line. 8655 8656The @code{close()} function is called to ensure that if two identical 8657@code{@@include} lines appear in the input, the entire specified file is 8658included twice. 8659@xref{Close Files And Pipes}. 8660 8661One deficiency of this program is that it does not process nested 8662@code{@@include} statements 8663(i.e., @code{@@include} statements in included files) 8664the way a true macro preprocessor would. 8665@xref{Igawk Program} for a program 8666that does handle nested @code{@@include} statements. 8667 8668@node Getline/Pipe 8669@subsection Using @code{getline} from a Pipe 8670 8671@c From private email, dated October 2, 1988. Used by permission, March 2013. 8672@cindex Kernighan, Brian @subentry quotes 8673@quotation 8674@i{Omniscience has much to recommend it. 8675Failing that, attention to details would be useful.} 8676@author Brian Kernighan 8677@end quotation 8678 8679@cindex @code{|} (vertical bar) @subentry @code{|} operator (I/O) 8680@cindex vertical bar (@code{|}) @subentry @code{|} operator (I/O) 8681@cindex input pipeline 8682@cindex pipe @subentry input 8683@cindex operators @subentry input/output 8684The output of a command can also be piped into @code{getline}, using 8685@samp{@var{command} | getline}. In 8686this case, the string @var{command} is run as a shell command and its output 8687is piped into @command{awk} to be used as input. This form of @code{getline} 8688reads one record at a time from the pipe. 8689For example, the following program copies its input to its output, except for 8690lines that begin with @samp{@@execute}, which are replaced by the output 8691produced by running the rest of the line as a shell command: 8692 8693@example 8694@group 8695@{ 8696 if ($1 == "@@execute") @{ 8697 tmp = substr($0, 10) # Remove "@@execute" 8698 while ((tmp | getline) > 0) 8699 print 8700 close(tmp) 8701 @} else 8702 print 8703@} 8704@end group 8705@end example 8706 8707@noindent 8708The @code{close()} function is called to ensure that if two identical 8709@samp{@@execute} lines appear in the input, the command is run for 8710each one. 8711@ifnottex 8712@ifnotdocbook 8713@xref{Close Files And Pipes}. 8714@end ifnotdocbook 8715@end ifnottex 8716@c This example is unrealistic, since you could just use system 8717Given the input: 8718 8719@example 8720foo 8721bar 8722baz 8723@@execute who 8724bletch 8725@end example 8726 8727@noindent 8728the program might produce: 8729 8730@cindex Robbins @subentry Bill 8731@cindex Robbins @subentry Miriam 8732@cindex Robbins @subentry Arnold 8733@example 8734foo 8735bar 8736baz 8737arnold ttyv0 Jul 13 14:22 8738miriam ttyp0 Jul 13 14:23 (murphy:0) 8739bill ttyp1 Jul 13 14:23 (murphy:0) 8740bletch 8741@end example 8742 8743@noindent 8744Notice that this program ran the command @command{who} and printed the result. 8745(If you try this program yourself, you will of course get different results, 8746depending upon who is logged in on your system.) 8747 8748This variation of @code{getline} splits the record into fields, sets the 8749value of @code{NF}, and recomputes the value of @code{$0}. The values of 8750@code{NR} and @code{FNR} are not changed. 8751@code{RT} is set. 8752 8753@cindex POSIX @command{awk} @subentry @code{|} I/O operator and 8754@c Thanks to Paul Eggert for initial wording here 8755According to POSIX, @samp{@var{expression} | getline} is ambiguous if 8756@var{expression} contains unparenthesized operators other than 8757@samp{$}---for example, @samp{@w{"echo "} "date" | getline} is ambiguous 8758because the concatenation operator is not parenthesized. You should 8759write it as @samp{(@w{"echo "} "date") | getline} if you want your program 8760to be portable to all @command{awk} implementations. 8761 8762@cindex Brian Kernighan's @command{awk} 8763@cindex @command{mawk} utility 8764@quotation NOTE 8765Unfortunately, @command{gawk} has not been consistent in its treatment 8766of a construct like @samp{@w{"echo "} "date" | getline}. 8767Most versions, including the current version, treat it as 8768@samp{@w{("echo "} "date") | getline}. 8769(This is also how BWK @command{awk} behaves.) 8770Some versions instead treat it as 8771@samp{@w{"echo "} ("date" | getline)}. 8772(This is how @command{mawk} behaves.) 8773In short, @emph{always} use explicit parentheses, and then you won't 8774have to worry. 8775@end quotation 8776 8777@node Getline/Variable/Pipe 8778@subsection Using @code{getline} into a Variable from a Pipe 8779@cindex variables @subentry @code{getline} command into, using 8780 8781When you use @samp{@var{command} | getline @var{var}}, the 8782output of @var{command} is sent through a pipe to 8783@code{getline} and into the variable @var{var}. For example, the 8784following program reads the current date and time into the variable 8785@code{current_time}, using the @command{date} utility, and then 8786prints it: 8787 8788@example 8789BEGIN @{ 8790 "date" | getline current_time 8791 close("date") 8792 print "Report printed on " current_time 8793@} 8794@end example 8795 8796In this version of @code{getline}, none of the predefined variables are 8797changed and the record is not split into fields. However, @code{RT} is set. 8798 8799@ifinfo 8800@c Thanks to Paul Eggert for initial wording here 8801According to POSIX, @samp{@var{expression} | getline @var{var}} is ambiguous if 8802@var{expression} contains unparenthesized operators other than 8803@samp{$}; for example, @samp{@w{"echo "} "date" | getline @var{var}} is ambiguous 8804because the concatenation operator is not parenthesized. You should 8805write it as @samp{(@w{"echo "} "date") | getline @var{var}} if you want your 8806program to be portable to other @command{awk} implementations. 8807@end ifinfo 8808 8809@node Getline/Coprocess 8810@subsection Using @code{getline} from a Coprocess 8811@cindex coprocesses @subentry @code{getline} from 8812@cindex @code{getline} command @subentry coprocesses, using from 8813@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O) 8814@cindex vertical bar (@code{|}) @subentry @code{|&} operator (I/O) 8815@cindex operators @subentry input/output 8816@cindex differences in @command{awk} and @command{gawk} @subentry input/output operators 8817 8818Reading input into @code{getline} from a pipe is a one-way operation. 8819The command that is started with @samp{@var{command} | getline} only 8820sends data @emph{to} your @command{awk} program. 8821 8822On occasion, you might want to send data to another program 8823for processing and then read the results back. 8824@command{gawk} allows you to start a @dfn{coprocess}, with which two-way 8825communications are possible. This is done with the @samp{|&} 8826operator. 8827Typically, you write data to the coprocess first and then 8828read the results back, as shown in the following: 8829 8830@example 8831print "@var{some query}" |& "db_server" 8832"db_server" |& getline 8833@end example 8834 8835@noindent 8836which sends a query to @command{db_server} and then reads the results. 8837 8838The values of @code{NR} and 8839@code{FNR} are not changed, 8840because the main input stream is not used. 8841However, the record is split into fields in 8842the normal manner, thus changing the values of @code{$0}, of the other fields, 8843and of @code{NF} and @code{RT}. 8844 8845Coprocesses are an advanced feature. They are discussed here only because 8846this is the @value{SECTION} on @code{getline}. 8847@xref{Two-way I/O}, 8848where coprocesses are discussed in more detail. 8849 8850@node Getline/Variable/Coprocess 8851@subsection Using @code{getline} into a Variable from a Coprocess 8852@cindex variables @subentry @code{getline} command into, using 8853 8854When you use @samp{@var{command} |& getline @var{var}}, the output from 8855the coprocess @var{command} is sent through a two-way pipe to @code{getline} 8856and into the variable @var{var}. 8857 8858In this version of @code{getline}, none of the predefined variables are 8859changed and the record is not split into fields. The only variable 8860changed is @var{var}. 8861However, @code{RT} is set. 8862 8863@ifinfo 8864Coprocesses are an advanced feature. They are discussed here only because 8865this is the @value{SECTION} on @code{getline}. 8866@xref{Two-way I/O}, 8867where coprocesses are discussed in more detail. 8868@end ifinfo 8869 8870@node Getline Notes 8871@subsection Points to Remember About @code{getline} 8872Here are some miscellaneous points about @code{getline} that 8873you should bear in mind: 8874 8875@itemize @value{BULLET} 8876@item 8877When @code{getline} changes the value of @code{$0} and @code{NF}, 8878@command{awk} does @emph{not} automatically jump to the start of the 8879program and start testing the new record against every pattern. 8880However, the new record is tested against any subsequent rules. 8881 8882@cindex differences in @command{awk} and @command{gawk} @subentry implementation limitations 8883@cindex implementation issues, @command{gawk} @subentry limits 8884@cindex @command{awk} @subentry implementations @subentry limits 8885@cindex @command{gawk} @subentry implementation issues @subentry limits 8886@item 8887Some very old @command{awk} implementations limit the number of pipelines that an @command{awk} 8888program may have open to just one. In @command{gawk}, there is no such limit. 8889You can open as many pipelines (and coprocesses) as the underlying operating 8890system permits. 8891 8892@cindex side effects @subentry @code{FILENAME} variable 8893@cindex @code{FILENAME} variable @subentry @code{getline}, setting with 8894@cindex dark corner @subentry @code{FILENAME} variable 8895@cindex @code{getline} command @subentry @code{FILENAME} variable and 8896@cindex @code{BEGIN} pattern @subentry @code{getline} and 8897@item 8898An interesting side effect occurs if you use @code{getline} without a 8899redirection inside a @code{BEGIN} rule. Because an unredirected @code{getline} 8900reads from the command-line @value{DF}s, the first @code{getline} command 8901causes @command{awk} to set the value of @code{FILENAME}. Normally, 8902@code{FILENAME} does not have a value inside @code{BEGIN} rules, because you 8903have not yet started to process the command-line @value{DF}s. 8904@value{DARKCORNER} 8905(See @ref{BEGIN/END}; 8906also @pxref{Auto-set}.) 8907 8908@item 8909Using @code{FILENAME} with @code{getline} 8910(@samp{getline < FILENAME}) 8911is likely to be a source of 8912confusion. @command{awk} opens a separate input stream from the 8913current input file. However, by not using a variable, @code{$0} 8914and @code{NF} are still updated. If you're doing this, it's 8915probably by accident, and you should reconsider what it is you're 8916trying to accomplish. 8917 8918@item 8919@ifdocbook 8920The next @value{SECTION} 8921@end ifdocbook 8922@ifnotdocbook 8923@ref{Getline Summary}, 8924@end ifnotdocbook 8925presents a table summarizing the 8926@code{getline} variants and which variables they can affect. 8927It is worth noting that those variants that do not use redirection 8928can cause @code{FILENAME} to be updated if they cause 8929@command{awk} to start reading a new input file. 8930 8931@item 8932@cindex Moore, Duncan 8933If the variable being assigned is an expression with side effects, 8934different versions of @command{awk} behave differently upon encountering 8935end-of-file. Some versions don't evaluate the expression; many versions 8936(including @command{gawk}) do. Here is an example, courtesy of Duncan Moore: 8937 8938@ignore 8939Date: Sun, 01 Apr 2012 11:49:33 +0100 8940From: Duncan Moore <duncan.moore@@gmx.com> 8941@end ignore 8942 8943@example 8944BEGIN @{ 8945 system("echo 1 > f") 8946 while ((getline a[++c] < "f") > 0) @{ @} 8947 print c 8948@} 8949@end example 8950 8951@noindent 8952Here, the side effect is the @samp{++c}. Is @code{c} incremented if 8953end-of-file is encountered before the element in @code{a} is assigned? 8954 8955@command{gawk} treats @code{getline} like a function call, and evaluates 8956the expression @samp{a[++c]} before attempting to read from @file{f}. 8957However, some versions of @command{awk} only evaluate the expression once they 8958know that there is a string value to be assigned. 8959@end itemize 8960 8961@node Getline Summary 8962@subsection Summary of @code{getline} Variants 8963@cindex @code{getline} command @subentry variants 8964 8965@ref{table-getline-variants} 8966summarizes the eight variants of @code{getline}, 8967listing which predefined variables are set by each one, 8968and whether the variant is standard or a @command{gawk} extension. 8969Note: for each variant, @command{gawk} sets the @code{RT} predefined variable. 8970 8971@float Table,table-getline-variants 8972@caption{@code{getline} variants and what they set} 8973@multitable @columnfractions .33 .38 .27 8974@headitem Variant @tab Effect @tab @command{awk} / @command{gawk} 8975@item @code{getline} @tab Sets @code{$0}, @code{NF}, @code{FNR}, @code{NR}, and @code{RT} @tab @command{awk} 8976@item @code{getline} @var{var} @tab Sets @var{var}, @code{FNR}, @code{NR}, and @code{RT} @tab @command{awk} 8977@item @code{getline <} @var{file} @tab Sets @code{$0}, @code{NF}, and @code{RT} @tab @command{awk} 8978@item @code{getline @var{var} < @var{file}} @tab Sets @var{var} and @code{RT} @tab @command{awk} 8979@item @var{command} @code{| getline} @tab Sets @code{$0}, @code{NF}, and @code{RT} @tab @command{awk} 8980@item @var{command} @code{| getline} @var{var} @tab Sets @var{var} and @code{RT} @tab @command{awk} 8981@item @var{command} @code{|& getline} @tab Sets @code{$0}, @code{NF}, and @code{RT} @tab @command{gawk} 8982@item @var{command} @code{|& getline} @var{var} @tab Sets @var{var} and @code{RT} @tab @command{gawk} 8983@end multitable 8984@end float 8985 8986@node Read Timeout 8987@section Reading Input with a Timeout 8988@cindex timeout, reading input 8989 8990@cindex differences in @command{awk} and @command{gawk} @subentry read timeouts 8991This @value{SECTION} describes a feature that is specific to @command{gawk}. 8992 8993You may specify a timeout in milliseconds for reading input from the keyboard, 8994a pipe, or two-way communication, including TCP/IP sockets. This can be done 8995on a per-input, per-command, or per-connection basis, by setting a special 8996element in the @code{PROCINFO} array (@pxref{Auto-set}): 8997 8998@example 8999PROCINFO["input_name", "READ_TIMEOUT"] = @var{timeout in milliseconds} 9000@end example 9001 9002When set, this causes @command{gawk} to time out and return failure 9003if no data is available to read within the specified timeout period. 9004For example, a TCP client can decide to give up on receiving 9005any response from the server after a certain amount of time: 9006 9007@example 9008@group 9009Service = "/inet/tcp/0/localhost/daytime" 9010PROCINFO[Service, "READ_TIMEOUT"] = 100 9011if ((Service |& getline) > 0) 9012 print $0 9013else if (ERRNO != "") 9014 print ERRNO 9015@end group 9016@end example 9017 9018Here is how to read interactively from the user@footnote{This assumes 9019that standard input is the keyboard.} without waiting 9020for more than five seconds: 9021 9022@example 9023PROCINFO["/dev/stdin", "READ_TIMEOUT"] = 5000 9024while ((getline < "/dev/stdin") > 0) 9025 print $0 9026@end example 9027 9028@command{gawk} terminates the read operation if input does not 9029arrive after waiting for the timeout period, returns failure, 9030and sets @code{ERRNO} to an appropriate string value. 9031A negative or zero value for the timeout is the same as specifying 9032no timeout at all. 9033 9034A timeout can also be set for reading from the keyboard in the implicit 9035loop that reads input records and matches them against patterns, 9036like so: 9037 9038@example 9039$ @kbd{gawk 'BEGIN @{ PROCINFO["-", "READ_TIMEOUT"] = 5000 @}} 9040> @kbd{@{ print "You entered: " $0 @}'} 9041@kbd{gawk} 9042@print{} You entered: gawk 9043@end example 9044 9045In this case, failure to respond within five seconds results in the following 9046error message: 9047 9048@example 9049@error{} gawk: cmd. line:2: (FILENAME=- FNR=1) fatal: error reading input file `-': Connection timed out 9050@end example 9051 9052The timeout can be set or changed at any time, and will take effect on the 9053next attempt to read from the input device. In the following example, 9054we start with a timeout value of one second, and progressively 9055reduce it by one-tenth of a second until we wait indefinitely 9056for the input to arrive: 9057 9058@example 9059PROCINFO[Service, "READ_TIMEOUT"] = 1000 9060while ((Service |& getline) > 0) @{ 9061 print $0 9062 PROCINFO[Service, "READ_TIMEOUT"] -= 100 9063@} 9064@end example 9065 9066@quotation NOTE 9067You should not assume that the read operation will block 9068exactly after the tenth record has been printed. It is possible that 9069@command{gawk} will read and buffer more than one record's 9070worth of data the first time. Because of this, changing the value 9071of timeout like in the preceding example is not very useful. 9072@end quotation 9073 9074@cindex @env{GAWK_READ_TIMEOUT} environment variable 9075@cindex environment variables @subentry @env{GAWK_READ_TIMEOUT} 9076If the @code{PROCINFO} element is not present and the 9077@env{GAWK_READ_TIMEOUT} environment variable exists, 9078@command{gawk} uses its value to initialize the timeout value. 9079The exclusive use of the environment variable to specify timeout 9080has the disadvantage of not being able to control it 9081on a per-command or per-connection basis. 9082 9083@command{gawk} considers a timeout event to be an error even though 9084the attempt to read from the underlying device may 9085succeed in a later attempt. This is a limitation, and it also 9086means that you cannot use this to multiplex input from 9087two or more sources. @xref{Retrying Input} for a way to enable 9088later I/O attempts to succeed. 9089 9090Assigning a timeout value prevents read operations from 9091blocking indefinitely. But bear in mind that there are other ways 9092@command{gawk} can stall waiting for an input device to be ready. 9093A network client can sometimes take a long time to establish 9094a connection before it can start reading any data, 9095or the attempt to open a FIFO special file for reading can block 9096indefinitely until some other process opens it for writing. 9097 9098@node Retrying Input 9099@section Retrying Reads After Certain Input Errors 9100@cindex retrying input 9101 9102@cindex differences in @command{awk} and @command{gawk} @subentry retrying input 9103This @value{SECTION} describes a feature that is specific to @command{gawk}. 9104 9105When @command{gawk} encounters an error while reading input, by 9106default @code{getline} returns @minus{}1, and subsequent attempts to 9107read from that file result in an end-of-file indication. However, you 9108may optionally instruct @command{gawk} to allow I/O to be retried when 9109certain errors are encountered by setting a special element in 9110the @code{PROCINFO} array (@pxref{Auto-set}): 9111 9112@example 9113PROCINFO["@var{input_name}", "RETRY"] = 1 9114@end example 9115 9116When this element exists, @command{gawk} checks the value of the system 9117(C language) 9118@code{errno} variable when an I/O error occurs. If @code{errno} indicates 9119a subsequent I/O attempt may succeed, @code{getline} instead returns 9120@minus{}2 and 9121further calls to @code{getline} may succeed. This applies to the @code{errno} 9122values @code{EAGAIN}, @code{EWOULDBLOCK}, @code{EINTR}, or @code{ETIMEDOUT}. 9123 9124This feature is useful in conjunction with 9125@code{PROCINFO["@var{input_name}", "READ_TIMEOUT"]} or situations where a file 9126descriptor has been configured to behave in a non-blocking fashion. 9127 9128@node Command-line directories 9129@section Directories on the Command Line 9130@cindex differences in @command{awk} and @command{gawk} @subentry command-line directories 9131@cindex directories @subentry command-line 9132@cindex command line @subentry directories on 9133 9134According to the POSIX standard, files named on the @command{awk} 9135command line must be text files; it is a fatal error if they are not. 9136Most versions of @command{awk} treat a directory on the command line as 9137a fatal error. 9138 9139By default, @command{gawk} produces a warning for a directory on the 9140command line, but otherwise ignores it. This makes it easier to use 9141shell wildcards with your @command{awk} program: 9142 9143@example 9144$ @kbd{gawk -f whizprog.awk *} @ii{Directories could kill this program} 9145@end example 9146 9147If either of the @option{--posix} 9148or @option{--traditional} options is given, then @command{gawk} reverts 9149to treating a directory on the command line as a fatal error. 9150 9151@xref{Extension Sample Readdir} for a way to treat directories 9152as usable data from an @command{awk} program. 9153 9154@node Input Summary 9155@section Summary 9156 9157@itemize @value{BULLET} 9158@item 9159Input is split into records based on the value of @code{RS}. 9160The possibilities are as follows: 9161 9162@multitable @columnfractions .25 .35 .40 9163@headitem Value of @code{RS} @tab Records are split on @dots{} @tab @command{awk} / @command{gawk} 9164@item Any single character @tab That character @tab @command{awk} 9165@item The empty string (@code{""}) @tab Runs of two or more newlines @tab @command{awk} 9166@item A regexp @tab Text that matches the regexp @tab @command{gawk} 9167@end multitable 9168 9169@item 9170@code{FNR} indicates how many records have been read from the current input file; 9171@code{NR} indicates how many records have been read in total. 9172 9173@item 9174@command{gawk} sets @code{RT} to the text matched by @code{RS}. 9175 9176@item 9177After splitting the input into records, @command{awk} further splits 9178the records into individual fields, named @code{$1}, @code{$2}, and so 9179on. @code{$0} is the whole record, and @code{NF} indicates how many 9180fields there are. The default way to split fields is between whitespace 9181characters. 9182 9183@item 9184Fields may be referenced using a variable, as in @code{$NF}. Fields 9185may also be assigned values, which causes the value of @code{$0} to be 9186recomputed when it is later referenced. Assigning to a field with a number 9187greater than @code{NF} creates the field and rebuilds the record, using 9188@code{OFS} to separate the fields. Incrementing @code{NF} does the same 9189thing. Decrementing @code{NF} throws away fields and rebuilds the record. 9190 9191@item 9192Field splitting is more complicated than record splitting: 9193 9194@multitable @columnfractions .40 .40 .20 9195@headitem Field separator value @tab Fields are split @dots{} @tab @command{awk} / @command{gawk} 9196@item @code{FS == " "} @tab On runs of whitespace @tab @command{awk} 9197@item @code{FS == @var{any single character}} @tab On that character @tab @command{awk} 9198@item @code{FS == @var{regexp}} @tab On text matching the regexp @tab @command{awk} 9199@item @code{FS == ""} @tab Such that each individual character is a separate field @tab @command{gawk} 9200@item @code{FIELDWIDTHS == @var{list of columns}} @tab Based on character position @tab @command{gawk} 9201@item @code{FPAT == @var{regexp}} @tab On the text surrounding text matching the regexp @tab @command{gawk} 9202@end multitable 9203 9204@item 9205Using @samp{FS = "\n"} causes the entire record to be a single field 9206(assuming that newlines separate records). 9207 9208@item 9209@code{FS} may be set from the command line using the @option{-F} option. 9210This can also be done using command-line variable assignment. 9211 9212@item 9213Use @code{PROCINFO["FS"]} to see how fields are being split. 9214 9215@item 9216Use @code{getline} in its various forms to read additional records 9217from the default input stream, from a file, or from a pipe or coprocess. 9218 9219@item 9220Use @code{PROCINFO[@var{file}, "READ_TIMEOUT"]} to cause reads to time out 9221for @var{file}. 9222 9223@cindex POSIX mode 9224@item 9225Directories on the command line are fatal for standard @command{awk}; 9226@command{gawk} ignores them if not in POSIX mode. 9227 9228@end itemize 9229 9230@c EXCLUDE START 9231@node Input Exercises 9232@section Exercises 9233 9234@enumerate 9235@item 9236Using the @code{FIELDWIDTHS} variable (@pxref{Constant Size}), 9237write a program to read election data, where each record represents 9238one voter's votes. Come up with a way to define which columns are 9239associated with each ballot item, and print the total votes, 9240including abstentions, for each item. 9241 9242@end enumerate 9243@c EXCLUDE END 9244 9245@node Printing 9246@chapter Printing Output 9247 9248@cindex printing 9249@cindex output, printing @seeentry{printing} 9250One of the most common programming actions is to @dfn{print}, or output, 9251some or all of the input. Use the @code{print} statement 9252for simple output, and the @code{printf} statement 9253for fancier formatting. 9254The @code{print} statement is not limited when 9255computing @emph{which} values to print. However, with two exceptions, 9256you cannot specify @emph{how} to print them---how many 9257columns, whether to use exponential notation or not, and so on. 9258(For the exceptions, @pxref{Output Separators} and 9259@ref{OFMT}.) 9260For printing with specifications, you need the @code{printf} statement 9261(@pxref{Printf}). 9262 9263@cindex @code{print} statement 9264@cindex @code{printf} statement 9265Besides basic and formatted printing, this @value{CHAPTER} 9266also covers I/O redirections to files and pipes, introduces 9267the special @value{FN}s that @command{gawk} processes internally, 9268and discusses the @code{close()} built-in function. 9269 9270@menu 9271* Print:: The @code{print} statement. 9272* Print Examples:: Simple examples of @code{print} statements. 9273* Output Separators:: The output separators and how to change them. 9274* OFMT:: Controlling Numeric Output With @code{print}. 9275* Printf:: The @code{printf} statement. 9276* Redirection:: How to redirect output to multiple files and 9277 pipes. 9278* Special FD:: Special files for I/O. 9279* Special Files:: File name interpretation in @command{gawk}. 9280 @command{gawk} allows access to inherited file 9281 descriptors. 9282* Close Files And Pipes:: Closing Input and Output Files and Pipes. 9283* Nonfatal:: Enabling Nonfatal Output. 9284* Output Summary:: Output summary. 9285* Output Exercises:: Exercises. 9286@end menu 9287 9288@node Print 9289@section The @code{print} Statement 9290 9291Use the @code{print} statement to produce output with simple, standardized 9292formatting. You specify only the strings or numbers to print, in a 9293list separated by commas. They are output, separated by single spaces, 9294followed by a newline. The statement looks like this: 9295 9296@example 9297print @var{item1}, @var{item2}, @dots{} 9298@end example 9299 9300@noindent 9301The entire list of items may be optionally enclosed in parentheses. The 9302parentheses are necessary if any of the item expressions uses the @samp{>} 9303relational operator; otherwise it could be confused with an output redirection 9304(@pxref{Redirection}). 9305 9306The items to print can be constant strings or numbers, fields of the 9307current record (such as @code{$1}), variables, or any @command{awk} 9308expression. Numeric values are converted to strings and then printed. 9309 9310@cindex records @subentry printing 9311@cindex lines @subentry blank, printing 9312@cindex text, printing 9313The simple statement @samp{print} with no items is equivalent to 9314@samp{print $0}: it prints the entire current record. To print a blank 9315line, use @samp{print ""}. 9316To print a fixed piece of text, use a string constant, such as 9317@w{@code{"Don't Panic"}}, as one item. If you forget to use the 9318double-quote characters, your text is taken as an @command{awk} 9319expression, and you will probably get an error. Keep in mind that a 9320space is printed between any two items. 9321 9322Note that the @code{print} statement is a statement and not an 9323expression---you can't use it in the pattern part of a 9324pattern--action statement, for example. 9325 9326@node Print Examples 9327@section @code{print} Statement Examples 9328 9329Each @code{print} statement makes at least one line of output. However, it 9330isn't limited to only one line. If an item value is a string containing a 9331newline, the newline is output along with the rest of the string. A 9332single @code{print} statement can make any number of lines this way. 9333 9334@cindex newlines @subentry printing 9335The following is an example of printing a string that contains embedded 9336@ifinfo 9337newlines 9338(the @samp{\n} is an escape sequence, used to represent the newline 9339character; @pxref{Escape Sequences}): 9340@end ifinfo 9341@ifhtml 9342newlines 9343(the @samp{\n} is an escape sequence, used to represent the newline 9344character; @pxref{Escape Sequences}): 9345@end ifhtml 9346@ifnotinfo 9347@ifnothtml 9348newlines: 9349@end ifnothtml 9350@end ifnotinfo 9351 9352@example 9353@group 9354$ @kbd{awk 'BEGIN @{ print "line one\nline two\nline three" @}'} 9355@print{} line one 9356@print{} line two 9357@print{} line three 9358@end group 9359@end example 9360 9361@cindex fields @subentry printing 9362The next example, which is run on the @file{inventory-shipped} file, 9363prints the first two fields of each input record, with a space between 9364them: 9365 9366@example 9367$ @kbd{awk '@{ print $1, $2 @}' inventory-shipped} 9368@print{} Jan 13 9369@print{} Feb 15 9370@print{} Mar 15 9371@dots{} 9372@end example 9373 9374@cindex @code{print} statement @subentry commas, omitting 9375@cindex troubleshooting @subentry @code{print} statement, omitting commas 9376A common mistake in using the @code{print} statement is to omit the comma 9377between two items. This often has the effect of making the items run 9378together in the output, with no space. The reason for this is that 9379juxtaposing two string expressions in @command{awk} means to concatenate 9380them. Here is the same program, without the comma: 9381 9382@example 9383$ @kbd{awk '@{ print $1 $2 @}' inventory-shipped} 9384@print{} Jan13 9385@print{} Feb15 9386@print{} Mar15 9387@dots{} 9388@end example 9389 9390@cindex @code{BEGIN} pattern @subentry headings, adding 9391To someone unfamiliar with the @file{inventory-shipped} file, neither 9392example's output makes much sense. A heading line at the beginning 9393would make it clearer. Let's add some headings to our table of months 9394(@code{$1}) and green crates shipped (@code{$2}). We do this using 9395a @code{BEGIN} rule (@pxref{BEGIN/END}) so that the headings are only 9396printed once: 9397 9398@example 9399awk 'BEGIN @{ print "Month Crates" 9400 print "----- ------" @} 9401 @{ print $1, $2 @}' inventory-shipped 9402@end example 9403 9404@noindent 9405When run, the program prints the following: 9406 9407@example 9408Month Crates 9409----- ------ 9410Jan 13 9411Feb 15 9412Mar 15 9413@dots{} 9414@end example 9415 9416@noindent 9417The only problem, however, is that the headings and the table data 9418don't line up! We can fix this by printing some spaces between the 9419two fields: 9420 9421@example 9422@group 9423awk 'BEGIN @{ print "Month Crates" 9424 print "----- ------" @} 9425 @{ print $1, " ", $2 @}' inventory-shipped 9426@end group 9427@end example 9428 9429@cindex @code{printf} statement @subentry columns, aligning 9430@cindex columns @subentry aligning 9431Lining up columns this way can get pretty 9432complicated when there are many columns to fix. Counting spaces for two 9433or three columns is simple, but any more than this can take up 9434a lot of time. This is why the @code{printf} statement was 9435created (@pxref{Printf}); 9436one of its specialties is lining up columns of data. 9437 9438@cindex line continuations @subentry in @code{print} statement 9439@cindex @code{print} statement @subentry line continuations and 9440@quotation NOTE 9441You can continue either a @code{print} or 9442@code{printf} statement simply by putting a newline after any comma 9443(@pxref{Statements/Lines}). 9444@end quotation 9445 9446@node Output Separators 9447@section Output Separators 9448 9449@cindex @code{OFS} variable 9450As mentioned previously, a @code{print} statement contains a list 9451of items separated by commas. In the output, the items are normally 9452separated by single spaces. However, this doesn't need to be the case; 9453a single space is simply the default. Any string of 9454characters may be used as the @dfn{output field separator} by setting the 9455predefined variable @code{OFS}. The initial value of this variable 9456is the string @w{@code{" "}} (i.e., a single space). 9457 9458The output from an entire @code{print} statement is called an @dfn{output 9459record}. Each @code{print} statement outputs one output record, and 9460then outputs a string called the @dfn{output record separator} (or 9461@code{ORS}). The initial value of @code{ORS} is the string @code{"\n"} 9462(i.e., a newline character). Thus, each @code{print} statement normally 9463makes a separate line. 9464 9465@cindex output @subentry records 9466@cindex output record separator @seeentry{@code{ORS} variable} 9467@cindex @code{ORS} variable 9468@cindex @code{BEGIN} pattern @subentry @code{OFS}/@code{ORS} variables, assigning values to 9469In order to change how output fields and records are separated, assign 9470new values to the variables @code{OFS} and @code{ORS}. The usual 9471place to do this is in the @code{BEGIN} rule 9472(@pxref{BEGIN/END}), so 9473that it happens before any input is processed. It can also be done 9474with assignments on the command line, before the names of the input 9475files, or using the @option{-v} command-line option 9476(@pxref{Options}). 9477The following example prints the first and second fields of each input 9478record, separated by a semicolon, with a blank line added after each 9479newline: 9480 9481 9482@example 9483$ @kbd{awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @}} 9484> @kbd{@{ print $1, $2 @}' mail-list} 9485@print{} Amelia;555-5553 9486@print{} 9487@print{} Anthony;555-3412 9488@print{} 9489@print{} Becky;555-7685 9490@print{} 9491@print{} Bill;555-1675 9492@print{} 9493@print{} Broderick;555-0542 9494@print{} 9495@print{} Camilla;555-2912 9496@print{} 9497@print{} Fabius;555-1234 9498@print{} 9499@print{} Julie;555-6699 9500@print{} 9501@print{} Martin;555-6480 9502@print{} 9503@print{} Samuel;555-3430 9504@print{} 9505@print{} Jean-Paul;555-2127 9506@print{} 9507@end example 9508 9509If the value of @code{ORS} does not contain a newline, the program's output 9510runs together on a single line. 9511 9512@node OFMT 9513@section Controlling Numeric Output with @code{print} 9514@cindex numeric @subentry output format 9515@cindex formats, numeric output 9516When printing numeric values with the @code{print} statement, 9517@command{awk} internally converts each number to a string of characters 9518and prints that string. @command{awk} uses the @code{sprintf()} function 9519to do this conversion 9520(@pxref{String Functions}). 9521For now, it suffices to say that the @code{sprintf()} 9522function accepts a @dfn{format specification} that tells it how to format 9523numbers (or strings), and that there are a number of different ways in which 9524numbers can be formatted. The different format specifications are discussed 9525more fully in 9526@ref{Control Letters}. 9527 9528@cindexawkfunc{sprintf} 9529@cindex @code{OFMT} variable 9530@cindex output @subentry format specifier, @code{OFMT} 9531The predefined variable @code{OFMT} contains the format specification 9532that @code{print} uses with @code{sprintf()} when it wants to convert a 9533number to a string for printing. 9534The default value of @code{OFMT} is @code{"%.6g"}. 9535The way @code{print} prints numbers can be changed 9536by supplying a different format specification 9537for the value of @code{OFMT}, as shown in the following example: 9538 9539@example 9540$ @kbd{awk 'BEGIN @{} 9541> @kbd{OFMT = "%.0f" # print numbers as integers (rounds)} 9542> @kbd{print 17.23, 17.54 @}'} 9543@print{} 17 18 9544@end example 9545 9546@noindent 9547@cindex dark corner @subentry @code{OFMT} variable 9548@cindex POSIX @command{awk} @subentry @code{OFMT} variable and 9549@cindex @code{OFMT} variable @subentry POSIX @command{awk} and 9550According to the POSIX standard, @command{awk}'s behavior is undefined 9551if @code{OFMT} contains anything but a floating-point conversion specification. 9552@value{DARKCORNER} 9553 9554@node Printf 9555@section Using @code{printf} Statements for Fancier Printing 9556 9557@cindex @code{printf} statement 9558@cindex output @subentry formatted 9559@cindex formatting @subentry output 9560For more precise control over the output format than what is 9561provided by @code{print}, use @code{printf}. 9562With @code{printf} you can 9563specify the width to use for each item, as well as various 9564formatting choices for numbers (such as what output base to use, whether to 9565print an exponent, whether to print a sign, and how many digits to print 9566after the decimal point). 9567 9568@menu 9569* Basic Printf:: Syntax of the @code{printf} statement. 9570* Control Letters:: Format-control letters. 9571* Format Modifiers:: Format-specification modifiers. 9572* Printf Examples:: Several examples. 9573@end menu 9574 9575@node Basic Printf 9576@subsection Introduction to the @code{printf} Statement 9577 9578@cindex @code{printf} statement @subentry syntax of 9579A simple @code{printf} statement looks like this: 9580 9581@example 9582printf @var{format}, @var{item1}, @var{item2}, @dots{} 9583@end example 9584 9585@noindent 9586As for @code{print}, the entire list of arguments may optionally be 9587enclosed in parentheses. Here too, the parentheses are necessary if any 9588of the item expressions uses the @samp{>} relational operator; otherwise, 9589it can be confused with an output redirection (@pxref{Redirection}). 9590 9591@cindex format specifiers 9592The difference between @code{printf} and @code{print} is the @var{format} 9593argument. This is an expression whose value is taken as a string; it 9594specifies how to output each of the other arguments. It is called the 9595@dfn{format string}. 9596 9597The format string is very similar to that in the ISO C library function 9598@code{printf()}. Most of @var{format} is text to output verbatim. 9599Scattered among this text are @dfn{format specifiers}---one per item. 9600Each format specifier says to output the next item in the argument list 9601at that place in the format. 9602 9603The @code{printf} statement does not automatically append a newline 9604to its output. It outputs only what the format string specifies. 9605So if a newline is needed, you must include one in the format string. 9606The output separator variables @code{OFS} and @code{ORS} have no effect 9607on @code{printf} statements. For example: 9608 9609@example 9610@group 9611$ @kbd{awk 'BEGIN @{} 9612> @kbd{ORS = "\nOUCH!\n"; OFS = "+"} 9613> @kbd{msg = "Don\47t Panic!"} 9614> @kbd{printf "%s\n", msg} 9615> @kbd{@}'} 9616@print{} Don't Panic! 9617@end group 9618@end example 9619 9620@noindent 9621Here, neither the @samp{+} nor the @samp{OUCH!} appears in 9622the output message. 9623 9624@node Control Letters 9625@subsection Format-Control Letters 9626@cindex @code{printf} statement @subentry format-control characters 9627@cindex format specifiers @subentry @code{printf} statement 9628 9629A format specifier starts with the character @samp{%} and ends with 9630a @dfn{format-control letter}---it tells the @code{printf} statement 9631how to output one item. The format-control letter specifies what @emph{kind} 9632of value to print. The rest of the format specifier is made up of 9633optional @dfn{modifiers} that control @emph{how} to print the value, such as 9634the field width. Here is a list of the format-control letters: 9635 9636@c @asis for docbook to come out right 9637@table @asis 9638@item @code{%a}, @code{%A} 9639A floating point number of the form 9640[@code{-}]@code{0x@var{h}.@var{hhhh}p+-@var{dd}} 9641(C99 hexadecimal floating point format). 9642For @code{%A}, 9643uppercase letters are used instead of lowercase ones. 9644 9645@quotation NOTE 9646The current POSIX standard requires support for @code{%a} and @code{%A} in 9647@command{awk}. As far as we know, besides @command{gawk}, the only other 9648version of @command{awk} that actually implements it is BWK @command{awk}. 9649It's use is thus highly nonportable! 9650 9651Furthermore, these formats are not available on any system where the 9652underlying C library @code{printf()} function does not support them. As 9653of this writing, among current systems, only OpenVMS is known to not 9654support them. 9655@end quotation 9656 9657@item @code{%c} 9658Print a number as a character; thus, @samp{printf "%c", 965965} outputs the letter @samp{A}. The output for a string value is 9660the first character of the string. 9661 9662@cindex dark corner @subentry format-control characters 9663@cindex @command{gawk} @subentry format-control characters 9664@quotation NOTE 9665The POSIX standard says the first character of a string is printed. 9666In locales with multibyte characters, @command{gawk} attempts to 9667convert the leading bytes of the string into a valid wide character 9668and then to print the multibyte encoding of that character. 9669Similarly, when printing a numeric value, @command{gawk} allows the 9670value to be within the numeric range of values that can be held 9671in a wide character. 9672If the conversion to multibyte encoding fails, @command{gawk} 9673uses the low eight bits of the value as the character to print. 9674 9675Other @command{awk} versions generally restrict themselves to printing 9676the first byte of a string or to numeric values within the range of 9677a single byte (0--255). 9678@value{DARKCORNER} 9679@end quotation 9680 9681 9682@item @code{%d}, @code{%i} 9683Print a decimal integer. 9684The two control letters are equivalent. 9685(The @samp{%i} specification is for compatibility with ISO C.) 9686 9687@item @code{%e}, @code{%E} 9688Print a number in scientific (exponential) notation. 9689For example: 9690 9691@example 9692printf "%4.3e\n", 1950 9693@end example 9694 9695@noindent 9696prints @samp{1.950e+03}, with a total of four significant figures, three of 9697which follow the decimal point. 9698(The @samp{4.3} represents two modifiers, 9699discussed in the next @value{SUBSECTION}.) 9700@samp{%E} uses @samp{E} instead of @samp{e} in the output. 9701 9702@item @code{%f} 9703Print a number in floating-point notation. 9704For example: 9705 9706@example 9707printf "%4.3f", 1950 9708@end example 9709 9710@noindent 9711prints @samp{1950.000}, with a minimum of four significant figures, three of 9712which follow the decimal point. 9713(The @samp{4.3} represents two modifiers, 9714discussed in the next @value{SUBSECTION}.) 9715 9716On systems supporting IEEE 754 floating-point format, values 9717representing negative 9718infinity are formatted as 9719@samp{-inf} or @samp{-infinity}, 9720and positive infinity as 9721@samp{inf} or @samp{infinity}. 9722The special ``not a number'' value formats as @samp{-nan} or @samp{nan} 9723(@pxref{Math Definitions}). 9724 9725@item @code{%F} 9726Like @samp{%f}, but the infinity and ``not a number'' values are spelled 9727using uppercase letters. 9728 9729The @samp{%F} format is a POSIX extension to ISO C; not all systems 9730support it. On those that don't, @command{gawk} uses @samp{%f} instead. 9731 9732@item @code{%g}, @code{%G} 9733Print a number in either scientific notation or in floating-point 9734notation, whichever uses fewer characters; if the result is printed in 9735scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}. 9736 9737@item @code{%o} 9738Print an unsigned octal integer 9739(@pxref{Nondecimal-numbers}). 9740 9741@item @code{%s} 9742Print a string. 9743 9744@item @code{%u} 9745Print an unsigned decimal integer. 9746(This format is of marginal use, because all numbers in @command{awk} 9747are floating point; it is provided primarily for compatibility with C.) 9748 9749@item @code{%x}, @code{%X} 9750Print an unsigned hexadecimal integer; 9751@samp{%X} uses the letters @samp{A} through @samp{F} 9752instead of @samp{a} through @samp{f} 9753(@pxref{Nondecimal-numbers}). 9754 9755@item @code{%%} 9756Print a single @samp{%}. 9757This does not consume an 9758argument and it ignores any modifiers. 9759@end table 9760 9761@cindex dark corner @subentry format-control characters 9762@cindex @command{gawk} @subentry format-control characters 9763@quotation NOTE 9764When using the integer format-control letters for values that are 9765outside the range of the widest C integer type, @command{gawk} switches to 9766the @samp{%g} format specifier. If @option{--lint} is provided on the 9767command line (@pxref{Options}), @command{gawk} 9768warns about this. Other versions of @command{awk} may print invalid 9769values or do something else entirely. 9770@value{DARKCORNER} 9771@end quotation 9772 9773@quotation NOTE 9774The IEEE 754 standard for floating-point arithmetic allows for special 9775values that represent ``infinity'' (positive and negative) and values 9776that are ``not a number'' (NaN). 9777 9778Input and output of these values occurs as text strings. This is 9779somewhat problematic for the @command{awk} language, which predates 9780the IEEE standard. Further details are provided in 9781@ref{POSIX Floating Point Problems}; please see there. 9782@end quotation 9783 9784@node Format Modifiers 9785@subsection Modifiers for @code{printf} Formats 9786 9787@cindex @code{printf} statement @subentry modifiers 9788@cindex modifiers, in format specifiers 9789A format specification can also include @dfn{modifiers} that can control 9790how much of the item's value is printed, as well as how much space it gets. 9791The modifiers come between the @samp{%} and the format-control letter. 9792We use the bullet symbol ``@bullet{}'' in the following examples to 9793represent 9794spaces in the output. Here are the possible modifiers, in the order in 9795which they may appear: 9796 9797@table @asis 9798@cindex differences in @command{awk} and @command{gawk} @subentry @code{print}/@code{printf} statements 9799@cindex @code{printf} statement @subentry positional specifiers 9800@c the code{} does NOT start a secondary 9801@cindex positional specifiers, @code{printf} statement 9802@item @code{@var{N}$} 9803An integer constant followed by a @samp{$} is a @dfn{positional specifier}. 9804Normally, format specifications are applied to arguments in the order 9805given in the format string. With a positional specifier, the format 9806specification is applied to a specific argument, instead of what 9807would be the next argument in the list. Positional specifiers begin 9808counting with one. Thus: 9809 9810@example 9811printf "%s %s\n", "don't", "panic" 9812printf "%2$s %1$s\n", "panic", "don't" 9813@end example 9814 9815@noindent 9816prints the famous friendly message twice. 9817 9818At first glance, this feature doesn't seem to be of much use. 9819It is in fact a @command{gawk} extension, intended for use in translating 9820messages at runtime. 9821@xref{Printf Ordering}, 9822which describes how and why to use positional specifiers. 9823For now, we ignore them. 9824 9825@item @code{-} (Minus) 9826The minus sign, used before the width modifier (see later on in 9827this list), 9828says to left-justify 9829the argument within its specified width. Normally, the argument 9830is printed right-justified in the specified width. Thus: 9831 9832@example 9833printf "%-4s", "foo" 9834@end example 9835 9836@noindent 9837prints @samp{foo@bullet{}}. 9838 9839@item @var{space} 9840For numeric conversions, prefix positive values with a space and 9841negative values with a minus sign. 9842 9843@item @code{+} 9844The plus sign, used before the width modifier (see later on in 9845this list), 9846says to always supply a sign for numeric conversions, even if the data 9847to format is positive. The @samp{+} overrides the space modifier. 9848 9849@item @code{#} 9850Use an ``alternative form'' for certain control letters. 9851For @samp{%o}, supply a leading zero. 9852For @samp{%x} and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for 9853a nonzero result. 9854For @samp{%e}, @samp{%E}, @samp{%f}, and @samp{%F}, the result always 9855contains a decimal point. 9856For @samp{%g} and @samp{%G}, trailing zeros are not removed from the result. 9857 9858@item @code{0} 9859A leading @samp{0} (zero) acts as a flag indicating that output should be 9860padded with zeros instead of spaces. 9861This applies only to the numeric output formats. 9862This flag only has an effect when the field width is wider than the 9863value to print. 9864 9865@item @code{'} 9866A single quote or apostrophe character is a POSIX extension to ISO C. 9867It indicates that the integer part of a floating-point value, or the 9868entire part of an integer decimal value, should have a thousands-separator 9869character in it. This only works in locales that support such characters. 9870For example: 9871 9872@example 9873$ @kbd{cat thousands.awk} @ii{Show source program} 9874@print{} BEGIN @{ printf "%'d\n", 1234567 @} 9875$ @kbd{LC_ALL=C gawk -f thousands.awk} 9876@print{} 1234567 @ii{Results in} "C" @ii{locale} 9877$ @kbd{LC_ALL=en_US.UTF-8 gawk -f thousands.awk} 9878@print{} 1,234,567 @ii{Results in US English UTF locale} 9879@end example 9880 9881@noindent 9882For more information about locales and internationalization issues, 9883see @ref{Locales}. 9884 9885@quotation NOTE 9886The @samp{'} flag is a nice feature, but its use complicates things: it 9887becomes difficult to use it in command-line programs. For information 9888on appropriate quoting tricks, see @ref{Quoting}. 9889@end quotation 9890 9891@item @var{width} 9892This is a number specifying the desired minimum width of a field. Inserting any 9893number between the @samp{%} sign and the format-control character forces the 9894field to expand to this width. The default way to do this is to 9895pad with spaces on the left. For example: 9896 9897@example 9898printf "%4s", "foo" 9899@end example 9900 9901@noindent 9902prints @samp{@bullet{}foo}. 9903 9904The value of @var{width} is a minimum width, not a maximum. If the item 9905value requires more than @var{width} characters, it can be as wide as 9906necessary. Thus, the following: 9907 9908@example 9909printf "%4s", "foobar" 9910@end example 9911 9912@noindent 9913prints @samp{foobar}. 9914 9915Preceding the @var{width} with a minus sign causes the output to be 9916padded with spaces on the right, instead of on the left. 9917 9918@item @code{.@var{prec}} 9919A period followed by an integer constant 9920specifies the precision to use when printing. 9921The meaning of the precision varies by control letter: 9922 9923@table @asis 9924@item @code{%d}, @code{%i}, @code{%o}, @code{%u}, @code{%x}, @code{%X} 9925Minimum number of digits to print. 9926 9927@item @code{%e}, @code{%E}, @code{%f}, @code{%F} 9928Number of digits to the right of the decimal point. 9929 9930@item @code{%g}, @code{%G} 9931Maximum number of significant digits. 9932 9933@item @code{%s} 9934Maximum number of characters from the string that should print. 9935@end table 9936 9937Thus, the following: 9938 9939@example 9940printf "%.4s", "foobar" 9941@end example 9942 9943@noindent 9944prints @samp{foob}. 9945@end table 9946 9947The C library @code{printf}'s dynamic @var{width} and @var{prec} 9948capability (e.g., @code{"%*.*s"}) is supported. Instead of 9949supplying explicit @var{width} and/or @var{prec} values in the format 9950string, they are passed in the argument list. For example: 9951 9952@example 9953w = 5 9954p = 3 9955s = "abcdefg" 9956printf "%*.*s\n", w, p, s 9957@end example 9958 9959@noindent 9960is exactly equivalent to: 9961 9962@example 9963s = "abcdefg" 9964printf "%5.3s\n", s 9965@end example 9966 9967@noindent 9968Both programs output @samp{@w{@bullet{}@bullet{}abc}}. 9969Earlier versions of @command{awk} did not support this capability. 9970If you must use such a version, you may simulate this feature by using 9971concatenation to build up the format string, like so: 9972 9973@example 9974w = 5 9975p = 3 9976s = "abcdefg" 9977printf "%" w "." p "s\n", s 9978@end example 9979 9980@noindent 9981This is not particularly easy to read, but it does work. 9982 9983@c @cindex lint checks 9984@cindex troubleshooting @subentry fatal errors @subentry @code{printf} format strings 9985@cindex POSIX @command{awk} @subentry @code{printf} format strings and 9986C programmers may be used to supplying additional modifiers (@samp{h}, 9987@samp{j}, @samp{l}, @samp{L}, @samp{t}, and @samp{z}) in @code{printf} 9988format strings. These are not valid in @command{awk}. Most @command{awk} 9989implementations silently ignore them. If @option{--lint} is provided 9990on the command line (@pxref{Options}), @command{gawk} warns about their 9991use. If @option{--posix} is supplied, their use is a fatal error. 9992 9993@node Printf Examples 9994@subsection Examples Using @code{printf} 9995 9996The following simple example shows 9997how to use @code{printf} to make an aligned table: 9998 9999@example 10000awk '@{ printf "%-10s %s\n", $1, $2 @}' mail-list 10001@end example 10002 10003@noindent 10004This command 10005prints the names of the people (@code{$1}) in the file 10006@file{mail-list} as a string of 10 characters that are left-justified. It also 10007prints the phone numbers (@code{$2}) next on the line. This 10008produces an aligned two-column table of names and phone numbers, 10009as shown here: 10010 10011@example 10012$ @kbd{awk '@{ printf "%-10s %s\n", $1, $2 @}' mail-list} 10013@print{} Amelia 555-5553 10014@print{} Anthony 555-3412 10015@print{} Becky 555-7685 10016@print{} Bill 555-1675 10017@print{} Broderick 555-0542 10018@print{} Camilla 555-2912 10019@print{} Fabius 555-1234 10020@print{} Julie 555-6699 10021@print{} Martin 555-6480 10022@print{} Samuel 555-3430 10023@print{} Jean-Paul 555-2127 10024@end example 10025 10026In this case, the phone numbers had to be printed as strings because 10027the numbers are separated by dashes. Printing the phone numbers as 10028numbers would have produced just the first three digits: @samp{555}. 10029This would have been pretty confusing. 10030 10031It wasn't necessary to specify a width for the phone numbers because 10032they are last on their lines. They don't need to have spaces 10033after them. 10034 10035The table could be made to look even nicer by adding headings to the 10036tops of the columns. This is done using a @code{BEGIN} rule 10037(@pxref{BEGIN/END}) 10038so that the headers are only printed once, at the beginning of 10039the @command{awk} program: 10040 10041@example 10042awk 'BEGIN @{ print "Name Number" 10043 print "---- ------" @} 10044 @{ printf "%-10s %s\n", $1, $2 @}' mail-list 10045@end example 10046 10047The preceding example mixes @code{print} and @code{printf} statements in 10048the same program. Using just @code{printf} statements can produce the 10049same results: 10050 10051@example 10052awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number" 10053 printf "%-10s %s\n", "----", "------" @} 10054 @{ printf "%-10s %s\n", $1, $2 @}' mail-list 10055@end example 10056 10057@noindent 10058Printing each column heading with the same format specification 10059used for the column elements ensures that the headings 10060are aligned just like the columns. 10061 10062The fact that the same format specification is used three times can be 10063emphasized by storing it in a variable, like this: 10064 10065@example 10066awk 'BEGIN @{ format = "%-10s %s\n" 10067 printf format, "Name", "Number" 10068 printf format, "----", "------" @} 10069 @{ printf format, $1, $2 @}' mail-list 10070@end example 10071 10072 10073@node Redirection 10074@section Redirecting Output of @code{print} and @code{printf} 10075 10076@cindex output redirection 10077@cindex redirection @subentry of output 10078@cindex @option{--sandbox} option @subentry output redirection with @code{print} @subentry @code{printf} 10079So far, the output from @code{print} and @code{printf} has gone 10080to the standard 10081output, usually the screen. Both @code{print} and @code{printf} can 10082also send their output to other places. 10083This is called @dfn{redirection}. 10084 10085@quotation NOTE 10086When @option{--sandbox} is specified (@pxref{Options}), 10087redirecting output to files, pipes, and coprocesses is disabled. 10088@end quotation 10089 10090A redirection appears after the @code{print} or @code{printf} statement. 10091Redirections in @command{awk} are written just like redirections in shell 10092commands, except that they are written inside the @command{awk} program. 10093 10094@c the commas here are part of the see also 10095@cindex @code{print} statement @seealso{redirection of output} 10096@cindex @code{printf} statement @seealso{redirection of output} 10097There are four forms of output redirection: output to a file, output 10098appended to a file, output through a pipe to another command, and output 10099to a coprocess. We show them all for the @code{print} statement, 10100but they work identically for @code{printf}: 10101 10102@table @code 10103@cindex @code{>} (right angle bracket) @subentry @code{>} operator (I/O) 10104@cindex right angle bracket (@code{>}) @subentry @code{>} operator (I/O) 10105@cindex operators @subentry input/output 10106@item print @var{items} > @var{output-file} 10107This redirection prints the items into the output file named 10108@var{output-file}. The @value{FN} @var{output-file} can be any 10109expression. Its value is changed to a string and then used as a 10110@value{FN} (@pxref{Expressions}). 10111 10112When this type of redirection is used, the @var{output-file} is erased 10113before the first output is written to it. Subsequent writes to the same 10114@var{output-file} do not erase @var{output-file}, but append to it. 10115(This is different from how you use redirections in shell scripts.) 10116If @var{output-file} does not exist, it is created. For example, here 10117is how an @command{awk} program can write a list of peoples' names to one 10118file named @file{name-list}, and a list of phone numbers to another file 10119named @file{phone-list}: 10120 10121@example 10122$ @kbd{awk '@{ print $2 > "phone-list"} 10123> @kbd{print $1 > "name-list" @}' mail-list} 10124$ @kbd{cat phone-list} 10125@print{} 555-5553 10126@print{} 555-3412 10127@dots{} 10128$ @kbd{cat name-list} 10129@print{} Amelia 10130@print{} Anthony 10131@dots{} 10132@end example 10133 10134@noindent 10135Each output file contains one name or number per line. 10136 10137@cindex @code{>} (right angle bracket) @subentry @code{>>} operator (I/O) 10138@cindex right angle bracket (@code{>}) @subentry @code{>>} operator (I/O) 10139@item print @var{items} >> @var{output-file} 10140This redirection prints the items into the preexisting output file 10141named @var{output-file}. The difference between this and the 10142single-@samp{>} redirection is that the old contents (if any) of 10143@var{output-file} are not erased. Instead, the @command{awk} output is 10144appended to the file. 10145If @var{output-file} does not exist, then it is created. 10146 10147@cindex @code{|} (vertical bar) @subentry @code{|} operator (I/O) 10148@cindex pipe @subentry output 10149@cindex output @subentry pipes 10150@item print @var{items} | @var{command} 10151It is possible to send output to another program through a pipe 10152instead of into a file. This redirection opens a pipe to 10153@var{command}, and writes the values of @var{items} through this pipe 10154to another process created to execute @var{command}. 10155 10156The redirection argument @var{command} is actually an @command{awk} 10157expression. Its value is converted to a string whose contents give 10158the shell command to be run. For example, the following produces two 10159files, one unsorted list of peoples' names, and one list sorted in reverse 10160alphabetical order: 10161 10162@ignore 1016310/2000: 10164This isn't the best style, since COMMAND is assigned for each 10165record. It's done to avoid overfull hboxes in TeX. Leave it 10166alone for now and let's hope no-one notices. 10167@end ignore 10168 10169@example 10170@group 10171awk '@{ print $1 > "names.unsorted" 10172 command = "sort -r > names.sorted" 10173 print $1 | command @}' mail-list 10174@end group 10175@end example 10176 10177The unsorted list is written with an ordinary redirection, while 10178the sorted list is written by piping through the @command{sort} utility. 10179 10180The next example uses redirection to mail a message to the mailing 10181list @code{bug-system}. This might be useful when trouble is encountered 10182in an @command{awk} script run periodically for system maintenance: 10183 10184@example 10185report = "mail bug-system" 10186print("Awk script failed:", $0) | report 10187print("at record number", FNR, "of", FILENAME) | report 10188close(report) 10189@end example 10190 10191The @code{close()} function is called here because it's a good idea to close 10192the pipe as soon as all the intended output has been sent to it. 10193@xref{Close Files And Pipes} 10194for more information. 10195 10196This example also illustrates the use of a variable to represent 10197a @var{file} or @var{command}---it is not necessary to always 10198use a string constant. Using a variable is generally a good idea, 10199because (if you mean to refer to that same file or command) 10200@command{awk} requires that the string value be written identically 10201every time. 10202 10203@cindex coprocesses 10204@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O) 10205@cindex operators @subentry input/output 10206@cindex differences in @command{awk} and @command{gawk} @subentry input/output operators 10207@item print @var{items} |& @var{command} 10208This redirection prints the items to the input of @var{command}. 10209The difference between this and the 10210single-@samp{|} redirection is that the output from @var{command} 10211can be read with @code{getline}. 10212Thus, @var{command} is a @dfn{coprocess}, which works together with 10213but is subsidiary to the @command{awk} program. 10214 10215This feature is a @command{gawk} extension, and is not available in 10216POSIX @command{awk}. 10217@ifnotdocbook 10218@xref{Getline/Coprocess}, 10219for a brief discussion. 10220@xref{Two-way I/O}, 10221for a more complete discussion. 10222@end ifnotdocbook 10223@ifdocbook 10224@xref{Getline/Coprocess} 10225for a brief discussion and 10226@ref{Two-way I/O} 10227for a more complete discussion. 10228@end ifdocbook 10229@end table 10230 10231Redirecting output using @samp{>}, @samp{>>}, @samp{|}, or @samp{|&} 10232asks the system to open a file, pipe, or coprocess only if the particular 10233@var{file} or @var{command} you specify has not already been written 10234to by your program or if it has been closed since it was last written to. 10235 10236@cindex troubleshooting @subentry printing 10237It is a common error to use @samp{>} redirection for the first @code{print} 10238to a file, and then to use @samp{>>} for subsequent output: 10239 10240@example 10241# clear the file 10242print "Don't panic" > "guide.txt" 10243@dots{} 10244# append 10245print "Avoid improbability generators" >> "guide.txt" 10246@end example 10247 10248@noindent 10249This is indeed how redirections must be used from the shell. But in 10250@command{awk}, it isn't necessary. In this kind of case, a program should 10251use @samp{>} for all the @code{print} statements, because the output file 10252is only opened once. (It happens that if you mix @samp{>} and @samp{>>} 10253output is produced in the expected order. However, mixing the operators 10254for the same file is definitely poor style, and is confusing to readers 10255of your program.) 10256 10257@cindex differences in @command{awk} and @command{gawk} @subentry implementation limitations 10258@cindex implementation issues, @command{gawk} @subentry limits 10259@cindex @command{awk} @subentry implementation issues @subentry pipes 10260@cindex @command{gawk} @subentry implementation issues @subentry pipes 10261@ifnotinfo 10262As mentioned earlier 10263(@pxref{Getline Notes}), 10264many 10265@end ifnotinfo 10266@ifnottex 10267@ifnotdocbook 10268Many 10269@end ifnotdocbook 10270@end ifnottex 10271older 10272@command{awk} implementations limit the number of pipelines that an @command{awk} 10273program may have open to just one! In @command{gawk}, there is no such limit. 10274@command{gawk} allows a program to 10275open as many pipelines as the underlying operating system permits. 10276 10277@sidebar Piping into @command{sh} 10278@cindex shells @subentry piping commands into 10279 10280A particularly powerful way to use redirection is to build command lines 10281and pipe them into the shell, @command{sh}. For example, suppose you 10282have a list of files brought over from a system where all the @value{FN}s 10283are stored in uppercase, and you wish to rename them to have names in 10284all lowercase. The following program is both simple and efficient: 10285 10286@c @cindex @command{mv} utility 10287@example 10288@{ printf("mv %s %s\n", $0, tolower($0)) | "sh" @} 10289 10290END @{ close("sh") @} 10291@end example 10292 10293The @code{tolower()} function returns its argument string with all 10294uppercase characters converted to lowercase 10295(@pxref{String Functions}). 10296The program builds up a list of command lines, 10297using the @command{mv} utility to rename the files. 10298It then sends the list to the shell for execution. 10299 10300@xref{Shell Quoting} for a function that can help in generating 10301command lines to be fed to the shell. 10302@end sidebar 10303 10304@node Special FD 10305@section Special Files for Standard Preopened Data Streams 10306@cindex standard input 10307@cindex input @subentry standard 10308@cindex standard output 10309@cindex output @subentry standard 10310@cindex error output 10311@cindex standard error 10312@cindex file descriptors 10313@cindex files @subentry descriptors @seeentry{file descriptors} 10314 10315Running programs conventionally have three input and output streams 10316already available to them for reading and writing. These are known 10317as the @dfn{standard input}, @dfn{standard output}, and @dfn{standard 10318error output}. These open streams (and any other open files or pipes) 10319are often referred to by the technical term @dfn{file descriptors}. 10320 10321These streams are, by default, connected to your keyboard and screen, but 10322they are often redirected with the shell, via the @samp{<}, @samp{<<}, 10323@samp{>}, @samp{>>}, @samp{>&}, and @samp{|} operators. Standard error 10324is typically used for writing error messages; the reason there are two separate 10325streams, standard output and standard error, is so that they can be 10326redirected separately. 10327 10328@cindex differences in @command{awk} and @command{gawk} @subentry error messages 10329@cindex error handling 10330In traditional implementations of @command{awk}, the only way to write an error 10331message to standard error in an @command{awk} program is as follows: 10332 10333@example 10334print "Serious error detected!" | "cat 1>&2" 10335@end example 10336 10337@noindent 10338This works by opening a pipeline to a shell command that can access the 10339standard error stream that it inherits from the @command{awk} process. 10340@c 8/2014: Mike Brennan says not to cite this as inefficient. So, fixed. 10341This is far from elegant, and it also requires a 10342separate process. So people writing @command{awk} programs often 10343don't do this. Instead, they send the error messages to the 10344screen, like this: 10345 10346@example 10347print "Serious error detected!" > "/dev/tty" 10348@end example 10349 10350@noindent 10351(@file{/dev/tty} is a special file supplied by the operating system 10352that is connected to your keyboard and screen. It represents the 10353``terminal,''@footnote{The ``tty'' in @file{/dev/tty} stands for 10354``Teletype,'' a serial terminal.} which on modern systems is a keyboard 10355and screen, not a serial console.) 10356This generally has the same effect, but not always: although the 10357standard error stream is usually the screen, it can be redirected; when 10358that happens, writing to the screen is not correct. In fact, if 10359@command{awk} is run from a background job, it may not have a 10360terminal at all. 10361Then opening @file{/dev/tty} fails. 10362 10363@command{gawk}, BWK @command{awk}, and @command{mawk} provide 10364special @value{FN}s for accessing the three standard streams. 10365If the @value{FN} matches one of these special names when @command{gawk} 10366(or one of the others) redirects input or output, then it directly uses 10367the descriptor that the @value{FN} stands for. These special 10368@value{FN}s work for all operating systems that @command{gawk} 10369has been ported to, not just those that are POSIX-compliant: 10370 10371@cindex common extensions @subentry @code{/dev/stdin} special file 10372@cindex common extensions @subentry @code{/dev/stdout} special file 10373@cindex common extensions @subentry @code{/dev/stderr} special file 10374@cindex extensions @subentry common @subentry @code{/dev/stdin} special file 10375@cindex extensions @subentry common @subentry @code{/dev/stdout} special file 10376@cindex extensions @subentry common @subentry @code{/dev/stderr} special file 10377@cindex file names @subentry standard streams in @command{gawk} 10378@cindex @code{/dev/@dots{}} special files 10379@cindex files @subentry @code{/dev/@dots{}} special files 10380@cindex @code{/dev/fd/@var{N}} special files (@command{gawk}) 10381@table @file 10382@item /dev/stdin 10383The standard input (file descriptor 0). 10384 10385@item /dev/stdout 10386The standard output (file descriptor 1). 10387 10388@item /dev/stderr 10389The standard error output (file descriptor 2). 10390@end table 10391 10392With these facilities, 10393the proper way to write an error message then becomes: 10394 10395@example 10396print "Serious error detected!" > "/dev/stderr" 10397@end example 10398 10399@cindex troubleshooting @subentry quotes with file names 10400Note the use of quotes around the @value{FN}. 10401Like with any other redirection, the value must be a string. 10402It is a common error to omit the quotes, which leads 10403to confusing results. 10404 10405@command{gawk} does not treat these @value{FN}s as special when 10406in POSIX-compatibility mode. However, because BWK @command{awk} 10407supports them, @command{gawk} does support them even when 10408invoked with the @option{--traditional} option (@pxref{Options}). 10409 10410@node Special Files 10411@section Special @value{FFN}s in @command{gawk} 10412@cindex @command{gawk} @subentry file names in 10413 10414Besides access to standard input, standard output, and standard error, 10415@command{gawk} provides access to any open file descriptor. 10416Additionally, there are special @value{FN}s reserved for 10417TCP/IP networking. 10418 10419@menu 10420* Other Inherited Files:: Accessing other open files with 10421 @command{gawk}. 10422* Special Network:: Special files for network communications. 10423* Special Caveats:: Things to watch out for. 10424@end menu 10425 10426@node Other Inherited Files 10427@subsection Accessing Other Open Files with @command{gawk} 10428 10429Besides the @code{/dev/stdin}, @code{/dev/stdout}, and @code{/dev/stderr} 10430special @value{FN}s mentioned earlier, @command{gawk} provides syntax 10431for accessing any other inherited open file: 10432 10433@table @file 10434@item /dev/fd/@var{N} 10435The file associated with file descriptor @var{N}. Such a file must 10436be opened by the program initiating the @command{awk} execution (typically 10437the shell). Unless special pains are taken in the shell from which 10438@command{gawk} is invoked, only descriptors 0, 1, and 2 are available. 10439@end table 10440 10441The @value{FN}s @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr} 10442are essentially aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and 10443@file{/dev/fd/2}, respectively. However, those names are more self-explanatory. 10444 10445Note that using @code{close()} on a @value{FN} of the 10446form @code{"/dev/fd/@var{N}"}, for file descriptor numbers 10447above two, does actually close the given file descriptor. 10448 10449@node Special Network 10450@subsection Special Files for Network Communications 10451@cindex networks @subentry support for 10452@cindex TCP/IP @subentry support for 10453 10454@command{gawk} programs 10455can open a two-way 10456TCP/IP connection, acting as either a client or a server. 10457This is done using a special @value{FN} of the form: 10458 10459@example 10460@file{/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}} 10461@end example 10462 10463The @var{net-type} is one of @samp{inet}, @samp{inet4}, or @samp{inet6}. 10464The @var{protocol} is one of @samp{tcp} or @samp{udp}, 10465and the other fields represent the other essential pieces of information 10466for making a networking connection. 10467These @value{FN}s are used with the @samp{|&} operator for communicating 10468with @w{a coprocess} 10469(@pxref{Two-way I/O}). 10470This is an advanced feature, mentioned here only for completeness. 10471Full discussion is delayed until 10472@ref{TCP/IP Networking}. 10473 10474@node Special Caveats 10475@subsection Special @value{FFN} Caveats 10476 10477Here are some things to bear in mind when using the 10478special @value{FN}s that @command{gawk} provides: 10479 10480@itemize @value{BULLET} 10481@cindex compatibility mode (@command{gawk}) @subentry file names 10482@cindex file names @subentry in compatibility mode 10483@cindex POSIX mode 10484@item 10485Recognition of the @value{FN}s for the three standard preopened 10486files is disabled only in POSIX mode. 10487 10488@item 10489Recognition of the other special @value{FN}s is disabled if @command{gawk} is in 10490compatibility mode (either @option{--traditional} or @option{--posix}; 10491@pxref{Options}). 10492 10493@item 10494@command{gawk} @emph{always} 10495interprets these special @value{FN}s. 10496For example, using @samp{/dev/fd/4} 10497for output actually writes on file descriptor 4, and not on a new 10498file descriptor that is @code{dup()}ed from file descriptor 4. Most of 10499the time this does not matter; however, it is important to @emph{not} 10500close any of the files related to file descriptors 0, 1, and 2. 10501Doing so results in unpredictable behavior. 10502@end itemize 10503 10504@node Close Files And Pipes 10505@section Closing Input and Output Redirections 10506@cindex files @subentry output @seeentry{output files} 10507@cindex input files @subentry closing 10508@cindex output @subentry files, closing 10509@cindex pipe @subentry closing 10510@cindex coprocesses @subentry closing 10511@cindex @code{getline} command @subentry coprocesses, using from 10512 10513If the same @value{FN} or the same shell command is used with @code{getline} 10514more than once during the execution of an @command{awk} program 10515(@pxref{Getline}), 10516the file is opened (or the command is executed) the first time only. 10517At that time, the first record of input is read from that file or command. 10518The next time the same file or command is used with @code{getline}, 10519another record is read from it, and so on. 10520 10521Similarly, when a file or pipe is opened for output, @command{awk} remembers 10522the @value{FN} or command associated with it, and subsequent 10523writes to the same file or command are appended to the previous writes. 10524The file or pipe stays open until @command{awk} exits. 10525 10526@cindexawkfunc{close} 10527This implies that special steps are necessary in order to read the same 10528file again from the beginning, or to rerun a shell command (rather than 10529reading more output from the same command). The @code{close()} function 10530makes these things possible: 10531 10532@example 10533close(@var{filename}) 10534@end example 10535 10536@noindent 10537or: 10538 10539@example 10540close(@var{command}) 10541@end example 10542 10543The argument @var{filename} or @var{command} can be any expression. Its 10544value must @emph{exactly} match the string that was used to open the file or 10545start the command (spaces and other ``irrelevant'' characters 10546included). For example, if you open a pipe with this: 10547 10548@example 10549"sort -r names" | getline foo 10550@end example 10551 10552@noindent 10553then you must close it with this: 10554 10555@example 10556close("sort -r names") 10557@end example 10558 10559Once this function call is executed, the next @code{getline} from that 10560file or command, or the next @code{print} or @code{printf} to that 10561file or command, reopens the file or reruns the command. 10562Because the expression that you use to close a file or pipeline must 10563exactly match the expression used to open the file or run the command, 10564it is good practice to use a variable to store the @value{FN} or command. 10565The previous example becomes the following: 10566 10567@example 10568@group 10569sortcom = "sort -r names" 10570sortcom | getline foo 10571@end group 10572@group 10573@dots{} 10574close(sortcom) 10575@end group 10576@end example 10577 10578@noindent 10579This helps avoid hard-to-find typographical errors in your @command{awk} 10580programs. Here are some of the reasons for closing an output file: 10581 10582@itemize @value{BULLET} 10583@item 10584To write a file and read it back later on in the same @command{awk} 10585program. Close the file after writing it, then 10586begin reading it with @code{getline}. 10587 10588@item 10589To write numerous files, successively, in the same @command{awk} 10590program. If the files aren't closed, eventually @command{awk} may exceed a 10591system limit on the number of open files in one process. It is best to 10592close each one when the program has finished writing it. 10593 10594@item 10595To make a command finish. When output is redirected through a pipe, 10596the command reading the pipe normally continues to try to read input 10597as long as the pipe is open. Often this means the command cannot 10598really do its work until the pipe is closed. For example, if 10599output is redirected to the @command{mail} program, the message is not 10600actually sent until the pipe is closed. 10601 10602@item 10603To run the same program a second time, with the same arguments. 10604This is not the same thing as giving more input to the first run! 10605 10606For example, suppose a program pipes output to the @command{mail} program. 10607If it outputs several lines redirected to this pipe without closing 10608it, they make a single message of several lines. By contrast, if the 10609program closes the pipe after each line of output, then each line makes 10610a separate message. 10611@end itemize 10612 10613@cindex differences in @command{awk} and @command{gawk} @subentry @code{close()} function 10614@cindex portability @subentry @code{close()} function and 10615@cindex @code{close()} function @subentry portability 10616If you use more files than the system allows you to have open, 10617@command{gawk} attempts to multiplex the available open files among 10618your @value{DF}s. @command{gawk}'s ability to do this depends upon the 10619facilities of your operating system, so it may not always work. It is 10620therefore both good practice and good portability advice to always 10621use @code{close()} on your files when you are done with them. 10622In fact, if you are using a lot of pipes, it is essential that 10623you close commands when done. For example, consider something like this: 10624 10625@example 10626@{ 10627 @dots{} 10628 command = ("grep " $1 " /some/file | my_prog -q " $3) 10629 while ((command | getline) > 0) @{ 10630 @var{process output of} command 10631 @} 10632 # need close(command) here 10633@} 10634@end example 10635 10636This example creates a new pipeline based on data in @emph{each} record. 10637Without the call to @code{close()} indicated in the comment, @command{awk} 10638creates child processes to run the commands, until it eventually 10639runs out of file descriptors for more pipelines. 10640 10641Even though each command has finished (as indicated by the end-of-file 10642return status from @code{getline}), the child process is not 10643terminated;@footnote{The technical terminology is rather morbid. 10644The finished child is called a ``zombie,'' and cleaning up after 10645it is referred to as ``reaping.''} 10646@c Good old UNIX: give the marketing guys fits, that's the ticket 10647more importantly, the file descriptor for the pipe 10648is not closed and released until @code{close()} is called or 10649@command{awk} exits. 10650 10651@code{close()} silently does nothing if given an argument that 10652does not represent a file, pipe, or coprocess that was opened with 10653a redirection. In such a case, it returns a negative value, 10654indicating an error. In addition, @command{gawk} sets @code{ERRNO} 10655to a string indicating the error. 10656 10657Note also that @samp{close(FILENAME)} has no ``magic'' effects on the 10658implicit loop that reads through the files named on the command line. 10659It is, more likely, a close of a file that was never opened with a 10660redirection, so @command{awk} silently does nothing, except return 10661a negative value. 10662 10663@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O) @subentry pipes, closing 10664When using the @samp{|&} operator to communicate with a coprocess, 10665it is occasionally useful to be able to close one end of the two-way 10666pipe without closing the other. 10667This is done by supplying a second argument to @code{close()}. 10668As in any other call to @code{close()}, 10669the first argument is the name of the command or special file used 10670to start the coprocess. 10671The second argument should be a string, with either of the values 10672@code{"to"} or @code{"from"}. Case does not matter. 10673As this is an advanced feature, discussion is 10674delayed until 10675@ref{Two-way I/O}, 10676which describes it in more detail and gives an example. 10677 10678@sidebar Using @code{close()}'s Return Value 10679@cindex dark corner @subentry @code{close()} function 10680@cindex @code{close()} function @subentry return value 10681@cindex return value, @code{close()} function 10682@cindex differences in @command{awk} and @command{gawk} @subentry @code{close()} function 10683@cindex Unix @command{awk} @subentry @code{close()} function and 10684 10685In many older versions of Unix @command{awk}, the @code{close()} function 10686is actually a statement. 10687@value{DARKCORNER} 10688It is a syntax error to try and use the return 10689value from @code{close()}: 10690 10691@example 10692command = "@dots{}" 10693command | getline info 10694retval = close(command) # syntax error in many Unix awks 10695@end example 10696 10697@cindex @command{gawk} @subentry @code{ERRNO} variable in 10698@cindex @code{ERRNO} variable @subentry with @command{close()} function 10699@command{gawk} treats @code{close()} as a function. 10700The return value is @minus{}1 if the argument names something 10701that was never opened with a redirection, or if there is 10702a system problem closing the file or process. 10703In these cases, @command{gawk} sets the predefined variable 10704@code{ERRNO} to a string describing the problem. 10705 10706In @command{gawk}, starting with @value{PVERSION} 4.2, when closing a pipe or 10707coprocess (input or output), the return value is the exit status of the 10708command, as described in @ref{table-close-pipe-return-values}.@footnote{Prior 10709to @value{PVERSION} 4.2, the return value from closing a pipe or co-process 10710was the full 16-bit exit value as defined by the @code{wait()} system 10711call.} Otherwise, it is the return value from the system's @code{close()} 10712or @code{fclose()} C functions when closing input or output files, 10713respectively. This value is zero if the close succeeds, or @minus{}1 10714if it fails. 10715 10716@float Table,table-close-pipe-return-values 10717@caption{Return values from @code{close()} of a pipe} 10718@multitable @columnfractions .50 .50 10719@headitem Situation @tab Return value from @code{close()} 10720@item Normal exit of command @tab Command's exit status 10721@item Death by signal of command @tab 256 + number of murderous signal 10722@item Death by signal of command with core dump @tab 512 + number of murderous signal 10723@item Some kind of error @tab @minus{}1 10724@end multitable 10725@end float 10726 10727@cindex POSIX mode 10728The POSIX standard is very vague; it says that @code{close()} 10729returns zero on success and a nonzero value otherwise. In general, 10730different implementations vary in what they report when closing 10731pipes; thus, the return value cannot be used portably. 10732@value{DARKCORNER} 10733In POSIX mode (@pxref{Options}), @command{gawk} just returns zero 10734when closing a pipe. 10735@end sidebar 10736 10737@node Nonfatal 10738@section Enabling Nonfatal Output 10739 10740This @value{SECTION} describes a @command{gawk}-specific feature. 10741 10742In standard @command{awk}, output with @code{print} or @code{printf} 10743to a nonexistent file, or some other I/O error (such as filling up the 10744disk) is a fatal error. 10745 10746@example 10747$ @kbd{gawk 'BEGIN @{ print "hi" > "/no/such/file" @}'} 10748@error{} gawk: cmd. line:1: fatal: can't redirect to `/no/such/file' (No 10749@error{} such file or directory) 10750@end example 10751 10752@command{gawk} makes it possible to detect that an error has 10753occurred, allowing you to possibly recover from the error, or 10754at least print an error message of your choosing before exiting. 10755You can do this in one of two ways: 10756 10757@itemize @bullet 10758@item 10759For all output files, by assigning any value to @code{PROCINFO["NONFATAL"]}. 10760 10761@item 10762On a per-file basis, by assigning any value to 10763@code{PROCINFO[@var{filename}, "NONFATAL"]}. 10764Here, @var{filename} is the name of the file to which 10765you wish output to be nonfatal. 10766@end itemize 10767 10768Once you have enabled nonfatal output, you must check @code{ERRNO} 10769after every relevant @code{print} or @code{printf} statement to 10770see if something went wrong. It is also a good idea to initialize 10771@code{ERRNO} to zero before attempting the output. For example: 10772 10773@example 10774$ @kbd{gawk '} 10775> @kbd{BEGIN @{} 10776> @kbd{ PROCINFO["NONFATAL"] = 1} 10777> @kbd{ ERRNO = 0} 10778> @kbd{ print "hi" > "/no/such/file"} 10779> @kbd{ if (ERRNO) @{} 10780> @kbd{ print("Output failed:", ERRNO) > "/dev/stderr"} 10781> @kbd{ exit 1} 10782> @kbd{ @}} 10783> @kbd{@}'} 10784@error{} Output failed: No such file or directory 10785@end example 10786 10787Here, @command{gawk} did not produce a fatal error; instead 10788it let the @command{awk} program code detect the problem and handle it. 10789 10790This mechanism works also for standard output and standard error. 10791For standard output, you may use @code{PROCINFO["-", "NONFATAL"]} 10792or @code{PROCINFO["/dev/stdout", "NONFATAL"]}. For standard error, use 10793@code{PROCINFO["/dev/stderr", "NONFATAL"]}. 10794 10795@cindex @env{GAWK_SOCK_RETRIES} environment variable 10796@cindex environment variables @subentry @env{GAWK_SOCK_RETRIES} 10797When attempting to open a TCP/IP socket (@pxref{TCP/IP Networking}), 10798@command{gawk} tries multiple times. The @env{GAWK_SOCK_RETRIES} 10799environment variable (@pxref{Other Environment Variables}) allows you to 10800override @command{gawk}'s builtin default number of attempts. However, 10801once nonfatal I/O is enabled for a given socket, @command{gawk} only 10802retries once, relying on @command{awk}-level code to notice that there 10803was a problem. 10804 10805@node Output Summary 10806@section Summary 10807 10808@itemize @value{BULLET} 10809@item 10810The @code{print} statement prints comma-separated expressions. Each 10811expression is separated by the value of @code{OFS} and terminated by 10812the value of @code{ORS}. @code{OFMT} provides the conversion format 10813for numeric values for the @code{print} statement. 10814 10815@item 10816The @code{printf} statement provides finer-grained control over output, 10817with format-control letters for different data types and various flags 10818that modify the behavior of the format-control letters. 10819 10820@item 10821Output from both @code{print} and @code{printf} may be redirected to 10822files, pipes, and coprocesses. 10823 10824@item 10825@command{gawk} provides special @value{FN}s for access to standard input, 10826output, and error, and for network communications. 10827 10828@item 10829Use @code{close()} to close open file, pipe, and coprocess redirections. 10830For coprocesses, it is possible to close only one direction of the 10831communications. 10832 10833@item 10834Normally errors with @code{print} or @code{printf} are fatal. 10835@command{gawk} lets you make output errors be nonfatal either for 10836all files or on a per-file basis. You must then check for errors 10837after every relevant output statement. 10838 10839@end itemize 10840 10841@c EXCLUDE START 10842@node Output Exercises 10843@section Exercises 10844 10845@enumerate 10846@item 10847Rewrite the program: 10848 10849@example 10850awk 'BEGIN @{ print "Month Crates" 10851 print "----- ------" @} 10852 @{ print $1, " ", $2 @}' inventory-shipped 10853@end example 10854 10855@noindent 10856from @ref{Output Separators}, by using a new value of @code{OFS}. 10857 10858@item 10859Use the @code{printf} statement to line up the headings and table data 10860for the @file{inventory-shipped} example that was covered in @ref{Print}. 10861 10862@item 10863What happens if you forget the double quotes when redirecting 10864output, as follows: 10865 10866@example 10867BEGIN @{ print "Serious error detected!" > /dev/stderr @} 10868@end example 10869 10870@end enumerate 10871@c EXCLUDE END 10872 10873 10874@node Expressions 10875@chapter Expressions 10876@cindex expressions 10877 10878Expressions are the basic building blocks of @command{awk} patterns 10879and actions. An expression evaluates to a value that you can print, test, 10880or pass to a function. Additionally, an expression 10881can assign a new value to a variable or a field by using an assignment operator. 10882 10883An expression can serve as a pattern or action statement on its own. 10884Most other kinds of 10885statements contain one or more expressions that specify the data on which to 10886operate. As in other languages, expressions in @command{awk} can include 10887variables, array references, constants, and function calls, as well as 10888combinations of these with various operators. 10889 10890@menu 10891* Values:: Constants, Variables, and Regular Expressions. 10892* All Operators:: @command{gawk}'s operators. 10893* Truth Values and Conditions:: Testing for true and false. 10894* Function Calls:: A function call is an expression. 10895* Precedence:: How various operators nest. 10896* Locales:: How the locale affects things. 10897* Expressions Summary:: Expressions summary. 10898@end menu 10899 10900@node Values 10901@section Constants, Variables, and Conversions 10902 10903Expressions are built up from values and the operations performed 10904upon them. This @value{SECTION} describes the elementary objects 10905that provide the values used in expressions. 10906 10907@menu 10908* Constants:: String, numeric and regexp constants. 10909* Using Constant Regexps:: When and how to use a regexp constant. 10910* Variables:: Variables give names to values for later use. 10911* Conversion:: The conversion of strings to numbers and vice 10912 versa. 10913@end menu 10914 10915@node Constants 10916@subsection Constant Expressions 10917 10918@cindex constants @subentry types of 10919 10920The simplest type of expression is the @dfn{constant}, which always has 10921the same value. There are three types of constants: numeric, 10922string, and regular expression. 10923 10924Each is used in the appropriate context when you need a data 10925value that isn't going to change. Numeric constants can 10926have different forms, but are internally stored in an identical manner. 10927 10928@menu 10929* Scalar Constants:: Numeric and string constants. 10930* Nondecimal-numbers:: What are octal and hex numbers. 10931* Regexp Constants:: Regular Expression constants. 10932@end menu 10933 10934@node Scalar Constants 10935@subsubsection Numeric and String Constants 10936 10937@cindex constants @subentry numeric 10938@cindex numeric @subentry constants 10939A @dfn{numeric constant} stands for a number. This number can be an 10940integer, a decimal fraction, or a number in scientific (exponential) 10941notation.@footnote{The internal representation of all numbers, 10942including integers, uses double-precision floating-point numbers. 10943On most modern systems, these are in IEEE 754 standard format. 10944@xref{Arbitrary Precision Arithmetic}, for much more information.} 10945Here are some examples of numeric constants that all 10946have the same value: 10947 10948@example 10949105 109501.05e+2 109511050e-1 10952@end example 10953 10954@cindex string @subentry constants 10955@cindex constants @subentry string 10956A @dfn{string constant} consists of a sequence of characters enclosed in 10957double quotation marks. For example: 10958 10959@example 10960"parrot" 10961@end example 10962 10963@noindent 10964@cindex differences in @command{awk} and @command{gawk} @subentry strings 10965@cindex strings @subentry length limitations 10966@cindex ASCII 10967represents the string whose contents are @samp{parrot}. Strings in 10968@command{gawk} can be of any length, and they can contain any of the possible 10969eight-bit ASCII characters, including ASCII @sc{nul} (character code zero). 10970Other @command{awk} 10971implementations may have difficulty with some character codes. 10972 10973Some languages allow you to continue long strings across 10974multiple lines by ending the line with a backslash. For example in C: 10975 10976@example 10977#include <stdio.h> 10978 10979int main() 10980@{ 10981 printf("hello, \ 10982world\n"); 10983 return 0; 10984@} 10985@end example 10986 10987@noindent 10988In such a case, the C compiler removes both the backslash and the newline, 10989producing a string as if it had been typed @samp{"hello, world\n"}. 10990This is useful when a single string needs to contain a large amount of text. 10991 10992The POSIX standard says explicitly that newlines are not allowed inside string 10993constants. And indeed, all @command{awk} implementations report an error 10994if you try to do so. For example: 10995 10996@example 10997$ @kbd{gawk 'BEGIN @{ print "hello, } 10998> @kbd{world" @}'} 10999@print{} gawk: cmd. line:1: BEGIN @{ print "hello, 11000@print{} gawk: cmd. line:1: ^ unterminated string 11001@print{} gawk: cmd. line:1: BEGIN @{ print "hello, 11002@print{} gawk: cmd. line:1: ^ syntax error 11003@end example 11004 11005@cindex dark corner @subentry string continuation 11006@cindex strings @subentry continuation across lines 11007@cindex differences in @command{awk} and @command{gawk} @subentry strings 11008Although POSIX doesn't define what happens if you use an escaped 11009newline, as in the previous C example, all known versions of 11010@command{awk} allow you to do so. Unfortunately, what each one 11011does with such a string varies. @value{DARKCORNER} @command{gawk}, 11012@command{mawk}, and the OpenSolaris POSIX @command{awk} 11013(@pxref{Other Versions}) elide the backslash and newline, as in C: 11014 11015@example 11016$ @kbd{gawk 'BEGIN @{ print "hello, \} 11017> @kbd{world" @}'} 11018@print{} hello, world 11019@end example 11020 11021@cindex POSIX mode 11022In POSIX mode (@pxref{Options}), @command{gawk} does not 11023allow escaped newlines. Otherwise, it behaves as just described. 11024 11025BWK @command{awk} and BusyBox @command{awk} 11026remove the backslash but leave the newline 11027intact, as part of the string: 11028 11029@example 11030$ @kbd{nawk 'BEGIN @{ print "hello, \} 11031> @kbd{world" @}'} 11032@print{} hello, 11033@print{} world 11034@end example 11035 11036@node Nondecimal-numbers 11037@subsubsection Octal and Hexadecimal Numbers 11038@cindex octal numbers 11039@cindex hexadecimal numbers 11040@cindex numbers @subentry octal 11041@cindex numbers @subentry hexadecimal 11042 11043In @command{awk}, all numbers are in decimal (i.e., base 10). Many other 11044programming languages allow you to specify numbers in other bases, often 11045octal (base 8) and hexadecimal (base 16). 11046In octal, the numbers go 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, and so on. 11047Just as @samp{11} in decimal is 1 times 10 plus 1, so 11048@samp{11} in octal is 1 times 8 plus 1. This equals 9 in decimal. 11049In hexadecimal, there are 16 digits. Because the everyday decimal 11050number system only has ten digits (@samp{0}--@samp{9}), the letters 11051@samp{a} through @samp{f} represent the rest. 11052(Case in the letters is usually irrelevant; hexadecimal @samp{a} and @samp{A} 11053have the same value.) 11054Thus, @samp{11} in 11055hexadecimal is 1 times 16 plus 1, which equals 17 in decimal. 11056 11057Just by looking at plain @samp{11}, you can't tell what base it's in. 11058So, in C, C++, and other languages derived from C, 11059@c such as PERL, but we won't mention that.... 11060there is a special notation to signify the base. 11061Octal numbers start with a leading @samp{0}, 11062and hexadecimal numbers start with a leading @samp{0x} or @samp{0X}: 11063 11064@table @code 11065@item 11 11066Decimal value 11 11067 11068@item 011 11069Octal 11, decimal value 9 11070 11071@item 0x11 11072Hexadecimal 11, decimal value 17 11073@end table 11074 11075This example shows the difference: 11076 11077@example 11078$ @kbd{gawk 'BEGIN @{ printf "%d, %d, %d\n", 011, 11, 0x11 @}'} 11079@print{} 9, 11, 17 11080@end example 11081 11082Being able to use octal and hexadecimal constants in your programs is most 11083useful when working with data that cannot be represented conveniently as 11084characters or as regular numbers, such as binary data of various sorts. 11085 11086@cindex @command{gawk} @subentry octal numbers and 11087@cindex @command{gawk} @subentry hexadecimal numbers and 11088@command{gawk} allows the use of octal and hexadecimal 11089constants in your program text. However, such numbers in the input data 11090are not treated differently; doing so by default would break old 11091programs. 11092(If you really need to do this, use the @option{--non-decimal-data} 11093command-line option; 11094@pxref{Nondecimal Data}.) 11095If you have octal or hexadecimal data, 11096you can use the @code{strtonum()} function 11097(@pxref{String Functions}) 11098to convert the data into a number. 11099Most of the time, you will want to use octal or hexadecimal constants 11100when working with the built-in bit-manipulation functions; 11101see @ref{Bitwise Functions} 11102for more information. 11103 11104Unlike in some early C implementations, @samp{8} and @samp{9} are not 11105valid in octal constants. For example, @command{gawk} treats @samp{018} 11106as decimal 18: 11107 11108@example 11109$ @kbd{gawk 'BEGIN @{ print "021 is", 021 ; print 018 @}'} 11110@print{} 021 is 17 11111@print{} 18 11112@end example 11113 11114@cindex compatibility mode (@command{gawk}) @subentry octal numbers 11115@cindex compatibility mode (@command{gawk}) @subentry hexadecimal numbers 11116Octal and hexadecimal source code constants are a @command{gawk} extension. 11117If @command{gawk} is in compatibility mode 11118(@pxref{Options}), 11119they are not available. 11120 11121@sidebar A Constant's Base Does Not Affect Its Value 11122 11123Once a numeric constant has 11124been converted internally into a number, 11125@command{gawk} no longer remembers 11126what the original form of the constant was; the internal value is 11127always used. This has particular consequences for conversion of 11128numbers to strings: 11129 11130@example 11131$ @kbd{gawk 'BEGIN @{ printf "0x11 is <%s>\n", 0x11 @}'} 11132@print{} 0x11 is <17> 11133@end example 11134@end sidebar 11135 11136@node Regexp Constants 11137@subsubsection Regular Expression Constants 11138 11139@cindex regexp constants 11140@cindex @code{~} (tilde), @code{~} operator 11141@cindex tilde (@code{~}), @code{~} operator 11142@cindex @code{!} (exclamation point) @subentry @code{!~} operator 11143@cindex exclamation point (@code{!}) @subentry @code{!~} operator 11144A @dfn{regexp constant} is a regular expression description enclosed in 11145slashes, such as @code{@w{/^beginning and end$/}}. Most regexps used in 11146@command{awk} programs are constant, but the @samp{~} and @samp{!~} 11147matching operators can also match computed or dynamic regexps 11148(which are typically just ordinary strings or variables that contain a regexp, 11149but could be more complex expressions). 11150 11151@node Using Constant Regexps 11152@subsection Using Regular Expression Constants 11153 11154Regular expression constants consist of text describing 11155a regular expression enclosed in slashes (such as @code{/the +answer/}). 11156This @value{SECTION} describes how such constants work in 11157POSIX @command{awk} and @command{gawk}, and then goes on to describe 11158@dfn{strongly typed regexp constants}, which are a @command{gawk} extension. 11159 11160@menu 11161* Standard Regexp Constants:: Regexp constants in standard @command{awk}. 11162* Strong Regexp Constants:: Strongly typed regexp constants. 11163@end menu 11164 11165@node Standard Regexp Constants 11166@subsubsection Standard Regular Expression Constants 11167 11168@cindex dark corner @subentry regexp constants 11169When used on the righthand side of the @samp{~} or @samp{!~} 11170operators, a regexp constant merely stands for the regexp that is to be 11171matched. 11172However, regexp constants (such as @code{/foo/}) may be used like simple expressions. 11173When a 11174regexp constant appears by itself, it has the same meaning as if it appeared 11175in a pattern (i.e., @samp{($0 ~ /foo/)}). 11176@value{DARKCORNER} 11177@xref{Expression Patterns}. 11178This means that the following two code segments: 11179 11180@example 11181if ($0 ~ /barfly/ || $0 ~ /camelot/) 11182 print "found" 11183@end example 11184 11185@noindent 11186and: 11187 11188@example 11189if (/barfly/ || /camelot/) 11190 print "found" 11191@end example 11192 11193@noindent 11194are exactly equivalent. 11195One rather bizarre consequence of this rule is that the following 11196Boolean expression is valid, but does not do what its author probably 11197intended: 11198 11199@example 11200# Note that /foo/ is on the left of the ~ 11201if (/foo/ ~ $1) print "found foo" 11202@end example 11203 11204@c @cindex automatic warnings 11205@c @cindex warnings, automatic 11206@cindex @command{gawk} @subentry regexp constants and 11207@cindex regexp constants @subentry in @command{gawk} 11208@noindent 11209This code is ``obviously'' testing @code{$1} for a match against the regexp 11210@code{/foo/}. But in fact, the expression @samp{/foo/ ~ $1} really means 11211@samp{($0 ~ /foo/) ~ $1}. In other words, first match the input record 11212against the regexp @code{/foo/}. The result is either zero or one, 11213depending upon the success or failure of the match. That result 11214is then matched against the first field in the record. 11215Because it is unlikely that you would ever really want to make this kind of 11216test, @command{gawk} issues a warning when it sees this construct in 11217a program. 11218Another consequence of this rule is that the assignment statement: 11219 11220@example 11221matches = /foo/ 11222@end example 11223 11224@noindent 11225assigns either zero or one to the variable @code{matches}, depending 11226upon the contents of the current input record. 11227 11228@cindex differences in @command{awk} and @command{gawk} @subentry regexp constants 11229@cindex dark corner @subentry regexp constants @subentry as arguments to user-defined functions 11230@cindexgawkfunc{gensub} 11231@cindexawkfunc{sub} 11232@cindexawkfunc{gsub} 11233Constant regular expressions are also used as the first argument for 11234the @code{gensub()}, @code{sub()}, and @code{gsub()} functions, as the 11235second argument of the @code{match()} function, 11236and as the third argument of the @code{split()} and @code{patsplit()} functions 11237(@pxref{String Functions}). 11238Modern implementations of @command{awk}, including @command{gawk}, allow 11239the third argument of @code{split()} to be a regexp constant, but some 11240older implementations do not. 11241@value{DARKCORNER} 11242Because some built-in functions accept regexp constants as arguments, 11243confusion can arise when attempting to use regexp constants as arguments 11244to user-defined functions (@pxref{User-defined}). For example: 11245 11246@example 11247@group 11248function mysub(pat, repl, str, global) 11249@{ 11250 if (global) 11251 gsub(pat, repl, str) 11252 else 11253 sub(pat, repl, str) 11254 return str 11255@} 11256@end group 11257 11258@group 11259@{ 11260 @dots{} 11261 text = "hi! hi yourself!" 11262 mysub(/hi/, "howdy", text, 1) 11263 @dots{} 11264@} 11265@end group 11266@end example 11267 11268@c @cindex automatic warnings 11269@c @cindex warnings, automatic 11270In this example, the programmer wants to pass a regexp constant to the 11271user-defined function @code{mysub()}, which in turn passes it on to 11272either @code{sub()} or @code{gsub()}. However, what really happens is that 11273the @code{pat} parameter is assigned a value of either one or zero, depending upon whether 11274or not @code{$0} matches @code{/hi/}. 11275@command{gawk} issues a warning when it sees a regexp constant used as 11276a parameter to a user-defined function, because passing a truth value in 11277this way is probably not what was intended. 11278 11279@node Strong Regexp Constants 11280@subsubsection Strongly Typed Regexp Constants 11281 11282This @value{SECTION} describes a @command{gawk}-specific feature. 11283 11284As we saw in the previous @value{SECTION}, 11285regexp constants (@code{/@dots{}/}) hold a strange position in the 11286@command{awk} language. In most contexts, they act like an expression: 11287@samp{$0 ~ /@dots{}/}. In other contexts, they denote only a regexp to 11288be matched. In no case are they really a ``first class citizen'' of the 11289language. That is, you cannot define a scalar variable whose type is 11290``regexp'' in the same sense that you can define a variable to be a 11291number or a string: 11292 11293@example 11294num = 42 @ii{Numeric variable} 11295str = "hi" @ii{String variable} 11296re = /foo/ @ii{Wrong!} re @ii{is the result of} $0 ~ /foo/ 11297@end example 11298 11299For a number of more advanced use cases, 11300it would be nice to have regexp constants that 11301are @dfn{strongly typed}; in other words, that denote a regexp useful 11302for matching, and not an expression. 11303 11304@cindex values @subentry regexp 11305@command{gawk} provides this feature. A strongly typed regexp constant 11306looks almost like a regular regexp constant, except that it is preceded 11307by an @samp{@@} sign: 11308 11309@example 11310re = @@/foo/ @ii{Regexp variable} 11311@end example 11312 11313Strongly typed regexp constants @emph{cannot} be used everywhere that a 11314regular regexp constant can, because this would make the language even more 11315confusing. Instead, you may use them only in certain contexts: 11316 11317@itemize @bullet 11318@item 11319On the righthand side of the @samp{~} and @samp{!~} operators: @samp{some_var ~ @@/foo/} 11320(@pxref{Regexp Usage}). 11321 11322@item 11323In the @code{case} part of a @code{switch} statement 11324(@pxref{Switch Statement}). 11325 11326@item 11327As an argument to one of the built-in functions that accept regexp constants: 11328@code{gensub()}, 11329@code{gsub()}, 11330@code{match()}, 11331@code{patsplit()}, 11332@code{split()}, 11333and 11334@code{sub()} 11335(@pxref{String Functions}). 11336 11337@item 11338As a parameter in a call to a user-defined function 11339(@pxref{User-defined}). 11340 11341@item 11342As the return value of a user-defined function. 11343 11344@item 11345On the righthand side of an assignment to a variable: @samp{some_var = @@/foo/}. 11346In this case, the type of @code{some_var} is regexp. Additionally, @code{some_var} 11347can be used with @samp{~} and @samp{!~}, passed to one of the built-in functions 11348listed above, or passed as a parameter to a user-defined function. 11349@end itemize 11350 11351You may use the @option{-v} option (@pxref{Options}) to assign a 11352strongly-typed regexp constant to a variable on the command line, like so: 11353 11354@example 11355gawk -v pattern='@@/something(interesting)+/' @dots{} 11356@end example 11357 11358@noindent 11359You may also make such assignments as regular command-line arguments 11360(@pxref{Other Arguments}). 11361 11362You may use the @code{typeof()} built-in function 11363(@pxref{Type Functions}) 11364to determine if a variable or function parameter is 11365a regexp variable. 11366 11367The true power of this feature comes from the ability to create variables that 11368have regexp type. Such variables can be passed on to user-defined functions, 11369without the confusing aspects of computed regular expressions created from 11370strings or string constants. They may also be passed through indirect function 11371calls (@pxref{Indirect Calls}) 11372and on to the built-in functions that accept regexp constants. 11373 11374When used in numeric conversions, strongly typed regexp variables convert 11375to zero. When used in string conversions, they convert to the string 11376value of the original regexp text. 11377 11378There is an additional, interesting corner case. When used as the third 11379argument to @code{sub()} or @code{gsub()}, they retain their type. Thus, 11380if you have something like this: 11381 11382@example 11383re = @/don't panic/ 11384sub(/don't/, "do", re) 11385print typeof(re), re 11386@end example 11387 11388@noindent 11389then @code{re} retains its type, but now attempts to match the string 11390@samp{do panic}. This provides a (very indirect) way to create regexp-typed 11391variables at runtime. 11392 11393@node Variables 11394@subsection Variables 11395 11396@cindex variables @subentry user-defined 11397@cindex user-defined @subentry variables 11398@dfn{Variables} are ways of storing values at one point in your program for 11399use later in another part of your program. They can be manipulated 11400entirely within the program text, and they can also be assigned values 11401on the @command{awk} command line. 11402 11403@menu 11404* Using Variables:: Using variables in your programs. 11405* Assignment Options:: Setting variables on the command line and a 11406 summary of command-line syntax. This is an 11407 advanced method of input. 11408@end menu 11409 11410@node Using Variables 11411@subsubsection Using Variables in a Program 11412 11413Variables let you give names to values and refer to them later. Variables 11414have already been used in many of the examples. The name of a variable 11415must be a sequence of letters, digits, or underscores, and it may not begin 11416with a digit. 11417Here, a @dfn{letter} is any one of the 52 upper- and lowercase 11418English letters. Other characters that may be defined as letters 11419in non-English locales are not valid in variable names. 11420Case is significant in variable names; @code{a} and @code{A} 11421are distinct variables. 11422 11423A variable name is a valid expression by itself; it represents the 11424variable's current value. Variables are given new values with 11425@dfn{assignment operators}, @dfn{increment operators}, and 11426@dfn{decrement operators} 11427(@pxref{Assignment Ops}). 11428In addition, the @code{sub()} and @code{gsub()} functions can 11429change a variable's value, and the @code{match()}, @code{split()}, 11430and @code{patsplit()} functions can change the contents of their 11431array parameters (@pxref{String Functions}). 11432 11433@cindex variables @subentry built-in 11434@cindex variables @subentry initializing 11435A few variables have special built-in meanings, such as @code{FS} (the 11436field separator) and @code{NF} (the number of fields in the current input 11437record). @xref{Built-in Variables} for a list of the predefined variables. 11438These predefined variables can be used and assigned just like all other 11439variables, but their values are also used or changed automatically by 11440@command{awk}. All predefined variables' names are entirely uppercase. 11441 11442Variables in @command{awk} can be assigned either numeric or string values. 11443The kind of value a variable holds can change over the life of a program. 11444By default, variables are initialized to the empty string, which 11445is zero if converted to a number. There is no need to explicitly 11446initialize a variable in @command{awk}, 11447which is what you would do in C and in most other traditional languages. 11448 11449@node Assignment Options 11450@subsubsection Assigning Variables on the Command Line 11451@cindex variables @subentry assigning on command line 11452@cindex command line @subentry variables, assigning on 11453 11454Any @command{awk} variable can be set by including a @dfn{variable assignment} 11455among the arguments on the command line when @command{awk} is invoked 11456(@pxref{Other Arguments}). 11457Such an assignment has the following form: 11458 11459@example 11460@var{variable}=@var{text} 11461@end example 11462 11463@cindex @option{-v} option 11464@noindent 11465With it, a variable is set either at the beginning of the 11466@command{awk} run or in between input files. 11467When the assignment is preceded with the @option{-v} option, 11468as in the following: 11469 11470@example 11471-v @var{variable}=@var{text} 11472@end example 11473 11474@noindent 11475the variable is set at the very beginning, even before the 11476@code{BEGIN} rules execute. The @option{-v} option and its assignment 11477must precede all the @value{FN} arguments, as well as the program text. 11478(@xref{Options} for more information about 11479the @option{-v} option.) 11480Otherwise, the variable assignment is performed at a time determined by 11481its position among the input file arguments---after the processing of the 11482preceding input file argument. For example: 11483 11484@example 11485awk '@{ print $n @}' n=4 inventory-shipped n=2 mail-list 11486@end example 11487 11488@noindent 11489prints the value of field number @code{n} for all input records. Before 11490the first file is read, the command line sets the variable @code{n} 11491equal to four. This causes the fourth field to be printed in lines from 11492@file{inventory-shipped}. After the first file has finished, 11493but before the second file is started, @code{n} is set to two, so that the 11494second field is printed in lines from @file{mail-list}: 11495 11496@example 11497$ @kbd{awk '@{ print $n @}' n=4 inventory-shipped n=2 mail-list} 11498@print{} 15 11499@print{} 24 11500@dots{} 11501@print{} 555-5553 11502@print{} 555-3412 11503@dots{} 11504@end example 11505 11506@cindex dark corner @subentry command-line arguments 11507Command-line arguments are made available for explicit examination by 11508the @command{awk} program in the @code{ARGV} array 11509(@pxref{ARGC and ARGV}). 11510@command{awk} processes the values of command-line assignments for escape 11511sequences 11512(@pxref{Escape Sequences}). 11513@value{DARKCORNER} 11514 11515Normally, variables assigned on the command line (with or without the 11516@option{-v} option) are treated as strings. When such variables are 11517used as numbers, @command{awk}'s normal automatic conversion of strings 11518to numbers takes place, and everything ``just works.'' 11519 11520However, @command{gawk} supports variables whose types are ``regexp''. 11521You can assign variables of this type using the following syntax: 11522 11523@example 11524gawk -v 're1=@@/foo|bar/' '@dots{}' /path/to/file1 're2=@@/baz|quux/' /path/to/file2 11525@end example 11526 11527@noindent 11528Strongly typed regexps are an advanced feature (@pxref{Strong Regexp Constants}). 11529We mention them here only for completeness. 11530 11531@node Conversion 11532@subsection Conversion of Strings and Numbers 11533 11534Number-to-string and string-to-number conversion are generally 11535straightforward. There can be subtleties to be aware of; 11536this @value{SECTION} discusses this important facet of @command{awk}. 11537 11538@menu 11539* Strings And Numbers:: How @command{awk} Converts Between Strings And 11540 Numbers. 11541* Locale influences conversions:: How the locale may affect conversions. 11542@end menu 11543 11544@node Strings And Numbers 11545@subsubsection How @command{awk} Converts Between Strings and Numbers 11546 11547@cindex converting @subentry string to numbers 11548@cindex strings @subentry converting 11549@cindex numbers @subentry converting 11550@cindex converting @subentry numbers to strings 11551Strings are converted to numbers and numbers are converted to strings, if the context 11552of the @command{awk} program demands it. For example, if the value of 11553either @code{foo} or @code{bar} in the expression @samp{foo + bar} 11554happens to be a string, it is converted to a number before the addition 11555is performed. If numeric values appear in string concatenation, they 11556are converted to strings. Consider the following: 11557 11558@example 11559@group 11560two = 2; three = 3 11561print (two three) + 4 11562@end group 11563@end example 11564 11565@noindent 11566This prints the (numeric) value 27. The numeric values of 11567the variables @code{two} and @code{three} are converted to strings and 11568concatenated together. The resulting string is converted back to the 11569number 23, to which 4 is then added. 11570 11571@cindex null strings @subentry converting numbers to strings 11572@cindex type @subentry conversion 11573If, for some reason, you need to force a number to be converted to a 11574string, concatenate that number with the empty string, @code{""}. 11575To force a string to be converted to a number, add zero to that string. 11576A string is converted to a number by interpreting any numeric prefix 11577of the string as numerals: 11578@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1,000, and @code{"25fix"} 11579has a numeric value of 25. 11580Strings that can't be interpreted as valid numbers convert to zero. 11581 11582@cindex @code{CONVFMT} variable 11583The exact manner in which numbers are converted into strings is controlled 11584by the @command{awk} predefined variable @code{CONVFMT} (@pxref{Built-in Variables}). 11585Numbers are converted using the @code{sprintf()} function 11586with @code{CONVFMT} as the format 11587specifier 11588(@pxref{String Functions}). 11589 11590@code{CONVFMT}'s default value is @code{"%.6g"}, which creates a value with 11591at most six significant digits. For some applications, you might want to 11592change it to specify more precision. 11593On most modern machines, 1159417 digits is usually enough to capture a floating-point number's 11595value exactly.@footnote{Pathological cases can require up to 11596752 digits (!), but we doubt that you need to worry about this.} 11597 11598@cindex dark corner @subentry @code{CONVFMT} variable 11599Strange results can occur if you set @code{CONVFMT} to a string that doesn't 11600tell @code{sprintf()} how to format floating-point numbers in a useful way. 11601For example, if you forget the @samp{%} in the format, @command{awk} converts 11602all numbers to the same constant string. 11603 11604As a special case, if a number is an integer, then the result of converting 11605it to a string is @emph{always} an integer, no matter what the value of 11606@code{CONVFMT} may be. Given the following code fragment: 11607 11608@example 11609CONVFMT = "%2.2f" 11610a = 12 11611b = a "" 11612@end example 11613 11614@noindent 11615@code{b} has the value @code{"12"}, not @code{"12.00"}. 11616@value{DARKCORNER} 11617 11618@sidebar Pre-POSIX @command{awk} Used @code{OFMT} for String Conversion 11619@cindex POSIX @command{awk} @subentry @code{OFMT} variable and 11620@cindex @code{OFMT} variable 11621@cindex portability @subentry new @command{awk} vs.@: old @command{awk} 11622@cindex @command{awk} @subentry new vs.@: old @subentry @code{OFMT} variable 11623Prior to the POSIX standard, @command{awk} used the value 11624of @code{OFMT} for converting numbers to strings. @code{OFMT} 11625specifies the output format to use when printing numbers with @code{print}. 11626@code{CONVFMT} was introduced in order to separate the semantics of 11627conversion from the semantics of printing. Both @code{CONVFMT} and 11628@code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority 11629of cases, old @command{awk} programs do not change their behavior. 11630@xref{Print} for more information on the @code{print} statement. 11631@end sidebar 11632 11633@node Locale influences conversions 11634@subsubsection Locales Can Influence Conversion 11635 11636Where you are can matter when it comes to converting between numbers and 11637strings. The local character set and language---the @dfn{locale}---can 11638affect numeric formats. In particular, for @command{awk} programs, 11639it affects the decimal point character and the thousands-separator 11640character. The @code{"C"} locale, and most English-language locales, 11641use the period character (@samp{.}) as the decimal point and don't 11642have a thousands separator. However, many (if not most) European and 11643non-English locales use the comma (@samp{,}) as the decimal point 11644character. European locales often use either a space or a period as 11645the thousands separator, if they have one. 11646 11647@cindex dark corner @subentry locale's decimal point character 11648The POSIX standard says that @command{awk} always uses the period as the decimal 11649point when reading the @command{awk} program source code, and for 11650command-line variable assignments (@pxref{Other Arguments}). However, 11651when interpreting input data, for @code{print} and @code{printf} output, 11652and for number-to-string conversion, the local decimal point character 11653is used. @value{DARKCORNER} In all cases, numbers in source code and 11654in input data cannot have a thousands separator. Here are some examples 11655indicating the difference in behavior, on a GNU/Linux system: 11656 11657@example 11658$ @kbd{export POSIXLY_CORRECT=1} @ii{Force POSIX behavior} 11659$ @kbd{gawk 'BEGIN @{ printf "%g\n", 3.1415927 @}'} 11660@print{} 3.14159 11661$ @kbd{LC_ALL=en_DK.utf-8 gawk 'BEGIN @{ printf "%g\n", 3.1415927 @}'} 11662@print{} 3,14159 11663$ @kbd{echo 4,321 | gawk '@{ print $1 + 1 @}'} 11664@print{} 5 11665$ @kbd{echo 4,321 | LC_ALL=en_DK.utf-8 gawk '@{ print $1 + 1 @}'} 11666@print{} 5,321 11667@end example 11668 11669@noindent 11670The @code{en_DK.utf-8} locale is for English in Denmark, where the comma acts as 11671the decimal point separator. In the normal @code{"C"} locale, @command{gawk} 11672treats @samp{4,321} as 4, while in the Danish locale, it's treated 11673as the full number including the fractional part, 4.321. 11674 11675@cindex POSIX mode 11676Some earlier versions of @command{gawk} fully complied with this aspect 11677of the standard. However, many users in non-English locales complained 11678about this behavior, because their data used a period as the decimal 11679point, so the default behavior was restored to use a period as the 11680decimal point character. You can use the @option{--use-lc-numeric} 11681option (@pxref{Options}) to force @command{gawk} to use the locale's 11682decimal point character. (@command{gawk} also uses the locale's decimal 11683point character when in POSIX mode, either via @option{--posix} or the 11684@env{POSIXLY_CORRECT} environment variable, as shown previously.) 11685 11686@ref{table-locale-affects} describes the cases in which the locale's decimal 11687point character is used and when a period is used. Some of these 11688features have not been described yet. 11689 11690@float Table,table-locale-affects 11691@caption{Locale decimal point versus a period} 11692@multitable @columnfractions .15 .20 .45 11693@headitem Feature @tab Default @tab @option{--posix} or @option{--use-lc-numeric} 11694@item @code{%'g} @tab Use locale @tab Use locale 11695@item @code{%g} @tab Use period @tab Use locale 11696@item Input @tab Use period @tab Use locale 11697@item @code{strtonum()} @tab Use period @tab Use locale 11698@end multitable 11699@end float 11700 11701Finally, modern-day formal standards and the IEEE standard floating-point 11702representation can have an unusual but important effect on the way 11703@command{gawk} converts some special string values to numbers. The details 11704are presented in @ref{POSIX Floating Point Problems}. 11705 11706@node All Operators 11707@section Operators: Doing Something with Values 11708 11709This @value{SECTION} introduces the @dfn{operators} that make use 11710of the values provided by constants and variables. 11711 11712@menu 11713* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, 11714 etc.) 11715* Concatenation:: Concatenating strings. 11716* Assignment Ops:: Changing the value of a variable or a field. 11717* Increment Ops:: Incrementing the numeric value of a variable. 11718@end menu 11719 11720@node Arithmetic Ops 11721@subsection Arithmetic Operators 11722@cindex arithmetic operators 11723@cindex operators @subentry arithmetic 11724@c @cindex addition 11725@c @cindex subtraction 11726@c @cindex multiplication 11727@c @cindex division 11728@c @cindex remainder 11729@c @cindex quotient 11730@c @cindex exponentiation 11731 11732The @command{awk} language uses the common arithmetic operators when 11733evaluating expressions. All of these arithmetic operators follow normal 11734precedence rules and work as you would expect them to. 11735 11736The following example uses a file named @file{grades}, which contains 11737a list of student names as well as three test scores per student (it's 11738a small class): 11739 11740@example 11741Pat 100 97 58 11742Sandy 84 72 93 11743Chris 72 92 89 11744@end example 11745 11746@noindent 11747This program takes the file @file{grades} and prints the average 11748of the scores: 11749 11750@example 11751$ @kbd{awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3} 11752> @kbd{print $1, avg @}' grades} 11753@print{} Pat 85 11754@print{} Sandy 83 11755@print{} Chris 84.3333 11756@end example 11757 11758The following list provides the arithmetic operators in @command{awk}, 11759in order from the highest precedence to the lowest: 11760 11761@table @code 11762@cindex common extensions @subentry @code{**} operator 11763@cindex extensions @subentry common @subentry @code{**} operator 11764@cindex POSIX @command{awk} @subentry arithmetic operators and 11765@item @var{x} ^ @var{y} 11766@itemx @var{x} ** @var{y} 11767Exponentiation; @var{x} raised to the @var{y} power. @samp{2 ^ 3} has 11768the value eight; the character sequence @samp{**} is equivalent to 11769@samp{^}. @value{COMMONEXT} 11770 11771@item - @var{x} 11772Negation. 11773 11774@item + @var{x} 11775Unary plus; the expression is converted to a number. 11776 11777@item @var{x} * @var{y} 11778Multiplication. 11779 11780@cindex troubleshooting @subentry division 11781@cindex division 11782@item @var{x} / @var{y} 11783Division; because all numbers in @command{awk} are floating-point 11784numbers, the result is @emph{not} rounded to an integer---@samp{3 / 4} has 11785the value 0.75. (It is a common mistake, especially for C programmers, 11786to forget that @emph{all} numbers in @command{awk} are floating point, 11787and that division of integer-looking constants produces a real number, 11788not an integer.) 11789 11790@item @var{x} % @var{y} 11791Remainder; further discussion is provided in the text, just 11792after this list. 11793 11794@item @var{x} + @var{y} 11795Addition. 11796 11797@item @var{x} - @var{y} 11798Subtraction. 11799@end table 11800 11801Unary plus and minus have the same precedence, 11802the multiplication operators all have the same precedence, and 11803addition and subtraction have the same precedence. 11804 11805@cindex differences in @command{awk} and @command{gawk} @subentry trunc-mod operation 11806@cindex trunc-mod operation 11807When computing the remainder of @samp{@var{x} % @var{y}}, 11808the quotient is rounded toward zero to an integer and 11809multiplied by @var{y}. This result is subtracted from @var{x}; 11810this operation is sometimes known as ``trunc-mod.'' The following 11811relation always holds: 11812 11813@example 11814b * int(a / b) + (a % b) == a 11815@end example 11816 11817One possibly undesirable effect of this definition of remainder is that 11818@samp{@var{x} % @var{y}} is negative if @var{x} is negative. Thus: 11819 11820@example 11821-17 % 8 = -1 11822@end example 11823 11824@noindent 11825This definition is compliant with the POSIX standard, which says that the @code{%} 11826operator produces results equivalent to using the standard C 11827@code{fmod()} function, and that function in turn works as just 11828described. 11829 11830In other @command{awk} implementations, the signedness of the remainder 11831may be machine-dependent. 11832 11833@cindex portability @subentry @code{**} operator and 11834@cindex @code{*} (asterisk) @subentry @code{**} operator 11835@cindex asterisk (@code{*}) @subentry @code{**} operator 11836@quotation NOTE 11837The POSIX standard only specifies the use of @samp{^} 11838for exponentiation. 11839For maximum portability, do not use the @samp{**} operator. 11840@end quotation 11841 11842@node Concatenation 11843@subsection String Concatenation 11844@cindex Kernighan, Brian @subentry quotes 11845@quotation 11846@i{It seemed like a good idea at the time.} 11847@author Brian Kernighan 11848@end quotation 11849 11850@cindex string @subentry operators 11851@cindex operators @subentry string 11852@cindex concatenating 11853There is only one string operation: concatenation. It does not have a 11854specific operator to represent it. Instead, concatenation is performed by 11855writing expressions next to one another, with no operator. For example: 11856 11857@example 11858$ @kbd{awk '@{ print "Field number one: " $1 @}' mail-list} 11859@print{} Field number one: Amelia 11860@print{} Field number one: Anthony 11861@dots{} 11862@end example 11863 11864Without the space in the string constant after the @samp{:}, the line 11865runs together. For example: 11866 11867@example 11868$ @kbd{awk '@{ print "Field number one:" $1 @}' mail-list} 11869@print{} Field number one:Amelia 11870@print{} Field number one:Anthony 11871@dots{} 11872@end example 11873 11874@cindex troubleshooting @subentry string concatenation 11875Because string concatenation does not have an explicit operator, it is 11876often necessary to ensure that it happens at the right time by using 11877parentheses to enclose the items to concatenate. For example, 11878you might expect that the 11879following code fragment concatenates @code{file} and @code{name}: 11880 11881@example 11882file = "file" 11883name = "name" 11884print "something meaningful" > file name 11885@end example 11886 11887@cindex Brian Kernighan's @command{awk} 11888@cindex @command{mawk} utility 11889@noindent 11890This produces a syntax error with some versions of Unix 11891@command{awk}.@footnote{It happens that BWK 11892@command{awk}, @command{gawk}, and @command{mawk} all ``get it right,'' 11893but you should not rely on this.} 11894It is necessary to use the following: 11895 11896@example 11897print "something meaningful" > (file name) 11898@end example 11899 11900@cindex order of evaluation, concatenation 11901@cindex evaluation order @subentry concatenation 11902@cindex side effects 11903Parentheses should be used around concatenation in all but the 11904most common contexts, such as on the righthand side of @samp{=}. 11905Be careful about the kinds of expressions used in string concatenation. 11906In particular, the order of evaluation of expressions used for concatenation 11907is undefined in the @command{awk} language. Consider this example: 11908 11909@example 11910BEGIN @{ 11911 a = "don't" 11912 print (a " " (a = "panic")) 11913@} 11914@end example 11915 11916@noindent 11917It is not defined whether the second assignment to @code{a} happens 11918before or after the value of @code{a} is retrieved for producing the 11919concatenated value. The result could be either @samp{don't panic}, 11920or @samp{panic panic}. 11921@c see test/nasty.awk for a worse example 11922 11923The precedence of concatenation, when mixed with other operators, is often 11924counter-intuitive. Consider this example: 11925 11926@ignore 11927> To: bug-gnu-utils@@gnu.org 11928> CC: arnold@@gnu.org 11929> Subject: gawk 3.0.4 bug with {print -12 " " -24} 11930> From: Russell Schulz <Russell_Schulz@locutus.ofB.ORG> 11931> Date: Tue, 8 Feb 2000 19:56:08 -0700 11932> 11933> gawk 3.0.4 on NT gives me: 11934> 11935> prompt> cat bad.awk 11936> BEGIN { print -12 " " -24; } 11937> 11938> prompt> gawk -f bad.awk 11939> -12-24 11940> 11941> when I would expect 11942> 11943> -12 -24 11944> 11945> I have not investigated the source, or other implementations. The 11946> bug is there on my NT and DOS versions 2.15.6 . 11947@end ignore 11948 11949@example 11950$ @kbd{awk 'BEGIN @{ print -12 " " -24 @}'} 11951@print{} -12-24 11952@end example 11953 11954This ``obviously'' is concatenating @minus{}12, a space, and @minus{}24. 11955But where did the space disappear to? 11956The answer lies in the combination of operator precedences and 11957@command{awk}'s automatic conversion rules. To get the desired result, 11958write the program this way: 11959 11960@example 11961$ @kbd{awk 'BEGIN @{ print -12 " " (-24) @}'} 11962@print{} -12 -24 11963@end example 11964 11965This forces @command{awk} to treat the @samp{-} on the @samp{-24} as unary. 11966Otherwise, it's parsed as follows: 11967 11968@display 11969 @minus{}12 (@code{"@ "} @minus{} 24) 11970@result{} @minus{}12 (0 @minus{} 24) 11971@result{} @minus{}12 (@minus{}24) 11972@result{} @minus{}12@minus{}24 11973@end display 11974 11975As mentioned earlier, 11976when mixing concatenation with other operators, @emph{parenthesize}. Otherwise, 11977you're never quite sure what you'll get. 11978 11979@node Assignment Ops 11980@subsection Assignment Expressions 11981@cindex assignment operators 11982@cindex operators @subentry assignment 11983@cindex expressions @subentry assignment 11984@cindex @code{=} (equals sign) @subentry @code{=} operator 11985@cindex equals sign (@code{=}) @subentry @code{=} operator 11986An @dfn{assignment} is an expression that stores a (usually different) 11987value into a variable. For example, let's assign the value one to the variable 11988@code{z}: 11989 11990@example 11991z = 1 11992@end example 11993 11994After this expression is executed, the variable @code{z} has the value one. 11995Whatever old value @code{z} had before the assignment is forgotten. 11996 11997Assignments can also store string values. For example, the 11998following stores 11999the value @code{"this food is good"} in the variable @code{message}: 12000 12001@example 12002thing = "food" 12003predicate = "good" 12004message = "this " thing " is " predicate 12005@end example 12006 12007@noindent 12008@cindex side effects @subentry assignment expressions 12009This also illustrates string concatenation. 12010The @samp{=} sign is called an @dfn{assignment operator}. It is the 12011simplest assignment operator because the value of the righthand 12012operand is stored unchanged. 12013Most operators (addition, concatenation, and so on) have no effect 12014except to compute a value. If the value isn't used, there's no reason to 12015use the operator. An assignment operator is different; it does 12016produce a value, but even if you ignore it, the assignment still 12017makes itself felt through the alteration of the variable. We call this 12018a @dfn{side effect}. 12019 12020@cindex lvalues/rvalues 12021@cindex rvalues/lvalues 12022@cindex assignment operators @subentry lvalues/rvalues 12023@cindex operators @subentry assignment 12024The lefthand operand of an assignment need not be a variable 12025(@pxref{Variables}); it can also be a field 12026(@pxref{Changing Fields}) or 12027an array element (@pxref{Arrays}). 12028These are all called @dfn{lvalues}, 12029which means they can appear on the lefthand side of an assignment operator. 12030The righthand operand may be any expression; it produces the new value 12031that the assignment stores in the specified variable, field, or array 12032element. (Such values are called @dfn{rvalues}.) 12033 12034@cindex variables @subentry types of 12035It is important to note that variables do @emph{not} have permanent types. 12036A variable's type is simply the type of whatever value was last assigned 12037to it. In the following program fragment, the variable 12038@code{foo} has a numeric value at first, and a string value later on: 12039 12040@example 12041@group 12042foo = 1 12043print foo 12044@end group 12045@group 12046foo = "bar" 12047print foo 12048@end group 12049@end example 12050 12051@noindent 12052When the second assignment gives @code{foo} a string value, the fact that 12053it previously had a numeric value is forgotten. 12054 12055String values that do not begin with a digit have a numeric value of 12056zero. After executing the following code, the value of @code{foo} is five: 12057 12058@example 12059foo = "a string" 12060foo = foo + 5 12061@end example 12062 12063@quotation NOTE 12064Using a variable as a number and then later as a string 12065can be confusing and is poor programming style. The previous two examples 12066illustrate how @command{awk} works, @emph{not} how you should write your 12067programs! 12068@end quotation 12069 12070An assignment is an expression, so it has a value---the same value that 12071is assigned. Thus, @samp{z = 1} is an expression with the value one. 12072One consequence of this is that you can write multiple assignments together, 12073such as: 12074 12075@example 12076x = y = z = 5 12077@end example 12078 12079@noindent 12080This example stores the value five in all three variables 12081(@code{x}, @code{y}, and @code{z}). 12082It does so because the 12083value of @samp{z = 5}, which is five, is stored into @code{y} and then 12084the value of @samp{y = z = 5}, which is five, is stored into @code{x}. 12085 12086Assignments may be used anywhere an expression is called for. For 12087example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one, 12088and then test whether @code{x} equals one. But this style tends to make 12089programs hard to read; such nesting of assignments should be avoided, 12090except perhaps in a one-shot program. 12091 12092@cindex @code{+} (plus sign) @subentry @code{+=} operator 12093@cindex plus sign (@code{+}) @subentry @code{+=} operator 12094Aside from @samp{=}, there are several other assignment operators that 12095do arithmetic with the old value of the variable. For example, the 12096operator @samp{+=} computes a new value by adding the righthand value 12097to the old value of the variable. Thus, the following assignment adds 12098five to the value of @code{foo}: 12099 12100@example 12101foo += 5 12102@end example 12103 12104@noindent 12105This is equivalent to the following: 12106 12107@example 12108foo = foo + 5 12109@end example 12110 12111@noindent 12112Use whichever makes the meaning of your program clearer. 12113 12114There are situations where using @samp{+=} (or any assignment operator) 12115is @emph{not} the same as simply repeating the lefthand operand in the 12116righthand expression. For example: 12117 12118@cindex Rankin, Pat 12119@example 12120@group 12121# Thanks to Pat Rankin for this example 12122BEGIN @{ 12123 foo[rand()] += 5 12124 for (x in foo) 12125 print x, foo[x] 12126@end group 12127 12128@group 12129 bar[rand()] = bar[rand()] + 5 12130 for (x in bar) 12131 print x, bar[x] 12132@} 12133@end group 12134@end example 12135 12136@cindex operators @subentry assignment @subentry evaluation order 12137@cindex assignment operators @subentry evaluation order 12138@noindent 12139The indices of @code{bar} are practically guaranteed to be different, because 12140@code{rand()} returns different values each time it is called. 12141(Arrays and the @code{rand()} function haven't been covered yet. 12142@xref{Arrays}, 12143and 12144@ifnotdocbook 12145@pxref{Numeric Functions} 12146@end ifnotdocbook 12147@ifdocbook 12148@ref{Numeric Functions} 12149@end ifdocbook 12150for more information.) 12151This example illustrates an important fact about assignment 12152operators: the lefthand expression is only evaluated @emph{once}. 12153 12154It is up to the implementation as to which expression is evaluated 12155first, the lefthand or the righthand. 12156Consider this example: 12157 12158@example 12159i = 1 12160a[i += 2] = i + 1 12161@end example 12162 12163@noindent 12164The value of @code{a[3]} could be either two or four. 12165 12166@ref{table-assign-ops} lists the arithmetic assignment operators. In each 12167case, the righthand operand is an expression whose value is converted 12168to a number. 12169 12170@cindex @code{-} (hyphen) @subentry @code{-=} operator 12171@cindex hyphen (@code{-}) @subentry @code{-=} operator 12172@cindex @code{*} (asterisk) @subentry @code{*=} operator 12173@cindex asterisk (@code{*}) @subentry @code{*=} operator 12174@cindex @code{/} (forward slash) @subentry @code{/=} operator 12175@cindex forward slash (@code{/}) @subentry @code{/=} operator 12176@cindex @code{%} (percent sign) @subentry @code{%=} operator 12177@cindex percent sign (@code{%}) @subentry @code{%=} operator 12178@cindex @code{^} (caret) @subentry @code{^=} operator 12179@cindex caret (@code{^}) @subentry @code{^=} operator 12180@cindex @code{*} (asterisk) @subentry @code{**=} operator 12181@cindex asterisk (@code{*}) @subentry @code{**=} operator 12182@float Table,table-assign-ops 12183@caption{Arithmetic assignment operators} 12184@multitable @columnfractions .30 .70 12185@headitem Operator @tab Effect 12186@item @var{lvalue} @code{+=} @var{increment} @tab Add @var{increment} to the value of @var{lvalue}. 12187@item @var{lvalue} @code{-=} @var{decrement} @tab Subtract @var{decrement} from the value of @var{lvalue}. 12188@item @var{lvalue} @code{*=} @var{coefficient} @tab Multiply the value of @var{lvalue} by @var{coefficient}. 12189@item @var{lvalue} @code{/=} @var{divisor} @tab Divide the value of @var{lvalue} by @var{divisor}. 12190@item @var{lvalue} @code{%=} @var{modulus} @tab Set @var{lvalue} to its remainder by @var{modulus}. 12191@cindex common extensions @subentry @code{**=} operator 12192@cindex extensions @subentry common @subentry @code{**=} operator 12193@cindex @command{awk} @subentry language, POSIX version 12194@cindex POSIX @command{awk} 12195@item @var{lvalue} @code{^=} @var{power} @tab Raise @var{lvalue} to the power @var{power}. 12196@item @var{lvalue} @code{**=} @var{power} @tab Raise @var{lvalue} to the power @var{power}. @value{COMMONEXT} 12197@end multitable 12198@end float 12199 12200@cindex POSIX @command{awk} @subentry @code{**=} operator and 12201@cindex portability @subentry @code{**=} operator and 12202@quotation NOTE 12203Only the @samp{^=} operator is specified by POSIX. 12204For maximum portability, do not use the @samp{**=} operator. 12205@end quotation 12206 12207@sidebar Syntactic Ambiguities Between @samp{/=} and Regular Expressions 12208@cindex dark corner @subentry regexp constants @subentry @code{/=} operator and 12209@cindex @code{/} (forward slash) @subentry @code{/=} operator @subentry vs.@: @code{/=@dots{}/} regexp constant 12210@cindex forward slash (@code{/}) @subentry @code{/=} operator @subentry vs.@: @code{/=@dots{}/} regexp constant 12211@cindex regexp constants @subentry @code{/=@dots{}/} @subentry @code{/=} operator and 12212 12213@c derived from email from "Nelson H. F. Beebe" <beebe@math.utah.edu> 12214@c Date: Mon, 1 Sep 1997 13:38:35 -0600 (MDT) 12215 12216@cindex dark corner @subentry @code{/=} operator vs.@: @code{/=@dots{}/} regexp constant 12217@cindex ambiguity, syntactic: @code{/=} operator vs.@: @code{/=@dots{}/} regexp constant 12218@cindex syntactic ambiguity: @code{/=} operator vs.@: @code{/=@dots{}/} regexp constant 12219@cindex @code{/=} operator vs.@: @code{/=@dots{}/} regexp constant 12220There is a syntactic ambiguity between the @code{/=} assignment 12221operator and regexp constants whose first character is an @samp{=}. 12222@value{DARKCORNER} 12223This is most notable in some commercial @command{awk} versions. 12224For example: 12225 12226@example 12227$ @kbd{awk /==/ /dev/null} 12228@error{} awk: syntax error at source line 1 12229@error{} context is 12230@error{} >>> /= <<< 12231@error{} awk: bailing out at source line 1 12232@end example 12233 12234@noindent 12235A workaround is: 12236 12237@example 12238awk '/[=]=/' /dev/null 12239@end example 12240 12241@command{gawk} does not have this problem; BWK @command{awk} 12242and @command{mawk} also do not. 12243@end sidebar 12244 12245@node Increment Ops 12246@subsection Increment and Decrement Operators 12247 12248@cindex increment operators 12249@cindex operators @subentry decrement/increment 12250@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of 12251a variable by one. An assignment operator can do the same thing, so 12252the increment operators add no power to the @command{awk} language; however, they 12253are convenient abbreviations for very common operations. 12254 12255@cindex side effects 12256@cindex @code{+} (plus sign) @subentry @code{++} operator 12257@cindex plus sign (@code{+}) @subentry @code{++} operator 12258@cindex side effects @subentry decrement/increment operators 12259The operator used for adding one is written @samp{++}. It can be used to increment 12260a variable either before or after taking its value. 12261To @dfn{pre-increment} a variable @code{v}, write @samp{++v}. This adds 12262one to the value of @code{v}---that new value is also the value of the 12263expression. (The assignment expression @samp{v += 1} is completely equivalent.) 12264Writing the @samp{++} after the variable specifies @dfn{post-increment}. This 12265increments the variable value just the same; the difference is that the 12266value of the increment expression itself is the variable's @emph{old} 12267value. Thus, if @code{foo} has the value four, then the expression @samp{foo++} 12268has the value four, but it changes the value of @code{foo} to five. 12269In other words, the operator returns the old value of the variable, 12270but with the side effect of incrementing it. 12271 12272The post-increment @samp{foo++} is nearly the same as writing @samp{(foo 12273+= 1) - 1}. It is not perfectly equivalent because all numbers in 12274@command{awk} are floating point---in floating point, @samp{foo + 1 - 1} does 12275not necessarily equal @code{foo}. But the difference is minute as 12276long as you stick to numbers that are fairly small (less than 12277@iftex 12278@math{10^{12}}). 12279@end iftex 12280@ifinfo 1228110e12). 12282@end ifinfo 12283@ifnottex 12284@ifnotinfo 1228510@sup{12}). 12286@end ifnotinfo 12287@end ifnottex 12288 12289@cindex @code{$} (dollar sign) @subentry incrementing fields and arrays 12290@cindex dollar sign (@code{$}) @subentry incrementing fields and arrays 12291Fields and array elements are incremented 12292just like variables. (Use @samp{$(i++)} when you want to do a field reference 12293and a variable increment at the same time. The parentheses are necessary 12294because of the precedence of the field reference operator @samp{$}.) 12295 12296@cindex decrement operators 12297The decrement operator @samp{--} works just like @samp{++}, except that 12298it subtracts one instead of adding it. As with @samp{++}, it can be used before 12299the lvalue to pre-decrement or after it to post-decrement. 12300Following is a summary of increment and decrement expressions: 12301 12302@table @code 12303@cindex @code{+} (plus sign) @subentry @code{++} operator 12304@cindex plus sign (@code{+}) @subentry @code{++} operator 12305@item ++@var{lvalue} 12306Increment @var{lvalue}, returning the new value as the 12307value of the expression. 12308 12309@item @var{lvalue}++ 12310Increment @var{lvalue}, returning the @emph{old} value of @var{lvalue} 12311as the value of the expression. 12312 12313@cindex @code{-} (hyphen) @subentry @code{--} operator 12314@cindex hyphen (@code{-}) @subentry @code{--} operator 12315@item --@var{lvalue} 12316Decrement @var{lvalue}, returning the new value as the 12317value of the expression. 12318(This expression is 12319like @samp{++@var{lvalue}}, but instead of adding, it subtracts.) 12320 12321@item @var{lvalue}-- 12322Decrement @var{lvalue}, returning the @emph{old} value of @var{lvalue} 12323as the value of the expression. 12324(This expression is 12325like @samp{@var{lvalue}++}, but instead of adding, it subtracts.) 12326@end table 12327 12328@sidebar Operator Evaluation Order 12329@cindex precedence 12330@cindex operators @subentry precedence of 12331@cindex portability @subentry operators 12332@cindex evaluation order 12333@cindex Marx, Groucho 12334@quotation 12335@i{Doctor, it hurts when I do this!@* 12336Then don't do that!} 12337@author Groucho Marx 12338@end quotation 12339 12340@noindent 12341What happens for something like the following? 12342 12343@example 12344b = 6 12345print b += b++ 12346@end example 12347 12348@noindent 12349Or something even stranger? 12350 12351@example 12352b = 6 12353b += ++b + b++ 12354print b 12355@end example 12356 12357@cindex side effects 12358In other words, when do the various side effects prescribed by the 12359postfix operators (@samp{b++}) take effect? 12360When side effects happen is @dfn{implementation-defined}. 12361In other words, it is up to the particular version of @command{awk}. 12362The result for the first example may be 12 or 13, and for the second, it 12363may be 22 or 23. 12364 12365In short, doing things like this is not recommended and definitely 12366not anything that you can rely upon for portability. 12367You should avoid such things in your own programs. 12368@c You'll sleep better at night and be able to look at yourself 12369@c in the mirror in the morning. 12370@end sidebar 12371 12372@node Truth Values and Conditions 12373@section Truth Values and Conditions 12374 12375In certain contexts, expression values also serve as ``truth values''; i.e., 12376they determine what should happen next as the program runs. This 12377@value{SECTION} describes how @command{awk} defines ``true'' and ``false'' 12378and how values are compared. 12379 12380@menu 12381* Truth Values:: What is ``true'' and what is ``false''. 12382* Typing and Comparison:: How variables acquire types and how this 12383 affects comparison of numbers and strings with 12384 @samp{<}, etc. 12385* Boolean Ops:: Combining comparison expressions using boolean 12386 operators @samp{||} (``or''), @samp{&&} 12387 (``and'') and @samp{!} (``not''). 12388* Conditional Exp:: Conditional expressions select between two 12389 subexpressions under control of a third 12390 subexpression. 12391@end menu 12392 12393@node Truth Values 12394@subsection True and False in @command{awk} 12395@cindex truth values 12396@cindex logical false/true 12397@cindex false, logical 12398@cindex true, logical 12399 12400@cindex null strings 12401Many programming languages have a special representation for the concepts 12402of ``true'' and ``false.'' Such languages usually use the special 12403constants @code{true} and @code{false}, or perhaps their uppercase 12404equivalents. 12405However, @command{awk} is different. 12406It borrows a very simple concept of true and 12407false from C. In @command{awk}, any nonzero numeric value @emph{or} any 12408nonempty string value is true. Any other value (zero or the null 12409string, @code{""}) is false. The following program prints @samp{A strange 12410truth value} three times: 12411 12412@example 12413BEGIN @{ 12414 if (3.1415927) 12415 print "A strange truth value" 12416 if ("Four Score And Seven Years Ago") 12417 print "A strange truth value" 12418 if (j = 57) 12419 print "A strange truth value" 12420@} 12421@end example 12422 12423@cindex dark corner @subentry @code{"0"} is actually true 12424There is a surprising consequence of the ``nonzero or non-null'' rule: 12425the string constant @code{"0"} is actually true, because it is non-null. 12426@value{DARKCORNER} 12427 12428@node Typing and Comparison 12429@subsection Variable Typing and Comparison Expressions 12430@quotation 12431@i{The Guide is definitive. Reality is frequently inaccurate.} 12432@author Douglas Adams, @cite{The Hitchhiker's Guide to the Galaxy} 12433@end quotation 12434@c 2/2015: Antonio Colombo points out that this is really from 12435@c The Restaurant at the End of the Universe. But I'm going to 12436@c leave it alone. 12437 12438@cindex comparison expressions 12439@cindex expressions @subentry comparison 12440@cindex expressions, matching @seeentry{comparison expressions} 12441@cindex matching @subentry expressions @seeentry{comparison expressions} 12442@cindex relational operators @seeentry{comparison operators} 12443@cindex operators, relational @seeentry{operators, comparison} 12444@cindex variables @subentry types of @subentry comparison expressions and 12445Unlike in other programming languages, in @command{awk} variables do not have a 12446fixed type. Instead, they can be either a number or a string, depending 12447upon the value that is assigned to them. 12448We look now at how variables are typed, and how @command{awk} 12449compares variables. 12450 12451@menu 12452* Variable Typing:: String type versus numeric type. 12453* Comparison Operators:: The comparison operators. 12454* POSIX String Comparison:: String comparison with POSIX rules. 12455@end menu 12456 12457@node Variable Typing 12458@subsubsection String Type versus Numeric Type 12459 12460Scalar objects in @command{awk} (variables, array elements, and fields) 12461are @emph{dynamically} typed. This means their type can change as the 12462program runs, from @dfn{untyped} before any use,@footnote{@command{gawk} 12463calls this @dfn{unassigned}, as the following example shows.} to string 12464or number, and then from string to number or number to string, as the 12465program progresses. (@command{gawk} also provides regexp-typed scalars, 12466but let's ignore that for now; @pxref{Strong Regexp Constants}.) 12467 12468You can't do much with untyped variables, other than tell that they 12469are untyped. The following program tests @code{a} against @code{""} 12470and @code{0}; the test succeeds when @code{a} has never been assigned 12471a value. It also uses the built-in @code{typeof()} function 12472(not presented yet; @pxref{Type Functions}) to show @code{a}'s type: 12473 12474@example 12475$ @kbd{gawk 'BEGIN @{ print (a == "" && a == 0 ?} 12476> @kbd{"a is untyped" : "a has a type!") ; print typeof(a) @}'} 12477@print{} a is untyped 12478@print{} unassigned 12479@end example 12480 12481A scalar has numeric type when assigned a numeric value, 12482such as from a numeric constant, or from another scalar 12483with numeric type: 12484 12485@example 12486$ @kbd{gawk 'BEGIN @{ a = 42 ; print typeof(a)} 12487> @kbd{b = a ; print typeof(b) @}'} 12488number 12489number 12490@end example 12491 12492Similarly, a scalar has string type when assigned a string 12493value, such as from a string constant, or from another scalar 12494with string type: 12495 12496@example 12497$ @kbd{gawk 'BEGIN @{ a = "forty two" ; print typeof(a)} 12498> @kbd{b = a ; print typeof(b) @}'} 12499string 12500string 12501@end example 12502 12503So far, this is all simple and straightforward. What happens, though, 12504when @command{awk} has to process data from a user? Let's start with 12505field data. What should the following command produce as output? 12506 12507@example 12508echo hello | awk '@{ printf("%s %s < 42\n", $1, 12509 ($1 < 42 ? "is" : "is not")) @}' 12510@end example 12511 12512@noindent 12513Since @samp{hello} is alphabetic data, @command{awk} can only do a string 12514comparison. Internally, it converts @code{42} into @code{"42"} and compares 12515the two string values @code{"hello"} and @code{"42"}. Here's the result: 12516 12517@example 12518$ @kbd{echo hello | awk '@{ printf("%s %s < 42\n", $1,} 12519> @kbd{ ($1 < 42 ? "is" : "is not")) @}'} 12520@print{} hello is not < 42 12521@end example 12522 12523However, what happens when data from a user @emph{looks like} a number? 12524On the one hand, in reality, the input data consists of characters, not 12525binary numeric 12526values. But, on the other hand, the data looks numeric, and @command{awk} 12527really ought to treat it as such. And indeed, it does: 12528 12529@example 12530$ @kbd{echo 37 | awk '@{ printf("%s %s < 42\n", $1,} 12531> @kbd{ ($1 < 42 ? "is" : "is not")) @}'} 12532@print{} 37 is < 42 12533@end example 12534 12535Here are the rules for when @command{awk} 12536treats data as a number, and for when it treats data as a string. 12537 12538@cindex numeric @subentry strings 12539@cindex strings @subentry numeric 12540@cindex POSIX @command{awk} @subentry numeric strings and 12541The POSIX standard uses the term @dfn{numeric string} for input data that 12542looks numeric. The @samp{37} in the previous example is a numeric string. 12543So what is the type of a numeric string? Answer: numeric. 12544 12545The type of a variable is important because the types of two variables 12546determine how they are compared. 12547Variable typing follows these definitions and rules: 12548 12549@itemize @value{BULLET} 12550@item 12551A numeric constant or the result of a numeric operation has the @dfn{numeric} 12552attribute. 12553 12554@item 12555A string constant or the result of a string operation has the @dfn{string} 12556attribute. 12557 12558@item 12559Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements, 12560@code{ENVIRON} elements, and the elements of an array created by 12561@code{match()}, @code{split()}, and @code{patsplit()} that are numeric 12562strings have the @dfn{strnum} attribute.@footnote{Thus, a POSIX 12563numeric string and @command{gawk}'s strnum are the same thing.} 12564Otherwise, they have 12565the @dfn{string} attribute. Uninitialized variables also have the 12566@dfn{strnum} attribute. 12567 12568@item 12569Attributes propagate across assignments but are not changed by 12570any use. 12571@c (Although a use may cause the entity to acquire an additional 12572@c value such that it has both a numeric and string value, this leaves the 12573@c attribute unchanged.) 12574@c This is important but not relevant 12575@end itemize 12576 12577The last rule is particularly important. In the following program, 12578@code{a} has numeric type, even though it is later used in a string 12579operation: 12580 12581@example 12582BEGIN @{ 12583 a = 12.345 12584 b = a " is a cute number" 12585 print b 12586@} 12587@end example 12588 12589When two operands are compared, either string comparison or numeric comparison 12590may be used. This depends upon the attributes of the operands, according to the 12591following symmetric matrix: 12592 12593@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables 12594@tex 12595\centerline{ 12596\vbox{\bigskip % space above the table (about 1 linespace) 12597% Because we have vertical rules, we can't let TeX insert interline space 12598% in its usual way. 12599\offinterlineskip 12600% 12601% Define the table template. & separates columns, and \cr ends the 12602% template (and each row). # is replaced by the text of that entry on 12603% each row. The template for the first column breaks down like this: 12604% \strut -- a way to make each line have the height and depth 12605% of a normal line of type, since we turned off interline spacing. 12606% \hfil -- infinite glue; has the effect of right-justifying in this case. 12607% # -- replaced by the text (for instance, `STRNUM', in the last row). 12608% \quad -- about the width of an `M'. Just separates the columns. 12609% 12610% The second column (\vrule#) is what generates the vertical rule that 12611% spans table rows. 12612% 12613% The doubled && before the next entry means `repeat the following 12614% template as many times as necessary on each line' -- in our case, twice. 12615% 12616% The template itself, \quad#\hfil, left-justifies with a little space before. 12617% 12618\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr 12619 &&STRING &NUMERIC &STRNUM\cr 12620% The \omit tells TeX to skip inserting the template for this column on 12621% this particular row. In this case, we only want a little extra space 12622% to separate the heading row from the rule below it. the depth 2pt -- 12623% `\vrule depth 2pt' is that little space. 12624\omit &depth 2pt\cr 12625% This is the horizontal rule below the heading. Since it has nothing to 12626% do with the columns of the table, we use \noalign to get it in there. 12627\noalign{\hrule} 12628% Like above, this time a little more space. 12629\omit &depth 4pt\cr 12630% The remaining rows have nothing special about them. 12631STRING &&string &string &string\cr 12632NUMERIC &&string &numeric &numeric\cr 12633STRNUM &&string &numeric &numeric\cr 12634}}} 12635@end tex 12636@ifnottex 12637@ifnotdocbook 12638@verbatim 12639 +---------------------------------------------- 12640 | STRING NUMERIC STRNUM 12641--------+---------------------------------------------- 12642 | 12643STRING | string string string 12644 | 12645NUMERIC | string numeric numeric 12646 | 12647STRNUM | string numeric numeric 12648--------+---------------------------------------------- 12649@end verbatim 12650@end ifnotdocbook 12651@end ifnottex 12652@docbook 12653<informaltable> 12654<tgroup cols="4"> 12655<colspec colname="1" align="left"/> 12656<colspec colname="2" align="left"/> 12657<colspec colname="3" align="left"/> 12658<colspec colname="4" align="left"/> 12659<thead> 12660<row> 12661<entry/> 12662<entry>STRING</entry> 12663<entry>NUMERIC</entry> 12664<entry>STRNUM</entry> 12665</row> 12666</thead> 12667 12668<tbody> 12669<row> 12670<entry><emphasis role="bold">STRING</emphasis></entry> 12671<entry>string</entry> 12672<entry>string</entry> 12673<entry>string</entry> 12674</row> 12675 12676<row> 12677<entry><emphasis role="bold">NUMERIC</emphasis></entry> 12678<entry>string</entry> 12679<entry>numeric</entry> 12680<entry>numeric</entry> 12681</row> 12682 12683<row> 12684<entry><emphasis role="bold">STRNUM</emphasis></entry> 12685<entry>string</entry> 12686<entry>numeric</entry> 12687<entry>numeric</entry> 12688</row> 12689 12690</tbody> 12691</tgroup> 12692</informaltable> 12693 12694@end docbook 12695 12696The basic idea is that user input that looks numeric---and @emph{only} 12697user input---should be treated as numeric, even though it is actually 12698made of characters and is therefore also a string. 12699Thus, for example, the string constant @w{@code{" +3.14"}}, 12700when it appears in program source code, 12701is a string---even though it looks numeric---and 12702is @emph{never} treated as a number for comparison 12703purposes. 12704 12705In short, when one operand is a ``pure'' string, such as a string 12706constant, then a string comparison is performed. Otherwise, a 12707numeric comparison is performed. 12708(The primary difference between a number and a strnum is that 12709for strnums @command{gawk} preserves the original string value that 12710the scalar had when it came in.) 12711 12712This point bears additional emphasis: 12713Input that looks numeric @emph{is} numeric. 12714All other input is treated as strings. 12715 12716Thus, the six-character input string @w{@samp{ +3.14}} receives the 12717strnum attribute. In contrast, the eight characters 12718@w{@code{" +3.14"}} appearing in program text comprise a string constant. 12719The following examples print @samp{1} when the comparison between 12720the two different constants is true, and @samp{0} otherwise: 12721 12722@c 22.9.2014: Tested with mawk and BWK awk, got same results. 12723@example 12724$ @kbd{echo ' +3.14' | awk '@{ print($0 == " +3.14") @}'} @ii{True} 12725@print{} 1 12726$ @kbd{echo ' +3.14' | awk '@{ print($0 == "+3.14") @}'} @ii{False} 12727@print{} 0 12728$ @kbd{echo ' +3.14' | awk '@{ print($0 == "3.14") @}'} @ii{False} 12729@print{} 0 12730$ @kbd{echo ' +3.14' | awk '@{ print($0 == 3.14) @}'} @ii{True} 12731@print{} 1 12732$ @kbd{echo ' +3.14' | awk '@{ print($1 == " +3.14") @}'} @ii{False} 12733@print{} 0 12734$ @kbd{echo ' +3.14' | awk '@{ print($1 == "+3.14") @}'} @ii{True} 12735@print{} 1 12736$ @kbd{echo ' +3.14' | awk '@{ print($1 == "3.14") @}'} @ii{False} 12737@print{} 0 12738$ @kbd{echo ' +3.14' | awk '@{ print($1 == 3.14) @}'} @ii{True} 12739@print{} 1 12740@end example 12741 12742You can see the type of an input field (or other user input) 12743using @code{typeof()}: 12744 12745@example 12746$ @kbd{echo hello 37 | gawk '@{ print typeof($1), typeof($2) @}'} 12747@print{} string strnum 12748@end example 12749 12750@node Comparison Operators 12751@subsubsection Comparison Operators 12752@cindex operators @subentry comparison 12753 12754@dfn{Comparison expressions} compare strings or numbers for 12755relationships such as equality. They are written using @dfn{relational 12756operators}, which are a superset of those in C. 12757@ref{table-relational-ops} describes them. 12758 12759@cindex @code{<} (left angle bracket) @subentry @code{<} operator 12760@cindex left angle bracket (@code{<}) @subentry @code{<} operator 12761@cindex @code{<} (left angle bracket) @subentry @code{<=} operator 12762@cindex left angle bracket (@code{<}) @subentry @code{<=} operator 12763@cindex @code{>} (right angle bracket) @subentry @code{>=} operator 12764@cindex right angle bracket (@code{>}) @subentry @code{>=} operator 12765@cindex @code{>} (right angle bracket) @subentry @code{>} operator 12766@cindex right angle bracket (@code{>}) @subentry @code{>} operator 12767@cindex @code{=} (equals sign) @subentry @code{==} operator 12768@cindex equals sign (@code{=}) @subentry @code{==} operator 12769@cindex @code{!} (exclamation point) @subentry @code{!=} operator 12770@cindex exclamation point (@code{!}) @subentry @code{!=} operator 12771@cindex @code{~} (tilde), @code{~} operator 12772@cindex tilde (@code{~}), @code{~} operator 12773@cindex @code{!} (exclamation point) @subentry @code{!~} operator 12774@cindex exclamation point (@code{!}) @subentry @code{!~} operator 12775@cindex @code{in} operator 12776@float Table,table-relational-ops 12777@caption{Relational operators} 12778@multitable @columnfractions .25 .75 12779@headitem Expression @tab Result 12780@item @var{x} @code{<} @var{y} @tab True if @var{x} is less than @var{y} 12781@item @var{x} @code{<=} @var{y} @tab True if @var{x} is less than or equal to @var{y} 12782@item @var{x} @code{>} @var{y} @tab True if @var{x} is greater than @var{y} 12783@item @var{x} @code{>=} @var{y} @tab True if @var{x} is greater than or equal to @var{y} 12784@item @var{x} @code{==} @var{y} @tab True if @var{x} is equal to @var{y} 12785@item @var{x} @code{!=} @var{y} @tab True if @var{x} is not equal to @var{y} 12786@item @var{x} @code{~} @var{y} @tab True if the string @var{x} matches the regexp denoted by @var{y} 12787@item @var{x} @code{!~} @var{y} @tab True if the string @var{x} does not match the regexp denoted by @var{y} 12788@item @var{subscript} @code{in} @var{array} @tab True if the array @var{array} has an element with the subscript @var{subscript} 12789@end multitable 12790@end float 12791 12792Comparison expressions have the value one if true and zero if false. 12793When comparing operands of mixed types, numeric operands are converted 12794to strings using the value of @code{CONVFMT} 12795(@pxref{Conversion}). 12796 12797Strings are compared 12798by comparing the first character of each, then the second character of each, 12799and so on. Thus, @code{"10"} is less than @code{"9"}. If there are two 12800strings where one is a prefix of the other, the shorter string is less than 12801the longer one. Thus, @code{"abc"} is less than @code{"abcd"}. 12802 12803@cindex troubleshooting @subentry @code{==} operator 12804It is very easy to accidentally mistype the @samp{==} operator and 12805leave off one of the @samp{=} characters. The result is still valid 12806@command{awk} code, but the program does not do what is intended: 12807 12808@example 12809@group 12810if (a = b) # oops! should be a == b 12811 @dots{} 12812else 12813 @dots{} 12814@end group 12815@end example 12816 12817@noindent 12818Unless @code{b} happens to be zero or the null string, the @code{if} 12819part of the test always succeeds. Because the operators are 12820so similar, this kind of error is very difficult to spot when 12821scanning the source code. 12822 12823The following list of expressions illustrates the kinds of comparisons 12824@command{awk} performs, as well as what the result of each comparison is: 12825 12826@table @code 12827@item 1.5 <= 2.0 12828Numeric comparison (true) 12829 12830@item "abc" >= "xyz" 12831String comparison (false) 12832 12833@item 1.5 != " +2" 12834String comparison (true) 12835 12836@item "1e2" < "3" 12837String comparison (true) 12838 12839@item a = 2; b = "2" 12840@itemx a == b 12841String comparison (true) 12842 12843@item a = 2; b = " +2" 12844@itemx a == b 12845String comparison (false) 12846@end table 12847 12848In this example: 12849 12850@example 12851$ @kbd{echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}'} 12852@print{} false 12853@end example 12854 12855@cindex comparison expressions @subentry string vs.@: regexp 12856@c @cindex string comparison vs.@: regexp comparison 12857@c @cindex regexp comparison vs.@: string comparison 12858@noindent 12859the result is @samp{false} because both @code{$1} and @code{$2} 12860are user input. They are numeric strings---therefore both have 12861the strnum attribute, dictating a numeric comparison. 12862The purpose of the comparison rules and the use of numeric strings is 12863to attempt to produce the behavior that is ``least surprising,'' while 12864still ``doing the right thing.'' 12865 12866String comparisons and regular expression comparisons are very different. 12867For example: 12868 12869@example 12870x == "foo" 12871@end example 12872 12873@noindent 12874has the value one, or is true if the variable @code{x} 12875is precisely @samp{foo}. By contrast: 12876 12877@example 12878x ~ /foo/ 12879@end example 12880 12881@noindent 12882has the value one if @code{x} contains @samp{foo}, such as 12883@code{"Oh, what a fool am I!"}. 12884 12885@cindex @code{~} (tilde), @code{~} operator 12886@cindex tilde (@code{~}), @code{~} operator 12887@cindex @code{!} (exclamation point) @subentry @code{!~} operator 12888@cindex exclamation point (@code{!}) @subentry @code{!~} operator 12889The righthand operand of the @samp{~} and @samp{!~} operators may be 12890either a regexp constant (@code{/}@dots{}@code{/}) or an ordinary 12891expression. In the latter case, the value of the expression as a string is used as a 12892dynamic regexp (@pxref{Regexp Usage}; also 12893@pxref{Computed Regexps}). 12894 12895@cindex @command{awk} @subentry regexp constants and 12896@cindex regexp constants 12897A constant regular 12898expression in slashes by itself is also an expression. 12899@code{/@var{regexp}/} is an abbreviation for the following comparison expression: 12900 12901@example 12902$0 ~ /@var{regexp}/ 12903@end example 12904 12905One special place where @code{/foo/} is @emph{not} an abbreviation for 12906@samp{$0 ~ /foo/} is when it is the righthand operand of @samp{~} or 12907@samp{!~}. 12908@xref{Using Constant Regexps}, 12909where this is discussed in more detail. 12910 12911@node POSIX String Comparison 12912@subsubsection String Comparison Based on Locale Collating Order 12913 12914The POSIX standard used to say that all string comparisons are 12915performed based on the locale's @dfn{collating order}. This 12916is the order in which characters sort, as defined by the locale 12917(for more discussion, @pxref{Locales}). This order is usually very 12918different from the results obtained when doing straight byte-by-byte 12919comparison.@footnote{Technically, string comparison is supposed to behave 12920the same way as if the strings were compared with the C @code{strcoll()} 12921function.} 12922 12923@cindex POSIX mode 12924Because this behavior differs considerably from existing practice, 12925@command{gawk} only implemented it when in POSIX mode (@pxref{Options}). 12926Here is an example to illustrate the difference, in an @code{en_US.UTF-8} 12927locale: 12928 12929@example 12930$ @kbd{gawk 'BEGIN @{ printf("ABC < abc = %s\n",} 12931> @kbd{("ABC" < "abc" ? "TRUE" : "FALSE")) @}'} 12932@print{} ABC < abc = TRUE 12933$ @kbd{gawk --posix 'BEGIN @{ printf("ABC < abc = %s\n",} 12934> @kbd{("ABC" < "abc" ? "TRUE" : "FALSE")) @}'} 12935@print{} ABC < abc = FALSE 12936@end example 12937 12938Fortunately, as of August 2016, comparison based on locale 12939collating order is no longer required for the @code{==} and @code{!=} 12940operators.@footnote{See @uref{http://austingroupbugs.net/view.php?id=1070, 12941the Austin Group website}.} However, comparison based on locales is still 12942required for @code{<}, @code{<=}, @code{>}, and @code{>=}. POSIX thus 12943recommends as follows: 12944 12945@quotation 12946Since the @code{==} operator checks whether strings are identical, 12947not whether they collate equally, applications needing to check whether 12948strings collate equally can use: 12949 12950@example 12951a <= b && a >= b 12952@end example 12953@end quotation 12954 12955@cindex POSIX mode 12956As of @value{PVERSION} 4.2, @command{gawk} continues to use locale 12957collating order for @code{<}, @code{<=}, @code{>}, and @code{>=} only 12958in POSIX mode. 12959 12960@ignore 12961References: http://austingroupbugs.net/view.php?id=963 12962and http://austingroupbugs.net/view.php?id=1070. 12963@end ignore 12964 12965@node Boolean Ops 12966@subsection Boolean Expressions 12967@cindex and Boolean-logic operator 12968@cindex or Boolean-logic operator 12969@cindex not Boolean-logic operator 12970@cindex expressions @subentry Boolean 12971@cindex Boolean expressions 12972@cindex operators, Boolean @seeentry{Boolean expressions} 12973@cindex Boolean operators @seeentry{Boolean expressions} 12974@cindex logical operators @seeentry{Boolean expressions} 12975@cindex operators, logical @seeentry{Boolean expressions} 12976 12977A @dfn{Boolean expression} is a combination of comparison expressions or 12978matching expressions, using the Boolean operators ``or'' 12979(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with 12980parentheses to control nesting. The truth value of the Boolean expression is 12981computed by combining the truth values of the component expressions. 12982Boolean expressions are also referred to as @dfn{logical expressions}. 12983The terms are equivalent. 12984 12985Boolean expressions can be used wherever comparison and matching 12986expressions can be used. They can be used in @code{if}, @code{while}, 12987@code{do}, and @code{for} statements 12988(@pxref{Statements}). 12989They have numeric values (one if true, zero if false) that come into play 12990if the result of the Boolean expression is stored in a variable or 12991used in arithmetic. 12992 12993In addition, every Boolean expression is also a valid pattern, so 12994you can use one as a pattern to control the execution of rules. 12995The Boolean operators are: 12996 12997@table @code 12998@item @var{boolean1} && @var{boolean2} 12999True if both @var{boolean1} and @var{boolean2} are true. For example, 13000the following statement prints the current input record if it contains 13001both @samp{edu} and @samp{li}: 13002 13003@example 13004if ($0 ~ /edu/ && $0 ~ /li/) print 13005@end example 13006 13007@cindex side effects @subentry Boolean operators 13008The subexpression @var{boolean2} is evaluated only if @var{boolean1} 13009is true. This can make a difference when @var{boolean2} contains 13010expressions that have side effects. In the case of @samp{$0 ~ /foo/ && 13011($2 == bar++)}, the variable @code{bar} is not incremented if there is 13012no substring @samp{foo} in the record. 13013 13014@item @var{boolean1} || @var{boolean2} 13015True if at least one of @var{boolean1} or @var{boolean2} is true. 13016For example, the following statement prints all records in the input 13017that contain @emph{either} @samp{edu} or 13018@samp{li}: 13019 13020@example 13021if ($0 ~ /edu/ || $0 ~ /li/) print 13022@end example 13023 13024The subexpression @var{boolean2} is evaluated only if @var{boolean1} 13025is false. This can make a difference when @var{boolean2} contains 13026expressions that have side effects. 13027(Thus, this test never really distinguishes records that contain both 13028@samp{edu} and @samp{li}---as soon as @samp{edu} is matched, 13029the full test succeeds.) 13030 13031@item ! @var{boolean} 13032True if @var{boolean} is false. For example, 13033the following program prints @samp{no home!} in 13034the unusual event that the @env{HOME} environment 13035variable is not defined: 13036 13037@example 13038BEGIN @{ if (! ("HOME" in ENVIRON)) 13039 print "no home!" @} 13040@end example 13041 13042(The @code{in} operator is described in 13043@ref{Reference to Elements}.) 13044@end table 13045 13046@cindex short-circuit operators 13047@cindex operators @subentry short-circuit 13048@cindex @code{&} (ampersand) @subentry @code{&&} operator 13049@cindex ampersand (@code{&}) @subentry @code{&&} operator 13050@cindex @code{|} (vertical bar) @subentry @code{||} operator 13051@cindex vertical bar (@code{|}) @subentry @code{||} operator 13052The @samp{&&} and @samp{||} operators are called @dfn{short-circuit} 13053operators because of the way they work. Evaluation of the full expression 13054is ``short-circuited'' if the result can be determined partway through 13055its evaluation. 13056 13057@cindex line continuations 13058Statements that end with @samp{&&} or @samp{||} can be continued simply 13059by putting a newline after them. But you cannot put a newline in front 13060of either of these operators without using backslash continuation 13061(@pxref{Statements/Lines}). 13062 13063@cindex @code{!} (exclamation point) @subentry @code{!} operator 13064@cindex exclamation point (@code{!}) @subentry @code{!} operator 13065@cindex newlines 13066@cindex variables @subentry flag 13067@cindex flag variables 13068The actual value of an expression using the @samp{!} operator is 13069either one or zero, depending upon the truth value of the expression it 13070is applied to. 13071The @samp{!} operator is often useful for changing the sense of a flag 13072variable from false to true and back again. For example, the following 13073program is one way to print lines in between special bracketing lines: 13074 13075@example 13076$1 == "START" @{ interested = ! interested; next @} 13077interested @{ print @} 13078$1 == "END" @{ interested = ! interested; next @} 13079@end example 13080 13081@noindent 13082The variable @code{interested}, as with all @command{awk} variables, starts 13083out initialized to zero, which is also false. When a line is seen whose 13084first field is @samp{START}, the value of @code{interested} is toggled 13085to true, using @samp{!}. The next rule prints lines as long as 13086@code{interested} is true. When a line is seen whose first field is 13087@samp{END}, @code{interested} is toggled back to false.@footnote{This 13088program has a bug; it prints lines starting with @samp{END}. How 13089would you fix it?} 13090 13091@ignore 13092Scott Deifik points out that this program isn't robust against 13093bogus input data, but the point is to illustrate the use of `!', 13094so we'll leave well enough alone. 13095@end ignore 13096 13097Most commonly, the @samp{!} operator is used in the conditions of 13098@code{if} and @code{while} statements, where it often makes more 13099sense to phrase the logic in the negative: 13100 13101@example 13102if (! @var{some condition} || @var{some other condition}) @{ 13103 @var{@dots{} do whatever processing @dots{}} 13104@} 13105@end example 13106 13107@cindex @code{next} statement 13108@quotation NOTE 13109The @code{next} statement is discussed in 13110@ref{Next Statement}. 13111@code{next} tells @command{awk} to skip the rest of the rules, get the 13112next record, and start processing the rules over again at the top. 13113The reason it's there is to avoid printing the bracketing 13114@samp{START} and @samp{END} lines. 13115@end quotation 13116 13117@node Conditional Exp 13118@subsection Conditional Expressions 13119@cindex conditional expressions 13120@cindex expressions @subentry conditional 13121@cindex expressions @subentry selecting 13122 13123A @dfn{conditional expression} is a special kind of expression that has 13124three operands. It allows you to use one expression's value to select 13125one of two other expressions. 13126The conditional expression in @command{awk} is the same as in the C 13127language, as shown here: 13128 13129@example 13130@var{selector} ? @var{if-true-exp} : @var{if-false-exp} 13131@end example 13132 13133@noindent 13134There are three subexpressions. The first, @var{selector}, is always 13135computed first. If it is ``true'' (not zero or not null), then 13136@var{if-true-exp} is computed next, and its value becomes the value of 13137the whole expression. Otherwise, @var{if-false-exp} is computed next, 13138and its value becomes the value of the whole expression. 13139For example, the following expression produces the absolute value of @code{x}: 13140 13141@example 13142x >= 0 ? x : -x 13143@end example 13144 13145@cindex side effects @subentry conditional expressions 13146Each time the conditional expression is computed, only one of 13147@var{if-true-exp} and @var{if-false-exp} is used; the other is ignored. 13148This is important when the expressions have side effects. For example, 13149this conditional expression examines element @code{i} of either array 13150@code{a} or array @code{b}, and increments @code{i}: 13151 13152@example 13153x == y ? a[i++] : b[i++] 13154@end example 13155 13156@noindent 13157This is guaranteed to increment @code{i} exactly once, because each time 13158only one of the two increment expressions is executed 13159and the other is not. 13160@xref{Arrays}, 13161for more information about arrays. 13162 13163@cindex differences in @command{awk} and @command{gawk} @subentry line continuations 13164@cindex line continuations @subentry @command{gawk} 13165@cindex @command{gawk} @subentry line continuation in 13166As a minor @command{gawk} extension, 13167a statement that uses @samp{?:} can be continued simply 13168by putting a newline after either character. 13169However, putting a newline in front 13170of either character does not work without using backslash continuation 13171(@pxref{Statements/Lines}). 13172If @option{--posix} is specified 13173(@pxref{Options}), this extension is disabled. 13174 13175@node Function Calls 13176@section Function Calls 13177@cindex function calls 13178 13179A @dfn{function} is a name for a particular calculation. 13180This enables you to 13181ask for it by name at any point in the program. For 13182example, the function @code{sqrt()} computes the square root of a number. 13183 13184@cindex functions @subentry built-in 13185A fixed set of functions are @dfn{built in}, which means they are 13186available in every @command{awk} program. The @code{sqrt()} function is one 13187of these. @xref{Built-in} for a list of built-in 13188functions and their descriptions. In addition, you can define 13189functions for use in your program. 13190@xref{User-defined} 13191for instructions on how to do this. 13192Finally, @command{gawk} lets you write functions in C or C++ 13193that may be called from your program (@pxref{Dynamic Extensions}). 13194 13195@cindex arguments @subentry in function calls 13196The way to use a function is with a @dfn{function call} expression, 13197which consists of the function name followed immediately by a list of 13198@dfn{arguments} in parentheses. The arguments are expressions that 13199provide the raw materials for the function's calculations. 13200When there is more than one argument, they are separated by commas. If 13201there are no arguments, just write @samp{()} after the function name. 13202The following examples show function calls with and without arguments: 13203 13204@example 13205sqrt(x^2 + y^2) @ii{one argument} 13206atan2(y, x) @ii{two arguments} 13207rand() @ii{no arguments} 13208@end example 13209 13210@cindex troubleshooting @subentry function call syntax 13211@quotation CAUTION 13212Do not put any space between the function name and the opening parenthesis! 13213A user-defined function name looks just like the name of a 13214variable---a space would make the expression look like concatenation of 13215a variable with an expression inside parentheses. 13216With built-in functions, space before the parenthesis is harmless, but 13217it is best not to get into the habit of using space to avoid mistakes 13218with user-defined functions. 13219@end quotation 13220 13221Each function expects a particular number 13222of arguments. For example, the @code{sqrt()} function must be called with 13223a single argument, the number of which to take the square root: 13224 13225@example 13226sqrt(@var{argument}) 13227@end example 13228 13229Some of the built-in functions have one or 13230more optional arguments. 13231If those arguments are not supplied, the functions 13232use a reasonable default value. 13233@xref{Built-in} for full details. If arguments 13234are omitted in calls to user-defined functions, then those arguments are 13235treated as local variables. Such local variables act like the 13236empty string if referenced where a string value is required, 13237and like zero if referenced where a numeric value is required 13238(@pxref{User-defined}). 13239 13240As an advanced feature, @command{gawk} provides indirect function calls, 13241which is a way to choose the function to call at runtime, instead of 13242when you write the source code to your program. We defer discussion of 13243this feature until later; see @ref{Indirect Calls}. 13244 13245@cindex side effects @subentry function calls 13246Like every other expression, the function call has a value, often 13247called the @dfn{return value}, which is computed by the function 13248based on the arguments you give it. In this example, the return value 13249of @samp{sqrt(@var{argument})} is the square root of @var{argument}. 13250The following program reads numbers, one number per line, and prints 13251the square root of each one: 13252 13253@example 13254$ @kbd{awk '@{ print "The square root of", $1, "is", sqrt($1) @}'} 13255@kbd{1} 13256@print{} The square root of 1 is 1 13257@kbd{3} 13258@print{} The square root of 3 is 1.73205 13259@kbd{5} 13260@print{} The square root of 5 is 2.23607 13261@kbd{Ctrl-d} 13262@end example 13263 13264A function can also have side effects, such as assigning 13265values to certain variables or doing I/O. 13266This program shows how the @code{match()} function 13267(@pxref{String Functions}) 13268changes the variables @code{RSTART} and @code{RLENGTH}: 13269 13270@example 13271@{ 13272 if (match($1, $2)) 13273 print RSTART, RLENGTH 13274 else 13275 print "no match" 13276@} 13277@end example 13278 13279@noindent 13280Here is a sample run: 13281 13282@example 13283$ @kbd{awk -f matchit.awk} 13284@kbd{aaccdd c+} 13285@print{} 3 2 13286@kbd{foo bar} 13287@print{} no match 13288@kbd{abcdefg e} 13289@print{} 5 1 13290@end example 13291 13292@node Precedence 13293@section Operator Precedence (How Operators Nest) 13294@cindex precedence 13295@cindex operators @subentry precedence of 13296 13297@dfn{Operator precedence} determines how operators are grouped when 13298different operators appear close by in one expression. For example, 13299@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c} 13300means to multiply @code{b} and @code{c}, and then add @code{a} to the 13301product (i.e., @samp{a + (b * c)}). 13302 13303The normal precedence of the operators can be overruled by using parentheses. 13304Think of the precedence rules as saying where the 13305parentheses are assumed to be. In 13306fact, it is wise to always use parentheses whenever there is an unusual 13307combination of operators, because other people who read the program may 13308not remember what the precedence is in this case. 13309Even experienced programmers occasionally forget the exact rules, 13310which leads to mistakes. 13311Explicit parentheses help prevent 13312any such mistakes. 13313 13314When operators of equal precedence are used together, the leftmost 13315operator groups first, except for the assignment, conditional, and 13316exponentiation operators, which group in the opposite order. 13317Thus, @samp{a - b + c} groups as @samp{(a - b) + c} and 13318@samp{a = b = c} groups as @samp{a = (b = c)}. 13319 13320Normally the precedence of prefix unary operators does not matter, 13321because there is only one way to interpret 13322them: innermost first. Thus, @samp{$++i} means @samp{$(++i)} and 13323@samp{++$x} means @samp{++($x)}. However, when another operator follows 13324the operand, then the precedence of the unary operators can matter. 13325@samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means 13326@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^}, 13327whereas @samp{$} has higher precedence. 13328Also, operators cannot be combined in a way that violates the 13329precedence rules; for example, @samp{$$0++--} is not a valid 13330expression because the first @samp{$} has higher precedence than the 13331@samp{++}; to avoid the problem the expression can be rewritten as 13332@samp{$($0++)--}. 13333 13334This list presents @command{awk}'s operators, in order of highest 13335to lowest precedence: 13336 13337@c @asis for docbook to come out right 13338@table @asis 13339@item @code{(}@dots{}@code{)} 13340Grouping. 13341 13342@cindex @code{$} (dollar sign) @subentry @code{$} field operator 13343@cindex dollar sign (@code{$}) @subentry @code{$} field operator 13344@item @code{$} 13345Field reference. 13346 13347@cindex @code{+} (plus sign) @subentry @code{++} operator 13348@cindex plus sign (@code{+}) @subentry @code{++} operator 13349@cindex @code{-} (hyphen) @subentry @code{--} operator 13350@cindex hyphen (@code{-}) @subentry @code{--} operator 13351@item @code{++ --} 13352Increment, decrement. 13353 13354@cindex @code{^} (caret) @subentry @code{^} operator 13355@cindex caret (@code{^}) @subentry @code{^} operator 13356@cindex @code{*} (asterisk) @subentry @code{**} operator 13357@cindex asterisk (@code{*}) @subentry @code{**} operator 13358@item @code{^ **} 13359Exponentiation. These operators group right to left. 13360 13361@cindex @code{+} (plus sign) @subentry @code{+} operator 13362@cindex plus sign (@code{+}) @subentry @code{+} operator 13363@cindex @code{-} (hyphen) @subentry @code{-} operator 13364@cindex hyphen (@code{-}) @subentry @code{-} operator 13365@cindex @code{!} (exclamation point) @subentry @code{!} operator 13366@cindex exclamation point (@code{!}) @subentry @code{!} operator 13367@item @code{+ - !} 13368Unary plus, minus, logical ``not.'' 13369 13370@cindex @code{*} (asterisk) @subentry @code{*} operator @subentry as multiplication operator 13371@cindex asterisk (@code{*}) @subentry @code{*} operator @subentry as multiplication operator 13372@cindex @code{/} (forward slash) @subentry @code{/} operator 13373@cindex forward slash (@code{/}) @subentry @code{/} operator 13374@cindex @code{%} (percent sign) @subentry @code{%} operator 13375@cindex percent sign (@code{%}) @subentry @code{%} operator 13376@item @code{* / %} 13377Multiplication, division, remainder. 13378 13379@cindex @code{+} (plus sign) @subentry @code{+} operator 13380@cindex plus sign (@code{+}) @subentry @code{+} operator 13381@cindex @code{-} (hyphen) @subentry @code{-} operator 13382@cindex hyphen (@code{-}) @subentry @code{-} operator 13383@item @code{+ -} 13384Addition, subtraction. 13385 13386@item String concatenation 13387There is no special symbol for concatenation. 13388The operands are simply written side by side 13389(@pxref{Concatenation}). 13390 13391@cindex @code{<} (left angle bracket) @subentry @code{<} operator 13392@cindex left angle bracket (@code{<}) @subentry @code{<} operator 13393@cindex @code{<} (left angle bracket) @subentry @code{<=} operator 13394@cindex left angle bracket (@code{<}) @subentry @code{<=} operator 13395@cindex @code{>} (right angle bracket) @subentry @code{>=} operator 13396@cindex right angle bracket (@code{>}) @subentry @code{>=} operator 13397@cindex @code{>} (right angle bracket) @subentry @code{>} operator 13398@cindex right angle bracket (@code{>}) @subentry @code{>} operator 13399@cindex @code{=} (equals sign) @subentry @code{==} operator 13400@cindex equals sign (@code{=}) @subentry @code{==} operator 13401@cindex @code{!} (exclamation point) @subentry @code{!=} operator 13402@cindex exclamation point (@code{!}) @subentry @code{!=} operator 13403@cindex @code{>} (right angle bracket) @subentry @code{>>} operator (I/O) 13404@cindex right angle bracket (@code{>}) @subentry @code{>>} operator (I/O) 13405@cindex operators @subentry input/output 13406@cindex @code{|} (vertical bar) @subentry @code{|} operator (I/O) 13407@cindex vertical bar (@code{|}) @subentry @code{|} operator (I/O) 13408@cindex operators @subentry input/output 13409@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O) 13410@cindex vertical bar (@code{|}) @subentry @code{|&} operator (I/O) 13411@cindex operators @subentry input/output 13412@item @code{< <= == != > >= >> | |&} 13413Relational and redirection. 13414The relational operators and the redirections have the same precedence 13415level. Characters such as @samp{>} serve both as relationals and as 13416redirections; the context distinguishes between the two meanings. 13417 13418@cindex @code{print} statement @subentry I/O operators in 13419@cindex @code{printf} statement @subentry I/O operators in 13420Note that the I/O redirection operators in @code{print} and @code{printf} 13421statements belong to the statement level, not to expressions. The 13422redirection does not produce an expression that could be the operand of 13423another operator. As a result, it does not make sense to use a 13424redirection operator near another operator of lower precedence without 13425parentheses. Such combinations (e.g., @samp{print foo > a ? b : c}) 13426result in syntax errors. 13427The correct way to write this statement is @samp{print foo > (a ? b : c)}. 13428 13429@cindex @code{~} (tilde), @code{~} operator 13430@cindex tilde (@code{~}), @code{~} operator 13431@cindex @code{!} (exclamation point) @subentry @code{!~} operator 13432@cindex exclamation point (@code{!}) @subentry @code{!~} operator 13433@item @code{~ !~} 13434Matching, nonmatching. 13435 13436@cindex @code{in} operator 13437@item @code{in} 13438Array membership. 13439 13440@cindex @code{&} (ampersand) @subentry @code{&&} operator 13441@cindex ampersand (@code{&}) @subentry @code{&&} operator 13442@item @code{&&} 13443Logical ``and.'' 13444 13445@cindex @code{|} (vertical bar) @subentry @code{||} operator 13446@cindex vertical bar (@code{|}) @subentry @code{||} operator 13447@item @code{||} 13448Logical ``or.'' 13449 13450@cindex @code{?} (question mark) @subentry @code{?:} operator 13451@cindex question mark (@code{?}) @subentry @code{?:} operator 13452@cindex @code{:} (colon) @subentry @code{?:} operator 13453@cindex colon (@code{:}) @subentry @code{?:} operator 13454@item @code{?:} 13455Conditional. This operator groups right to left. 13456 13457@cindex @code{+} (plus sign) @subentry @code{+=} operator 13458@cindex plus sign (@code{+}) @subentry @code{+=} operator 13459@cindex @code{-} (hyphen) @subentry @code{-=} operator 13460@cindex hyphen (@code{-}) @subentry @code{-=} operator 13461@cindex @code{*} (asterisk) @subentry @code{*=} operator 13462@cindex asterisk (@code{*}) @subentry @code{*=} operator 13463@cindex @code{*} (asterisk) @subentry @code{**=} operator 13464@cindex asterisk (@code{*}) @subentry @code{**=} operator 13465@cindex @code{/} (forward slash) @subentry @code{/=} operator 13466@cindex forward slash (@code{/}) @subentry @code{/=} operator 13467@cindex @code{%} (percent sign) @subentry @code{%=} operator 13468@cindex percent sign (@code{%}) @subentry @code{%=} operator 13469@cindex @code{^} (caret) @subentry @code{^=} operator 13470@cindex caret (@code{^}) @subentry @code{^=} operator 13471@item @code{= += -= *= /= %= ^= **=} 13472Assignment. These operators group right to left. 13473@end table 13474 13475@cindex POSIX @command{awk} @subentry @code{**} operator and 13476@cindex portability @subentry operators @subentry not in POSIX @command{awk} 13477@quotation NOTE 13478The @samp{|&}, @samp{**}, and @samp{**=} operators are not specified by POSIX. 13479For maximum portability, do not use them. 13480@end quotation 13481 13482@node Locales 13483@section Where You Are Makes a Difference 13484@cindex locale, definition of 13485 13486Modern systems support the notion of @dfn{locales}: a way to tell the 13487system about the local character set and language. The ISO C standard 13488defines a default @code{"C"} locale, which is an environment that is 13489typical of what many C programmers are used to. 13490 13491Once upon a time, the locale setting used to affect regexp matching, 13492but this is no longer true (@pxref{Ranges and Locales}). 13493 13494Locales can affect record splitting. For the normal case of @samp{RS = 13495"\n"}, the locale is largely irrelevant. For other single-character 13496record separators, setting @samp{LC_ALL=C} in the environment will 13497give you much better performance when reading records. Otherwise, 13498@command{gawk} has to make several function calls, @emph{per input 13499character}, to find the record terminator. 13500 13501Locales can affect how dates and times are formatted (@pxref{Time 13502Functions}). For example, a common way to abbreviate the date September 135034, 2015, in the United States is ``9/4/15.'' In many countries in 13504Europe, however, it is abbreviated ``4.9.15.'' Thus, the @samp{%x} 13505specification in a @code{"US"} locale might produce @samp{9/4/15}, 13506while in a @code{"EUROPE"} locale, it might produce @samp{4.9.15}. 13507 13508According to POSIX, string comparison is also affected by locales (similar 13509to regular expressions). The details are presented in @ref{POSIX String 13510Comparison}. 13511 13512Finally, the locale affects the value of the decimal point character 13513used when @command{gawk} parses input data. This is discussed in detail 13514in @ref{Conversion}. 13515 13516@node Expressions Summary 13517@section Summary 13518 13519@itemize @value{BULLET} 13520@item 13521Expressions are the basic elements of computation in programs. They are 13522built from constants, variables, function calls, and combinations of the 13523various kinds of values with operators. 13524 13525@item 13526@command{awk} supplies three kinds of constants: numeric, string, and 13527regexp. @command{gawk} lets you specify numeric constants in octal 13528and hexadecimal (bases 8 and 16) as well as decimal (base 10). 13529In certain contexts, a standalone regexp constant such as @code{/foo/} 13530has the same meaning as @samp{$0 ~ /foo/}. 13531 13532@item 13533Variables hold values between uses in computations. A number of built-in 13534variables provide information to your @command{awk} program, and a number 13535of others let you control how @command{awk} behaves. 13536 13537@item 13538Numbers are automatically converted to strings, and strings to numbers, 13539as needed by @command{awk}. Numeric values are converted as if they were 13540formatted with @code{sprintf()} using the format in @code{CONVFMT}. 13541Locales can influence the conversions. 13542 13543@item 13544@command{awk} provides the usual arithmetic operators (addition, 13545subtraction, multiplication, division, modulus), and unary plus and minus. 13546It also provides comparison operators, Boolean operators, an array membership 13547testing operator, and regexp 13548matching operators. String concatenation is accomplished by placing 13549two expressions next to each other; there is no explicit operator. 13550The three-operand @samp{?:} operator provides an ``if-else'' test within 13551expressions. 13552 13553@item 13554Assignment operators provide convenient shorthands for common arithmetic 13555operations. 13556 13557@item 13558In @command{awk}, a value is considered to be true if it is nonzero 13559@emph{or} non-null. Otherwise, the value is false. 13560 13561@item 13562A variable's type is set upon each assignment and may change over its 13563lifetime. The type determines how it behaves in comparisons (string 13564or numeric). 13565 13566@item 13567Function calls return a value that may be used as part of a larger 13568expression. Expressions used to pass parameter values are fully 13569evaluated before the function is called. @command{awk} provides 13570built-in and user-defined functions; this is described in 13571@ref{Functions}. 13572 13573@item 13574Operator precedence specifies the order in which operations are performed, 13575unless explicitly overridden by parentheses. @command{awk}'s operator 13576precedence is compatible with that of C. 13577 13578@item 13579Locales can affect the format of data as output by an @command{awk} 13580program, and occasionally the format for data read as input. 13581 13582@end itemize 13583 13584 13585@node Patterns and Actions 13586@chapter Patterns, Actions, and Variables 13587@cindex patterns 13588 13589As you have already seen, each @command{awk} statement consists of 13590a pattern with an associated action. This @value{CHAPTER} describes how 13591you build patterns and actions, what kinds of things you can do within 13592actions, and @command{awk}'s predefined variables. 13593 13594The pattern--action rules and the statements available for use 13595within actions form the core of @command{awk} programming. 13596In a sense, everything covered 13597up to here has been the foundation 13598that programs are built on top of. Now it's time to start 13599building something useful. 13600 13601@menu 13602* Pattern Overview:: What goes into a pattern. 13603* Using Shell Variables:: How to use shell variables with @command{awk}. 13604* Action Overview:: What goes into an action. 13605* Statements:: Describes the various control statements in 13606 detail. 13607* Built-in Variables:: Summarizes the predefined variables. 13608* Pattern Action Summary:: Patterns and Actions summary. 13609@end menu 13610 13611@node Pattern Overview 13612@section Pattern Elements 13613 13614@menu 13615* Regexp Patterns:: Using regexps as patterns. 13616* Expression Patterns:: Any expression can be used as a pattern. 13617* Ranges:: Pairs of patterns specify record ranges. 13618* BEGIN/END:: Specifying initialization and cleanup rules. 13619* BEGINFILE/ENDFILE:: Two special patterns for advanced control. 13620* Empty:: The empty pattern, which matches every record. 13621@end menu 13622 13623@cindex patterns @subentry types of 13624Patterns in @command{awk} control the execution of rules---a rule is 13625executed when its pattern matches the current input record. 13626The following is a summary of the types of @command{awk} patterns: 13627 13628@table @code 13629@item /@var{regular expression}/ 13630A regular expression. It matches when the text of the 13631input record fits the regular expression. 13632(@xref{Regexp}.) 13633 13634@item @var{expression} 13635A single expression. It matches when its value 13636is nonzero (if a number) or non-null (if a string). 13637(@xref{Expression Patterns}.) 13638 13639@item @var{begpat}, @var{endpat} 13640A pair of patterns separated by a comma, specifying a @dfn{range} of records. 13641The range includes both the initial record that matches @var{begpat} and 13642the final record that matches @var{endpat}. 13643(@xref{Ranges}.) 13644 13645@item BEGIN 13646@itemx END 13647Special patterns for you to supply startup or cleanup actions for your 13648@command{awk} program. 13649(@xref{BEGIN/END}.) 13650 13651@item BEGINFILE 13652@itemx ENDFILE 13653Special patterns for you to supply startup or cleanup actions to be 13654done on a per-file basis. 13655(@xref{BEGINFILE/ENDFILE}.) 13656 13657@item @var{empty} 13658The empty pattern matches every input record. 13659(@xref{Empty}.) 13660@end table 13661 13662@node Regexp Patterns 13663@subsection Regular Expressions as Patterns 13664@cindex patterns @subentry regexp constants as 13665@cindex regular expressions @subentry as patterns 13666 13667Regular expressions are one of the first kinds of patterns presented 13668in this book. 13669This kind of pattern is simply a regexp constant in the pattern part of 13670a rule. Its meaning is @samp{$0 ~ /@var{pattern}/}. 13671The pattern matches when the input record matches the regexp. 13672For example: 13673 13674@example 13675/foo|bar|baz/ @{ buzzwords++ @} 13676END @{ print buzzwords, "buzzwords seen" @} 13677@end example 13678 13679@node Expression Patterns 13680@subsection Expressions as Patterns 13681@cindex expressions @subentry as patterns 13682@cindex patterns @subentry expressions as 13683 13684Any @command{awk} expression is valid as an @command{awk} pattern. 13685The pattern matches if the expression's value is nonzero (if a 13686number) or non-null (if a string). 13687The expression is reevaluated each time the rule is tested against a new 13688input record. If the expression uses fields such as @code{$1}, the 13689value depends directly on the new input record's text; otherwise, it 13690depends on only what has happened so far in the execution of the 13691@command{awk} program. 13692 13693@cindex comparison expressions @subentry as patterns 13694@cindex patterns @subentry comparison expressions as 13695Comparison expressions, using the comparison operators described in 13696@ref{Typing and Comparison}, 13697are a very common kind of pattern. 13698Regexp matching and nonmatching are also very common expressions. 13699The left operand of the @samp{~} and @samp{!~} operators is a string. 13700The right operand is either a constant regular expression enclosed in 13701slashes (@code{/@var{regexp}/}), or any expression whose string value 13702is used as a dynamic regular expression 13703(@pxref{Computed Regexps}). 13704The following example prints the second field of each input record 13705whose first field is precisely @samp{li}: 13706 13707@cindex @code{/} (forward slash) @subentry patterns and 13708@cindex forward slash (@code{/}) @subentry patterns and 13709@cindex @code{~} (tilde), @code{~} operator 13710@cindex tilde (@code{~}), @code{~} operator 13711@cindex @code{!} (exclamation point) @subentry @code{!~} operator 13712@cindex exclamation point (@code{!}) @subentry @code{!~} operator 13713@example 13714$ @kbd{awk '$1 == "li" @{ print $2 @}' mail-list} 13715@end example 13716 13717@noindent 13718(There is no output, because there is no person with the exact name @samp{li}.) 13719Contrast this with the following regular expression match, which 13720accepts any record with a first field that contains @samp{li}: 13721 13722@example 13723$ @kbd{awk '$1 ~ /li/ @{ print $2 @}' mail-list} 13724@print{} 555-5553 13725@print{} 555-6699 13726@end example 13727 13728@cindex regexp constants @subentry as patterns 13729@cindex patterns @subentry regexp constants as 13730A regexp constant as a pattern is also a special case of an expression 13731pattern. The expression @code{/li/} has the value one if @samp{li} 13732appears in the current input record. Thus, as a pattern, @code{/li/} 13733matches any record containing @samp{li}. 13734 13735@cindex Boolean expressions @subentry as patterns 13736@cindex patterns @subentry Boolean expressions as 13737Boolean expressions are also commonly used as patterns. 13738Whether the pattern 13739matches an input record depends on whether its subexpressions match. 13740For example, the following command prints all the records in 13741@file{mail-list} that contain both @samp{edu} and @samp{li}: 13742 13743@example 13744$ @kbd{awk '/edu/ && /li/' mail-list} 13745@print{} Samuel 555-3430 samuel.lanceolis@@shu.edu A 13746@end example 13747 13748The following command prints all records in 13749@file{mail-list} that contain @emph{either} @samp{edu} or @samp{li} 13750(or both, of course): 13751 13752@example 13753$ @kbd{awk '/edu/ || /li/' mail-list} 13754@print{} Amelia 555-5553 amelia.zodiacusque@@gmail.com F 13755@print{} Broderick 555-0542 broderick.aliquotiens@@yahoo.com R 13756@print{} Fabius 555-1234 fabius.undevicesimus@@ucb.edu F 13757@print{} Julie 555-6699 julie.perscrutabor@@skeeve.com F 13758@print{} Samuel 555-3430 samuel.lanceolis@@shu.edu A 13759@print{} Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R 13760@end example 13761 13762The following command prints all records in 13763@file{mail-list} that do @emph{not} contain the string @samp{li}: 13764 13765@example 13766$ @kbd{awk '! /li/' mail-list} 13767@print{} Anthony 555-3412 anthony.asserturo@@hotmail.com A 13768@print{} Becky 555-7685 becky.algebrarum@@gmail.com A 13769@print{} Bill 555-1675 bill.drowning@@hotmail.com A 13770@print{} Camilla 555-2912 camilla.infusarum@@skynet.be R 13771@print{} Fabius 555-1234 fabius.undevicesimus@@ucb.edu F 13772@group 13773@print{} Martin 555-6480 martin.codicibus@@hotmail.com A 13774@print{} Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R 13775@end group 13776@end example 13777 13778@cindex @code{BEGIN} pattern @subentry Boolean patterns and 13779@cindex @code{END} pattern @subentry Boolean patterns and 13780@cindex @code{BEGINFILE} pattern @subentry Boolean patterns and 13781@cindex @code{ENDFILE} pattern @subentry Boolean patterns and 13782The subexpressions of a Boolean operator in a pattern can be constant regular 13783expressions, comparisons, or any other @command{awk} expressions. Range 13784patterns are not expressions, so they cannot appear inside Boolean 13785patterns. Likewise, the special patterns @code{BEGIN}, @code{END}, 13786@code{BEGINFILE}, and @code{ENDFILE}, 13787which never match any input record, are not expressions and cannot 13788appear inside Boolean patterns. 13789 13790The precedence of the different operators that can appear in 13791patterns is described in @ref{Precedence}. 13792 13793@node Ranges 13794@subsection Specifying Record Ranges with Patterns 13795 13796@cindex range patterns 13797@cindex patterns @subentry ranges in 13798@cindex lines @subentry matching ranges of 13799@cindex @code{,} (comma), in range patterns 13800@cindex comma (@code{,}), in range patterns 13801A @dfn{range pattern} is made of two patterns separated by a comma, in 13802the form @samp{@var{begpat}, @var{endpat}}. It is used to match ranges of 13803consecutive input records. The first pattern, @var{begpat}, controls 13804where the range begins, while @var{endpat} controls where 13805the pattern ends. For example, the following: 13806 13807@example 13808awk '$1 == "on", $1 == "off"' myfile 13809@end example 13810 13811@noindent 13812prints every record in @file{myfile} between @samp{on}/@samp{off} pairs, inclusive. 13813 13814A range pattern starts out by matching @var{begpat} against every 13815input record. When a record matches @var{begpat}, the range pattern is 13816@dfn{turned on}, and the range pattern matches this record as well. As long as 13817the range pattern stays turned on, it automatically matches every input 13818record read. The range pattern also matches @var{endpat} against every 13819input record; when this succeeds, the range pattern is @dfn{turned off} again 13820for the following record. Then the range pattern goes back to checking 13821@var{begpat} against each record. 13822 13823@cindex @code{if} statement @subentry actions, changing 13824The record that turns on the range pattern and the one that turns it 13825off both match the range pattern. If you don't want to operate on 13826these records, you can write @code{if} statements in the rule's action 13827to distinguish them from the records you are interested in. 13828 13829It is possible for a pattern to be turned on and off by the same 13830record. If the record satisfies both conditions, then the action is 13831executed for just that record. 13832For example, suppose there is text between two identical markers (e.g., 13833the @samp{%} symbol), each on its own line, that should be ignored. 13834A first attempt would be to 13835combine a range pattern that describes the delimited text with the 13836@code{next} statement 13837(not discussed yet, @pxref{Next Statement}). 13838This causes @command{awk} to skip any further processing of the current 13839record and start over again with the next input record. Such a program 13840looks like this: 13841 13842@example 13843/^%$/,/^%$/ @{ next @} 13844 @{ print @} 13845@end example 13846 13847@noindent 13848@cindex lines @subentry skipping between markers 13849@c @cindex flag variables 13850This program fails because the range pattern is both turned on and turned off 13851by the first line, which just has a @samp{%} on it. To accomplish this task, 13852write the program in the following manner, using a flag: 13853 13854@cindex @code{!} (exclamation point) @subentry @code{!} operator 13855@example 13856/^%$/ @{ skip = ! skip; next @} 13857skip == 1 @{ next @} # skip lines with `skip' set 13858@end example 13859 13860In a range pattern, the comma (@samp{,}) has the lowest precedence of 13861all the operators (i.e., it is evaluated last). Thus, the following 13862program attempts to combine a range pattern with another, simpler test: 13863 13864@example 13865echo Yes | awk '/1/,/2/ || /Yes/' 13866@end example 13867 13868The intent of this program is @samp{(/1/,/2/) || /Yes/}. 13869However, @command{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}. 13870This cannot be changed or worked around; range patterns do not combine 13871with other patterns: 13872 13873@example 13874$ @kbd{echo Yes | gawk '(/1/,/2/) || /Yes/'} 13875@error{} gawk: cmd. line:1: (/1/,/2/) || /Yes/ 13876@error{} gawk: cmd. line:1: ^ syntax error 13877@end example 13878 13879@cindex range patterns @subentry line continuation and 13880@cindex dark corner @subentry range patterns, line continuation and 13881As a minor point of interest, although it is poor style, 13882POSIX allows you to put a newline after the comma in 13883a range pattern. @value{DARKCORNER} 13884 13885@node BEGIN/END 13886@subsection The @code{BEGIN} and @code{END} Special Patterns 13887 13888@cindex @code{BEGIN} pattern 13889@cindex @code{END} pattern 13890All the patterns described so far are for matching input records. 13891The @code{BEGIN} and @code{END} special patterns are different. 13892They supply startup and cleanup actions for @command{awk} programs. 13893@code{BEGIN} and @code{END} rules must have actions; there is no default 13894action for these rules because there is no current record when they run. 13895@code{BEGIN} and @code{END} rules are often referred to as 13896``@code{BEGIN} and @code{END} blocks'' by longtime @command{awk} 13897programmers. 13898 13899@menu 13900* Using BEGIN/END:: How and why to use BEGIN/END rules. 13901* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. 13902@end menu 13903 13904@node Using BEGIN/END 13905@subsubsection Startup and Cleanup Actions 13906 13907@cindex @code{BEGIN} pattern 13908@cindex @code{END} pattern 13909A @code{BEGIN} rule is executed once only, before the first input record 13910is read. Likewise, an @code{END} rule is executed once only, after all the 13911input is read. For example: 13912 13913@example 13914$ @kbd{awk '} 13915> @kbd{BEGIN @{ print "Analysis of \"li\"" @}} 13916> @kbd{/li/ @{ ++n @}} 13917> @kbd{END @{ print "\"li\" appears in", n, "records." @}' mail-list} 13918@print{} Analysis of "li" 13919@print{} "li" appears in 4 records. 13920@end example 13921 13922@cindex @code{BEGIN} pattern @subentry operators and 13923@cindex @code{END} pattern @subentry operators and 13924This program finds the number of records in the input file @file{mail-list} 13925that contain the string @samp{li}. The @code{BEGIN} rule prints a title 13926for the report. There is no need to use the @code{BEGIN} rule to 13927initialize the counter @code{n} to zero, as @command{awk} does this 13928automatically (@pxref{Variables}). 13929The second rule increments the variable @code{n} every time a 13930record containing the pattern @samp{li} is read. The @code{END} rule 13931prints the value of @code{n} at the end of the run. 13932 13933The special patterns @code{BEGIN} and @code{END} cannot be used in ranges 13934or with Boolean operators (indeed, they cannot be used with any operators). 13935An @command{awk} program may have multiple @code{BEGIN} and/or @code{END} 13936rules. They are executed in the order in which they appear: all the @code{BEGIN} 13937rules at startup and all the @code{END} rules at termination. 13938 13939@code{BEGIN} and @code{END} rules may be intermixed with other rules. 13940This feature was added in the 1987 version of @command{awk} and is included 13941in the POSIX standard. 13942The original (1978) version of @command{awk} 13943required the @code{BEGIN} rule to be placed at the beginning of the 13944program, the @code{END} rule to be placed at the end, and only allowed one of 13945each. 13946This is no longer required, but it is a good idea to follow this template 13947in terms of program organization and readability. 13948 13949Multiple @code{BEGIN} and @code{END} rules are useful for writing 13950library functions, because each library file can have its own @code{BEGIN} and/or 13951@code{END} rule to do its own initialization and/or cleanup. 13952The order in which library functions are named on the command line 13953controls the order in which their @code{BEGIN} and @code{END} rules are 13954executed. Therefore, you have to be careful when writing such rules in 13955library files so that the order in which they are executed doesn't matter. 13956@xref{Options} for more information on 13957using library functions. 13958@xref{Library Functions}, 13959for a number of useful library functions. 13960 13961If an @command{awk} program has only @code{BEGIN} rules and no 13962other rules, then the program exits after the @code{BEGIN} rules are 13963run.@footnote{The original version of @command{awk} kept 13964reading and ignoring input until the end of the file was seen.} However, if an 13965@code{END} rule exists, then the input is read, even if there are 13966no other rules in the program. This is necessary in case the @code{END} 13967rule checks the @code{FNR} and @code{NR} variables, or the fields. 13968 13969@node I/O And BEGIN/END 13970@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules 13971 13972@cindex input/output @subentry from @code{BEGIN} and @code{END} 13973There are several (sometimes subtle) points to be aware of when doing I/O 13974from a @code{BEGIN} or @code{END} rule. 13975The first has to do with the value of @code{$0} in a @code{BEGIN} 13976rule. Because @code{BEGIN} rules are executed before any input is read, 13977there simply is no input record, and therefore no fields, when 13978executing @code{BEGIN} rules. References to @code{$0} and the fields 13979yield a null string or zero, depending upon the context. One way 13980to give @code{$0} a real value is to execute a @code{getline} command 13981without a variable (@pxref{Getline}). 13982Another way is simply to assign a value to @code{$0}. 13983 13984@cindex Brian Kernighan's @command{awk} 13985@cindex differences in @command{awk} and @command{gawk} @subentry @code{BEGIN}/@code{END} patterns 13986@cindex POSIX @command{awk} @subentry @code{BEGIN}/@code{END} patterns 13987@cindex @code{print} statement @subentry @code{BEGIN}/@code{END} patterns and 13988@cindex @code{BEGIN} pattern @subentry @code{print} statement and 13989@cindex @code{END} pattern @subentry @code{print} statement and 13990The second point is similar to the first, but from the other direction. 13991Traditionally, due largely to implementation issues, @code{$0} and 13992@code{NF} were @emph{undefined} inside an @code{END} rule. 13993The POSIX standard specifies that @code{NF} is available in an @code{END} 13994rule. It contains the number of fields from the last input record. 13995@c FIXME: Update this if POSIX is ever fixed. 13996Most probably due to an oversight, the standard does not say that @code{$0} 13997is also preserved, although logically one would think that it should be. 13998In fact, all of BWK @command{awk}, @command{mawk}, and @command{gawk} 13999preserve the value of @code{$0} for use in @code{END} rules. Be aware, 14000however, that some other implementations and many older versions 14001of Unix @command{awk} do not. 14002 14003The third point follows from the first two. The meaning of @samp{print} 14004inside a @code{BEGIN} or @code{END} rule is the same as always: 14005@samp{print $0}. If @code{$0} is the null string, then this prints an 14006empty record. Many longtime @command{awk} programmers use an unadorned 14007@samp{print} in @code{BEGIN} and @code{END} rules to mean @samp{@w{print ""}}, 14008relying on @code{$0} being null. Although one might generally get away with 14009this in @code{BEGIN} rules, it is a very bad idea in @code{END} rules, 14010at least in @command{gawk}. It is also poor style, because if an empty 14011line is needed in the output, the program should print one explicitly. 14012 14013@cindex @code{next} statement @subentry @code{BEGIN}/@code{END} patterns and 14014@cindex @code{nextfile} statement @subentry @code{BEGIN}/@code{END} patterns and 14015@cindex @code{BEGIN} pattern @subentry @code{next}/@code{nextfile} statements and 14016@cindex @code{END} pattern @subentry @code{next}/@code{nextfile} statements and 14017Finally, the @code{next} and @code{nextfile} statements are not allowed 14018in a @code{BEGIN} rule, because the implicit 14019read-a-record-and-match-against-the-rules loop has not started yet. Similarly, those statements 14020are not valid in an @code{END} rule, because all the input has been read. 14021(@xref{Next Statement} and 14022@ifnotdocbook 14023@pxref{Nextfile Statement}.) 14024@end ifnotdocbook 14025@ifdocbook 14026@ref{Nextfile Statement}.) 14027@end ifdocbook 14028 14029@node BEGINFILE/ENDFILE 14030@subsection The @code{BEGINFILE} and @code{ENDFILE} Special Patterns 14031@cindex @code{BEGINFILE} pattern 14032@cindex @code{ENDFILE} pattern 14033@cindex differences in @command{awk} and @command{gawk} @subentry @code{BEGINFILE}/@code{ENDFILE} patterns 14034 14035This @value{SECTION} describes a @command{gawk}-specific feature. 14036 14037Two special kinds of rule, @code{BEGINFILE} and @code{ENDFILE}, give 14038you ``hooks'' into @command{gawk}'s command-line file processing loop. 14039As with the @code{BEGIN} and @code{END} rules 14040@ifnottex 14041@ifnotdocbook 14042(@pxref{BEGIN/END}), 14043@end ifnotdocbook 14044@end ifnottex 14045@iftex 14046(see the previous @value{SECTION}), 14047@end iftex 14048@ifdocbook 14049(see the previous @value{SECTION}), 14050@end ifdocbook 14051@code{BEGINFILE} rules in a program execute in the order they are 14052read by @command{gawk}. Similarly, all @code{ENDFILE} rules also execute in 14053the order they are read. 14054 14055The bodies of the @code{BEGINFILE} rules execute just before 14056@command{gawk} reads the first record from a file. @code{FILENAME} 14057is set to the name of the current file, and @code{FNR} is set to zero. 14058 14059Prior to @value{PVERSION} 5.1.1 of @command{gawk}, as an accident of the 14060implementation, @code{$0} and the fields retained any previous values 14061they had in @code{BEGINFILE} rules. Starting with @value{PVERSION} 140625.1.1, @code{$0} and the fields are cleared, since no record has been 14063read yet from the file that is about to be processed. 14064 14065The @code{BEGINFILE} rule provides you the opportunity to accomplish two tasks 14066that would otherwise be difficult or impossible to perform: 14067 14068@itemize @value{BULLET} 14069@item 14070You can test if the file is readable. Normally, it is a fatal error if a 14071file named on the command line cannot be opened for reading. However, 14072you can bypass the fatal error and move on to the next file on the 14073command line. 14074 14075@cindex @command{gawk} @subentry @code{ERRNO} variable in 14076@cindex @code{ERRNO} variable @subentry with @code{BEGINFILE} pattern 14077@cindex @code{nextfile} statement @subentry @code{BEGINFILE}/@code{ENDFILE} patterns and 14078You do this by checking if the @code{ERRNO} variable is not the empty 14079string; if so, then @command{gawk} was not able to open the file. In 14080this case, your program can execute the @code{nextfile} statement 14081(@pxref{Nextfile Statement}). This causes @command{gawk} to skip 14082the file entirely. Otherwise, @command{gawk} exits with the usual 14083fatal error. 14084 14085@item 14086If you have written extensions that modify the record handling (by 14087inserting an ``input parser''; @pxref{Input Parsers}), you can invoke 14088them at this point, before @command{gawk} has started processing the file. 14089(This is a @emph{very} advanced feature, currently used only by the 14090@uref{https://sourceforge.net/projects/gawkextlib, @code{gawkextlib} project}.) 14091@end itemize 14092 14093The @code{ENDFILE} rule is called when @command{gawk} has finished processing 14094the last record in an input file. For the last input file, 14095it will be called before any @code{END} rules. 14096The @code{ENDFILE} rule is executed even for empty input files. 14097 14098Normally, when an error occurs when reading input in the normal 14099input-processing loop, the error is fatal. However, if a @code{BEGINFILE} 14100rule is present, the error becomes non-fatal, and instead @code{ERRNO} 14101is set. This makes it possible to catch and process I/O errors at the 14102level of the @command{awk} program. 14103 14104@cindex @code{next} statement @subentry @code{BEGINFILE}/@code{ENDFILE} patterns and 14105The @code{next} statement (@pxref{Next Statement}) is not allowed inside 14106either a @code{BEGINFILE} or an @code{ENDFILE} rule. The @code{nextfile} 14107statement is allowed only inside a 14108@code{BEGINFILE} rule, not inside an @code{ENDFILE} rule. 14109 14110@cindex @code{getline} command @subentry @code{BEGINFILE}/@code{ENDFILE} patterns and 14111The @code{getline} statement (@pxref{Getline}) is restricted inside 14112both @code{BEGINFILE} and @code{ENDFILE}: only redirected 14113forms of @code{getline} are allowed. 14114 14115@code{BEGINFILE} and @code{ENDFILE} are @command{gawk} extensions. 14116In most other @command{awk} implementations, or if @command{gawk} is in 14117compatibility mode (@pxref{Options}), they are not special. 14118 14119@node Empty 14120@subsection The Empty Pattern 14121 14122@cindex empty pattern 14123@cindex patterns @subentry empty 14124An empty (i.e., nonexistent) pattern is considered to match @emph{every} 14125input record. For example, the program: 14126 14127@example 14128awk '@{ print $1 @}' mail-list 14129@end example 14130 14131@noindent 14132prints the first field of every record. 14133 14134@node Using Shell Variables 14135@section Using Shell Variables in Programs 14136@cindex shells @subentry variables 14137@cindex @command{awk} programs @subentry shell variables in 14138@c @cindex shell and @command{awk} interaction 14139 14140@command{awk} programs are often used as components in larger 14141programs written in shell. 14142For example, it is very common to use a shell variable to 14143hold a pattern that the @command{awk} program searches for. 14144There are two ways to get the value of the shell variable 14145into the body of the @command{awk} program. 14146 14147@cindex shells @subentry quoting 14148A common method is to use shell quoting to substitute 14149the variable's value into the program inside the script. 14150For example, consider the following program: 14151 14152@example 14153@group 14154printf "Enter search pattern: " 14155read pattern 14156awk "/$pattern/ "'@{ nmatches++ @} 14157 END @{ print nmatches, "found" @}' /path/to/data 14158@end group 14159@end example 14160 14161@noindent 14162The @command{awk} program consists of two pieces of quoted text 14163that are concatenated together to form the program. 14164The first part is double-quoted, which allows substitution of 14165the @code{pattern} shell variable inside the quotes. 14166The second part is single-quoted. 14167 14168Variable substitution via quoting works, but can potentially be 14169messy. It requires a good understanding of the shell's quoting rules 14170(@pxref{Quoting}), 14171and it's often difficult to correctly 14172match up the quotes when reading the program. 14173 14174A better method is to use @command{awk}'s variable assignment feature 14175(@pxref{Assignment Options}) 14176to assign the shell variable's value to an @command{awk} variable. 14177Then use dynamic regexps to match the pattern 14178(@pxref{Computed Regexps}). 14179The following shows how to redo the 14180previous example using this technique: 14181 14182@example 14183printf "Enter search pattern: " 14184read pattern 14185awk -v pat="$pattern" '$0 ~ pat @{ nmatches++ @} 14186 END @{ print nmatches, "found" @}' /path/to/data 14187@end example 14188 14189@noindent 14190Now, the @command{awk} program is just one single-quoted string. 14191The assignment @samp{-v pat="$pattern"} still requires double quotes, 14192in case there is whitespace in the value of @code{$pattern}. 14193The @command{awk} variable @code{pat} could be named @code{pattern} 14194too, but that would be more confusing. Using a variable also 14195provides more flexibility, as the variable can be used anywhere inside 14196the program---for printing, as an array subscript, or for any other 14197use---without requiring the quoting tricks at every point in the program. 14198 14199@node Action Overview 14200@section Actions 14201@c @cindex action, definition of 14202@c @cindex curly braces 14203@c @cindex action, curly braces 14204@c @cindex action, separating statements 14205@cindex actions 14206 14207An @command{awk} program or script consists of a series of 14208rules and function definitions interspersed. (Functions are 14209described later. @xref{User-defined}.) 14210A rule contains a pattern and an action, either of which (but not 14211both) may be omitted. The purpose of the @dfn{action} is to tell 14212@command{awk} what to do once a match for the pattern is found. Thus, 14213in outline, an @command{awk} program generally looks like this: 14214 14215@display 14216[@var{pattern}] @code{@{ @var{action} @}} 14217 @var{pattern} [@code{@{ @var{action} @}}] 14218@dots{} 14219@code{function @var{name}(@var{args}) @{ @dots{} @}} 14220@dots{} 14221@end display 14222 14223@cindex @code{@{@}} (braces) @subentry actions and 14224@cindex braces (@code{@{@}}) @subentry actions and 14225@cindex separators @subentry for statements in actions 14226@cindex newlines @subentry separating statements in actions 14227@cindex @code{;} (semicolon) @subentry separating statements in actions 14228@cindex semicolon (@code{;}) @subentry separating statements in actions 14229An action consists of one or more @command{awk} @dfn{statements}, enclosed 14230in braces (@samp{@{@r{@dots{}}@}}). Each statement specifies one 14231thing to do. The statements are separated by newlines or semicolons. 14232The braces around an action must be used even if the action 14233contains only one statement, or if it contains no statements at 14234all. However, if you omit the action entirely, omit the braces as 14235well. An omitted action is equivalent to @samp{@{ print $0 @}}: 14236 14237@example 14238/foo/ @{ @} @ii{match @code{foo}, do nothing --- empty action} 14239/foo/ @ii{match @code{foo}, print the record --- omitted action} 14240@end example 14241 14242The following types of statements are supported in @command{awk}: 14243 14244@table @asis 14245@cindex side effects @subentry statements 14246@item Expressions 14247Call functions or assign values to variables 14248(@pxref{Expressions}). Executing 14249this kind of statement simply computes the value of the expression. 14250This is useful when the expression has side effects 14251(@pxref{Assignment Ops}). 14252 14253@item Control statements 14254Specify the control flow of @command{awk} 14255programs. The @command{awk} language gives you C-like constructs 14256(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few 14257special ones (@pxref{Statements}). 14258 14259@item Compound statements 14260Enclose one or more statements in braces. A compound statement 14261is used in order to put several statements together in the body of an 14262@code{if}, @code{while}, @code{do}, or @code{for} statement. 14263 14264@item Input statements 14265Use the @code{getline} command 14266(@pxref{Getline}). 14267Also supplied in @command{awk} are the @code{next} 14268statement (@pxref{Next Statement}) 14269and the @code{nextfile} statement 14270(@pxref{Nextfile Statement}). 14271 14272@item Output statements 14273Such as @code{print} and @code{printf}. 14274@xref{Printing}. 14275 14276@item Deletion statements 14277For deleting array elements. 14278@xref{Delete}. 14279@end table 14280 14281@node Statements 14282@section Control Statements in Actions 14283@cindex control statements 14284@cindex statements @subentry control, in actions 14285@cindex actions @subentry control statements in 14286 14287@dfn{Control statements}, such as @code{if}, @code{while}, and so on, 14288control the flow of execution in @command{awk} programs. Most of @command{awk}'s 14289control statements are patterned after similar statements in C. 14290 14291@cindex compound statements, control statements and 14292@cindex statements @subentry compound, control statements and 14293@cindex body @subentry in actions 14294@cindex @code{@{@}} (braces) @subentry statements, grouping 14295@cindex braces (@code{@{@}}) @subentry statements, grouping 14296@cindex newlines @subentry separating statements in actions 14297@cindex @code{;} (semicolon) @subentry separating statements in actions 14298@cindex semicolon (@code{;}) @subentry separating statements in actions 14299All the control statements start with special keywords, such as @code{if} 14300and @code{while}, to distinguish them from simple expressions. 14301Many control statements contain other statements. For example, the 14302@code{if} statement contains another statement that may or may not be 14303executed. The contained statement is called the @dfn{body}. 14304To include more than one statement in the body, group them into a 14305single @dfn{compound statement} with braces, separating them with 14306newlines or semicolons. 14307 14308@menu 14309* If Statement:: Conditionally execute some @command{awk} 14310 statements. 14311* While Statement:: Loop until some condition is satisfied. 14312* Do Statement:: Do specified action while looping until some 14313 condition is satisfied. 14314* For Statement:: Another looping statement, that provides 14315 initialization and increment clauses. 14316* Switch Statement:: Switch/case evaluation for conditional 14317 execution of statements based on a value. 14318* Break Statement:: Immediately exit the innermost enclosing loop. 14319* Continue Statement:: Skip to the end of the innermost enclosing 14320 loop. 14321* Next Statement:: Stop processing the current input record. 14322* Nextfile Statement:: Stop processing the current file. 14323* Exit Statement:: Stop execution of @command{awk}. 14324@end menu 14325 14326@node If Statement 14327@subsection The @code{if}-@code{else} Statement 14328 14329@cindex @code{if} statement 14330The @code{if}-@code{else} statement is @command{awk}'s decision-making 14331statement. It looks like this: 14332 14333@display 14334@code{if (@var{condition}) @var{then-body}} [@code{else @var{else-body}}] 14335@end display 14336 14337@noindent 14338The @var{condition} is an expression that controls what the rest of the 14339statement does. If the @var{condition} is true, @var{then-body} is 14340executed; otherwise, @var{else-body} is executed. 14341The @code{else} part of the statement is 14342optional. The condition is considered false if its value is zero or 14343the null string; otherwise, the condition is true. 14344Refer to the following: 14345 14346@example 14347@group 14348if (x % 2 == 0) 14349 print "x is even" 14350else 14351 print "x is odd" 14352@end group 14353@end example 14354 14355In this example, if the expression @samp{x % 2 == 0} is true (i.e., 14356if the value of @code{x} is evenly divisible by two), then the first 14357@code{print} statement is executed; otherwise, the second @code{print} 14358statement is executed. 14359If the @code{else} keyword appears on the same line as @var{then-body} and 14360@var{then-body} is not a compound statement (i.e., not surrounded by 14361braces), then a semicolon must separate @var{then-body} from 14362the @code{else}. 14363To illustrate this, the previous example can be rewritten as: 14364 14365@example 14366if (x % 2 == 0) print "x is even"; else 14367 print "x is odd" 14368@end example 14369 14370@noindent 14371If the @samp{;} is left out, @command{awk} can't interpret the statement and 14372it produces a syntax error. Don't actually write programs this way, 14373because a human reader might fail to see the @code{else} if it is not 14374the first thing on its line. 14375 14376@node While Statement 14377@subsection The @code{while} Statement 14378@cindex @code{while} statement 14379@cindex loops 14380@cindex loops @subentry @code{while} 14381@cindex loops @seealso{@code{while} statement} 14382 14383In programming, a @dfn{loop} is a part of a program that can 14384be executed two or more times in succession. 14385The @code{while} statement is the simplest looping statement in 14386@command{awk}. It repeatedly executes a statement as long as a condition is 14387true. For example: 14388 14389@example 14390while (@var{condition}) 14391 @var{body} 14392@end example 14393 14394@cindex body @subentry in loops 14395@noindent 14396@var{body} is a statement called the @dfn{body} of the loop, 14397and @var{condition} is an expression that controls how long the loop 14398keeps running. 14399The first thing the @code{while} statement does is test the @var{condition}. 14400If the @var{condition} is true, it executes the statement @var{body}. 14401@ifinfo 14402(The @var{condition} is true when the value 14403is not zero and not a null string.) 14404@end ifinfo 14405After @var{body} has been executed, 14406@var{condition} is tested again, and if it is still true, @var{body} 14407executes again. This process repeats until the @var{condition} is no longer 14408true. If the @var{condition} is initially false, the body of the loop 14409never executes and @command{awk} continues with the statement following 14410the loop. 14411This example prints the first three fields of each record, one per line: 14412 14413@example 14414awk ' 14415@{ 14416 i = 1 14417 while (i <= 3) @{ 14418 print $i 14419 i++ 14420 @} 14421@}' inventory-shipped 14422@end example 14423 14424@noindent 14425The body of this loop is a compound statement enclosed in braces, 14426containing two statements. 14427The loop works in the following manner: first, the value of @code{i} is set to one. 14428Then, the @code{while} statement tests whether @code{i} is less than or equal to 14429three. This is true when @code{i} equals one, so the @code{i}th 14430field is printed. Then the @samp{i++} increments the value of @code{i} 14431and the loop repeats. The loop terminates when @code{i} reaches four. 14432 14433A newline is not required between the condition and the 14434body; however, using one makes the program clearer unless the body is a 14435compound statement or else is very simple. The newline after the open brace 14436that begins the compound statement is not required either, but the 14437program is harder to read without it. 14438 14439@node Do Statement 14440@subsection The @code{do}-@code{while} Statement 14441@cindex @code{do}-@code{while} statement 14442@cindex loops @subentry @code{do}-@code{while} 14443 14444The @code{do} loop is a variation of the @code{while} looping statement. 14445The @code{do} loop executes the @var{body} once and then repeats the 14446@var{body} as long as the @var{condition} is true. It looks like this: 14447 14448@example 14449do 14450 @var{body} 14451while (@var{condition}) 14452@end example 14453 14454Even if the @var{condition} is false at the start, the @var{body} 14455executes at least once (and only once, unless executing @var{body} 14456makes @var{condition} true). Contrast this with the corresponding 14457@code{while} statement: 14458 14459@example 14460while (@var{condition}) 14461 @var{body} 14462@end example 14463 14464@noindent 14465This statement does not execute the @var{body} even once if the 14466@var{condition} is false to begin with. The following is an example of 14467a @code{do} statement: 14468 14469@example 14470@{ 14471 i = 1 14472 do @{ 14473 print $0 14474 i++ 14475 @} while (i <= 10) 14476@} 14477@end example 14478 14479@noindent 14480This program prints each input record 10 times. However, it isn't a very 14481realistic example, because in this case an ordinary @code{while} would do 14482just as well. This situation reflects actual experience; only 14483occasionally is there a real use for a @code{do} statement. 14484 14485@node For Statement 14486@subsection The @code{for} Statement 14487@cindex @code{for} statement 14488@cindex loops @subentry @code{for} @subentry iterative 14489 14490The @code{for} statement makes it more convenient to count iterations of a 14491loop. The general form of the @code{for} statement looks like this: 14492 14493@example 14494for (@var{initialization}; @var{condition}; @var{increment}) 14495 @var{body} 14496@end example 14497 14498@noindent 14499The @var{initialization}, @var{condition}, and @var{increment} parts are 14500arbitrary @command{awk} expressions, and @var{body} stands for any 14501@command{awk} statement. 14502 14503The @code{for} statement starts by executing @var{initialization}. 14504Then, as long 14505as the @var{condition} is true, it repeatedly executes @var{body} and then 14506@var{increment}. Typically, @var{initialization} sets a variable to 14507either zero or one, @var{increment} adds one to it, and @var{condition} 14508compares it against the desired number of iterations. 14509For example: 14510 14511@example 14512awk ' 14513@{ 14514 for (i = 1; i <= 3; i++) 14515 print $i 14516@}' inventory-shipped 14517@end example 14518 14519@noindent 14520This prints the first three fields of each input record, with one 14521input field per output line. 14522 14523@c @cindex comma operator, not supported 14524C and C++ programmers might expect to be able to use the comma 14525operator to set more than one variable in the @var{initialization} 14526part of the @code{for} loop, or to increment multiple variables in the 14527@var{increment} part of the loop, like so: 14528 14529@example 14530for (i = 0, j = length(a); i < j; i++, j--) @dots{} @ii{C/C++, not awk!} 14531@end example 14532 14533@noindent 14534You cannot do this; the comma operator is not supported in @command{awk}. 14535There are workarounds, but they are nonobvious and can lead to 14536code that is difficult to read and understand. It is best, therefore, 14537to simply write additional initializations as separate statements 14538preceding the @code{for} loop and to place additional increment statements 14539at the end of the loop's body. 14540 14541Most often, @var{increment} is an increment expression, as in the earlier 14542example. But this is not required; it can be any expression 14543whatsoever. For example, the following statement prints all the powers of two 14544between 1 and 100: 14545 14546@example 14547for (i = 1; i <= 100; i *= 2) 14548 print i 14549@end example 14550 14551If there is nothing to be done, any of the three expressions in the 14552parentheses following the @code{for} keyword may be omitted. Thus, 14553@w{@samp{for (; x > 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the 14554@var{condition} is omitted, it is treated as true, effectively 14555yielding an @dfn{infinite loop} (i.e., a loop that never terminates). 14556 14557In most cases, a @code{for} loop is an abbreviation for a @code{while} 14558loop, as shown here: 14559 14560@example 14561@var{initialization} 14562while (@var{condition}) @{ 14563 @var{body} 14564 @var{increment} 14565@} 14566@end example 14567 14568@cindex loops @subentry @code{continue} statement and 14569@noindent 14570The only exception is when the @code{continue} statement 14571(@pxref{Continue Statement}) is used 14572inside the loop. Changing a @code{for} statement to a @code{while} 14573statement in this way can change the effect of the @code{continue} 14574statement inside the loop. 14575 14576The @command{awk} language has a @code{for} statement in addition to a 14577@code{while} statement because a @code{for} loop is often both less work to 14578type and more natural to think of. Counting the number of iterations is 14579very common in loops. It can be easier to think of this counting as part 14580of looping rather than as something to do inside the loop. 14581 14582@cindex @code{in} operator 14583There is an alternative version of the @code{for} loop, for iterating over 14584all the indices of an array: 14585 14586@example 14587for (i in array) 14588 @var{do something with} array[i] 14589@end example 14590 14591@noindent 14592@xref{Scanning an Array} 14593for more information on this version of the @code{for} loop. 14594 14595@node Switch Statement 14596@subsection The @code{switch} Statement 14597@cindex @code{switch} statement 14598@cindex @code{case} keyword 14599@cindex @code{default} keyword 14600 14601This @value{SECTION} describes a @command{gawk}-specific feature. 14602If @command{gawk} is in compatibility mode (@pxref{Options}), 14603it is not available. 14604 14605The @code{switch} statement allows the evaluation of an expression and 14606the execution of statements based on a @code{case} match. Case statements 14607are checked for a match in the order they are defined. If no suitable 14608@code{case} is found, the @code{default} section is executed, if supplied. 14609 14610Each @code{case} contains a single constant, be it numeric, string, 14611or regexp. The @code{switch} expression is evaluated, and then each 14612@code{case}'s constant is compared against the result in turn. The 14613type of constant determines the comparison: numeric or string do the 14614usual comparisons. A regexp constant (either regular, @code{/foo/}, or 14615strongly typed, @code{@@/foo/}) does a regular expression match against 14616the string value of the original expression. The general form of the 14617@code{switch} statement looks like this: 14618 14619@example 14620switch (@var{expression}) @{ 14621case @var{value or regular expression}: 14622 @var{case-body} 14623default: 14624 @var{default-body} 14625@} 14626@end example 14627 14628Control flow in 14629the @code{switch} statement works as it does in C. Once a match to a given 14630case is made, the case statement bodies execute until a @code{break}, 14631@code{continue}, @code{next}, @code{nextfile}, or @code{exit} is encountered, 14632or the end of the @code{switch} statement itself. For example: 14633 14634@example 14635while ((c = getopt(ARGC, ARGV, "aksx")) != -1) @{ 14636 switch (c) @{ 14637 case "a": 14638 # report size of all files 14639 all_files = TRUE; 14640 break 14641 case "k": 14642 BLOCK_SIZE = 1024 # 1K block size 14643 break 14644 case "s": 14645 # do sums only 14646 sum_only = TRUE 14647 break 14648 case "x": 14649 # don't cross filesystems 14650 fts_flags = or(fts_flags, FTS_XDEV) 14651 break 14652 case "?": 14653 default: 14654 usage() 14655 break 14656 @} 14657@} 14658@end example 14659 14660Note that if none of the statements specified here halt execution 14661of a matched @code{case} statement, execution falls through to the 14662next @code{case} until execution halts. In this example, the 14663@code{case} for @code{"?"} falls through to the @code{default} 14664case, which is to call a function named @code{usage()}. 14665(The @code{getopt()} function being called here is 14666described in @ref{Getopt Function}.) 14667 14668@node Break Statement 14669@subsection The @code{break} Statement 14670@cindex @code{break} statement 14671@cindex loops @subentry exiting 14672@cindex loops @subentry @code{break} statement and 14673 14674The @code{break} statement jumps out of the innermost @code{for}, 14675@code{while}, or @code{do} loop that encloses it. The following example 14676finds the smallest divisor of any integer, and also identifies prime 14677numbers: 14678 14679@example 14680@group 14681# find smallest divisor of num 14682@{ 14683 num = $1 14684 for (divisor = 2; divisor * divisor <= num; divisor++) @{ 14685 if (num % divisor == 0) 14686 break 14687 @} 14688@end group 14689@group 14690 if (num % divisor == 0) 14691 printf "Smallest divisor of %d is %d\n", num, divisor 14692 else 14693 printf "%d is prime\n", num 14694@} 14695@end group 14696@end example 14697 14698When the remainder is zero in the first @code{if} statement, @command{awk} 14699immediately @dfn{breaks out} of the containing @code{for} loop. This means 14700that @command{awk} proceeds immediately to the statement following the loop 14701and continues processing. (This is very different from the @code{exit} 14702statement, which stops the entire @command{awk} program. 14703@xref{Exit Statement}.) 14704 14705The following program illustrates how the @var{condition} of a @code{for} 14706or @code{while} statement could be replaced with a @code{break} inside 14707an @code{if}: 14708 14709@example 14710# find smallest divisor of num 14711@{ 14712 num = $1 14713 for (divisor = 2; ; divisor++) @{ 14714 if (num % divisor == 0) @{ 14715 printf "Smallest divisor of %d is %d\n", num, divisor 14716 break 14717 @} 14718 if (divisor * divisor > num) @{ 14719 printf "%d is prime\n", num 14720 break 14721 @} 14722 @} 14723@} 14724@end example 14725 14726The @code{break} statement is also used to break out of the 14727@code{switch} statement. 14728This is discussed in @ref{Switch Statement}. 14729 14730@c @cindex @code{break}, outside of loops 14731@c @cindex historical features 14732@c @cindex @command{awk} language, POSIX version 14733@cindex POSIX @command{awk} @subentry @code{break} statement and 14734@cindex dark corner @subentry @code{break} statement 14735@cindex @command{gawk} @subentry @code{break} statement in 14736@cindex Brian Kernighan's @command{awk} 14737The @code{break} statement has no meaning when 14738used outside the body of a loop or @code{switch}. 14739However, although it was never documented, 14740historical implementations of @command{awk} treated the @code{break} 14741statement outside of a loop as if it were a @code{next} statement 14742(@pxref{Next Statement}). 14743@value{DARKCORNER} 14744Recent versions of BWK @command{awk} no longer allow this usage, 14745nor does @command{gawk}. 14746 14747@node Continue Statement 14748@subsection The @code{continue} Statement 14749 14750@cindex @code{continue} statement 14751Similar to @code{break}, the @code{continue} statement is used only inside 14752@code{for}, @code{while}, and @code{do} loops. It skips 14753over the rest of the loop body, causing the next cycle around the loop 14754to begin immediately. Contrast this with @code{break}, which jumps out 14755of the loop altogether. 14756 14757The @code{continue} statement in a @code{for} loop directs @command{awk} to 14758skip the rest of the body of the loop and resume execution with the 14759increment-expression of the @code{for} statement. The following program 14760illustrates this fact: 14761 14762@example 14763BEGIN @{ 14764 for (x = 0; x <= 20; x++) @{ 14765 if (x == 5) 14766 continue 14767 printf "%d ", x 14768 @} 14769 print "" 14770@} 14771@end example 14772 14773@noindent 14774This program prints all the numbers from 0 to 20---except for 5, for 14775which the @code{printf} is skipped. Because the increment @samp{x++} 14776is not skipped, @code{x} does not remain stuck at 5. Contrast the 14777@code{for} loop from the previous example with the following @code{while} loop: 14778 14779@example 14780BEGIN @{ 14781 x = 0 14782 while (x <= 20) @{ 14783 if (x == 5) 14784 continue 14785 printf "%d ", x 14786 x++ 14787 @} 14788 print "" 14789@} 14790@end example 14791 14792@noindent 14793This program loops forever once @code{x} reaches 5, because 14794the increment (@samp{x++}) is never reached. 14795 14796@c @cindex @code{continue}, outside of loops 14797@c @cindex historical features 14798@c @cindex @command{awk} language, POSIX version 14799@cindex POSIX @command{awk} @subentry @code{continue} statement and 14800@cindex dark corner @subentry @code{continue} statement 14801@cindex @command{gawk} @subentry @code{continue} statement in 14802@cindex Brian Kernighan's @command{awk} 14803The @code{continue} statement has no special meaning with respect to the 14804@code{switch} statement, nor does it have any meaning when used outside the 14805body of a loop. Historical versions of @command{awk} treated a @code{continue} 14806statement outside a loop the same way they treated a @code{break} 14807statement outside a loop: as if it were a @code{next} 14808statement 14809@ifset FOR_PRINT 14810(discussed in the following @value{SECTION}). 14811@end ifset 14812@ifclear FOR_PRINT 14813(@pxref{Next Statement}). 14814@end ifclear 14815@value{DARKCORNER} 14816Recent versions of BWK @command{awk} no longer work this way, nor 14817does @command{gawk}. 14818 14819@node Next Statement 14820@subsection The @code{next} Statement 14821@cindex @code{next} statement 14822 14823The @code{next} statement forces @command{awk} to immediately stop processing 14824the current record and go on to the next record. This means that no 14825further rules are executed for the current record, and the rest of the 14826current rule's action isn't executed. 14827 14828Contrast this with the effect of the @code{getline} function 14829(@pxref{Getline}). That also causes 14830@command{awk} to read the next record immediately, but it does not alter the 14831flow of control in any way (i.e., the rest of the current action executes 14832with a new input record). 14833 14834@cindex @command{awk} programs @subentry execution of 14835At the highest level, @command{awk} program execution is a loop that reads 14836an input record and then tests each rule's pattern against it. If you 14837think of this loop as a @code{for} statement whose body contains the 14838rules, then the @code{next} statement is analogous to a @code{continue} 14839statement. It skips to the end of the body of this implicit loop and 14840executes the increment (which reads another record). 14841 14842For example, suppose an @command{awk} program works only on records 14843with four fields, and it shouldn't fail when given bad input. To avoid 14844complicating the rest of the program, write a ``weed out'' rule near 14845the beginning, in the following manner: 14846 14847@example 14848NF != 4 @{ 14849 printf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) > "/dev/stderr" 14850 next 14851@} 14852@end example 14853 14854@noindent 14855Because of the @code{next} statement, 14856the program's subsequent rules won't see the bad record. The error 14857message is redirected to the standard error output stream, as error 14858messages should be. 14859For more detail, see 14860@ref{Special Files}. 14861 14862If the @code{next} statement causes the end of the input to be reached, 14863then the code in any @code{END} rules is executed. 14864@xref{BEGIN/END}. 14865 14866The @code{next} statement is not allowed inside @code{BEGINFILE} and 14867@code{ENDFILE} rules. @xref{BEGINFILE/ENDFILE}. 14868 14869@c @cindex @code{next}, inside a user-defined function 14870@cindex @command{awk} @subentry language, POSIX version 14871@cindex @code{BEGIN} pattern @subentry @code{next}/@code{nextfile} statements and 14872@cindex @code{END} pattern @subentry @code{next}/@code{nextfile} statements and 14873@cindex POSIX @command{awk} @subentry @code{next}/@code{nextfile} statements and 14874@cindex @code{next} statement @subentry user-defined functions and 14875@cindex functions @subentry user-defined @subentry @code{next}/@code{nextfile} statements and 14876According to the POSIX standard, the behavior is undefined if the 14877@code{next} statement is used in a @code{BEGIN} or @code{END} rule. 14878@command{gawk} treats it as a syntax error. Although POSIX does not disallow it, 14879most other @command{awk} implementations don't allow the @code{next} 14880statement inside function bodies (@pxref{User-defined}). Just as with any 14881other @code{next} statement, a @code{next} statement inside a function 14882body reads the next record and starts processing it with the first rule 14883in the program. 14884 14885@node Nextfile Statement 14886@subsection The @code{nextfile} Statement 14887@cindex @code{nextfile} statement 14888 14889The @code{nextfile} statement 14890is similar to the @code{next} statement. 14891However, instead of abandoning processing of the current record, the 14892@code{nextfile} statement instructs @command{awk} to stop processing the 14893current @value{DF}. 14894 14895Upon execution of the @code{nextfile} statement, 14896@code{FILENAME} is 14897updated to the name of the next @value{DF} listed on the command line, 14898@code{FNR} is reset to one, 14899and processing 14900starts over with the first rule in the program. 14901If the @code{nextfile} statement causes the end of the input to be reached, 14902then the code in any @code{END} rules is executed. An exception to this is 14903when @code{nextfile} is invoked during execution of any statement in an 14904@code{END} rule; in this case, it causes the program to stop immediately. 14905@xref{BEGIN/END}. 14906 14907The @code{nextfile} statement is useful when there are many @value{DF}s 14908to process but it isn't necessary to process every record in every file. 14909Without @code{nextfile}, 14910in order to move on to the next @value{DF}, a program 14911would have to continue scanning the unwanted records. The @code{nextfile} 14912statement accomplishes this much more efficiently. 14913 14914In @command{gawk}, execution of @code{nextfile} causes additional things 14915to happen: any @code{ENDFILE} rules are executed if @command{gawk} is 14916not currently in an @code{END} rule, @code{ARGIND} is 14917incremented, and any @code{BEGINFILE} rules are executed. (@code{ARGIND} 14918hasn't been introduced yet. @xref{Built-in Variables}.) 14919 14920There is an additional, special, use case 14921with @command{gawk}. @code{nextfile} is useful inside a @code{BEGINFILE} 14922rule to skip over a file that would otherwise cause @command{gawk} 14923to exit with a fatal error. In this special case, @code{ENDFILE} rules are not 14924executed. @xref{BEGINFILE/ENDFILE}. 14925 14926Although it might seem that @samp{close(FILENAME)} would accomplish 14927the same as @code{nextfile}, this isn't true. @code{close()} is 14928reserved for closing files, pipes, and coprocesses that are 14929opened with redirections. It is not related to the main processing that 14930@command{awk} does with the files listed in @code{ARGV}. 14931 14932@quotation NOTE 14933For many years, @code{nextfile} was a 14934common extension. In September 2012, it was accepted for 14935inclusion into the POSIX standard. 14936See @uref{http://austingroupbugs.net/view.php?id=607, the Austin Group website}. 14937@end quotation 14938 14939@cindex functions @subentry user-defined @subentry @code{next}/@code{nextfile} statements and 14940@cindex @code{nextfile} statement @subentry user-defined functions and 14941@cindex Brian Kernighan's @command{awk} 14942@cindex @command{mawk} utility 14943The current version of BWK @command{awk} and @command{mawk} 14944also support @code{nextfile}. However, they don't allow the 14945@code{nextfile} statement inside function bodies (@pxref{User-defined}). 14946@command{gawk} does; a @code{nextfile} inside a function body reads the 14947first record from the next file and starts processing it with the first 14948rule in the program, just as any other @code{nextfile} statement. 14949 14950@node Exit Statement 14951@subsection The @code{exit} Statement 14952 14953@cindex @code{exit} statement 14954The @code{exit} statement causes @command{awk} to immediately stop 14955executing the current rule and to stop processing input; any remaining input 14956is ignored. The @code{exit} statement is written as follows: 14957 14958@display 14959@code{exit} [@var{return code}] 14960@end display 14961 14962@cindex @code{BEGIN} pattern @subentry @code{exit} statement and 14963@cindex @code{END} pattern @subentry @code{exit} statement and 14964When an @code{exit} statement is executed from a @code{BEGIN} rule, the 14965program stops processing everything immediately. No input records are 14966read. However, if an @code{END} rule is present, 14967as part of executing the @code{exit} statement, 14968the @code{END} rule is executed 14969(@pxref{BEGIN/END}). 14970If @code{exit} is used in the body of an @code{END} rule, it causes 14971the program to stop immediately. 14972 14973An @code{exit} statement that is not part of a @code{BEGIN} or @code{END} 14974rule stops the execution of any further automatic rules for the current 14975record, skips reading any remaining input records, and executes the 14976@code{END} rule if there is one. @command{gawk} also skips 14977any @code{ENDFILE} rules; they do not execute. 14978 14979In such a case, 14980if you don't want the @code{END} rule to do its job, set a variable 14981to a nonzero value before the @code{exit} statement and check that variable in 14982the @code{END} rule. 14983@xref{Assert Function} 14984for an example that does this. 14985 14986@cindex dark corner @subentry @code{exit} statement 14987If an argument is supplied to @code{exit}, its value is used as the exit 14988status code for the @command{awk} process. If no argument is supplied, 14989@code{exit} causes @command{awk} to return a ``success'' status. 14990In the case where an argument 14991is supplied to a first @code{exit} statement, and then @code{exit} is 14992called a second time from an @code{END} rule with no argument, 14993@command{awk} uses the previously supplied exit value. @value{DARKCORNER} 14994@xref{Exit Status} for more information. 14995 14996@cindex programming conventions @subentry @code{exit} statement 14997For example, suppose an error condition occurs that is difficult or 14998impossible to handle. Conventionally, programs report this by 14999exiting with a nonzero status. An @command{awk} program can do this 15000using an @code{exit} statement with a nonzero argument, as shown 15001in the following example: 15002 15003@example 15004@group 15005BEGIN @{ 15006 if (("date" | getline date_now) <= 0) @{ 15007 print "Can't get system date" > "/dev/stderr" 15008 exit 1 15009 @} 15010@end group 15011@group 15012 print "current date is", date_now 15013 close("date") 15014@} 15015@end group 15016@end example 15017 15018@quotation NOTE 15019For full portability, exit values should be between zero and 126, inclusive. 15020Negative values, and values of 127 or greater, may not produce consistent 15021results across different operating systems. 15022@end quotation 15023 15024 15025@node Built-in Variables 15026@section Predefined Variables 15027@cindex predefined variables 15028@cindex variables @subentry predefined 15029 15030Most @command{awk} variables are available to use for your own 15031purposes; they never change unless your program assigns values to 15032them, and they never affect anything unless your program examines them. 15033However, a few variables in @command{awk} have special built-in meanings. 15034@command{awk} examines some of these automatically, so that they enable you 15035to tell @command{awk} how to do certain things. Others are set 15036automatically by @command{awk}, so that they carry information from the 15037internal workings of @command{awk} to your program. 15038 15039@cindex @command{gawk} @subentry predefined variables and 15040This @value{SECTION} documents all of @command{gawk}'s predefined variables, 15041most of which are also documented in the @value{CHAPTER}s describing 15042their areas of activity. 15043 15044@menu 15045* User-modified:: Built-in variables that you change to control 15046 @command{awk}. 15047* Auto-set:: Built-in variables where @command{awk} gives 15048 you information. 15049* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}. 15050@end menu 15051 15052@node User-modified 15053@subsection Built-in Variables That Control @command{awk} 15054@cindex predefined variables @subentry user-modifiable 15055@cindex user-modifiable variables 15056 15057The following is an alphabetical list of variables that you can change to 15058control how @command{awk} does certain things. 15059 15060The variables that are specific to @command{gawk} are marked with a pound 15061sign (@samp{#}). These variables are @command{gawk} extensions. In other 15062@command{awk} implementations or if @command{gawk} is in compatibility 15063mode (@pxref{Options}), they are not special. (Any exceptions are noted 15064in the description of each variable.) 15065 15066@table @code 15067@cindex @code{BINMODE} variable 15068@cindex binary input/output 15069@cindex input/output @subentry binary 15070@cindex differences in @command{awk} and @command{gawk} @subentry @code{BINMODE} variable 15071@item BINMODE # 15072On non-POSIX systems, this variable specifies use of binary mode 15073for all I/O. Numeric values of one, two, or three specify that input 15074files, output files, or all files, respectively, should use binary I/O. 15075A numeric value less than zero is treated as zero, and a numeric value 15076greater than three is treated as three. Alternatively, string values 15077of @code{"r"} or @code{"w"} specify that input files and output files, 15078respectively, should use binary I/O. A string value of @code{"rw"} or 15079@code{"wr"} indicates that all files should use binary I/O. Any other 15080string value is treated the same as @code{"rw"}, but causes @command{gawk} 15081to generate a warning message. @code{BINMODE} is described in more 15082detail in @ref{PC Using}. @command{mawk} (@pxref{Other Versions}) 15083also supports this variable, but only using numeric values. 15084 15085@cindex @code{CONVFMT} variable 15086@cindex POSIX @command{awk} @subentry @code{CONVFMT} variable and 15087@cindex numbers @subentry converting @subentry to strings 15088@cindex strings @subentry converting @subentry numbers to 15089@item @code{CONVFMT} 15090A string that controls the conversion of numbers to 15091strings (@pxref{Conversion}). 15092It works by being passed, in effect, as the first argument to the 15093@code{sprintf()} function 15094(@pxref{String Functions}). 15095Its default value is @code{"%.6g"}. 15096@code{CONVFMT} was introduced by the POSIX standard. 15097 15098@cindex @command{gawk} @subentry @code{FIELDWIDTHS} variable in 15099@cindex @code{FIELDWIDTHS} variable 15100@cindex differences in @command{awk} and @command{gawk} @subentry @code{FIELDWIDTHS} variable 15101@cindex field separator @subentry @code{FIELDWIDTHS} variable and 15102@cindex separators @subentry field @subentry @code{FIELDWIDTHS} variable and 15103@item FIELDWIDTHS # 15104A space-separated list of columns that tells @command{gawk} 15105how to split input with fixed columnar boundaries. 15106Starting in @value{PVERSION} 4.2, each field width may optionally be 15107preceded by a colon-separated value specifying the number of characters to skip 15108before the field starts. 15109Assigning a value to @code{FIELDWIDTHS} 15110overrides the use of @code{FS} and @code{FPAT} for field splitting. 15111@xref{Constant Size} for more information. 15112 15113@cindex @command{gawk} @subentry @code{FPAT} variable in 15114@cindex @code{FPAT} variable 15115@cindex differences in @command{awk} and @command{gawk} @subentry @code{FPAT} variable 15116@cindex field separator @subentry @code{FPAT} variable and 15117@cindex separators @subentry field @subentry @code{FPAT} variable and 15118@item FPAT # 15119A regular expression (as a string) that tells @command{gawk} 15120to create the fields based on text that matches the regular expression. 15121Assigning a value to @code{FPAT} 15122overrides the use of @code{FS} and @code{FIELDWIDTHS} for field splitting. 15123@xref{Splitting By Content} for more information. 15124 15125@cindex @code{FS} variable 15126@cindex separators @subentry field 15127@cindex field separator 15128@item FS 15129The input field separator (@pxref{Field Separators}). 15130The value is a single-character string or a multicharacter regular 15131expression that matches the separations between fields in an input 15132record. If the value is the null string (@code{""}), then each 15133character in the record becomes a separate field. 15134(This behavior is a @command{gawk} extension. POSIX @command{awk} does not 15135specify the behavior when @code{FS} is the null string. 15136Nonetheless, some other versions of @command{awk} also treat 15137@code{""} specially.) 15138 15139The default value is @w{@code{" "}}, a string consisting of a single 15140space. As a special exception, this value means that any sequence of 15141spaces, TABs, and/or newlines is a single separator. It also causes 15142spaces, TABs, and newlines at the beginning and end of a record to 15143be ignored. 15144 15145You can set the value of @code{FS} on the command line using the 15146@option{-F} option: 15147 15148@example 15149awk -F, '@var{program}' @var{input-files} 15150@end example 15151 15152@cindex @command{gawk} @subentry field separators and 15153If @command{gawk} is using @code{FIELDWIDTHS} or @code{FPAT} 15154for field splitting, 15155assigning a value to @code{FS} causes @command{gawk} to return to 15156the normal, @code{FS}-based field splitting. An easy way to do this 15157is to simply say @samp{FS = FS}, perhaps with an explanatory comment. 15158 15159@cindex @command{gawk} @subentry @code{IGNORECASE} variable in 15160@cindex @code{IGNORECASE} variable 15161@cindex differences in @command{awk} and @command{gawk} @subentry @code{IGNORECASE} variable 15162@cindex case sensitivity @subentry string comparisons and 15163@cindex case sensitivity @subentry regexps and 15164@cindex regular expressions @subentry case sensitivity 15165@item IGNORECASE # 15166If @code{IGNORECASE} is nonzero or non-null, then all string comparisons 15167and all regular expression matching are case-independent. 15168This applies to 15169regexp matching with @samp{~} and @samp{!~}, 15170the @code{gensub()}, @code{gsub()}, @code{index()}, @code{match()}, 15171@code{patsplit()}, @code{split()}, and @code{sub()} functions, 15172record termination with @code{RS}, and field splitting with 15173@code{FS} and @code{FPAT}. 15174However, the value of @code{IGNORECASE} does @emph{not} affect array subscripting 15175and it does not affect field splitting when using a single-character 15176field separator. 15177@xref{Case-sensitivity}. 15178 15179@cindex @command{gawk} @subentry @code{LINT} variable in 15180@cindex @code{LINT} variable 15181@cindex differences in @command{awk} and @command{gawk} @subentry @code{LINT} variable 15182@cindex lint checking 15183@item LINT # 15184When this variable is true (nonzero or non-null), @command{gawk} 15185behaves as if the @option{--lint} command-line option is in effect 15186(@pxref{Options}). 15187With a value of @code{"fatal"}, lint warnings become fatal errors. 15188With a value of @code{"invalid"}, only warnings about things that are 15189actually invalid are issued. (This is not fully implemented yet.) 15190Any other true value prints nonfatal warnings. 15191Assigning a false value to @code{LINT} turns off the lint warnings. 15192 15193This variable is a @command{gawk} extension. It is not special 15194in other @command{awk} implementations. Unlike with the other special variables, 15195changing @code{LINT} does affect the production of lint warnings, 15196even if @command{gawk} is in compatibility mode. Much as 15197the @option{--lint} and @option{--traditional} options independently 15198control different aspects of @command{gawk}'s behavior, the control 15199of lint warnings during program execution is independent of the flavor 15200of @command{awk} being executed. 15201 15202@cindex @code{OFMT} variable 15203@cindex numbers @subentry converting @subentry to strings 15204@cindex strings @subentry converting @subentry numbers to 15205@item OFMT 15206A string that controls conversion of numbers to 15207strings (@pxref{Conversion}) for 15208printing with the @code{print} statement. It works by being passed 15209as the first argument to the @code{sprintf()} function 15210(@pxref{String Functions}). 15211Its default value is @code{"%.6g"}. Earlier versions of @command{awk} 15212used @code{OFMT} to specify the format for converting numbers to 15213strings in general expressions; this is now done by @code{CONVFMT}. 15214 15215@cindex @code{print} statement @subentry @code{OFMT} variable and 15216@cindex @code{OFS} variable 15217@cindex separators @subentry field 15218@cindex field separator 15219@item OFS 15220The output field separator (@pxref{Output Separators}). It is 15221output between the fields printed by a @code{print} statement. Its 15222default value is @w{@code{" "}}, a string consisting of a single space. 15223 15224@cindex @code{ORS} variable 15225@item ORS 15226The output record separator. It is output at the end of every 15227@code{print} statement. Its default value is @code{"\n"}, the newline 15228character. (@xref{Output Separators}.) 15229 15230@cindex @code{PREC} variable 15231@item PREC # 15232The working precision of arbitrary-precision floating-point numbers, 1523353 bits by default (@pxref{Setting precision}). 15234 15235@cindex @code{ROUNDMODE} variable 15236@item ROUNDMODE # 15237The rounding mode to use for arbitrary-precision arithmetic on 15238numbers, by default @code{"N"} (@code{roundTiesToEven} in 15239the IEEE 754 standard; @pxref{Setting the rounding mode}). 15240 15241@cindex @code{RS} variable 15242@cindex separators @subentry for records 15243@cindex record separators 15244@item @code{RS} 15245The input record separator. Its default value is a string 15246containing a single newline character, which means that an input record 15247consists of a single line of text. 15248It can also be the null string, in which case records are separated by 15249runs of blank lines. 15250If it is a regexp, records are separated by 15251matches of the regexp in the input text. 15252(@xref{Records}.) 15253 15254The ability for @code{RS} to be a regular expression 15255is a @command{gawk} extension. 15256In most other @command{awk} implementations, 15257or if @command{gawk} is in compatibility mode 15258(@pxref{Options}), 15259just the first character of @code{RS}'s value is used. 15260 15261@cindex @code{SUBSEP} variable 15262@cindex separators @subentry subscript 15263@cindex subscript separators 15264@item @code{SUBSEP} 15265The subscript separator. It has the default value of 15266@code{"\034"} and is used to separate the parts of the indices of a 15267multidimensional array. Thus, the expression @samp{@w{foo["A", "B"]}} 15268really accesses @code{foo["A\034B"]} 15269(@pxref{Multidimensional}). 15270 15271@cindex @command{gawk} @subentry @code{TEXTDOMAIN} variable in 15272@cindex @code{TEXTDOMAIN} variable 15273@cindex differences in @command{awk} and @command{gawk} @subentry @code{TEXTDOMAIN} variable 15274@cindex internationalization @subentry localization 15275@item TEXTDOMAIN # 15276Used for internationalization of programs at the 15277@command{awk} level. It sets the default text domain for specially 15278marked string constants in the source text, as well as for the 15279@code{dcgettext()}, @code{dcngettext()}, and @code{bindtextdomain()} functions 15280(@pxref{Internationalization}). 15281The default value of @code{TEXTDOMAIN} is @code{"messages"}. 15282@end table 15283 15284@node Auto-set 15285@subsection Built-in Variables That Convey Information 15286 15287@cindex predefined variables @subentry conveying information 15288@cindex variables @subentry predefined @subentry conveying information 15289The following is an alphabetical list of variables that @command{awk} 15290sets automatically on certain occasions in order to provide 15291information to your program. 15292 15293The variables that are specific to @command{gawk} are marked with a pound 15294sign (@samp{#}). These variables are @command{gawk} extensions. In other 15295@command{awk} implementations or if @command{gawk} is in compatibility 15296mode (@pxref{Options}), they are not special: 15297 15298@c @asis for docbook 15299@table @asis 15300@cindex @code{ARGC}/@code{ARGV} variables 15301@cindex arguments @subentry command-line 15302@cindex command line @subentry arguments 15303@item @code{ARGC}, @code{ARGV} 15304The command-line arguments available to @command{awk} programs are stored in 15305an array called @code{ARGV}. @code{ARGC} is the number of command-line 15306arguments present. @xref{Other Arguments}. 15307Unlike most @command{awk} arrays, 15308@code{ARGV} is indexed from 0 to @code{ARGC} @minus{} 1. 15309In the following example: 15310 15311@example 15312@group 15313$ @kbd{awk 'BEGIN @{} 15314> @kbd{for (i = 0; i < ARGC; i++)} 15315> @kbd{print ARGV[i]} 15316> @kbd{@}' inventory-shipped mail-list} 15317@print{} awk 15318@print{} inventory-shipped 15319@print{} mail-list 15320@end group 15321@end example 15322 15323@noindent 15324@code{ARGV[0]} contains @samp{awk}, @code{ARGV[1]} 15325contains @samp{inventory-shipped}, and @code{ARGV[2]} contains 15326@samp{mail-list}. The value of @code{ARGC} is three, one more than the 15327index of the last element in @code{ARGV}, because the elements are numbered 15328from zero. 15329 15330@cindex programming conventions @subentry @code{ARGC}/@code{ARGV} variables 15331The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing 15332the array from 0 to @code{ARGC} @minus{} 1, are derived from the C language's 15333method of accessing command-line arguments. 15334 15335@cindex dark corner @subentry value of @code{ARGV[0]} 15336The value of @code{ARGV[0]} can vary from system to system. 15337Also, you should note that the program text is @emph{not} included in 15338@code{ARGV}, nor are any of @command{awk}'s command-line options. 15339@xref{ARGC and ARGV} for information 15340about how @command{awk} uses these variables. 15341@value{DARKCORNER} 15342 15343@cindex @code{ARGIND} variable 15344@cindex differences in @command{awk} and @command{gawk} @subentry @code{ARGIND} variable 15345@item @code{ARGIND #} 15346The index in @code{ARGV} of the current file being processed. 15347Every time @command{gawk} opens a new @value{DF} for processing, it sets 15348@code{ARGIND} to the index in @code{ARGV} of the @value{FN}. 15349When @command{gawk} is processing the input files, 15350@samp{FILENAME == ARGV[ARGIND]} is always true. 15351 15352@cindex files @subentry processing, @code{ARGIND} variable and 15353This variable is useful in file processing; it allows you to tell how far 15354along you are in the list of @value{DF}s as well as to distinguish between 15355successive instances of the same @value{FN} on the command line. 15356 15357@cindex file names @subentry distinguishing 15358While you can change the value of @code{ARGIND} within your @command{awk} 15359program, @command{gawk} automatically sets it to a new value when it 15360opens the next file. 15361 15362@cindex @code{ENVIRON} array 15363@cindex environment variables @subentry in @code{ENVIRON} array 15364@item @code{ENVIRON} 15365An associative array containing the values of the environment. The array 15366indices are the environment variable names; the elements are the values of 15367the particular environment variables. For example, 15368@code{ENVIRON["HOME"]} might be @code{/home/arnold}. 15369 15370For POSIX @command{awk}, changing this array does not affect the 15371environment passed on to any programs that @command{awk} may spawn via 15372redirection or the @code{system()} function. 15373 15374However, beginning with @value{PVERSION} 4.2, if not in POSIX 15375compatibility mode, @command{gawk} does update its own environment when 15376@code{ENVIRON} is changed, thus changing the environment seen by programs 15377that it creates. You should therefore be especially careful if you 15378modify @code{ENVIRON["PATH"]}, which is the search path for finding 15379executable programs. 15380 15381This can also affect the running @command{gawk} program, since some of the 15382built-in functions may pay attention to certain environment variables. 15383The most notable instance of this is @code{mktime()} (@pxref{Time 15384Functions}), which pays attention the value of the @env{TZ} environment 15385variable on many systems. 15386 15387Some operating systems may not have environment variables. 15388On such systems, the @code{ENVIRON} array is empty (except for 15389@w{@code{ENVIRON["AWKPATH"]}} and 15390@w{@code{ENVIRON["AWKLIBPATH"]}}; 15391@pxref{AWKPATH Variable} and 15392@ifdocbook 15393@ref{AWKLIBPATH Variable}). 15394@end ifdocbook 15395@ifnotdocbook 15396@pxref{AWKLIBPATH Variable}). 15397@end ifnotdocbook 15398 15399@cindex @command{gawk} @subentry @code{ERRNO} variable in 15400@cindex @code{ERRNO} variable 15401@cindex differences in @command{awk} and @command{gawk} @subentry @code{ERRNO} variable 15402@cindex error handling @subentry @code{ERRNO} variable and 15403@item @code{ERRNO #} 15404If a system error occurs during a redirection for @code{getline}, during 15405a read for @code{getline}, or during a @code{close()} operation, then 15406@code{ERRNO} contains a string describing the error. 15407 15408In addition, @command{gawk} clears @code{ERRNO} before opening each 15409command-line input file. This enables checking if the file is readable 15410inside a @code{BEGINFILE} pattern (@pxref{BEGINFILE/ENDFILE}). 15411 15412Otherwise, @code{ERRNO} works similarly to the C variable @code{errno}. 15413Except for the case just mentioned, @command{gawk} @emph{never} clears 15414it (sets it to zero or @code{""}). Thus, you should only expect its 15415value to be meaningful when an I/O operation returns a failure value, 15416such as @code{getline} returning @minus{}1. You are, of course, free 15417to clear it yourself before doing an I/O operation. 15418 15419If the value of @code{ERRNO} corresponds to a system error in the C 15420@code{errno} variable, then @code{PROCINFO["errno"]} will be set to the value 15421of @code{errno}. For non-system errors, @code{PROCINFO["errno"]} will 15422be zero. 15423 15424@cindex @code{FILENAME} variable 15425@cindex dark corner @subentry @code{FILENAME} variable 15426@item @code{FILENAME} 15427The name of the current input file. When no @value{DF}s are listed 15428on the command line, @command{awk} reads from the standard input and 15429@code{FILENAME} is set to @code{"-"}. @code{FILENAME} changes each 15430time a new file is read (@pxref{Reading Files}). Inside a @code{BEGIN} 15431rule, the value of @code{FILENAME} is @code{""}, because there are no input 15432files being processed yet.@footnote{Some early implementations of Unix 15433@command{awk} initialized @code{FILENAME} to @code{"-"}, even if there 15434were @value{DF}s to be processed. This behavior was incorrect and should 15435not be relied upon in your programs.} @value{DARKCORNER} Note, though, 15436that using @code{getline} (@pxref{Getline}) inside a @code{BEGIN} rule 15437can give @code{FILENAME} a value. 15438 15439@cindex @code{FNR} variable 15440@item @code{FNR} 15441The current record number in the current file. @command{awk} increments 15442@code{FNR} each time it reads a new record (@pxref{Records}). 15443@command{awk} resets @code{FNR} to zero each time it starts a new 15444input file. 15445 15446@cindex @code{NF} variable 15447@item @code{NF} 15448The number of fields in the current input record. 15449@code{NF} is set each time a new record is read, when a new field is 15450created, or when @code{$0} changes (@pxref{Fields}). 15451 15452Unlike most of the variables described in this @value{SUBSECTION}, 15453assigning a value to @code{NF} has the potential to affect 15454@command{awk}'s internal workings. In particular, assignments 15455to @code{NF} can be used to create fields in or remove fields from the 15456current record. @xref{Changing Fields}. 15457 15458@cindex @code{FUNCTAB} array 15459@cindex @command{gawk} @subentry @code{FUNCTAB} array in 15460@cindex differences in @command{awk} and @command{gawk} @subentry @code{FUNCTAB} variable 15461@item @code{FUNCTAB #} 15462An array whose indices and corresponding values are the names of all 15463the built-in, user-defined, and extension functions in the program. 15464 15465@quotation NOTE 15466Attempting to use the @code{delete} statement with the @code{FUNCTAB} 15467array causes a fatal error. Any attempt to assign to an element of 15468@code{FUNCTAB} also causes a fatal error. 15469@end quotation 15470 15471@cindex @code{NR} variable 15472@item @code{NR} 15473The number of input records @command{awk} has processed since 15474the beginning of the program's execution 15475(@pxref{Records}). 15476@command{awk} increments @code{NR} each time it reads a new record. 15477 15478@cindex @command{gawk} @subentry @code{PROCINFO} array in 15479@cindex @code{PROCINFO} array 15480@cindex differences in @command{awk} and @command{gawk} @subentry @code{PROCINFO} array 15481@item @code{PROCINFO #} 15482The elements of this array provide access to information about the 15483running @command{awk} program. 15484The following elements (listed alphabetically) 15485are guaranteed to be available: 15486 15487@table @code 15488@item PROCINFO["argv"] 15489@cindex command line @subentry arguments 15490The @code{PROCINFO["argv"]} array contains all of the command-line arguments 15491(after glob expansion and redirection processing on platforms where that must 15492be done manually by the program) with subscripts ranging from 0 through 15493@code{argc} @minus{} 1. For example, @code{PROCINFO["argv"][0]} will contain 15494the name by which @command{gawk} was invoked. Here is an example of how this 15495feature may be used: 15496 15497@example 15498gawk ' 15499BEGIN @{ 15500 for (i = 0; i < length(PROCINFO["argv"]); i++) 15501 print i, PROCINFO["argv"][i] 15502@}' 15503@end example 15504 15505Please note that this differs from the standard @code{ARGV} array which does 15506not include command-line arguments that have already been processed by 15507@command{gawk} (@pxref{ARGC and ARGV}). 15508 15509@cindex effective group ID of @command{gawk} user 15510@item PROCINFO["egid"] 15511The value of the @code{getegid()} system call. 15512 15513@item PROCINFO["errno"] 15514The value of the C @code{errno} variable when @code{ERRNO} is set to 15515the associated error message. 15516 15517@item PROCINFO["euid"] 15518@cindex effective user ID of @command{gawk} user 15519The value of the @code{geteuid()} system call. 15520 15521@item PROCINFO["FS"] 15522This is 15523@code{"FS"} if field splitting with @code{FS} is in effect, 15524@code{"FIELDWIDTHS"} if field splitting with @code{FIELDWIDTHS} is in effect, 15525@code{"FPAT"} if field matching with @code{FPAT} is in effect, 15526or @code{"API"} if field splitting is controlled by an API input parser. 15527 15528@item PROCINFO["gid"] 15529@cindex group ID of @command{gawk} user 15530The value of the @code{getgid()} system call. 15531 15532@item PROCINFO["identifiers"] 15533@cindex program identifiers 15534A subarray, indexed by the names of all identifiers used in the text of 15535the @command{awk} program. An @dfn{identifier} is simply the name of a variable 15536(be it scalar or array), built-in function, user-defined function, or 15537extension function. For each identifier, the value of the element is 15538one of the following: 15539 15540@table @code 15541@item "array" 15542The identifier is an array. 15543 15544@item "builtin" 15545The identifier is a built-in function. 15546 15547@item "extension" 15548The identifier is an extension function loaded via 15549@code{@@load} or @option{-l}. 15550 15551@item "scalar" 15552The identifier is a scalar. 15553 15554@item "untyped" 15555The identifier is untyped (could be used as a scalar or an array; 15556@command{gawk} doesn't know yet). 15557 15558@item "user" 15559The identifier is a user-defined function. 15560@end table 15561 15562@noindent 15563The values indicate what @command{gawk} knows about the identifiers 15564after it has finished parsing the program; they are @emph{not} updated 15565while the program runs. 15566 15567@item PROCINFO["platform"] 15568@cindex platform running on 15569@cindex @code{PROCINFO} array @subentry platform running on 15570This element gives a string indicating the platform for which 15571@command{gawk} was compiled. The value will be one of the following: 15572 15573@c nested table 15574@table @code 15575@item "djgpp" 15576@itemx "mingw" 15577Microsoft Windows, using either DJGPP or MinGW, respectively. 15578 15579@item "os2" 15580OS/2. 15581 15582@item "os390" 15583OS/390. 15584 15585@item "posix" 15586GNU/Linux, Cygwin, Mac OS X, and legacy Unix systems. 15587 15588@item "vms" 15589OpenVMS or Vax/VMS. 15590@end table 15591 15592@item PROCINFO["pgrpid"] 15593@cindex process group ID of @command{gawk} process 15594The process group ID of the current process. 15595 15596@item PROCINFO["pid"] 15597@cindex process ID of @command{gawk} process 15598The process ID of the current process. 15599 15600@item PROCINFO["ppid"] 15601@cindex parent process ID of @command{gawk} process 15602The parent process ID of the current process. 15603 15604@item PROCINFO["strftime"] 15605The default time format string for @code{strftime()}. 15606Assigning a new value to this element changes the default. 15607@xref{Time Functions}. 15608 15609@item PROCINFO["uid"] 15610The value of the @code{getuid()} system call. 15611 15612@item PROCINFO["version"] 15613@cindex version of @subentry @command{gawk} 15614@cindex @command{gawk} @subentry version of 15615The version of @command{gawk}. 15616@end table 15617 15618The following additional elements in the array 15619are available to provide information about the MPFR and GMP libraries 15620if your version of @command{gawk} supports arbitrary-precision arithmetic 15621(@pxref{Arbitrary Precision Arithmetic}): 15622 15623@table @code 15624@item PROCINFO["gmp_version"] 15625@cindex version of @subentry GNU MP library 15626The version of the GNU MP library. 15627 15628@cindex version of @subentry GNU MPFR library 15629@item PROCINFO["mpfr_version"] 15630The version of the GNU MPFR library. 15631 15632@item PROCINFO["prec_max"] 15633@cindex maximum precision supported by MPFR library 15634The maximum precision supported by MPFR. 15635 15636@item PROCINFO["prec_min"] 15637@cindex minimum precision required by MPFR library 15638The minimum precision required by MPFR. 15639@end table 15640 15641The following additional elements in the array are available to provide 15642information about the version of the extension API, if your version 15643of @command{gawk} supports dynamic loading of extension functions 15644(@pxref{Dynamic Extensions}): 15645 15646@table @code 15647@item PROCINFO["api_major"] 15648@cindex version of @subentry @command{gawk} extension API 15649@cindex extension API @subentry version number 15650The major version of the extension API. 15651 15652@item PROCINFO["api_minor"] 15653The minor version of the extension API. 15654@end table 15655 15656@cindex supplementary groups of @command{gawk} process 15657On some systems, there may be elements in the array, @code{"group1"} 15658through @code{"group@var{N}"} for some @var{N}. @var{N} is the number of 15659supplementary groups that the process has. Use the @code{in} operator 15660to test for these elements 15661(@pxref{Reference to Elements}). 15662 15663The following elements allow you to change @command{gawk}'s behavior: 15664 15665@table @code 15666@item PROCINFO["NONFATAL"] 15667If this element exists, then I/O errors for all redirections become nonfatal. 15668@xref{Nonfatal}. 15669 15670@item PROCINFO["@var{name}", "NONFATAL"] 15671Make I/O errors for @var{name} be nonfatal. 15672@xref{Nonfatal}. 15673 15674@item PROCINFO["@var{command}", "pty"] 15675For two-way communication to @var{command}, use a pseudo-tty instead 15676of setting up a two-way pipe. 15677@xref{Two-way I/O} for more information. 15678 15679@item PROCINFO["@var{input_name}", "READ_TIMEOUT"] 15680Set a timeout for reading from input redirection @var{input_name}. 15681@xref{Read Timeout} for more information. 15682 15683@item PROCINFO["@var{input_name}", "RETRY"] 15684If an I/O error that may be retried occurs when reading data from 15685@var{input_name}, and this array entry exists, then @code{getline} returns 15686@minus{}2 instead of following the default behavior of returning @minus{}1 15687and configuring @var{input_name} to return no further data. An I/O error 15688that may be retried is one where @code{errno} has the value @code{EAGAIN}, 15689@code{EWOULDBLOCK}, @code{EINTR}, or @code{ETIMEDOUT}. This may be useful 15690in conjunction with @code{PROCINFO["@var{input_name}", "READ_TIMEOUT"]} 15691or situations where a file descriptor has been configured to behave in 15692a non-blocking fashion. 15693@xref{Retrying Input} for more information. 15694 15695@item PROCINFO["sorted_in"] 15696If this element exists in @code{PROCINFO}, its value controls the 15697order in which array indices will be processed by 15698@samp{for (@var{indx} in @var{array})} loops. 15699This is an advanced feature, so we defer the 15700full description until later; see 15701@ref{Controlling Scanning}. 15702@end table 15703 15704@cindex @code{RLENGTH} variable 15705@item @code{RLENGTH} 15706The length of the substring matched by the 15707@code{match()} function 15708(@pxref{String Functions}). 15709@code{RLENGTH} is set by invoking the @code{match()} function. Its value 15710is the length of the matched string, or @minus{}1 if no match is found. 15711 15712@cindex @code{RSTART} variable 15713@item @code{RSTART} 15714The start index in characters of the substring that is matched by the 15715@code{match()} function 15716(@pxref{String Functions}). 15717@code{RSTART} is set by invoking the @code{match()} function. Its value 15718is the position of the string where the matched substring starts, or zero 15719if no match was found. 15720 15721@cindex @command{gawk} @subentry @code{RT} variable in 15722@cindex @code{RT} variable 15723@cindex differences in @command{awk} and @command{gawk} @subentry @code{RS}/@code{RT} variables 15724@item @code{RT #} 15725The input text that matched the text denoted by @code{RS}, 15726the record separator. It is set every time a record is read. 15727 15728@cindex @command{gawk} @subentry @code{SYMTAB} array in 15729@cindex @code{SYMTAB} array 15730@cindex differences in @command{awk} and @command{gawk} @subentry @code{SYMTAB} variable 15731@item @code{SYMTAB #} 15732An array whose indices are the names of all defined global variables and 15733arrays in the program. @code{SYMTAB} makes @command{gawk}'s symbol table 15734visible to the @command{awk} programmer. It is built as @command{gawk} 15735parses the program and is complete before the program starts to run. 15736 15737The array may be used for indirect access to read or write the value of 15738a variable: 15739 15740@example 15741foo = 5 15742SYMTAB["foo"] = 4 15743print foo # prints 4 15744@end example 15745 15746@noindent 15747The @code{isarray()} function (@pxref{Type Functions}) may be used to test 15748if an element in @code{SYMTAB} is an array. 15749Also, you may not use the @code{delete} statement with the 15750@code{SYMTAB} array. 15751 15752Prior to @value{PVERSION} 5.0 of @command{gawk}, you could 15753use an index for @code{SYMTAB} that was not a predefined identifier: 15754 15755@example 15756SYMTAB["xxx"] = 5 15757print SYMTAB["xxx"] 15758@end example 15759 15760@noindent 15761This no longer works, instead producing a fatal error, as it led 15762to rampant confusion. 15763 15764@cindex Schorr, Andrew 15765The @code{SYMTAB} array is more interesting than it looks. Andrew Schorr 15766points out that it effectively gives @command{awk} data pointers. Consider his 15767example: 15768 15769@example 15770@group 15771# Indirect multiply of any variable by amount, return result 15772 15773function multiply(variable, amount) 15774@{ 15775 return SYMTAB[variable] *= amount 15776@} 15777@end group 15778@end example 15779 15780@noindent 15781You would use it like this: 15782 15783@example 15784BEGIN @{ 15785 answer = 10.5 15786 multiply("answer", 4) 15787 print "The answer is", answer 15788@} 15789@end example 15790 15791@noindent 15792When run, this produces: 15793 15794@example 15795$ @kbd{gawk -f answer.awk} 15796@print{} The answer is 42 15797@end example 15798 15799@quotation NOTE 15800In order to avoid severe time-travel paradoxes,@footnote{Not to mention 15801difficult implementation issues.} neither @code{FUNCTAB} nor @code{SYMTAB} 15802is available as an element within the @code{SYMTAB} array. 15803@end quotation 15804@end table 15805 15806@sidebar Changing @code{NR} and @code{FNR} 15807@cindex @code{NR} variable @subentry changing 15808@cindex @code{FNR} variable @subentry changing 15809@cindex dark corner @subentry @code{FNR}/@code{NR} variables 15810@command{awk} increments @code{NR} and @code{FNR} 15811each time it reads a record, instead of setting them to the absolute 15812value of the number of records read. This means that a program can 15813change these variables and their new values are incremented for 15814each record. 15815@value{DARKCORNER} 15816The following example shows this: 15817 15818@example 15819$ @kbd{echo '1} 15820> @kbd{2} 15821> @kbd{3} 15822> @kbd{4' | awk 'NR == 2 @{ NR = 17 @}} 15823> @kbd{@{ print NR @}'} 15824@print{} 1 15825@print{} 17 15826@print{} 18 15827@print{} 19 15828@end example 15829 15830@noindent 15831Before @code{FNR} was added to the @command{awk} language 15832(@pxref{V7/SVR3.1}), 15833many @command{awk} programs used this feature to track the number of 15834records in a file by resetting @code{NR} to zero when @code{FILENAME} 15835changed. 15836@end sidebar 15837 15838@node ARGC and ARGV 15839@subsection Using @code{ARGC} and @code{ARGV} 15840@cindex @code{ARGC}/@code{ARGV} variables @subentry how to use 15841@cindex arguments @subentry command-line 15842@cindex command line @subentry arguments 15843 15844@ref{Auto-set} 15845presented the following program describing the information contained in @code{ARGC} 15846and @code{ARGV}: 15847 15848@example 15849@group 15850$ @kbd{awk 'BEGIN @{} 15851> @kbd{for (i = 0; i < ARGC; i++)} 15852> @kbd{print ARGV[i]} 15853> @kbd{@}' inventory-shipped mail-list} 15854@print{} awk 15855@print{} inventory-shipped 15856@print{} mail-list 15857@end group 15858@end example 15859 15860@noindent 15861In this example, @code{ARGV[0]} contains @samp{awk}, @code{ARGV[1]} 15862contains @samp{inventory-shipped}, and @code{ARGV[2]} contains 15863@samp{mail-list}. 15864Notice that the @command{awk} program is not entered in @code{ARGV}. The 15865other command-line options, with their arguments, are also not 15866entered. This includes variable assignments done with the @option{-v} 15867option (@pxref{Options}). 15868Normal variable assignments on the command line @emph{are} 15869treated as arguments and do show up in the @code{ARGV} array. 15870Given the following program in a file named @file{showargs.awk}: 15871 15872@example 15873BEGIN @{ 15874 printf "A=%d, B=%d\n", A, B 15875 for (i = 0; i < ARGC; i++) 15876 printf "\tARGV[%d] = %s\n", i, ARGV[i] 15877@} 15878END @{ printf "A=%d, B=%d\n", A, B @} 15879@end example 15880 15881@noindent 15882Running it produces the following: 15883 15884@example 15885$ @kbd{awk -v A=1 -f showargs.awk B=2 /dev/null} 15886@print{} A=1, B=0 15887@print{} ARGV[0] = awk 15888@print{} ARGV[1] = B=2 15889@print{} ARGV[2] = /dev/null 15890@print{} A=1, B=2 15891@end example 15892 15893A program can alter @code{ARGC} and the elements of @code{ARGV}. 15894Each time @command{awk} reaches the end of an input file, it uses the next 15895element of @code{ARGV} as the name of the next input file. By storing a 15896different string there, a program can change which files are read. 15897Use @code{"-"} to represent the standard input. Storing 15898additional elements and incrementing @code{ARGC} causes 15899additional files to be read. 15900 15901If the value of @code{ARGC} is decreased, that eliminates input files 15902from the end of the list. By recording the old value of @code{ARGC} 15903elsewhere, a program can treat the eliminated arguments as 15904something other than @value{FN}s. 15905 15906To eliminate a file from the middle of the list, store the null string 15907(@code{""}) into @code{ARGV} in place of the file's name. As a 15908special feature, @command{awk} ignores @value{FN}s that have been 15909replaced with the null string. 15910Another option is to 15911use the @code{delete} statement to remove elements from 15912@code{ARGV} (@pxref{Delete}). 15913 15914All of these actions are typically done in the @code{BEGIN} rule, 15915before actual processing of the input begins. 15916@xref{Split Program} and 15917@ifnotdocbook 15918@pxref{Tee Program} 15919@end ifnotdocbook 15920@ifdocbook 15921@ref{Tee Program} 15922@end ifdocbook 15923for examples 15924of each way of removing elements from @code{ARGV}. 15925 15926To actually get options into an @command{awk} program, 15927end the @command{awk} options with @option{--} and then supply 15928the @command{awk} program's options, in the following manner: 15929 15930@example 15931awk -f myprog.awk -- -v -q file1 file2 @dots{} 15932@end example 15933 15934The following fragment processes @code{ARGV} in order to examine, and 15935then remove, the previously mentioned command-line options: 15936 15937@example 15938BEGIN @{ 15939 for (i = 1; i < ARGC; i++) @{ 15940 if (ARGV[i] == "-v") 15941 verbose = 1 15942 else if (ARGV[i] == "-q") 15943 debug = 1 15944 else if (ARGV[i] ~ /^-./) @{ 15945 e = sprintf("%s: unrecognized option -- %c", 15946 ARGV[0], substr(ARGV[i], 2, 1)) 15947 print e > "/dev/stderr" 15948 @} else 15949 break 15950 delete ARGV[i] 15951 @} 15952@} 15953@end example 15954 15955@cindex differences in @command{awk} and @command{gawk} @subentry @code{ARGC}/@code{ARGV} variables 15956Ending the @command{awk} options with @option{--} isn't 15957necessary in @command{gawk}. Unless @option{--posix} has 15958been specified, @command{gawk} silently puts any unrecognized options 15959into @code{ARGV} for the @command{awk} program to deal with. As soon 15960as it sees an unknown option, @command{gawk} stops looking for other 15961options that it might otherwise recognize. The previous command line with 15962@command{gawk} would be: 15963 15964@example 15965gawk -f myprog.awk -q -v file1 file2 @dots{} 15966@end example 15967 15968@noindent 15969Because @option{-q} is not a valid @command{gawk} option, it and the 15970following @option{-v} are passed on to the @command{awk} program. 15971(@xref{Getopt Function} for an @command{awk} library function that 15972parses command-line options.) 15973 15974When designing your program, you should choose options that don't 15975conflict with @command{gawk}'s, because it will process any options 15976that it accepts before passing the rest of the command line on to 15977your program. Using @samp{#!} with the @option{-E} option may help 15978(@pxref{Executable Scripts} 15979and 15980@ifnotdocbook 15981@pxref{Options}). 15982@end ifnotdocbook 15983@ifdocbook 15984@ref{Options}). 15985@end ifdocbook 15986 15987@node Pattern Action Summary 15988@section Summary 15989 15990@itemize @value{BULLET} 15991@item 15992Pattern--action pairs make up the basic elements of an @command{awk} 15993program. Patterns are either normal expressions, range expressions, 15994or regexp constants; one of the special keywords @code{BEGIN}, @code{END}, 15995@code{BEGINFILE}, or @code{ENDFILE}; or empty. The action executes if 15996the current record matches the pattern. Empty (missing) patterns match 15997all records. 15998 15999@item 16000I/O from @code{BEGIN} and @code{END} rules has certain constraints. 16001This is also true, only more so, for @code{BEGINFILE} and @code{ENDFILE} 16002rules. The latter two give you ``hooks'' into @command{gawk}'s file 16003processing, allowing you to recover from a file that otherwise would 16004cause a fatal error (such as a file that cannot be opened). 16005 16006@item 16007Shell variables can be used in @command{awk} programs by careful 16008use of shell quoting. It is easier to pass a shell variable into 16009@command{awk} by using the @option{-v} option and an @command{awk} 16010variable. 16011 16012@item 16013Actions consist of statements enclosed in curly braces. Statements 16014are built up from expressions, control statements, compound statements, 16015input and output statements, and deletion statements. 16016 16017@item 16018The control statements in @command{awk} are @code{if}-@code{else}, 16019@code{while}, @code{for}, and @code{do}-@code{while}. @command{gawk} 16020adds the @code{switch} statement. There are two flavors of @code{for} 16021statement: one for performing general looping, and the other for iterating 16022through an array. 16023 16024@item 16025@code{break} and @code{continue} let you exit early or start the next 16026iteration of a loop (or get out of a @code{switch}). 16027 16028@item 16029@code{next} and @code{nextfile} let you read the next record and start 16030over at the top of your program or skip to the next input file and 16031start over, respectively. 16032 16033@item 16034The @code{exit} statement terminates your program. When executed 16035from an action (or function body), it transfers control to the 16036@code{END} statements. From an @code{END} statement body, it exits 16037immediately. You may pass an optional numeric value to be used 16038as @command{awk}'s exit status. 16039 16040@item 16041Some predefined variables provide control over @command{awk}, mainly for I/O. 16042Other variables convey information from @command{awk} to your program. 16043 16044@item 16045@code{ARGC} and @code{ARGV} make the command-line arguments available 16046to your program. Manipulating them from a @code{BEGIN} rule lets you 16047control how @command{awk} will process the provided @value{DF}s. 16048 16049@end itemize 16050 16051@node Arrays 16052@chapter Arrays in @command{awk} 16053@cindex arrays 16054 16055An @dfn{array} is a table of values called @dfn{elements}. The 16056elements of an array are distinguished by their @dfn{indices}. Indices 16057may be either numbers or strings. 16058 16059This @value{CHAPTER} describes how arrays work in @command{awk}, 16060how to use array elements, how to scan through every element in an array, 16061and how to remove array elements. 16062It also describes how @command{awk} simulates multidimensional 16063arrays, as well as some of the less obvious points about array usage. 16064The @value{CHAPTER} moves on to discuss @command{gawk}'s facility 16065for sorting arrays, and ends with a brief description of @command{gawk}'s 16066ability to support true arrays of arrays. 16067 16068@menu 16069* Array Basics:: The basics of arrays. 16070* Numeric Array Subscripts:: How to use numbers as subscripts in 16071 @command{awk}. 16072* Uninitialized Subscripts:: Using Uninitialized variables as subscripts. 16073* Delete:: The @code{delete} statement removes an element 16074 from an array. 16075* Multidimensional:: Emulating multidimensional arrays in 16076 @command{awk}. 16077* Arrays of Arrays:: True multidimensional arrays. 16078* Arrays Summary:: Summary of arrays. 16079@end menu 16080 16081@node Array Basics 16082@section The Basics of Arrays 16083 16084This @value{SECTION} presents the basics: working with elements 16085in arrays one at a time, and traversing all of the elements in 16086an array. 16087 16088@menu 16089* Array Intro:: Introduction to Arrays 16090* Reference to Elements:: How to examine one element of an array. 16091* Assigning Elements:: How to change an element of an array. 16092* Array Example:: Basic Example of an Array 16093* Scanning an Array:: A variation of the @code{for} statement. It 16094 loops through the indices of an array's 16095 existing elements. 16096* Controlling Scanning:: Controlling the order in which arrays are 16097 scanned. 16098@end menu 16099 16100@node Array Intro 16101@subsection Introduction to Arrays 16102 16103@cindex Wall, Larry 16104@quotation 16105@i{Doing linear scans over an associative array is like trying to club someone 16106to death with a loaded Uzi.} 16107@author Larry Wall 16108@end quotation 16109 16110The @command{awk} language provides one-dimensional arrays 16111for storing groups of related strings or numbers. 16112Every @command{awk} array must have a name. Array names have the same 16113syntax as variable names; any valid variable name would also be a valid 16114array name. But one name cannot be used in both ways (as an array and 16115as a variable) in the same @command{awk} program. 16116 16117Arrays in @command{awk} superficially resemble arrays in other programming 16118languages, but there are fundamental differences. In @command{awk}, it 16119isn't necessary to specify the size of an array before starting to use it. 16120Additionally, any number or string, not just consecutive integers, 16121may be used as an array index. 16122 16123In most other languages, arrays must be @dfn{declared} before use, 16124including a specification of 16125how many elements or components they contain. In such languages, the 16126declaration causes a contiguous block of memory to be allocated for that 16127many elements. Usually, an index in the array must be a nonnegative integer. 16128For example, the index zero specifies the first element in the array, which is 16129actually stored at the beginning of the block of memory. Index one 16130specifies the second element, which is stored in memory right after the 16131first element, and so on. It is impossible to add more elements to the 16132array, because it has room only for as many elements as given in 16133the declaration. 16134(Some languages allow arbitrary starting and ending 16135indices---e.g., @samp{15 .. 27}---but the size of the array is still fixed when 16136the array is declared.) 16137 16138@c 1/2015: Do not put the numeric values into @code. Array element 16139@c values are no different than scalar variable values. 16140A contiguous array of four elements might look like 16141@ifnotdocbook 16142@ref{figure-array-elements}, 16143@end ifnotdocbook 16144@ifdocbook 16145@inlineraw{docbook, <xref linkend="figure-array-elements"/>}, 16146@end ifdocbook 16147conceptually, if the element values are eight, @code{"foo"}, 16148@code{""}, and 30. 16149 16150@ifnotdocbook 16151@float Figure,figure-array-elements 16152@caption{A contiguous array} 16153@center @image{array-elements, , , A Contiguous Array} 16154@end float 16155@end ifnotdocbook 16156 16157@docbook 16158<figure id="figure-array-elements" float="0"> 16159<title>A contiguous array</title> 16160<mediaobject> 16161<imageobject role="web"><imagedata fileref="array-elements.png" format="PNG"/></imageobject> 16162</mediaobject> 16163</figure> 16164@end docbook 16165 16166@noindent 16167Only the values are stored; the indices are implicit from the order of 16168the values. Here, eight is the value at index zero, because eight appears in the 16169position with zero elements before it. 16170 16171@cindex arrays @subentry indexing 16172@cindex indexing arrays 16173@cindex associative arrays 16174@cindex arrays @subentry associative 16175Arrays in @command{awk} are different---they are @dfn{associative}. This means 16176that each array is a collection of pairs---an index and its corresponding 16177array element value: 16178 16179@ifnotdocbook 16180@c extra empty column to indent it right 16181@multitable @columnfractions .1 .1 .1 16182@headitem @tab Index @tab Value 16183@item @tab @code{3} @tab @code{30} 16184@item @tab @code{1} @tab @code{"foo"} 16185@item @tab @code{0} @tab @code{8} 16186@item @tab @code{2} @tab @code{""} 16187@end multitable 16188@end ifnotdocbook 16189 16190@docbook 16191<informaltable> 16192<tgroup cols="2"> 16193<colspec colname="1" align="left"/> 16194<colspec colname="2" align="left"/> 16195<thead> 16196<row> 16197<entry>Index</entry> 16198<entry>Value</entry> 16199</row> 16200</thead> 16201 16202<tbody> 16203<row> 16204<entry><literal>3</literal></entry> 16205<entry><literal>30</literal></entry> 16206</row> 16207 16208<row> 16209<entry><literal>1</literal></entry> 16210<entry><literal>"foo"</literal></entry> 16211</row> 16212 16213<row> 16214<entry><literal>0</literal></entry> 16215<entry><literal>8</literal></entry> 16216</row> 16217 16218<row> 16219<entry><literal>2</literal></entry> 16220<entry><literal>""</literal></entry> 16221</row> 16222 16223</tbody> 16224</tgroup> 16225</informaltable> 16226 16227@end docbook 16228 16229@noindent 16230The pairs are shown in jumbled order because their order is 16231irrelevant.@footnote{The ordering will vary among @command{awk} 16232implementations, which typically use hash tables to store array elements 16233and values.} 16234 16235One advantage of associative arrays is that new pairs can be added 16236at any time. For example, suppose a tenth element is added to the array 16237whose value is @w{@code{"number ten"}}. The result is: 16238 16239@ifnotdocbook 16240@c extra empty column to indent it right 16241@multitable @columnfractions .1 .1 .2 16242@headitem @tab Index @tab Value 16243@item @tab @code{10} @tab @code{"number ten"} 16244@item @tab @code{3} @tab @code{30} 16245@item @tab @code{1} @tab @code{"foo"} 16246@item @tab @code{0} @tab @code{8} 16247@item @tab @code{2} @tab @code{""} 16248@end multitable 16249@end ifnotdocbook 16250 16251@docbook 16252<informaltable> 16253<tgroup cols="2"> 16254<colspec colname="1" align="left"/> 16255<colspec colname="2" align="left"/> 16256<thead> 16257<row> 16258<entry>Index</entry> 16259<entry>Value</entry> 16260</row> 16261</thead> 16262<tbody> 16263 16264<row> 16265<entry><literal>10</literal></entry> 16266<entry><literal>"number ten"</literal></entry> 16267</row> 16268 16269<row> 16270<entry><literal>3</literal></entry> 16271<entry><literal>30</literal></entry> 16272</row> 16273 16274<row> 16275<entry><literal>1</literal></entry> 16276<entry><literal>"foo"</literal></entry> 16277</row> 16278 16279<row> 16280<entry><literal>0</literal></entry> 16281<entry><literal>8</literal></entry> 16282</row> 16283 16284<row> 16285<entry><literal>2</literal></entry> 16286<entry><literal>""</literal></entry> 16287</row> 16288 16289</tbody> 16290</tgroup> 16291</informaltable> 16292 16293@end docbook 16294 16295@noindent 16296@cindex sparse arrays 16297@cindex arrays @subentry sparse 16298Now the array is @dfn{sparse}, which just means some indices are missing. 16299It has elements 0--3 and 10, but doesn't have elements 4, 5, 6, 7, 8, or 9. 16300 16301Another consequence of associative arrays is that the indices don't 16302have to be nonnegative integers. Any number, or even a string, can be 16303an index. For example, the following is an array that translates words from 16304English to French: 16305 16306@ifnotdocbook 16307@multitable @columnfractions .1 .1 .1 16308@headitem @tab Index @tab Value 16309@item @tab @code{"dog"} @tab @code{"chien"} 16310@item @tab @code{"cat"} @tab @code{"chat"} 16311@item @tab @code{"one"} @tab @code{"un"} 16312@item @tab @code{1} @tab @code{"un"} 16313@end multitable 16314@end ifnotdocbook 16315 16316@docbook 16317<informaltable> 16318<tgroup cols="2"> 16319<colspec colname="1" align="left"/> 16320<colspec colname="2" align="left"/> 16321<thead> 16322<row> 16323<entry>Index</entry> 16324<entry>Value</entry> 16325</row> 16326</thead> 16327<tbody> 16328<row> 16329<entry><literal>"dog"</literal></entry> 16330<entry><literal>"chien"</literal></entry> 16331</row> 16332 16333<row> 16334<entry><literal>"cat"</literal></entry> 16335<entry><literal>"chat"</literal></entry> 16336</row> 16337 16338<row> 16339<entry><literal>"one"</literal></entry> 16340<entry><literal>"un"</literal></entry> 16341</row> 16342 16343<row> 16344<entry><literal>1</literal></entry> 16345<entry><literal>"un"</literal></entry> 16346</row> 16347 16348</tbody> 16349</tgroup> 16350</informaltable> 16351 16352@end docbook 16353 16354@noindent 16355Here we decided to translate the number one in both spelled-out and 16356numeric form---thus illustrating that a single array can have both 16357numbers and strings as indices. 16358(In fact, array subscripts are always strings. 16359There are some subtleties to how numbers work when used as 16360array subscripts; this is discussed in more detail in 16361@ref{Numeric Array Subscripts}.) 16362Here, the number @code{1} isn't double-quoted, because @command{awk} 16363automatically converts it to a string. 16364 16365@cindex @command{gawk} @subentry @code{IGNORECASE} variable in 16366@cindex case sensitivity @subentry array indices and 16367@cindex arrays @subentry @code{IGNORECASE} variable and 16368@cindex @code{IGNORECASE} variable @subentry array indices and 16369The value of @code{IGNORECASE} has no effect upon array subscripting. 16370The identical string value used to store an array element must be used 16371to retrieve it. 16372When @command{awk} creates an array (e.g., with the @code{split()} 16373built-in function), 16374that array's indices are consecutive integers starting at one. 16375(@xref{String Functions}.) 16376 16377@command{awk}'s arrays are efficient---the time to access an element 16378is independent of the number of elements in the array. 16379 16380@node Reference to Elements 16381@subsection Referring to an Array Element 16382@cindex arrays @subentry referencing elements 16383@cindex array members 16384@cindex elements in arrays 16385 16386The principal way to use an array is to refer to one of its elements. 16387An @dfn{array reference} is an expression as follows: 16388 16389@example 16390@var{array}[@var{index-expression}] 16391@end example 16392 16393@noindent 16394Here, @var{array} is the name of an array. The expression @var{index-expression} is 16395the index of the desired element of the array. 16396 16397@c 1/2015: Having the 4.3 in @samp is a little iffy. It's essentially 16398@c an expression though, so leave be. It's to early in the discussion 16399@c to mention that it's really a string. 16400The value of the array reference is the current value of that array 16401element. For example, @code{foo[4.3]} is an expression referencing the element 16402of array @code{foo} at index @samp{4.3}. 16403 16404@cindex arrays @subentry unassigned elements 16405@cindex unassigned array elements 16406@cindex empty array elements 16407A reference to an array element that has no recorded value yields a value of 16408@code{""}, the null string. This includes elements 16409that have not been assigned any value as well as elements that have been 16410deleted (@pxref{Delete}). 16411 16412@cindex non-existent array elements 16413@cindex arrays @subentry elements @subentry that don't exist 16414@quotation NOTE 16415A reference to an element that does not exist @emph{automatically} creates 16416that array element, with the null string as its value. (In some cases, 16417this is unfortunate, because it might waste memory inside @command{awk}.) 16418 16419Novice @command{awk} programmers often make the mistake of checking if 16420an element exists by checking if the value is empty: 16421 16422@example 16423# Check if "foo" exists in a: @ii{Incorrect!} 16424if (a["foo"] != "") @dots{} 16425@end example 16426 16427@noindent 16428This is incorrect for two reasons. First, it @emph{creates} @code{a["foo"]} 16429if it didn't exist before! Second, it is valid (if a bit unusual) to set 16430an array element equal to the empty string. 16431@end quotation 16432 16433@c @cindex arrays, @code{in} operator and 16434@cindex @code{in} operator @subentry testing if array element exists 16435To determine whether an element exists in an array at a certain index, use 16436the following expression: 16437 16438@example 16439@var{indx} in @var{array} 16440@end example 16441 16442@cindex side effects @subentry array indexing 16443@noindent 16444This expression tests whether the particular index @var{indx} exists, 16445without the side effect of creating that element if it is not present. 16446The expression has the value one (true) if @code{@var{array}[@var{indx}]} 16447exists and zero (false) if it does not exist. 16448(We use @var{indx} here, because @samp{index} is the name of a built-in 16449function.) 16450For example, this statement tests whether the array @code{frequencies} 16451contains the index @samp{2}: 16452 16453@example 16454@group 16455if (2 in frequencies) 16456 print "Subscript 2 is present." 16457@end group 16458@end example 16459 16460Note that this is @emph{not} a test of whether the array 16461@code{frequencies} contains an element whose @emph{value} is two. 16462There is no way to do that except to scan all the elements. Also, this 16463@emph{does not} create @code{frequencies[2]}, while the following 16464(incorrect) alternative does: 16465 16466@example 16467@group 16468if (frequencies[2] != "") 16469 print "Subscript 2 is present." 16470@end group 16471@end example 16472 16473@node Assigning Elements 16474@subsection Assigning Array Elements 16475@cindex arrays @subentry elements @subentry assigning values 16476@cindex elements in arrays @subentry assigning values 16477 16478Array elements can be assigned values just like 16479@command{awk} variables: 16480 16481@example 16482@var{array}[@var{index-expression}] = @var{value} 16483@end example 16484 16485@noindent 16486@var{array} is the name of an array. The expression 16487@var{index-expression} is the index of the element of the array that is 16488assigned a value. The expression @var{value} is the value to 16489assign to that element of the array. 16490 16491@node Array Example 16492@subsection Basic Array Example 16493@cindex arrays @subentry example of using 16494 16495The following program takes a list of lines, each beginning with a line 16496number, and prints them out in order of line number. The line numbers 16497are not in order when they are first read---instead, they 16498are scrambled. This program sorts the lines by making an array using 16499the line numbers as subscripts. The program then prints out the lines 16500in sorted order of their numbers. It is a very simple program and gets 16501confused upon encountering repeated numbers, gaps, or lines that don't 16502begin with a number: 16503 16504@example 16505@c file eg/misc/arraymax.awk 16506@{ 16507 if ($1 > max) 16508 max = $1 16509 arr[$1] = $0 16510@} 16511 16512END @{ 16513 for (x = 1; x <= max; x++) 16514 print arr[x] 16515@} 16516@c endfile 16517@end example 16518 16519The first rule keeps track of the largest line number seen so far; 16520it also stores each line into the array @code{arr}, at an index that 16521is the line's number. 16522The second rule runs after all the input has been read, to print out 16523all the lines. 16524When this program is run with the following input: 16525 16526@example 16527@group 16528@c file eg/misc/arraymax.data 165295 I am the Five man 165302 Who are you? The new number two! 165314 . . . And four on the floor 165321 Who is number one? 165333 I three you. 16534@c endfile 16535@end group 16536@end example 16537 16538@noindent 16539Its output is: 16540 16541@example 16542@group 165431 Who is number one? 165442 Who are you? The new number two! 165453 I three you. 165464 . . . And four on the floor 165475 I am the Five man 16548@end group 16549@end example 16550 16551If a line number is repeated, the last line with a given number overrides 16552the others. 16553Gaps in the line numbers can be handled with an easy improvement to the 16554program's @code{END} rule, as follows: 16555 16556@example 16557@group 16558END @{ 16559 for (x = 1; x <= max; x++) 16560 if (x in arr) 16561 print arr[x] 16562@} 16563@end group 16564@end example 16565 16566@node Scanning an Array 16567@subsection Scanning All Elements of an Array 16568@cindex elements in arrays @subentry scanning 16569@cindex scanning arrays 16570@cindex arrays @subentry scanning 16571@cindex loops @subentry @code{for} @subentry array scanning 16572 16573In programs that use arrays, it is often necessary to use a loop that 16574executes once for each element of an array. In other languages, where 16575arrays are contiguous and indices are limited to nonnegative integers, 16576this is easy: all the valid indices can be found by counting from 16577the lowest index up to the highest. This technique won't do the job 16578in @command{awk}, because any number or string can be an array index. 16579So @command{awk} has a special kind of @code{for} statement for scanning 16580an array: 16581 16582@example 16583@group 16584for (@var{var} in @var{array}) 16585 @var{body} 16586@end group 16587@end example 16588 16589@noindent 16590@cindex @code{in} operator @subentry use in loops 16591This loop executes @var{body} once for each index in @var{array} that the 16592program has previously used, with the variable @var{var} set to that index. 16593 16594@cindex arrays @subentry @code{for} statement and 16595@cindex @code{for} statement @subentry looping over arrays 16596The following program uses this form of the @code{for} statement. The 16597first rule scans the input records and notes which words appear (at 16598least once) in the input, by storing a one into the array @code{used} with 16599the word as the index. The second rule scans the elements of @code{used} to 16600find all the distinct words that appear in the input. It prints each 16601word that is more than 10 characters long and also prints the number of 16602such words. 16603@xref{String Functions} 16604for more information on the built-in function @code{length()}. 16605 16606@example 16607@group 16608# Record a 1 for each word that is used at least once 16609@{ 16610 for (i = 1; i <= NF; i++) 16611 used[$i] = 1 16612@} 16613@end group 16614 16615@group 16616# Find number of distinct words more than 10 characters long 16617END @{ 16618 for (x in used) @{ 16619 if (length(x) > 10) @{ 16620 ++num_long_words 16621 print x 16622 @} 16623 @} 16624 print num_long_words, "words longer than 10 characters" 16625@} 16626@end group 16627@end example 16628 16629@noindent 16630@xref{Word Sorting} 16631for a more detailed example of this type. 16632 16633@cindex arrays @subentry elements @subentry order of access by @code{in} operator 16634@cindex elements in arrays @subentry order of access by @code{in} operator 16635@cindex @code{in} operator @subentry order of array access 16636The order in which elements of the array are accessed by this statement 16637is determined by the internal arrangement of the array elements within 16638@command{awk} and in standard @command{awk} cannot be controlled 16639or changed. This can lead to problems if new elements are added to 16640@var{array} by statements in the loop body; it is not predictable whether 16641the @code{for} loop will reach them. Similarly, changing @var{var} inside 16642the loop may produce strange results. It is best to avoid such things. 16643 16644As a point of information, @command{gawk} sets up the list of elements 16645to be iterated over before the loop starts, and does not change it. 16646But not all @command{awk} versions do so. Consider this program, named 16647@file{loopcheck.awk}: 16648 16649@example 16650BEGIN @{ 16651 a["here"] = "here" 16652 a["is"] = "is" 16653 a["a"] = "a" 16654 a["loop"] = "loop" 16655 for (i in a) @{ 16656 j++ 16657 a[j] = j 16658 print i 16659 @} 16660@} 16661@end example 16662 16663Here is what happens when run with @command{gawk} (and @command{mawk}): 16664 16665@example 16666$ @kbd{gawk -f loopcheck.awk} 16667@print{} here 16668@print{} loop 16669@print{} a 16670@print{} is 16671@end example 16672 16673Contrast this to BWK @command{awk}: 16674 16675@example 16676$ @kbd{nawk -f loopcheck.awk} 16677@print{} loop 16678@print{} here 16679@print{} is 16680@print{} a 16681@print{} 1 16682@end example 16683 16684@node Controlling Scanning 16685@subsection Using Predefined Array Scanning Orders with @command{gawk} 16686 16687This @value{SUBSECTION} describes a feature that is specific to @command{gawk}. 16688 16689By default, when a @code{for} loop traverses an array, the order 16690is undefined, meaning that the @command{awk} implementation 16691determines the order in which the array is traversed. 16692This order is usually based on the internal implementation of arrays 16693and will vary from one version of @command{awk} to the next. 16694 16695@cindex array scanning order, controlling 16696@cindex controlling array scanning order 16697Often, though, you may wish to do something simple, such as 16698``traverse the array by comparing the indices in ascending order,'' 16699or ``traverse the array by comparing the values in descending order.'' 16700@command{gawk} provides two mechanisms that give you this control: 16701 16702@itemize @value{BULLET} 16703@item 16704Set @code{PROCINFO["sorted_in"]} to one of a set of predefined values. 16705We describe this now. 16706 16707@item 16708Set @code{PROCINFO["sorted_in"]} to the name of a user-defined function 16709to use for comparison of array elements. This advanced feature 16710is described later in @ref{Array Sorting}. 16711@end itemize 16712 16713@cindex @code{PROCINFO} array @subentry values of @code{sorted_in} 16714The following special values for @code{PROCINFO["sorted_in"]} are available: 16715 16716@table @code 16717@item "@@unsorted" 16718Array elements are processed in arbitrary order, which is the default 16719@command{awk} behavior. 16720 16721@item "@@ind_str_asc" 16722Order by indices in ascending order compared as strings; this is the most basic sort. 16723(Internally, array indices are always strings, so with @samp{a[2*5] = 1} 16724the index is @code{"10"} rather than numeric 10.) 16725 16726@item "@@ind_num_asc" 16727Order by indices in ascending order but force them to be treated as numbers in the process. 16728Any index with a non-numeric value will end up positioned as if it were zero. 16729 16730@item "@@val_type_asc" 16731Order by element values in ascending order (rather than by indices). 16732Ordering is by the type assigned to the element 16733(@pxref{Typing and Comparison}). 16734All numeric values come before all string values, 16735which in turn come before all subarrays. 16736(Subarrays have not been described yet; 16737@pxref{Arrays of Arrays}.) 16738 16739If you choose to use this feature in traversing @code{FUNCTAB} 16740(@pxref{Auto-set}), then the order is built-in functions first 16741(@pxref{Built-in}), then user-defined functions (@pxref{User-defined}) 16742next, and finally functions loaded from an extension 16743(@pxref{Dynamic Extensions}). 16744 16745@item "@@val_str_asc" 16746Order by element values in ascending order (rather than by indices). Scalar values are 16747compared as strings. 16748If the string values are identical, 16749the index string values are compared instead. 16750When comparing non-scalar values, 16751@code{"@@val_type_asc"} sort ordering is used, so subarrays, if present, 16752come out last. 16753 16754@item "@@val_num_asc" 16755Order by element values in ascending order (rather than by indices). Scalar values are 16756compared as numbers. 16757Non-scalar values are compared using @code{"@@val_type_asc"} sort ordering, 16758so subarrays, if present, come out last. 16759When numeric values are equal, the string values are used to provide 16760an ordering: this guarantees consistent results across different 16761versions of the C @code{qsort()} function,@footnote{When two elements 16762compare as equal, the C @code{qsort()} function does not guarantee 16763that they will maintain their original relative order after sorting. 16764Using the string value to provide a unique ordering when the numeric 16765values are equal ensures that @command{gawk} behaves consistently 16766across different environments.} which @command{gawk} uses internally 16767to perform the sorting. 16768If the string values are also identical, 16769the index string values are compared instead. 16770 16771 16772@item "@@ind_str_desc" 16773Like @code{"@@ind_str_asc"}, but the 16774string indices are ordered from high to low. 16775 16776@item "@@ind_num_desc" 16777Like @code{"@@ind_num_asc"}, but the 16778numeric indices are ordered from high to low. 16779 16780@item "@@val_type_desc" 16781Like @code{"@@val_type_asc"}, but the 16782element values, based on type, are ordered from high to low. 16783Subarrays, if present, come out first. 16784 16785@item "@@val_str_desc" 16786Like @code{"@@val_str_asc"}, but the 16787element values, treated as strings, are ordered from high to low. 16788If the string values are identical, 16789the index string values are compared instead. 16790When comparing non-scalar values, 16791@code{"@@val_type_desc"} sort ordering is used, so subarrays, if present, 16792come out first. 16793 16794@item "@@val_num_desc" 16795Like @code{"@@val_num_asc"}, but the 16796element values, treated as numbers, are ordered from high to low. 16797If the numeric values are equal, the string values are compared instead. 16798If they are also identical, the index string values are compared instead. 16799Non-scalar values are compared using @code{"@@val_type_desc"} sort ordering, 16800so subarrays, if present, come out first. 16801@end table 16802 16803The array traversal order is determined before the @code{for} loop 16804starts to run. Changing @code{PROCINFO["sorted_in"]} in the loop body 16805does not affect the loop. 16806For example: 16807 16808@example 16809$ @kbd{gawk '} 16810> @kbd{BEGIN @{} 16811> @kbd{ a[4] = 4} 16812> @kbd{ a[3] = 3} 16813> @kbd{ for (i in a)} 16814> @kbd{ print i, a[i]} 16815> @kbd{@}'} 16816@print{} 4 4 16817@print{} 3 3 16818$ @kbd{gawk '} 16819> @kbd{BEGIN @{} 16820> @kbd{ PROCINFO["sorted_in"] = "@@ind_str_asc"} 16821> @kbd{ a[4] = 4} 16822> @kbd{ a[3] = 3} 16823> @kbd{ for (i in a)} 16824> @kbd{ print i, a[i]} 16825> @kbd{@}'} 16826@print{} 3 3 16827@print{} 4 4 16828@end example 16829 16830When sorting an array by element values, if a value happens to be 16831a subarray then it is considered to be greater than any string or 16832numeric value, regardless of what the subarray itself contains, 16833and all subarrays are treated as being equal to each other. Their 16834order relative to each other is determined by their index strings. 16835 16836Here are some additional things to bear in mind about sorted 16837array traversal: 16838 16839@itemize @value{BULLET} 16840@item 16841The value of @code{PROCINFO["sorted_in"]} is global. That is, it affects 16842all array traversal @code{for} loops. If you need to change it within your 16843own code, you should see if it's defined and save and restore the value: 16844 16845@example 16846@dots{} 16847if ("sorted_in" in PROCINFO) @{ 16848 save_sorted = PROCINFO["sorted_in"] 16849 PROCINFO["sorted_in"] = "@@val_str_desc" # or whatever 16850@} 16851@dots{} 16852if (save_sorted) 16853 PROCINFO["sorted_in"] = save_sorted 16854@end example 16855 16856@item 16857As already mentioned, the default array traversal order is represented by 16858@code{"@@unsorted"}. You can also get the default behavior by assigning 16859the null string to @code{PROCINFO["sorted_in"]} or by just deleting the 16860@code{"sorted_in"} element from the @code{PROCINFO} array with 16861the @code{delete} statement. 16862(The @code{delete} statement hasn't been described yet; @pxref{Delete}.) 16863@end itemize 16864 16865In addition, @command{gawk} provides built-in functions for 16866sorting arrays; see @ref{Array Sorting Functions}. 16867 16868@node Numeric Array Subscripts 16869@section Using Numbers to Subscript Arrays 16870 16871@cindex numbers @subentry as array subscripts 16872@cindex array subscripts @subentry numbers as 16873@cindex arrays @subentry numeric subscripts 16874@cindex subscripts in arrays @subentry numbers as 16875@cindex @code{CONVFMT} variable @subentry array subscripts and 16876An important aspect to remember about arrays is that @emph{array subscripts 16877are always strings}. When a numeric value is used as a subscript, 16878it is converted to a string value before being used for subscripting 16879(@pxref{Conversion}). 16880This means that the value of the predefined variable @code{CONVFMT} can 16881affect how your program accesses elements of an array. For example: 16882 16883@example 16884xyz = 12.153 16885data[xyz] = 1 16886CONVFMT = "%2.2f" 16887if (xyz in data) 16888 printf "%s is in data\n", xyz 16889else 16890 printf "%s is not in data\n", xyz 16891@end example 16892 16893@noindent 16894This prints @samp{12.15 is not in data}. The first statement gives 16895@code{xyz} a numeric value. Assigning to 16896@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"} 16897(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}). 16898Thus, the array element @code{data["12.153"]} is assigned the value one. 16899The program then changes 16900the value of @code{CONVFMT}. The test @samp{(xyz in data)} generates a new 16901string value from @code{xyz}---this time @code{"12.15"}---because the value of 16902@code{CONVFMT} only allows two significant digits. This test fails, 16903because @code{"12.15"} is different from @code{"12.153"}. 16904 16905@cindex converting @subentry integer array subscripts to strings 16906@cindex integer array indices 16907According to the rules for conversions 16908(@pxref{Conversion}), integer 16909values always convert to strings as integers, no matter what the 16910value of @code{CONVFMT} may happen to be. So the usual case of 16911the following works: 16912 16913@example 16914for (i = 1; i <= maxsub; i++) 16915 @ii{do something with} array[i] 16916@end example 16917 16918The ``integer values always convert to strings as integers'' rule 16919has an additional consequence for array indexing. 16920Octal and hexadecimal constants 16921@ifnotdocbook 16922(@pxref{Nondecimal-numbers}) 16923@end ifnotdocbook 16924@ifdocbook 16925(covered in @ref{Nondecimal-numbers}) 16926@end ifdocbook 16927are converted internally into numbers, and their original form 16928is forgotten. This means, for example, that @code{array[17]}, 16929@code{array[021]}, and @code{array[0x11]} all refer to the same element! 16930 16931As with many things in @command{awk}, the majority of the time 16932things work as you would expect them to. But it is useful to have a precise 16933knowledge of the actual rules, as they can sometimes have a subtle 16934effect on your programs. 16935 16936@node Uninitialized Subscripts 16937@section Using Uninitialized Variables as Subscripts 16938 16939@cindex variables @subentry uninitialized, as array subscripts 16940@cindex uninitialized variables, as array subscripts 16941@cindex subscripts in arrays @subentry uninitialized variables as 16942@cindex arrays @subentry subscripts, uninitialized variables as 16943Suppose it's necessary to write a program 16944to print the input data in reverse order. 16945A reasonable attempt to do so (with some test 16946data) might look like this: 16947 16948@example 16949$ @kbd{echo 'line 1} 16950> @kbd{line 2} 16951> @kbd{line 3' | awk '@{ l[lines] = $0; ++lines @}} 16952> @kbd{END @{} 16953> @kbd{for (i = lines - 1; i >= 0; i--)} 16954> @kbd{print l[i]} 16955> @kbd{@}'} 16956@print{} line 3 16957@print{} line 2 16958@end example 16959 16960Unfortunately, the very first line of input data did not appear in the 16961output! 16962 16963Upon first glance, we would think that this program should have worked. 16964The variable @code{lines} 16965is uninitialized, and uninitialized variables have the numeric value zero. 16966So, @command{awk} should have printed the value of @code{l[0]}. 16967 16968The issue here is that subscripts for @command{awk} arrays are @emph{always} 16969strings. Uninitialized variables, when used as strings, have the 16970value @code{""}, not zero. Thus, @samp{line 1} ends up stored in 16971@code{l[""]}. 16972The following version of the program works correctly: 16973 16974@example 16975@{ l[lines++] = $0 @} 16976END @{ 16977 for (i = lines - 1; i >= 0; i--) 16978 print l[i] 16979@} 16980@end example 16981 16982Here, the @samp{++} forces @code{lines} to be numeric, thus making 16983the ``old value'' numeric zero. This is then converted to @code{"0"} 16984as the array subscript. 16985 16986@cindex array subscripts @subentry null string as 16987@cindex null strings @subentry as array subscripts 16988@cindex dark corner @subentry array subscripts 16989@cindex lint checking @subentry array subscripts 16990Even though it is somewhat unusual, the null string 16991(@code{""}) is a valid array subscript. 16992@value{DARKCORNER} 16993@command{gawk} warns about the use of the null string as a subscript 16994if @option{--lint} is provided 16995on the command line (@pxref{Options}). 16996 16997@node Delete 16998@section The @code{delete} Statement 16999@cindex @code{delete} statement 17000@cindex deleting @subentry elements in arrays 17001@cindex arrays @subentry elements @subentry deleting 17002@cindex elements in arrays @subentry deleting 17003 17004To remove an individual element of an array, use the @code{delete} 17005statement: 17006 17007@example 17008delete @var{array}[@var{index-expression}] 17009@end example 17010 17011Once an array element has been deleted, any value the element once 17012had is no longer available. It is as if the element had never 17013been referred to or been given a value. 17014The following is an example of deleting elements in an array: 17015 17016@example 17017for (i in frequencies) 17018 delete frequencies[i] 17019@end example 17020 17021@noindent 17022This example removes all the elements from the array @code{frequencies}. 17023Once an element is deleted, a subsequent @code{for} statement to scan the array 17024does not report that element and using the @code{in} operator to check for 17025the presence of that element returns zero (i.e., false): 17026 17027@example 17028delete foo[4] 17029if (4 in foo) 17030 print "This will never be printed" 17031@end example 17032 17033@cindex null strings @subentry deleting array elements and 17034It is important to note that deleting an element is @emph{not} the 17035same as assigning it a null value (the empty string, @code{""}). 17036For example: 17037 17038@example 17039@group 17040foo[4] = "" 17041if (4 in foo) 17042 print "This is printed, even though foo[4] is empty" 17043@end group 17044@end example 17045 17046@cindex lint checking @subentry array subscripts 17047It is not an error to delete an element that does not exist. 17048However, if @option{--lint} is provided on the command line 17049(@pxref{Options}), 17050@command{gawk} issues a warning message when an element that 17051is not in the array is deleted. 17052 17053@cindex common extensions @subentry @code{delete} to delete entire arrays 17054@cindex extensions @subentry common @subentry @code{delete} to delete entire arrays 17055@cindex arrays @subentry deleting entire contents 17056@cindex deleting @subentry entire arrays 17057@cindex @code{delete} @var{array} 17058@cindex differences in @command{awk} and @command{gawk} @subentry array elements, deleting 17059All the elements of an array may be deleted with a single statement 17060by leaving off the subscript in the @code{delete} statement, 17061as follows: 17062 17063 17064@example 17065delete @var{array} 17066@end example 17067 17068Using this version of the @code{delete} statement is about three times 17069more efficient than the equivalent loop that deletes each element one 17070at a time. 17071 17072This form of the @code{delete} statement is also supported 17073by BWK @command{awk} and @command{mawk}, as well as 17074by a number of other implementations. 17075 17076@cindex Brian Kernighan's @command{awk} 17077@quotation NOTE 17078For many years, using @code{delete} without a subscript was a common 17079extension. In September 2012, it was accepted for inclusion into the 17080POSIX standard. See @uref{http://austingroupbugs.net/view.php?id=544, 17081the Austin Group website}. 17082@end quotation 17083 17084@cindex portability @subentry deleting array elements 17085@cindex Brennan, Michael 17086The following statement provides a portable but nonobvious way to clear 17087out an array:@footnote{Thanks to Michael Brennan for pointing this out.} 17088 17089@example 17090split("", array) 17091@end example 17092 17093@cindex @code{split()} function @subentry array elements, deleting 17094The @code{split()} function 17095(@pxref{String Functions}) 17096clears out the target array first. This call asks it to split 17097apart the null string. Because there is no data to split out, the 17098function simply clears the array and then returns. 17099 17100@quotation CAUTION 17101Deleting all the elements from an array does not change its type; you cannot 17102clear an array and then use the array's name as a scalar 17103(i.e., a regular variable). For example, the following does not work: 17104 17105@example 17106a[1] = 3 17107delete a 17108a = 3 17109@end example 17110@end quotation 17111 17112@node Multidimensional 17113@section Multidimensional Arrays 17114 17115@menu 17116* Multiscanning:: Scanning multidimensional arrays. 17117@end menu 17118 17119@cindex subscripts in arrays @subentry multidimensional 17120@cindex arrays @subentry multidimensional 17121A @dfn{multidimensional array} is an array in which an element is identified 17122by a sequence of indices instead of a single index. For example, a 17123two-dimensional array requires two indices. The usual way (in many 17124languages, including @command{awk}) to refer to an element of a 17125two-dimensional array named @code{grid} is with 17126@code{grid[@var{x},@var{y}]}. 17127 17128@cindex @code{SUBSEP} variable @subentry multidimensional arrays and 17129Multidimensional arrays are supported in @command{awk} through 17130concatenation of indices into one string. 17131@command{awk} converts the indices into strings 17132(@pxref{Conversion}) and 17133concatenates them together, with a separator between them. This creates 17134a single string that describes the values of the separate indices. The 17135combined string is used as a single index into an ordinary, 17136one-dimensional array. The separator used is the value of the built-in 17137variable @code{SUBSEP}. 17138 17139For example, suppose we evaluate the expression @samp{foo[5,12] = "value"} 17140when the value of @code{SUBSEP} is @code{"@@"}. The numbers 5 and 12 are 17141converted to strings and 17142concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus, 17143the array element @code{foo["5@@12"]} is set to @code{"value"}. 17144 17145Once the element's value is stored, @command{awk} has no record of whether 17146it was stored with a single index or a sequence of indices. The two 17147expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always 17148equivalent. 17149 17150The default value of @code{SUBSEP} is the string @code{"\034"}, 17151which contains a nonprinting character that is unlikely to appear in an 17152@command{awk} program or in most input data. 17153The usefulness of choosing an unlikely character comes from the fact 17154that index values that contain a string matching @code{SUBSEP} can lead to 17155combined strings that are ambiguous. Suppose that @code{SUBSEP} is 17156@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a", 17157"b@@c"]}} are indistinguishable because both are actually 17158stored as @samp{foo["a@@b@@c"]}. 17159 17160@cindex @code{in} operator @subentry index existence in multidimensional arrays 17161To test whether a particular index sequence exists in a 17162multidimensional array, use the same operator (@code{in}) that is 17163used for single-dimensional arrays. Write the whole sequence of indices 17164in parentheses, separated by commas, as the left operand: 17165 17166@example 17167if ((@var{subscript1}, @var{subscript2}, @dots{}) in @var{array}) 17168 @dots{} 17169@end example 17170 17171Here is an example that treats its input as a two-dimensional array of 17172fields; it rotates this array 90 degrees clockwise and prints the 17173result. It assumes that all lines have the same number of 17174elements: 17175 17176@example 17177@{ 17178 if (max_nf < NF) 17179 max_nf = NF 17180 max_nr = NR 17181 for (x = 1; x <= NF; x++) 17182 vector[x, NR] = $x 17183@} 17184 17185END @{ 17186 for (x = 1; x <= max_nf; x++) @{ 17187 for (y = max_nr; y >= 1; --y) 17188 printf("%s ", vector[x, y]) 17189 printf("\n") 17190 @} 17191@} 17192@end example 17193 17194@noindent 17195When given the input: 17196 17197@example 17198@group 171991 2 3 4 5 6 172002 3 4 5 6 1 172013 4 5 6 1 2 172024 5 6 1 2 3 17203@end group 17204@end example 17205 17206@noindent 17207the program produces the following output: 17208 17209@example 17210@group 172114 3 2 1 172125 4 3 2 172136 5 4 3 172141 6 5 4 172152 1 6 5 172163 2 1 6 17217@end group 17218@end example 17219 17220@node Multiscanning 17221@subsection Scanning Multidimensional Arrays 17222 17223There is no special @code{for} statement for scanning a 17224``multidimensional'' array. There cannot be one, because, in truth, 17225@command{awk} does not have 17226multidimensional arrays or elements---there is only a 17227multidimensional @emph{way of accessing} an array. 17228 17229@cindex subscripts in arrays @subentry multidimensional @subentry scanning 17230@cindex arrays @subentry multidimensional @subentry scanning 17231@cindex scanning multidimensional arrays 17232However, if your program has an array that is always accessed as 17233multidimensional, you can get the effect of scanning it by combining 17234the scanning @code{for} statement 17235(@pxref{Scanning an Array}) with the 17236built-in @code{split()} function 17237(@pxref{String Functions}). 17238It works in the following manner: 17239 17240@example 17241for (combined in array) @{ 17242 split(combined, separate, SUBSEP) 17243 @dots{} 17244@} 17245@end example 17246 17247@noindent 17248This sets the variable @code{combined} to 17249each concatenated combined index in the array, and splits it 17250into the individual indices by breaking it apart where the value of 17251@code{SUBSEP} appears. The individual indices then become the elements of 17252the array @code{separate}. 17253 17254Thus, if a value is previously stored in @code{array[1, "foo"]}, then 17255an element with index @code{"1\034foo"} exists in @code{array}. (Recall 17256that the default value of @code{SUBSEP} is the character with code 034.) 17257Sooner or later, the @code{for} statement finds that index and does an 17258iteration with the variable @code{combined} set to @code{"1\034foo"}. 17259Then the @code{split()} function is called as follows: 17260 17261@example 17262split("1\034foo", separate, "\034") 17263@end example 17264 17265@noindent 17266The result is to set @code{separate[1]} to @code{"1"} and 17267@code{separate[2]} to @code{"foo"}. Presto! The original sequence of 17268separate indices is recovered. 17269 17270 17271@node Arrays of Arrays 17272@section Arrays of Arrays 17273@cindex arrays @subentry arrays of arrays 17274 17275@command{gawk} goes beyond standard @command{awk}'s multidimensional 17276array access and provides true arrays of 17277arrays. Elements of a subarray are referred to by their own indices 17278enclosed in square brackets, just like the elements of the main array. 17279For example, the following creates a two-element subarray at index @code{1} 17280of the main array @code{a}: 17281 17282@example 17283a[1][1] = 1 17284a[1][2] = 2 17285@end example 17286 17287This simulates a true two-dimensional array. Each subarray element can 17288contain another subarray as a value, which in turn can hold other arrays 17289as well. In this way, you can create arrays of three or more dimensions. 17290The indices can be any @command{awk} expressions, including scalars 17291separated by commas (i.e., a regular @command{awk} simulated 17292multidimensional subscript). So the following is valid in 17293@command{gawk}: 17294 17295@example 17296a[1][3][1, "name"] = "barney" 17297@end example 17298 17299Each subarray and the main array can be of different length. In fact, the 17300elements of an array or its subarray do not all have to have the same 17301type. This means that the main array and any of its subarrays can be 17302nonrectangular, or jagged in structure. You can assign a scalar value to 17303the index @code{4} of the main array @code{a}, even though @code{a[1]} 17304is itself an array and not a scalar: 17305 17306@example 17307a[4] = "An element in a jagged array" 17308@end example 17309 17310The terms @dfn{dimension}, @dfn{row}, and @dfn{column} are 17311meaningless when applied 17312to such an array, but we will use ``dimension'' henceforth to imply the 17313maximum number of indices needed to refer to an existing element. The 17314type of any element that has already been assigned cannot be changed 17315by assigning a value of a different type. You have to first delete the 17316current element, which effectively makes @command{gawk} forget about 17317the element at that index: 17318 17319@example 17320delete a[4] 17321a[4][5][6][7] = "An element in a four-dimensional array" 17322@end example 17323 17324@noindent 17325This removes the scalar value from index @code{4} and then inserts a 17326three-level nested subarray 17327containing a scalar. You can also 17328delete an entire subarray or subarray of subarrays: 17329 17330@example 17331delete a[4][5] 17332a[4][5] = "An element in subarray a[4]" 17333@end example 17334 17335But recall that you can not delete the main array @code{a} and then use it 17336as a scalar. 17337 17338The built-in functions that take array arguments can also be used 17339with subarrays. For example, the following code fragment uses @code{length()} 17340(@pxref{String Functions}) 17341to determine the number of elements in the main array @code{a} and 17342its subarrays: 17343 17344@example 17345print length(a), length(a[1]), length(a[1][3]) 17346@end example 17347 17348@noindent 17349This results in the following output for our main array @code{a}: 17350 17351@example 173522, 3, 1 17353@end example 17354 17355@noindent 17356The @samp{@var{subscript} in @var{array}} expression 17357(@pxref{Reference to Elements}) works similarly for both 17358regular @command{awk}-style 17359arrays and arrays of arrays. For example, the tests @samp{1 in a}, 17360@samp{3 in a[1]}, and @samp{(1, "name") in a[1][3]} all evaluate to 17361one (true) for our array @code{a}. 17362 17363The @samp{for (item in array)} statement (@pxref{Scanning an Array}) 17364can be nested to scan all the 17365elements of an array of arrays if it is rectangular in structure. In order 17366to print the contents (scalar values) of a two-dimensional array of arrays 17367(i.e., in which each first-level element is itself an 17368array, not necessarily of the same length), 17369you could use the following code: 17370 17371@example 17372for (i in array) 17373 for (j in array[i]) 17374 print array[i][j] 17375@end example 17376 17377The @code{isarray()} function (@pxref{Type Functions}) 17378lets you test if an array element is itself an array: 17379 17380@example 17381for (i in array) @{ 17382 if (isarray(array[i])) @{ 17383 for (j in array[i]) @{ 17384 print array[i][j] 17385 @} 17386 @} 17387 else 17388 print array[i] 17389@} 17390@end example 17391 17392If the structure of a jagged array of arrays is known in advance, 17393you can often devise workarounds using control statements. For example, 17394the following code prints the elements of our main array @code{a}: 17395 17396@example 17397@group 17398for (i in a) @{ 17399 for (j in a[i]) @{ 17400 if (j == 3) @{ 17401 for (k in a[i][j]) 17402 print a[i][j][k] 17403@end group 17404@group 17405 @} else 17406 print a[i][j] 17407 @} 17408@} 17409@end group 17410@end example 17411 17412@noindent 17413@xref{Walking Arrays} for a user-defined function that ``walks'' an 17414arbitrarily dimensioned array of arrays. 17415 17416Recall that a reference to an uninitialized array element yields a value 17417of @code{""}, the null string. This has one important implication when you 17418intend to use a subarray as an argument to a function, as illustrated by 17419the following example: 17420 17421@example 17422$ @kbd{gawk 'BEGIN @{ split("a b c d", b[1]); print b[1][1] @}'} 17423@error{} gawk: cmd. line:1: fatal: split: second argument is not an array 17424@end example 17425 17426The way to work around this is to first force @code{b[1]} to be an array by 17427creating an arbitrary index: 17428 17429@example 17430$ @kbd{gawk 'BEGIN @{ b[1][1] = ""; split("a b c d", b[1]); print b[1][1] @}'} 17431@print{} a 17432@end example 17433 17434@node Arrays Summary 17435@section Summary 17436 17437@itemize @value{BULLET} 17438@item 17439Standard @command{awk} provides one-dimensional associative arrays 17440(arrays indexed by string values). All arrays are associative; numeric 17441indices are converted automatically to strings. 17442 17443@item 17444Array elements are referenced as @code{@var{array}[@var{indx}]}. 17445Referencing an element creates it if it did not exist previously. 17446 17447@item 17448The proper way to see if an array has an element with a given index 17449is to use the @code{in} operator: @samp{@var{indx} in @var{array}}. 17450 17451@item 17452Use @samp{for (@var{indx} in @var{array}) @dots{}} to scan through all the 17453individual elements of an array. In the body of the loop, @var{indx} takes 17454on the value of each element's index in turn. 17455 17456@item 17457The order in which a @samp{for (@var{indx} in @var{array})} loop 17458traverses an array is undefined in POSIX @command{awk} and varies among 17459implementations. @command{gawk} lets you control the order by assigning 17460special predefined values to @code{PROCINFO["sorted_in"]}. 17461 17462@item 17463Use @samp{delete @var{array}[@var{indx}]} to delete an individual element. 17464To delete all of the elements in an array, 17465use @samp{delete @var{array}}. 17466This latter feature has been a common extension for many 17467years and is now standard, but may not be supported by all commercial 17468versions of @command{awk}. 17469 17470@item 17471Standard @command{awk} simulates multidimensional arrays by separating 17472subscript values with commas. The values are concatenated into a 17473single string, separated by the value of @code{SUBSEP}. The fact 17474that such a subscript was created in this way is not retained; thus, 17475changing @code{SUBSEP} may have unexpected consequences. You can use 17476@samp{(@var{sub1}, @var{sub2}, @dots{}) in @var{array}} to see if such 17477a multidimensional subscript exists in @var{array}. 17478 17479@item 17480@command{gawk} provides true arrays of arrays. You use a separate 17481set of square brackets for each dimension in such an array: 17482@code{data[row][col]}, for example. Array elements may thus be either 17483scalar values (number or string) or other arrays. 17484 17485@item 17486Use the @code{isarray()} built-in function to determine if an array 17487element is itself a subarray. 17488 17489@end itemize 17490 17491 17492@node Functions 17493@chapter Functions 17494 17495@cindex functions @subentry built-in 17496@cindex built-in functions 17497This @value{CHAPTER} describes @command{awk}'s built-in functions, 17498which fall into three categories: numeric, string, and I/O. 17499@command{gawk} provides additional groups of functions 17500to work with values that represent time, do 17501bit manipulation, sort arrays, 17502provide type information, and internationalize and localize programs. 17503 17504Besides the built-in functions, @command{awk} has provisions for 17505writing new functions that the rest of a program can use. 17506The second half of this @value{CHAPTER} describes these 17507@dfn{user-defined} functions. 17508Finally, we explore indirect function calls, a @command{gawk}-specific 17509extension that lets you determine at runtime what function is to 17510be called. 17511 17512@menu 17513* Built-in:: Summarizes the built-in functions. 17514* User-defined:: Describes User-defined functions in detail. 17515* Indirect Calls:: Choosing the function to call at runtime. 17516* Functions Summary:: Summary of functions. 17517@end menu 17518 17519@node Built-in 17520@section Built-in Functions 17521 17522@dfn{Built-in} functions are always available for your @command{awk} 17523program to call. This @value{SECTION} defines all the built-in functions 17524in @command{awk}; some of these are mentioned in other @value{SECTION}s 17525but are summarized here for your convenience. 17526 17527@menu 17528* Calling Built-in:: How to call built-in functions. 17529* Numeric Functions:: Functions that work with numbers, including 17530 @code{int()}, @code{sin()} and @code{rand()}. 17531* String Functions:: Functions for string manipulation, such as 17532 @code{split()}, @code{match()} and 17533 @code{sprintf()}. 17534* I/O Functions:: Functions for files and shell commands. 17535* Time Functions:: Functions for dealing with timestamps. 17536* Bitwise Functions:: Functions for bitwise operations. 17537* Type Functions:: Functions for type information. 17538* I18N Functions:: Functions for string translation. 17539@end menu 17540 17541@node Calling Built-in 17542@subsection Calling Built-in Functions 17543 17544To call one of @command{awk}'s built-in functions, write the name of 17545the function followed 17546by arguments in parentheses. For example, @samp{atan2(y + z, 1)} 17547is a call to the function @code{atan2()} and has two arguments. 17548 17549@cindex programming conventions @subentry functions @subentry calling 17550@cindex whitespace @subentry functions, calling 17551Whitespace is ignored between the built-in function name and the 17552opening parenthesis, but nonetheless it is good practice to avoid using whitespace 17553there. User-defined functions do not permit whitespace in this way, and 17554it is easier to avoid mistakes by following a simple 17555convention that always works---no whitespace after a function name. 17556 17557@cindex troubleshooting @subentry @command{gawk} @subentry fatal errors, function arguments 17558@cindex @command{gawk} @subentry function arguments and 17559@cindex differences in @command{awk} and @command{gawk} @subentry function arguments 17560Each built-in function accepts a certain number of arguments. 17561In some cases, arguments can be omitted. The defaults for omitted 17562arguments vary from function to function and are described under the 17563individual functions. In some @command{awk} implementations, extra 17564arguments given to built-in functions are ignored. However, in @command{gawk}, 17565it is a fatal error to give extra arguments to a built-in function. 17566 17567When a function is called, expressions that create the function's actual 17568parameters are evaluated completely before the call is performed. 17569For example, in the following code fragment: 17570 17571@example 17572i = 4 17573j = sqrt(i++) 17574@end example 17575 17576@cindex evaluation order @subentry functions 17577@cindex functions @subentry built-in @subentry evaluation order 17578@cindex built-in functions @subentry evaluation order 17579@noindent 17580the variable @code{i} is incremented to the value five before @code{sqrt()} 17581is called with a value of four for its actual parameter. 17582The order of evaluation of the expressions used for the function's 17583parameters is undefined. Thus, avoid writing programs that 17584assume that parameters are evaluated from left to right or from 17585right to left. For example: 17586 17587@example 17588i = 5 17589j = atan2(++i, i *= 2) 17590@end example 17591 17592If the order of evaluation is left to right, then @code{i} first becomes 17593six, and then 12, and @code{atan2()} is called with the two arguments six 17594and 12. But if the order of evaluation is right to left, @code{i} 17595first becomes 10, then 11, and @code{atan2()} is called with the 17596two arguments 11 and 10. 17597 17598@node Numeric Functions 17599@subsection Numeric Functions 17600@cindex numeric @subentry functions 17601 17602The following list describes all of 17603the built-in functions that work with numbers. 17604Optional parameters are enclosed in square brackets@w{ ([ ]):} 17605 17606@c @asis for docbook 17607@table @asis 17608@item @code{atan2(@var{y}, @var{x})} 17609@cindexawkfunc{atan2} 17610@cindex arctangent 17611Return the arctangent of @code{@var{y} / @var{x}} in radians. 17612You can use @samp{pi = atan2(0, -1)} to retrieve the value of 17613@value{PI}. 17614 17615@item @code{cos(@var{x})} 17616@cindexawkfunc{cos} 17617@cindex cosine 17618Return the cosine of @var{x}, with @var{x} in radians. 17619 17620@item @code{exp(@var{x})} 17621@cindexawkfunc{exp} 17622@cindex exponent 17623Return the exponential of @var{x} (@code{e ^ @var{x}}) or report 17624an error if @var{x} is out of range. The range of values @var{x} can have 17625depends on your machine's floating-point representation. 17626 17627@item @code{int(@var{x})} 17628@cindexawkfunc{int} 17629@cindex round to nearest integer 17630Return the nearest integer to @var{x}, located between @var{x} and zero and 17631truncated toward zero. 17632For example, @code{int(3)} is 3, @code{int(3.9)} is 3, @code{int(-3.9)} 17633is @minus{}3, and @code{int(-3)} is @minus{}3 as well. 17634 17635@ifset INTDIV 17636@item @code{intdiv0(@var{numerator}, @var{denominator}, @var{result})} 17637@cindexawkfunc{intdiv0} 17638@cindex intdiv0 17639Perform integer division, similar to the standard C @code{div()} function. 17640First, truncate @code{numerator} and @code{denominator} 17641towards zero, creating integer values. Clear the @code{result} 17642array, and then set @code{result["quotient"]} to the result of 17643@samp{numerator / denominator}, truncated towards zero to an integer, 17644and set @code{result["remainder"]} to the result of @samp{numerator % 17645denominator}, truncated towards zero to an integer. 17646Attempting division by zero causes a fatal error. 17647The function returns zero upon success, and @minus{}1 upon error. 17648 17649This function is 17650primarily intended for use with arbitrary length integers; it avoids 17651creating MPFR arbitrary precision floating-point values (@pxref{Arbitrary 17652Precision Integers}). 17653 17654This function is a @code{gawk} extension. It is not available in 17655compatibility mode (@pxref{Options}). 17656@end ifset 17657 17658@item @code{log(@var{x})} 17659@cindexawkfunc{log} 17660@cindex logarithm 17661Return the natural logarithm of @var{x}, if @var{x} is positive; 17662otherwise, return @code{NaN} (``not a number'') on IEEE 754 systems. 17663Additionally, @command{gawk} prints a warning message when @code{x} 17664is negative. 17665 17666@cindex Beebe, Nelson H.F.@: 17667@item @code{rand()} 17668@cindexawkfunc{rand} 17669@cindex random numbers @subentry @code{rand()}/@code{srand()} functions 17670Return a random number. The values of @code{rand()} are 17671uniformly distributed between zero and one. 17672The value could be zero but is never one.@footnote{The C version of 17673@code{rand()} on many Unix systems is known to produce fairly poor 17674sequences of random numbers. However, nothing requires that an 17675@command{awk} implementation use the C @code{rand()} to implement the 17676@command{awk} version of @code{rand()}. In fact, for many years, 17677@command{gawk} used the BSD @code{random()} function, which is 17678considerably better than @code{rand()}, to produce random numbers. 17679From @value{PVERSION} 4.1.4, courtesy of Nelson H.F.@: Beebe, @command{gawk} 17680uses the Bayes-Durham shuffle buffer algorithm which considerably extends 17681the period of the random number generator, and eliminates short-range and 17682long-range correlations that might exist in the original generator.} 17683 17684Often random integers are needed instead. Following is a user-defined function 17685that can be used to obtain a random nonnegative integer less than @var{n}: 17686 17687@example 17688function randint(n) 17689@{ 17690 return int(n * rand()) 17691@} 17692@end example 17693 17694@noindent 17695The multiplication produces a random number greater than or equal to 17696zero and less than @code{n}. Using @code{int()}, this result is made into 17697an integer between zero and @code{n} @minus{} 1, inclusive. 17698 17699The following example uses a similar function to produce random integers 17700between one and @var{n}. This program prints a new random number for 17701each input record: 17702 17703@example 17704# Function to roll a simulated die. 17705function roll(n) @{ return 1 + int(rand() * n) @} 17706 17707# Roll 3 six-sided dice and 17708# print total number of points. 17709@{ 17710 printf("%d points\n", roll(6) + roll(6) + roll(6)) 17711@} 17712@end example 17713 17714@cindex seeding random number generator 17715@cindex random numbers @subentry seed of 17716@quotation CAUTION 17717In most @command{awk} implementations, including @command{gawk}, 17718@code{rand()} starts generating numbers from the same 17719starting number, or @dfn{seed}, each time you run @command{awk}.@footnote{@command{mawk} 17720uses a different seed each time.} Thus, 17721a program generates the same results each time you run it. 17722The numbers are random within one @command{awk} run but predictable 17723from run to run. This is convenient for debugging, but if you want 17724a program to do different things each time it is used, you must change 17725the seed to a value that is different in each run. To do this, 17726use @code{srand()}. 17727@end quotation 17728 17729@item @code{sin(@var{x})} 17730@cindexawkfunc{sin} 17731@cindex sine 17732Return the sine of @var{x}, with @var{x} in radians. 17733 17734@item @code{sqrt(@var{x})} 17735@cindexawkfunc{sqrt} 17736@cindex square root 17737Return the positive square root of @var{x}. 17738@command{gawk} prints a warning message 17739if @var{x} is negative. Thus, @code{sqrt(4)} is 2. 17740 17741@item @code{srand(}[@var{x}]@code{)} 17742@cindexawkfunc{srand} 17743Set the starting point, or seed, 17744for generating random numbers to the value @var{x}. 17745 17746Each seed value leads to a particular sequence of random 17747numbers.@footnote{Computer-generated random numbers really are not truly 17748random. They are technically known as @dfn{pseudorandom}. This means 17749that although the numbers in a sequence appear to be random, you can in 17750fact generate the same sequence of random numbers over and over again.} 17751Thus, if the seed is set to the same value a second time, 17752the same sequence of random numbers is produced again. 17753 17754@quotation CAUTION 17755Different @command{awk} implementations use different random-number 17756generators internally. Don't expect the same @command{awk} program 17757to produce the same series of random numbers when executed by 17758different versions of @command{awk}. 17759@end quotation 17760 17761If the argument @var{x} is omitted, as in @samp{srand()}, then the current 17762date and time of day are used for a seed. This is the way to get random 17763numbers that are truly unpredictable. 17764 17765The return value of @code{srand()} is the previous seed. This makes it 17766easy to keep track of the seeds in case you need to consistently reproduce 17767sequences of random numbers. 17768 17769POSIX does not specify the initial seed; it differs among @command{awk} 17770implementations. 17771@end table 17772 17773@node String Functions 17774@subsection String-Manipulation Functions 17775@cindex string-manipulation functions 17776 17777The functions in this @value{SECTION} look at or change the text of one 17778or more strings. 17779 17780@command{gawk} understands locales (@pxref{Locales}) and does all 17781string processing in terms of @emph{characters}, not @emph{bytes}. 17782This distinction is particularly important to understand for locales 17783where one character may be represented by multiple bytes. Thus, for 17784example, @code{length()} returns the number of characters in a string, 17785and not the number of bytes used to represent those characters. Similarly, 17786@code{index()} works with character indices, and not byte indices. 17787 17788@quotation CAUTION 17789A number of functions deal with indices into strings. For these 17790functions, the first character of a string is at position (index) one. 17791This is different from C and the languages descended from it, where the 17792first character is at position zero. You need to remember this when 17793doing index calculations, particularly if you are used to C. 17794@end quotation 17795 17796In the following list, optional parameters are enclosed in square brackets@w{ ([ ]).} 17797Several functions perform string substitution; the full discussion is 17798provided in the description of the @code{sub()} function, which comes 17799toward the end, because the list is presented alphabetically. 17800 17801Those functions that are specific to @command{gawk} are marked with a 17802pound sign (@samp{#}). They are not available in compatibility mode 17803(@pxref{Options}): 17804 17805 17806@menu 17807* Gory Details:: More than you want to know about @samp{\} and 17808 @samp{&} with @code{sub()}, @code{gsub()}, and 17809 @code{gensub()}. 17810@end menu 17811 17812@c @asis for docbook 17813@table @asis 17814@item @code{asort(}@var{source} [@code{,} @var{dest} [@code{,} @var{how} ] ]@code{) #} 17815@itemx @code{asorti(}@var{source} [@code{,} @var{dest} [@code{,} @var{how} ] ]@code{) #} 17816@cindexgawkfunc{asorti} 17817@cindex sort array 17818@cindex arrays @subentry elements @subentry retrieving number of 17819@cindexgawkfunc{asort} 17820@cindex sort array indices 17821These two functions are similar in behavior, so they are described 17822together. 17823 17824@quotation NOTE 17825The following description ignores the third argument, @var{how}, as it 17826requires understanding features that we have not discussed yet. Thus, 17827the discussion here is a deliberate simplification. (We do provide all 17828the details later on; see @ref{Array Sorting Functions} for the full story.) 17829@end quotation 17830 17831Both functions return the number of elements in the array @var{source}. 17832For @command{asort()}, @command{gawk} sorts the values of @var{source} 17833and replaces the indices of the sorted values of @var{source} with 17834sequential integers starting with one. If the optional array @var{dest} 17835is specified, then @var{source} is duplicated into @var{dest}. @var{dest} 17836is then sorted, leaving the indices of @var{source} unchanged. 17837 17838@cindex @command{gawk} @subentry @code{IGNORECASE} variable in 17839When comparing strings, @code{IGNORECASE} affects the sorting 17840(@pxref{Array Sorting Functions}). If the 17841@var{source} array contains subarrays as values (@pxref{Arrays of 17842Arrays}), they will come last, after all scalar values. 17843Subarrays are @emph{not} recursively sorted. 17844 17845For example, if the contents of @code{a} are as follows: 17846 17847@example 17848a["last"] = "de" 17849a["first"] = "sac" 17850a["middle"] = "cul" 17851@end example 17852 17853@noindent 17854A call to @code{asort()}: 17855 17856@example 17857asort(a) 17858@end example 17859 17860@noindent 17861results in the following contents of @code{a}: 17862 17863@example 17864@group 17865a[1] = "cul" 17866a[2] = "de" 17867a[3] = "sac" 17868@end group 17869@end example 17870 17871The @code{asorti()} function works similarly to @code{asort()}; however, 17872the @emph{indices} are sorted, instead of the values. Thus, in the 17873previous example, starting with the same initial set of indices and 17874values in @code{a}, calling @samp{asorti(a)} would yield: 17875 17876@example 17877a[1] = "first" 17878a[2] = "last" 17879a[3] = "middle" 17880@end example 17881 17882@quotation NOTE 17883You may not use either @code{SYMTAB} or @code{FUNCTAB} as the second 17884argument to these functions. Attempting to do so produces a fatal error. 17885You may use them as the first argument, but only if providing a second 17886array to use for the actual sorting. 17887@end quotation 17888 17889You are allowed to use the same array for both the @var{source} and @var{dest} 17890arguments, but doing so only makes sense if you're also supplying the third argument. 17891 17892@item @code{gensub(@var{regexp}, @var{replacement}, @var{how}} [@code{, @var{target}}]@code{) #} 17893@cindexgawkfunc{gensub} 17894@cindex search and replace in strings 17895@cindex substitute in string 17896Search the target string @var{target} for matches of the regular 17897expression @var{regexp}. If @var{how} is a string beginning with 17898@samp{g} or @samp{G} (short for ``global''), then replace all matches 17899of @var{regexp} with @var{replacement}. Otherwise, treat @var{how} 17900as a number indicating which match of @var{regexp} to replace. Treat 17901numeric values less than one as if they were one. If no @var{target} 17902is supplied, use @code{$0}. Return the modified string as the result 17903of the function. The original target string is @emph{not} changed. 17904 17905The returned value is @emph{always} a string, even if the original 17906@var{target} was a number or a regexp value. 17907 17908@code{gensub()} is a general substitution function. Its purpose is 17909to provide more features than the standard @code{sub()} and @code{gsub()} 17910functions. 17911 17912@code{gensub()} provides an additional feature that is not available 17913in @code{sub()} or @code{gsub()}: the ability to specify components of a 17914regexp in the replacement text. This is done by using parentheses in 17915the regexp to mark the components and then specifying @samp{\@var{N}} 17916in the replacement text, where @var{N} is a digit from 1 to 9. 17917For example: 17918 17919@example 17920$ @kbd{gawk '} 17921> @kbd{BEGIN @{} 17922> @kbd{a = "abc def"} 17923> @kbd{b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)} 17924> @kbd{print b} 17925> @kbd{@}'} 17926@print{} def abc 17927@end example 17928 17929@noindent 17930As with @code{sub()}, you must type two backslashes in order 17931to get one into the string. 17932In the replacement text, the sequence @samp{\0} represents the entire 17933matched text, as does the character @samp{&}. 17934 17935The following example shows how you can use the third argument to control 17936which match of the regexp should be changed: 17937 17938@example 17939$ @kbd{echo a b c a b c |} 17940> @kbd{gawk '@{ print gensub(/a/, "AA", 2) @}'} 17941@print{} a b c AA b c 17942@end example 17943 17944In this case, @code{$0} is the default target string. 17945@code{gensub()} returns the new string as its result, which is 17946passed directly to @code{print} for printing. 17947 17948@c @cindex automatic warnings 17949@c @cindex warnings, automatic 17950If the @var{how} argument is a string that does not begin with @samp{g} or 17951@samp{G}, or if it is a number that is less than or equal to zero, only one 17952substitution is performed. If @var{how} is zero, @command{gawk} issues 17953a warning message. 17954 17955If @var{regexp} does not match @var{target}, @code{gensub()}'s return value 17956is the original unchanged value of @var{target}. Note that, as mentioned 17957above, the returned value is a string, even if @var{target} was not. 17958 17959@item @code{gsub(@var{regexp}, @var{replacement}} [@code{, @var{target}}]@code{)} 17960@cindexawkfunc{gsub} 17961Search @var{target} for 17962@emph{all} of the longest, leftmost, @emph{nonoverlapping} matching 17963substrings it can find and replace them with @var{replacement}. 17964The @samp{g} in @code{gsub()} stands for 17965``global,'' which means replace everywhere. For example: 17966 17967@example 17968@{ gsub(/Britain/, "United Kingdom"); print @} 17969@end example 17970 17971@noindent 17972replaces all occurrences of the string @samp{Britain} with @samp{United 17973Kingdom} for all input records. 17974 17975The @code{gsub()} function returns the number of substitutions made. If 17976the variable to search and alter (@var{target}) is 17977omitted, then the entire input record (@code{$0}) is used. 17978As in @code{sub()}, the characters @samp{&} and @samp{\} are special, 17979and the third argument must be assignable. 17980 17981@item @code{index(@var{in}, @var{find})} 17982@cindexawkfunc{index} 17983@cindex search for substring 17984@cindex find substring in string 17985Search the string @var{in} for the first occurrence of the string 17986@var{find}, and return the position in characters where that occurrence 17987begins in the string @var{in}. Consider the following example: 17988 17989@example 17990$ @kbd{awk 'BEGIN @{ print index("peanut", "an") @}'} 17991@print{} 3 17992@end example 17993 17994@noindent 17995If @var{find} is not found, @code{index()} returns zero. 17996 17997@cindex dark corner @subentry regexp as second argument to @code{index()} 17998With BWK @command{awk} and @command{gawk}, 17999it is a fatal error to use a regexp constant for @var{find}. 18000Other implementations allow it, simply treating the regexp 18001constant as an expression meaning @samp{$0 ~ /regexp/}. @value{DARKCORNER} 18002 18003@item @code{length(}[@var{string}]@code{)} 18004@cindexawkfunc{length} 18005@cindex string @subentry length 18006@cindex length of string 18007Return the number of characters in @var{string}. If 18008@var{string} is a number, the length of the digit string representing 18009that number is returned. For example, @code{length("abcde")} is five. By 18010contrast, @code{length(15 * 35)} works out to three. In this example, 18011@iftex 18012@math{15 @cdot 35 = 525}, 18013@end iftex 18014@ifnottex 18015@ifnotdocbook 1801615 * 35 = 525, 18017@end ifnotdocbook 18018@end ifnottex 18019@docbook 1802015 ⋅ 35 = 525, 18021@end docbook 18022and 525 is then converted to the string @code{"525"}, which has 18023three characters. 18024 18025@cindex length of input record 18026@cindex input record, length of 18027If no argument is supplied, @code{length()} returns the length of @code{$0}. 18028 18029@c @cindex historical features 18030@cindex portability @subentry @code{length()} function 18031@cindex POSIX @command{awk} @subentry functions and @subentry @code{length()} 18032@quotation NOTE 18033In older versions of @command{awk}, the @code{length()} function could 18034be called 18035without any parentheses. Doing so is considered poor practice, 18036although the 2008 POSIX standard explicitly allows it, to 18037support historical practice. For programs to be maximally portable, 18038always supply the parentheses. 18039@end quotation 18040 18041@cindex dark corner @subentry @code{length()} function 18042If @code{length()} is called with a variable that has not been used, 18043@command{gawk} forces the variable to be a scalar. Other 18044implementations of @command{awk} leave the variable without a type. 18045@value{DARKCORNER} 18046Consider: 18047 18048@example 18049$ @kbd{gawk 'BEGIN @{ print length(x) ; x[1] = 1 @}'} 18050@print{} 0 18051@error{} gawk: fatal: attempt to use scalar `x' as array 18052 18053$ @kbd{nawk 'BEGIN @{ print length(x) ; x[1] = 1 @}'} 18054@print{} 0 18055@end example 18056 18057@noindent 18058If @option{--lint} has 18059been specified on the command line, @command{gawk} issues a 18060warning about this. 18061 18062@cindex common extensions @subentry @code{length()} applied to an array 18063@cindex extensions @subentry common @subentry @code{length()} applied to an array 18064@cindex differences in @command{awk} and @command{gawk} @subentry @code{length()} function 18065@cindex number of array elements 18066@cindex arrays @subentry number of elements 18067With @command{gawk} and several other @command{awk} implementations, when given an 18068array argument, the @code{length()} function returns the number of elements 18069in the array. @value{COMMONEXT} 18070This is less useful than it might seem at first, as the 18071array is not guaranteed to be indexed from one to the number of elements 18072in it. 18073If @option{--lint} is provided on the command line 18074(@pxref{Options}), 18075@command{gawk} warns that passing an array argument is not portable. 18076If @option{--posix} is supplied, using an array argument is a fatal error 18077(@pxref{Arrays}). 18078 18079@item @code{match(@var{string}, @var{regexp}} [@code{, @var{array}}]@code{)} 18080@cindexawkfunc{match} 18081@cindex string @subentry regular expression match of 18082@cindex match regexp in string 18083Search @var{string} for the 18084longest, leftmost substring matched by the regular expression 18085@var{regexp} and return the character position (index) 18086at which that substring begins (one, if it starts at the beginning of 18087@var{string}). If no match is found, return zero. 18088 18089The @var{regexp} argument may be either a regexp constant 18090(@code{/}@dots{}@code{/}) or a string constant (@code{"}@dots{}@code{"}). 18091In the latter case, the string is treated as a regexp to be matched. 18092@xref{Computed Regexps} for a 18093discussion of the difference between the two forms, and the 18094implications for writing your program correctly. 18095 18096The order of the first two arguments is the opposite of most other string 18097functions that work with regular expressions, such as 18098@code{sub()} and @code{gsub()}. It might help to remember that 18099for @code{match()}, the order is the same as for the @samp{~} operator: 18100@samp{@var{string} ~ @var{regexp}}. 18101 18102@cindex @code{RSTART} variable @subentry @code{match()} function and 18103@cindex @code{RLENGTH} variable @subentry @code{match()} function and 18104@cindex @code{match()} function @subentry @code{RSTART}/@code{RLENGTH} variables 18105@cindex @code{match()} function @subentry side effects 18106@cindex side effects @subentry @code{match()} function 18107The @code{match()} function sets the predefined variable @code{RSTART} to 18108the index. It also sets the predefined variable @code{RLENGTH} to the 18109length in characters of the matched substring. If no match is found, 18110@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1. 18111 18112For example: 18113 18114@example 18115@c file eg/misc/findpat.awk 18116@{ 18117 if ($1 == "FIND") 18118 regex = $2 18119 else @{ 18120 where = match($0, regex) 18121 if (where != 0) 18122 print "Match of", regex, "found at", where, "in", $0 18123 @} 18124@} 18125@c endfile 18126@end example 18127 18128@noindent 18129This program looks for lines that match the regular expression stored in 18130the variable @code{regex}. This regular expression can be changed. If the 18131first word on a line is @samp{FIND}, @code{regex} is changed to be the 18132second word on that line. Therefore, if given: 18133 18134@example 18135@c file eg/misc/findpat.data 18136FIND ru+n 18137My program runs 18138but not very quickly 18139FIND Melvin 18140JF+KM 18141This line is property of Reality Engineering Co. 18142Melvin was here. 18143@c endfile 18144@end example 18145 18146@noindent 18147@command{awk} prints: 18148 18149@example 18150Match of ru+n found at 12 in My program runs 18151Match of Melvin found at 1 in Melvin was here. 18152@end example 18153 18154@cindex differences in @command{awk} and @command{gawk} @subentry @code{match()} function 18155If @var{array} is present, it is cleared, and then the zeroth element 18156of @var{array} is set to the entire portion of @var{string} 18157matched by @var{regexp}. If @var{regexp} contains parentheses, 18158the integer-indexed elements of @var{array} are set to contain the 18159portion of @var{string} matching the corresponding parenthesized 18160subexpression. 18161For example: 18162 18163@example 18164$ @kbd{echo foooobazbarrrrr |} 18165> @kbd{gawk '@{ match($0, /(fo+).+(bar*)/, arr)} 18166> @kbd{print arr[1], arr[2] @}'} 18167@print{} foooo barrrrr 18168@end example 18169 18170In addition, 18171multidimensional subscripts are available providing 18172the start index and length of each matched subexpression: 18173 18174@example 18175$ @kbd{echo foooobazbarrrrr |} 18176> @kbd{gawk '@{ match($0, /(fo+).+(bar*)/, arr)} 18177> @kbd{print arr[1], arr[2]} 18178> @kbd{print arr[1, "start"], arr[1, "length"]} 18179> @kbd{print arr[2, "start"], arr[2, "length"]} 18180> @kbd{@}'} 18181@print{} foooo barrrrr 18182@print{} 1 5 18183@print{} 9 7 18184@end example 18185 18186There may not be subscripts for the start and index for every parenthesized 18187subexpression, because they may not all have matched text; thus, they 18188should be tested for with the @code{in} operator 18189(@pxref{Reference to Elements}). 18190 18191@cindex troubleshooting @subentry @code{match()} function 18192The @var{array} argument to @code{match()} is a 18193@command{gawk} extension. In compatibility mode 18194(@pxref{Options}), 18195using a third argument is a fatal error. 18196 18197@item @code{patsplit(@var{string}, @var{array}} [@code{, @var{fieldpat}} [@code{, @var{seps}} ] ]@code{) #} 18198@cindexgawkfunc{patsplit} 18199@cindex split string into array 18200Divide 18201@var{string} into pieces (or ``fields'') defined by @var{fieldpat} 18202and store the pieces in @var{array} and the separator strings in the 18203@var{seps} array. The first piece is stored in 18204@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so 18205forth. The third argument, @var{fieldpat}, is 18206a regexp describing the fields in @var{string} (just as @code{FPAT} is 18207a regexp describing the fields in input records). 18208It may be either a regexp constant or a string. 18209If @var{fieldpat} is omitted, the value of @code{FPAT} is used. 18210@code{patsplit()} returns the number of elements created. 18211@code{@var{seps}[@var{i}]} is 18212the possibly null separator string 18213after @code{@var{array}[@var{i}]}. 18214The possibly null leading separator will be in @code{@var{seps}[0]}. 18215So a non-null @var{string} with @var{n} fields will have @var{n+1} separators. 18216A null @var{string} has no fields or separators. 18217 18218The @code{patsplit()} function splits strings into pieces in a 18219manner similar to the way input lines are split into fields using @code{FPAT} 18220(@pxref{Splitting By Content}). 18221 18222Before splitting the string, @code{patsplit()} deletes any previously existing 18223elements in the arrays @var{array} and @var{seps}. 18224 18225@item @code{split(@var{string}, @var{array}} [@code{, @var{fieldsep}} [@code{, @var{seps}} ] ]@code{)} 18226@cindexawkfunc{split} 18227Divide @var{string} into pieces separated by @var{fieldsep} 18228and store the pieces in @var{array} and the separator strings in the 18229@var{seps} array. The first piece is stored in 18230@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so 18231forth. The string value of the third argument, @var{fieldsep}, is 18232a regexp describing where to split @var{string} (much as @code{FS} can 18233be a regexp describing where to split input records). 18234If @var{fieldsep} is omitted, the value of @code{FS} is used. 18235@code{split()} returns the number of elements created. 18236@var{seps} is a @command{gawk} extension, with @code{@var{seps}[@var{i}]} 18237being the separator string 18238between @code{@var{array}[@var{i}]} and @code{@var{array}[@var{i}+1]}. 18239If @var{fieldsep} is a single 18240space, then any leading whitespace goes into @code{@var{seps}[0]} and 18241any trailing 18242whitespace goes into @code{@var{seps}[@var{n}]}, where @var{n} is the 18243return value of 18244@code{split()} (i.e., the number of elements in @var{array}). 18245 18246The @code{split()} function splits strings into pieces in the same way 18247that input lines are split into fields. For example: 18248 18249@example 18250split("cul-de-sac", a, "-", seps) 18251@end example 18252 18253@noindent 18254@cindex strings @subentry splitting, example 18255splits the string @code{"cul-de-sac"} into three fields using @samp{-} as the 18256separator. It sets the contents of the array @code{a} as follows: 18257 18258@example 18259a[1] = "cul" 18260a[2] = "de" 18261a[3] = "sac" 18262@end example 18263 18264and sets the contents of the array @code{seps} as follows: 18265 18266@example 18267seps[1] = "-" 18268seps[2] = "-" 18269@end example 18270 18271@noindent 18272The value returned by this call to @code{split()} is three. 18273 18274@cindex differences in @command{awk} and @command{gawk} @subentry @code{split()} function 18275As with input field-splitting, when the value of @var{fieldsep} is 18276@w{@code{" "}}, leading and trailing whitespace is ignored in values assigned to 18277the elements of 18278@var{array} but not in @var{seps}, and the elements 18279are separated by runs of whitespace. 18280Also, as with input field splitting, if @var{fieldsep} is the null string, each 18281individual character in the string is split into its own array element. 18282@value{COMMONEXT} 18283Additionally, if @var{fieldsep} is a single-character string, that string acts 18284as the separator, even if its value is a regular expression metacharacter. 18285 18286Note, however, that @code{RS} has no effect on the way @code{split()} 18287works. Even though @samp{RS = ""} causes the newline character to also be an input 18288field separator, this does not affect how @code{split()} splits strings. 18289 18290@cindex dark corner @subentry @code{split()} function 18291Modern implementations of @command{awk}, including @command{gawk}, allow 18292the third argument to be a regexp constant (@w{@code{/}@dots{}@code{/}}) 18293as well as a string. @value{DARKCORNER} 18294The POSIX standard allows this as well. 18295@xref{Computed Regexps} for a 18296discussion of the difference between using a string constant or a regexp constant, 18297and the implications for writing your program correctly. 18298 18299Before splitting the string, @code{split()} deletes any previously existing 18300elements in the arrays @var{array} and @var{seps}. 18301 18302If @var{string} is null, the array has no elements. (So this is a portable 18303way to delete an entire array with one statement. 18304@xref{Delete}.) 18305 18306If @var{string} does not match @var{fieldsep} at all (but is not null), 18307@var{array} has one element only. The value of that element is the original 18308@var{string}. 18309 18310@cindex POSIX mode 18311In POSIX mode (@pxref{Options}), the fourth argument is not allowed. 18312 18313@item @code{sprintf(@var{format}, @var{expression1}, @dots{})} 18314@cindexawkfunc{sprintf} 18315@cindex formatting @subentry strings 18316Return (without printing) the string that @code{printf} would 18317have printed out with the same arguments 18318(@pxref{Printf}). 18319For example: 18320 18321@example 18322pival = sprintf("pi = %.2f (approx.)", 22/7) 18323@end example 18324 18325@noindent 18326assigns the string @w{@samp{pi = 3.14 (approx.)}} to the variable @code{pival}. 18327 18328@cindexgawkfunc{strtonum} 18329@cindex converting @subentry string to numbers 18330@item @code{strtonum(@var{str}) #} 18331Examine @var{str} and return its numeric value. If @var{str} 18332begins with a leading @samp{0}, @code{strtonum()} assumes that @var{str} 18333is an octal number. If @var{str} begins with a leading @samp{0x} or 18334@samp{0X}, @code{strtonum()} assumes that @var{str} is a hexadecimal number. 18335For example: 18336 18337@example 18338$ @kbd{echo 0x11 |} 18339> @kbd{gawk '@{ printf "%d\n", strtonum($1) @}'} 18340@print{} 17 18341@end example 18342 18343Using the @code{strtonum()} function is @emph{not} the same as adding zero 18344to a string value; the automatic coercion of strings to numbers 18345works only for decimal data, not for octal or hexadecimal.@footnote{Unless 18346you use the @option{--non-decimal-data} option, which isn't recommended. 18347@xref{Nondecimal Data} for more information.} 18348 18349Note also that @code{strtonum()} uses the current locale's decimal point 18350for recognizing numbers (@pxref{Locales}). 18351 18352@item @code{sub(@var{regexp}, @var{replacement}} [@code{, @var{target}}]@code{)} 18353@cindexawkfunc{sub} 18354@cindex replace in string 18355Search @var{target}, which is treated as a string, for the 18356leftmost, longest substring matched by the regular expression @var{regexp}. 18357Modify the entire string 18358by replacing the matched text with @var{replacement}. 18359The modified string becomes the new value of @var{target}. 18360Return the number of substitutions made (zero or one). 18361 18362The @var{regexp} argument may be either a regexp constant 18363(@code{/}@dots{}@code{/}) or a string constant (@code{"}@dots{}@code{"}). 18364In the latter case, the string is treated as a regexp to be matched. 18365@xref{Computed Regexps} for a 18366discussion of the difference between the two forms, and the 18367implications for writing your program correctly. 18368 18369This function is peculiar because @var{target} is not simply 18370used to compute a value, and not just any expression will do---it 18371must be a variable, field, or array element so that @code{sub()} can 18372store a modified value there. If this argument is omitted, then the 18373default is to use and alter @code{$0}.@footnote{Note that this means 18374that the record will first be regenerated using the value of @code{OFS} if 18375any fields have been changed, and that the fields will be updated 18376after the substitution, even if the operation is a ``no-op'' such 18377as @samp{sub(/^/, "")}.} 18378For example: 18379 18380@example 18381str = "water, water, everywhere" 18382sub(/at/, "ith", str) 18383@end example 18384 18385@noindent 18386sets @code{str} to @w{@samp{wither, water, everywhere}}, by replacing the 18387leftmost longest occurrence of @samp{at} with @samp{ith}. 18388 18389If the special character @samp{&} appears in @var{replacement}, it 18390stands for the precise substring that was matched by @var{regexp}. (If 18391the regexp can match more than one string, then this precise substring 18392may vary.) For example: 18393 18394@example 18395@{ sub(/candidate/, "& and his wife"); print @} 18396@end example 18397 18398@noindent 18399changes the first occurrence of @samp{candidate} to @samp{candidate 18400and his wife} on each input line. 18401Here is another example: 18402 18403@example 18404$ @kbd{awk 'BEGIN @{} 18405> @kbd{str = "daabaaa"} 18406> @kbd{sub(/a+/, "C&C", str)} 18407> @kbd{print str} 18408> @kbd{@}'} 18409@print{} dCaaCbaaa 18410@end example 18411 18412@noindent 18413This shows how @samp{&} can represent a nonconstant string and also 18414illustrates the ``leftmost, longest'' rule in regexp matching 18415(@pxref{Leftmost Longest}). 18416 18417The effect of this special character (@samp{&}) can be turned off by putting a 18418backslash before it in the string. As usual, to insert one backslash in 18419the string, you must write two backslashes. Therefore, write @samp{\\&} 18420in a string constant to include a literal @samp{&} in the replacement. 18421For example, the following shows how to replace the first @samp{|} on each line with 18422an @samp{&}: 18423 18424@example 18425@{ sub(/\|/, "\\&"); print @} 18426@end example 18427 18428@cindex @code{sub()} function @subentry arguments of 18429@cindex @code{gsub()} function @subentry arguments of 18430@cindex side effects @subentry @code{sub()} function 18431@cindex side effects @subentry @code{gsub()} function 18432As mentioned, the third argument to @code{sub()} must 18433be a variable, field, or array element. 18434Some versions of @command{awk} allow the third argument to 18435be an expression that is not an lvalue. In such a case, @code{sub()} 18436still searches for the pattern and returns zero or one, but the result of 18437the substitution (if any) is thrown away because there is no place 18438to put it. Such versions of @command{awk} accept expressions 18439like the following: 18440 18441@example 18442sub(/USA/, "United States", "the USA and Canada") 18443@end example 18444 18445@noindent 18446@cindex troubleshooting @subentry @code{gsub()}/@code{sub()} functions 18447For historical compatibility, @command{gawk} accepts such erroneous code. 18448However, using any other nonchangeable 18449object as the third parameter causes a fatal error and your program 18450will not run. 18451 18452Finally, if the @var{regexp} is not a regexp constant, it is converted into a 18453string, and then the value of that string is treated as the regexp to match. 18454 18455@item @code{substr(@var{string}, @var{start}} [@code{, @var{length}} ]@code{)} 18456@cindexawkfunc{substr} 18457@cindex substring 18458Return a @var{length}-character-long substring of @var{string}, 18459starting at character number @var{start}. The first character of a 18460string is character number one.@footnote{This is different from 18461C and C++, in which the first character is number zero.} 18462For example, @code{substr("washington", 5, 3)} returns @code{"ing"}. 18463 18464If @var{length} is not present, @code{substr()} returns the whole suffix of 18465@var{string} that begins at character number @var{start}. For example, 18466@code{substr("washington", 5)} returns @code{"ington"}. The whole 18467suffix is also returned 18468if @var{length} is greater than the number of characters remaining 18469in the string, counting from character @var{start}. 18470 18471@cindex Brian Kernighan's @command{awk} 18472If @var{start} is less than one, @code{substr()} treats it as 18473if it was one. (POSIX doesn't specify what to do in this case: 18474BWK @command{awk} acts this way, and therefore @command{gawk} 18475does too.) 18476If @var{start} is greater than the number of characters 18477in the string, @code{substr()} returns the null string. 18478Similarly, if @var{length} is present but less than or equal to zero, 18479the null string is returned. 18480 18481@cindex troubleshooting @subentry @code{substr()} function 18482The string returned by @code{substr()} @emph{cannot} be 18483assigned. Thus, it is a mistake to attempt to change a portion of 18484a string, as shown in the following example: 18485 18486@example 18487string = "abcdef" 18488# try to get "abCDEf", won't work 18489substr(string, 3, 3) = "CDE" 18490@end example 18491 18492@noindent 18493It is also a mistake to use @code{substr()} as the third argument 18494of @code{sub()} or @code{gsub()}: 18495 18496@example 18497gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG 18498@end example 18499 18500@cindex portability @subentry @code{substr()} function 18501(Some commercial versions of @command{awk} treat 18502@code{substr()} as assignable, but doing so is not portable.) 18503 18504If you need to replace bits and pieces of a string, combine @code{substr()} 18505with string concatenation, in the following manner: 18506 18507@example 18508string = "abcdef" 18509@dots{} 18510string = substr(string, 1, 2) "CDE" substr(string, 6) 18511@end example 18512 18513@cindex case sensitivity @subentry converting case 18514@cindex strings @subentry converting letter case 18515@item @code{tolower(@var{string})} 18516@cindexawkfunc{tolower} 18517@cindex converting @subentry string to lower case 18518Return a copy of @var{string}, with each uppercase character 18519in the string replaced with its corresponding lowercase character. 18520Nonalphabetic characters are left unchanged. For example, 18521@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}. 18522 18523@item @code{toupper(@var{string})} 18524@cindexawkfunc{toupper} 18525@cindex converting @subentry string to upper case 18526Return a copy of @var{string}, with each lowercase character 18527in the string replaced with its corresponding uppercase character. 18528Nonalphabetic characters are left unchanged. For example, 18529@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}. 18530@end table 18531 18532At first glance, the @code{split()} and @code{patsplit()} functions appear to be 18533mirror images of each other. But there are differences: 18534 18535@itemize @bullet 18536@item @code{split()} treats its third argument like @code{FS}, with all the 18537special rules involved for @code{FS}. 18538 18539@item Matching of null strings differs. This is discussed in @ref{FS versus FPAT}. 18540@end itemize 18541 18542@sidebar Matching the Null String 18543@cindex matching @subentry null strings 18544@cindex null strings @subentry matching 18545@cindex @code{*} (asterisk) @subentry @code{*} operator @subentry null strings, matching 18546@cindex asterisk (@code{*}) @subentry @code{*} operator @subentry null strings, matching 18547 18548In @command{awk}, the @samp{*} operator can match the null string. 18549This is particularly important for the @code{sub()}, @code{gsub()}, 18550and @code{gensub()} functions. For example: 18551 18552@example 18553$ @kbd{echo abc | awk '@{ gsub(/m*/, "X"); print @}'} 18554@print{} XaXbXcX 18555@end example 18556 18557@noindent 18558Although this makes a certain amount of sense, it can be surprising. 18559@end sidebar 18560 18561 18562@node Gory Details 18563@subsubsection More about @samp{\} and @samp{&} with @code{sub()}, @code{gsub()}, and @code{gensub()} 18564 18565@cindex escape processing @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions 18566@cindex @code{sub()} function @subentry escape processing 18567@cindex @code{gsub()} function @subentry escape processing 18568@cindex @code{gensub()} function (@command{gawk}) @subentry escape processing 18569@cindex @code{\} (backslash) @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions and 18570@cindex backslash (@code{\}) @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions and 18571@cindex @code{&} (ampersand) @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions and 18572@cindex ampersand (@code{&}) @subentry @code{gsub()}/@code{gensub()}/@code{sub()} functions and 18573 18574@quotation CAUTION 18575This subsubsection has been reported to cause headaches. 18576You might want to skip it upon first reading. 18577@end quotation 18578 18579When using @code{sub()}, @code{gsub()}, or @code{gensub()}, and trying to get literal 18580backslashes and ampersands into the replacement text, you need to remember 18581that there are several levels of @dfn{escape processing} going on. 18582 18583First, there is the @dfn{lexical} level, which is when @command{awk} reads 18584your program 18585and builds an internal copy of it to execute. 18586Then there is the runtime level, which is when @command{awk} actually scans the 18587replacement string to determine what to generate. 18588 18589@cindex Brian Kernighan's @command{awk} 18590At both levels, @command{awk} looks for a defined set of characters that 18591can come after a backslash. At the lexical level, it looks for the 18592escape sequences listed in @ref{Escape Sequences}. 18593Thus, for every @samp{\} that @command{awk} processes at the runtime 18594level, you must type two backslashes at the lexical level. 18595When a character that is not valid for an escape sequence follows the 18596@samp{\}, BWK @command{awk} and @command{gawk} both simply remove the initial 18597@samp{\} and put the next character into the string. Thus, for 18598example, @code{"a\qb"} is treated as @code{"aqb"}. 18599 18600At the runtime level, the various functions handle sequences of 18601@samp{\} and @samp{&} differently. The situation is (sadly) somewhat complex. 18602Historically, the @code{sub()} and @code{gsub()} functions treated the 18603two-character sequence @samp{\&} specially; this sequence was replaced in 18604the generated text with a single @samp{&}. Any other @samp{\} within 18605the @var{replacement} string that did not precede an @samp{&} was passed 18606through unchanged. This is illustrated in @ref{table-sub-escapes}. 18607 18608@c Thank to Karl Berry for help with the TeX stuff. 18609@float Table,table-sub-escapes 18610@caption{Historical escape sequence processing for @code{sub()} and @code{gsub()}} 18611@tex 18612\vbox{\bigskip 18613% We need more characters for escape and tab ... 18614\catcode`_ = 0 18615\catcode`! = 4 18616% ... since this table has lots of &'s and \'s, so we unspecialize them. 18617\catcode`\& = \other \catcode`\\ = \other 18618_halign{_hfil#!_qquad_hfil#!_qquad#_hfil_cr 18619 You type!@code{sub()} sees!@code{sub()} generates_cr 18620_hrulefill!_hrulefill!_hrulefill_cr 18621 @code{\&}! @code{&}!The matched text_cr 18622 @code{\\&}! @code{\&}!A literal @samp{&}_cr 18623 @code{\\\&}! @code{\&}!A literal @samp{&}_cr 18624 @code{\\\\&}! @code{\\&}!A literal @samp{\&}_cr 18625 @code{\\\\\&}! @code{\\&}!A literal @samp{\&}_cr 18626@code{\\\\\\&}! @code{\\\&}!A literal @samp{\\&}_cr 18627 @code{\\q}! @code{\q}!A literal @samp{\q}_cr 18628} 18629_bigskip} 18630@end tex 18631@ifdocbook 18632@multitable @columnfractions .20 .20 .60 18633@headitem You type @tab @code{sub()} sees @tab @code{sub()} generates 18634@item @code{\&} @tab @code{&} @tab The matched text 18635@item @code{\\&} @tab @code{\&} @tab A literal @samp{&} 18636@item @code{\\\&} @tab @code{\&} @tab A literal @samp{&} 18637@item @code{\\\\&} @tab @code{\\&} @tab A literal @samp{\&} 18638@item @code{\\\\\&} @tab @code{\\&} @tab A literal @samp{\&} 18639@item @code{\\\\\\&} @tab @code{\\\&} @tab A literal @samp{\\&} 18640@item @code{\\q} @tab @code{\q} @tab A literal @samp{\q} 18641@end multitable 18642@end ifdocbook 18643@ifnottex 18644@ifnotdocbook 18645@display 18646 You type @code{sub()} sees @code{sub()} generates 18647 -------- ---------- --------------- 18648 @code{\&} @code{&} The matched text 18649 @code{\\&} @code{\&} A literal @samp{&} 18650 @code{\\\&} @code{\&} A literal @samp{&} 18651 @code{\\\\&} @code{\\&} A literal @samp{\&} 18652 @code{\\\\\&} @code{\\&} A literal @samp{\&} 18653@code{\\\\\\&} @code{\\\&} A literal @samp{\\&} 18654 @code{\\q} @code{\q} A literal @samp{\q} 18655@end display 18656@end ifnotdocbook 18657@end ifnottex 18658@end float 18659 18660@noindent 18661This table shows the lexical-level processing, where 18662an odd number of backslashes becomes an even number at the runtime level, 18663as well as the runtime processing done by @code{sub()}. 18664(For the sake of simplicity, the rest of the following tables only show the 18665case of even numbers of backslashes entered at the lexical level.) 18666 18667The problem with the historical approach is that there is no way to get 18668a literal @samp{\} followed by the matched text. 18669 18670Several editions of the POSIX standard attempted to fix this problem 18671but weren't successful. The details are irrelevant at this point in time. 18672 18673At one point, the @command{gawk} maintainer submitted 18674proposed text for a revised standard that 18675reverts to rules that correspond more closely to the original existing 18676practice. The proposed rules have special cases that make it possible 18677to produce a @samp{\} preceding the matched text. 18678This is shown in 18679@ref{table-sub-proposed}. 18680 18681@float Table,table-sub-proposed 18682@caption{@command{gawk} rules for @code{sub()} and backslash} 18683@tex 18684\vbox{\bigskip 18685% We need more characters for escape and tab ... 18686\catcode`_ = 0 18687\catcode`! = 4 18688% ... since this table has lots of &'s and \'s, so we unspecialize them. 18689\catcode`\& = \other \catcode`\\ = \other 18690_halign{_hfil#!_qquad_hfil#!_qquad#_hfil_cr 18691 You type!@code{sub()} sees!@code{sub()} generates_cr 18692_hrulefill!_hrulefill!_hrulefill_cr 18693@code{\\\\\\&}! @code{\\\&}!A literal @samp{\&}_cr 18694@code{\\\\&}! @code{\\&}!A literal @samp{\}, followed by the matched text_cr 18695 @code{\\&}! @code{\&}!A literal @samp{&}_cr 18696 @code{\\q}! @code{\q}!A literal @samp{\q}_cr 18697 @code{\\\\}! @code{\\}!@code{\\}_cr 18698} 18699_bigskip} 18700@end tex 18701@ifdocbook 18702@multitable @columnfractions .20 .20 .60 18703@headitem You type @tab @code{sub()} sees @tab @code{sub()} generates 18704@item @code{\\\\\\&} @tab @code{\\\&} @tab A literal @samp{\&} 18705@item @code{\\\\&} @tab @code{\\&} @tab A literal @samp{\}, followed by the matched text 18706@item @code{\\&} @tab @code{\&} @tab A literal @samp{&} 18707@item @code{\\q} @tab @code{\q} @tab A literal @samp{\q} 18708@item @code{\\\\} @tab @code{\\} @tab @code{\\} 18709@end multitable 18710@end ifdocbook 18711@ifnottex 18712@ifnotdocbook 18713@display 18714 You type @code{sub()} sees @code{sub()} generates 18715 -------- ---------- --------------- 18716@code{\\\\\\&} @code{\\\&} A literal @samp{\&} 18717 @code{\\\\&} @code{\\&} A literal @samp{\}, followed by the matched text 18718 @code{\\&} @code{\&} A literal @samp{&} 18719 @code{\\q} @code{\q} A literal @samp{\q} 18720 @code{\\\\} @code{\\} @code{\\} 18721@end display 18722@end ifnotdocbook 18723@end ifnottex 18724@end float 18725 18726In a nutshell, at the runtime level, there are now three special sequences 18727of characters (@samp{\\\&}, @samp{\\&}, and @samp{\&}) whereas historically 18728there was only one. However, as in the historical case, any @samp{\} that 18729is not part of one of these three sequences is not special and appears 18730in the output literally. 18731 18732@command{gawk} 3.0 and 3.1 follow these rules for @code{sub()} and 18733@code{gsub()}. The POSIX standard took much longer to be revised than 18734was expected. In addition, the @command{gawk} maintainer's proposal was 18735lost during the standardization process. The final rules are 18736somewhat simpler. The results are similar except for one case. 18737 18738@cindex POSIX @command{awk} @subentry functions and @subentry @code{gsub()}/@code{sub()} 18739The POSIX rules state that @samp{\&} in the replacement string produces 18740a literal @samp{&}, @samp{\\} produces a literal @samp{\}, and @samp{\} followed 18741by anything else is not special; the @samp{\} is placed straight into the output. 18742These rules are presented in @ref{table-posix-sub}. 18743 18744@float Table,table-posix-sub 18745@caption{POSIX rules for @code{sub()} and @code{gsub()}} 18746@tex 18747\vbox{\bigskip 18748% We need more characters for escape and tab ... 18749\catcode`_ = 0 18750\catcode`! = 4 18751% ... since this table has lots of &'s and \'s, so we unspecialize them. 18752\catcode`\& = \other \catcode`\\ = \other 18753_halign{_hfil#!_qquad_hfil#!_qquad#_hfil_cr 18754 You type!@code{sub()} sees!@code{sub()} generates_cr 18755_hrulefill!_hrulefill!_hrulefill_cr 18756@code{\\\\\\&}! @code{\\\&}!A literal @samp{\&}_cr 18757@code{\\\\&}! @code{\\&}!A literal @samp{\}, followed by the matched text_cr 18758 @code{\\&}! @code{\&}!A literal @samp{&}_cr 18759 @code{\\q}! @code{\q}!A literal @samp{\q}_cr 18760 @code{\\\\}! @code{\\}!@code{\}_cr 18761} 18762_bigskip} 18763@end tex 18764@ifdocbook 18765@multitable @columnfractions .20 .20 .60 18766@headitem You type @tab @code{sub()} sees @tab @code{sub()} generates 18767@item @code{\\\\\\&} @tab @code{\\\&} @tab A literal @samp{\&} 18768@item @code{\\\\&} @tab @code{\\&} @tab A literal @samp{\}, followed by the matched text 18769@item @code{\\&} @tab @code{\&} @tab A literal @samp{&} 18770@item @code{\\q} @tab @code{\q} @tab A literal @samp{\q} 18771@item @code{\\\\} @tab @code{\\} @tab @code{\} 18772@end multitable 18773@end ifdocbook 18774@ifnottex 18775@ifnotdocbook 18776@display 18777 You type @code{sub()} sees @code{sub()} generates 18778 -------- ---------- --------------- 18779@code{\\\\\\&} @code{\\\&} A literal @samp{\&} 18780 @code{\\\\&} @code{\\&} A literal @samp{\}, followed by the matched text 18781 @code{\\&} @code{\&} A literal @samp{&} 18782 @code{\\q} @code{\q} A literal @samp{\q} 18783 @code{\\\\} @code{\\} @code{\} 18784@end display 18785@end ifnotdocbook 18786@end ifnottex 18787@end float 18788 18789The only case where the difference is noticeable is the last one: @samp{\\\\} 18790is seen as @samp{\\} and produces @samp{\} instead of @samp{\\}. 18791 18792Starting with @value{PVERSION} 3.1.4, @command{gawk} followed the POSIX rules 18793when @option{--posix} was specified (@pxref{Options}). Otherwise, 18794it continued to follow the proposed rules, as 18795that had been its behavior for many years. 18796 18797When @value{PVERSION} 4.0.0 was released, the @command{gawk} maintainer 18798made the POSIX rules the default, breaking well over a decade's worth 18799of backward compatibility.@footnote{This was rather naive of him, despite 18800there being a note in this @value{SECTION} indicating that the next major version 18801would move to the POSIX rules.} Needless to say, this was a bad idea, 18802and as of @value{PVERSION} 4.0.1, @command{gawk} resumed its historical 18803behavior, and only follows the POSIX rules when @option{--posix} is given. 18804 18805The rules for @code{gensub()} are considerably simpler. At the runtime 18806level, whenever @command{gawk} sees a @samp{\}, if the following character 18807is a digit, then the text that matched the corresponding parenthesized 18808subexpression is placed in the generated output. Otherwise, 18809no matter what character follows the @samp{\}, it 18810appears in the generated text and the @samp{\} does not, 18811as shown in @ref{table-gensub-escapes}. 18812 18813@float Table,table-gensub-escapes 18814@caption{Escape sequence processing for @code{gensub()}} 18815@tex 18816\vbox{\bigskip 18817% We need more characters for escape and tab ... 18818\catcode`_ = 0 18819\catcode`! = 4 18820% ... since this table has lots of &'s and \'s, so we unspecialize them. 18821\catcode`\& = \other \catcode`\\ = \other 18822_halign{_hfil#!_qquad_hfil#!_qquad#_hfil_cr 18823 You type!@code{gensub()} sees!@code{gensub()} generates_cr 18824_hrulefill!_hrulefill!_hrulefill_cr 18825 @code{&}! @code{&}!The matched text_cr 18826 @code{\\&}! @code{\&}!A literal @samp{&}_cr 18827 @code{\\\\}! @code{\\}!A literal @samp{\}_cr 18828 @code{\\\\&}! @code{\\&}!A literal @samp{\}, then the matched text_cr 18829@code{\\\\\\&}! @code{\\\&}!A literal @samp{\&}_cr 18830 @code{\\q}! @code{\q}!A literal @samp{q}_cr 18831} 18832_bigskip} 18833@end tex 18834@ifdocbook 18835@multitable @columnfractions .20 .20 .60 18836@headitem You type @tab @code{gensub()} sees @tab @code{gensub()} generates 18837@item @code{&} @tab @code{&} @tab The matched text 18838@item @code{\\&} @tab @code{\&} @tab A literal @samp{&} 18839@item @code{\\\\} @tab @code{\\} @tab A literal @samp{\} 18840@item @code{\\\\&} @tab @code{\\&} @tab A literal @samp{\}, then the matched text 18841@item @code{\\\\\\&} @tab @code{\\\&} @tab A literal @samp{\&} 18842@item @code{\\q} @tab @code{\q} @tab A literal @samp{q} 18843@end multitable 18844@end ifdocbook 18845@ifnottex 18846@ifnotdocbook 18847@display 18848 You type @code{gensub()} sees @code{gensub()} generates 18849 -------- ------------- ------------------ 18850 @code{&} @code{&} The matched text 18851 @code{\\&} @code{\&} A literal @samp{&} 18852 @code{\\\\} @code{\\} A literal @samp{\} 18853 @code{\\\\&} @code{\\&} A literal @samp{\}, then the matched text 18854@code{\\\\\\&} @code{\\\&} A literal @samp{\&} 18855 @code{\\q} @code{\q} A literal @samp{q} 18856@end display 18857@end ifnotdocbook 18858@end ifnottex 18859@end float 18860 18861Because of the complexity of the lexical- and runtime-level processing 18862and the special cases for @code{sub()} and @code{gsub()}, 18863we recommend the use of @command{gawk} and @code{gensub()} when you have 18864to do substitutions. 18865 18866@node I/O Functions 18867@subsection Input/Output Functions 18868@cindex input/output @subentry functions 18869 18870The following functions relate to input/output (I/O). 18871Optional parameters are enclosed in square brackets ([ ]): 18872 18873@table @asis 18874@item @code{close(}@var{filename} [@code{,} @var{how}]@code{)} 18875@cindexawkfunc{close} 18876@cindex files @subentry closing 18877@cindex close file or coprocess 18878Close the file @var{filename} for input or output. Alternatively, the 18879argument may be a shell command that was used for creating a coprocess, or 18880for redirecting to or from a pipe; then the coprocess or pipe is closed. 18881@xref{Close Files And Pipes} 18882for more information. 18883 18884When closing a coprocess, it is occasionally useful to first close 18885one end of the two-way pipe and then to close the other. This is done 18886by providing a second argument to @code{close()}. This second argument 18887(@var{how}) 18888should be one of the two string values @code{"to"} or @code{"from"}, 18889indicating which end of the pipe to close. Case in the string does 18890not matter. 18891@xref{Two-way I/O}, 18892which discusses this feature in more detail and gives an example. 18893 18894Note that the second argument to @code{close()} is a @command{gawk} 18895extension; it is not available in compatibility mode (@pxref{Options}). 18896 18897@item @code{fflush(}[@var{filename}]@code{)} 18898@cindexawkfunc{fflush} 18899@cindex flush buffered output 18900Flush any buffered output associated with @var{filename}, which is either a 18901file opened for writing or a shell command for redirecting output to 18902a pipe or coprocess. 18903 18904@cindex buffers @subentry flushing 18905@cindex output @subentry buffering 18906Many utility programs @dfn{buffer} their output (i.e., they save information 18907to write to a disk file or the screen in memory until there is enough 18908for it to be worthwhile to send the data to the output device). 18909This is often more efficient than writing 18910every little bit of information as soon as it is ready. However, sometimes 18911it is necessary to force a program to @dfn{flush} its buffers (i.e., 18912write the information to its destination, even if a buffer is not full). 18913This is the purpose of the @code{fflush()} function---@command{gawk} also 18914buffers its output, and the @code{fflush()} function forces 18915@command{gawk} to flush its buffers. 18916 18917@cindex extensions @subentry common @subentry @code{fflush()} function 18918@cindex Brian Kernighan's @command{awk} 18919Brian Kernighan added @code{fflush()} to his @command{awk} in April 189201992. For two decades, it was a common extension. In December 189212012, it was accepted for inclusion into the POSIX standard. 18922See @uref{http://austingroupbugs.net/view.php?id=634, the Austin Group website}. 18923 18924POSIX standardizes @code{fflush()} as follows: if there 18925is no argument, or if the argument is the null string (@w{@code{""}}), 18926then @command{awk} flushes the buffers for @emph{all} open output files 18927and pipes. 18928 18929@quotation NOTE 18930Prior to @value{PVERSION} 4.0.2, @command{gawk} 18931would flush only the standard output if there was no argument, 18932and flush all output files and pipes if the argument was the null 18933string. This was changed in order to be compatible with BWK 18934@command{awk}, in the hope that standardizing this 18935feature in POSIX would then be easier (which indeed proved to be the case). 18936 18937With @command{gawk}, 18938you can use @samp{fflush("/dev/stdout")} if you wish to flush 18939only the standard output. 18940@end quotation 18941 18942@c @cindex automatic warnings 18943@c @cindex warnings, automatic 18944@cindex troubleshooting @subentry @code{fflush()} function 18945@code{fflush()} returns zero if the buffer is successfully flushed; 18946otherwise, it returns a nonzero value. (@command{gawk} returns @minus{}1.) 18947In the case where all buffers are flushed, the return value is zero 18948only if all buffers were flushed successfully. Otherwise, it is 18949@minus{}1, and @command{gawk} warns about the problem @var{filename}. 18950 18951@command{gawk} also issues a warning message if you attempt to flush 18952a file or pipe that was opened for reading (such as with @code{getline}), 18953or if @var{filename} is not an open file, pipe, or coprocess. 18954In such a case, @code{fflush()} returns @minus{}1, as well. 18955 18956@c end the table to let the sidebar take up the full width of the page. 18957@end table 18958 18959@sidebar Interactive Versus Noninteractive Buffering 18960@cindex buffering @subentry interactive vs.@: noninteractive 18961 18962As a side point, buffering issues can be even more confusing if 18963your program is @dfn{interactive} (i.e., communicating 18964with a user sitting at a keyboard).@footnote{A program is interactive 18965if the standard output is connected to a terminal device. On modern 18966systems, this means your keyboard and screen.} 18967 18968@c Thanks to Walter.Mecky@dresdnerbank.de for this example, and for 18969@c motivating me to write this section. 18970Interactive programs generally @dfn{line buffer} their output (i.e., they 18971write out every line). Noninteractive programs wait until they have 18972a full buffer, which may be many lines of output. 18973Here is an example of the difference: 18974 18975@example 18976$ @kbd{awk '@{ print $1 + $2 @}'} 18977@kbd{1 1} 18978@print{} 2 18979@kbd{2 3} 18980@print{} 5 18981@kbd{Ctrl-d} 18982@end example 18983 18984@noindent 18985Each line of output is printed immediately. Compare that behavior 18986with this example: 18987 18988@example 18989$ @kbd{awk '@{ print $1 + $2 @}' | cat} 18990@kbd{1 1} 18991@kbd{2 3} 18992@kbd{Ctrl-d} 18993@print{} 2 18994@print{} 5 18995@end example 18996 18997@noindent 18998Here, no output is printed until after the @kbd{Ctrl-d} is typed, because 18999it is all buffered and sent down the pipe to @command{cat} in one shot. 19000@end sidebar 19001 19002@table @asis 19003@item @code{system(@var{command})} 19004@cindexawkfunc{system} 19005@cindex invoke shell command 19006@cindex interacting with other programs 19007Execute the operating system 19008command @var{command} and then return to the @command{awk} program. 19009Return @var{command}'s exit status (see further on). 19010 19011For example, if the following fragment of code is put in your @command{awk} 19012program: 19013 19014@example 19015END @{ 19016 system("date | mail -s 'awk run done' root") 19017@} 19018@end example 19019 19020@noindent 19021the system administrator is sent mail when the @command{awk} program 19022finishes processing input and begins its end-of-input processing. 19023 19024Note that redirecting @code{print} or @code{printf} into a pipe is often 19025enough to accomplish your task. If you need to run many commands, it 19026is more efficient to simply print them down a pipeline to the shell: 19027 19028@example 19029while (@var{more stuff to do}) 19030 print @var{command} | "/bin/sh" 19031close("/bin/sh") 19032@end example 19033 19034@noindent 19035@cindex troubleshooting @subentry @code{system()} function 19036@cindex @option{--sandbox} option @subentry disabling @code{system()} function 19037However, if your @command{awk} 19038program is interactive, @code{system()} is useful for running large 19039self-contained programs, such as a shell or an editor. 19040Some operating systems cannot implement the @code{system()} function. 19041@code{system()} causes a fatal error if it is not supported. 19042 19043@quotation NOTE 19044When @option{--sandbox} is specified, the @code{system()} function is disabled 19045(@pxref{Options}). 19046@end quotation 19047 19048On POSIX systems, a command's exit status is a 16-bit number. The exit 19049value passed to the C @code{exit()} function is held in the high-order 19050eight bits. The low-order bits indicate if the process was killed by a 19051signal (bit 7) and if so, the guilty signal number (bits 0--6). 19052 19053Traditionally, @command{awk}'s @code{system()} function has simply 19054returned the exit status value divided by 256. In the normal case this 19055gives the exit status but in the case of death-by-signal it yields 19056a fractional floating-point value.@footnote{In private correspondence, 19057Dr.@: Kernighan has indicated to me that the way this was done 19058was probably a mistake.} POSIX states that @command{awk}'s 19059@code{system()} should return the full 16-bit value. 19060 19061@command{gawk} steers a middle ground. 19062The return values are summarized in @ref{table-system-return-values}. 19063 19064@float Table,table-system-return-values 19065@caption{Return values from @code{system()}} 19066@multitable @columnfractions .40 .60 19067@headitem Situation @tab Return value from @code{system()} 19068@item @option{--traditional} @tab C @code{system()}'s value divided by 256 19069@item @option{--posix} @tab C @code{system()}'s value 19070@item Normal exit of command @tab Command's exit status 19071@item Death by signal of command @tab 256 + number of murderous signal 19072@item Death by signal of command with core dump @tab 512 + number of murderous signal 19073@item Some kind of error @tab @minus{}1 19074@end multitable 19075@end float 19076@end table 19077 19078As of August, 2018, BWK @command{awk} now follows @command{gawk}'s behavior 19079for the return value of @code{system()}. 19080 19081@sidebar Controlling Output Buffering with @code{system()} 19082@cindex buffers @subentry flushing 19083@cindex buffering @subentry input/output 19084@cindex output @subentry buffering 19085 19086The @code{fflush()} function provides explicit control over output buffering for 19087individual files and pipes. However, its use is not portable to many older 19088@command{awk} implementations. An alternative method to flush output 19089buffers is to call @code{system()} with a null string as its argument: 19090 19091@example 19092system("") # flush output 19093@end example 19094 19095@noindent 19096@command{gawk} treats this use of the @code{system()} function as a special 19097case and is smart enough not to run a shell (or other command 19098interpreter) with the empty command. Therefore, with @command{gawk}, this 19099idiom is not only useful, it is also efficient. Although this method should work 19100with other @command{awk} implementations, it does not necessarily avoid 19101starting an unnecessary shell. (Other implementations may only 19102flush the buffer associated with the standard output and not necessarily 19103all buffered output.) 19104 19105If you think about what a programmer expects, it makes sense that 19106@code{system()} should flush any pending output. The following program: 19107 19108@example 19109BEGIN @{ 19110 print "first print" 19111 system("echo system echo") 19112 print "second print" 19113@} 19114@end example 19115 19116@noindent 19117must print: 19118 19119@example 19120first print 19121system echo 19122second print 19123@end example 19124 19125@noindent 19126and not: 19127 19128@example 19129system echo 19130first print 19131second print 19132@end example 19133 19134If @command{awk} did not flush its buffers before calling @code{system()}, 19135you would see the latter (undesirable) output. 19136@end sidebar 19137 19138@node Time Functions 19139@subsection Time Functions 19140@cindex time functions 19141 19142@cindex timestamps 19143@cindex log files, timestamps in 19144@cindex files @subentry log, timestamps in 19145@cindex @command{gawk} @subentry timestamps 19146@cindex POSIX @command{awk} @subentry timestamps and 19147@command{awk} programs are commonly used to process log files 19148containing timestamp information, indicating when a 19149particular log record was written. Many programs log their timestamps 19150in the form returned by the @code{time()} system call, which is the 19151number of seconds since a particular epoch. On POSIX-compliant systems, 19152it is the number of seconds since 191531970-01-01 00:00:00 UTC, not counting leap 19154@ifclear FOR_PRINT 19155seconds.@footnote{@xref{Glossary}, especially the entries ``Epoch'' and ``UTC.''} 19156@end ifclear 19157@ifset FOR_PRINT 19158seconds. 19159@end ifset 19160All known POSIX-compliant systems support timestamps from 0 through 19161@iftex 19162@math{2^{31} - 1}, 19163@end iftex 19164@ifinfo 191652^31 - 1, 19166@end ifinfo 19167@ifnottex 19168@ifnotinfo 191692@sup{31} @minus{} 1, 19170@end ifnotinfo 19171@end ifnottex 19172which is sufficient to represent times through 191732038-01-19 03:14:07 UTC. Many systems support a wider range of timestamps, 19174including negative timestamps that represent times before the 19175epoch. 19176 19177@cindex @command{date} utility @subentry GNU 19178@cindex time @subentry retrieving 19179In order to make it easier to process such log files and to produce 19180useful reports, @command{gawk} provides the following functions for 19181working with timestamps. They are @command{gawk} extensions; they are 19182not specified in the POSIX standard.@footnote{The GNU @command{date} utility can 19183also do many of the things described here. Its use may be preferable 19184for simple time-related operations in shell scripts.} 19185However, recent versions 19186of @command{mawk} (@pxref{Other Versions}) also support these functions. 19187Optional parameters are enclosed in square brackets ([ ]): 19188 19189@c @asis for docbook 19190@table @asis 19191@item @code{mktime(@var{datespec}} [@code{, @var{utc-flag}} ]@code{)} 19192@cindexgawkfunc{mktime} 19193@cindex generate time values 19194Turn @var{datespec} into a timestamp in the same form 19195as is returned by @code{systime()}. It is similar to the function of the 19196same name in ISO C. The argument, @var{datespec}, is a string of the form 19197@w{@code{"@var{YYYY} @var{MM} @var{DD} @var{HH} @var{MM} @var{SS} [@var{DST}]"}}. 19198The string consists of six or seven numbers representing, respectively, 19199the full year including century, the month from 1 to 12, the day of the month 19200from 1 to 31, the hour of the day from 0 to 23, the minute from 0 to 1920159, the second from 0 to 60,@footnote{Occasionally there are 19202minutes in a year with a leap second, which is why the 19203seconds can go up to 60.} 19204and an optional daylight-savings flag. 19205 19206The values of these numbers need not be within the ranges specified; 19207for example, an hour of @minus{}1 means 1 hour before midnight. 19208The origin-zero Gregorian calendar is assumed, with year 0 preceding 19209year 1 and year @minus{}1 preceding year 0. 19210If @var{utc-flag} is present and is either nonzero or non-null, the time 19211is assumed to be in the UTC time zone; otherwise, the 19212time is assumed to be in the local time zone. 19213If the @var{DST} daylight-savings flag is positive, the time is assumed to be 19214daylight savings time; if zero, the time is assumed to be standard 19215time; and if negative (the default), @code{mktime()} attempts to determine 19216whether daylight savings time is in effect for the specified time. 19217 19218If @var{datespec} does not contain enough elements or if the resulting time 19219is out of range, @code{mktime()} returns @minus{}1. 19220 19221@cindex @command{gawk} @subentry @code{PROCINFO} array in 19222@cindex @code{PROCINFO} array 19223@item @code{strftime(}[@var{format} [@code{,} @var{timestamp} [@code{,} @var{utc-flag}] ] ]@code{)} 19224@cindexgawkfunc{strftime} 19225@cindex format time string 19226Format the time specified by @var{timestamp} 19227based on the contents of the @var{format} string and return the result. 19228It is similar to the function of the same name in ISO C. 19229If @var{utc-flag} is present and is either nonzero or non-null, the value 19230is formatted as UTC (Coordinated Universal Time, formerly GMT or Greenwich 19231Mean Time). Otherwise, the value is formatted for the local time zone. 19232The @var{timestamp} is in the same format as the value returned by the 19233@code{systime()} function. If no @var{timestamp} argument is supplied, 19234@command{gawk} uses the current time of day as the timestamp. 19235Without a @var{format} argument, @code{strftime()} uses 19236the value of @code{PROCINFO["strftime"]} as the format string 19237(@pxref{Built-in Variables}). 19238The default string value is 19239@code{@w{"%a %b %e %H:%M:%S %Z %Y"}}. This format string produces 19240output that is equivalent to that of the @command{date} utility. 19241You can assign a new value to @code{PROCINFO["strftime"]} to 19242change the default format; see the following list for the various format directives. 19243 19244@item @code{systime()} 19245@cindexgawkfunc{systime} 19246@cindex timestamps 19247@cindex current system time 19248Return the current time as the number of seconds since 19249the system epoch. On POSIX systems, this is the number of seconds 19250since 1970-01-01 00:00:00 UTC, not counting leap seconds. 19251It may be a different number on other systems. 19252@end table 19253 19254The @code{systime()} function allows you to compare a timestamp from a 19255log file with the current time of day. In particular, it is easy to 19256determine how long ago a particular record was logged. It also allows 19257you to produce log records using the ``seconds since the epoch'' format. 19258 19259@cindex converting @subentry dates to timestamps 19260@cindex dates @subentry converting to timestamps 19261@cindex timestamps @subentry converting dates to 19262The @code{mktime()} function allows you to convert a textual representation 19263of a date and time into a timestamp. This makes it easy to do before/after 19264comparisons of dates and times, particularly when dealing with date and 19265time data coming from an external source, such as a log file. 19266 19267The @code{strftime()} function allows you to easily turn a timestamp 19268into human-readable information. It is similar in nature to the @code{sprintf()} 19269function 19270(@pxref{String Functions}), 19271in that it copies nonformat specification characters verbatim to the 19272returned string, while substituting date and time values for format 19273specifications in the @var{format} string. 19274 19275@cindex format specifiers @subentry @code{strftime()} function (@command{gawk}) 19276@code{strftime()} is guaranteed by the 1999 ISO C 19277standard@footnote{Unfortunately, 19278not every system's @code{strftime()} necessarily 19279supports all of the conversions listed here.} 19280to support the following date format specifications: 19281 19282@table @code 19283@item %a 19284The locale's abbreviated weekday name. 19285 19286@item %A 19287The locale's full weekday name. 19288 19289@item %b 19290The locale's abbreviated month name. 19291 19292@item %B 19293The locale's full month name. 19294 19295@item %c 19296The locale's ``appropriate'' date and time representation. 19297(This is @samp{%A %B %d %T %Y} in the @code{"C"} locale.) 19298 19299@item %C 19300The century part of the current year. 19301This is the year divided by 100 and truncated to the next 19302lower integer. 19303 19304@item %d 19305The day of the month as a decimal number (01--31). 19306 19307@item %D 19308Equivalent to specifying @samp{%m/%d/%y}. 19309 19310@item %e 19311The day of the month, padded with a space if it is only one digit. 19312 19313@item %F 19314Equivalent to specifying @samp{%Y-%m-%d}. 19315This is the ISO 8601 date format. 19316 19317@item %g 19318The year modulo 100 of the ISO 8601 week number, as a decimal number (00--99). 19319For example, January 1, 2012, is in week 53 of 2011. Thus, the year 19320of its ISO 8601 week number is 2011, even though its year is 2012. 19321Similarly, December 31, 2012, is in week 1 of 2013. Thus, the year 19322of its ISO week number is 2013, even though its year is 2012. 19323 19324@item %G 19325The full year of the ISO week number, as a decimal number. 19326 19327@item %h 19328Equivalent to @samp{%b}. 19329 19330@item %H 19331The hour (24-hour clock) as a decimal number (00--23). 19332 19333@item %I 19334The hour (12-hour clock) as a decimal number (01--12). 19335 19336@item %j 19337The day of the year as a decimal number (001--366). 19338 19339@item %m 19340The month as a decimal number (01--12). 19341 19342@item %M 19343The minute as a decimal number (00--59). 19344 19345@item %n 19346A newline character (ASCII LF). 19347 19348@item %p 19349The locale's equivalent of the AM/PM designations associated 19350with a 12-hour clock. 19351 19352@item %r 19353The locale's 12-hour clock time. 19354(This is @samp{%I:%M:%S %p} in the @code{"C"} locale.) 19355 19356@item %R 19357Equivalent to specifying @samp{%H:%M}. 19358 19359@item %S 19360The second as a decimal number (00--60). 19361 19362@item %t 19363A TAB character. 19364 19365@item %T 19366Equivalent to specifying @samp{%H:%M:%S}. 19367 19368@item %u 19369The weekday as a decimal number (1--7). Monday is day one. 19370 19371@item %U 19372The week number of the year (with the first Sunday as the first day of week one) 19373as a decimal number (00--53). 19374 19375@cindex ISO @subentry ISO 8601 date and time standard 19376@item %V 19377The week number of the year (with the first Monday as the first 19378day of week one) as a decimal number (01--53). 19379The method for determining the week number is as specified by ISO 8601. 19380(To wit: if the week containing January 1 has four or more days in the 19381new year, then it is week one; otherwise it is the last week 19382[52 or 53] of the previous year and the next week is week one.) 19383 19384@item %w 19385The weekday as a decimal number (0--6). Sunday is day zero. 19386 19387@item %W 19388The week number of the year (with the first Monday as the first day of week one) 19389as a decimal number (00--53). 19390 19391@item %x 19392The locale's ``appropriate'' date representation. 19393(This is @samp{%A %B %d %Y} in the @code{"C"} locale.) 19394 19395@item %X 19396The locale's ``appropriate'' time representation. 19397(This is @samp{%T} in the @code{"C"} locale.) 19398 19399@item %y 19400The year modulo 100 as a decimal number (00--99). 19401 19402@item %Y 19403The full year as a decimal number (e.g., 2015). 19404 19405@c @cindex RFC 822 19406@c @cindex RFC 1036 19407@item %z 19408The time zone offset in a @samp{+@var{HHMM}} format (e.g., the format 19409necessary to produce RFC 822/RFC 1036 date headers). 19410 19411@item %Z 19412The time zone name or abbreviation; no characters if 19413no time zone is determinable. 19414 19415@item %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH 19416@itemx %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy 19417``Alternative representations'' for the specifications 19418that use only the second letter (@samp{%c}, @samp{%C}, 19419and so on).@footnote{If you don't understand any of this, don't worry about 19420it; these facilities are meant to make it easier to ``internationalize'' 19421programs. 19422Other internationalization features are described in 19423@ref{Internationalization}.} 19424(These facilitate compliance with the POSIX @command{date} utility.) 19425 19426@item %% 19427A literal @samp{%}. 19428@end table 19429 19430If a conversion specifier is not one of those just listed, the behavior is 19431undefined.@footnote{This is because ISO C leaves the 19432behavior of the C version of @code{strftime()} undefined and @command{gawk} 19433uses the system's version of @code{strftime()} if it's there. 19434Typically, the conversion specifier either does not appear in the 19435returned string or appears literally.} 19436 19437For systems that are not yet fully standards-compliant, 19438@command{gawk} supplies a copy of 19439@code{strftime()} from the GNU C Library. 19440It supports all of the just-listed format specifications. 19441If that version is 19442used to compile @command{gawk} (@pxref{Installation}), 19443then the following additional format specifications are available: 19444 19445@table @code 19446@item %k 19447The hour (24-hour clock) as a decimal number (0--23). 19448Single-digit numbers are padded with a space. 19449 19450@item %l 19451The hour (12-hour clock) as a decimal number (1--12). 19452Single-digit numbers are padded with a space. 19453 19454@ignore 19455@item %N 19456The ``Emperor/Era'' name. 19457Equivalent to @samp{%C}. 19458 19459@item %o 19460The ``Emperor/Era'' year. 19461Equivalent to @samp{%y}. 19462@end ignore 19463 19464@item %s 19465The time as a decimal timestamp in seconds since the epoch. 19466 19467@ignore 19468@item %v 19469The date in VMS format (e.g., @samp{20-JUN-1991}). 19470@end ignore 19471@end table 19472 19473Additionally, the alternative representations are recognized but their 19474normal representations are used. 19475 19476@cindex @code{date} utility @subentry POSIX 19477@cindex POSIX @command{awk} @subentry @code{date} utility and 19478The following example is an @command{awk} implementation of the POSIX 19479@command{date} utility. Normally, the @command{date} utility prints the 19480current date and time of day in a well-known format. However, if you 19481provide an argument to it that begins with a @samp{+}, @command{date} 19482copies nonformat specifier characters to the standard output and 19483interprets the current time according to the format specifiers in 19484the string. For example: 19485 19486@example 19487$ @kbd{date '+Today is %A, %B %d, %Y.'} 19488@print{} Today is Monday, September 22, 2014. 19489@end example 19490 19491Here is the @command{gawk} version of the @command{date} utility. 19492It has a shell ``wrapper'' to handle the @option{-u} option, 19493which requires that @command{date} run as if the time zone 19494is set to UTC: 19495 19496@example 19497#! /bin/sh 19498# 19499# date --- approximate the POSIX 'date' command 19500 19501case $1 in 19502-u) TZ=UTC0 # use UTC 19503 export TZ 19504 shift ;; 19505esac 19506 19507gawk 'BEGIN @{ 19508 format = PROCINFO["strftime"] 19509 exitval = 0 19510 19511 if (ARGC > 2) 19512 exitval = 1 19513 else if (ARGC == 2) @{ 19514 format = ARGV[1] 19515 if (format ~ /^\+/) 19516 format = substr(format, 2) # remove leading + 19517 @} 19518 print strftime(format) 19519 exit exitval 19520@}' "$@@" 19521@end example 19522 19523@node Bitwise Functions 19524@subsection Bit-Manipulation Functions 19525@cindex bit-manipulation functions 19526@cindex bitwise @subentry operations 19527@cindex AND bitwise operation 19528@cindex OR bitwise operation 19529@cindex XOR bitwise operation 19530@cindex operations, bitwise 19531@quotation 19532@i{I can explain it for you, but I can't understand it for you.} 19533@author Anonymous 19534@end quotation 19535 19536Many languages provide the ability to perform @dfn{bitwise} operations 19537on two integer numbers. In other words, the operation is performed on 19538each successive pair of bits in the operands. 19539Three common operations are bitwise AND, OR, and XOR. 19540The operations are described in @ref{table-bitwise-ops}. 19541 19542@c 11/2014: Postprocessing turns the docbook informaltable 19543@c into a table. Hurray for scripting! 19544@float Table,table-bitwise-ops 19545@caption{Bitwise operations} 19546@ifnottex 19547@ifnotdocbook 19548@verbatim 19549 Bit operator 19550 | AND | OR | XOR 19551 |---+---+---+---+---+--- 19552Operands | 0 | 1 | 0 | 1 | 0 | 1 19553----------+---+---+---+---+---+--- 19554 0 | 0 0 | 0 1 | 0 1 19555 1 | 0 1 | 1 1 | 1 0 19556@end verbatim 19557@end ifnotdocbook 19558@end ifnottex 19559@tex 19560\centerline{ 19561\vbox{\bigskip % space above the table (about 1 linespace) 19562% Because we have vertical rules, we can't let TeX insert interline space 19563% in its usual way. 19564\offinterlineskip 19565\halign{\strut\hfil#\quad\hfil % operands 19566 &\vrule#&\quad#\quad % rule, 0 (of and) 19567 &\vrule#&\quad#\quad % rule, 1 (of and) 19568 &\vrule# % rule between and and or 19569 &\quad#\quad % 0 (of or) 19570 &\vrule#&\quad#\quad % rule, 1 (of of) 19571 &\vrule# % rule between or and xor 19572 &\quad#\quad % 0 of xor 19573 &\vrule#&\quad#\quad % rule, 1 of xor 19574 \cr 19575&\omit&\multispan{11}\hfil\bf Bit operator\hfil\cr 19576\noalign{\smallskip} 19577& &\multispan3\hfil AND\hfil&&\multispan3\hfil OR\hfil 19578 &&\multispan3\hfil XOR\hfil\cr 19579\bf Operands&&0&&1&&0&&1&&0&&1\cr 19580\noalign{\hrule} 19581\omit&height 2pt&&\omit&&&&\omit&&&&\omit\cr 19582\noalign{\hrule height0pt}% without this the rule does not extend; why? 195830&&0&\omit&0&&0&\omit&1&&0&\omit&1\cr 195841&&0&\omit&1&&1&\omit&1&&1&\omit&0\cr 19585}}} 19586@end tex 19587 19588@docbook 19589<informaltable> 19590 19591<tgroup cols="7" colsep="1"> 19592<colspec colname="c1"/> 19593<colspec colname="c2"/> 19594<colspec colname="c3"/> 19595<colspec colname="c4"/> 19596<colspec colname="c5"/> 19597<colspec colname="c6"/> 19598<colspec colname="c7"/> 19599<spanspec spanname="optitle" namest="c2" nameend="c7" align="center"/> 19600<spanspec spanname="andspan" namest="c2" nameend="c3" align="center"/> 19601<spanspec spanname="orspan" namest="c4" nameend="c5" align="center"/> 19602<spanspec spanname="xorspan" namest="c6" nameend="c7" align="center"/> 19603 19604<tbody> 19605<row> 19606<entry colsep="0"></entry> 19607<entry spanname="optitle"><emphasis role="bold">Bit operator</emphasis></entry> 19608</row> 19609 19610<row rowsep="1"> 19611<entry rowsep="0"></entry> 19612<entry spanname="andspan">AND</entry> 19613<entry spanname="orspan">OR</entry> 19614<entry spanname="xorspan">XOR</entry> 19615</row> 19616 19617<row rowsep="1"> 19618<entry ><emphasis role="bold">Operands</emphasis></entry> 19619<entry colsep="0">0</entry> 19620<entry colsep="1">1</entry> 19621<entry colsep="0">0</entry> 19622<entry colsep="1">1</entry> 19623<entry colsep="0">0</entry> 19624<entry colsep="1">1</entry> 19625</row> 19626 19627<row> 19628<entry align="center">0</entry> 19629<entry colsep="0">0</entry> 19630<entry>0</entry> 19631<entry colsep="0">0</entry> 19632<entry>1</entry> 19633<entry colsep="0">0</entry> 19634<entry>1</entry> 19635</row> 19636 19637<row> 19638<entry align="center">1</entry> 19639<entry colsep="0">0</entry> 19640<entry>1</entry> 19641<entry colsep="0">1</entry> 19642<entry>1</entry> 19643<entry colsep="0">1</entry> 19644<entry>0</entry> 19645</row> 19646 19647</tbody> 19648</tgroup> 19649</informaltable> 19650@end docbook 19651@end float 19652 19653@cindex bitwise @subentry complement 19654@cindex complement, bitwise 19655As you can see, the result of an AND operation is 1 only when @emph{both} 19656bits are 1. 19657The result of an OR operation is 1 if @emph{either} bit is 1. 19658The result of an XOR operation is 1 if either bit is 1, 19659but not both. 19660The next operation is the @dfn{complement}; the complement of 1 is 0 and 19661the complement of 0 is 1. Thus, this operation ``flips'' all the bits 19662of a given value. 19663 19664@cindex bitwise @subentry shift 19665@cindex left shift, bitwise 19666@cindex right shift, bitwise 19667@cindex shift, bitwise 19668Finally, two other common operations are to shift the bits left or right. 19669For example, if you have a bit string @samp{10111001} and you shift it 19670right by three bits, you end up with @samp{00010111}.@footnote{This example 19671shows that zeros come in on the left side. For @command{gawk}, this is 19672always true, but in some languages, it's possible to have the left side 19673fill with ones.} 19674If you start over again with @samp{10111001} and shift it left by three 19675bits, you end up with @samp{11001000}. The following list describes 19676@command{gawk}'s built-in functions that implement the bitwise operations. 19677Optional parameters are enclosed in square brackets ([ ]): 19678 19679@cindex @command{gawk} @subentry bitwise operations in 19680@table @asis 19681@cindexgawkfunc{and} 19682@cindex bitwise @subentry AND 19683@item @code{and(}@var{v1}@code{,} @var{v2} [@code{,} @dots{}]@code{)} 19684Return the bitwise AND of the arguments. There must be at least two. 19685 19686@cindexgawkfunc{compl} 19687@cindex bitwise @subentry complement 19688@item @code{compl(@var{val})} 19689Return the bitwise complement of @var{val}. 19690 19691@cindexgawkfunc{lshift} 19692@item @code{lshift(@var{val}, @var{count})} 19693Return the value of @var{val}, shifted left by @var{count} bits. 19694 19695@cindexgawkfunc{or} 19696@cindex bitwise @subentry OR 19697@item @code{or(}@var{v1}@code{,} @var{v2} [@code{,} @dots{}]@code{)} 19698Return the bitwise OR of the arguments. There must be at least two. 19699 19700@cindexgawkfunc{rshift} 19701@item @code{rshift(@var{val}, @var{count})} 19702Return the value of @var{val}, shifted right by @var{count} bits. 19703 19704@cindexgawkfunc{xor} 19705@cindex bitwise @subentry XOR 19706@item @code{xor(}@var{v1}@code{,} @var{v2} [@code{,} @dots{}]@code{)} 19707Return the bitwise XOR of the arguments. There must be at least two. 19708@end table 19709 19710@quotation CAUTION 19711Beginning with @command{gawk} @value{PVERSION} 4.2, negative 19712operands are not allowed for any of these functions. A negative 19713operand produces a fatal error. See the sidebar 19714``Beware The Smoke and Mirrors!'' for more information as to why. 19715@end quotation 19716 19717Here is a user-defined function (@pxref{User-defined}) 19718that illustrates the use of these functions: 19719 19720@cindex @code{bits2str()} user-defined function 19721@cindex user-defined @subentry function @subentry @code{bits2str()} 19722@cindex @file{testbits.awk} program 19723@example 19724@group 19725@c file eg/lib/bits2str.awk 19726# bits2str --- turn an integer into readable ones and zeros 19727 19728function bits2str(bits, data, mask) 19729@{ 19730 if (bits == 0) 19731 return "0" 19732 19733 mask = 1 19734 for (; bits != 0; bits = rshift(bits, 1)) 19735 data = (and(bits, mask) ? "1" : "0") data 19736 19737 while ((length(data) % 8) != 0) 19738 data = "0" data 19739 19740 return data 19741@} 19742@c endfile 19743@end group 19744 19745@c this is a hack to make testbits.awk self-contained 19746@ignore 19747@c file eg/prog/testbits.awk 19748# bits2str --- turn an integer into readable ones and zeros 19749 19750function bits2str(bits, data, mask) 19751@{ 19752 if (bits == 0) 19753 return "0" 19754 19755 mask = 1 19756 for (; bits != 0; bits = rshift(bits, 1)) 19757 data = (and(bits, mask) ? "1" : "0") data 19758 19759 while ((length(data) % 8) != 0) 19760 data = "0" data 19761 19762 return data 19763@} 19764@c endfile 19765@end ignore 19766@c file eg/prog/testbits.awk 19767BEGIN @{ 19768 printf "123 = %s\n", bits2str(123) 19769 printf "0123 = %s\n", bits2str(0123) 19770 printf "0x99 = %s\n", bits2str(0x99) 19771 comp = compl(0x99) 19772 printf "compl(0x99) = %#x = %s\n", comp, bits2str(comp) 19773 shift = lshift(0x99, 2) 19774 printf "lshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift) 19775 shift = rshift(0x99, 2) 19776 printf "rshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift) 19777@} 19778@c endfile 19779@end example 19780 19781@noindent 19782This program produces the following output when run: 19783 19784@example 19785$ @kbd{gawk -f testbits.awk} 19786@print{} 123 = 01111011 19787@print{} 0123 = 01010011 19788@print{} 0x99 = 10011001 19789@print{} compl(0x99) = 0x3fffffffffff66 = 19790@print{} 00111111111111111111111111111111111111111111111101100110 19791@print{} lshift(0x99, 2) = 0x264 = 0000001001100100 19792@print{} rshift(0x99, 2) = 0x26 = 00100110 19793@end example 19794 19795@cindex converting @subentry string to numbers 19796@cindex strings @subentry converting 19797@cindex numbers @subentry converting 19798@cindex converting @subentry numbers to strings 19799@cindex numbers @subentry as string of bits 19800The @code{bits2str()} function turns a binary number into a string. 19801Initializing @code{mask} to one creates 19802a binary value where the rightmost bit 19803is set to one. Using this mask, 19804the function repeatedly checks the rightmost bit. 19805ANDing the mask with the value indicates whether the 19806rightmost bit is one or not. If so, a @code{"1"} is concatenated onto the front 19807of the string. 19808Otherwise, a @code{"0"} is added. 19809The value is then shifted right by one bit and the loop continues 19810until there are no more one bits. 19811 19812If the initial value is zero, it returns a simple @code{"0"}. 19813Otherwise, at the end, it pads the value with zeros to represent multiples 19814of 8-bit quantities. This is typical in modern computers. 19815 19816The main code in the @code{BEGIN} rule shows the difference between the 19817decimal and octal values for the same numbers 19818(@pxref{Nondecimal-numbers}), 19819and then demonstrates the 19820results of the @code{compl()}, @code{lshift()}, and @code{rshift()} functions. 19821 19822@sidebar Beware The Smoke and Mirrors! 19823 19824It other languages, bitwise operations are performed on integer values, 19825not floating-point values. As a general statement, such operations work 19826best when performed on unsigned integers. 19827 19828@command{gawk} attempts to treat the arguments to the bitwise functions 19829as unsigned integers. For this reason, negative arguments produce a 19830fatal error. 19831 19832In normal operation, for all of these functions, first the 19833double-precision floating-point value is converted to the widest C 19834unsigned integer type, then the bitwise operation is performed. If the 19835result cannot be represented exactly as a C @code{double}, leading 19836nonzero bits are removed one by one until it can be represented exactly. 19837The result is then converted back into a C @code{double}.@footnote{If you don't 19838understand this paragraph, the upshot is that @command{gawk} can only 19839store a particular range of integer values; numbers outside that range 19840are reduced to fit within the range.} 19841 19842However, when using arbitrary precision arithmetic with the @option{-M} 19843option (@pxref{Arbitrary Precision Arithmetic}), the results may differ. 19844This is particularly noticeable with the @code{compl()} function: 19845 19846@example 19847$ @kbd{gawk 'BEGIN @{ print compl(42) @}'} 19848@print{} 9007199254740949 19849$ @kbd{gawk -M 'BEGIN @{ print compl(42) @}'} 19850@print{} -43 19851@end example 19852 19853What's going on becomes clear when printing the results 19854in hexadecimal: 19855 19856@example 19857$ @kbd{gawk 'BEGIN @{ printf "%#x\n", compl(42) @}'} 19858@print{} 0x1fffffffffffd5 19859$ @kbd{gawk -M 'BEGIN @{ printf "%#x\n", compl(42) @}'} 19860@print{} 0xffffffffffffffd5 19861@end example 19862 19863When using the @option{-M} option, under the hood, @command{gawk} uses 19864GNU MP arbitrary precision integers which have at least 64 bits of precision. 19865When not using @option{-M}, @command{gawk} stores integral values in 19866regular double-precision floating point, which only maintain 53 bits of 19867precision. Furthermore, the GNU MP library treats (or at least seems to treat) 19868the leading bit as a sign bit; thus the result with @option{-M} in this case is 19869a negative number. 19870 19871In short, using @command{gawk} for any but the simplest kind of bitwise 19872operations is probably a bad idea; caveat emptor! 19873 19874@end sidebar 19875 19876@node Type Functions 19877@subsection Getting Type Information 19878 19879@command{gawk} provides two functions that let you distinguish 19880the type of a variable. 19881This is necessary for writing code 19882that traverses every element of an array of arrays 19883(@pxref{Arrays of Arrays}), and in other contexts. 19884 19885@table @code 19886@cindexgawkfunc{isarray} 19887@cindex scalar or array 19888@item isarray(@var{x}) 19889Return a true value if @var{x} is an array. Otherwise, return false. 19890 19891@cindexgawkfunc{typeof} 19892@cindex variable type, @code{typeof()} function (@command{gawk}) 19893@cindex type @subentry of variable, @code{typeof()} function (@command{gawk}) 19894@item typeof(@var{x}) 19895Return one of the following strings, depending upon the type of @var{x}: 19896 19897@c nested table 19898@table @code 19899@item "array" 19900@var{x} is an array. 19901 19902@item "regexp" 19903@var{x} is a strongly typed regexp (@pxref{Strong Regexp Constants}). 19904 19905@item "number" 19906@var{x} is a number. 19907 19908@item "string" 19909@var{x} is a string. 19910 19911@item "strnum" 19912@var{x} is a number that started life as user input, such as a field or 19913the result of calling @code{split()}. (I.e., @var{x} has the strnum 19914attribute; @pxref{Variable Typing}.) 19915 19916@item "unassigned" 19917@var{x} is a scalar variable that has not been assigned a value yet. 19918For example: 19919 19920@example 19921BEGIN @{ 19922 # creates a[1] but it has no assigned value 19923 a[1] 19924 print typeof(a[1]) # unassigned 19925@} 19926@end example 19927 19928@item "untyped" 19929@var{x} has not yet been used yet at all; it can become a scalar or an 19930array. The typing could even conceivably differ from run to run of 19931the same program! For example: 19932 19933@example 19934BEGIN @{ 19935 print "initially, typeof(v) = ", typeof(v) 19936 19937 if ("FOO" in ENVIRON) 19938 make_scalar(v) 19939 else 19940 make_array(v) 19941 19942 print "typeof(v) =", typeof(v) 19943@} 19944 19945function make_scalar(p, l) @{ l = p @} 19946 19947function make_array(p) @{ p[1] = 1 @} 19948@end example 19949 19950@end table 19951@end table 19952 19953@code{isarray()} is meant for use in two circumstances. The first is when 19954traversing a multidimensional array: you can test if an element is itself 19955an array or not. The second is inside the body of a user-defined function 19956(not discussed yet; @pxref{User-defined}), to test if a parameter is an 19957array or not. 19958 19959@quotation NOTE 19960While you can use @code{isarray()} at the global level to test variables, 19961doing so makes no sense. Because @emph{you} are the one writing the 19962program, @emph{you} are supposed to know if your variables are arrays 19963or not. 19964@end quotation 19965 19966The @code{typeof()} function is general; it allows you to determine 19967if a variable or function parameter is a scalar (number, string, 19968or strongly typed regexp) or an array. 19969 19970Normally, passing a variable that has never been used to a built-in 19971function causes it to become a scalar variable (unassigned). 19972However, @code{isarray()} and @code{typeof()} are different; they do 19973not change their arguments from untyped to unassigned. 19974 19975@cindex dark corner @subentry array elements created by reference 19976By ``variable'' we mean one denoted by a simple identifier. Array elements 19977that come into existence simply by referencing them 19978are different, they are automatically forced to be scalars. Consider: 19979 19980@example 19981$ @kbd{gawk 'BEGIN @{ print typeof(x) @}'} 19982@print{} untyped 19983$ @kbd{gawk 'BEGIN @{ print typeof(x["foo"]) @}'} 19984@print{} unassigned 19985@end example 19986 19987@noindent 19988@code{x["foo"]} comes into existence before it is passed to @code{typeof()}; 19989@code{typeof()} cannot tell that it didn't exist prior to being called. 19990@value{DARKCORNER} 19991 19992@c FIXME: For 5.2, if this will change, update this bit of doc. 19993@c This may change in a future release, whereby @command{gawk} 19994@c would allow such an unassigned array element to be used for 19995@c a multidimensional array, and not remain a scalar forever 19996@c (or until deleted). 19997 19998@node I18N Functions 19999@subsection String-Translation Functions 20000@cindex @command{gawk} @subentry string-translation functions 20001@cindex functions @subentry string-translation 20002@cindex string-translation functions 20003@cindex internationalization 20004@cindex @command{awk} programs @subentry internationalizing 20005 20006@command{gawk} provides facilities for internationalizing @command{awk} programs. 20007These include the functions described in the following list. 20008The descriptions here are purposely brief. 20009@xref{Internationalization}, 20010for the full story. 20011Optional parameters are enclosed in square brackets ([ ]): 20012 20013@table @asis 20014@cindexgawkfunc{bindtextdomain} 20015@cindex set directory of message catalogs 20016@item @code{bindtextdomain(@var{directory}} [@code{,} @var{domain}]@code{)} 20017Set the directory in which 20018@command{gawk} will look for message translation files, in case they 20019will not or cannot be placed in the ``standard'' locations 20020(e.g., during testing). 20021It returns the directory in which @var{domain} is ``bound.'' 20022 20023The default @var{domain} is the value of @code{TEXTDOMAIN}. 20024If @var{directory} is the null string (@code{""}), then 20025@code{bindtextdomain()} returns the current binding for the 20026given @var{domain}. 20027 20028@cindexgawkfunc{dcgettext} 20029@cindex translate string 20030@item @code{dcgettext(@var{string}} [@code{,} @var{domain} [@code{,} @var{category}] ]@code{)} 20031Return the translation of @var{string} in 20032text domain @var{domain} for locale category @var{category}. 20033The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. 20034The default value for @var{category} is @code{"LC_MESSAGES"}. 20035 20036@cindexgawkfunc{dcngettext} 20037@item @code{dcngettext(@var{string1}, @var{string2}, @var{number}} [@code{,} @var{domain} [@code{,} @var{category}] ]@code{)} 20038Return the plural form used for @var{number} of the 20039translation of @var{string1} and @var{string2} in text domain 20040@var{domain} for locale category @var{category}. @var{string1} is the 20041English singular variant of a message, and @var{string2} is the English plural 20042variant of the same message. 20043The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. 20044The default value for @var{category} is @code{"LC_MESSAGES"}. 20045@end table 20046 20047@node User-defined 20048@section User-Defined Functions 20049 20050@cindex user-defined @subentry functions 20051@cindex functions @subentry user-defined 20052Complicated @command{awk} programs can often be simplified by defining 20053your own functions. User-defined functions can be called just like 20054built-in ones (@pxref{Function Calls}), but it is up to you to define 20055them (i.e., to tell @command{awk} what they should do). 20056 20057@menu 20058* Definition Syntax:: How to write definitions and what they mean. 20059* Function Example:: An example function definition and what it 20060 does. 20061* Function Calling:: Calling user-defined functions. 20062* Return Statement:: Specifying the value a function returns. 20063* Dynamic Typing:: How variable types can change at runtime. 20064@end menu 20065 20066@node Definition Syntax 20067@subsection Function Definition Syntax 20068 20069@quotation 20070@i{It's entirely fair to say that the awk syntax for local 20071variable definitions is appallingly awful.} 20072@author Brian Kernighan 20073@end quotation 20074 20075@cindex functions @subentry defining 20076Definitions of functions can appear anywhere between the rules of an 20077@command{awk} program. Thus, the general form of an @command{awk} program is 20078extended to include sequences of rules @emph{and} user-defined function 20079definitions. 20080There is no need to put the definition of a function 20081before all uses of the function. This is because @command{awk} reads the 20082entire program before starting to execute any of it. 20083 20084The definition of a function named @var{name} looks like this: 20085 20086@display 20087@group 20088@code{function} @var{name}@code{(}[@var{parameter-list}]@code{)} 20089@code{@{} 20090 @var{body-of-function} 20091@code{@}} 20092@end group 20093@end display 20094 20095@cindex names @subentry functions 20096@cindex functions @subentry names of 20097@cindex naming issues @subentry functions 20098@noindent 20099Here, @var{name} is the name of the function to define. A valid function 20100name is like a valid variable name: a sequence of letters, digits, and 20101underscores that doesn't start with a digit. 20102Here too, only the 52 upper- and lowercase English letters may 20103be used in a function name. 20104Within a single @command{awk} program, any particular name can only be 20105used as a variable, array, or function. 20106 20107@var{parameter-list} is an optional list of the function's arguments and local 20108variable names, separated by commas. When the function is called, 20109the argument names are used to hold the argument values given in 20110the call. 20111 20112A function cannot have two parameters with the same name, nor may it 20113have a parameter with the same name as the function itself. 20114 20115@quotation CAUTION 20116According to the POSIX standard, function parameters 20117cannot have the same name as one of the special predefined variables 20118(@pxref{Built-in Variables}), nor may a function parameter have the 20119same name as another function. 20120 20121@cindex dark corner @subentry parameter name restrictions 20122Not all versions of @command{awk} enforce 20123these restrictions. @value{DARKCORNER} 20124@command{gawk} always enforces the first restriction. 20125With @option{--posix} (@pxref{Options}), 20126it also enforces the second restriction. 20127@end quotation 20128 20129Local variables act like the empty string if referenced where a string 20130value is required, and like zero if referenced where a numeric value 20131is required. This is the same as the behavior of regular variables that have never been 20132assigned a value. (There is more to understand about local variables; 20133@pxref{Dynamic Typing}.) 20134 20135The @var{body-of-function} consists of @command{awk} statements. It is the 20136most important part of the definition, because it says what the function 20137should actually @emph{do}. The argument names exist to give the body a 20138way to talk about the arguments; local variables exist to give the body 20139places to keep temporary values. 20140 20141Argument names are not distinguished syntactically from local variable 20142names. Instead, the number of arguments supplied when the function is 20143called determines how many argument variables there are. Thus, if three 20144argument values are given, the first three names in @var{parameter-list} 20145are arguments and the rest are local variables. 20146 20147It follows that if the number of arguments is not the same in all calls 20148to the function, some of the names in @var{parameter-list} may be 20149arguments on some occasions and local variables on others. Another 20150way to think of this is that omitted arguments default to the 20151null string. 20152 20153@cindex programming conventions @subentry functions @subentry writing 20154Usually when you write a function, you know how many names you intend to 20155use for arguments and how many you intend to use as local variables. It is 20156conventional to place some extra space between the arguments and 20157the local variables, in order to document how your function is supposed to be used. 20158 20159@cindex variables @subentry shadowing 20160@cindex shadowing of variable values 20161During execution of the function body, the arguments and local variable 20162values hide, or @dfn{shadow}, any variables of the same names used in the 20163rest of the program. The shadowed variables are not accessible in the 20164function definition, because there is no way to name them while their 20165names have been taken away for the arguments and local variables. All other variables 20166used in the @command{awk} program can be referenced or set normally in the 20167function's body. 20168 20169The arguments and local variables last only as long as the function body 20170is executing. Once the body finishes, you can once again access the 20171variables that were shadowed while the function was running. 20172 20173@cindex recursive functions 20174@cindex functions @subentry recursive 20175The function body can contain expressions that call functions. They 20176can even call this function, either directly or by way of another 20177function. When this happens, we say the function is @dfn{recursive}. 20178The act of a function calling itself is called @dfn{recursion}. 20179 20180All the built-in functions return a value to their caller. 20181User-defined functions can do so also, using the @code{return} statement, 20182which is described in detail in @ref{Return Statement}. 20183Many of the subsequent examples in this @value{SECTION} use 20184the @code{return} statement. 20185 20186@cindex common extensions @subentry @code{func} keyword 20187@cindex extensions @subentry common @subentry @code{func} keyword 20188@c @cindex POSIX @command{awk} 20189@cindex @command{awk} @subentry language, POSIX version 20190@cindex POSIX @command{awk} @subentry @code{function} keyword in 20191In many @command{awk} implementations, including @command{gawk}, 20192the keyword @code{function} may be 20193abbreviated @code{func}. @value{COMMONEXT} 20194However, POSIX only specifies the use of 20195the keyword @code{function}. This actually has some practical implications. 20196If @command{gawk} is in POSIX-compatibility mode 20197(@pxref{Options}), then the following 20198statement does @emph{not} define a function: 20199 20200@example 20201func foo() @{ a = sqrt($1) ; print a @} 20202@end example 20203 20204@noindent 20205Instead, it defines a rule that, for each record, concatenates the value 20206of the variable @samp{func} with the return value of the function @samp{foo}. 20207If the resulting string is non-null, the action is executed. 20208This is probably not what is desired. (@command{awk} accepts this input as 20209syntactically valid, because functions may be used before they are defined 20210in @command{awk} programs.@footnote{This program won't actually run, 20211because @code{foo()} is undefined.}) 20212 20213@cindex portability @subentry functions, defining 20214To ensure that your @command{awk} programs are portable, always use the 20215keyword @code{function} when defining a function. 20216 20217@node Function Example 20218@subsection Function Definition Examples 20219@cindex function definition example 20220 20221Here is an example of a user-defined function, called @code{myprint()}, that 20222takes a number and prints it in a specific format: 20223 20224@example 20225function myprint(num) 20226@{ 20227 printf "%6.3g\n", num 20228@} 20229@end example 20230 20231@noindent 20232To illustrate, here is an @command{awk} rule that uses our @code{myprint()} 20233function: 20234 20235@example 20236$3 > 0 @{ myprint($3) @} 20237@end example 20238 20239@noindent 20240This program prints, in our special format, all the third fields that 20241contain a positive number in our input. Therefore, when given the following input: 20242 20243@example 20244 1.2 3.4 5.6 7.8 20245 9.10 11.12 -13.14 15.16 2024617.18 19.20 21.22 23.24 20247@end example 20248 20249@noindent 20250this program, using our function to format the results, prints: 20251 20252@example 20253 5.6 20254 21.2 20255@end example 20256 20257This function deletes all the elements in an array (recall that the 20258extra whitespace signifies the start of the local variable list): 20259 20260@example 20261@group 20262function delarray(a, i) 20263@{ 20264 for (i in a) 20265 delete a[i] 20266@} 20267@end group 20268@end example 20269 20270When working with arrays, it is often necessary to delete all the elements 20271in an array and start over with a new list of elements 20272(@pxref{Delete}). 20273Instead of having 20274to repeat this loop everywhere that you need to clear out 20275an array, your program can just call @code{delarray()}. 20276(This guarantees portability. The use of @samp{delete @var{array}} to delete 20277the contents of an entire array is a relatively recent@footnote{Late in 2012.} 20278addition to the POSIX standard.) 20279 20280The following is an example of a recursive function. It takes a string 20281as an input parameter and returns the string in reverse order. 20282Recursive functions must always have a test that stops the recursion. 20283In this case, the recursion terminates when the input string is 20284already empty: 20285 20286@c 8/2014: Thanks to Mike Brennan for the improved formulation 20287@cindex @code{rev()} user-defined function 20288@cindex user-defined @subentry function @subentry @code{rev()} 20289@example 20290function rev(str) 20291@{ 20292 if (str == "") 20293 return "" 20294 20295 return (rev(substr(str, 2)) substr(str, 1, 1)) 20296@} 20297@end example 20298 20299If this function is in a file named @file{rev.awk}, it can be tested 20300this way: 20301 20302@example 20303$ @kbd{echo "Don't Panic!" |} 20304> @kbd{gawk -e '@{ print rev($0) @}' -f rev.awk} 20305@print{} !cinaP t'noD 20306@end example 20307 20308The C @code{ctime()} function takes a timestamp and returns it as a string, 20309formatted in a well-known fashion. 20310The following example uses the built-in @code{strftime()} function 20311(@pxref{Time Functions}) 20312to create an @command{awk} version of @code{ctime()}: 20313 20314@cindex @code{ctime()} user-defined function 20315@cindex user-defined @subentry function @subentry @code{ctime()} 20316@example 20317@c file eg/lib/ctime.awk 20318# ctime.awk 20319# 20320# awk version of C ctime(3) function 20321 20322function ctime(ts, format) 20323@{ 20324 format = "%a %b %e %H:%M:%S %Z %Y" 20325 20326 if (ts == 0) 20327 ts = systime() # use current time as default 20328 return strftime(format, ts) 20329@} 20330@c endfile 20331@end example 20332 20333You might think that @code{ctime()} could use @code{PROCINFO["strftime"]} 20334for its format string. That would be a mistake, because @code{ctime()} is 20335supposed to return the time formatted in a standard fashion, and user-level 20336code could have changed @code{PROCINFO["strftime"]}. 20337 20338@node Function Calling 20339@subsection Calling User-Defined Functions 20340 20341@cindex functions @subentry user-defined @subentry calling 20342@dfn{Calling a function} means causing the function to run and do its job. 20343A function call is an expression and its value is the value returned by 20344the function. 20345 20346@menu 20347* Calling A Function:: Don't use spaces. 20348* Variable Scope:: Controlling variable scope. 20349* Pass By Value/Reference:: Passing parameters. 20350* Function Caveats:: Other points to know about functions. 20351@end menu 20352 20353@node Calling A Function 20354@subsubsection Writing a Function Call 20355 20356A function call consists of the function name followed by the arguments 20357in parentheses. @command{awk} expressions are what you write in the 20358call for the arguments. Each time the call is executed, these 20359expressions are evaluated, and the values become the actual arguments. For 20360example, here is a call to @code{foo()} with three arguments (the first 20361being a string concatenation): 20362 20363@example 20364foo(x y, "lose", 4 * z) 20365@end example 20366 20367@quotation CAUTION 20368Whitespace characters (spaces and TABs) are not allowed 20369between the function name and the opening parenthesis of the argument list. 20370If you write whitespace by mistake, @command{awk} might think that you mean 20371to concatenate a variable with an expression in parentheses. However, it 20372notices that you used a function name and not a variable name, and reports 20373an error. 20374@end quotation 20375 20376@node Variable Scope 20377@subsubsection Controlling Variable Scope 20378 20379@cindex local variables @subentry in a function 20380@cindex variables @subentry local to a function 20381Unlike in many languages, 20382there is no way to make a variable local to a @code{@{} @dots{} @code{@}} block in 20383@command{awk}, but you can make a variable local to a function. It is 20384good practice to do so whenever a variable is needed only in that 20385function. 20386 20387To make a variable local to a function, simply declare the variable as 20388an argument after the actual function arguments 20389(@pxref{Definition Syntax}). 20390Look at the following example, where variable 20391@code{i} is a global variable used by both functions @code{foo()} and 20392@code{bar()}: 20393 20394@example 20395function bar() 20396@{ 20397 for (i = 0; i < 3; i++) 20398 print "bar's i=" i 20399@} 20400 20401function foo(j) 20402@{ 20403 i = j + 1 20404 print "foo's i=" i 20405 bar() 20406 print "foo's i=" i 20407@} 20408 20409BEGIN @{ 20410 i = 10 20411 print "top's i=" i 20412 foo(0) 20413 print "top's i=" i 20414@} 20415@end example 20416 20417Running this script produces the following, because the @code{i} in 20418functions @code{foo()} and @code{bar()} and at the top level refer to the same 20419variable instance: 20420 20421@example 20422top's i=10 20423foo's i=1 20424bar's i=0 20425bar's i=1 20426bar's i=2 20427foo's i=3 20428top's i=3 20429@end example 20430 20431If you want @code{i} to be local to both @code{foo()} and @code{bar()}, do as 20432follows (the extra space before @code{i} is a coding convention to 20433indicate that @code{i} is a local variable, not an argument): 20434 20435@example 20436function bar( i) 20437@{ 20438 for (i = 0; i < 3; i++) 20439 print "bar's i=" i 20440@} 20441 20442function foo(j, i) 20443@{ 20444 i = j + 1 20445 print "foo's i=" i 20446 bar() 20447 print "foo's i=" i 20448@} 20449 20450BEGIN @{ 20451 i = 10 20452 print "top's i=" i 20453 foo(0) 20454 print "top's i=" i 20455@} 20456@end example 20457 20458Running the corrected script produces the following: 20459 20460@example 20461top's i=10 20462foo's i=1 20463bar's i=0 20464bar's i=1 20465bar's i=2 20466foo's i=1 20467top's i=10 20468@end example 20469 20470Besides scalar values (strings and numbers), you may also have 20471local arrays. By using a parameter name as an array, @command{awk} 20472treats it as an array, and it is local to the function. 20473In addition, recursive calls create new arrays. 20474Consider this example: 20475 20476@example 20477@group 20478function some_func(p1, a) 20479@{ 20480 if (p1++ > 3) 20481 return 20482@end group 20483 20484 a[p1] = p1 20485 20486 some_func(p1) 20487 20488 printf("At level %d, index %d %s found in a\n", 20489 p1, (p1 - 1), (p1 - 1) in a ? "is" : "is not") 20490 printf("At level %d, index %d %s found in a\n", 20491 p1, p1, p1 in a ? "is" : "is not") 20492 print "" 20493@} 20494 20495BEGIN @{ 20496 some_func(1) 20497@} 20498@end example 20499 20500When run, this program produces the following output: 20501 20502@example 20503At level 4, index 3 is not found in a 20504At level 4, index 4 is found in a 20505 20506At level 3, index 2 is not found in a 20507At level 3, index 3 is found in a 20508 20509At level 2, index 1 is not found in a 20510At level 2, index 2 is found in a 20511@end example 20512 20513@node Pass By Value/Reference 20514@subsubsection Passing Function Arguments by Value Or by Reference 20515 20516In @command{awk}, when you declare a function, there is no way to 20517declare explicitly whether the arguments are passed @dfn{by value} or 20518@dfn{by reference}. 20519 20520Instead, the passing convention is determined at runtime when 20521the function is called, according to the following rule: 20522if the argument is an array variable, then it is passed by reference. 20523Otherwise, the argument is passed by value. 20524 20525@cindex call by value 20526Passing an argument by value means that when a function is called, it 20527is given a @emph{copy} of the value of this argument. 20528The caller may use a variable as the expression for the argument, but 20529the called function does not know this---it only knows what value the 20530argument had. For example, if you write the following code: 20531 20532@example 20533foo = "bar" 20534z = myfunc(foo) 20535@end example 20536 20537@noindent 20538then you should not think of the argument to @code{myfunc()} as being 20539``the variable @code{foo}.'' Instead, think of the argument as the 20540string value @code{"bar"}. 20541If the function @code{myfunc()} alters the values of its local variables, 20542this has no effect on any other variables. Thus, if @code{myfunc()} 20543does this: 20544 20545@example 20546@group 20547function myfunc(str) 20548@{ 20549 print str 20550 str = "zzz" 20551 print str 20552@} 20553@end group 20554@end example 20555 20556@noindent 20557to change its first argument variable @code{str}, it does @emph{not} 20558change the value of @code{foo} in the caller. The role of @code{foo} in 20559calling @code{myfunc()} ended when its value (@code{"bar"}) was computed. 20560If @code{str} also exists outside of @code{myfunc()}, the function body 20561cannot alter this outer value, because it is shadowed during the 20562execution of @code{myfunc()} and cannot be seen or changed from there. 20563 20564@cindex call by reference 20565@cindex arrays @subentry as parameters to functions 20566@cindex functions @subentry arrays as parameters to 20567However, when arrays are the parameters to functions, they are @emph{not} 20568copied. Instead, the array itself is made available for direct manipulation 20569by the function. This is usually termed @dfn{call by reference}. 20570Changes made to an array parameter inside the body of a function @emph{are} 20571visible outside that function. 20572 20573@quotation NOTE 20574Changing an array parameter inside a function 20575can be very dangerous if you do not watch what you are doing. 20576For example: 20577 20578@example 20579function changeit(array, ind, nvalue) 20580@{ 20581 array[ind] = nvalue 20582@} 20583 20584BEGIN @{ 20585 a[1] = 1; a[2] = 2; a[3] = 3 20586 changeit(a, 2, "two") 20587 printf "a[1] = %s, a[2] = %s, a[3] = %s\n", 20588 a[1], a[2], a[3] 20589@} 20590@end example 20591 20592@noindent 20593prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because 20594@code{changeit()} stores @code{"two"} in the second element of @code{a}. 20595@end quotation 20596 20597@node Function Caveats 20598@subsubsection Other Points About Calling Functions 20599 20600@cindex undefined functions 20601@cindex functions @subentry undefined 20602Some @command{awk} implementations allow you to call a function that 20603has not been defined. They only report a problem at runtime, when the 20604program actually tries to call the function. For example: 20605 20606@example 20607BEGIN @{ 20608 if (0) 20609 foo() 20610 else 20611 bar() 20612@} 20613function bar() @{ @dots{} @} 20614# note that `foo' is not defined 20615@end example 20616 20617@noindent 20618Because the @samp{if} statement will never be true, it is not really a 20619problem that @code{foo()} has not been defined. Usually, though, it is a 20620problem if a program calls an undefined function. 20621 20622@cindex lint checking @subentry undefined functions 20623If @option{--lint} is specified 20624(@pxref{Options}), 20625@command{gawk} reports calls to undefined functions. 20626 20627@cindex portability @subentry @code{next} statement in user-defined functions 20628Some @command{awk} implementations generate a runtime 20629error if you use either the @code{next} statement 20630or the @code{nextfile} statement 20631(@pxref{Next Statement}, and 20632@ifdocbook 20633@ref{Nextfile Statement}) 20634@end ifdocbook 20635@ifnotdocbook 20636@pxref{Nextfile Statement}) 20637@end ifnotdocbook 20638inside a user-defined function. 20639@command{gawk} does not have this limitation. 20640 20641You can call a function and pass it more parameters than it was declared 20642with, like so: 20643 20644@example 20645function foo(p1, p2) 20646@{ 20647 @dots{} 20648@} 20649 20650BEGIN @{ 20651 foo(1, 2, 3, 4) 20652@} 20653@end example 20654 20655Doing so is bad practice, however. The called function cannot do 20656anything with the additional values being passed to it, so @command{awk} 20657evaluates the expressions but then just throws them away. 20658 20659More importantly, such a call is confusing for whoever will next read your 20660program.@footnote{Said person might even be you, sometime in the future, 20661at which point you will wonder, ``what was I thinking?!?''} Function 20662parameters generally are input items that influence the computation 20663performed by the function. Calling a function with more parameters than 20664it accepts gives the false impression that those values are important 20665to the function, when in fact they are not. 20666 20667Because this is such a bad practice, @command{gawk} @emph{unconditionally} 20668issues a warning whenever it executes such a function call. (If you 20669don't like the warning, fix your code! It's incorrect, after all.) 20670 20671@node Return Statement 20672@subsection The @code{return} Statement 20673@cindex @code{return} statement, user-defined functions 20674 20675As seen in several earlier examples, 20676the body of a user-defined function can contain a @code{return} statement. 20677This statement returns control to the calling part of the @command{awk} program. It 20678can also be used to return a value for use in the rest of the @command{awk} 20679program. It looks like this: 20680 20681@display 20682@code{return} [@var{expression}] 20683@end display 20684 20685The @var{expression} part is optional. 20686Due most likely to an oversight, POSIX does not define what the return 20687value is if you omit the @var{expression}. Technically speaking, this 20688makes the returned value undefined, and therefore, unpredictable. 20689In practice, though, all versions of @command{awk} simply return the 20690null string, which acts like zero if used in a numeric context. 20691 20692A @code{return} statement without an @var{expression} is assumed at the end of 20693every function definition. So, if control reaches the end of the function 20694body, then technically the function returns an unpredictable value. 20695In practice, it returns the empty string. @command{awk} 20696does @emph{not} warn you if you use the return value of such a function. 20697 20698Sometimes, you want to write a function for what it does, not for 20699what it returns. Such a function corresponds to a @code{void} function 20700in C, C++, or Java, or to a @code{procedure} in Ada. Thus, it may be appropriate to not 20701return any value; simply bear in mind that you should not be using the 20702return value of such a function. 20703 20704The following is an example of a user-defined function that returns a value 20705for the largest number among the elements of an array: 20706 20707@example 20708function maxelt(vec, i, ret) 20709@{ 20710 for (i in vec) @{ 20711 if (ret == "" || vec[i] > ret) 20712 ret = vec[i] 20713 @} 20714 return ret 20715@} 20716@end example 20717 20718@cindex programming conventions @subentry function parameters 20719@noindent 20720You call @code{maxelt()} with one argument, which is an array name. The local 20721variables @code{i} and @code{ret} are not intended to be arguments; 20722there is nothing to stop you from passing more than one argument 20723to @code{maxelt()} but the results would be strange. The extra space before 20724@code{i} in the function parameter list indicates that @code{i} and 20725@code{ret} are local variables. 20726You should follow this convention when defining functions. 20727 20728The following program uses the @code{maxelt()} function. It loads an 20729array, calls @code{maxelt()}, and then reports the maximum number in that 20730array: 20731 20732@example 20733function maxelt(vec, i, ret) 20734@{ 20735 for (i in vec) @{ 20736 if (ret == "" || vec[i] > ret) 20737 ret = vec[i] 20738 @} 20739 return ret 20740@} 20741 20742@group 20743# Load all fields of each record into nums. 20744@{ 20745 for(i = 1; i <= NF; i++) 20746 nums[NR, i] = $i 20747@} 20748@end group 20749 20750END @{ 20751 print maxelt(nums) 20752@} 20753@end example 20754 20755Given the following input: 20756 20757@example 20758 1 5 23 8 16 2075944 3 5 2 8 26 20760256 291 1396 2962 100 20761-6 467 998 1101 2076299385 11 0 225 20763@end example 20764 20765@noindent 20766the program reports (predictably) that 99,385 is the largest value 20767in the array. 20768 20769@node Dynamic Typing 20770@subsection Functions and Their Effects on Variable Typing 20771 20772@command{awk} is a very fluid language. 20773It is possible that @command{awk} can't tell if an identifier 20774represents a scalar variable or an array until runtime. 20775Here is an annotated sample program: 20776 20777@example 20778function foo(a) 20779@{ 20780 a[1] = 1 # parameter is an array 20781@} 20782 20783BEGIN @{ 20784 b = 1 20785 foo(b) # invalid: fatal type mismatch 20786 20787 foo(x) # x uninitialized, becomes an array dynamically 20788 x = 1 # now not allowed, runtime error 20789@} 20790@end example 20791 20792In this example, the first call to @code{foo()} generates 20793a fatal error, so @command{awk} will not report the second 20794error. If you comment out that call, though, then @command{awk} 20795does report the second error. 20796 20797Usually, such things aren't a big issue, but it's worth 20798being aware of them. 20799 20800@node Indirect Calls 20801@section Indirect Function Calls 20802 20803@cindex indirect function calls 20804@cindex function calls @subentry indirect 20805@cindex function pointers 20806@cindex pointers to functions 20807@cindex differences in @command{awk} and @command{gawk} @subentry indirect function calls 20808 20809This section describes an advanced, @command{gawk}-specific extension. 20810 20811Often, you may wish to defer the choice of function to call until runtime. 20812For example, you may have different kinds of records, each of which 20813should be processed differently. 20814 20815Normally, you would have to use a series of @code{if}-@code{else} 20816statements to decide which function to call. By using @dfn{indirect} 20817function calls, you can specify the name of the function to call as a 20818string variable, and then call the function. Let's look at an example. 20819 20820Suppose you have a file with your test scores for the classes you 20821are taking, and 20822you wish to get the sum and the average of 20823your test scores. 20824The first field is the class name. The following fields 20825are the functions to call to process the data, up to a ``marker'' 20826field @samp{data:}. Following the marker, to the end of the record, 20827are the various numeric test scores. 20828 20829Here is the initial file: 20830 20831@example 20832@c file eg/data/class_data1 20833Biology_101 sum average data: 87.0 92.4 78.5 94.9 20834Chemistry_305 sum average data: 75.2 98.3 94.7 88.2 20835English_401 sum average data: 100.0 95.6 87.1 93.4 20836@c endfile 20837@end example 20838 20839To process the data, you might write initially: 20840 20841@example 20842@{ 20843 class = $1 20844 for (i = 2; $i != "data:"; i++) @{ 20845 if ($i == "sum") 20846 sum() # processes the whole record 20847 else if ($i == "average") 20848 average() 20849 @dots{} # and so on 20850 @} 20851@} 20852@end example 20853 20854@noindent 20855This style of programming works, but can be awkward. With @dfn{indirect} 20856function calls, you tell @command{gawk} to use the @emph{value} of a 20857variable as the @emph{name} of the function to call. 20858 20859@cindex @code{@@} (at-sign) @subentry @code{@@}-notation for indirect function calls 20860@cindex at-sign (@code{@@}) @subentry @code{@@}-notation for indirect function calls 20861@cindex indirect function calls @subentry @code{@@}-notation 20862@cindex function calls @subentry indirect @subentry @code{@@}-notation for 20863The syntax is similar to that of a regular function call: an identifier 20864immediately followed by an opening parenthesis, any arguments, and then 20865a closing parenthesis, with the addition of a leading @samp{@@} 20866character: 20867 20868@example 20869the_func = "sum" 20870result = @@the_func() # calls the sum() function 20871@end example 20872 20873Here is a full program that processes the previously shown data, 20874using indirect function calls: 20875 20876@example 20877@c file eg/prog/indirectcall.awk 20878# indirectcall.awk --- Demonstrate indirect function calls 20879@c endfile 20880@ignore 20881@c file eg/prog/indirectcall.awk 20882# 20883# Arnold Robbins, arnold@@skeeve.com, Public Domain 20884# January 2009 20885@c endfile 20886@end ignore 20887 20888@c file eg/prog/indirectcall.awk 20889# average --- return the average of the values in fields $first - $last 20890 20891function average(first, last, sum, i) 20892@{ 20893 sum = 0; 20894 for (i = first; i <= last; i++) 20895 sum += $i 20896 20897 return sum / (last - first + 1) 20898@} 20899 20900# sum --- return the sum of the values in fields $first - $last 20901 20902function sum(first, last, ret, i) 20903@{ 20904 ret = 0; 20905 for (i = first; i <= last; i++) 20906 ret += $i 20907 20908 return ret 20909@} 20910@c endfile 20911@end example 20912 20913These two functions expect to work on fields; thus, the parameters 20914@code{first} and @code{last} indicate where in the fields to start and end. 20915Otherwise, they perform the expected computations and are not unusual: 20916 20917@example 20918@c file eg/prog/indirectcall.awk 20919# For each record, print the class name and the requested statistics 20920@{ 20921 class_name = $1 20922 gsub(/_/, " ", class_name) # Replace _ with spaces 20923 20924 # find start 20925 for (i = 1; i <= NF; i++) @{ 20926 if ($i == "data:") @{ 20927 start = i + 1 20928 break 20929 @} 20930 @} 20931 20932 printf("%s:\n", class_name) 20933 for (i = 2; $i != "data:"; i++) @{ 20934 the_function = $i 20935 printf("\t%s: <%s>\n", $i, @@the_function(start, NF) "") 20936 @} 20937 print "" 20938@} 20939@c endfile 20940@end example 20941 20942This is the main processing for each record. It prints the class name (with 20943underscores replaced with spaces). It then finds the start of the actual data, 20944saving it in @code{start}. 20945The last part of the code loops through each function name (from @code{$2} up to 20946the marker, @samp{data:}), calling the function named by the field. The indirect 20947function call itself occurs as a parameter in the call to @code{printf}. 20948(The @code{printf} format string uses @samp{%s} as the format specifier so that we 20949can use functions that return strings, as well as numbers. Note that the result 20950from the indirect call is concatenated with the empty string, in order to force 20951it to be a string value.) 20952 20953Here is the result of running the program: 20954 20955@example 20956$ @kbd{gawk -f indirectcall.awk class_data1} 20957@print{} Biology 101: 20958@print{} sum: <352.8> 20959@print{} average: <88.2> 20960@print{} 20961@print{} Chemistry 305: 20962@print{} sum: <356.4> 20963@print{} average: <89.1> 20964@print{} 20965@print{} English 401: 20966@print{} sum: <376.1> 20967@print{} average: <94.025> 20968@end example 20969 20970The ability to use indirect function calls is more powerful than you may 20971think at first. The C and C++ languages provide ``function pointers,'' which 20972are a mechanism for calling a function chosen at runtime. One of the most 20973well-known uses of this ability is the C @code{qsort()} function, which sorts 20974an array using the famous ``quicksort'' algorithm 20975(see @uref{https://en.wikipedia.org/wiki/Quicksort, the Wikipedia article} 20976for more information). To use this function, you supply a pointer to a comparison 20977function. This mechanism allows you to sort arbitrary data in an arbitrary 20978fashion. 20979 20980We can do something similar using @command{gawk}, like this: 20981 20982@example 20983@c file eg/lib/quicksort.awk 20984# quicksort.awk --- Quicksort algorithm, with user-supplied 20985# comparison function 20986@c endfile 20987@ignore 20988@c file eg/lib/quicksort.awk 20989# 20990# Arnold Robbins, arnold@@skeeve.com, Public Domain 20991# January 2009 20992 20993@c endfile 20994@end ignore 20995@c file eg/lib/quicksort.awk 20996 20997# quicksort --- C.A.R. Hoare's quicksort algorithm. See Wikipedia 20998# or almost any algorithms or computer science text. 20999@c endfile 21000@ignore 21001@c file eg/lib/quicksort.awk 21002# 21003# Adapted from K&R-II, page 110 21004@c endfile 21005@end ignore 21006@c file eg/lib/quicksort.awk 21007 21008function quicksort(data, left, right, less_than, i, last) 21009@{ 21010 if (left >= right) # do nothing if array contains fewer 21011 return # than two elements 21012 21013 quicksort_swap(data, left, int((left + right) / 2)) 21014 last = left 21015 for (i = left + 1; i <= right; i++) 21016 if (@@less_than(data[i], data[left])) 21017 quicksort_swap(data, ++last, i) 21018 quicksort_swap(data, left, last) 21019 quicksort(data, left, last - 1, less_than) 21020 quicksort(data, last + 1, right, less_than) 21021@} 21022 21023# quicksort_swap --- helper function for quicksort, should really be inline 21024 21025function quicksort_swap(data, i, j, temp) 21026@{ 21027 temp = data[i] 21028 data[i] = data[j] 21029 data[j] = temp 21030@} 21031@c endfile 21032@end example 21033 21034The @code{quicksort()} function receives the @code{data} array, the starting and ending 21035indices to sort (@code{left} and @code{right}), and the name of a function that 21036performs a ``less than'' comparison. It then implements the quicksort algorithm. 21037 21038To make use of the sorting function, we return to our previous example. The 21039first thing to do is write some comparison functions: 21040 21041@example 21042@c file eg/prog/indirectcall.awk 21043@group 21044# num_lt --- do a numeric less than comparison 21045 21046function num_lt(left, right) 21047@{ 21048 return ((left + 0) < (right + 0)) 21049@} 21050@end group 21051 21052# num_ge --- do a numeric greater than or equal to comparison 21053 21054function num_ge(left, right) 21055@{ 21056 return ((left + 0) >= (right + 0)) 21057@} 21058@c endfile 21059@end example 21060 21061The @code{num_ge()} function is needed to perform a descending sort; when used 21062to perform a ``less than'' test, it actually does the opposite (greater than 21063or equal to), which yields data sorted in descending order. 21064 21065Next comes a sorting function. It is parameterized with the starting and 21066ending field numbers and the comparison function. It builds an array with 21067the data and calls @code{quicksort()} appropriately, and then formats the 21068results as a single string: 21069 21070@example 21071@c file eg/prog/indirectcall.awk 21072# do_sort --- sort the data according to `compare' 21073# and return it as a string 21074 21075function do_sort(first, last, compare, data, i, retval) 21076@{ 21077 delete data 21078 for (i = 1; first <= last; first++) @{ 21079 data[i] = $first 21080 i++ 21081 @} 21082 21083 quicksort(data, 1, i-1, compare) 21084 21085 retval = data[1] 21086 for (i = 2; i in data; i++) 21087 retval = retval " " data[i] 21088 21089 return retval 21090@} 21091@c endfile 21092@end example 21093 21094Finally, the two sorting functions call @code{do_sort()}, passing in the 21095names of the two comparison functions: 21096 21097@example 21098@c file eg/prog/indirectcall.awk 21099@group 21100# sort --- sort the data in ascending order and return it as a string 21101 21102function sort(first, last) 21103@{ 21104 return do_sort(first, last, "num_lt") 21105@} 21106@end group 21107 21108@group 21109# rsort --- sort the data in descending order and return it as a string 21110 21111function rsort(first, last) 21112@{ 21113 return do_sort(first, last, "num_ge") 21114@} 21115@end group 21116@c endfile 21117@end example 21118 21119Here is an extended version of the @value{DF}: 21120 21121@example 21122@c file eg/data/class_data2 21123Biology_101 sum average sort rsort data: 87.0 92.4 78.5 94.9 21124Chemistry_305 sum average sort rsort data: 75.2 98.3 94.7 88.2 21125English_401 sum average sort rsort data: 100.0 95.6 87.1 93.4 21126@c endfile 21127@end example 21128 21129Finally, here are the results when the enhanced program is run: 21130 21131@example 21132$ @kbd{gawk -f quicksort.awk -f indirectcall.awk class_data2} 21133@print{} Biology 101: 21134@print{} sum: <352.8> 21135@print{} average: <88.2> 21136@print{} sort: <78.5 87.0 92.4 94.9> 21137@print{} rsort: <94.9 92.4 87.0 78.5> 21138@print{} 21139@print{} Chemistry 305: 21140@print{} sum: <356.4> 21141@print{} average: <89.1> 21142@print{} sort: <75.2 88.2 94.7 98.3> 21143@print{} rsort: <98.3 94.7 88.2 75.2> 21144@print{} 21145@print{} English 401: 21146@print{} sum: <376.1> 21147@print{} average: <94.025> 21148@print{} sort: <87.1 93.4 95.6 100.0> 21149@print{} rsort: <100.0 95.6 93.4 87.1> 21150@end example 21151 21152Another example where indirect functions calls are useful can be found in 21153processing arrays. This is described in @ref{Walking Arrays}. 21154 21155Remember that you must supply a leading @samp{@@} in front of an indirect function call. 21156 21157Starting with @value{PVERSION} 4.1.2 of @command{gawk}, indirect function 21158calls may also be used with built-in functions and with extension functions 21159(@pxref{Dynamic Extensions}). There are some limitations when calling 21160built-in functions indirectly, as follows. 21161 21162@itemize @value{BULLET} 21163@item 21164You cannot pass a regular expression constant to a built-in function 21165through an indirect function call.@footnote{This may change in a future 21166version; recheck the documentation that comes with your version of 21167@command{gawk} to see if it has.} This applies to the @code{sub()}, 21168@code{gsub()}, @code{gensub()}, @code{match()}, @code{split()} and 21169@code{patsplit()} functions. 21170 21171@item 21172If calling @code{sub()} or @code{gsub()}, you may only pass two arguments, 21173since those functions are unusual in that they update their third argument. 21174This means that @code{$0} will be updated. 21175@end itemize 21176 21177@command{gawk} does its best to make indirect function calls efficient. 21178For example, in the following case: 21179 21180@example 21181for (i = 1; i <= n; i++) 21182 @@the_func() 21183@end example 21184 21185@noindent 21186@command{gawk} looks up the actual function to call only once. 21187 21188@node Functions Summary 21189@section Summary 21190 21191@itemize @value{BULLET} 21192@item 21193@command{awk} provides built-in functions and lets you define your own 21194functions. 21195 21196@item 21197POSIX @command{awk} provides three kinds of built-in functions: numeric, 21198string, and I/O. @command{gawk} provides functions that sort arrays, work 21199with values representing time, do bit manipulation, determine variable 21200type (array versus scalar), and internationalize and localize programs. 21201@command{gawk} also provides several extensions to some of standard 21202functions, typically in the form of additional arguments. 21203 21204@item 21205Functions accept zero or more arguments and return a value. The 21206expressions that provide the argument values are completely evaluated 21207before the function is called. Order of evaluation is not defined. 21208The return value can be ignored. 21209 21210@item 21211The handling of backslash in @code{sub()} and @code{gsub()} is not simple. 21212It is more straightforward in @command{gawk}'s @code{gensub()} function, 21213but that function still requires care in its use. 21214 21215@item 21216User-defined functions provide important capabilities but come with 21217some syntactic inelegancies. In a function call, there cannot be any 21218space between the function name and the opening left parenthesis of the 21219argument list. Also, there is no provision for local variables, so the 21220convention is to add extra parameters, and to separate them visually 21221from the real parameters by extra whitespace. 21222 21223@item 21224User-defined functions may call other user-defined (and built-in) 21225functions and may call themselves recursively. Function parameters 21226``hide'' any global variables of the same names. 21227You cannot use the name of a reserved variable (such as @code{ARGC}) 21228as the name of a parameter in user-defined functions. 21229 21230@item 21231Scalar values are passed to user-defined functions by value. Array 21232parameters are passed by reference; any changes made by the function to 21233array parameters are thus visible after the function has returned. 21234 21235@item 21236Use the @code{return} statement to return from a user-defined function. 21237An optional expression becomes the function's return value. Only scalar 21238values may be returned by a function. 21239 21240@item 21241If a variable that has never been used is passed to a user-defined 21242function, how that function treats the variable can set its nature: 21243either scalar or array. 21244 21245@item 21246@command{gawk} provides indirect function calls using a special syntax. 21247By setting a variable to the name of a function, you can 21248determine at runtime what function will be called at that point in the 21249program. This is equivalent to function pointers in C and C++. 21250 21251@end itemize 21252 21253 21254@ifnotinfo 21255@part @value{PART2}Problem Solving with @command{awk} 21256@end ifnotinfo 21257 21258@ifdocbook 21259Part II shows how to use @command{awk} and @command{gawk} for problem solving. 21260There is lots of code here for you to read and learn from. 21261It contains the following chapters: 21262 21263@itemize @value{BULLET} 21264@item 21265@ref{Library Functions} 21266 21267@item 21268@ref{Sample Programs} 21269@end itemize 21270@end ifdocbook 21271 21272@node Library Functions 21273@chapter A Library of @command{awk} Functions 21274@cindex libraries of @command{awk} functions 21275@cindex functions @subentry library 21276@cindex functions @subentry user-defined @subentry library of 21277 21278@ref{User-defined} describes how to write 21279your own @command{awk} functions. Writing functions is important, because 21280it allows you to encapsulate algorithms and program tasks in a single 21281place. It simplifies programming, making program development more 21282manageable and making programs more readable. 21283 21284@cindex Kernighan, Brian @subentry quotes 21285@cindex Plauger, P.J.@: 21286In their seminal 1976 book, @cite{Software Tools},@footnote{Sadly, over 35 21287years later, many of the lessons taught by this book have yet to be 21288learned by a vast number of practicing programmers.} Brian Kernighan 21289and P.J.@: Plauger wrote: 21290 21291@quotation 21292Good Programming is not learned from generalities, but by seeing how 21293significant programs can be made clean, easy to read, easy to maintain and 21294modify, human-engineered, efficient and reliable, by the application of 21295common sense and good programming practices. Careful study and imitation 21296of good programs leads to better writing. 21297@end quotation 21298 21299In fact, they felt this idea was so important that they placed this 21300statement on the cover of their book. Because we believe strongly 21301that their statement is correct, this @value{CHAPTER} and @ref{Sample 21302Programs}, provide a good-sized body of code for you to read and, we hope, 21303to learn from. 21304 21305This @value{CHAPTER} presents a library of useful @command{awk} functions. 21306Many of the sample programs presented later in this @value{DOCUMENT} 21307use these functions. 21308The functions are presented here in a progression from simple to complex. 21309 21310@cindex Texinfo 21311@ref{Extract Program} 21312presents a program that you can use to extract the source code for 21313these example library functions and programs from the Texinfo source 21314for this @value{DOCUMENT}. 21315(This has already been done as part of the @command{gawk} distribution.) 21316 21317@ifclear FOR_PRINT 21318If you have written one or more useful, general-purpose @command{awk} functions 21319and would like to contribute them to the @command{awk} user community, see 21320@ref{How To Contribute}, for more information. 21321@end ifclear 21322 21323@cindex portability @subentry example programs 21324The programs in this @value{CHAPTER} and in 21325@ref{Sample Programs}, 21326freely use @command{gawk}-specific features. 21327Rewriting these programs for different implementations of @command{awk} 21328is pretty straightforward: 21329 21330@itemize @value{BULLET} 21331@item 21332Diagnostic error messages are sent to @file{/dev/stderr}. 21333Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"} if your system 21334does not have a @file{/dev/stderr}, or if you cannot use @command{gawk}. 21335 21336@item 21337A number of programs use @code{nextfile} 21338(@pxref{Nextfile Statement}) 21339to skip any remaining input in the input file. 21340 21341@item 21342@c 12/2000: Thanks to Nelson Beebe for pointing out the output issue. 21343@cindex case sensitivity @subentry example programs 21344@cindex @code{IGNORECASE} variable @subentry in example programs 21345Finally, some of the programs choose to ignore upper- and lowercase 21346distinctions in their input. They do so by assigning one to @code{IGNORECASE}. 21347You can achieve almost the same effect@footnote{The effects are 21348not identical. Output of the transformed 21349record will be in all lowercase, while @code{IGNORECASE} preserves the original 21350contents of the input record.} by adding the following rule to the 21351beginning of the program: 21352 21353@example 21354# ignore case 21355@{ $0 = tolower($0) @} 21356@end example 21357 21358@noindent 21359Also, verify that all regexp and string constants used in 21360comparisons use only lowercase letters. 21361@end itemize 21362 21363@menu 21364* Library Names:: How to best name private global variables in 21365 library functions. 21366* General Functions:: Functions that are of general use. 21367* Data File Management:: Functions for managing command-line data 21368 files. 21369* Getopt Function:: A function for processing command-line 21370 arguments. 21371* Passwd Functions:: Functions for getting user information. 21372* Group Functions:: Functions for getting group information. 21373* Walking Arrays:: A function to walk arrays of arrays. 21374* Library Functions Summary:: Summary of library functions. 21375* Library Exercises:: Exercises. 21376@end menu 21377 21378@node Library Names 21379@section Naming Library Function Global Variables 21380 21381@cindex names @subentry arrays/variables 21382@cindex names @subentry functions 21383@cindex naming issues 21384@cindex @command{awk} programs @subentry documenting 21385@cindex documentation @subentry of @command{awk} programs 21386Due to the way the @command{awk} language evolved, variables are either 21387@dfn{global} (usable by the entire program) or @dfn{local} (usable just by 21388a specific function). There is no intermediate state analogous to 21389@code{static} variables in C. 21390 21391@cindex variables @subentry global @subentry for library functions 21392@cindex private variables 21393@cindex variables @subentry private 21394Library functions often need to have global variables that they can use to 21395preserve state information between calls to the function---for example, 21396@code{getopt()}'s variable @code{_opti} 21397(@pxref{Getopt Function}). 21398Such variables are called @dfn{private}, as the only functions that need to 21399use them are the ones in the library. 21400 21401When writing a library function, you should try to choose names for your 21402private variables that will not conflict with any variables used by 21403either another library function or a user's main program. For example, a 21404name like @code{i} or @code{j} is not a good choice, because user programs 21405often use variable names like these for their own purposes. 21406 21407@cindex programming conventions @subentry private variable names 21408The example programs shown in this @value{CHAPTER} all start the names of their 21409private variables with an underscore (@samp{_}). Users generally don't use 21410leading underscores in their variable names, so this convention immediately 21411decreases the chances that the variable names will be accidentally shared 21412with the user's program. 21413 21414@cindex @code{_} (underscore) @subentry in names of private variables 21415@cindex underscore (@code{_}) @subentry in names of private variables 21416In addition, several of the library functions use a prefix that helps 21417indicate what function or set of functions use the variables---for example, 21418@code{_pw_byname()} in the user database routines 21419(@pxref{Passwd Functions}). 21420This convention is recommended, as it even further decreases the 21421chance of inadvertent conflict among variable names. Note that this 21422convention is used equally well for variable names and for private 21423function names.@footnote{Although all the library routines could have 21424been rewritten to use this convention, this was not done, in order to 21425show how our own @command{awk} programming style has evolved and to 21426provide some basis for this discussion.} 21427 21428As a final note on variable naming, if a function makes global variables 21429available for use by a main program, it is a good convention to start those 21430variables' names with a capital letter---for 21431example, @code{getopt()}'s @code{Opterr} and @code{Optind} variables 21432(@pxref{Getopt Function}). 21433The leading capital letter indicates that it is global, while the fact that 21434the variable name is not all capital letters indicates that the variable is 21435not one of @command{awk}'s predefined variables, such as @code{FS}. 21436 21437@cindex @option{--dump-variables} option @subentry using for library functions 21438It is also important that @emph{all} variables in library 21439functions that do not need to save state are, in fact, declared 21440local.@footnote{@command{gawk}'s @option{--dump-variables} command-line 21441option is useful for verifying this.} If this is not done, the variables 21442could accidentally be used in the user's program, leading to bugs that 21443are very difficult to track down: 21444 21445@example 21446function lib_func(x, y, l1, l2) 21447@{ 21448 @dots{} 21449 # some_var should be local but by oversight is not 21450 @var{use variable} some_var 21451 @dots{} 21452@} 21453@end example 21454 21455@cindex arrays @subentry associative @subentry library functions and 21456@cindex libraries of @command{awk} functions @subentry associative arrays and 21457@cindex functions @subentry library @subentry associative arrays and 21458@cindex Tcl 21459A different convention, common in the Tcl community, is to use a single 21460associative array to hold the values needed by the library function(s), or 21461``package.'' This significantly decreases the number of actual global names 21462in use. For example, the functions described in 21463@ref{Passwd Functions} 21464might have used array elements @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}}, 21465@code{@w{PW_data["count"]}}, and @code{@w{PW_data["awklib"]}}, instead of 21466@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}}, 21467and @code{@w{_pw_count}}. 21468 21469The conventions presented in this @value{SECTION} are exactly 21470that: conventions. You are not required to write your programs this 21471way---we merely recommend that you do so. 21472 21473Beginning with @value{PVERSION} 5.0, @command{gawk} provides 21474a powerful mechanism for solving the problems described in this 21475section: @dfn{namespaces}. Namespaces and their use are described 21476in detail in @ref{Namespaces}. 21477 21478@node General Functions 21479@section General Programming 21480 21481This @value{SECTION} presents a number of functions that are of general 21482programming use. 21483 21484@menu 21485* Strtonum Function:: A replacement for the built-in 21486 @code{strtonum()} function. 21487* Assert Function:: A function for assertions in @command{awk} 21488 programs. 21489* Round Function:: A function for rounding if @code{sprintf()} 21490 does not do it correctly. 21491* Cliff Random Function:: The Cliff Random Number Generator. 21492* Ordinal Functions:: Functions for using characters as numbers and 21493 vice versa. 21494* Join Function:: A function to join an array into a string. 21495* Getlocaltime Function:: A function to get formatted times. 21496* Readfile Function:: A function to read an entire file at once. 21497* Shell Quoting:: A function to quote strings for the shell. 21498* Isnumeric Function:: A function to test whether a value is numeric. 21499@end menu 21500 21501@node Strtonum Function 21502@subsection Converting Strings to Numbers 21503 21504The @code{strtonum()} function (@pxref{String Functions}) 21505is a @command{gawk} extension. The following function 21506provides an implementation for other versions of @command{awk}: 21507 21508@example 21509@c file eg/lib/strtonum.awk 21510# mystrtonum --- convert string to number 21511 21512@c endfile 21513@ignore 21514@c file eg/lib/strtonum.awk 21515# 21516# Arnold Robbins, arnold@@skeeve.com, Public Domain 21517# February, 2004 21518# Revised June, 2014 21519 21520@c endfile 21521@end ignore 21522@c file eg/lib/strtonum.awk 21523function mystrtonum(str, ret, n, i, k, c) 21524@{ 21525 if (str ~ /^0[0-7]*$/) @{ 21526 # octal 21527 n = length(str) 21528 ret = 0 21529 for (i = 1; i <= n; i++) @{ 21530 c = substr(str, i, 1) 21531 # index() returns 0 if c not in string, 21532 # includes c == "0" 21533 k = index("1234567", c) 21534 21535 ret = ret * 8 + k 21536 @} 21537 @} else if (str ~ /^0[xX][[:xdigit:]]+$/) @{ 21538 # hexadecimal 21539 str = substr(str, 3) # lop off leading 0x 21540 n = length(str) 21541 ret = 0 21542 for (i = 1; i <= n; i++) @{ 21543 c = substr(str, i, 1) 21544 c = tolower(c) 21545 # index() returns 0 if c not in string, 21546 # includes c == "0" 21547 k = index("123456789abcdef", c) 21548 21549 ret = ret * 16 + k 21550 @} 21551 @} else if (str ~ \ 21552 /^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|([.][0-9]+([Ee][-+]?[0-9]+)?))$/) @{ 21553 # decimal number, possibly floating point 21554 ret = str + 0 21555 @} else 21556 ret = "NOT-A-NUMBER" 21557 21558 return ret 21559@} 21560 21561# BEGIN @{ # gawk test harness 21562# a[1] = "25" 21563# a[2] = ".31" 21564# a[3] = "0123" 21565# a[4] = "0xdeadBEEF" 21566# a[5] = "123.45" 21567# a[6] = "1.e3" 21568# a[7] = "1.32" 21569# a[8] = "1.32E2" 21570# 21571# for (i = 1; i in a; i++) 21572# print a[i], strtonum(a[i]), mystrtonum(a[i]) 21573# @} 21574@c endfile 21575@end example 21576 21577The function first looks for C-style octal numbers (base 8). 21578If the input string matches a regular expression describing octal 21579numbers, then @code{mystrtonum()} loops through each character in the 21580string. It sets @code{k} to the index in @code{"1234567"} of the current 21581octal digit. 21582The return value will either be the same number as the digit, or zero 21583if the character is not there, which will be true for a @samp{0}. 21584This is safe, because the regexp test in the @code{if} ensures that 21585only octal values are converted. 21586 21587Similar logic applies to the code that checks for and converts a 21588hexadecimal value, which starts with @samp{0x} or @samp{0X}. 21589The use of @code{tolower()} simplifies the computation for finding 21590the correct numeric value for each hexadecimal digit. 21591 21592Finally, if the string matches the (rather complicated) regexp for a 21593regular decimal integer or floating-point number, the computation 21594@samp{ret = str + 0} lets @command{awk} convert the value to a 21595number. 21596 21597A commented-out test program is included, so that the function can 21598be tested with @command{gawk} and the results compared to the built-in 21599@code{strtonum()} function. 21600 21601@node Assert Function 21602@subsection Assertions 21603 21604@cindex assertions 21605@cindex @code{assert()} function (C library) 21606@cindex C library functions @subentry @code{assert()} 21607@cindex libraries of @command{awk} functions @subentry assertions 21608@cindex functions @subentry library @subentry assertions 21609@cindex @command{awk} programs @subentry lengthy @subentry assertions 21610When writing large programs, it is often useful to know 21611that a condition or set of conditions is true. Before proceeding with a 21612particular computation, you make a statement about what you believe to be 21613the case. Such a statement is known as an 21614@dfn{assertion}. The C language provides an @code{<assert.h>} header file 21615and corresponding @code{assert()} macro that a programmer can use to make 21616assertions. If an assertion fails, the @code{assert()} macro arranges to 21617print a diagnostic message describing the condition that should have 21618been true but was not, and then it kills the program. In C, using 21619@code{assert()} looks this: 21620 21621@example 21622@group 21623#include <assert.h> 21624 21625int myfunc(int a, double b) 21626@{ 21627 assert(a <= 5 && b >= 17.1); 21628 @dots{} 21629@} 21630@end group 21631@end example 21632 21633If the assertion fails, the program prints a message similar to this: 21634 21635@example 21636prog.c:5: assertion failed: a <= 5 && b >= 17.1 21637@end example 21638 21639@cindex @code{assert()} user-defined function 21640@cindex user-defined @subentry function @subentry @code{assert()} 21641The C language makes it possible to turn the condition into a string for use 21642in printing the diagnostic message. This is not possible in @command{awk}, so 21643this @code{assert()} function also requires a string version of the condition 21644that is being tested. 21645Following is the function: 21646 21647@example 21648@c file eg/lib/assert.awk 21649# assert --- assert that a condition is true. Otherwise, exit. 21650 21651@c endfile 21652@ignore 21653@c file eg/lib/assert.awk 21654# 21655# Arnold Robbins, arnold@@skeeve.com, Public Domain 21656# May, 1993 21657 21658@c endfile 21659@end ignore 21660@c file eg/lib/assert.awk 21661function assert(condition, string) 21662@{ 21663 if (! condition) @{ 21664 printf("%s:%d: assertion failed: %s\n", 21665 FILENAME, FNR, string) > "/dev/stderr" 21666 _assert_exit = 1 21667 exit 1 21668 @} 21669@} 21670 21671@group 21672END @{ 21673 if (_assert_exit) 21674 exit 1 21675@} 21676@end group 21677@c endfile 21678@end example 21679 21680The @code{assert()} function tests the @code{condition} parameter. If it 21681is false, it prints a message to standard error, using the @code{string} 21682parameter to describe the failed condition. It then sets the variable 21683@code{_assert_exit} to one and executes the @code{exit} statement. 21684The @code{exit} statement jumps to the @code{END} rule. If the @code{END} 21685rule finds @code{_assert_exit} to be true, it exits immediately. 21686 21687The purpose of the test in the @code{END} rule is to 21688keep any other @code{END} rules from running. When an assertion fails, the 21689program should exit immediately. 21690If no assertions fail, then @code{_assert_exit} is still 21691false when the @code{END} rule is run normally, and the rest of the 21692program's @code{END} rules execute. 21693For all of this to work correctly, @file{assert.awk} must be the 21694first source file read by @command{awk}. 21695The function can be used in a program in the following way: 21696 21697@example 21698function myfunc(a, b) 21699@{ 21700 assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1") 21701 @dots{} 21702@} 21703@end example 21704 21705@noindent 21706If the assertion fails, you see a message similar to the following: 21707 21708@example 21709mydata:1357: assertion failed: a <= 5 && b >= 17.1 21710@end example 21711 21712@cindex @code{END} pattern @subentry @code{assert()} user-defined function and 21713There is a small problem with this version of @code{assert()}. 21714An @code{END} rule is automatically added 21715to the program calling @code{assert()}. Normally, if a program consists 21716of just a @code{BEGIN} rule, the input files and/or standard input are 21717not read. However, now that the program has an @code{END} rule, @command{awk} 21718attempts to read the input @value{DF}s or standard input 21719(@pxref{Using BEGIN/END}), 21720most likely causing the program to hang as it waits for input. 21721 21722@cindex @code{BEGIN} pattern @subentry @code{assert()} user-defined function and 21723There is a simple workaround to this: 21724make sure that such a @code{BEGIN} rule always ends 21725with an @code{exit} statement. 21726 21727@node Round Function 21728@subsection Rounding Numbers 21729 21730@cindex rounding numbers 21731@cindex numbers @subentry rounding 21732@cindex libraries of @command{awk} functions @subentry rounding numbers 21733@cindex functions @subentry library @subentry rounding numbers 21734@cindex @code{print} statement @subentry @code{sprintf()} function and 21735@cindex @code{printf} statement @subentry @code{sprintf()} function and 21736@cindex @code{sprintf()} function @subentry @code{print}/@code{printf} statements and 21737The way @code{printf} and @code{sprintf()} 21738(@pxref{Printf}) 21739perform rounding often depends upon the system's C @code{sprintf()} 21740subroutine. On many machines, @code{sprintf()} rounding is @dfn{unbiased}, 21741which means it doesn't always round a trailing .5 up, contrary 21742to naive expectations. In unbiased rounding, .5 rounds to even, 21743rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4. This means 21744that if you are using a format that does rounding (e.g., @code{"%.0f"}), 21745you should check what your system does. The following function does 21746traditional rounding; it might be useful if your @command{awk}'s @code{printf} 21747does unbiased rounding: 21748 21749@cindex @code{round()} user-defined function 21750@cindex user-defined @subentry function @subentry @code{round()} 21751@example 21752@c file eg/lib/round.awk 21753# round.awk --- do normal rounding 21754@c endfile 21755@ignore 21756@c file eg/lib/round.awk 21757# 21758# Arnold Robbins, arnold@@skeeve.com, Public Domain 21759# August, 1996 21760@c endfile 21761@end ignore 21762@c file eg/lib/round.awk 21763 21764function round(x, ival, aval, fraction) 21765@{ 21766 ival = int(x) # integer part, int() truncates 21767 21768 # see if fractional part 21769 if (ival == x) # no fraction 21770 return ival # ensure no decimals 21771 21772 if (x < 0) @{ 21773 aval = -x # absolute value 21774 ival = int(aval) 21775 fraction = aval - ival 21776 if (fraction >= .5) 21777 return int(x) - 1 # -2.5 --> -3 21778 else 21779 return int(x) # -2.3 --> -2 21780 @} else @{ 21781 fraction = x - ival 21782 if (fraction >= .5) 21783 return ival + 1 21784 else 21785 return ival 21786 @} 21787@} 21788@c endfile 21789@c don't include test harness in the file that gets installed 21790@group 21791# test harness 21792# @{ print $0, round($0) @} 21793@end group 21794@end example 21795 21796@node Cliff Random Function 21797@subsection The Cliff Random Number Generator 21798@cindex random numbers @subentry Cliff 21799@cindex Cliff random numbers 21800@cindex numbers @subentry Cliff random 21801@cindex functions @subentry library @subentry Cliff random numbers 21802 21803The 21804@uref{http://mathworld.wolfram.com/CliffRandomNumberGenerator.html, Cliff random number generator} 21805is a very simple random number generator that ``passes the noise sphere test 21806for randomness by showing no structure.'' 21807It is easily programmed, in less than 10 lines of @command{awk} code: 21808 21809@cindex @code{cliff_rand()} user-defined function 21810@cindex user-defined @subentry function @subentry @code{cliff_rand()} 21811@example 21812@c file eg/lib/cliff_rand.awk 21813# cliff_rand.awk --- generate Cliff random numbers 21814@c endfile 21815@ignore 21816@c file eg/lib/cliff_rand.awk 21817# 21818# Arnold Robbins, arnold@@skeeve.com, Public Domain 21819# December 2000 21820@c endfile 21821@end ignore 21822@c file eg/lib/cliff_rand.awk 21823 21824BEGIN @{ _cliff_seed = 0.1 @} 21825 21826function cliff_rand() 21827@{ 21828 _cliff_seed = (100 * log(_cliff_seed)) % 1 21829 if (_cliff_seed < 0) 21830 _cliff_seed = - _cliff_seed 21831 return _cliff_seed 21832@} 21833@c endfile 21834@end example 21835 21836This algorithm requires an initial ``seed'' of 0.1. Each new value 21837uses the current seed as input for the calculation. 21838If the built-in @code{rand()} function 21839(@pxref{Numeric Functions}) 21840isn't random enough, you might try using this function instead. 21841 21842@node Ordinal Functions 21843@subsection Translating Between Characters and Numbers 21844 21845@cindex libraries of @command{awk} functions @subentry character values as numbers 21846@cindex functions @subentry library @subentry character values as numbers 21847@cindex characters @subentry values of as numbers 21848@cindex numbers @subentry as values of characters 21849One commercial implementation of @command{awk} supplies a built-in function, 21850@code{ord()}, which takes a character and returns the numeric value for that 21851character in the machine's character set. If the string passed to 21852@code{ord()} has more than one character, only the first one is used. 21853 21854The inverse of this function is @code{chr()} (from the function of the same 21855name in Pascal), which takes a number and returns the corresponding character. 21856Both functions are written very nicely in @command{awk}; there is no real 21857reason to build them into the @command{awk} interpreter: 21858 21859@cindex @code{ord()} user-defined function 21860@cindex user-defined @subentry function @subentry @code{ord()} 21861@cindex @code{chr()} user-defined function 21862@cindex user-defined @subentry function @subentry @code{chr()} 21863@cindex @code{_ord_init()} user-defined function 21864@cindex user-defined @subentry function @subentry @code{_ord_init()} 21865@example 21866@c file eg/lib/ord.awk 21867# ord.awk --- do ord and chr 21868 21869# Global identifiers: 21870# _ord_: numerical values indexed by characters 21871# _ord_init: function to initialize _ord_ 21872@c endfile 21873@ignore 21874@c file eg/lib/ord.awk 21875# 21876# Arnold Robbins, arnold@@skeeve.com, Public Domain 21877# 16 January, 1992 21878# 20 July, 1992, revised 21879@c endfile 21880@end ignore 21881@c file eg/lib/ord.awk 21882 21883BEGIN @{ _ord_init() @} 21884 21885function _ord_init( low, high, i, t) 21886@{ 21887 low = sprintf("%c", 7) # BEL is ascii 7 21888 if (low == "\a") @{ # regular ascii 21889 low = 0 21890 high = 127 21891 @} else if (sprintf("%c", 128 + 7) == "\a") @{ 21892 # ascii, mark parity 21893 low = 128 21894 high = 255 21895 @} else @{ # ebcdic(!) 21896 low = 0 21897 high = 255 21898 @} 21899 21900 for (i = low; i <= high; i++) @{ 21901 t = sprintf("%c", i) 21902 _ord_[t] = i 21903 @} 21904@} 21905@c endfile 21906@end example 21907 21908@cindex character sets (machine character encodings) 21909@cindex ASCII 21910@cindex EBCDIC 21911@cindex Unicode 21912@cindex mark parity 21913Some explanation of the numbers used by @code{_ord_init()} is worthwhile. 21914The most prominent character set in use today is ASCII.@footnote{This 21915is changing; many systems use Unicode, a very large character set 21916that includes ASCII as a subset. On systems with full Unicode support, 21917a character can occupy up to 32 bits, making simple tests such as 21918used here prohibitively expensive.} 21919Although an 219208-bit byte can hold 256 distinct values (from 0 to 255), ASCII only 21921defines characters that use the values from 0 to 127.@footnote{ASCII 21922has been extended in many countries to use the values from 128 to 255 21923for country-specific characters. If your system uses these extensions, 21924you can simplify @code{_ord_init()} to loop from 0 to 255.} 21925In the now distant past, 21926at least one minicomputer manufacturer 21927@c Pr1me, blech 21928used ASCII, but with mark parity, meaning that the leftmost bit in the byte 21929is always 1. This means that on those systems, characters 21930have numeric values from 128 to 255. 21931Finally, large mainframe systems use the EBCDIC character set, which 21932uses all 256 values. 21933There are other character sets in use on some older systems, but 21934they are not really worth worrying about: 21935 21936@example 21937@c file eg/lib/ord.awk 21938function ord(str, c) 21939@{ 21940 # only first character is of interest 21941 c = substr(str, 1, 1) 21942 return _ord_[c] 21943@} 21944 21945function chr(c) 21946@{ 21947 # force c to be numeric by adding 0 21948 return sprintf("%c", c + 0) 21949@} 21950@c endfile 21951 21952#### test code #### 21953# BEGIN @{ 21954# for (;;) @{ 21955# printf("enter a character: ") 21956# if (getline var <= 0) 21957# break 21958# printf("ord(%s) = %d\n", var, ord(var)) 21959# @} 21960# @} 21961@c endfile 21962@end example 21963 21964An obvious improvement to these functions is to move the code for the 21965@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule. It was 21966written this way initially for ease of development. 21967There is a ``test program'' in a @code{BEGIN} rule, to test the 21968function. It is commented out for production use. 21969 21970@node Join Function 21971@subsection Merging an Array into a String 21972 21973@cindex libraries of @command{awk} functions @subentry merging arrays into strings 21974@cindex functions @subentry library @subentry merging arrays into strings 21975@cindex strings @subentry merging arrays into 21976@cindex arrays @subentry merging into strings 21977When doing string processing, it is often useful to be able to join 21978all the strings in an array into one long string. The following function, 21979@code{join()}, accomplishes this task. It is used later in several of 21980the application programs 21981(@pxref{Sample Programs}). 21982 21983Good function design is important; this function needs to be general, but it 21984should also have a reasonable default behavior. It is called with an array 21985as well as the beginning and ending indices of the elements in the array to be 21986merged. This assumes that the array indices are numeric---a reasonable 21987assumption, as the array was likely created with @code{split()} 21988(@pxref{String Functions}): 21989 21990@cindex @code{join()} user-defined function 21991@cindex user-defined @subentry function @subentry @code{join()} 21992@example 21993@c file eg/lib/join.awk 21994# join.awk --- join an array into a string 21995@c endfile 21996@ignore 21997@c file eg/lib/join.awk 21998# 21999# Arnold Robbins, arnold@@skeeve.com, Public Domain 22000# May 1993 22001@c endfile 22002@end ignore 22003@c file eg/lib/join.awk 22004 22005function join(array, start, end, sep, result, i) 22006@{ 22007 if (sep == "") 22008 sep = " " 22009 else if (sep == SUBSEP) # magic value 22010 sep = "" 22011 result = array[start] 22012 for (i = start + 1; i <= end; i++) 22013 result = result sep array[i] 22014 return result 22015@} 22016@c endfile 22017@end example 22018 22019An optional additional argument is the separator to use when joining the 22020strings back together. If the caller supplies a nonempty value, 22021@code{join()} uses it; if it is not supplied, it has a null 22022value. In this case, @code{join()} uses a single space as a default 22023separator for the strings. If the value is equal to @code{SUBSEP}, 22024then @code{join()} joins the strings with no separator between them. 22025@code{SUBSEP} serves as a ``magic'' value to indicate that there should 22026be no separation between the component strings.@footnote{It would 22027be nice if @command{awk} had an assignment operator for concatenation. 22028The lack of an explicit operator for concatenation makes string operations 22029more difficult than they really need to be.} 22030 22031@node Getlocaltime Function 22032@subsection Managing the Time of Day 22033 22034@cindex libraries of @command{awk} functions @subentry managing @subentry time 22035@cindex functions @subentry library @subentry managing time 22036@cindex timestamps @subentry formatted 22037@cindex time @subentry managing 22038The @code{systime()} and @code{strftime()} functions described in 22039@ref{Time Functions} 22040provide the minimum functionality necessary for dealing with the time of day 22041in human-readable form. Although @code{strftime()} is extensive, the control 22042formats are not necessarily easy to remember or intuitively obvious when 22043reading a program. 22044 22045The following function, @code{getlocaltime()}, populates a user-supplied array 22046with preformatted time information. It returns a string with the current 22047time formatted in the same way as the @command{date} utility: 22048 22049@cindex @code{getlocaltime()} user-defined function 22050@cindex user-defined @subentry function @subentry @code{getlocaltime()} 22051@example 22052@c file eg/lib/gettime.awk 22053# getlocaltime.awk --- get the time of day in a usable format 22054@c endfile 22055@ignore 22056@c file eg/lib/gettime.awk 22057# 22058# Arnold Robbins, arnold@@skeeve.com, Public Domain, May 1993 22059# 22060@c endfile 22061@end ignore 22062@c file eg/lib/gettime.awk 22063 22064# Returns a string in the format of output of date(1) 22065# Populates the array argument time with individual values: 22066# time["second"] -- seconds (0 - 59) 22067# time["minute"] -- minutes (0 - 59) 22068# time["hour"] -- hours (0 - 23) 22069# time["althour"] -- hours (0 - 12) 22070# time["monthday"] -- day of month (1 - 31) 22071# time["month"] -- month of year (1 - 12) 22072# time["monthname"] -- name of the month 22073# time["shortmonth"] -- short name of the month 22074# time["year"] -- year modulo 100 (0 - 99) 22075# time["fullyear"] -- full year 22076# time["weekday"] -- day of week (Sunday = 0) 22077# time["altweekday"] -- day of week (Monday = 0) 22078# time["dayname"] -- name of weekday 22079# time["shortdayname"] -- short name of weekday 22080# time["yearday"] -- day of year (0 - 365) 22081# time["timezone"] -- abbreviation of timezone name 22082# time["ampm"] -- AM or PM designation 22083# time["weeknum"] -- week number, Sunday first day 22084# time["altweeknum"] -- week number, Monday first day 22085 22086function getlocaltime(time, ret, now, i) 22087@{ 22088 # get time once, avoids unnecessary system calls 22089 now = systime() 22090 22091 # return date(1)-style output 22092 ret = strftime("%a %b %e %H:%M:%S %Z %Y", now) 22093 22094 # clear out target array 22095 delete time 22096 22097 # fill in values, force numeric values to be 22098 # numeric by adding 0 22099 time["second"] = strftime("%S", now) + 0 22100 time["minute"] = strftime("%M", now) + 0 22101 time["hour"] = strftime("%H", now) + 0 22102 time["althour"] = strftime("%I", now) + 0 22103 time["monthday"] = strftime("%d", now) + 0 22104 time["month"] = strftime("%m", now) + 0 22105 time["monthname"] = strftime("%B", now) 22106 time["shortmonth"] = strftime("%b", now) 22107 time["year"] = strftime("%y", now) + 0 22108 time["fullyear"] = strftime("%Y", now) + 0 22109 time["weekday"] = strftime("%w", now) + 0 22110 time["altweekday"] = strftime("%u", now) + 0 22111 time["dayname"] = strftime("%A", now) 22112 time["shortdayname"] = strftime("%a", now) 22113 time["yearday"] = strftime("%j", now) + 0 22114 time["timezone"] = strftime("%Z", now) 22115 time["ampm"] = strftime("%p", now) 22116 time["weeknum"] = strftime("%U", now) + 0 22117 time["altweeknum"] = strftime("%W", now) + 0 22118 22119 return ret 22120@} 22121@c endfile 22122@end example 22123 22124The string indices are easier to use and read than the various formats 22125required by @code{strftime()}. The @code{alarm} program presented in 22126@ref{Alarm Program} 22127uses this function. 22128A more general design for the @code{getlocaltime()} function would have 22129allowed the user to supply an optional timestamp value to use instead 22130of the current time. 22131 22132@node Readfile Function 22133@subsection Reading a Whole File at Once 22134 22135Often, it is convenient to have the entire contents of a file available 22136in memory as a single string. A straightforward but naive way to 22137do that might be as follows: 22138 22139@example 22140function readfile1(file, tmp, contents) 22141@{ 22142 if ((getline tmp < file) < 0) 22143 return 22144 22145 contents = tmp RT 22146 while ((getline tmp < file) > 0) 22147 contents = contents tmp RT 22148 22149 close(file) 22150 return contents 22151@} 22152@end example 22153 22154This function reads from @code{file} one record at a time, building 22155up the full contents of the file in the local variable @code{contents}. 22156It works, but is not necessarily efficient. 22157 22158The following function, based on a suggestion by Denis Shirokov, 22159reads the entire contents of the named file in one shot: 22160 22161@cindex @code{readfile()} user-defined function 22162@cindex user-defined @subentry function @subentry @code{readfile()} 22163@example 22164@c file eg/lib/readfile.awk 22165# readfile.awk --- read an entire file at once 22166@c endfile 22167@ignore 22168@c file eg/lib/readfile.awk 22169# 22170# Original idea by Denis Shirokov, cosmogen@@gmail.com, April 2013 22171# 22172@c endfile 22173@end ignore 22174@c file eg/lib/readfile.awk 22175 22176function readfile(file, tmp, save_rs) 22177@{ 22178 save_rs = RS 22179 RS = "^$" 22180 getline tmp < file 22181 close(file) 22182 RS = save_rs 22183 22184 return tmp 22185@} 22186@c endfile 22187@end example 22188 22189It works by setting @code{RS} to @samp{^$}, a regular expression that 22190will never match if the file has contents. @command{gawk} reads data from 22191the file into @code{tmp}, attempting to match @code{RS}. The match fails 22192after each read, but fails quickly, such that @command{gawk} fills 22193@code{tmp} with the entire contents of the file. 22194(@xref{Records} for information on @code{RT} and @code{RS}.) 22195 22196In the case that @code{file} is empty, the return value is the null 22197string. Thus, calling code may use something like: 22198 22199@example 22200contents = readfile("/some/path") 22201if (length(contents) == 0) 22202 # file was empty @dots{} 22203@end example 22204 22205This tests the result to see if it is empty or not. An equivalent 22206test would be @samp{@w{contents == ""}}. 22207 22208@xref{Extension Sample Readfile} for an extension function that 22209also reads an entire file into memory. 22210 22211@node Shell Quoting 22212@subsection Quoting Strings to Pass to the Shell 22213 22214@c included by permission 22215@ignore 22216Date: Sun, 27 Jul 2014 17:16:16 -0700 22217Message-ID: <CAKuGj+iCF_obaCLDUX60aSAgbfocFVtguG39GyeoNxTFby5sqQ@mail.gmail.com> 22218Subject: Useful awk function 22219From: Mike Brennan <mike@madronabluff.com> 22220To: Arnold Robbins <arnold@skeeve.com> 22221@end ignore 22222 22223Michael Brennan offers the following programming pattern, 22224which he uses frequently: 22225 22226@example 22227#! /bin/sh 22228 22229awkp=' 22230 @dots{} 22231 ' 22232 22233@var{input_program} | awk "$awkp" | /bin/sh 22234@end example 22235 22236For example, a program of his named @command{flac-edit} has this form: 22237 22238@example 22239$ @kbd{flac-edit -song="Whoope! That's Great" file.flac} 22240@end example 22241 22242It generates the following output, which is to be piped to 22243the shell (@file{/bin/sh}): 22244 22245@example 22246chmod +w file.flac 22247metaflac --remove-tag=TITLE file.flac 22248LANG=en_US.88591 metaflac --set-tag=TITLE='Whoope! That'"'"'s Great' file.flac 22249chmod -w file.flac 22250@end example 22251 22252Note the need for shell quoting. The function @code{shell_quote()} 22253does it. @code{SINGLE} is the one-character string @code{"'"} and 22254@code{QSINGLE} is the three-character string @code{"\"'\""}: 22255 22256@example 22257@c file eg/lib/shellquote.awk 22258# shell_quote --- quote an argument for passing to the shell 22259@c endfile 22260@ignore 22261@c file eg/lib/shellquote.awk 22262# 22263# Michael Brennan 22264# brennan@@madronabluff.com 22265# September 2014 22266@c endfile 22267@end ignore 22268@c file eg/lib/shellquote.awk 22269 22270function shell_quote(s, # parameter 22271 SINGLE, QSINGLE, i, X, n, ret) # locals 22272@{ 22273 if (s == "") 22274 return "\"\"" 22275 22276 SINGLE = "\x27" # single quote 22277 QSINGLE = "\"\x27\"" 22278 n = split(s, X, SINGLE) 22279 22280 ret = SINGLE X[1] SINGLE 22281 for (i = 2; i <= n; i++) 22282 ret = ret QSINGLE SINGLE X[i] SINGLE 22283 22284 return ret 22285@} 22286@c endfile 22287@end example 22288 22289@node Isnumeric Function 22290@subsection Checking Whether A Value Is Numeric 22291 22292A frequent programming question is how to ascertain whether a value is numeric. 22293This can be solved by using this example function @code{isnumeric()}, which 22294employs the trick of converting a string value to user input by using the 22295@code{split()} function: 22296 22297@cindex @code{isnumeric()} user-defined function 22298@cindex user-defined @subentry function @subentry @code{isnumeric()} 22299@example 22300@c file eg/lib/isnumeric.awk 22301# isnumeric --- check whether a value is numeric 22302 22303function isnumeric(x, f) 22304@{ 22305 switch (typeof(x)) @{ 22306 case "strnum": 22307 case "number": 22308 return 1 22309 case "string": 22310 return (split(x, f, " ") == 1) && (typeof(f[1]) == "strnum") 22311 default: 22312 return 0 22313 @} 22314@} 22315@c endfile 22316@end example 22317 22318Please note that leading or trailing white space is disregarded in deciding 22319whether a value is numeric or not, so if it matters to you, you may want 22320to add an additional check for that. 22321 22322Traditionally, it has been recommended to check for numeric values using the 22323test @samp{x+0 == x}. This function is superior in two ways: it will not 22324report that unassigned variables contain numeric values; and it recognizes 22325string values with numeric contents where @code{CONVFMT} does not yield 22326the original string. 22327On the other hand, it uses the @code{typeof()} function 22328(@pxref{Type Functions}), which is specific to @command{gawk}. 22329 22330@node Data File Management 22331@section @value{DDF} Management 22332 22333@cindex files @subentry managing 22334@cindex libraries of @command{awk} functions @subentry managing @subentry data files 22335@cindex functions @subentry library @subentry managing data files 22336This @value{SECTION} presents functions that are useful for managing 22337command-line @value{DF}s. 22338 22339@menu 22340* Filetrans Function:: A function for handling data file transitions. 22341* Rewind Function:: A function for rereading the current file. 22342* File Checking:: Checking that data files are readable. 22343* Empty Files:: Checking for zero-length files. 22344* Ignoring Assigns:: Treating assignments as file names. 22345@end menu 22346 22347@node Filetrans Function 22348@subsection Noting @value{DDF} Boundaries 22349 22350@cindex files @subentry managing @subentry data file boundaries 22351@cindex files @subentry initialization and cleanup 22352The @code{BEGIN} and @code{END} rules are each executed exactly once, at 22353the beginning and end of your @command{awk} program, respectively 22354(@pxref{BEGIN/END}). 22355We (the @command{gawk} authors) once had a user who mistakenly thought that the 22356@code{BEGIN} rules were executed at the beginning of each @value{DF} and the 22357@code{END} rules were executed at the end of each @value{DF}. 22358 22359When informed 22360that this was not the case, the user requested that we add new special 22361patterns to @command{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that 22362would have the desired behavior. He even supplied us the code to do so. 22363 22364Adding these special patterns to @command{gawk} wasn't necessary; 22365the job can be done cleanly in @command{awk} itself, as illustrated 22366by the following library program. 22367It arranges to call two user-supplied functions, @code{beginfile()} and 22368@code{endfile()}, at the beginning and end of each @value{DF}. 22369Besides solving the problem in only nine(!) lines of code, it does so 22370@emph{portably}; this works with any implementation of @command{awk}: 22371 22372@example 22373# transfile.awk 22374# 22375# Give the user a hook for filename transitions 22376# 22377# The user must supply functions beginfile() and endfile() 22378# that each take the name of the file being started or 22379# finished, respectively. 22380@c # 22381@c # Arnold Robbins, arnold@@skeeve.com, Public Domain 22382@c # January 1992 22383 22384FILENAME != _oldfilename @{ 22385 if (_oldfilename != "") 22386 endfile(_oldfilename) 22387 _oldfilename = FILENAME 22388 beginfile(FILENAME) 22389@} 22390 22391END @{ endfile(FILENAME) @} 22392@end example 22393 22394This file must be loaded before the user's ``main'' program, so that the 22395rule it supplies is executed first. 22396 22397This rule relies on @command{awk}'s @code{FILENAME} variable, which 22398automatically changes for each new @value{DF}. The current @value{FN} is 22399saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does 22400not equal @code{_oldfilename}, then a new @value{DF} is being processed and 22401it is necessary to call @code{endfile()} for the old file. Because 22402@code{endfile()} should only be called if a file has been processed, the 22403program first checks to make sure that @code{_oldfilename} is not the null 22404string. The program then assigns the current @value{FN} to 22405@code{_oldfilename} and calls @code{beginfile()} for the file. 22406Because, like all @command{awk} variables, @code{_oldfilename} is 22407initialized to the null string, this rule executes correctly even for the 22408first @value{DF}. 22409 22410The program also supplies an @code{END} rule to do the final processing for 22411the last file. Because this @code{END} rule comes before any @code{END} rules 22412supplied in the ``main'' program, @code{endfile()} is called first. Once 22413again, the value of multiple @code{BEGIN} and @code{END} rules should be clear. 22414 22415@cindex @code{beginfile()} user-defined function 22416@cindex user-defined @subentry function @subentry @code{beginfile()} 22417@cindex @code{endfile()} user-defined function 22418@cindex user-defined @subentry function @subentry @code{endfile()} 22419If the same @value{DF} occurs twice in a row on the command line, then 22420@code{endfile()} and @code{beginfile()} are not executed at the end of the 22421first pass and at the beginning of the second pass. 22422The following version solves the problem: 22423 22424@example 22425@c file eg/lib/ftrans.awk 22426# ftrans.awk --- handle datafile transitions 22427# 22428# user supplies beginfile() and endfile() functions 22429@c endfile 22430@ignore 22431@c file eg/lib/ftrans.awk 22432# 22433# Arnold Robbins, arnold@@skeeve.com, Public Domain 22434# November 1992 22435@c endfile 22436@end ignore 22437@c file eg/lib/ftrans.awk 22438 22439FNR == 1 @{ 22440 if (_filename_ != "") 22441 endfile(_filename_) 22442 _filename_ = FILENAME 22443 beginfile(FILENAME) 22444@} 22445 22446END @{ endfile(_filename_) @} 22447@c endfile 22448@end example 22449 22450@ref{Wc Program} 22451shows how this library function can be used and 22452how it simplifies writing the main program. 22453 22454@sidebar So Why Does @command{gawk} Have @code{BEGINFILE} and @code{ENDFILE}? 22455 22456You are probably wondering, if @code{beginfile()} and @code{endfile()} 22457functions can do the job, why does @command{gawk} have 22458@code{BEGINFILE} and @code{ENDFILE} patterns? 22459 22460Good question. Normally, if @command{awk} cannot open a file, this 22461causes an immediate fatal error. In this case, there is no way for a 22462user-defined function to deal with the problem, as the mechanism for 22463calling it relies on the file being open and at the first record. Thus, 22464the main reason for @code{BEGINFILE} is to give you a ``hook'' to catch 22465files that cannot be processed. @code{ENDFILE} exists for symmetry, 22466and because it provides an easy way to do per-file cleanup processing. 22467For more information, refer to @ref{BEGINFILE/ENDFILE}. 22468@end sidebar 22469 22470@node Rewind Function 22471@subsection Rereading the Current File 22472 22473@cindex files @subentry reading 22474Another request for a new built-in function was for a 22475function that would make it possible to reread the current file. 22476The requesting user didn't want to have to use @code{getline} 22477(@pxref{Getline}) 22478inside a loop. 22479 22480However, as long as you are not in the @code{END} rule, it is 22481quite easy to arrange to immediately close the current input file 22482and then start over with it from the top. 22483For lack of a better name, we'll call the function @code{rewind()}: 22484 22485@cindex @code{rewind()} user-defined function 22486@cindex user-defined @subentry function @subentry @code{rewind()} 22487@example 22488@c file eg/lib/rewind.awk 22489# rewind.awk --- rewind the current file and start over 22490@c endfile 22491@ignore 22492@c file eg/lib/rewind.awk 22493# 22494# Arnold Robbins, arnold@@skeeve.com, Public Domain 22495# September 2000 22496@c endfile 22497@end ignore 22498@c file eg/lib/rewind.awk 22499 22500function rewind( i) 22501@{ 22502 # shift remaining arguments up 22503 for (i = ARGC; i > ARGIND; i--) 22504 ARGV[i] = ARGV[i-1] 22505 22506 # make sure gawk knows to keep going 22507 ARGC++ 22508 22509 # make current file next to get done 22510 ARGV[ARGIND+1] = FILENAME 22511 22512 # do it 22513 nextfile 22514@} 22515@c endfile 22516@end example 22517 22518The @code{rewind()} function relies on the @code{ARGIND} variable 22519(@pxref{Auto-set}), which is specific to @command{gawk}. It also 22520relies on the @code{nextfile} keyword (@pxref{Nextfile Statement}). 22521Because of this, you should not call it from an @code{ENDFILE} rule. 22522(This isn't necessary anyway, because @command{gawk} goes to the next 22523file as soon as an @code{ENDFILE} rule finishes!) 22524 22525You need to be careful calling @code{rewind()}. You can end up 22526causing infinite recursion if you don't pay attention. Here is an 22527example use: 22528 22529@example 22530$ @kbd{cat data} 22531@print{} a 22532@print{} b 22533@print{} c 22534@print{} d 22535@print{} e 22536 22537$ cat @kbd{test.awk} 22538@print{} FNR == 3 && ! rewound @{ 22539@print{} rewound = 1 22540@print{} rewind() 22541@print{} @} 22542@print{} 22543@print{} @{ print FILENAME, FNR, $0 @} 22544 22545$ @kbd{gawk -f rewind.awk -f test.awk data } 22546@print{} data 1 a 22547@print{} data 2 b 22548@print{} data 1 a 22549@print{} data 2 b 22550@print{} data 3 c 22551@group 22552@print{} data 4 d 22553@print{} data 5 e 22554@end group 22555@end example 22556 22557@node File Checking 22558@subsection Checking for Readable @value{DDF}s 22559 22560@cindex troubleshooting @subentry readable data files 22561@cindex readable data files, checking 22562@cindex files @subentry skipping 22563Normally, if you give @command{awk} a @value{DF} that isn't readable, 22564it stops with a fatal error. There are times when you might want to 22565just ignore such files and keep going.@footnote{The @code{BEGINFILE} 22566special pattern (@pxref{BEGINFILE/ENDFILE}) provides an alternative 22567mechanism for dealing with files that can't be opened. However, the 22568code here provides a portable solution.} You can do this by prepending 22569the following program to your @command{awk} program: 22570 22571@cindex @file{readable.awk} program 22572@example 22573@c file eg/lib/readable.awk 22574# readable.awk --- library file to skip over unreadable files 22575@c endfile 22576@ignore 22577@c file eg/lib/readable.awk 22578# 22579# Arnold Robbins, arnold@@skeeve.com, Public Domain 22580# October 2000 22581# December 2010 22582@c endfile 22583@end ignore 22584@c file eg/lib/readable.awk 22585 22586BEGIN @{ 22587 for (i = 1; i < ARGC; i++) @{ 22588 if (ARGV[i] ~ /^[a-zA-Z_][a-zA-Z0-9_]*=.*/ \ 22589 || ARGV[i] == "-" || ARGV[i] == "/dev/stdin") 22590 continue # assignment or standard input 22591 else if ((getline junk < ARGV[i]) < 0) # unreadable 22592 delete ARGV[i] 22593 else 22594 close(ARGV[i]) 22595 @} 22596@} 22597@c endfile 22598@end example 22599 22600@cindex troubleshooting @subentry @code{getline} command 22601This works, because the @code{getline} won't be fatal. 22602Removing the element from @code{ARGV} with @code{delete} 22603skips the file (because it's no longer in the list). 22604See also @ref{ARGC and ARGV}. 22605 22606Because @command{awk} variable names only allow the English letters, 22607the regular expression check purposely does not use character classes 22608such as @samp{[:alpha:]} and @samp{[:alnum:]} 22609(@pxref{Bracket Expressions}). 22610 22611@node Empty Files 22612@subsection Checking for Zero-Length Files 22613 22614All known @command{awk} implementations silently skip over zero-length files. 22615This is a by-product of @command{awk}'s implicit 22616read-a-record-and-match-against-the-rules loop: when @command{awk} 22617tries to read a record from an empty file, it immediately receives an 22618end-of-file indication, closes the file, and proceeds on to the next 22619command-line @value{DF}, @emph{without} executing any user-level 22620@command{awk} program code. 22621 22622Using @command{gawk}'s @code{ARGIND} variable 22623(@pxref{Built-in Variables}), it is possible to detect when an empty 22624@value{DF} has been skipped. Similar to the library file presented 22625in @ref{Filetrans Function}, the following library file calls a function named 22626@code{zerofile()} that the user must provide. The arguments passed are 22627the @value{FN} and the position in @code{ARGV} where it was found: 22628 22629@cindex @file{zerofile.awk} program 22630@example 22631@c file eg/lib/zerofile.awk 22632# zerofile.awk --- library file to process empty input files 22633@c endfile 22634@ignore 22635@c file eg/lib/zerofile.awk 22636# 22637# Arnold Robbins, arnold@@skeeve.com, Public Domain 22638# June 2003 22639@c endfile 22640@end ignore 22641@c file eg/lib/zerofile.awk 22642 22643BEGIN @{ Argind = 0 @} 22644 22645ARGIND > Argind + 1 @{ 22646 for (Argind++; Argind < ARGIND; Argind++) 22647 zerofile(ARGV[Argind], Argind) 22648@} 22649 22650ARGIND != Argind @{ Argind = ARGIND @} 22651 22652END @{ 22653 if (ARGIND > Argind) 22654 for (Argind++; Argind <= ARGIND; Argind++) 22655 zerofile(ARGV[Argind], Argind) 22656@} 22657@c endfile 22658@end example 22659 22660The user-level variable @code{Argind} allows the @command{awk} program 22661to track its progress through @code{ARGV}. Whenever the program detects 22662that @code{ARGIND} is greater than @samp{Argind + 1}, it means that one or 22663more empty files were skipped. The action then calls @code{zerofile()} for 22664each such file, incrementing @code{Argind} along the way. 22665 22666The @samp{Argind != ARGIND} rule simply keeps @code{Argind} up to date 22667in the normal case. 22668 22669Finally, the @code{END} rule catches the case of any empty files at 22670the end of the command-line arguments. Note that the test in the 22671condition of the @code{for} loop uses the @samp{<=} operator, 22672not @samp{<}. 22673 22674@node Ignoring Assigns 22675@subsection Treating Assignments as @value{FFN}s 22676 22677@cindex assignments as file names 22678@cindex file names @subentry assignments as 22679Occasionally, you might not want @command{awk} to process command-line 22680variable assignments 22681(@pxref{Assignment Options}). 22682In particular, if you have a @value{FN} that contains an @samp{=} character, 22683@command{awk} treats the @value{FN} as an assignment and does not process it. 22684 22685Some users have suggested an additional command-line option for @command{gawk} 22686to disable command-line assignments. However, some simple programming with 22687a library file does the trick: 22688 22689@cindex @file{noassign.awk} program 22690@example 22691@c file eg/lib/noassign.awk 22692# noassign.awk --- library file to avoid the need for a 22693# special option that disables command-line assignments 22694@c endfile 22695@ignore 22696@c file eg/lib/noassign.awk 22697# 22698# Arnold Robbins, arnold@@skeeve.com, Public Domain 22699# October 1999 22700@c endfile 22701@end ignore 22702@c file eg/lib/noassign.awk 22703 22704function disable_assigns(argc, argv, i) 22705@{ 22706 for (i = 1; i < argc; i++) 22707 if (argv[i] ~ /^[a-zA-Z_][a-zA-Z0-9_]*=.*/) 22708 argv[i] = ("./" argv[i]) 22709@} 22710 22711BEGIN @{ 22712 if (No_command_assign) 22713 disable_assigns(ARGC, ARGV) 22714@} 22715@c endfile 22716@end example 22717 22718You then run your program this way: 22719 22720@example 22721awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk * 22722@end example 22723 22724The function works by looping through the arguments. 22725It prepends @samp{./} to 22726any argument that matches the form 22727of a variable assignment, turning that argument into a @value{FN}. 22728 22729The use of @code{No_command_assign} allows you to disable command-line 22730assignments at invocation time, by giving the variable a true value. 22731When not set, it is initially zero (i.e., false), so the command-line arguments 22732are left alone. 22733 22734@node Getopt Function 22735@section Processing Command-Line Options 22736 22737@cindex libraries of @command{awk} functions @subentry command-line options 22738@cindex functions @subentry library @subentry command-line options 22739@cindex command line @subentry options @subentry processing 22740@cindex options @subentry command-line @subentry processing 22741@cindex functions @subentry library @subentry C library 22742@cindex arguments @subentry processing 22743Most utilities on POSIX-compatible systems take options on 22744the command line that can be used to change the way a program behaves. 22745@command{awk} is an example of such a program 22746(@pxref{Options}). 22747Often, options take @dfn{arguments} (i.e., data that the program needs to 22748correctly obey the command-line option). For example, @command{awk}'s 22749@option{-F} option requires a string to use as the field separator. 22750The first occurrence on the command line of either @option{--} or a 22751string that does not begin with @samp{-} ends the options. 22752 22753@cindex @code{getopt()} function (C library) 22754@cindex C library functions @subentry @code{getopt()} 22755Modern Unix systems provide a C function named @code{getopt()} for processing 22756command-line arguments. The programmer provides a string describing the 22757one-letter options. If an option requires an argument, it is followed in the 22758string with a colon. @code{getopt()} is also passed the 22759count and values of the command-line arguments and is called in a loop. 22760@code{getopt()} processes the command-line arguments for option letters. 22761Each time around the loop, it returns a single character representing the 22762next option letter that it finds, or @samp{?} if it finds an invalid option. 22763When it returns @minus{}1, there are no options left on the command line. 22764 22765When using @code{getopt()}, options that do not take arguments can be 22766grouped together. Furthermore, options that take arguments require that the 22767argument be present. The argument can immediately follow the option letter, 22768or it can be a separate command-line argument. 22769 22770Given a hypothetical program that takes 22771three command-line options, @option{-a}, @option{-b}, and @option{-c}, where 22772@option{-b} requires an argument, all of the following are valid ways of 22773invoking the program: 22774 22775@example 22776prog -a -b foo -c data1 data2 data3 22777prog -ac -bfoo -- data1 data2 data3 22778prog -acbfoo data1 data2 data3 22779@end example 22780 22781Notice that when the argument is grouped with its option, the rest of 22782the argument is considered to be the option's argument. 22783In this example, @option{-acbfoo} indicates that all of the 22784@option{-a}, @option{-b}, and @option{-c} options were supplied, 22785and that @samp{foo} is the argument to the @option{-b} option. 22786 22787@code{getopt()} provides four external variables that the programmer can use: 22788 22789@table @code 22790@item optind 22791The index in the argument value array (@code{argv}) where the first 22792nonoption command-line argument can be found. 22793 22794@item optarg 22795The string value of the argument to an option. 22796 22797@item opterr 22798Usually @code{getopt()} prints an error message when it finds an invalid 22799option. Setting @code{opterr} to zero disables this feature. (An 22800application might want to print its own error message.) 22801 22802@item optopt 22803The letter representing the command-line option. 22804@end table 22805 22806The following C fragment shows how @code{getopt()} might process command-line 22807arguments for @command{awk}: 22808 22809@example 22810int 22811main(int argc, char *argv[]) 22812@{ 22813 @dots{} 22814 /* print our own message */ 22815 opterr = 0; 22816 while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{ 22817 switch (c) @{ 22818 case 'f': /* file */ 22819 @dots{} 22820 break; 22821 case 'F': /* field separator */ 22822 @dots{} 22823 break; 22824 case 'v': /* variable assignment */ 22825 @dots{} 22826 break; 22827 case 'W': /* extension */ 22828 @dots{} 22829 break; 22830 case '?': 22831 default: 22832 usage(); 22833 break; 22834 @} 22835 @} 22836 @dots{} 22837@} 22838@end example 22839 22840The GNU project's version of the original Unix utilities popularized 22841the use of long command line options. For example, @option{--help} 22842in addition to @option{-h}. Arguments to long options are either provided 22843as separate command line arguments (@samp{--source '@var{program-text}'}) 22844or separated from the option with an @samp{=} sign 22845(@samp{--source='@var{program-text}'}). 22846 22847As a side point, @command{gawk} actually uses the GNU @code{getopt_long()} 22848function to process both normal and GNU-style long options 22849(@pxref{Options}). 22850 22851The abstraction provided by @code{getopt()} is very useful and is quite 22852handy in @command{awk} programs as well. Following is an @command{awk} 22853version of @code{getopt()} that accepts both short and long options. 22854 22855This function highlights one of the 22856greatest weaknesses in @command{awk}, which is that it is very poor at 22857manipulating single characters. The function needs repeated calls to 22858@code{substr()} in order to access individual characters 22859(@pxref{String Functions}).@footnote{This 22860function was written before @command{gawk} acquired the ability to 22861split strings into single characters using @code{""} as the separator. 22862We have left it alone, as using @code{substr()} is more portable.} 22863 22864The discussion that follows walks through the code a bit at a time: 22865 22866@cindex @code{getopt()} user-defined function 22867@cindex user-defined @subentry function @subentry @code{getopt()} 22868@example 22869@c file eg/lib/getopt.awk 22870# getopt.awk --- Do C library getopt(3) function in awk 22871# Also supports long options. 22872@c endfile 22873@ignore 22874@c file eg/lib/getopt.awk 22875# 22876# Arnold Robbins, arnold@@skeeve.com, Public Domain 22877# 22878# Initial version: March, 1991 22879# Revised: May, 1993 22880# Long options added by Greg Minshall, January 2020 22881@c endfile 22882@end ignore 22883@c file eg/lib/getopt.awk 22884 22885# External variables: 22886# Optind -- index in ARGV of first nonoption argument 22887# Optarg -- string value of argument to current option 22888# Opterr -- if nonzero, print our own diagnostic 22889# Optopt -- current option letter 22890 22891# Returns: 22892# -1 at end of options 22893# "?" for unrecognized option 22894# <s> a string representing the current option 22895 22896# Private Data: 22897# _opti -- index in multiflag option, e.g., -abc 22898@c endfile 22899@end example 22900 22901The function starts out with comments presenting 22902a list of the global variables it uses, 22903what the return values are, what they mean, and any global variables that 22904are ``private'' to this library function. Such documentation is essential 22905for any program, and particularly for library functions. 22906 22907The @code{getopt()} function first checks that it was indeed called with 22908a string of options (the @code{options} parameter). If both 22909@code{options} and @code{longoptions} have a zero length, 22910@code{getopt()} immediately returns @minus{}1: 22911 22912@cindex @code{getopt()} user-defined function 22913@cindex user-defined @subentry function @subentry @code{getopt()} 22914@example 22915@c file eg/lib/getopt.awk 22916function getopt(argc, argv, options, longopts, thisopt, i, j) 22917@{ 22918 if (length(options) == 0 && length(longopts) == 0) 22919 return -1 # no options given 22920 22921@group 22922 if (argv[Optind] == "--") @{ # all done 22923 Optind++ 22924 _opti = 0 22925 return -1 22926@end group 22927 @} else if (argv[Optind] !~ /^-[^:[:space:]]/) @{ 22928 _opti = 0 22929 return -1 22930 @} 22931@c endfile 22932@end example 22933 22934The next thing to check for is the end of the options. A @option{--} 22935ends the command-line options, as does any command-line argument that 22936does not begin with a @samp{-} (unless it is an argument to a preceding 22937option). @code{Optind} steps through 22938the array of command-line arguments; it retains its value across calls 22939to @code{getopt()}, because it is a global variable. 22940 22941The regular expression @code{@w{/^-[^:[:space:]/}} 22942checks for a @samp{-} followed by anything 22943that is not whitespace and not a colon. 22944If the current command-line argument does not match this pattern, 22945it is not an option, and it ends option processing. 22946Now, we 22947check to see if we are processing a short (single letter) option, or a 22948long option (indicated by two dashes, e.g., @samp{--filename}). If it 22949is a short option, we continue on: 22950 22951@example 22952@c file eg/lib/getopt.awk 22953 if (argv[Optind] !~ /^--/) @{ # if this is a short option 22954 if (_opti == 0) 22955 _opti = 2 22956 thisopt = substr(argv[Optind], _opti, 1) 22957 Optopt = thisopt 22958 i = index(options, thisopt) 22959 if (i == 0) @{ 22960 if (Opterr) 22961 printf("%c -- invalid option\n", thisopt) > "/dev/stderr" 22962 if (_opti >= length(argv[Optind])) @{ 22963 Optind++ 22964 _opti = 0 22965 @} else 22966 _opti++ 22967 return "?" 22968 @} 22969@c endfile 22970@end example 22971 22972The @code{_opti} variable tracks the position in the current command-line 22973argument (@code{argv[Optind]}). If multiple options are 22974grouped together with one @samp{-} (e.g., @option{-abx}), it is necessary 22975to return them to the user one at a time. 22976 22977If @code{_opti} is equal to zero, it is set to two, which is the index in 22978the string of the next character to look at (we skip the @samp{-}, which 22979is at position one). The variable @code{thisopt} holds the character, 22980obtained with @code{substr()}. It is saved in @code{Optopt} for the main 22981program to use. 22982 22983If @code{thisopt} is not in the @code{options} string, then it is an 22984invalid option. If @code{Opterr} is nonzero, @code{getopt()} prints an error 22985message on the standard error that is similar to the message from the C 22986version of @code{getopt()}. 22987 22988Because the option is invalid, it is necessary to skip it and move on to the 22989next option character. If @code{_opti} is greater than or equal to the 22990length of the current command-line argument, it is necessary to move on 22991to the next argument, so @code{Optind} is incremented and @code{_opti} is reset 22992to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely 22993incremented. 22994 22995In any case, because the option is invalid, @code{getopt()} returns @code{"?"}. 22996The main program can examine @code{Optopt} if it needs to know what the 22997invalid option letter actually is. Continuing on: 22998 22999@example 23000@c file eg/lib/getopt.awk 23001 if (substr(options, i + 1, 1) == ":") @{ 23002 # get option argument 23003 if (length(substr(argv[Optind], _opti + 1)) > 0) 23004 Optarg = substr(argv[Optind], _opti + 1) 23005 else 23006 Optarg = argv[++Optind] 23007 _opti = 0 23008 @} else 23009 Optarg = "" 23010@c endfile 23011@end example 23012 23013If the option requires an argument, the option letter is followed by a colon 23014in the @code{options} string. If there are remaining characters in the 23015current command-line argument (@code{argv[Optind]}), then the rest of that 23016string is assigned to @code{Optarg}. Otherwise, the next command-line 23017argument is used (@samp{-xFOO} versus @samp{@w{-x FOO}}). In either case, 23018@code{_opti} is reset to zero, because there are no more characters left to 23019examine in the current command-line argument. Continuing: 23020 23021@example 23022@c file eg/lib/getopt.awk 23023 if (_opti == 0 || _opti >= length(argv[Optind])) @{ 23024 Optind++ 23025 _opti = 0 23026 @} else 23027 _opti++ 23028 return thisopt 23029@c endfile 23030@end example 23031 23032Finally, for a short option, if @code{_opti} is either zero or greater 23033than the length of the current command-line argument, it means this 23034element in @code{argv} is through being processed, so @code{Optind} is 23035incremented to point to the next element in @code{argv}. If neither 23036condition is true, then only @code{_opti} is incremented, so that the 23037next option letter can be processed on the next call to @code{getopt()}. 23038 23039On the other hand, if the earlier test found that this was a long 23040option, we take a different branch: 23041 23042@example 23043@c file eg/lib/getopt.awk 23044 @} else @{ 23045 j = index(argv[Optind], "=") 23046 if (j > 0) 23047 thisopt = substr(argv[Optind], 3, j - 3) 23048 else 23049 thisopt = substr(argv[Optind], 3) 23050 Optopt = thisopt 23051@c endfile 23052@end example 23053 23054First, we search this option for a possible embedded equal sign, as the 23055specification of long options allows an argument to an option 23056@samp{--someopt} to be specified as @samp{--someopt=answer} as well as 23057@samp{@w{--someopt answer}}. 23058 23059@example 23060@c file eg/lib/getopt.awk 23061 i = match(longopts, "(^|,)" thisopt "($|[,:])") 23062 if (i == 0) @{ 23063 if (Opterr) 23064 printf("%s -- invalid option\n", thisopt) > "/dev/stderr" 23065 Optind++ 23066 return "?" 23067 @} 23068@c endfile 23069@end example 23070 23071Next, we try to find the current option in @code{longopts}. The regular 23072expression given to @code{match()}, @code{@w{"(^|,)" thisopt "($|[,:])"}}, 23073matches this option at the beginning of @code{longopts}, or at the 23074beginning of a subsequent long option (the previous long option would 23075have been terminated by a comma), and, in any case, either at the end of 23076the @code{longopts} string (@samp{$}), or followed by a comma 23077(separating this option from a subsequent option) or a colon (indicating 23078this long option takes an argument (@samp{@w{[,:]}}). 23079 23080Using this regular expression, we check to see if the current option 23081might possibly be in @code{longopts} (if @code{longopts} is not 23082specified, this test will also fail). In case of an error, we possibly 23083print an error message and then return @code{"?"}. Continuing on: 23084 23085@example 23086@c file eg/lib/getopt.awk 23087 if (substr(longopts, i+1+length(thisopt), 1) == ":") @{ 23088 if (j > 0) 23089 Optarg = substr(argv[Optind], j + 1) 23090 else 23091 Optarg = argv[++Optind] 23092 @} else 23093 Optarg = "" 23094@c endfile 23095@end example 23096 23097We now check to see if this option takes an argument and, if so, we set 23098@code{Optarg} to the value of that argument (either a value after an 23099equal sign specified on the command line, immediately adjoining the long 23100option string, or as the next argument on the command line). 23101 23102@example 23103@c file eg/lib/getopt.awk 23104 Optind++ 23105 return thisopt 23106 @} 23107@} 23108@c endfile 23109@end example 23110 23111We increase @code{Optind} (which we already increased once if a required 23112argument was separated from its option by an equal sign), and return the 23113long option (minus its leading dashes). 23114 23115The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one. 23116@code{Opterr} is set to one, because the default behavior is for @code{getopt()} 23117to print a diagnostic message upon seeing an invalid option. @code{Optind} 23118is set to one, because there's no reason to look at the program name, which is 23119in @code{ARGV[0]}: 23120 23121@example 23122@c file eg/lib/getopt.awk 23123BEGIN @{ 23124 Opterr = 1 # default is to diagnose 23125 Optind = 1 # skip ARGV[0] 23126 23127 # test program 23128 if (_getopt_test) @{ 23129 _myshortopts = "ab:cd" 23130 _mylongopts = "longa,longb:,otherc,otherd" 23131 23132 while ((_go_c = getopt(ARGC, ARGV, _myshortopts, _mylongopts)) != -1) 23133 printf("c = <%s>, Optarg = <%s>\n", _go_c, Optarg) 23134 printf("non-option arguments:\n") 23135 for (; Optind < ARGC; Optind++) 23136 printf("\tARGV[%d] = <%s>\n", Optind, ARGV[Optind]) 23137 @} 23138@} 23139@c endfile 23140@end example 23141 23142The rest of the @code{BEGIN} rule is a simple test program. Here are the 23143results of some sample runs of the test program: 23144 23145@example 23146$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x} 23147@print{} c = <a>, Optarg = <> 23148@print{} c = <c>, Optarg = <> 23149@print{} c = <b>, Optarg = <ARG> 23150@print{} non-option arguments: 23151@print{} ARGV[3] = <bax> 23152@print{} ARGV[4] = <-x> 23153 23154$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc} 23155@print{} c = <a>, Optarg = <> 23156@error{} x -- invalid option 23157@print{} c = <?>, Optarg = <> 23158@print{} non-option arguments: 23159@print{} ARGV[4] = <xyz> 23160@print{} ARGV[5] = <abc> 23161 23162$ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a \} 23163> @kbd{--longa -b xx --longb=foo=bar --otherd --otherc arg1 arg2} 23164@print{} c = <a>, Optarg = <> 23165@print{} c = <longa>, Optarg = <> 23166@print{} c = <b>, Optarg = <xx> 23167@print{} c = <longb>, Optarg = <foo=bar> 23168@print{} c = <otherd>, Optarg = <> 23169@print{} c = <otherc>, Optarg = <> 23170@print{} non-option arguments: 23171@print{} ARGV[8] = <arg1> 23172@print{} ARGV[9] = <arg2> 23173@end example 23174 23175In all the runs, the first @option{--} terminates the arguments to 23176@command{awk}, so that it does not try to interpret the @option{-a}, 23177etc., as its own options. 23178 23179@quotation NOTE 23180After @code{getopt()} is through, 23181user-level code must clear out all the elements of @code{ARGV} from 1 23182to @code{Optind}, so that @command{awk} does not try to process the 23183command-line options as @value{FN}s. 23184@end quotation 23185 23186Using @samp{#!} with the @option{-E} option may help avoid 23187conflicts between your program's options and @command{gawk}'s options, 23188as @option{-E} causes @command{gawk} to abandon processing of 23189further options 23190(@pxref{Executable Scripts} and 23191@ifnotdocbook 23192@pxref{Options}). 23193@end ifnotdocbook 23194@ifdocbook 23195@ref{Options}). 23196@end ifdocbook 23197 23198Several of the sample programs presented in 23199@ref{Sample Programs}, 23200use @code{getopt()} to process their arguments. 23201 23202@node Passwd Functions 23203@section Reading the User Database 23204 23205@cindex libraries of @command{awk} functions @subentry user database, reading 23206@cindex functions @subentry library @subentry user database, reading 23207@cindex user database, reading 23208@cindex database @subentry users, reading 23209@cindex @code{PROCINFO} array 23210The @code{PROCINFO} array 23211(@pxref{Built-in Variables}) 23212provides access to the current user's real and effective user and group ID 23213numbers, and, if available, the user's supplementary group set. 23214However, because these are numbers, they do not provide very useful 23215information to the average user. There needs to be some way to find the 23216user information associated with the user and group ID numbers. This 23217@value{SECTION} presents a suite of functions for retrieving information from the 23218user database. @xref{Group Functions} 23219for a similar suite that retrieves information from the group database. 23220 23221@cindex @code{getpwent()} function (C library) 23222@cindex C library functions @subentry @code{getpwent()} 23223@cindex @code{getpwent()} user-defined function 23224@cindex user-defined @subentry function @subentry @code{getpwent()} 23225@cindex users, information about @subentry retrieving 23226@cindex login information 23227@cindex account information 23228@cindex password file 23229@cindex files @subentry password 23230The POSIX standard does not define the file where user information is 23231kept. Instead, it provides the @code{<pwd.h>} header file 23232and several C language subroutines for obtaining user information. 23233The primary function is @code{getpwent()}, for ``get password entry.'' 23234The ``password'' comes from the original user database file, 23235@file{/etc/passwd}, which stores user information along with the 23236encrypted passwords (hence the name). 23237 23238@cindex @command{pwcat} program 23239Although an @command{awk} program could simply read @file{/etc/passwd} 23240directly, this file may not contain complete information about the 23241system's set of users.@footnote{It is often the case that password 23242information is stored in a network database.} To be sure you are able to 23243produce a readable and complete version of the user database, it is necessary 23244to write a small C program that calls @code{getpwent()}. @code{getpwent()} 23245is defined as returning a pointer to a @code{struct passwd}. Each time it 23246is called, it returns the next entry in the database. When there are 23247no more entries, it returns @code{NULL}, the null pointer. When this 23248happens, the C program should call @code{endpwent()} to close the database. 23249Following is @command{pwcat}, a C program that ``cats'' the password database: 23250 23251@example 23252@c file eg/lib/pwcat.c 23253/* 23254 * pwcat.c 23255 * 23256 * Generate a printable version of the password database. 23257 */ 23258@c endfile 23259@ignore 23260@c file eg/lib/pwcat.c 23261/* 23262 * Arnold Robbins, arnold@@skeeve.com, May 1993 23263 * Public Domain 23264 * December 2010, move to ANSI C definition for main(). 23265 */ 23266 23267#if HAVE_CONFIG_H 23268#include <config.h> 23269#endif 23270 23271@c endfile 23272@end ignore 23273@c file eg/lib/pwcat.c 23274#include <stdio.h> 23275#include <pwd.h> 23276 23277@c endfile 23278@ignore 23279@c file eg/lib/pwcat.c 23280#if defined (STDC_HEADERS) 23281#include <stdlib.h> 23282#endif 23283 23284@c endfile 23285@end ignore 23286@c file eg/lib/pwcat.c 23287int 23288main(int argc, char **argv) 23289@{ 23290 struct passwd *p; 23291 23292 while ((p = getpwent()) != NULL) 23293@c endfile 23294@ignore 23295@c file eg/lib/pwcat.c 23296#ifdef HAVE_STRUCT_PASSWD_PW_PASSWD 23297@c endfile 23298@end ignore 23299@c file eg/lib/pwcat.c 23300 printf("%s:%s:%ld:%ld:%s:%s:%s\n", 23301 p->pw_name, p->pw_passwd, (long) p->pw_uid, 23302 (long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell); 23303@c endfile 23304@ignore 23305@c file eg/lib/pwcat.c 23306#else 23307 printf("%s:*:%ld:%ld:%s:%s\n", 23308 p->pw_name, (long) p->pw_uid, 23309 (long) p->pw_gid, p->pw_dir, p->pw_shell); 23310#endif 23311@c endfile 23312@end ignore 23313@c file eg/lib/pwcat.c 23314 23315 endpwent(); 23316 return 0; 23317@} 23318@c endfile 23319@end example 23320 23321If you don't understand C, don't worry about it. 23322The output from @command{pwcat} is the user database, in the traditional 23323@file{/etc/passwd} format of colon-separated fields. The fields are: 23324 23325@table @asis 23326@item Login name 23327The user's login name. 23328 23329@item Encrypted password 23330The user's encrypted password. This may not be available on some systems. 23331 23332@item User-ID 23333The user's numeric user ID number. 23334(On some systems, it's a C @code{long}, and not an @code{int}. Thus, 23335we cast it to @code{long} for all cases.) 23336 23337@item Group-ID 23338The user's numeric group ID number. 23339(Similar comments about @code{long} versus @code{int} apply here.) 23340 23341@item Full name 23342The user's full name, and perhaps other information associated with the 23343user. 23344 23345@item Home directory 23346The user's login (or ``home'') directory (familiar to shell programmers as 23347@code{$HOME}). 23348 23349@item Login shell 23350The program that is run when the user logs in. This is usually a 23351shell, such as Bash. 23352@end table 23353 23354A few lines representative of @command{pwcat}'s output are as follows: 23355 23356@cindex Jacobs, Andrew 23357@cindex Robbins @subentry Arnold 23358@cindex Robbins @subentry Miriam 23359@example 23360$ @kbd{pwcat} 23361@print{} root:x:0:1:Operator:/:/bin/sh 23362@print{} nobody:*:65534:65534::/: 23363@print{} daemon:*:1:1::/: 23364@print{} sys:*:2:2::/:/bin/csh 23365@print{} bin:*:3:3::/bin: 23366@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh 23367@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh 23368@print{} andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh 23369@dots{} 23370@end example 23371 23372With that introduction, following is a group of functions for getting user 23373information. There are several functions here, corresponding to the C 23374functions of the same names: 23375 23376@cindex @code{_pw_init()} user-defined function 23377@cindex user-defined @subentry function @subentry @code{_pw_init()} 23378@example 23379@c file eg/lib/passwdawk.in 23380# passwd.awk --- access password file information 23381@c endfile 23382@ignore 23383@c file eg/lib/passwdawk.in 23384# 23385# Arnold Robbins, arnold@@skeeve.com, Public Domain 23386# May 1993 23387# Revised October 2000 23388# Revised December 2010 23389@c endfile 23390@end ignore 23391@c file eg/lib/passwdawk.in 23392 23393BEGIN @{ 23394 # tailor this to suit your system 23395 _pw_awklib = "/usr/local/libexec/awk/" 23396@} 23397 23398function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat) 23399@{ 23400 if (_pw_inited) 23401 return 23402 23403 oldfs = FS 23404 oldrs = RS 23405 olddol0 = $0 23406 using_fw = (PROCINFO["FS"] == "FIELDWIDTHS") 23407 using_fpat = (PROCINFO["FS"] == "FPAT") 23408 FS = ":" 23409 RS = "\n" 23410 23411 pwcat = _pw_awklib "pwcat" 23412 while ((pwcat | getline) > 0) @{ 23413 _pw_byname[$1] = $0 23414 _pw_byuid[$3] = $0 23415 _pw_bycount[++_pw_total] = $0 23416 @} 23417 close(pwcat) 23418 _pw_count = 0 23419 _pw_inited = 1 23420 FS = oldfs 23421 if (using_fw) 23422 FIELDWIDTHS = FIELDWIDTHS 23423 else if (using_fpat) 23424 FPAT = FPAT 23425 RS = oldrs 23426 $0 = olddol0 23427@} 23428@c endfile 23429@end example 23430 23431@cindex @code{BEGIN} pattern @subentry @code{pwcat} program 23432The @code{BEGIN} rule sets a private variable to the directory where 23433@command{pwcat} is stored. Because it is used to help out an @command{awk} library 23434routine, we have chosen to put it in @file{/usr/local/libexec/awk}; 23435however, you might want it to be in a different directory on your system. 23436 23437The function @code{_pw_init()} fills three copies of the user information 23438into three associative arrays. The arrays are indexed by username 23439(@code{_pw_byname}), by user ID number (@code{_pw_byuid}), and by order of 23440occurrence (@code{_pw_bycount}). 23441The variable @code{_pw_inited} is used for efficiency, as @code{_pw_init()} 23442needs to be called only once. 23443 23444@cindex @code{PROCINFO} array @subentry testing the field splitting 23445@cindex @code{getline} command @subentry @code{_pw_init()} function 23446Because this function uses @code{getline} to read information from 23447@command{pwcat}, it first saves the values of @code{FS}, @code{RS}, and @code{$0}. 23448It notes in the variable @code{using_fw} whether field splitting 23449with @code{FIELDWIDTHS} is in effect or not. 23450Doing so is necessary, as these functions could be called 23451from anywhere within a user's program, and the user may have his 23452or her own way of splitting records and fields. 23453This makes it possible to restore the correct 23454field-splitting mechanism later. The test can only be true for 23455@command{gawk}. It is false if using @code{FS} or @code{FPAT}, 23456or on some other @command{awk} implementation. 23457 23458The code that checks for using @code{FPAT}, using @code{using_fpat} 23459and @code{PROCINFO["FS"]}, is similar. 23460 23461The main part of the function uses a loop to read database lines, split 23462the lines into fields, and then store the lines into each array as necessary. 23463When the loop is done, @code{@w{_pw_init()}} cleans up by closing the pipeline, 23464setting @code{@w{_pw_inited}} to one, and restoring @code{FS} 23465(and @code{FIELDWIDTHS} or @code{FPAT} 23466if necessary), @code{RS}, and @code{$0}. 23467The use of @code{@w{_pw_count}} is explained shortly. 23468 23469@cindex @code{getpwnam()} function (C library) 23470@cindex C library functions @subentry @code{getpwnam()} 23471The @code{getpwnam()} function takes a username as a string argument. If that 23472user is in the database, it returns the appropriate line. Otherwise, it 23473relies on the array reference to a nonexistent 23474element to create the element with the null string as its value: 23475 23476@cindex @code{getpwnam()} user-defined function 23477@cindex user-defined @subentry function @subentry @code{getpwnam()} 23478@example 23479@group 23480@c file eg/lib/passwdawk.in 23481function getpwnam(name) 23482@{ 23483 _pw_init() 23484 return _pw_byname[name] 23485@} 23486@c endfile 23487@end group 23488@end example 23489 23490@cindex @code{getpwuid()} function (C library) 23491@cindex C library functions @subentry @code{getpwuid()} 23492Similarly, the @code{getpwuid()} function takes a user ID number 23493argument. If that user number is in the database, it returns the 23494appropriate line. Otherwise, it returns the null string: 23495 23496@cindex @code{getpwuid()} user-defined function 23497@cindex user-defined @subentry function @subentry @code{getpwuid()} 23498@example 23499@c file eg/lib/passwdawk.in 23500function getpwuid(uid) 23501@{ 23502 _pw_init() 23503 return _pw_byuid[uid] 23504@} 23505@c endfile 23506@end example 23507 23508@cindex @code{getpwent()} function (C library) 23509@cindex C library functions @subentry @code{getpwent()} 23510The @code{getpwent()} function simply steps through the database, one entry at 23511a time. It uses @code{_pw_count} to track its current position in the 23512@code{_pw_bycount} array: 23513 23514@cindex @code{getpwent()} user-defined function 23515@cindex user-defined @subentry function @subentry @code{getpwent()} 23516@example 23517@c file eg/lib/passwdawk.in 23518function getpwent() 23519@{ 23520 _pw_init() 23521 if (_pw_count < _pw_total) 23522 return _pw_bycount[++_pw_count] 23523 return "" 23524@} 23525@c endfile 23526@end example 23527 23528@cindex @code{endpwent()} function (C library) 23529@cindex C library functions @subentry @code{endpwent()} 23530The @code{@w{endpwent()}} function resets @code{@w{_pw_count}} to zero, so that 23531subsequent calls to @code{getpwent()} start over again: 23532 23533@cindex @code{endpwent()} user-defined function 23534@cindex user-defined @subentry function @subentry @code{endpwent()} 23535@example 23536@c file eg/lib/passwdawk.in 23537function endpwent() 23538@{ 23539 _pw_count = 0 23540@} 23541@c endfile 23542@end example 23543 23544A conscious design decision in this suite is that each subroutine calls 23545@code{@w{_pw_init()}} to initialize the database arrays. 23546The overhead of running 23547a separate process to generate the user database, and the I/O to scan it, 23548are only incurred if the user's main program actually calls one of these 23549functions. If this library file is loaded along with a user's program, but 23550none of the routines are ever called, then there is no extra runtime overhead. 23551(The alternative is move the body of @code{@w{_pw_init()}} into a 23552@code{BEGIN} rule, which always runs @command{pwcat}. This simplifies the 23553code but runs an extra process that may never be needed.) 23554 23555In turn, calling @code{_pw_init()} is not too expensive, because the 23556@code{_pw_inited} variable keeps the program from reading the data more than 23557once. If you are worried about squeezing every last cycle out of your 23558@command{awk} program, the check of @code{_pw_inited} could be moved out of 23559@code{_pw_init()} and duplicated in all the other functions. In practice, 23560this is not necessary, as most @command{awk} programs are I/O-bound, 23561and such a change would clutter up the code. 23562 23563The @command{id} program in @ref{Id Program} 23564uses these functions. 23565 23566@node Group Functions 23567@section Reading the Group Database 23568 23569@cindex libraries of @command{awk} functions @subentry group database, reading 23570@cindex functions @subentry library @subentry group database, reading 23571@cindex group database, reading 23572@cindex database @subentry group, reading 23573@cindex @code{PROCINFO} array @subentry group membership and 23574@cindex @code{getgrent()} function (C library) 23575@cindex C library functions @subentry @code{getgrent()} 23576@cindex @code{getgrent()} user-defined function 23577@cindex user-defined @subentry function @subentry @code{getgrent()} 23578@cindex groups, information about 23579@cindex account information 23580@cindex group file 23581@cindex files @subentry group 23582Much of the discussion presented in 23583@ref{Passwd Functions} 23584applies to the group database as well. Although there has traditionally 23585been a well-known file (@file{/etc/group}) in a well-known format, the POSIX 23586standard only provides a set of C library routines 23587(@code{<grp.h>} and @code{getgrent()}) 23588for accessing the information. 23589Even though this file may exist, it may not have 23590complete information. Therefore, as with the user database, it is necessary 23591to have a small C program that generates the group database as its output. 23592@command{grcat}, a C program that ``cats'' the group database, 23593is as follows: 23594 23595@cindex @command{grcat} program 23596@example 23597@c file eg/lib/grcat.c 23598/* 23599 * grcat.c 23600 * 23601 * Generate a printable version of the group database. 23602 */ 23603@c endfile 23604@ignore 23605@c file eg/lib/grcat.c 23606/* 23607 * Arnold Robbins, arnold@@skeeve.com, May 1993 23608 * Public Domain 23609 * December 2010, move to ANSI C definition for main(). 23610 */ 23611 23612#if HAVE_CONFIG_H 23613#include <config.h> 23614#endif 23615 23616#if defined (STDC_HEADERS) 23617#include <stdlib.h> 23618#endif 23619 23620#ifndef HAVE_GETGRENT 23621int main() { return 0; } 23622#else 23623@c endfile 23624@end ignore 23625@c file eg/lib/grcat.c 23626#include <stdio.h> 23627#include <grp.h> 23628 23629int 23630main(int argc, char **argv) 23631@{ 23632 struct group *g; 23633 int i; 23634 23635 while ((g = getgrent()) != NULL) @{ 23636@c endfile 23637@ignore 23638@c file eg/lib/grcat.c 23639#ifdef HAVE_STRUCT_GROUP_GR_PASSWD 23640@c endfile 23641@end ignore 23642@c file eg/lib/grcat.c 23643 printf("%s:%s:%ld:", g->gr_name, g->gr_passwd, 23644 (long) g->gr_gid); 23645@c endfile 23646@ignore 23647@c file eg/lib/grcat.c 23648#else 23649 printf("%s:*:%ld:", g->gr_name, (long) g->gr_gid); 23650#endif 23651@c endfile 23652@end ignore 23653@c file eg/lib/grcat.c 23654 for (i = 0; g->gr_mem[i] != NULL; i++) @{ 23655 printf("%s", g->gr_mem[i]); 23656@group 23657 if (g->gr_mem[i+1] != NULL) 23658 putchar(','); 23659 @} 23660@end group 23661 putchar('\n'); 23662 @} 23663 endgrent(); 23664 return 0; 23665@} 23666@c endfile 23667@ignore 23668@c file eg/lib/grcat.c 23669#endif /* HAVE_GETGRENT */ 23670@c endfile 23671@end ignore 23672@end example 23673 23674Each line in the group database represents one group. The fields are 23675separated with colons and represent the following information: 23676 23677@table @asis 23678@item Group Name 23679The group's name. 23680 23681@item Group Password 23682The group's encrypted password. In practice, this field is never used; 23683it is usually empty or set to @samp{*}. 23684 23685@item Group ID Number 23686The group's numeric group ID number; 23687the association of name to number must be unique within the file. 23688(On some systems it's a C @code{long}, and not an @code{int}. Thus, 23689we cast it to @code{long} for all cases.) 23690 23691@item Group Member List 23692A comma-separated list of usernames. These users are members of the group. 23693Modern Unix systems allow users to be members of several groups 23694simultaneously. If your system does, then there are elements 23695@code{"group1"} through @code{"group@var{N}"} in @code{PROCINFO} 23696for those group ID numbers. 23697(Note that @code{PROCINFO} is a @command{gawk} extension; 23698@pxref{Built-in Variables}.) 23699@end table 23700 23701Here is what running @command{grcat} might produce: 23702 23703@example 23704$ @kbd{grcat} 23705@print{} wheel:*:0:arnold 23706@print{} nogroup:*:65534: 23707@print{} daemon:*:1: 23708@print{} kmem:*:2: 23709@print{} staff:*:10:arnold,miriam,andy 23710@print{} other:*:20: 23711@dots{} 23712@end example 23713 23714Here are the functions for obtaining information from the group database. 23715There are several, modeled after the C library functions of the same names: 23716 23717@cindex @code{getline} command @subentry @code{_gr_init()} user-defined function 23718@cindex @code{_gr_init()} user-defined function 23719@cindex user-defined @subentry function @subentry @code{_gr_init()} 23720@example 23721@c file eg/lib/groupawk.in 23722# group.awk --- functions for dealing with the group file 23723@c endfile 23724@ignore 23725@c file eg/lib/groupawk.in 23726# 23727# Arnold Robbins, arnold@@skeeve.com, Public Domain 23728# May 1993 23729# Revised October 2000 23730# Revised December 2010 23731@c endfile 23732@end ignore 23733@c line break on _gr_init for smallbook 23734@c file eg/lib/groupawk.in 23735 23736BEGIN @{ 23737 # Change to suit your system 23738 _gr_awklib = "/usr/local/libexec/awk/" 23739@} 23740 23741function _gr_init( oldfs, oldrs, olddol0, grcat, 23742 using_fw, using_fpat, n, a, i) 23743@{ 23744 if (_gr_inited) 23745 return 23746 23747 oldfs = FS 23748 oldrs = RS 23749 olddol0 = $0 23750 using_fw = (PROCINFO["FS"] == "FIELDWIDTHS") 23751 using_fpat = (PROCINFO["FS"] == "FPAT") 23752 FS = ":" 23753 RS = "\n" 23754 23755 grcat = _gr_awklib "grcat" 23756 while ((grcat | getline) > 0) @{ 23757 if ($1 in _gr_byname) 23758 _gr_byname[$1] = _gr_byname[$1] "," $4 23759 else 23760 _gr_byname[$1] = $0 23761 if ($3 in _gr_bygid) 23762 _gr_bygid[$3] = _gr_bygid[$3] "," $4 23763 else 23764 _gr_bygid[$3] = $0 23765 23766 n = split($4, a, "[ \t]*,[ \t]*") 23767 for (i = 1; i <= n; i++) 23768 if (a[i] in _gr_groupsbyuser) 23769 _gr_groupsbyuser[a[i]] = _gr_groupsbyuser[a[i]] " " $1 23770 else 23771 _gr_groupsbyuser[a[i]] = $1 23772 23773 _gr_bycount[++_gr_count] = $0 23774 @} 23775 close(grcat) 23776 _gr_count = 0 23777 _gr_inited++ 23778 FS = oldfs 23779 if (using_fw) 23780 FIELDWIDTHS = FIELDWIDTHS 23781 else if (using_fpat) 23782 FPAT = FPAT 23783 RS = oldrs 23784 $0 = olddol0 23785@} 23786@c endfile 23787@end example 23788 23789The @code{BEGIN} rule sets a private variable to the directory where 23790@command{grcat} is stored. Because it is used to help out an @command{awk} library 23791routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might 23792want it to be in a different directory on your system. 23793 23794These routines follow the same general outline as the user database routines 23795(@pxref{Passwd Functions}). 23796The @code{@w{_gr_inited}} variable is used to 23797ensure that the database is scanned no more than once. 23798The @code{@w{_gr_init()}} function first saves @code{FS}, 23799@code{RS}, and 23800@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for 23801scanning the group information. 23802It also takes care to note whether @code{FIELDWIDTHS} or @code{FPAT} 23803is being used, and to restore the appropriate field-splitting mechanism. 23804 23805The group information is stored in several associative arrays. 23806The arrays are indexed by group name (@code{@w{_gr_byname}}), by group ID number 23807(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}). 23808There is an additional array indexed by username (@code{@w{_gr_groupsbyuser}}), 23809which is a space-separated list of groups to which each user belongs. 23810 23811Unlike in the user database, it is possible to have multiple records in the 23812database for the same group. This is common when a group has a large number 23813of members. A pair of such entries might look like the following: 23814 23815@example 23816tvpeople:*:101:johnny,jay,arsenio 23817tvpeople:*:101:david,conan,tom,joan 23818@end example 23819 23820For this reason, @code{_gr_init()} looks to see if a group name or 23821group ID number is already seen. If so, the usernames are 23822simply concatenated onto the previous list of users.@footnote{There is a 23823subtle problem with the code just presented. Suppose that 23824the first time there were no names. This code adds the names with 23825a leading comma. It also doesn't check that there is a @code{$4}.} 23826 23827Finally, @code{_gr_init()} closes the pipeline to @command{grcat}, restores 23828@code{FS} (and @code{FIELDWIDTHS} or @code{FPAT}, if necessary), @code{RS}, and @code{$0}, 23829initializes @code{_gr_count} to zero 23830(it is used later), and makes @code{_gr_inited} nonzero. 23831 23832@cindex @code{getgrnam()} function (C library) 23833@cindex C library functions @subentry @code{getgrnam()} 23834The @code{getgrnam()} function takes a group name as its argument, and if that 23835group exists, it is returned. 23836Otherwise, it 23837relies on the array reference to a nonexistent 23838element to create the element with the null string as its value: 23839 23840@cindex @code{getgrnam()} user-defined function 23841@cindex user-defined @subentry function @subentry @code{getgrnam()} 23842@example 23843@c file eg/lib/groupawk.in 23844function getgrnam(group) 23845@{ 23846 _gr_init() 23847 return _gr_byname[group] 23848@} 23849@c endfile 23850@end example 23851 23852@cindex @code{getgrgid()} function (C library) 23853@cindex C library functions @subentry @code{getgrgid()} 23854The @code{getgrgid()} function is similar; it takes a numeric group ID and 23855looks up the information associated with that group ID: 23856 23857@cindex @code{getgrgid()} user-defined function 23858@cindex user-defined @subentry function @subentry @code{getgrgid()} 23859@example 23860@c file eg/lib/groupawk.in 23861function getgrgid(gid) 23862@{ 23863 _gr_init() 23864 return _gr_bygid[gid] 23865@} 23866@c endfile 23867@end example 23868 23869@cindex @code{getgruser()} function (C library) 23870@cindex C library functions @subentry @code{getgruser()} 23871The @code{getgruser()} function does not have a C counterpart. It takes a 23872username and returns the list of groups that have the user as a member: 23873 23874@cindex @code{getgruser()} user-defined function 23875@cindex user-defined @subentry function @subentry @code{getgruser()} 23876@example 23877@c file eg/lib/groupawk.in 23878function getgruser(user) 23879@{ 23880 _gr_init() 23881 return _gr_groupsbyuser[user] 23882@} 23883@c endfile 23884@end example 23885 23886@cindex @code{getgrent()} function (C library) 23887@cindex C library functions @subentry @code{getgrent()} 23888The @code{getgrent()} function steps through the database one entry at a time. 23889It uses @code{_gr_count} to track its position in the list: 23890 23891@cindex @code{getgrent()} user-defined function 23892@cindex user-defined @subentry function @subentry @code{getgrent()} 23893@example 23894@c file eg/lib/groupawk.in 23895function getgrent() 23896@{ 23897 _gr_init() 23898 if (++_gr_count in _gr_bycount) 23899 return _gr_bycount[_gr_count] 23900@group 23901 return "" 23902@} 23903@end group 23904@c endfile 23905@end example 23906 23907@cindex @code{endgrent()} function (C library) 23908@cindex C library functions @subentry @code{endgrent()} 23909The @code{endgrent()} function resets @code{_gr_count} to zero so that @code{getgrent()} can 23910start over again: 23911 23912@cindex @code{endgrent()} user-defined function 23913@cindex user-defined @subentry function @subentry @code{endgrent()} 23914@example 23915@c file eg/lib/groupawk.in 23916function endgrent() 23917@{ 23918 _gr_count = 0 23919@} 23920@c endfile 23921@end example 23922 23923As with the user database routines, each function calls @code{_gr_init()} to 23924initialize the arrays. Doing so only incurs the extra overhead of running 23925@command{grcat} if these functions are used (as opposed to moving the body of 23926@code{_gr_init()} into a @code{BEGIN} rule). 23927 23928Most of the work is in scanning the database and building the various 23929associative arrays. The functions that the user calls are themselves very 23930simple, relying on @command{awk}'s associative arrays to do work. 23931 23932The @command{id} program in @ref{Id Program} 23933uses these functions. 23934 23935@node Walking Arrays 23936@section Traversing Arrays of Arrays 23937 23938@ref{Arrays of Arrays} described how @command{gawk} 23939provides arrays of arrays. In particular, any element of 23940an array may be either a scalar or another array. The 23941@code{isarray()} function (@pxref{Type Functions}) 23942lets you distinguish an array 23943from a scalar. 23944The following function, @code{walk_array()}, recursively traverses 23945an array, printing the element indices and values. 23946You call it with the array and a string representing the name 23947of the array: 23948 23949@cindex @code{walk_array()} user-defined function 23950@cindex user-defined @subentry function @subentry @code{walk_array()} 23951@example 23952@c file eg/lib/walkarray.awk 23953function walk_array(arr, name, i) 23954@{ 23955 for (i in arr) @{ 23956 if (isarray(arr[i])) 23957 walk_array(arr[i], (name "[" i "]")) 23958 else 23959 printf("%s[%s] = %s\n", name, i, arr[i]) 23960 @} 23961@} 23962@c endfile 23963@end example 23964 23965@noindent 23966It works by looping over each element of the array. If any given 23967element is itself an array, the function calls itself recursively, 23968passing the subarray and a new string representing the current index. 23969Otherwise, the function simply prints the element's name, index, and value. 23970Here is a main program to demonstrate: 23971 23972@example 23973BEGIN @{ 23974 a[1] = 1 23975 a[2][1] = 21 23976 a[2][2] = 22 23977 a[3] = 3 23978 a[4][1][1] = 411 23979 a[4][2] = 42 23980 23981 walk_array(a, "a") 23982@} 23983@end example 23984 23985When run, the program produces the following output: 23986 23987@example 23988$ @kbd{gawk -f walk_array.awk} 23989@print{} a[1] = 1 23990@print{} a[2][1] = 21 23991@print{} a[2][2] = 22 23992@print{} a[3] = 3 23993@print{} a[4][1][1] = 411 23994@print{} a[4][2] = 42 23995@end example 23996 23997The function just presented simply prints the 23998name and value of each scalar array element. However, it is easy to 23999generalize it, by passing in the name of a function to call 24000when walking an array. The modified function looks like this: 24001 24002@example 24003@c file eg/lib/processarray.awk 24004function process_array(arr, name, process, do_arrays, i, new_name) 24005@{ 24006 for (i in arr) @{ 24007 new_name = (name "[" i "]") 24008 if (isarray(arr[i])) @{ 24009 if (do_arrays) 24010 @@process(new_name, arr[i]) 24011 process_array(arr[i], new_name, process, do_arrays) 24012 @} else 24013 @@process(new_name, arr[i]) 24014 @} 24015@} 24016@c endfile 24017@end example 24018 24019The arguments are as follows: 24020 24021@table @code 24022@item arr 24023The array. 24024 24025@item name 24026The name of the array (a string). 24027 24028@item process 24029The name of the function to call. 24030 24031@item do_arrays 24032If this is true, the function can handle elements that are subarrays. 24033@end table 24034 24035If subarrays are to be processed, that is done before walking them further. 24036 24037When run with the following scaffolding, the function produces the same 24038results as does the earlier version of @code{walk_array()}: 24039 24040@example 24041BEGIN @{ 24042 a[1] = 1 24043 a[2][1] = 21 24044 a[2][2] = 22 24045 a[3] = 3 24046 a[4][1][1] = 411 24047 a[4][2] = 42 24048 24049 process_array(a, "a", "do_print", 0) 24050@} 24051 24052function do_print(name, element) 24053@{ 24054 printf "%s = %s\n", name, element 24055@} 24056@end example 24057 24058@node Library Functions Summary 24059@section Summary 24060 24061@itemize @value{BULLET} 24062@item 24063Reading programs is an excellent way to learn Good Programming. 24064The functions and programs provided in this @value{CHAPTER} and the next 24065are intended to serve that purpose. 24066 24067@item 24068When writing general-purpose library functions, put some thought into how 24069to name any global variables so that they won't conflict with variables 24070from a user's program. 24071 24072@item 24073The functions presented here fit into the following categories: 24074 24075@c nested list 24076@table @asis 24077@item General problems 24078Number-to-string conversion, testing assertions, rounding, random number 24079generation, converting characters to numbers, joining strings, getting 24080easily usable time-of-day information, and reading a whole file in 24081one shot 24082 24083@item Managing @value{DF}s 24084Noting @value{DF} boundaries, rereading the current file, checking for 24085readable files, checking for zero-length files, and treating assignments 24086as @value{FN}s 24087 24088@item Processing command-line options 24089An @command{awk} version of the standard C @code{getopt()} function 24090 24091@item Reading the user and group databases 24092Two sets of routines that parallel the C library versions 24093 24094@item Traversing arrays of arrays 24095Two functions that traverse an array of arrays to any depth 24096@end table 24097@c end nested list 24098 24099@end itemize 24100 24101@c EXCLUDE START 24102@node Library Exercises 24103@section Exercises 24104 24105@enumerate 24106@item 24107In @ref{Empty Files}, we presented the @file{zerofile.awk} program, 24108which made use of @command{gawk}'s @code{ARGIND} variable. Can this 24109problem be solved without relying on @code{ARGIND}? If so, how? 24110 24111@ignore 24112# zerofile2.awk --- same thing, portably 24113 24114BEGIN @{ 24115 ARGIND = Argind = 0 24116 for (i = 1; i < ARGC; i++) 24117 Fnames[ARGV[i]]++ 24118 24119@} 24120FNR == 1 @{ 24121 while (ARGV[ARGIND] != FILENAME) 24122 ARGIND++ 24123 Seen[FILENAME]++ 24124 if (Seen[FILENAME] == Fnames[FILENAME]) 24125 do 24126 ARGIND++ 24127 while (ARGV[ARGIND] != FILENAME) 24128@} 24129ARGIND > Argind + 1 @{ 24130 for (Argind++; Argind < ARGIND; Argind++) 24131 zerofile(ARGV[Argind], Argind) 24132@} 24133ARGIND != Argind @{ 24134 Argind = ARGIND 24135@} 24136END @{ 24137 if (ARGIND < ARGC - 1) 24138 ARGIND = ARGC - 1 24139 if (ARGIND > Argind) 24140 for (Argind++; Argind <= ARGIND; Argind++) 24141 zerofile(ARGV[Argind], Argind) 24142@} 24143@end ignore 24144 24145@item 24146As a related challenge, revise that code to handle the case where 24147an intervening value in @code{ARGV} is a variable assignment. 24148 24149@ignore 24150@c June 13 2015: Antonio points out that this is answered in the text. Ooops. 24151@item 24152@ref{Walking Arrays} presented a function that walked a multidimensional 24153array to print it out. However, walking an array and processing 24154each element is a general-purpose operation. Generalize the 24155@code{walk_array()} function by adding an additional parameter named 24156@code{process}. 24157 24158Then, inside the loop, instead of printing the array element's index and 24159value, use the indirect function call syntax (@pxref{Indirect Calls}) 24160on @code{process}, passing it the index and the value. 24161 24162When calling @code{walk_array()}, you would pass the name of a 24163user-defined function that expects to receive an index and a value, 24164and then processes the element. 24165 24166Test your new version by printing the array; you should end up with 24167output identical to that of the original version. 24168@end ignore 24169 24170@end enumerate 24171@c EXCLUDE END 24172 24173 24174@node Sample Programs 24175@chapter Practical @command{awk} Programs 24176@cindex @command{awk} programs @subentry examples of 24177 24178@c FULLXREF ON 24179@ref{Library Functions}, 24180presents the idea that reading programs in a language contributes to 24181learning that language. This @value{CHAPTER} continues that theme, 24182presenting a potpourri of @command{awk} programs for your reading 24183enjoyment. 24184@c FULLXREF OFF 24185@ifnotinfo 24186There are three @value{SECTION}s. 24187The first describes how to run the programs presented 24188in this @value{CHAPTER}. 24189 24190The second presents @command{awk} 24191versions of several common POSIX utilities. 24192These are programs that you are hopefully already familiar with, 24193and therefore whose problems are understood. 24194By reimplementing these programs in @command{awk}, 24195you can focus on the @command{awk}-related aspects of solving 24196the programming problems. 24197 24198The third is a grab bag of interesting programs. 24199These solve a number of different data-manipulation and management 24200problems. Many of the programs are short, which emphasizes @command{awk}'s 24201ability to do a lot in just a few lines of code. 24202@end ifnotinfo 24203 24204Many of these programs use library functions presented in 24205@ref{Library Functions}. 24206 24207@menu 24208* Running Examples:: How to run these examples. 24209* Clones:: Clones of common utilities. 24210* Miscellaneous Programs:: Some interesting @command{awk} programs. 24211* Programs Summary:: Summary of programs. 24212* Programs Exercises:: Exercises. 24213@end menu 24214 24215@node Running Examples 24216@section Running the Example Programs 24217 24218To run a given program, you would typically do something like this: 24219 24220@example 24221awk -f @var{program} -- @var{options} @var{files} 24222@end example 24223 24224@noindent 24225Here, @var{program} is the name of the @command{awk} program (such as 24226@file{cut.awk}), @var{options} are any command-line options for the 24227program that start with a @samp{-}, and @var{files} are the actual @value{DF}s. 24228 24229If your system supports the @samp{#!} executable interpreter mechanism 24230(@pxref{Executable Scripts}), 24231you can instead run your program directly: 24232 24233@example 24234cut.awk -c1-8 myfiles > results 24235@end example 24236 24237If your @command{awk} is not @command{gawk}, you may instead need to use this: 24238 24239@example 24240cut.awk -- -c1-8 myfiles > results 24241@end example 24242 24243@node Clones 24244@section Reinventing Wheels for Fun and Profit 24245@cindex POSIX @subentry programs, implementing in @command{awk} 24246 24247This @value{SECTION} presents a number of POSIX utilities implemented in 24248@command{awk}. Reinventing these programs in @command{awk} is often enjoyable, 24249because the algorithms can be very clearly expressed, and the code is usually 24250very concise and simple. This is true because @command{awk} does so much for you. 24251 24252It should be noted that these programs are not necessarily intended to 24253replace the installed versions on your system. 24254Nor may all of these programs be fully compliant with the most recent 24255POSIX standard. This is not a problem; their 24256purpose is to illustrate @command{awk} language programming for ``real-world'' 24257tasks. 24258 24259The programs are presented in alphabetical order. 24260 24261@menu 24262* Cut Program:: The @command{cut} utility. 24263* Egrep Program:: The @command{egrep} utility. 24264* Id Program:: The @command{id} utility. 24265* Split Program:: The @command{split} utility. 24266* Tee Program:: The @command{tee} utility. 24267* Uniq Program:: The @command{uniq} utility. 24268* Wc Program:: The @command{wc} utility. 24269@end menu 24270 24271@node Cut Program 24272@subsection Cutting Out Fields and Columns 24273 24274@cindex @command{cut} utility 24275@cindex @command{cut} utility 24276@cindex fields @subentry cutting 24277@cindex columns @subentry cutting 24278The @command{cut} utility selects, or ``cuts,'' characters or fields 24279from its standard input and sends them to its standard output. 24280Fields are separated by TABs by default, 24281but you may supply a command-line option to change the field 24282@dfn{delimiter} (i.e., the field-separator character). @command{cut}'s 24283definition of fields is less general than @command{awk}'s. 24284 24285A common use of @command{cut} might be to pull out just the login names of 24286logged-on users from the output of @command{who}. For example, the following 24287pipeline generates a sorted, unique list of the logged-on users: 24288 24289@example 24290who | cut -c1-8 | sort | uniq 24291@end example 24292 24293The options for @command{cut} are: 24294 24295@table @code 24296@item -c @var{list} 24297Use @var{list} as the list of characters to cut out. Items within the list 24298may be separated by commas, and ranges of characters can be separated with 24299dashes. The list @samp{1-8,15,22-35} specifies characters 1 through 243008, 15, and 22 through 35. 24301 24302@item -d @var{delim} 24303Use @var{delim} as the field-separator character instead of the TAB 24304character. 24305 24306@item -f @var{list} 24307Use @var{list} as the list of fields to cut out. 24308 24309@item -s 24310Suppress printing of lines that do not contain the field delimiter. 24311@end table 24312 24313The @command{awk} implementation of @command{cut} uses the @code{getopt()} library 24314function (@pxref{Getopt Function}) 24315and the @code{join()} library function 24316(@pxref{Join Function}). 24317 24318The current POSIX version of @command{cut} has options to cut fields based on 24319both bytes and characters. This version does not attempt to implement those options, 24320as @command{awk} works exclusively in terms of characters. 24321 24322The program begins with a comment describing the options, the library 24323functions needed, and a @code{usage()} function that prints out a usage 24324message and exits. @code{usage()} is called if invalid arguments are 24325supplied: 24326 24327@cindex @file{cut.awk} program 24328@example 24329@c file eg/prog/cut.awk 24330# cut.awk --- implement cut in awk 24331@c endfile 24332@ignore 24333@c file eg/prog/cut.awk 24334# 24335# Arnold Robbins, arnold@@skeeve.com, Public Domain 24336# May 1993 24337@c endfile 24338@end ignore 24339@c file eg/prog/cut.awk 24340 24341# Options: 24342# -c list Cut characters 24343# -f list Cut fields 24344# -d c Field delimiter character 24345# 24346# -s Suppress lines without the delimiter 24347# 24348# Requires getopt() and join() library functions 24349 24350@group 24351function usage() 24352@{ 24353 print("usage: cut [-f list] [-d c] [-s] [files...]") > "/dev/stderr" 24354 print(" cut [-c list] [files...]") > "/dev/stderr" 24355 exit 1 24356@} 24357@end group 24358@c endfile 24359@end example 24360 24361@cindex @code{BEGIN} pattern @subentry running @command{awk} programs and 24362@cindex @code{FS} variable @subentry running @command{awk} programs and 24363Next comes a @code{BEGIN} rule that parses the command-line options. 24364It sets @code{FS} to a single TAB character, because that is @command{cut}'s 24365default field separator. The rule then sets the output field separator to be the 24366same as the input field separator. A loop using @code{getopt()} steps 24367through the command-line options. Exactly one of the variables 24368@code{by_fields} or @code{by_chars} is set to true, to indicate that 24369processing should be done by fields or by characters, respectively. 24370When cutting by characters, the output field separator is set to the null 24371string: 24372 24373@example 24374@c file eg/prog/cut.awk 24375BEGIN @{ 24376 FS = "\t" # default 24377 OFS = FS 24378 while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{ 24379 if (c == "f") @{ 24380 by_fields = 1 24381 fieldlist = Optarg 24382 @} else if (c == "c") @{ 24383 by_chars = 1 24384 fieldlist = Optarg 24385 OFS = "" 24386 @} else if (c == "d") @{ 24387 if (length(Optarg) > 1) @{ 24388 printf("cut: using first character of %s" \ 24389 " for delimiter\n", Optarg) > "/dev/stderr" 24390 Optarg = substr(Optarg, 1, 1) 24391 @} 24392 fs = FS = Optarg 24393 OFS = FS 24394 if (FS == " ") # defeat awk semantics 24395 FS = "[ ]" 24396 @} else if (c == "s") 24397 suppress = 1 24398 else 24399 usage() 24400 @} 24401 24402 # Clear out options 24403 for (i = 1; i < Optind; i++) 24404 ARGV[i] = "" 24405@c endfile 24406@end example 24407 24408@cindex field separator @subentry spaces as 24409The code must take 24410special care when the field delimiter is a space. Using 24411a single space (@code{@w{" "}}) for the value of @code{FS} is 24412incorrect---@command{awk} would separate fields with runs of spaces, 24413TABs, and/or newlines, and we want them to be separated with individual 24414spaces. 24415To this end, we save the original space character in the variable 24416@code{fs} for later use; after setting @code{FS} to @code{@w{"[ ]"}} we can't 24417use it directly to see if the field delimiter character is in the string. 24418 24419Also remember that after @code{getopt()} is through 24420(as described in @ref{Getopt Function}), 24421we have to 24422clear out all the elements of @code{ARGV} from 1 to @code{Optind}, 24423so that @command{awk} does not try to process the command-line options 24424as @value{FN}s. 24425 24426After dealing with the command-line options, the program verifies that the 24427options make sense. Only one or the other of @option{-c} and @option{-f} 24428should be used, and both require a field list. Then the program calls 24429either @code{set_fieldlist()} or @code{set_charlist()} to pull apart the 24430list of fields or characters: 24431 24432@example 24433@c file eg/prog/cut.awk 24434 if (by_fields && by_chars) 24435 usage() 24436 24437 if (by_fields == 0 && by_chars == 0) 24438 by_fields = 1 # default 24439 24440@group 24441 if (fieldlist == "") @{ 24442 print "cut: needs list for -c or -f" > "/dev/stderr" 24443 exit 1 24444 @} 24445@end group 24446 24447 if (by_fields) 24448 set_fieldlist() 24449 else 24450 set_charlist() 24451@} 24452@c endfile 24453@end example 24454 24455@code{set_fieldlist()} splits the field list apart at the commas 24456into an array. Then, for each element of the array, it looks to 24457see if the element is actually a range, and if so, splits it apart. 24458The function checks the range 24459to make sure that the first number is smaller than the second. 24460Each number in the list is added to the @code{flist} array, which 24461simply lists the fields that will be printed. Normal field splitting 24462is used. The program lets @command{awk} handle the job of doing the 24463field splitting: 24464 24465@example 24466@c file eg/prog/cut.awk 24467function set_fieldlist( n, m, i, j, k, f, g) 24468@{ 24469 n = split(fieldlist, f, ",") 24470 j = 1 # index in flist 24471 for (i = 1; i <= n; i++) @{ 24472 if (index(f[i], "-") != 0) @{ # a range 24473 m = split(f[i], g, "-") 24474@group 24475 if (m != 2 || g[1] >= g[2]) @{ 24476 printf("cut: bad field list: %s\n", 24477 f[i]) > "/dev/stderr" 24478 exit 1 24479 @} 24480@end group 24481 for (k = g[1]; k <= g[2]; k++) 24482 flist[j++] = k 24483 @} else 24484 flist[j++] = f[i] 24485 @} 24486 nfields = j - 1 24487@} 24488@c endfile 24489@end example 24490 24491The @code{set_charlist()} function is more complicated than 24492@code{set_fieldlist()}. 24493The idea here is to use @command{gawk}'s @code{FIELDWIDTHS} variable 24494(@pxref{Constant Size}), 24495which describes constant-width input. When using a character list, that is 24496exactly what we have. 24497 24498Setting up @code{FIELDWIDTHS} is more complicated than simply listing the 24499fields that need to be printed. We have to keep track of the fields to 24500print and also the intervening characters that have to be skipped. 24501For example, suppose you wanted characters 1 through 8, 15, and 2450222 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value 24503for @code{FIELDWIDTHS} is @code{@w{"8 6 1 6 14"}}. This yields five 24504fields, and the fields to print 24505are @code{$1}, @code{$3}, and @code{$5}. 24506The intermediate fields are @dfn{filler}, 24507which is stuff in between the desired data. 24508@code{flist} lists the fields to print, and @code{t} tracks the 24509complete field list, including filler fields: 24510 24511@example 24512@c file eg/prog/cut.awk 24513function set_charlist( field, i, j, f, g, n, m, t, 24514 filler, last, len) 24515@{ 24516 field = 1 # count total fields 24517 n = split(fieldlist, f, ",") 24518 j = 1 # index in flist 24519 for (i = 1; i <= n; i++) @{ 24520 if (index(f[i], "-") != 0) @{ # range 24521 m = split(f[i], g, "-") 24522 if (m != 2 || g[1] >= g[2]) @{ 24523 printf("cut: bad character list: %s\n", 24524 f[i]) > "/dev/stderr" 24525 exit 1 24526 @} 24527 len = g[2] - g[1] + 1 24528 if (g[1] > 1) # compute length of filler 24529 filler = g[1] - last - 1 24530 else 24531 filler = 0 24532@group 24533 if (filler) 24534 t[field++] = filler 24535@end group 24536 t[field++] = len # length of field 24537 last = g[2] 24538 flist[j++] = field - 1 24539 @} else @{ 24540 if (f[i] > 1) 24541 filler = f[i] - last - 1 24542 else 24543 filler = 0 24544 if (filler) 24545 t[field++] = filler 24546 t[field++] = 1 24547 last = f[i] 24548 flist[j++] = field - 1 24549 @} 24550 @} 24551 FIELDWIDTHS = join(t, 1, field - 1) 24552 nfields = j - 1 24553@} 24554@c endfile 24555@end example 24556 24557Next is the rule that processes the data. If the @option{-s} option 24558is given, then @code{suppress} is true. The first @code{if} statement 24559makes sure that the input record does have the field separator. If 24560@command{cut} is processing fields, @code{suppress} is true, and the field 24561separator character is not in the record, then the record is skipped. 24562 24563If the record is valid, then @command{gawk} has split the data 24564into fields, either using the character in @code{FS} or using fixed-length 24565fields and @code{FIELDWIDTHS}. The loop goes through the list of fields 24566that should be printed. The corresponding field is printed if it contains data. 24567If the next field also has data, then the separator character is 24568written out between the fields: 24569 24570@example 24571@c file eg/prog/cut.awk 24572@{ 24573 if (by_fields && suppress && index($0, fs) == 0) 24574 next 24575 24576 for (i = 1; i <= nfields; i++) @{ 24577 if ($flist[i] != "") @{ 24578 printf "%s", $flist[i] 24579 if (i < nfields && $flist[i+1] != "") 24580 printf "%s", OFS 24581 @} 24582 @} 24583 print "" 24584@} 24585@c endfile 24586@end example 24587 24588This version of @command{cut} relies on @command{gawk}'s @code{FIELDWIDTHS} 24589variable to do the character-based cutting. It is possible in 24590other @command{awk} implementations to use @code{substr()} 24591(@pxref{String Functions}), but 24592it is also extremely painful. 24593The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem 24594of picking the input line apart by characters. 24595 24596 24597@node Egrep Program 24598@subsection Searching for Regular Expressions in Files 24599 24600@cindex regular expressions @subentry searching for 24601@cindex searching @subentry files for regular expressions 24602@cindex files @subentry searching for regular expressions 24603@cindex @command{egrep} utility 24604The @command{grep} family of programs searches files for patterns. 24605These programs have an unusual history. 24606Initially there was @command{grep} (Global Regular Expression Print), 24607which used what are now called Basic Regular Expressions (BREs). 24608Later there was @command{egrep} (Extended @command{grep}) which used 24609what are now called Extended Regular Expressions (EREs). (These are almost 24610identical to those available in @command{awk}; @pxref{Regexp}). 24611There was also @command{fgrep} (Fast @command{grep}), which searched 24612for matches of one more fixed strings. 24613 24614POSIX chose to combine these three programs into one, simply named 24615@command{grep}. On a POSIX system, @command{grep}'s default behavior 24616is to search using BREs. You use @command{-E} to specify the use 24617of EREs, and @option{-F} to specify searching for fixed strings. 24618 24619In practice, systems continue to come with separate @command{egrep} 24620and @command{fgrep} utilities, for backwards compatibility. This 24621@value{SECTION} provides an @command{awk} implementation of @command{egrep}, 24622which supports all of the POSIX-mandated options. 24623You invoke it as follows: 24624 24625@display 24626@command{egrep} [@var{options}] @code{'@var{pattern}'} @var{files} @dots{} 24627@end display 24628 24629The @var{pattern} is a regular expression. In typical usage, the regular 24630expression is quoted to prevent the shell from expanding any of the 24631special characters as @value{FN} wildcards. Normally, @command{egrep} 24632prints the lines that matched. If multiple @value{FN}s are provided on 24633the command line, each output line is preceded by the name of the file 24634and a colon. 24635 24636The options to @command{egrep} are as follows: 24637 24638@table @code 24639@item -c 24640Print a count of the lines that matched the pattern, instead of the 24641lines themselves. 24642 24643@item -e @var{pattern} 24644Use @var{pattern} as the regexp to match. The purpose of the @option{-e} 24645option is to allow patterns that start with a @samp{-}. 24646 24647@item -i 24648Ignore case distinctions in both the pattern and the input data. 24649 24650@item -l 24651Only print (list) the names of the files that matched, not the lines that matched. 24652 24653@item -q 24654Be quiet. No output is produced and the exit value indicates whether 24655the pattern was matched. 24656 24657@item -s 24658Be silent. Do not print error messages for files that could 24659not be opened. 24660 24661@item -v 24662Invert the sense of the test. @command{egrep} prints the lines that do 24663@emph{not} match the pattern and exits successfully if the pattern is not 24664matched. 24665 24666@item -x 24667Match the entire input line in order to consider the match as having 24668succeeded. 24669@end table 24670 24671This version uses the @code{getopt()} library function 24672(@pxref{Getopt Function}) and @command{gawk}'s 24673@code{BEGINFILE} and @code{ENDFILE} special patterns 24674(@pxref{BEGINFILE/ENDFILE}). 24675 24676The program begins with descriptive comments and then a @code{BEGIN} rule 24677that processes the command-line arguments with @code{getopt()}. The @option{-i} 24678(ignore case) option is particularly easy with @command{gawk}; we just use the 24679@code{IGNORECASE} predefined variable 24680(@pxref{Built-in Variables}): 24681 24682@cindex @file{egrep.awk} program 24683@example 24684@c file eg/prog/egrep.awk 24685# egrep.awk --- simulate egrep in awk 24686# 24687@c endfile 24688@ignore 24689@c file eg/prog/egrep.awk 24690# Arnold Robbins, arnold@@skeeve.com, Public Domain 24691# May 1993 24692# Revised September 2020 24693 24694@c endfile 24695@end ignore 24696@c file eg/prog/egrep.awk 24697# Options: 24698# -c count of lines 24699# -e argument is pattern 24700# -i ignore case 24701# -l print filenames only 24702# -n add line number to output 24703# -q quiet - use exit value 24704# -s silent - don't print errors 24705# -v invert test, success if no match 24706# -x the entire line must match 24707# 24708# Requires getopt library function 24709# Uses IGNORECASE, BEGINFILE and ENDFILE 24710# Invoke using gawk -f egrep.awk -- options ... 24711 24712BEGIN @{ 24713 while ((c = getopt(ARGC, ARGV, "ce:ilnqsvx")) != -1) @{ 24714 if (c == "c") 24715 count_only++ 24716 else if (c == "e") 24717 pattern = Optarg 24718 else if (c == "i") 24719 IGNORECASE = 1 24720 else if (c == "l") 24721 filenames_only++ 24722 else if (c == "n") 24723 line_numbers++ 24724 else if (c == "q") 24725 no_print++ 24726 else if (c == "s") 24727 no_errors++ 24728 else if (c == "v") 24729 invert++ 24730 else if (c == "x") 24731 full_line++ 24732 else 24733 usage() 24734 @} 24735@c endfile 24736@end example 24737 24738@noindent 24739Note the comment about invocation: Because several of the options overlap 24740with @command{gawk}'s, a @option{--} is needed to tell @command{gawk} 24741to stop looking for options. 24742 24743Next comes the code that handles the @command{egrep}-specific behavior. 24744@command{egrep} uses the first nonoption on the command line 24745if no pattern is supplied with @option{-e}. 24746If the pattern is empty, that means no pattern was supplied, so it's 24747necessary to print an error message and exit. 24748The @command{awk} command-line arguments up to @code{ARGV[Optind]} 24749are cleared, so that @command{awk} won't try to process them as files. If no 24750files are specified, the standard input is used, and if multiple files are 24751specified, we make sure to note this so that the @value{FN}s can precede the 24752matched lines in the output: 24753 24754@example 24755@c file eg/prog/egrep.awk 24756 if (pattern == "") 24757 pattern = ARGV[Optind++] 24758 24759 if (pattern == "") 24760 usage() 24761 24762 for (i = 1; i < Optind; i++) 24763 ARGV[i] = "" 24764 24765 if (Optind >= ARGC) @{ 24766 ARGV[1] = "-" 24767 ARGC = 2 24768 @} else if (ARGC - Optind > 1) 24769 do_filenames++ 24770@} 24771@c endfile 24772@end example 24773 24774The @code{BEGINFILE} rule executes 24775when each new file is processed. In this case, it is fairly simple; it 24776initializes a variable @code{fcount} to zero. @code{fcount} tracks 24777how many lines in the current file matched the pattern. 24778 24779Here also is where we implement the @option{-s} option. We check 24780if @code{ERRNO} has been set, and if @option{-s} was supplied. 24781In that case, it's necessary to move on to the next file. Otherwise 24782@command{gawk} would exit with an error: 24783 24784@example 24785@c file eg/prog/egrep.awk 24786BEGINFILE @{ 24787 fcount = 0 24788 if (ERRNO && no_errors) 24789 nextfile 24790@} 24791@c endfile 24792@end example 24793 24794The @code{ENDFILE} rule executes after each file has been processed. 24795It affects the output only when the user wants a count of the number of lines that 24796matched. @code{no_print} is true only if the exit status is desired. 24797@code{count_only} is true if line counts are desired. @command{egrep} 24798therefore only prints line counts if printing and counting are enabled. 24799The output format must be adjusted depending upon the number of files to 24800process. Finally, @code{fcount} is added to @code{total}, so that we 24801know the total number of lines that matched the pattern: 24802 24803@example 24804@c file eg/prog/egrep.awk 24805ENDFILE @{ 24806 if (! no_print && count_only) @{ 24807 if (do_filenames) 24808 print file ":" fcount 24809 else 24810 print fcount 24811 @} 24812 24813@group 24814 total += fcount 24815@} 24816@end group 24817@c endfile 24818@end example 24819 24820The following rule does most of the work of matching lines. The variable 24821@code{matches} is true (non-zero) if the line matched the pattern. 24822If the user specified that the entire line must match (with @option{-x}), 24823the code checks this condition by looking at the values of 24824@code{RSTART} and @code{RLENGTH}. If those indicate that the match 24825is not over the full line, @code{matches} is set to zero (false). 24826 24827If the user 24828wants lines that did not match, we invert the sense of @code{matches} 24829using the @samp{!} operator. We then increment @code{fcount} with the value of 24830@code{matches}, which is either one or zero, depending upon a 24831successful or unsuccessful match. If the line does not match, the 24832@code{next} statement just moves on to the next input line. 24833 24834We make a number of additional tests, but only if we 24835are not counting lines. First, if the user only wants the exit status 24836(@code{no_print} is true), then it is enough to know that @emph{one} 24837line in this file matched, and we can skip on to the next file with 24838@code{nextfile}. Similarly, if we are only printing @value{FN}s, we can 24839print the @value{FN}, and then skip to the next file with @code{nextfile}. 24840Finally, each line is printed, with a leading @value{FN}, 24841optional colon and line number, and the final colon 24842if necessary: 24843 24844@cindex @code{!} (exclamation point) @subentry @code{!} operator 24845@cindex exclamation point (@code{!}) @subentry @code{!} operator 24846@example 24847@c file eg/prog/egrep.awk 24848@{ 24849 matches = match($0, pattern) 24850 if (matches && full_line && (RSTART != 1 || RLENGTH != length())) 24851 matches = 0 24852 24853 if (invert) 24854 matches = ! matches 24855 24856 fcount += matches # 1 or 0 24857 24858 if (! matches) 24859 next 24860 24861 if (! count_only) @{ 24862 if (no_print) 24863 nextfile 24864 24865 if (filenames_only) @{ 24866 print FILENAME 24867 nextfile 24868 @} 24869 24870 if (do_filenames) 24871 if (line_numbers) 24872 print FILENAME ":" FNR ":" $0 24873 else 24874 print FILENAME ":" $0 24875 else 24876 print 24877 @} 24878@} 24879@c endfile 24880@end example 24881 24882The @code{END} rule takes care of producing the correct exit status. If 24883there are no matches, the exit status is one; otherwise, it is zero: 24884 24885@example 24886@c file eg/prog/egrep.awk 24887END @{ 24888 exit (total == 0) 24889@} 24890@c endfile 24891@end example 24892 24893The @code{usage()} function prints a usage message in case of invalid options, 24894and then exits: 24895 24896@example 24897@c file eg/prog/egrep.awk 24898function usage() 24899@{ 24900 print("Usage:\tegrep [-cilnqsvx] [-e pat] [files ...]") > "/dev/stderr" 24901 print("\tegrep [-cilnqsvx] pat [files ...]") > "/dev/stderr" 24902 exit 1 24903@} 24904@c endfile 24905@end example 24906 24907@node Id Program 24908@subsection Printing Out User Information 24909 24910@cindex printing @subentry user information 24911@cindex users, information about @subentry printing 24912@cindex @command{id} utility 24913The @command{id} utility lists a user's real and effective user ID numbers, 24914real and effective group ID numbers, and the user's group set, if any. 24915@command{id} only prints the effective user ID and group ID if they are 24916different from the real ones. If possible, @command{id} also supplies the 24917corresponding user and group names. The output might look like this: 24918 24919@example 24920$ @kbd{id} 24921@print{} uid=1000(arnold) gid=1000(arnold) groups=1000(arnold),4(adm),7(lp),27(sudo) 24922@end example 24923 24924@cindex @code{PROCINFO} array @subentry user and group ID numbers and 24925This information is part of what is provided by @command{gawk}'s 24926@code{PROCINFO} array (@pxref{Built-in Variables}). 24927However, the @command{id} utility provides a more palatable output than just 24928individual numbers. 24929 24930The POSIX version of @command{id} takes several options that give you 24931control over the output's format, such as printing only real ids, or printing 24932only numbers or only names. Additionally, you can print the information 24933for a specific user, instead of that of the current user. 24934 24935Here is a version of POSIX @command{id} written in @command{awk}. 24936It uses the @code{getopt()} library function 24937(@pxref{Getopt Function}), 24938the user database library functions 24939(@pxref{Passwd Functions}), 24940and the group database library functions 24941(@pxref{Group Functions}) 24942from @ref{Library Functions}. 24943 24944The program is moderately straightforward. All the work is done in the 24945@code{BEGIN} rule. 24946It starts with explanatory comments, a list of options, 24947and then a @code{usage()} function: 24948 24949@cindex @file{id.awk} program 24950@example 24951@c file eg/prog/id.awk 24952# id.awk --- implement id in awk 24953# 24954# Requires user and group library functions and getopt 24955@c endfile 24956@ignore 24957@c file eg/prog/id.awk 24958# 24959# Arnold Robbins, arnold@@skeeve.com, Public Domain 24960# May 1993 24961# Revised February 1996 24962# Revised May 2014 24963# Revised September 2014 24964# Revised September 2020 24965 24966@c endfile 24967@end ignore 24968@c file eg/prog/id.awk 24969# output is: 24970# uid=12(foo) euid=34(bar) gid=3(baz) \ 24971# egid=5(blat) groups=9(nine),2(two),1(one) 24972 24973# Options: 24974# -G Output all group ids as space separated numbers (ruid, euid, groups) 24975# -g Output only the euid as a number 24976# -n Output name instead of the numeric value (with -g/-G/-u) 24977# -r Output ruid/rguid instead of effective id 24978# -u Output only effective user id, as a number 24979 24980@group 24981function usage() 24982@{ 24983 printf("Usage:\n" \ 24984 "\tid [user]\n" \ 24985 "\tid -G [-n] [user]\n" \ 24986 "\tid -g [-nr] [user]\n" \ 24987 "\tid -u [-nr] [user]\n") > "/dev/stderr" 24988 24989 exit 1 24990@} 24991@end group 24992@c endfile 24993@end example 24994 24995The first step is to parse the options using @code{getopt()}, 24996and to set various flag variables according to the options given: 24997 24998@example 24999@c file eg/prog/id.awk 25000BEGIN @{ 25001 # parse args 25002 while ((c = getopt(ARGC, ARGV, "Ggnru")) != -1) @{ 25003 if (c == "G") 25004 groupset_only++ 25005 else if (c == "g") 25006 egid_only++ 25007 else if (c == "n") 25008 names_not_groups++ 25009 else if (c == "r") 25010 real_ids_only++ 25011 else if (c == "u") 25012 euid_only++ 25013 else 25014 usage() 25015 @} 25016@c endfile 25017@end example 25018 25019The next step is to check that no conflicting options were 25020provided. @option{-G} and @option{-r} are mutually exclusive. 25021It is also not allowed to provide more than one user name 25022on the command line: 25023 25024@example 25025@c file eg/prog/id.awk 25026 if (groupset_only && real_ids_only) 25027 usage() 25028 else if (ARGC - Optind > 1) 25029 usage() 25030@c endfile 25031@end example 25032 25033The user and group ID numbers are obtained from 25034@code{PROCINFO} for the current user, or from the 25035user and password databases for a user supplied on 25036the command line. In the latter case, @code{real_ids_only} 25037is set, since it's not possible to print information about 25038the effective user and group IDs: 25039 25040@example 25041@c file eg/prog/id.awk 25042 if (ARGC - Optind == 0) @{ 25043 # gather info for current user 25044 uid = PROCINFO["uid"] 25045 euid = PROCINFO["euid"] 25046 gid = PROCINFO["gid"] 25047 egid = PROCINFO["egid"] 25048 for (i = 1; ("group" i) in PROCINFO; i++) 25049 groupset[i] = PROCINFO["group" i] 25050 @} else @{ 25051 fill_info_for_user(ARGV[ARGC-1]) 25052 real_ids_only++ 25053 @} 25054@c endfile 25055@end example 25056 25057The test in the @code{for} loop is worth noting. 25058Any supplementary groups in the @code{PROCINFO} array have the 25059indices @code{"group1"} through @code{"group@var{N}"} for some 25060@var{N} (i.e., the total number of supplementary groups). 25061However, we don't know in advance how many of these groups 25062there are. 25063 25064This loop works by starting at one, concatenating the value with 25065@code{"group"}, and then using @code{in} to see if that value is 25066in the array (@pxref{Reference to Elements}). Eventually, @code{i} increments past 25067the last group in the array and the loop exits. 25068 25069The loop is also correct if there are @emph{no} supplementary 25070groups; then the condition is false the first time it's 25071tested, and the loop body never executes. 25072 25073 25074Now, based on the options, we decide what information to print. 25075For @option{-G} (print just the group set), we then select 25076whether to print names or numbers. In either case, when done 25077we exit: 25078 25079@example 25080@c file eg/prog/id.awk 25081 if (groupset_only) @{ 25082 if (names_not_groups) @{ 25083 for (i = 1; i in groupset; i++) @{ 25084 entry = getgrgid(groupset[i]) 25085 name = get_first_field(entry) 25086 printf("%s", name) 25087 if ((i + 1) in groupset) 25088 printf(" ") 25089 @} 25090 @} else @{ 25091 for (i = 1; i in groupset; i++) @{ 25092 printf("%u", groupset[i]) 25093 if ((i + 1) in groupset) 25094 printf(" ") 25095 @} 25096 @} 25097 25098 print "" # final newline 25099 exit 0 25100 @} 25101@c endfile 25102@end example 25103 25104Otherwise, for @option{-g} (effective group ID only), we 25105check if @option{-r} was also provided, in which case we 25106use the real group ID. Then based on @option{-n}, we decide 25107whether to print names or numbers. Here too, when done, 25108we exit: 25109 25110@example 25111@c file eg/prog/id.awk 25112 else if (egid_only) @{ 25113 id = real_ids_only ? gid : egid 25114 if (names_not_groups) @{ 25115 entry = getgrgid(id) 25116 name = get_first_field(entry) 25117 printf("%s\n", name) 25118 @} else @{ 25119 printf("%u\n", id) 25120 @} 25121 25122 exit 0 25123 @} 25124@c endfile 25125@end example 25126 25127The @code{get_first_field()} function extracts the group name from 25128the group database entry for the given group ID. 25129 25130Similar processing logic applies to @option{-u} (effective user ID only), 25131combined with @option{-r} and @option{-n}: 25132 25133@example 25134@c file eg/prog/id.awk 25135 else if (euid_only) @{ 25136 id = real_ids_only ? uid : euid 25137 if (names_not_groups) @{ 25138 entry = getpwuid(id) 25139 name = get_first_field(entry) 25140 printf("%s\n", name) 25141 @} else @{ 25142 printf("%u\n", id) 25143 @} 25144 25145 exit 0 25146 @} 25147@c endfile 25148@end example 25149 25150At this point, we haven't exited yet, so we print 25151the regular, default output, based either on the current 25152user's information, or that of the user whose name was 25153provided on the command line. We start with the real user ID: 25154 25155@example 25156@c file eg/prog/id.awk 25157 printf("uid=%d", uid) 25158 pw = getpwuid(uid) 25159 print_first_field(pw) 25160@c endfile 25161@end example 25162 25163The @code{print_first_field()} function prints the user's 25164login name from the password file entry, surrounded by 25165parentheses. It is shown soon. 25166Printing the effective user ID is next: 25167 25168@example 25169@c file eg/prog/id.awk 25170 if (euid != uid && ! real_ids_only) @{ 25171 printf(" euid=%d", euid) 25172 pw = getpwuid(euid) 25173 print_first_field(pw) 25174 @} 25175@c endfile 25176@end example 25177 25178Similar logic applies to the real and effective group IDs: 25179 25180@example 25181@c file eg/prog/id.awk 25182 printf(" gid=%d", gid) 25183 pw = getgrgid(gid) 25184 print_first_field(pw) 25185 25186 if (egid != gid && ! real_ids_only) @{ 25187 printf(" egid=%d", egid) 25188 pw = getgrgid(egid) 25189 print_first_field(pw) 25190 @} 25191@c endfile 25192@end example 25193 25194Finally, we print the group set and the terminating newline: 25195 25196@example 25197@c file eg/prog/id.awk 25198 for (i = 1; i in groupset; i++) @{ 25199 if (i == 1) 25200 printf(" groups=") 25201 group = groupset[i] 25202 printf("%d", group) 25203 pw = getgrgid(group) 25204 print_first_field(pw) 25205 if ((i + 1) in groupset) 25206 printf(",") 25207 @} 25208 25209 print "" 25210@} 25211@c endfile 25212@end example 25213 25214The @code{get_first_field()} function extracts the first field 25215from a password or group file entry for use as a user or group 25216name. Fields are separated by @samp{:} characters: 25217 25218@example 25219@c file eg/prog/id.awk 25220function get_first_field(str, a) 25221@{ 25222 if (str != "") @{ 25223 split(str, a, ":") 25224 return a[1] 25225 @} 25226@} 25227@c endfile 25228@end example 25229 25230This function is then used by @code{print_first_field()} to 25231output the given name surrounded by parentheses: 25232 25233@example 25234@c file eg/prog/id.awk 25235function print_first_field(str) 25236@{ 25237 first = get_first_field(str) 25238 printf("(%s)", first) 25239@} 25240@c endfile 25241@end example 25242 25243These two functions simply isolate out some code that is used repeatedly, 25244making the whole program shorter and cleaner. In particular, moving the 25245check for the empty string into @code{get_first_field()} saves several 25246lines of code. 25247 25248Finally, @code{fill_info_for_user()} fetches user, group, and group 25249set information for the user named on the command. The code is fairly 25250straightforward, merely requiring that we exit if the given user doesn't 25251exist: 25252 25253@example 25254@c file eg/prog/id.awk 25255function fill_info_for_user(user, 25256 pwent, fields, groupnames, grent, groups, i) 25257@{ 25258 pwent = getpwnam(user) 25259 if (pwent == "") @{ 25260 printf("id: '%s': no such user\n", user) > "/dev/stderr" 25261 exit 1 25262 @} 25263 25264 split(pwent, fields, ":") 25265 uid = fields[3] + 0 25266 gid = fields[4] + 0 25267@c endfile 25268@end example 25269 25270Getting the group set is a little awkward. The library routine 25271@code{getgruser()} returns a list of group @emph{names}. These 25272have to be gone through and turned back into group numbers, 25273so that the rest of the code will work as expected: 25274 25275@example 25276@ignore 25277@c file eg/prog/id.awk 25278 25279@c endfile 25280@end ignore 25281@c file eg/prog/id.awk 25282 groupnames = getgruser(user) 25283 split(groupnames, groups, " ") 25284 for (i = 1; i in groups; i++) @{ 25285 grent = getgrnam(groups[i]) 25286 split(grent, fields, ":") 25287 groupset[i] = fields[3] + 0 25288 @} 25289@} 25290@c endfile 25291@end example 25292 25293@node Split Program 25294@subsection Splitting a Large File into Pieces 25295 25296@cindex files @subentry splitting 25297@cindex @code{split} utility 25298The @command{split} utility splits large text files into smaller pieces. 25299The usage follows the POSIX standard for @command{split} and is as follows: 25300 25301@display 25302@command{split} [@option{-l} @var{count}] [@option{-a} @var{suffix-len}] [@var{file} [@var{outname}]] 25303@command{split} @option{-b} @var{N}[@code{k}|@code{m}]] [@option{-a} @var{suffix-len}] [@var{file} [@var{outname}]] 25304@end display 25305 25306By default, the output files are named @file{xaa}, @file{xab}, and so 25307on. Each file has 1,000 lines in it, with the likely exception of the 25308last file. 25309 25310The @command{split} program has evolved over time, and the current POSIX 25311version is more complicated than the original Unix version. The options 25312and what they do are as follows: 25313 25314@table @asis 25315@item @option{-a} @var{suffix-len} 25316Use @var{suffix-len} characters for the suffix. For example, if @var{suffix-len} 25317is four, the output files would range from @file{xaaaa} to @file{xzzzz}. 25318 25319@item @option{-b} @var{N}[@code{k}|@code{m}]] 25320Instead of each file containing a specified number of lines, each file 25321should have (at most) @var{N} bytes. Supplying a trailing @samp{k} 25322multiplies @var{N} by 1,024, yielding kilobytes. Supplying a trailing 25323@samp{m} multiplies @var{N} by 1,048,576 (@math{1,024 @value{TIMES} 1,024}) 25324yielding megabytes. (This option is mutually exclusive with @option{-l}). 25325 25326@item @option{-l} @var{count} 25327Each file should have at most @var{count} lines, instead of the default 253281,000. (This option is mutually exclusive with @option{-b}). 25329@end table 25330 25331If supplied, @var{file} is the input file to read. Otherwise standard 25332input is processed. If supplied, @var{outname} is the leading prefix 25333to use for @value{FN}s, instead of @samp{x}. 25334 25335In order to use the @option{-b} option, @command{gawk} should be invoked 25336with its @option{-b} option (@pxref{Options}), or with the environment 25337variable @env{LC_ALL} set to @samp{C}, so that each input byte is treated 25338as a separate character.@footnote{Using @option{-b} twice requires 25339separating @command{gawk}'s options from those of the program. For example: 25340@samp{gawk -f getopt.awk -f split.awk -b -- -b 42m large-file.txt split-}.} 25341 25342Here is an implementation of @command{split} in @command{awk}. It uses the 25343@code{getopt()} function presented in @ref{Getopt Function}. 25344 25345The program begins with a standard descriptive comment and then 25346a @code{usage()} function describing the options. The variable 25347@code{common} keeps the function's lines short so that they 25348look nice on the page: 25349 25350@cindex @file{split.awk} program 25351@example 25352@c file eg/prog/split.awk 25353# split.awk --- do split in awk 25354# 25355# Requires getopt() library function. 25356@c endfile 25357@ignore 25358@c file eg/prog/split.awk 25359# 25360# Arnold Robbins, arnold@@skeeve.com, Public Domain 25361# May 1993 25362# Revised slightly, May 2014 25363# Rewritten September 2020 25364 25365@c endfile 25366@end ignore 25367@c file eg/prog/split.awk 25368 25369function usage( common) 25370@{ 25371 common = "[-a suffix-len] [file [outname]]" 25372 printf("usage: split [-l count] %s\n", common) > "/dev/stderr" 25373 printf(" split [-b N[k|m]] %s\n", common) > "/dev/stderr" 25374 exit 1 25375@} 25376@c endfile 25377@end example 25378 25379Next, in a @code{BEGIN} rule we set the default values and parse the arguments. 25380After that we initialize the data structures used to cycle the suffix 25381from @samp{aa@dots{}} to @samp{zz@dots{}}. Finally we set the name of 25382the first output file: 25383 25384@example 25385@c file eg/prog/split.awk 25386BEGIN @{ 25387 # Set defaults: 25388 Suffix_length = 2 25389 Line_count = 1000 25390 Byte_count = 0 25391 Outfile = "x" 25392 25393 parse_arguments() 25394 25395 init_suffix_data() 25396 25397 Output = (Outfile compute_suffix()) 25398@} 25399@c endfile 25400@end example 25401 25402Parsing the arguments is straightforward. The program follows our 25403convention (@pxref{Library Names}) of having important global variables 25404start with an uppercase letter: 25405 25406@example 25407@c file eg/prog/split.awk 25408function parse_arguments( i, c, l, modifier) 25409@{ 25410 while ((c = getopt(ARGC, ARGV, "a:b:l:")) != -1) @{ 25411 if (c == "a") 25412 Suffix_length = Optarg + 0 25413 else if (c == "b") @{ 25414 Byte_count = Optarg + 0 25415 Line_count = 0 25416 25417 l = length(Optarg) 25418 modifier = substr(Optarg, l, 1) 25419 if (modifier == "k") 25420 Byte_count *= 1024 25421 else if (modifier == "m") 25422 Byte_count *= 1024 * 1024 25423 @} else if (c == "l") @{ 25424 Line_count = Optarg + 0 25425 Byte_count = 0 25426 @} else 25427 usage() 25428 @} 25429 25430 # Clear out options 25431 for (i = 1; i < Optind; i++) 25432 ARGV[i] = "" 25433 25434 # Check for filename 25435 if (ARGV[Optind]) @{ 25436 Optind++ 25437 25438 # Check for different prefix 25439 if (ARGV[Optind]) @{ 25440 Outfile = ARGV[Optind] 25441 ARGV[Optind] = "" 25442 25443 if (++Optind < ARGC) 25444 usage() 25445 @} 25446 @} 25447@} 25448@c endfile 25449@end example 25450 25451Managing the @value{FN} suffix is interesting. 25452Given a suffix of length three, say, the values go from 25453@samp{aaa}, @samp{aab}, @samp{aac} and so on, all the way to 25454@samp{zzx}, @samp{zzy}, and finally @samp{zzz}. 25455There are two important aspects to this: 25456 25457@itemize @bullet 25458@item 25459We have to be 25460able to easily generate these suffixes, and in particular 25461easily handle ``rolling over''; for example, going from 25462@samp{abz} to @samp{aca}. 25463 25464@item 25465We have to tell when we've finished with the last file, 25466so that if we still have more input data we can print an 25467error message and exit. The trick is to handle this @emph{after} 25468using the last suffix, and not when the final suffix is created. 25469@end itemize 25470 25471The computation is handled by @code{compute_suffix()}. 25472This function is called every time a new file is opened. 25473 25474The flow here is messy, because we want to generate @samp{zzzz} (say), 25475and use it, and only produce an error after all the @value{FN} 25476suffixes have been used up. The logical steps are as follows: 25477 25478@enumerate 1 25479@item 25480Generate the suffix, saving the value in @code{result} to return. 25481To do this, the supplementary array @code{Suffix_ind} contains one 25482element for each letter in the suffix. Each element ranges from 1 to 2548326, acting as the index into a string containing all the lowercase 25484letters of the English alphabet. 25485It is initialized by @code{init_suffix_data()}. 25486@code{result} is built up one letter at a time, using each @code{substr()}. 25487 25488@item 25489Prepare the data structures for the next time @code{compute_suffix()} 25490is called. To do this, we loop over @code{Suffix_ind}, @emph{backwards}. 25491If the current element is less than 26, it's incremented and the loop 25492breaks (@samp{abq} goes to @samp{abr}). Otherwise, the element is 25493reset to one and we move down the list (@samp{abz} to @samp{aca}). 25494Thus, the @code{Suffix_ind} array is always ``one step ahead'' of the actual 25495@value{FN} suffix to be returned. 25496 25497@item 25498Check if we've gone past the limit of possible @value{FN}s. 25499If @code{Reached_last} is true, print a message and exit. Otherwise, 25500check if @code{Suffix_ind} describes a suffix where all the letters are 25501@samp{z}. If that's the case we're about to return the final suffix. If 25502so, we set @code{Reached_last} to true so that the @emph{next} call to 25503@code{compute_suffix()} will cause a failure. 25504@end enumerate 25505 25506Physically, the steps in the function occur in the order 3, 1, 2: 25507 25508@example 25509@c file eg/prog/split.awk 25510function compute_suffix( i, result, letters) 25511@{ 25512 # Logical step 3 25513 if (Reached_last) @{ 25514 printf("split: too many files!\n") > "/dev/stderr" 25515 exit 1 25516 @} else if (on_last_file()) 25517 Reached_last = 1 # fail when wrapping after 'zzz' 25518 25519 # Logical step 1 25520 result = "" 25521 letters = "abcdefghijklmnopqrstuvwxyz" 25522 for (i = 1; i <= Suffix_length; i++) 25523 result = result substr(letters, Suffix_ind[i], 1) 25524 25525 # Logical step 2 25526 for (i = Suffix_length; i >= 1; i--) @{ 25527 if (++Suffix_ind[i] > 26) @{ 25528 Suffix_ind[i] = 1 25529 @} else 25530 break 25531 @} 25532 25533 return result 25534@} 25535@c endfile 25536@end example 25537 25538The @code{Suffix_ind} array and @code{Reached_last} are initialized 25539by @code{init_suffix_data()}: 25540 25541@example 25542@c file eg/prog/split.awk 25543function init_suffix_data( i) 25544@{ 25545 for (i = 1; i <= Suffix_length; i++) 25546 Suffix_ind[i] = 1 25547 25548 Reached_last = 0 25549@} 25550@c endfile 25551@end example 25552 25553The function @code{on_last_file()} returns true if @code{Suffix_ind} describes 25554a suffix where all the letters are @samp{z} by checking that all the elements 25555in the array are equal to 26: 25556 25557@example 25558@c file eg/prog/split.awk 25559function on_last_file( i, on_last) 25560@{ 25561 on_last = 1 25562 for (i = 1; i <= Suffix_length; i++) @{ 25563 on_last = on_last && (Suffix_ind[i] == 26) 25564 @} 25565 25566 return on_last 25567@} 25568@c endfile 25569@end example 25570 25571The actual work of splitting the input file is done by the next two rules. 25572Since splitting by line count and splitting by byte count are mutually 25573exclusive, we simply use two separate rules, one for when @code{Line_count} 25574is greater than zero, and another for when @code{Byte_count} is greater than zero. 25575 25576The variable @code{tcount} counts how many lines have been processed so far. 25577When it exceeds @code{Line_count}, it's time to close the previous file and 25578switch to a new one: 25579 25580@example 25581@c file eg/prog/split.awk 25582Line_count > 0 @{ 25583 if (++tcount > Line_count) @{ 25584 close(Output) 25585 Output = (Outfile compute_suffix()) 25586 tcount = 1 25587 @} 25588 print > Output 25589@} 25590@c endfile 25591@end example 25592 25593The rule for handling bytes is more complicated. Since lines most likely 25594vary in length, the @code{Byte_count} boundary may be hit in the middle of 25595an input record. In that case, @command{split} has to write enough of the 25596first bytes of the input record to finish up @code{Byte_count} bytes, close 25597the file, open a new file, and write the rest of the record to the new file. 25598The logic here does all that: 25599 25600@example 25601@c file eg/prog/split.awk 25602Byte_count > 0 @{ 25603 # `+ 1' is for the final newline 25604 if (tcount + length($0) + 1 > Byte_count) @{ # would overflow 25605 # compute leading bytes 25606 leading_bytes = Byte_count - tcount 25607 25608 # write leading bytes 25609 printf("%s", substr($0, 1, leading_bytes)) > Output 25610 25611 # close old file, open new file 25612 close(Output) 25613 Output = (Outfile compute_suffix()) 25614 25615 # set up first bytes for new file 25616 $0 = substr($0, leading_bytes + 1) # trailing bytes 25617 tcount = 0 25618 @} 25619 25620 # write full record or trailing bytes 25621 tcount += length($0) + 1 25622 print > Output 25623@} 25624@c endfile 25625@end example 25626 25627Finally, the @code{END} rule cleans up by closing the last output file: 25628 25629@example 25630@c file eg/prog/split.awk 25631END @{ 25632 close(Output) 25633@} 25634@c endfile 25635@end example 25636 25637@node Tee Program 25638@subsection Duplicating Output into Multiple Files 25639 25640@cindex files @subentry multiple, duplicating output into 25641@cindex output @subentry duplicating into files 25642@cindex @code{tee} utility 25643The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies 25644its standard input to its standard output and also duplicates it to the 25645files named on the command line. Its usage is as follows: 25646 25647@display 25648@command{tee} [@option{-a}] @var{file} @dots{} 25649@end display 25650 25651The @option{-a} option tells @code{tee} to append to the named files, instead of 25652truncating them and starting over. 25653 25654The @code{BEGIN} rule first makes a copy of all the command-line arguments 25655into an array named @code{copy}. 25656@code{ARGV[0]} is not needed, so it is not copied. 25657@code{tee} cannot use @code{ARGV} directly, because @command{awk} attempts to 25658process each @value{FN} in @code{ARGV} as input data. 25659 25660@cindex flag variables 25661If the first argument is @option{-a}, then the flag variable 25662@code{append} is set to true, and both @code{ARGV[1]} and 25663@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no 25664@value{FN}s were supplied and @code{tee} prints a usage message and exits. 25665Finally, @command{awk} is forced to read the standard input by setting 25666@code{ARGV[1]} to @code{"-"} and @code{ARGC} to two: 25667 25668@cindex @file{tee.awk} program 25669@example 25670@c file eg/prog/tee.awk 25671# tee.awk --- tee in awk 25672# 25673# Copy standard input to all named output files. 25674# Append content if -a option is supplied. 25675# 25676@c endfile 25677@ignore 25678@c file eg/prog/tee.awk 25679# Arnold Robbins, arnold@@skeeve.com, Public Domain 25680# May 1993 25681# Revised December 1995 25682 25683@c endfile 25684@end ignore 25685@c file eg/prog/tee.awk 25686BEGIN @{ 25687 for (i = 1; i < ARGC; i++) 25688 copy[i] = ARGV[i] 25689 25690 if (ARGV[1] == "-a") @{ 25691 append = 1 25692 delete ARGV[1] 25693 delete copy[1] 25694 ARGC-- 25695 @} 25696 if (ARGC < 2) @{ 25697 print "usage: tee [-a] file ..." > "/dev/stderr" 25698 exit 1 25699 @} 25700 ARGV[1] = "-" 25701 ARGC = 2 25702@} 25703@c endfile 25704@end example 25705 25706The following single rule does all the work. Because there is no pattern, it is 25707executed for each line of input. The body of the rule simply prints the 25708line into each file on the command line, and then to the standard output: 25709 25710@example 25711@c file eg/prog/tee.awk 25712@{ 25713 # moving the if outside the loop makes it run faster 25714 if (append) 25715 for (i in copy) 25716 print >> copy[i] 25717 else 25718 for (i in copy) 25719 print > copy[i] 25720 print 25721@} 25722@c endfile 25723@end example 25724 25725@noindent 25726It is also possible to write the loop this way: 25727 25728@example 25729@group 25730for (i in copy) 25731 if (append) 25732 print >> copy[i] 25733@end group 25734@group 25735 else 25736 print > copy[i] 25737@end group 25738@end example 25739 25740@noindent 25741This is more concise, but it is also less efficient. The @samp{if} is 25742tested for each record and for each output file. By duplicating the loop 25743body, the @samp{if} is only tested once for each input record. If there are 25744@var{N} input records and @var{M} output files, the first method only 25745executes @var{N} @samp{if} statements, while the second executes 25746@var{N}@code{*}@var{M} @samp{if} statements. 25747 25748Finally, the @code{END} rule cleans up by closing all the output files: 25749 25750@example 25751@c file eg/prog/tee.awk 25752END @{ 25753 for (i in copy) 25754 close(copy[i]) 25755@} 25756@c endfile 25757@end example 25758 25759@node Uniq Program 25760@subsection Printing Nonduplicated Lines of Text 25761 25762@cindex printing @subentry unduplicated lines of text 25763@cindex text, printing @subentry unduplicated lines of 25764@cindex @command{uniq} utility 25765The @command{uniq} utility reads sorted lines of data on its standard 25766input, and by default removes duplicate lines. In other words, it only 25767prints unique lines---hence the name. @command{uniq} has a number of 25768options. The usage is as follows: 25769 25770@display 25771@command{uniq} [@option{-udc} [@code{-f @var{n}}] [@code{-s @var{n}}]] [@var{inputfile} [@var{outputfile}]] 25772@end display 25773 25774The options for @command{uniq} are: 25775 25776@table @code 25777@item -d 25778Print only repeated (duplicated) lines. 25779 25780@item -u 25781Print only nonrepeated (unique) lines. 25782 25783@item -c 25784Count lines. This option overrides @option{-d} and @option{-u}. Both repeated 25785and nonrepeated lines are counted. 25786 25787@item -f @var{n} 25788Skip @var{n} fields before comparing lines. The definition of fields 25789is similar to @command{awk}'s default: nonwhitespace characters separated 25790by runs of spaces and/or TABs. 25791 25792@item -s @var{n} 25793Skip @var{n} characters before comparing lines. Any fields specified with 25794@option{-f} are skipped first. 25795 25796@item @var{inputfile} 25797Data is read from the input file named on the command line, instead of from 25798the standard input. 25799 25800@item @var{outputfile} 25801The generated output is sent to the named output file, instead of to the 25802standard output. 25803@end table 25804 25805Normally @command{uniq} behaves as if both the @option{-d} and 25806@option{-u} options are provided. 25807 25808@command{uniq} uses the 25809@code{getopt()} library function 25810(@pxref{Getopt Function}) 25811and the @code{join()} library function 25812(@pxref{Join Function}). 25813 25814The program begins with a @code{usage()} function and then a brief outline of 25815the options and their meanings in comments: 25816 25817@cindex @file{uniq.awk} program 25818@example 25819@c file eg/prog/uniq.awk 25820@group 25821# uniq.awk --- do uniq in awk 25822# 25823# Requires getopt() and join() library functions 25824@end group 25825@c endfile 25826@ignore 25827@c file eg/prog/uniq.awk 25828# 25829# Arnold Robbins, arnold@@skeeve.com, Public Domain 25830# May 1993 25831# Updated August 2020 to current POSIX 25832@c endfile 25833@end ignore 25834@c file eg/prog/uniq.awk 25835 25836function usage() 25837@{ 25838 print("Usage: uniq [-udc [-f fields] [-s chars]] " \ 25839 "[ in [ out ]]") > "/dev/stderr" 25840 exit 1 25841@} 25842 25843# -c count lines. overrides -d and -u 25844# -d only repeated lines 25845# -u only nonrepeated lines 25846# -f n skip n fields 25847# -s n skip n characters, skip fields first 25848@c endfile 25849@end example 25850 25851The POSIX standard for @command{uniq} allows options to start with 25852@samp{+} as well as with @samp{-}. An initial @code{BEGIN} rule 25853traverses the arguments changing any leading @samp{+} to @samp{-} 25854so that the @code{getopt()} function can parse the options: 25855 25856@example 25857@c file eg/prog/uniq.awk 25858# As of 2020, '+' can be used as the option character in addition to '-' 25859# Previously allowed use of -N to skip fields and +N to skip 25860# characters is no longer allowed, and not supported by this version. 25861 25862BEGIN @{ 25863 # Convert + to - so getopt can handle things 25864 for (i = 1; i < ARGC; i++) @{ 25865 first = substr(ARGV[i], 1, 1) 25866 if (ARGV[i] == "--" || (first != "-" && first != "+")) 25867 break 25868 else if (first == "+") 25869 # Replace "+" with "-" 25870 ARGV[i] = "-" substr(ARGV[i], 2) 25871 @} 25872@} 25873@c endfile 25874@end example 25875 25876The next @code{BEGIN} rule deals with the command-line arguments and options. 25877If no options are supplied, then the default is taken, to print both 25878repeated and nonrepeated lines. The output file, if provided, is assigned 25879to @code{outputfile}. Early on, @code{outputfile} is initialized to the 25880standard output, @file{/dev/stdout}: 25881 25882@example 25883@c file eg/prog/uniq.awk 25884BEGIN @{ 25885 count = 1 25886 outputfile = "/dev/stdout" 25887 opts = "udcf:s:" 25888 while ((c = getopt(ARGC, ARGV, opts)) != -1) @{ 25889 if (c == "u") 25890 non_repeated_only++ 25891 else if (c == "d") 25892 repeated_only++ 25893 else if (c == "c") 25894 do_count++ 25895 else if (c == "f") 25896 fcount = Optarg + 0 25897 else if (c == "s") 25898 charcount = Optarg + 0 25899 else 25900 usage() 25901 @} 25902 25903 for (i = 1; i < Optind; i++) 25904 ARGV[i] = "" 25905 25906 if (repeated_only == 0 && non_repeated_only == 0) 25907 repeated_only = non_repeated_only = 1 25908 25909 if (ARGC - Optind == 2) @{ 25910 outputfile = ARGV[ARGC - 1] 25911 ARGV[ARGC - 1] = "" 25912 @} 25913@} 25914@c endfile 25915@end example 25916 25917The following function, @code{are_equal()}, compares the current line, 25918@code{$0}, to the previous line, @code{last}. It handles skipping fields 25919and characters. If no field count and no character count are specified, 25920@code{are_equal()} returns one or zero depending upon the result of a 25921simple string comparison of @code{last} and @code{$0}. 25922 25923Otherwise, things get more complicated. If fields have to be skipped, 25924each line is broken into an array using @code{split()} (@pxref{String 25925Functions}); the desired fields are then joined back into a line 25926using @code{join()}. The joined lines are stored in @code{clast} and 25927@code{cline}. If no fields are skipped, @code{clast} and @code{cline} 25928are set to @code{last} and @code{$0}, respectively. Finally, if 25929characters are skipped, @code{substr()} is used to strip off the leading 25930@code{charcount} characters in @code{clast} and @code{cline}. The two 25931strings are then compared and @code{are_equal()} returns the result: 25932 25933@example 25934@c file eg/prog/uniq.awk 25935@group 25936function are_equal( n, m, clast, cline, alast, aline) 25937@{ 25938 if (fcount == 0 && charcount == 0) 25939 return (last == $0) 25940@end group 25941 25942 if (fcount > 0) @{ 25943 n = split(last, alast) 25944 m = split($0, aline) 25945 clast = join(alast, fcount+1, n) 25946 cline = join(aline, fcount+1, m) 25947 @} else @{ 25948 clast = last 25949 cline = $0 25950 @} 25951 if (charcount) @{ 25952 clast = substr(clast, charcount + 1) 25953 cline = substr(cline, charcount + 1) 25954 @} 25955@group 25956 25957 return (clast == cline) 25958@} 25959@end group 25960@c endfile 25961@end example 25962 25963The following two rules are the body of the program. The first one is 25964executed only for the very first line of data. It sets @code{last} equal to 25965@code{$0}, so that subsequent lines of text have something to be compared to. 25966 25967The second rule does the work. The variable @code{equal} is one or zero, 25968depending upon the results of @code{are_equal()}'s comparison. If @command{uniq} 25969is counting repeated lines, and the lines are equal, then it increments the @code{count} variable. 25970Otherwise, it prints the line and resets @code{count}, 25971because the two lines are not equal. 25972 25973If @command{uniq} is not counting, and if the lines are equal, @code{count} is incremented. 25974Nothing is printed, as the point is to remove duplicates. 25975Otherwise, if @command{uniq} is counting repeated lines and more than 25976one line is seen, or if @command{uniq} is counting nonrepeated lines 25977and only one line is seen, then the line is printed, and @code{count} 25978is reset. 25979 25980Finally, similar logic is used in the @code{END} rule to print the final 25981line of input data: 25982 25983@example 25984@c file eg/prog/uniq.awk 25985NR == 1 @{ 25986 last = $0 25987 next 25988@} 25989 25990@{ 25991 equal = are_equal() 25992 25993 if (do_count) @{ # overrides -d and -u 25994 if (equal) 25995 count++ 25996 else @{ 25997 printf("%4d %s\n", count, last) > outputfile 25998 last = $0 25999 count = 1 # reset 26000 @} 26001 next 26002 @} 26003 26004 if (equal) 26005 count++ 26006 else @{ 26007 if ((repeated_only && count > 1) || 26008 (non_repeated_only && count == 1)) 26009 print last > outputfile 26010 last = $0 26011 count = 1 26012 @} 26013@} 26014 26015END @{ 26016 if (do_count) 26017 printf("%4d %s\n", count, last) > outputfile 26018@group 26019 else if ((repeated_only && count > 1) || 26020 (non_repeated_only && count == 1)) 26021 print last > outputfile 26022 close(outputfile) 26023@} 26024@end group 26025@c endfile 26026@end example 26027 26028As a side note, this program does not follow our recommended convention of naming 26029global variables with a leading capital letter. Doing that would 26030make the program a little easier to follow. 26031 26032@ifset FOR_PRINT 26033@cindex Kernighan, Brian @subentry quotes 26034The logic for choosing which lines to print represents a @dfn{state 26035machine}, which is ``a device which can be in one of a set number 26036of stable conditions depending on its previous condition and on the 26037present values of its inputs.''@footnote{This definition is from 26038@uref{https://www.lexico.com/en/definition/state_machine}.} Brian 26039Kernighan suggests that ``an alternative approach to state machines is 26040to just read the input into an array, then use indexing. It's almost 26041always easier code, and for most inputs where you would use this, just 26042as fast.'' Consider how to rewrite the logic to follow this suggestion. 26043@end ifset 26044 26045 26046@node Wc Program 26047@subsection Counting Things 26048 26049@cindex counting words, lines, characters, and bytes 26050@cindex input files @subentry counting elements in 26051@cindex words @subentry counting 26052@cindex characters @subentry counting 26053@cindex lines @subentry counting 26054@cindex bytes @subentry counting 26055@cindex @command{wc} utility 26056The @command{wc} (word count) utility counts lines, words, characters 26057and bytes in one or more input files. 26058 26059@menu 26060* Bytes vs. Characters:: Modern character sets. 26061* Using extensions:: A brief intro to extensions. 26062* @command{wc} program:: Code for @file{wc.awk}. 26063@end menu 26064 26065@node Bytes vs. Characters 26066@subsubsection Modern Character Sets 26067 26068In the early days of computing, single bytes were used for storing 26069characters. The most common character sets were ASCII and EBCDIC, 26070which each provided all the English upper- and lowercase letters, the 10 26071Hindu-Arabic numerals from 0 through 9, and a number of other standard 26072punctuation and control characters. 26073 26074Today, the most popular character set in use is Unicode (of which ASCII 26075is a pure subset). Unicode provides tens of thousands of unique characters 26076(called @dfn{code points}) to cover most existing human languages (living 26077and dead) and a number of nonhuman ones as well (such as Klingon and 26078J.R.R.@: Tolkien's elvish languages). 26079 26080To save space in files, Unicode code points are @dfn{encoded}, where each 26081character takes from one to four bytes in the file. UTF-8 is possibly 26082the most popular of such @dfn{multibyte encodings}. 26083 26084The POSIX standard requires that @command{awk} function in terms 26085of characters, not bytes. Thus in @command{gawk}, @code{length()}, 26086@code{substr()}, @code{split()}, @code{match()} and the other string 26087functions (@pxref{String Functions}) all work in terms of characters in 26088the local character set, and not in terms of bytes. (Not all @command{awk} 26089implementations do so, though). 26090 26091There is no standard, built-in way to distinguish characters from bytes 26092in an @command{awk} program. For an @command{awk} implementation of 26093@command{wc}, which needs to make such a distinction, we will have to 26094use an external extension. 26095 26096@node Using extensions 26097@subsubsection A Brief Introduction To Extensions 26098 26099Loadable extensions are presented in full detail in @ref{Dynamic Extensions}. 26100They provide a way to add functions to @command{gawk} which can call 26101out to other facilities written in C or C++. 26102 26103For the purposes of 26104@file{wc.awk}, it's enough to know that the extension is loaded 26105with the @code{@@load} directive, and the additional function we 26106will use is called @code{mbs_length()}. This function returns the 26107number of bytes in a string, not the number of characters. 26108 26109The @code{"mbs"} extension comes from the @code{gawkextlib} 26110project. @xref{gawkextlib} for more information. 26111 26112@node @command{wc} program 26113@subsubsection Code for @file{wc.awk} 26114 26115The usage for @command{wc} is as follows: 26116 26117@display 26118@command{wc} [@option{-lwcm}] [@var{files} @dots{}] 26119@end display 26120 26121If no files are specified on the command line, @command{wc} reads its standard 26122input. If there are multiple files, it also prints total counts for all 26123the files. The options and their meanings are as follows: 26124 26125@table @code 26126@item -c 26127Count only bytes. 26128Once upon a time, the @samp{c} in this option stood for ``characters.'' 26129But, as explained earlier, bytes and character are no longer synonymous 26130with each other. 26131 26132@item -l 26133Count only lines. 26134 26135@item -m 26136Count only characters. 26137 26138@item -w 26139Count only words. 26140A ``word'' is a contiguous sequence of nonwhitespace characters, separated 26141by spaces and/or TABs. Luckily, this is the normal way @command{awk} separates 26142fields in its input data. 26143@end table 26144 26145Implementing @command{wc} in @command{awk} is particularly elegant, 26146because @command{awk} does a lot of the work for us; it splits lines into 26147words (i.e., fields) and counts them, it counts lines (i.e., records), 26148and it can easily tell us how long a line is in characters. 26149 26150This program uses the @code{getopt()} library function 26151(@pxref{Getopt Function}) 26152and the file-transition functions 26153(@pxref{Filetrans Function}). 26154 26155This version has one notable difference from older versions of 26156@command{wc}: it always prints the counts in the order lines, words, 26157characters and bytes. Older versions note the order of the @option{-l}, 26158@option{-w}, and @option{-c} options on the command line, and print the 26159counts in that order. POSIX does not mandate this behavior, though. 26160 26161The @code{BEGIN} rule does the argument processing. The variable 26162@code{print_total} is true if more than one file is named on the 26163command line: 26164 26165@cindex @file{wc.awk} program 26166@example 26167@c file eg/prog/wc.awk 26168# wc.awk --- count lines, words, characters, bytes 26169@c endfile 26170@ignore 26171@c file eg/prog/wc.awk 26172# 26173# Arnold Robbins, arnold@@skeeve.com, Public Domain 26174# May 1993 26175# Revised September 2020 26176@c endfile 26177@end ignore 26178@c file eg/prog/wc.awk 26179 26180# Options: 26181# -l only count lines 26182# -w only count words 26183# -c only count bytes 26184# -m only count characters 26185# 26186# Default is to count lines, words, bytes 26187# 26188# Requires getopt() and file transition library functions 26189# Requires mbs extension from gawkextlib 26190 26191@@load "mbs" 26192 26193BEGIN @{ 26194 # let getopt() print a message about 26195 # invalid options. we ignore them 26196 while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) @{ 26197 if (c == "l") 26198 do_lines = 1 26199 else if (c == "w") 26200 do_words = 1 26201 else if (c == "c") 26202 do_bytes = 1 26203 else if (c == "m") 26204 do_chars = 1 26205 @} 26206 for (i = 1; i < Optind; i++) 26207 ARGV[i] = "" 26208 26209 # if no options, do lines, words, bytes 26210 if (! do_lines && ! do_words && ! do_chars && ! do_bytes) 26211 do_lines = do_words = do_bytes = 1 26212 26213 print_total = (ARGC - i > 1) 26214@} 26215@c endfile 26216@end example 26217 26218The @code{beginfile()} function is simple; it just resets the counts of lines, 26219words, characters and bytes to zero, and saves the current @value{FN} in 26220@code{fname}: 26221 26222@example 26223@c file eg/prog/wc.awk 26224function beginfile(file) 26225@{ 26226 lines = words = chars = bytes = 0 26227 fname = FILENAME 26228@} 26229@c endfile 26230@end example 26231 26232The @code{endfile()} function adds the current file's numbers to the 26233running totals of lines, words, and characters. It then prints out those 26234numbers for the file that was just read. It relies on @code{beginfile()} 26235to reset the numbers for the following @value{DF}: 26236 26237@example 26238@c file eg/prog/wc.awk 26239function endfile(file) 26240@{ 26241 tlines += lines 26242 twords += words 26243 tchars += chars 26244 tbytes += bytes 26245 if (do_lines) 26246 printf "\t%d", lines 26247@group 26248 if (do_words) 26249 printf "\t%d", words 26250@end group 26251 if (do_chars) 26252 printf "\t%d", chars 26253 if (do_bytes) 26254 printf "\t%d", bytes 26255 printf "\t%s\n", fname 26256@} 26257@c endfile 26258@end example 26259 26260There is one rule that is executed for each line. It adds the length of 26261the record, plus one, to @code{chars}. Adding one plus the record length 26262is needed because the newline character separating records (the value 26263of @code{RS}) is not part of the record itself, and thus not included 26264in its length. Similarly, it adds the length of the record in bytes, 26265plus one, to @code{bytes}. Next, @code{lines} is incremented for each 26266line read, and @code{words} is incremented by the value of @code{NF}, 26267which is the number of ``words'' on this line: 26268 26269@example 26270@c file eg/prog/wc.awk 26271# do per line 26272@{ 26273 chars += length($0) + 1 # get newline 26274 bytes += mbs_length($0) + 1 26275 lines++ 26276 words += NF 26277@} 26278@c endfile 26279@end example 26280 26281Finally, the @code{END} rule simply prints the totals for all the files: 26282 26283@example 26284@c file eg/prog/wc.awk 26285END @{ 26286 if (print_total) @{ 26287 if (do_lines) 26288 printf "\t%d", tlines 26289 if (do_words) 26290 printf "\t%d", twords 26291 if (do_chars) 26292 printf "\t%d", tchars 26293 if (do_bytes) 26294 printf "\t%d", tbytes 26295 print "\ttotal" 26296 @} 26297@} 26298@c endfile 26299@end example 26300 26301@node Miscellaneous Programs 26302@section A Grab Bag of @command{awk} Programs 26303 26304This @value{SECTION} is a large ``grab bag'' of miscellaneous programs. 26305We hope you find them both interesting and enjoyable. 26306 26307@menu 26308* Dupword Program:: Finding duplicated words in a document. 26309* Alarm Program:: An alarm clock. 26310* Translate Program:: A program similar to the @command{tr} utility. 26311* Labels Program:: Printing mailing labels. 26312* Word Sorting:: A program to produce a word usage count. 26313* History Sorting:: Eliminating duplicate entries from a history 26314 file. 26315* Extract Program:: Pulling out programs from Texinfo source 26316 files. 26317* Simple Sed:: A Simple Stream Editor. 26318* Igawk Program:: A wrapper for @command{awk} that includes 26319 files. 26320* Anagram Program:: Finding anagrams from a dictionary. 26321* Signature Program:: People do amazing things with too much time on 26322 their hands. 26323@end menu 26324 26325@node Dupword Program 26326@subsection Finding Duplicated Words in a Document 26327 26328@cindex words @subentry duplicate, searching for 26329@cindex searching @subentry for words 26330@cindex documents, searching 26331A common error when writing large amounts of prose is to accidentally 26332duplicate words. Typically you will see this in text as something like ``the 26333the program does the following@dots{}'' When the text is online, often 26334the duplicated words occur at the end of one line and the 26335@iftex 26336the 26337@end iftex 26338beginning of 26339another, making them very difficult to spot. 26340@c as here! 26341 26342This program, @file{dupword.awk}, scans through a file one line at a time 26343and looks for adjacent occurrences of the same word. It also saves the last 26344word on a line (in the variable @code{prev}) for comparison with the first 26345word on the next line. 26346 26347@cindex Texinfo 26348The first two statements make sure that the line is all lowercase, 26349so that, for example, ``The'' and ``the'' compare equal to each other. 26350The next statement replaces nonalphanumeric and nonwhitespace characters 26351with spaces, so that punctuation does not affect the comparison either. 26352The characters are replaced with spaces so that formatting controls 26353don't create nonsense words (e.g., the Texinfo @samp{@@code@{NF@}} 26354becomes @samp{codeNF} if punctuation is simply deleted). The record is 26355then resplit into fields, yielding just the actual words on the line, 26356and ensuring that there are no empty fields. 26357 26358If there are no fields left after removing all the punctuation, the 26359current record is skipped. Otherwise, the program loops through each 26360word, comparing it to the previous one: 26361 26362@cindex @file{dupword.awk} program 26363@example 26364@c file eg/prog/dupword.awk 26365# dupword.awk --- find duplicate words in text 26366@c endfile 26367@ignore 26368@c file eg/prog/dupword.awk 26369# 26370# Arnold Robbins, arnold@@skeeve.com, Public Domain 26371# December 1991 26372# Revised October 2000 26373 26374@c endfile 26375@end ignore 26376@c file eg/prog/dupword.awk 26377@{ 26378 $0 = tolower($0) 26379 gsub(/[^[:alnum:][:blank:]]/, " "); 26380 $0 = $0 # re-split 26381 if (NF == 0) 26382 next 26383 if ($1 == prev) 26384 printf("%s:%d: duplicate %s\n", 26385 FILENAME, FNR, $1) 26386 for (i = 2; i <= NF; i++) 26387 if ($i == $(i-1)) 26388 printf("%s:%d: duplicate %s\n", 26389 FILENAME, FNR, $i) 26390 prev = $NF 26391@} 26392@c endfile 26393@end example 26394 26395@node Alarm Program 26396@subsection An Alarm Clock Program 26397@cindex insomnia, cure for 26398@cindex Robbins @subentry Arnold 26399@quotation 26400@i{Nothing cures insomnia like a ringing alarm clock.} 26401@author Arnold Robbins 26402@end quotation 26403@cindex Quanstrom, Erik 26404@ignore 26405Date: Sat, 15 Feb 2014 16:47:09 -0500 26406Subject: Re: 9atom install question 26407Message-ID: <l2jcvx6j6mey60xnrkb0hhob.1392500829294@email.android.com> 26408From: Erik Quanstrom <quanstro@quanstro.net> 26409To: Aharon Robbins <arnold@skeeve.com> 26410 26411yes. 26412 26413- erik 26414 26415Aharon Robbins <arnold@skeeve.com> wrote: 26416 26417>> sleep is for web developers. 26418> 26419>Can I quote you, in the gawk manual? 26420> 26421>Thanks, 26422> 26423>Arnold 26424@end ignore 26425@quotation 26426@i{Sleep is for web developers.} 26427@author Erik Quanstrom 26428@end quotation 26429 26430@cindex time @subentry alarm clock example program 26431@cindex alarm clock example program 26432The following program is a simple ``alarm clock'' program. 26433You give it a time of day and an optional message. At the specified time, 26434it prints the message on the standard output. In addition, you can give it 26435the number of times to repeat the message as well as a delay between 26436repetitions. 26437 26438This program uses the @code{getlocaltime()} function from 26439@ref{Getlocaltime Function}. 26440 26441@cindex ASCII 26442All the work is done in the @code{BEGIN} rule. The first part is argument 26443checking and setting of defaults: the delay, the count, and the message to 26444print. If the user supplied a message without the ASCII BEL 26445character (known as the ``alert'' character, @code{"\a"}), then it is added to 26446the message. (On many systems, printing the ASCII BEL generates an 26447audible alert. Thus, when the alarm goes off, the system calls attention 26448to itself in case the user is not looking at the computer.) 26449Just for a change, this program uses a @code{switch} statement 26450(@pxref{Switch Statement}), but the processing could be done with a series of 26451@code{if}-@code{else} statements instead. 26452Here is the program: 26453 26454@cindex @file{alarm.awk} program 26455@example 26456@c file eg/prog/alarm.awk 26457# alarm.awk --- set an alarm 26458# 26459# Requires getlocaltime() library function 26460@c endfile 26461@ignore 26462@c file eg/prog/alarm.awk 26463# 26464# Arnold Robbins, arnold@@skeeve.com, Public Domain 26465# May 1993 26466# Revised December 2010 26467 26468@c endfile 26469@end ignore 26470@c file eg/prog/alarm.awk 26471# usage: alarm time [ "message" [ count [ delay ] ] ] 26472 26473BEGIN @{ 26474 # Initial argument sanity checking 26475 usage1 = "usage: alarm time ['message' [count [delay]]]" 26476 usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1]) 26477 26478 if (ARGC < 2) @{ 26479 print usage1 > "/dev/stderr" 26480 print usage2 > "/dev/stderr" 26481 exit 1 26482 @} 26483 switch (ARGC) @{ 26484 case 5: 26485 delay = ARGV[4] + 0 26486 # fall through 26487 case 4: 26488 count = ARGV[3] + 0 26489 # fall through 26490 case 3: 26491 message = ARGV[2] 26492 break 26493 default: 26494 if (ARGV[1] !~ /[[:digit:]]?[[:digit:]]:[[:digit:]]@{2@}/) @{ 26495 print usage1 > "/dev/stderr" 26496 print usage2 > "/dev/stderr" 26497 exit 1 26498 @} 26499 break 26500 @} 26501 26502 # set defaults for once we reach the desired time 26503 if (delay == 0) 26504 delay = 180 # 3 minutes 26505@group 26506 if (count == 0) 26507 count = 5 26508@end group 26509 if (message == "") 26510 message = sprintf("\aIt is now %s!\a", ARGV[1]) 26511 else if (index(message, "\a") == 0) 26512 message = "\a" message "\a" 26513@c endfile 26514@end example 26515 26516The next @value{SECTION} of code turns the alarm time into hours and minutes, 26517converts it (if necessary) to a 24-hour clock, and then turns that 26518time into a count of the seconds since midnight. Next it turns the current 26519time into a count of seconds since midnight. The difference between the two 26520is how long to wait before setting off the alarm: 26521 26522@example 26523@c file eg/prog/alarm.awk 26524 # split up alarm time 26525 split(ARGV[1], atime, ":") 26526 hour = atime[1] + 0 # force numeric 26527 minute = atime[2] + 0 # force numeric 26528 26529 # get current broken down time 26530 getlocaltime(now) 26531 26532 # if time given is 12-hour hours and it's after that 26533 # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m., 26534 # then add 12 to real hour 26535 if (hour < 12 && now["hour"] > hour) 26536 hour += 12 26537 26538 # set target time in seconds since midnight 26539 target = (hour * 60 * 60) + (minute * 60) 26540 26541 # get current time in seconds since midnight 26542 current = (now["hour"] * 60 * 60) + \ 26543 (now["minute"] * 60) + now["second"] 26544 26545 # how long to sleep for 26546 naptime = target - current 26547 if (naptime <= 0) @{ 26548 print "alarm: time is in the past!" > "/dev/stderr" 26549 exit 1 26550 @} 26551@c endfile 26552@end example 26553 26554@cindex @command{sleep} utility 26555Finally, the program uses the @code{system()} function 26556(@pxref{I/O Functions}) 26557to call the @command{sleep} utility. The @command{sleep} utility simply pauses 26558for the given number of seconds. If the exit status is not zero, 26559the program assumes that @command{sleep} was interrupted and exits. If 26560@command{sleep} exited with an OK status (zero), then the program prints the 26561message in a loop, again using @command{sleep} to delay for however many 26562seconds are necessary: 26563 26564@example 26565@c file eg/prog/alarm.awk 26566 # zzzzzz..... go away if interrupted 26567 if (system(sprintf("sleep %d", naptime)) != 0) 26568 exit 1 26569 26570 # time to notify! 26571 command = sprintf("sleep %d", delay) 26572 for (i = 1; i <= count; i++) @{ 26573 print message 26574 # if sleep command interrupted, go away 26575 if (system(command) != 0) 26576 break 26577 @} 26578 26579 exit 0 26580@} 26581@c endfile 26582@end example 26583 26584@node Translate Program 26585@subsection Transliterating Characters 26586 26587@cindex characters @subentry transliterating 26588@cindex @command{tr} utility 26589The system @command{tr} utility transliterates characters. For example, it is 26590often used to map uppercase letters into lowercase for further processing: 26591 26592@example 26593@var{generate data} | tr 'A-Z' 'a-z' | @var{process data} @dots{} 26594@end example 26595 26596@command{tr} requires two lists of characters.@footnote{On some older 26597systems, including Solaris, the system version of @command{tr} may require 26598that the lists be written as range expressions enclosed in square brackets 26599(@samp{[a-z]}) and quoted, to prevent the shell from attempting a 26600@value{FN} expansion. This is not a feature.} When processing the input, the 26601first character in the first list is replaced with the first character 26602in the second list, the second character in the first list is replaced 26603with the second character in the second list, and so on. If there are 26604more characters in the ``from'' list than in the ``to'' list, the last 26605character of the ``to'' list is used for the remaining characters in the 26606``from'' list. 26607 26608Once upon a time, 26609@c early or mid-1989! 26610a user proposed adding a transliteration function 26611to @command{gawk}. 26612@c Wishing to avoid gratuitous new features, 26613@c at least theoretically 26614The following program was written to 26615prove that character transliteration could be done with a user-level 26616function. This program is not as complete as the system @command{tr} utility, 26617but it does most of the job. 26618 26619The @command{translate} program was written long before @command{gawk} 26620acquired the ability to split each character in a string into separate 26621array elements. Thus, it makes repeated use of the @code{substr()}, 26622@code{index()}, and @code{gsub()} built-in functions (@pxref{String 26623Functions}). There are two functions. The first, @code{stranslate()}, 26624takes three arguments: 26625 26626@table @code 26627@item from 26628A list of characters from which to translate 26629 26630@item to 26631A list of characters to which to translate 26632 26633@item target 26634The string on which to do the translation 26635@end table 26636 26637Associative arrays make the translation part fairly easy. @code{t_ar} holds 26638the ``to'' characters, indexed by the ``from'' characters. Then a simple 26639loop goes through @code{from}, one character at a time. For each character 26640in @code{from}, if the character appears in @code{target}, 26641it is replaced with the corresponding @code{to} character. 26642 26643The @code{translate()} function calls @code{stranslate()}, using @code{$0} 26644as the target. The main program sets two global variables, @code{FROM} and 26645@code{TO}, from the command line, and then changes @code{ARGV} so that 26646@command{awk} reads from the standard input. 26647 26648Finally, the processing rule simply calls @code{translate()} for each record: 26649 26650@cindex @file{translate.awk} program 26651@example 26652@c file eg/prog/translate.awk 26653# translate.awk --- do tr-like stuff 26654@c endfile 26655@ignore 26656@c file eg/prog/translate.awk 26657# 26658# Arnold Robbins, arnold@@skeeve.com, Public Domain 26659# August 1989 26660# February 2009 - bug fix 26661 26662@c endfile 26663@end ignore 26664@c file eg/prog/translate.awk 26665# Bugs: does not handle things like tr A-Z a-z; it has 26666# to be spelled out. However, if `to' is shorter than `from', 26667# the last character in `to' is used for the rest of `from'. 26668 26669function stranslate(from, to, target, lf, lt, ltarget, t_ar, i, c, 26670 result) 26671@{ 26672 lf = length(from) 26673 lt = length(to) 26674 ltarget = length(target) 26675 for (i = 1; i <= lt; i++) 26676 t_ar[substr(from, i, 1)] = substr(to, i, 1) 26677 if (lt < lf) 26678 for (; i <= lf; i++) 26679 t_ar[substr(from, i, 1)] = substr(to, lt, 1) 26680 for (i = 1; i <= ltarget; i++) @{ 26681 c = substr(target, i, 1) 26682 if (c in t_ar) 26683 c = t_ar[c] 26684 result = result c 26685 @} 26686 return result 26687@} 26688 26689function translate(from, to) 26690@{ 26691 return $0 = stranslate(from, to, $0) 26692@} 26693 26694# main program 26695BEGIN @{ 26696@group 26697 if (ARGC < 3) @{ 26698 print "usage: translate from to" > "/dev/stderr" 26699 exit 26700 @} 26701@end group 26702 FROM = ARGV[1] 26703 TO = ARGV[2] 26704 ARGC = 2 26705 ARGV[1] = "-" 26706@} 26707 26708@{ 26709 translate(FROM, TO) 26710 print 26711@} 26712@c endfile 26713@end example 26714 26715It is possible to do character transliteration in a user-level 26716function, but it is not necessarily efficient, and we (the @command{gawk} 26717developers) started to consider adding a built-in function. However, 26718shortly after writing this program, we learned that Brian Kernighan 26719had added the @code{toupper()} and @code{tolower()} functions to his 26720@command{awk} (@pxref{String Functions}). These functions handle the 26721vast majority of the cases where character transliteration is necessary, 26722and so we chose to simply add those functions to @command{gawk} as well 26723and then leave well enough alone. 26724 26725An obvious improvement to this program would be to set up the 26726@code{t_ar} array only once, in a @code{BEGIN} rule. However, this 26727assumes that the ``from'' and ``to'' lists 26728will never change throughout the lifetime of the program. 26729 26730Another obvious improvement is to enable the use of ranges, 26731such as @samp{a-z}, as allowed by the @command{tr} utility. 26732Look at the code for @file{cut.awk} (@pxref{Cut Program}) 26733for inspiration. 26734 26735 26736@node Labels Program 26737@subsection Printing Mailing Labels 26738 26739@cindex printing @subentry mailing labels 26740@cindex mailing labels, printing 26741Here is a ``real-world''@footnote{``Real world'' is defined as 26742``a program actually used to get something done.''} 26743program. This 26744script reads lists of names and 26745addresses and generates mailing labels. Each page of labels has 20 labels 26746on it, two across and 10 down. The addresses are guaranteed to be no more 26747than five lines of data. Each address is separated from the next by a blank 26748line. 26749 26750The basic idea is to read 20 labels' worth of data. Each line of each label 26751is stored in the @code{line} array. The single rule takes care of filling 26752the @code{line} array and printing the page when 20 labels have been read. 26753 26754The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that 26755@command{awk} splits records at blank lines 26756(@pxref{Records}). 26757It sets @code{MAXLINES} to 100, because 100 is the maximum number 26758of lines on the page 26759@iftex 26760(@math{20 @cdot 5 = 100}). 26761@end iftex 26762@ifnottex 26763@ifnotdocbook 26764(20 * 5 = 100). 26765@end ifnotdocbook 26766@end ifnottex 26767@docbook 26768(20 ⋅ 5 = 100). 26769@end docbook 26770 26771Most of the work is done in the @code{printpage()} function. 26772The label lines are stored sequentially in the @code{line} array. But they 26773have to print horizontally: @code{line[1]} next to @code{line[6]}, 26774@code{line[2]} next to @code{line[7]}, and so on. Two loops 26775accomplish this. The outer loop, controlled by @code{i}, steps through 26776every 10 lines of data; this is each row of labels. The inner loop, 26777controlled by @code{j}, goes through the lines within the row. 26778As @code{j} goes from 0 to 4, @samp{i+j} is the @code{j}th line in 26779the row, and @samp{i+j+5} is the entry next to it. The output ends up 26780looking something like this: 26781 26782@example 26783line 1 line 6 26784line 2 line 7 26785line 3 line 8 26786line 4 line 9 26787line 5 line 10 26788@dots{} 26789@end example 26790 26791@noindent 26792The @code{printf} format string @samp{%-41s} left-aligns 26793the data and prints it within a fixed-width field. 26794 26795As a final note, an extra blank line is printed at lines 21 and 61, to keep 26796the output lined up on the labels. This is dependent on the particular 26797brand of labels in use when the program was written. You will also note 26798that there are two blank lines at the top and two blank lines at the bottom. 26799 26800The @code{END} rule arranges to flush the final page of labels; there may 26801not have been an even multiple of 20 labels in the data: 26802 26803@cindex @file{labels.awk} program 26804@example 26805@c file eg/prog/labels.awk 26806# labels.awk --- print mailing labels 26807@c endfile 26808@ignore 26809@c file eg/prog/labels.awk 26810# 26811# Arnold Robbins, arnold@@skeeve.com, Public Domain 26812# June 1992 26813# December 2010, minor edits 26814@c endfile 26815@end ignore 26816@c file eg/prog/labels.awk 26817 26818# Each label is 5 lines of data that may have blank lines. 26819# The label sheets have 2 blank lines at the top and 2 at 26820# the bottom. 26821 26822BEGIN @{ RS = "" ; MAXLINES = 100 @} 26823 26824function printpage( i, j) 26825@{ 26826 if (Nlines <= 0) 26827 return 26828 26829 printf "\n\n" # header 26830 26831 for (i = 1; i <= Nlines; i += 10) @{ 26832 if (i == 21 || i == 61) 26833 print "" 26834 for (j = 0; j < 5; j++) @{ 26835 if (i + j > MAXLINES) 26836 break 26837 printf " %-41s %s\n", line[i+j], line[i+j+5] 26838 @} 26839 print "" 26840 @} 26841 26842 printf "\n\n" # footer 26843 26844 delete line 26845@} 26846 26847# main rule 26848@{ 26849 if (Count >= 20) @{ 26850 printpage() 26851 Count = 0 26852 Nlines = 0 26853 @} 26854 n = split($0, a, "\n") 26855 for (i = 1; i <= n; i++) 26856 line[++Nlines] = a[i] 26857 for (; i <= 5; i++) 26858 line[++Nlines] = "" 26859 Count++ 26860@} 26861 26862END @{ 26863 printpage() 26864@} 26865@c endfile 26866@end example 26867 26868@node Word Sorting 26869@subsection Generating Word-Usage Counts 26870 26871@cindex words @subentry usage counts, generating 26872 26873When working with large amounts of text, it can be interesting to know 26874how often different words appear. For example, an author may overuse 26875certain words, in which case he or she might wish to find synonyms to substitute 26876for words that appear too often. This @value{SUBSECTION} develops a 26877program for counting words and presenting the frequency information 26878in a useful format. 26879 26880At first glance, a program like this would seem to do the job: 26881 26882@example 26883# wordfreq-first-try.awk --- print list of word frequencies 26884 26885@{ 26886 for (i = 1; i <= NF; i++) 26887 freq[$i]++ 26888@} 26889 26890@group 26891END @{ 26892 for (word in freq) 26893 printf "%s\t%d\n", word, freq[word] 26894@} 26895@end group 26896@end example 26897 26898The program relies on @command{awk}'s default field-splitting 26899mechanism to break each line up into ``words'' and uses an 26900associative array named @code{freq}, indexed by each word, to count 26901the number of times the word occurs. In the @code{END} rule, 26902it prints the counts. 26903 26904This program has several problems that prevent it from being 26905useful on real text files: 26906 26907@itemize @value{BULLET} 26908@item 26909The @command{awk} language considers upper- and lowercase characters to be 26910distinct. Therefore, ``bartender'' and ``Bartender'' are not treated 26911as the same word. This is undesirable, because words are capitalized 26912if they begin sentences in normal text, and a frequency analyzer should 26913not be sensitive to capitalization. 26914 26915@item 26916Words are detected using the @command{awk} convention that fields are 26917separated just by whitespace. Other characters in the input (except 26918newlines) don't have any special meaning to @command{awk}. This means that 26919punctuation characters count as part of words. 26920 26921@item 26922The output does not come out in any useful order. You're more likely to be 26923interested in which words occur most frequently or in having an alphabetized 26924table of how frequently each word occurs. 26925@end itemize 26926 26927@cindex @command{sort} utility 26928The first problem can be solved by using @code{tolower()} to remove case 26929distinctions. The second problem can be solved by using @code{gsub()} 26930to remove punctuation characters. Finally, we solve the third problem 26931by using the system @command{sort} utility to process the output of the 26932@command{awk} script. Here is the new version of the program: 26933 26934@cindex @file{wordfreq.awk} program 26935@example 26936@c file eg/prog/wordfreq.awk 26937# wordfreq.awk --- print list of word frequencies 26938 26939@{ 26940 $0 = tolower($0) # remove case distinctions 26941 # remove punctuation 26942 gsub(/[^[:alnum:]_[:blank:]]/, "", $0) 26943 for (i = 1; i <= NF; i++) 26944 freq[$i]++ 26945@} 26946 26947@c endfile 26948END @{ 26949 for (word in freq) 26950 printf "%s\t%d\n", word, freq[word] 26951@} 26952@end example 26953 26954The regexp @code{/[^[:alnum:]_[:blank:]]/} might have been written 26955@code{/[[:punct:]]/}, but then underscores would also be removed, 26956and we want to keep them. 26957 26958Assuming we have saved this program in a file named @file{wordfreq.awk}, 26959and that the data is in @file{file1}, the following pipeline: 26960 26961@example 26962awk -f wordfreq.awk file1 | sort -k 2nr 26963@end example 26964 26965@noindent 26966produces a table of the words appearing in @file{file1} in order of 26967decreasing frequency. 26968 26969The @command{awk} program suitably massages the 26970data and produces a word frequency table, which is not ordered. 26971The @command{awk} script's output is then sorted by the @command{sort} 26972utility and printed on the screen. 26973 26974The options given to @command{sort} 26975specify a sort that uses the second field of each input line (skipping 26976one field), that the sort keys should be treated as numeric quantities 26977(otherwise @samp{15} would come before @samp{5}), and that the sorting 26978should be done in descending (reverse) order. 26979 26980The @command{sort} could even be done from within the program, by changing 26981the @code{END} action to: 26982 26983@example 26984@c file eg/prog/wordfreq.awk 26985END @{ 26986 sort = "sort -k 2nr" 26987 for (word in freq) 26988 printf "%s\t%d\n", word, freq[word] | sort 26989 close(sort) 26990@} 26991@c endfile 26992@end example 26993 26994This way of sorting must be used on systems that do not 26995have true pipes at the command-line (or batch-file) level. 26996See the general operating system documentation for more information on how 26997to use the @command{sort} program. 26998 26999@node History Sorting 27000@subsection Removing Duplicates from Unsorted Text 27001 27002@cindex lines @subentry duplicate, removing 27003The @command{uniq} program 27004(@pxref{Uniq Program}) 27005removes duplicate lines from @emph{sorted} data. 27006 27007Suppose, however, you need to remove duplicate lines from a @value{DF} but 27008that you want to preserve the order the lines are in. A good example of 27009this might be a shell history file. The history file keeps a copy of all 27010the commands you have entered, and it is not unusual to repeat a command 27011several times in a row. Occasionally you might want to compact the history 27012by removing duplicate entries. Yet it is desirable to maintain the order 27013of the original commands. 27014 27015This simple program does the job. It uses two arrays. The @code{data} 27016array is indexed by the text of each line. 27017For each line, @code{data[$0]} is incremented. 27018If a particular line has not 27019been seen before, then @code{data[$0]} is zero. 27020In this case, the text of the line is stored in @code{lines[count]}. 27021Each element of @code{lines} is a unique command, and the indices of 27022@code{lines} indicate the order in which those lines are encountered. 27023The @code{END} rule simply prints out the lines, in order: 27024 27025@cindex Rakitzis, Byron 27026@cindex @file{histsort.awk} program 27027@example 27028@c file eg/prog/histsort.awk 27029# histsort.awk --- compact a shell history file 27030# Thanks to Byron Rakitzis for the general idea 27031@c endfile 27032@ignore 27033@c file eg/prog/histsort.awk 27034# 27035# Arnold Robbins, arnold@@skeeve.com, Public Domain 27036# May 1993 27037@c endfile 27038@end ignore 27039@c file eg/prog/histsort.awk 27040 27041@group 27042@{ 27043 if (data[$0]++ == 0) 27044 lines[++count] = $0 27045@} 27046@end group 27047 27048@group 27049END @{ 27050 for (i = 1; i <= count; i++) 27051 print lines[i] 27052@} 27053@end group 27054@c endfile 27055@end example 27056 27057This program also provides a foundation for generating other useful 27058information. For example, using the following @code{print} statement in the 27059@code{END} rule indicates how often a particular command is used: 27060 27061@example 27062print data[lines[i]], lines[i] 27063@end example 27064 27065@noindent 27066This works because @code{data[$0]} is incremented each time a line is 27067seen. 27068 27069@c rick@openfortress.nl, Tue, 24 Dec 2019 13:43:06 +0100 27070Rick van Rein offers the following one-liner to do the same job of 27071removing duplicates from unsorted text: 27072 27073@example 27074awk '@{ if (! seen[$0]++) print @}' 27075@end example 27076 27077This can be simplified even further, at the risk of becoming 27078almost too obscure: 27079 27080@example 27081awk '! seen[$0]++' 27082@end example 27083 27084@noindent 27085This version uses the expression as a pattern, relying on 27086@command{awk}'s default action of printing the line when 27087the pattern is true. 27088 27089@node Extract Program 27090@subsection Extracting Programs from Texinfo Source Files 27091 27092@cindex Texinfo @subentry extracting programs from source files 27093@cindex files @subentry Texinfo, extracting programs from 27094@ifnotinfo 27095Both this chapter and the previous chapter 27096(@ref{Library Functions}) 27097present a large number of @command{awk} programs. 27098@end ifnotinfo 27099@ifinfo 27100The nodes 27101@ref{Library Functions}, 27102and @ref{Sample Programs}, 27103are the top level nodes for a large number of @command{awk} programs. 27104@end ifinfo 27105If you want to experiment with these programs, it is tedious to type 27106them in by hand. Here we present a program that can extract parts of a 27107Texinfo input file into separate files. 27108 27109@cindex Texinfo 27110This @value{DOCUMENT} is written in @uref{https://www.gnu.org/software/texinfo/, Texinfo}, 27111the GNU Project's document formatting language. 27112A single Texinfo source file can be used to produce both 27113printed documentation, with @TeX{}, and online documentation. 27114@ifnotinfo 27115(Texinfo is fully documented in the book 27116@cite{Texinfo---The GNU Documentation Format}, 27117available from the Free Software Foundation, 27118and also available @uref{https://www.gnu.org/software/texinfo/manual/texinfo/, online}.) 27119@end ifnotinfo 27120@ifinfo 27121(The Texinfo language is described fully, starting with 27122@inforef{Top, , Texinfo, texinfo,Texinfo---The GNU Documentation Format}.) 27123@end ifinfo 27124 27125For our purposes, it is enough to know three things about Texinfo input 27126files: 27127 27128@itemize @value{BULLET} 27129@item 27130The ``at'' symbol (@samp{@@}) is special in Texinfo, much as 27131the backslash (@samp{\}) is in C 27132or @command{awk}. Literal @samp{@@} symbols are represented in Texinfo source 27133files as @samp{@@@@}. 27134 27135@item 27136Comments start with either @samp{@@c} or @samp{@@comment}. 27137The file-extraction program works by using special comments that start 27138at the beginning of a line. 27139 27140@item 27141Lines containing @samp{@@group} and @samp{@@end group} commands bracket 27142example text that should not be split across a page boundary. 27143(Unfortunately, @TeX{} isn't always smart enough to do things exactly right, 27144so we have to give it some help.) 27145@end itemize 27146 27147The following program, @file{extract.awk}, reads through a Texinfo source 27148file and does two things, based on the special comments. 27149Upon seeing @samp{@w{@@c system @dots{}}}, 27150it runs a command, by extracting the command text from the 27151control line and passing it on to the @code{system()} function 27152(@pxref{I/O Functions}). 27153Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to 27154the file @var{filename}, until @samp{@@c endfile} is encountered. 27155The rules in @file{extract.awk} match either @samp{@@c} or 27156@samp{@@comment} by letting the @samp{omment} part be optional. 27157Lines containing @samp{@@group} and @samp{@@end group} are simply removed. 27158@file{extract.awk} uses the @code{join()} library function 27159(@pxref{Join Function}). 27160 27161The example programs in the online Texinfo source for @cite{@value{TITLE}} 27162(@file{gawktexi.in}) have all been bracketed inside @samp{file} and 27163@samp{endfile} lines. The @command{gawk} distribution uses a copy of 27164@file{extract.awk} to extract the sample programs and install many 27165of them in a standard directory where @command{gawk} can find them. 27166The Texinfo file looks something like this: 27167 27168@example 27169@dots{} 27170This program has a @@code@{BEGIN@} rule 27171that prints a nice message: 27172 27173@@example 27174@@c file examples/messages.awk 27175BEGIN @@@{ print "Don't panic!" @@@} 27176@@c endfile 27177@@end example 27178 27179It also prints some final advice: 27180 27181@@example 27182@@c file examples/messages.awk 27183END @@@{ print "Always avoid bored archaeologists!" @@@} 27184@@c endfile 27185@@end example 27186@dots{} 27187@end example 27188 27189@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that 27190mixed upper- and lowercase letters in the directives won't matter. 27191 27192The first rule handles calling @code{system()}, checking that a command is 27193given (@code{NF} is at least three) and also checking that the command 27194exits with a zero exit status, signifying OK: 27195 27196@cindex @file{extract.awk} program 27197@example 27198@c file eg/prog/extract.awk 27199# extract.awk --- extract files and run programs from Texinfo files 27200@c endfile 27201@ignore 27202@c file eg/prog/extract.awk 27203# 27204# Arnold Robbins, arnold@@skeeve.com, Public Domain 27205# May 1993 27206# Revised September 2000 27207@c endfile 27208@end ignore 27209@c file eg/prog/extract.awk 27210 27211BEGIN @{ IGNORECASE = 1 @} 27212 27213/^@@c(omment)?[ \t]+system/ @{ 27214 if (NF < 3) @{ 27215 e = ("extract: " FILENAME ":" FNR) 27216 e = (e ": badly formed `system' line") 27217 print e > "/dev/stderr" 27218 next 27219 @} 27220 $1 = "" 27221 $2 = "" 27222 stat = system($0) 27223 if (stat != 0) @{ 27224 e = ("extract: " FILENAME ":" FNR) 27225 e = (e ": warning: system returned " stat) 27226 print e > "/dev/stderr" 27227 @} 27228@} 27229@c endfile 27230@end example 27231 27232@noindent 27233The variable @code{e} is used so that the rule 27234fits nicely on the @value{PAGE}. 27235 27236The second rule handles moving data into files. It verifies that a 27237@value{FN} is given in the directive. If the file named is not the 27238current file, then the current file is closed. Keeping the current file 27239open until a new file is encountered allows the use of the @samp{>} 27240redirection for printing the contents, keeping open-file management 27241simple. 27242 27243The @code{for} loop does the work. It reads lines using @code{getline} 27244(@pxref{Getline}). 27245For an unexpected end-of-file, it calls the @code{@w{unexpected_eof()}} 27246function. If the line is an ``endfile'' line, then it breaks out of 27247the loop. 27248If the line is an @samp{@@group} or @samp{@@end group} line, then it 27249ignores it and goes on to the next line. 27250Similarly, comments within examples are also ignored. 27251 27252Most of the work is in the following few lines. If the line has no @samp{@@} 27253symbols, the program can print it directly. 27254Otherwise, each leading @samp{@@} must be stripped off. 27255To remove the @samp{@@} symbols, the line is split into separate elements of 27256the array @code{a}, using the @code{split()} function 27257(@pxref{String Functions}). 27258The @samp{@@} symbol is used as the separator character. 27259Each element of @code{a} that is empty indicates two successive @samp{@@} 27260symbols in the original line. For each two empty elements (@samp{@@@@} in 27261the original file), we have to add a single @samp{@@} symbol back in. 27262 27263When the processing of the array is finished, @code{join()} is called with the 27264value of @code{SUBSEP} (@pxref{Multidimensional}), 27265to rejoin the pieces back into a single 27266line. That line is then printed to the output file: 27267 27268@example 27269@c file eg/prog/extract.awk 27270/^@@c(omment)?[ \t]+file/ @{ 27271 if (NF != 3) @{ 27272 e = ("extract: " FILENAME ":" FNR ": badly formed `file' line") 27273 print e > "/dev/stderr" 27274 next 27275 @} 27276 if ($3 != curfile) @{ 27277 if (curfile != "") 27278 filelist[curfile] = 1 # save to close later 27279 curfile = $3 27280 @} 27281 27282 for (;;) @{ 27283 if ((getline line) <= 0) 27284 unexpected_eof() 27285 if (line ~ /^@@c(omment)?[ \t]+endfile/) 27286 break 27287 else if (line ~ /^@@(end[ \t]+)?group/) 27288 continue 27289 else if (line ~ /^@@c(omment+)?[ \t]+/) 27290 continue 27291 if (index(line, "@@") == 0) @{ 27292 print line > curfile 27293 continue 27294 @} 27295 n = split(line, a, "@@") 27296 # if a[1] == "", means leading @@, 27297 # don't add one back in. 27298 for (i = 2; i <= n; i++) @{ 27299 if (a[i] == "") @{ # was an @@@@ 27300 a[i] = "@@" 27301 if (a[i+1] == "") 27302 i++ 27303 @} 27304 @} 27305@group 27306 print join(a, 1, n, SUBSEP) > curfile 27307 @} 27308@} 27309@end group 27310@c endfile 27311@end example 27312 27313An important thing to note is the use of the @samp{>} redirection. 27314Output done with @samp{>} only opens the file once; it stays open and 27315subsequent output is appended to the file 27316(@pxref{Redirection}). 27317This makes it easy to mix program text and explanatory prose for the same 27318sample source file (as has been done here!) without any hassle. The file is 27319only closed when a new @value{DF} name is encountered or at the end of the 27320input file. 27321 27322When a new @value{FN} is encountered, instead of closing the file, 27323the program saves the name of the current file in @code{filelist}. 27324This makes it possible to interleave the code for more than one file in 27325the Texinfo input file. (Previous versions of this program @emph{did} 27326close the file. But because of the @samp{>} redirection, a file whose 27327parts were not all one after the other ended up getting clobbered.) 27328An @code{END} rule then closes all the open files when processing 27329is finished: 27330 27331@example 27332@c file eg/prog/extract.awk 27333@group 27334END @{ 27335 close(curfile) # close the last one 27336 for (f in filelist) # close all the rest 27337 close(f) 27338@} 27339@end group 27340@c endfile 27341@end example 27342 27343Finally, the function @code{@w{unexpected_eof()}} prints an appropriate 27344error message and then exits: 27345 27346@example 27347@c file eg/prog/extract.awk 27348@group 27349function unexpected_eof() 27350@{ 27351 printf("extract: %s:%d: unexpected EOF or error\n", 27352 FILENAME, FNR) > "/dev/stderr" 27353 exit 1 27354@} 27355@end group 27356@c endfile 27357@end example 27358 27359@node Simple Sed 27360@subsection A Simple Stream Editor 27361 27362@cindex @command{sed} utility 27363@cindex stream editors 27364The @command{sed} utility is a @dfn{stream editor}, a program that reads a 27365stream of data, makes changes to it, and passes it on. 27366It is often used to make global changes to a large file or to a stream 27367of data generated by a pipeline of commands. 27368Although @command{sed} is a complicated program in its own right, its most common 27369use is to perform global substitutions in the middle of a pipeline: 27370 27371@example 27372@var{command1} < orig.data | sed 's/old/new/g' | @var{command2} > result 27373@end example 27374 27375Here, @samp{s/old/new/g} tells @command{sed} to look for the regexp 27376@samp{old} on each input line and globally replace it with the text 27377@samp{new} (i.e., all the occurrences on a line). This is similar to 27378@command{awk}'s @code{gsub()} function 27379(@pxref{String Functions}). 27380 27381The following program, @file{awksed.awk}, accepts at least two command-line 27382arguments: the pattern to look for and the text to replace it with. Any 27383additional arguments are treated as @value{DF} names to process. If none 27384are provided, the standard input is used: 27385 27386@cindex Brennan, Michael 27387@cindex @command{awksed.awk} program 27388@c @cindex simple stream editor 27389@c @cindex stream editor, simple 27390@example 27391@c file eg/prog/awksed.awk 27392# awksed.awk --- do s/foo/bar/g using just print 27393# Thanks to Michael Brennan for the idea 27394@c endfile 27395@ignore 27396@c file eg/prog/awksed.awk 27397# 27398# Arnold Robbins, arnold@@skeeve.com, Public Domain 27399# August 1995 27400@c endfile 27401@end ignore 27402@c file eg/prog/awksed.awk 27403 27404function usage() 27405@{ 27406 print "usage: awksed pat repl [files...]" > "/dev/stderr" 27407 exit 1 27408@} 27409 27410@group 27411BEGIN @{ 27412 # validate arguments 27413 if (ARGC < 3) 27414 usage() 27415@end group 27416 27417 RS = ARGV[1] 27418 ORS = ARGV[2] 27419 27420 # don't use arguments as files 27421 ARGV[1] = ARGV[2] = "" 27422@} 27423 27424@group 27425# look ma, no hands! 27426@{ 27427 if (RT == "") 27428 printf "%s", $0 27429 else 27430 print 27431@} 27432@end group 27433@c endfile 27434@end example 27435 27436The program relies on @command{gawk}'s ability to have @code{RS} be a regexp, 27437as well as on the setting of @code{RT} to the actual text that terminates the 27438record (@pxref{Records}). 27439 27440The idea is to have @code{RS} be the pattern to look for. @command{gawk} 27441automatically sets @code{$0} to the text between matches of the pattern. 27442This is text that we want to keep, unmodified. Then, by setting @code{ORS} 27443to the replacement text, a simple @code{print} statement outputs the 27444text we want to keep, followed by the replacement text. 27445 27446There is one wrinkle to this scheme, which is what to do if the last record 27447doesn't end with text that matches @code{RS}. Using a @code{print} 27448statement unconditionally prints the replacement text, which is not correct. 27449However, if the file did not end in text that matches @code{RS}, @code{RT} 27450is set to the null string. In this case, we can print @code{$0} using 27451@code{printf} 27452(@pxref{Printf}). 27453 27454The @code{BEGIN} rule handles the setup, checking for the right number 27455of arguments and calling @code{usage()} if there is a problem. Then it sets 27456@code{RS} and @code{ORS} from the command-line arguments and sets 27457@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they are 27458not treated as @value{FN}s 27459(@pxref{ARGC and ARGV}). 27460 27461The @code{usage()} function prints an error message and exits. 27462Finally, the single rule handles the printing scheme outlined earlier, 27463using @code{print} or @code{printf} as appropriate, depending upon the 27464value of @code{RT}. 27465 27466@node Igawk Program 27467@subsection An Easy Way to Use Library Functions 27468 27469@cindex libraries of @command{awk} functions @subentry example program for using 27470@cindex functions @subentry library @subentry example program for using 27471In @ref{Include Files}, we saw how @command{gawk} provides a built-in 27472file-inclusion capability. However, this is a @command{gawk} extension. 27473This @value{SECTION} provides the motivation for making file inclusion 27474available for standard @command{awk}, and shows how to do it using a 27475combination of shell and @command{awk} programming. 27476 27477Using library functions in @command{awk} can be very beneficial. It 27478encourages code reuse and the writing of general functions. Programs are 27479smaller and therefore clearer. 27480However, using library functions is only easy when writing @command{awk} 27481programs; it is painful when running them, requiring multiple @option{-f} 27482options. If @command{gawk} is unavailable, then so too is the @env{AWKPATH} 27483environment variable and the ability to put @command{awk} functions into a 27484library directory (@pxref{Options}). 27485It would be nice to be able to write programs in the following manner: 27486 27487@example 27488# library functions 27489@@include getopt.awk 27490@@include join.awk 27491@dots{} 27492 27493# main program 27494BEGIN @{ 27495 while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1) 27496 @dots{} 27497 @dots{} 27498@} 27499@end example 27500 27501The following program, @file{igawk.sh}, provides this service. 27502It simulates @command{gawk}'s searching of the @env{AWKPATH} variable 27503and also allows @dfn{nested} includes (i.e., a file that is included 27504with @code{@@include} can contain further @code{@@include} statements). 27505@command{igawk} makes an effort to only include files once, so that nested 27506includes don't accidentally include a library function twice. 27507 27508@command{igawk} should behave just like @command{gawk} externally. This 27509means it should accept all of @command{gawk}'s command-line arguments, 27510including the ability to have multiple source files specified via 27511@option{-f} and the ability to mix command-line and library source files. 27512 27513The program is written using the POSIX Shell (@command{sh}) command 27514language.@footnote{Fully explaining the @command{sh} language is beyond 27515the scope of this book. We provide some minimal explanations, but see 27516a good shell programming book if you wish to understand things in more 27517depth.} It works as follows: 27518 27519@enumerate 27520@item 27521Loop through the arguments, saving anything that doesn't represent 27522@command{awk} source code for later, when the expanded program is run. 27523 27524@item 27525For any arguments that do represent @command{awk} text, put the arguments into 27526a shell variable that will be expanded. There are two cases: 27527 27528@enumerate a 27529@item 27530Literal text, provided with @option{-e} or @option{--source}. This 27531text is just appended directly. 27532 27533@item 27534Source @value{FN}s, provided with @option{-f}. We use a neat trick and 27535append @samp{@@include @var{filename}} to the shell variable's contents. 27536Because the file-inclusion program works the way @command{gawk} does, this 27537gets the text of the file included in the program at the correct point. 27538@end enumerate 27539 27540@item 27541Run an @command{awk} program (naturally) over the shell variable's contents to expand 27542@code{@@include} statements. The expanded program is placed in a second 27543shell variable. 27544 27545@item 27546Run the expanded program with @command{gawk} and any other original command-line 27547arguments that the user supplied (such as the @value{DF} names). 27548@end enumerate 27549 27550This program uses shell variables extensively: for storing command-line arguments and 27551the text of the @command{awk} program that will expand the user's program, for the 27552user's original program, and for the expanded program. Doing so removes some 27553potential problems that might arise were we to use temporary files instead, 27554at the cost of making the script somewhat more complicated. 27555 27556The initial part of the program turns on shell tracing if the first 27557argument is @samp{debug}. 27558 27559The next part loops through all the command-line arguments. 27560There are several cases of interest: 27561 27562@c @asis for docbook 27563@table @asis 27564@item @option{--} 27565This ends the arguments to @command{igawk}. Anything else should be passed on 27566to the user's @command{awk} program without being evaluated. 27567 27568@item @option{-W} 27569This indicates that the next option is specific to @command{gawk}. To make 27570argument processing easier, the @option{-W} is appended to the front of the 27571remaining arguments and the loop continues. (This is an @command{sh} 27572programming trick. Don't worry about it if you are not familiar with 27573@command{sh}.) 27574 27575@item @option{-v}, @option{-F} 27576These are saved and passed on to @command{gawk}. 27577 27578@item @option{-f}, @option{--file}, @option{--file=}, @option{-Wfile=} 27579The @value{FN} is appended to the shell variable @code{program} with an 27580@code{@@include} statement. 27581The @command{expr} utility is used to remove the leading option part of the 27582argument (e.g., @samp{--file=}). 27583(Typical @command{sh} usage would be to use the @command{echo} and @command{sed} 27584utilities to do this work. Unfortunately, some versions of @command{echo} evaluate 27585escape sequences in their arguments, possibly mangling the program text. 27586Using @command{expr} avoids this problem.) 27587 27588@item @option{--source}, @option{--source=}, @option{-Wsource=} 27589The source text is appended to @code{program}. 27590 27591@item @option{--version}, @option{-Wversion} 27592@command{igawk} prints its version number, runs @samp{gawk --version} 27593to get the @command{gawk} version information, and then exits. 27594@end table 27595 27596If none of the @option{-f}, @option{--file}, @option{-Wfile}, @option{--source}, 27597or @option{-Wsource} arguments are supplied, then the first nonoption argument 27598should be the @command{awk} program. If there are no command-line 27599arguments left, @command{igawk} prints an error message and exits. 27600Otherwise, the first argument is appended to @code{program}. 27601In any case, after the arguments have been processed, 27602the shell variable 27603@code{program} contains the complete text of the original @command{awk} 27604program. 27605 27606The program is as follows: 27607 27608@cindex @code{igawk.sh} program 27609@example 27610@c file eg/prog/igawk.sh 27611#! /bin/sh 27612# igawk --- like gawk but do @@include processing 27613@c endfile 27614@ignore 27615@c file eg/prog/igawk.sh 27616# 27617# Arnold Robbins, arnold@@skeeve.com, Public Domain 27618# July 1993 27619# December 2010, minor edits 27620@c endfile 27621@end ignore 27622@c file eg/prog/igawk.sh 27623 27624if [ "$1" = debug ] 27625then 27626 set -x 27627 shift 27628fi 27629 27630# A literal newline, so that program text is formatted correctly 27631n=' 27632' 27633 27634# Initialize variables to empty 27635program= 27636opts= 27637 27638while [ $# -ne 0 ] # loop over arguments 27639do 27640 case $1 in 27641 --) shift 27642 break ;; 27643 27644 -W) shift 27645 # The $@{x?'message here'@} construct prints a 27646 # diagnostic if $x is the null string 27647 set -- -W"$@{@@?'missing operand'@}" 27648 continue ;; 27649 27650 -[vF]) opts="$opts $1 '$@{2?'missing operand'@}'" 27651 shift ;; 27652 27653 -[vF]*) opts="$opts '$1'" ;; 27654 27655 -f) program="$program$n@@include $@{2?'missing operand'@}" 27656 shift ;; 27657 27658 -f*) f=$(expr "$1" : '-f\(.*\)') 27659 program="$program$n@@include $f" ;; 27660 27661 -[W-]file=*) 27662 f=$(expr "$1" : '-.file=\(.*\)') 27663 program="$program$n@@include $f" ;; 27664 27665 -[W-]file) 27666 program="$program$n@@include $@{2?'missing operand'@}" 27667 shift ;; 27668 27669 -[W-]source=*) 27670 t=$(expr "$1" : '-.source=\(.*\)') 27671 program="$program$n$t" ;; 27672 27673 -[W-]source) 27674 program="$program$n$@{2?'missing operand'@}" 27675 shift ;; 27676 27677 -[W-]version) 27678 echo igawk: version 3.0 1>&2 27679 gawk --version 27680 exit 0 ;; 27681 27682 -[W-]*) opts="$opts '$1'" ;; 27683 27684 *) break ;; 27685 esac 27686 shift 27687done 27688 27689if [ -z "$program" ] 27690then 27691 program=$@{1?'missing program'@} 27692 shift 27693fi 27694 27695# At this point, `program' has the program. 27696@c endfile 27697@end example 27698 27699The @command{awk} program to process @code{@@include} directives 27700is stored in the shell variable @code{expand_prog}. Doing this keeps 27701the shell script readable. The @command{awk} program 27702reads through the user's program, one line at a time, using @code{getline} 27703(@pxref{Getline}). The input 27704@value{FN}s and @code{@@include} statements are managed using a stack. 27705As each @code{@@include} is encountered, the current @value{FN} is 27706``pushed'' onto the stack and the file named in the @code{@@include} 27707directive becomes the current @value{FN}. As each file is finished, 27708the stack is ``popped,'' and the previous input file becomes the current 27709input file again. The process is started by making the original file 27710the first one on the stack. 27711 27712The @code{pathto()} function does the work of finding the full path to 27713a file. It simulates @command{gawk}'s behavior when searching the 27714@env{AWKPATH} environment variable 27715(@pxref{AWKPATH Variable}). 27716If a @value{FN} has a @samp{/} in it, no path search is done. 27717Similarly, if the @value{FN} is @code{"-"}, then that string is 27718used as-is. Otherwise, 27719the @value{FN} is concatenated with the name of each directory in 27720the path, and an attempt is made to open the generated @value{FN}. 27721The only way to test if a file can be read in @command{awk} is to go 27722ahead and try to read it with @code{getline}; this is what @code{pathto()} 27723does.@footnote{On some very old versions of @command{awk}, the test 27724@samp{getline junk < t} can loop forever if the file exists but is empty.} 27725If the file can be read, it is closed and the @value{FN} 27726is returned: 27727 27728@ignore 27729An alternative way to test for the file's existence would be to call 27730@samp{system("test -r " t)}, which uses the @command{test} utility to 27731see if the file exists and is readable. The disadvantage to this method 27732is that it requires creating an extra process and can thus be slightly 27733slower. 27734@end ignore 27735 27736@example 27737@c file eg/prog/igawk.sh 27738expand_prog=' 27739 27740function pathto(file, i, t, junk) 27741@{ 27742 if (index(file, "/") != 0) 27743 return file 27744 27745 if (file == "-") 27746 return file 27747 27748 for (i = 1; i <= ndirs; i++) @{ 27749 t = (pathlist[i] "/" file) 27750@group 27751 if ((getline junk < t) > 0) @{ 27752 # found it 27753 close(t) 27754 return t 27755 @} 27756@end group 27757 @} 27758 return "" 27759@} 27760@c endfile 27761@end example 27762 27763The main program is contained inside one @code{BEGIN} rule. The first thing it 27764does is set up the @code{pathlist} array that @code{pathto()} uses. After 27765splitting the path on @samp{:}, null elements are replaced with @code{"."}, 27766which represents the current directory: 27767 27768@example 27769@c file eg/prog/igawk.sh 27770BEGIN @{ 27771 path = ENVIRON["AWKPATH"] 27772 ndirs = split(path, pathlist, ":") 27773 for (i = 1; i <= ndirs; i++) @{ 27774 if (pathlist[i] == "") 27775 pathlist[i] = "." 27776 @} 27777@c endfile 27778@end example 27779 27780The stack is initialized with @code{ARGV[1]}, which will be @code{"/dev/stdin"}. 27781The main loop comes next. Input lines are read in succession. Lines that 27782do not start with @code{@@include} are printed verbatim. 27783If the line does start with @code{@@include}, the @value{FN} is in @code{$2}. 27784@code{pathto()} is called to generate the full path. If it cannot, then the program 27785prints an error message and continues. 27786 27787The next thing to check is if the file is included already. The 27788@code{processed} array is indexed by the full @value{FN} of each included 27789file and it tracks this information for us. If the file is 27790seen again, a warning message is printed. Otherwise, the new @value{FN} is 27791pushed onto the stack and processing continues. 27792 27793Finally, when @code{getline} encounters the end of the input file, the file 27794is closed and the stack is popped. When @code{stackptr} is less than zero, 27795the program is done: 27796 27797@example 27798@c file eg/prog/igawk.sh 27799 stackptr = 0 27800 input[stackptr] = ARGV[1] # ARGV[1] is first file 27801 27802 for (; stackptr >= 0; stackptr--) @{ 27803 while ((getline < input[stackptr]) > 0) @{ 27804 if (tolower($1) != "@@include") @{ 27805 print 27806 continue 27807 @} 27808 fpath = pathto($2) 27809 if (fpath == "") @{ 27810 printf("igawk: %s:%d: cannot find %s\n", 27811 input[stackptr], FNR, $2) > "/dev/stderr" 27812 continue 27813 @} 27814 if (! (fpath in processed)) @{ 27815 processed[fpath] = input[stackptr] 27816 input[++stackptr] = fpath # push onto stack 27817 @} else 27818 print $2, "included in", input[stackptr], 27819 "already included in", 27820 processed[fpath] > "/dev/stderr" 27821 @} 27822 close(input[stackptr]) 27823 @} 27824@}' # close quote ends `expand_prog' variable 27825 27826processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF 27827$program 27828EOF 27829) 27830@c endfile 27831@end example 27832 27833The shell construct @samp{@var{command} << @var{marker}} is called 27834a @dfn{here document}. Everything in the shell script up to the 27835@var{marker} is fed to @var{command} as input. The shell processes 27836the contents of the here document for variable and command substitution 27837(and possibly other things as well, depending upon the shell). 27838 27839The shell construct @samp{$(@dots{})} is called @dfn{command substitution}. 27840The output of the command inside the parentheses is substituted 27841into the command line. 27842Because the result is used in a variable assignment, 27843it is saved as a single string, even if the results contain whitespace. 27844 27845The expanded program is saved in the variable @code{processed_program}. 27846It's done in these steps: 27847 27848@enumerate 27849@item 27850Run @command{gawk} with the @code{@@include}-processing program (the 27851value of the @code{expand_prog} shell variable) reading standard input. 27852 27853@item 27854Standard input is the contents of the user's program, 27855from the shell variable @code{program}. 27856Feed its contents to @command{gawk} via a here document. 27857 27858@item 27859Save the results of this processing in the shell variable 27860@code{processed_program} by using command substitution. 27861@end enumerate 27862 27863The last step is to call @command{gawk} with the expanded program, 27864along with the original 27865options and command-line arguments that the user supplied: 27866 27867@example 27868@c file eg/prog/igawk.sh 27869eval gawk $opts -- '"$processed_program"' '"$@@"' 27870@c endfile 27871@end example 27872 27873The @command{eval} command is a shell construct that reruns the shell's parsing 27874process. This keeps things properly quoted. 27875 27876This version of @command{igawk} represents the fifth version of this program. 27877There are four key simplifications that make the program work better: 27878 27879@itemize @value{BULLET} 27880@item 27881Using @code{@@include} even for the files named with @option{-f} makes building 27882the initial collected @command{awk} program much simpler; all the 27883@code{@@include} processing can be done once. 27884 27885@item 27886Not trying to save the line read with @code{getline} 27887in the @code{pathto()} function when testing for the 27888file's accessibility for use with the main program simplifies things 27889considerably. 27890 27891@item 27892Using a @code{getline} loop in the @code{BEGIN} rule does it all in one 27893place. It is not necessary to call out to a separate loop for processing 27894nested @code{@@include} statements. 27895 27896@item 27897Instead of saving the expanded program in a temporary file, putting it in a shell variable 27898avoids some potential security problems. 27899This has the disadvantage that the script relies upon more features 27900of the @command{sh} language, making it harder to follow for those who 27901aren't familiar with @command{sh}. 27902@end itemize 27903 27904Also, this program illustrates that it is often worthwhile to combine 27905@command{sh} and @command{awk} programming together. You can usually 27906accomplish quite a lot, without having to resort to low-level programming 27907in C or C++, and it is frequently easier to do certain kinds of string 27908and argument manipulation using the shell than it is in @command{awk}. 27909 27910Finally, @command{igawk} shows that it is not always necessary to add new 27911features to a program; they can often be layered on top.@footnote{@command{gawk} 27912does @code{@@include} processing itself in order to support the use 27913of @command{awk} programs as Web CGI scripts.} 27914 27915 27916@node Anagram Program 27917@subsection Finding Anagrams from a Dictionary 27918 27919@cindex anagrams, finding 27920An interesting programming challenge is to 27921search for @dfn{anagrams} in a 27922word list (such as 27923@file{/usr/share/dict/words} on many GNU/Linux systems). 27924One word is an anagram of another if both words contain 27925the same letters 27926(e.g., ``babbling'' and ``blabbing''). 27927 27928Column 2, Problem C, of Jon Bentley's @cite{Programming Pearls}, Second 27929Edition, presents an elegant algorithm. The idea is to give words that 27930are anagrams a common signature, sort all the words together by their 27931signatures, and then print them. Dr.@: Bentley observes that taking the 27932letters in each word and sorting them produces those common signatures. 27933 27934The following program uses arrays of arrays to bring together 27935words with the same signature and array sorting to print the words 27936in sorted order: 27937 27938@cindex @file{anagram.awk} program 27939@example 27940@c file eg/prog/anagram.awk 27941# anagram.awk --- An implementation of the anagram-finding algorithm 27942# from Jon Bentley's "Programming Pearls," 2nd edition. 27943# Addison Wesley, 2000, ISBN 0-201-65788-0. 27944# Column 2, Problem C, section 2.8, pp 18-20. 27945@c endfile 27946@ignore 27947@c file eg/prog/anagram.awk 27948# 27949# This program requires gawk 4.0 or newer. 27950# Required gawk-specific features: 27951# - True multidimensional arrays 27952# - split() with "" as separator splits out individual characters 27953# - asort() and asorti() functions 27954# 27955# See https://savannah.gnu.org/projects/gawk. 27956# 27957# Arnold Robbins 27958# arnold@@skeeve.com 27959# Public Domain 27960# January, 2011 27961@c endfile 27962@end ignore 27963@c file eg/prog/anagram.awk 27964 27965/'s$/ @{ next @} # Skip possessives 27966@c endfile 27967@end example 27968 27969The program starts with a header, and then a rule to skip 27970possessives in the dictionary file. The next rule builds 27971up the data structure. The first dimension of the array 27972is indexed by the signature; the second dimension is the word 27973itself: 27974 27975@example 27976@c file eg/prog/anagram.awk 27977@{ 27978 key = word2key($1) # Build signature 27979 data[key][$1] = $1 # Store word with signature 27980@} 27981@c endfile 27982@end example 27983 27984The @code{word2key()} function creates the signature. 27985It splits the word apart into individual letters, 27986sorts the letters, and then joins them back together: 27987 27988@example 27989@c file eg/prog/anagram.awk 27990# word2key --- split word apart into letters, sort, and join back together 27991 27992function word2key(word, a, i, n, result) 27993@{ 27994 n = split(word, a, "") 27995 asort(a) 27996 27997 for (i = 1; i <= n; i++) 27998 result = result a[i] 27999 28000 return result 28001@} 28002@c endfile 28003@end example 28004 28005Finally, the @code{END} rule traverses the array 28006and prints out the anagram lists. It sends the output 28007to the system @command{sort} command because otherwise 28008the anagrams would appear in arbitrary order: 28009 28010@example 28011@c file eg/prog/anagram.awk 28012END @{ 28013 sort = "sort" 28014 for (key in data) @{ 28015 # Sort words with same key 28016 nwords = asorti(data[key], words) 28017 if (nwords == 1) 28018 continue 28019 28020 # And print. Minor glitch: trailing space at end of each line 28021 for (j = 1; j <= nwords; j++) 28022 printf("%s ", words[j]) | sort 28023 print "" | sort 28024 @} 28025 close(sort) 28026@} 28027@c endfile 28028@end example 28029 28030Here is some partial output when the program is run: 28031 28032@example 28033$ @kbd{gawk -f anagram.awk /usr/share/dict/words | grep '^b'} 28034@dots{} 28035babbled blabbed 28036babbler blabber brabble 28037babblers blabbers brabbles 28038babbling blabbing 28039babbly blabby 28040babel bable 28041babels beslab 28042babery yabber 28043@dots{} 28044@end example 28045 28046 28047@node Signature Program 28048@subsection And Now for Something Completely Different 28049 28050@cindex signature program 28051@cindex Brini, Davide 28052The following program was written by Davide Brini 28053@c (@email{dave_br@@gmx.com}) 28054and is published on @uref{http://backreference.org/2011/02/03/obfuscated-awk/, 28055his website}. 28056It serves as his signature in the Usenet group @code{comp.lang.awk}. 28057He supplies the following copyright terms: 28058 28059@quotation 28060Copyright @copyright{} 2008 Davide Brini 28061 28062Copying and distribution of the code published in this page, with or without 28063modification, are permitted in any medium without royalty provided the copyright 28064notice and this notice are preserved. 28065@end quotation 28066 28067Here is the program: 28068 28069@example 28070@group 28071awk 'BEGIN@{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c"; 28072printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O, 28073X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O, 28074O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O@}' 28075@end group 28076@end example 28077 28078@cindex Johansen, Chris 28079We leave it to you to determine what the program does. (If you are 28080truly desperate to understand it, see Chris Johansen's explanation, 28081which is embedded in the Texinfo source file for this @value{DOCUMENT}.) 28082 28083@ignore 28084To: "Arnold Robbins" <arnold@skeeve.com> 28085Date: Sat, 20 Aug 2011 13:50:46 -0400 28086Subject: The GNU Awk User's Guide, Section 13.3.11 28087From: "Chris Johansen" <johansen@main.nc.us> 28088Message-ID: <op.v0iw6wlv7finx3@asusodin.thrudvang.lan> 28089 28090Arnold, you don't know me, but we have a tenuous connection. My wife is 28091Barbara A. Field, FAIA, GIT '65 (B. Arch.). 28092 28093I have had a couple of paper copies of "Effective Awk Programming" for 28094years, and now I'm going through a Kindle version of "The GNU Awk User's 28095Guide" again. When I got to section 13.3.11, I reformatted and lightly 28096commented Davide Brin's signature script to understand its workings. 28097 28098It occurs to me that this might have pedagogical value as an example 28099(although imperfect) of the value of whitespace and comments, and a 28100starting point for that discussion. It certainly helped _me_ understand 28101what's going on. You are welcome to it, as-is or modified (subject to 28102Davide's constraints, of course, which I think I have met). 28103 28104If I were to include it in a future edition, I would put it at some 28105distance from section 13.3.11, say, as a note or an appendix, so as not to 28106be a "spoiler" to the puzzle. 28107 28108Best regards, 28109-- 28110Chris Johansen {johansen at main dot nc dot us} 28111 . . . collapsing the probability wave function, sending ripples of 28112certainty through the space-time continuum. 28113 28114 28115#! /usr/bin/gawk -f 28116 28117# From "13.3.11 And Now For Something Completely Different" 28118# https://www.gnu.org/software/gawk/manual/html_node/Signature-Program.html#Signature-Program 28119 28120# Copyright @copyright{} 2008 Davide Brini 28121 28122# Copying and distribution of the code published in this page, with 28123# or without modification, are permitted in any medium without 28124# royalty provided the copyright notice and this notice are preserved. 28125 28126BEGIN { 28127 O = "~" ~ "~"; # 1 28128 o = "==" == "=="; # 1 28129 o += +o; # 2 28130 x = O "" O; # 11 28131 28132 28133 while ( X++ <= x + o + o ) c = c "%c"; 28134 28135 # O is 1 28136 # o is 2 28137 # x is 11 28138 # X is 17 28139 # c is "%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c" 28140 28141 printf c, 28142 ( x - O )*( x - O), # 100 d 28143 x*( x - o ) - o, # 97 a 28144 x*( x - O ) + x - O - o, # 118 v 28145 +x*( x - O ) - x + o, # 101 e 28146 X*( o*o + O ) + x - O, # 95 _ 28147 X*( X - x ) - o*o, # 98 b 28148 ( x + X )*o*o + o, # 114 r 28149 x*( X - x ) - O - O, # 64 @ 28150 x - O + ( O + o + X + x )*( o + O ), # 103 g 28151 X*X - X*( x - O ) - x + O, # 109 m 28152 O + X*( o*( o + O ) + O ), # 120 x 28153 +x + O + X*o, # 46 . 28154 x*( x - o), # 99 c 28155 ( o + X + x )*o*o - ( x - O - O ), # 111 0 28156 O + ( X - x )*( X + O ), # 109 m 28157 x - O # 10 \n 28158} 28159@end ignore 28160 28161@node Programs Summary 28162@section Summary 28163 28164@itemize @value{BULLET} 28165@item 28166The programs provided in this @value{CHAPTER} 28167continue on the theme that reading programs is an excellent way to learn 28168Good Programming. 28169 28170@item 28171Using @samp{#!} to make @command{awk} programs directly runnable makes 28172them easier to use. Otherwise, invoke the program using @samp{awk 28173-f @dots{}}. 28174 28175@item 28176Reimplementing standard POSIX programs in @command{awk} is a pleasant 28177exercise; @command{awk}'s expressive power lets you write such programs 28178in relatively few lines of code, yet they are functionally complete 28179and usable. 28180 28181@item 28182One of standard @command{awk}'s weaknesses is working with individual 28183characters. The ability to use @code{split()} with the empty string as 28184the separator can considerably simplify such tasks. 28185 28186@item 28187The examples here demonstrate the usefulness of the library 28188functions from @ref{Library Functions} 28189for a number of real (if small) programs. 28190 28191@item 28192Besides reinventing POSIX wheels, other programs solved a selection of 28193interesting problems, such as finding duplicate words in text, printing 28194mailing labels, and finding anagrams. 28195 28196@end itemize 28197 28198@c EXCLUDE START 28199@node Programs Exercises 28200@section Exercises 28201 28202@enumerate 28203@item 28204Rewrite @file{cut.awk} (@pxref{Cut Program}) 28205using @code{split()} with @code{""} as the separator. 28206 28207@item 28208In @ref{Egrep Program}, we mentioned that @samp{egrep -i} could be 28209simulated in versions of @command{awk} without @code{IGNORECASE} by 28210using @code{tolower()} on the line and the pattern. In a footnote there, 28211we also mentioned that this solution has a bug: the translated line is 28212output, and not the original one. Fix this problem. 28213@c Exercise: Fix this, w/array and new line as key to original line 28214 28215@item 28216The POSIX version of @command{id} takes options that control which 28217information is printed. Modify the @command{awk} version 28218(@pxref{Id Program}) to accept the same arguments and perform in the 28219same way. 28220 28221@item 28222The @file{split.awk} program (@pxref{Split Program}) assumes 28223that letters are contiguous in the character set, 28224which isn't true for EBCDIC systems. 28225Fix this problem. 28226(Hint: Consider a different way to work through the alphabet, 28227without relying on @code{ord()} and @code{chr()}.) 28228 28229@item 28230@cindex Kernighan, Brian @subentry quotes 28231In @file{uniq.awk} (@pxref{Uniq Program}, the 28232logic for choosing which lines to print represents a @dfn{state 28233machine}, which is ``a device which can be in one of a set number of stable 28234conditions depending on its previous condition and on the present values 28235of its inputs.''@footnote{This definition is from 28236@uref{https://www.lexico.com/en/definition/state_machine}.} 28237Brian Kernighan suggests that 28238``an alternative approach to state machines is to just read 28239the input into an array, then use indexing. It's almost always 28240easier code, and for most inputs where you would use this, just 28241as fast.'' Rewrite the logic to follow this 28242suggestion. 28243 28244 28245@item 28246Why can't the @file{wc.awk} program (@pxref{Wc Program}) just 28247use the value of @code{FNR} in @code{endfile()}? 28248Hint: Examine the code in @ref{Filetrans Function}. 28249 28250@ignore 28251@command{wc} can't just use the value of @code{FNR} in 28252@code{endfile()}. If you examine the code in @ref{Filetrans Function}, 28253you will see that @code{FNR} has already been reset by the time 28254@code{endfile()} is called. 28255@end ignore 28256 28257@item 28258Manipulation of individual characters in the @command{translate} program 28259(@pxref{Translate Program}) is painful using standard @command{awk} 28260functions. Given that @command{gawk} can split strings into individual 28261characters using @code{""} as the separator, how might you use this 28262feature to simplify the program? 28263 28264@item 28265The @file{extract.awk} program (@pxref{Extract Program}) was written 28266before @command{gawk} had the @code{gensub()} function. Use it 28267to simplify the code. 28268 28269@item 28270Compare the performance of the @file{awksed.awk} program 28271(@pxref{Simple Sed}) with the more straightforward: 28272 28273@example 28274BEGIN @{ 28275 pat = ARGV[1] 28276 repl = ARGV[2] 28277 ARGV[1] = ARGV[2] = "" 28278@} 28279 28280@{ gsub(pat, repl); print @} 28281@end example 28282 28283@item 28284What are the advantages and disadvantages of @file{awksed.awk} versus 28285the real @command{sed} utility? 28286 28287@ignore 28288 Advantage: egrep regexps 28289 speed (?) 28290 Disadvantage: no & in replacement text 28291 28292Others? 28293@end ignore 28294 28295@item 28296In @ref{Igawk Program}, we mentioned that not trying to save the line 28297read with @code{getline} in the @code{pathto()} function when testing 28298for the file's accessibility for use with the main program simplifies 28299things considerably. What problem does this engender though? 28300@c answer, reading from "-" or /dev/stdin 28301 28302@cindex search paths 28303@cindex search paths @subentry for source files 28304@cindex source files, search path for 28305@cindex files @subentry source, search path for 28306@cindex directories @subentry searching @subentry for source files 28307@item 28308As an additional example of the idea that it is not always necessary to 28309add new features to a program, consider the idea of having two files in 28310a directory in the search path: 28311 28312@table @file 28313@item default.awk 28314This file contains a set of default library functions, such 28315as @code{getopt()} and @code{assert()}. 28316 28317@item site.awk 28318This file contains library functions that are specific to a site or 28319installation; i.e., locally developed functions. 28320Having a separate file allows @file{default.awk} to change with 28321new @command{gawk} releases, without requiring the system administrator to 28322update it each time by adding the local functions. 28323@end table 28324 28325One user 28326@c Karl Berry, karl@ileaf.com, 10/95 28327suggested that @command{gawk} be modified to automatically read these files 28328upon startup. Instead, it would be very simple to modify @command{igawk} 28329to do this. Since @command{igawk} can process nested @code{@@include} 28330directives, @file{default.awk} could simply contain @code{@@include} 28331statements for the desired library functions. 28332Make this change. 28333 28334@item 28335Modify @file{anagram.awk} (@pxref{Anagram Program}), to avoid 28336the use of the external @command{sort} utility. 28337 28338@end enumerate 28339@c EXCLUDE END 28340 28341@ifnotinfo 28342@part @value{PART3}Moving Beyond Standard @command{awk} with @command{gawk} 28343@end ifnotinfo 28344 28345@ifdocbook 28346Part III focuses on features specific to @command{gawk}. 28347It contains the following chapters: 28348 28349@itemize @value{BULLET} 28350@item 28351@ref{Namespaces} 28352 28353@item 28354@ref{Advanced Features} 28355 28356@item 28357@ref{Internationalization} 28358 28359@item 28360@ref{Debugger} 28361 28362@item 28363@ref{Arbitrary Precision Arithmetic} 28364 28365@item 28366@ref{Dynamic Extensions} 28367@end itemize 28368@end ifdocbook 28369 28370@node Advanced Features 28371@chapter Advanced Features of @command{gawk} 28372@cindex @command{gawk} @subentry features @subentry advanced 28373@cindex advanced features @subentry @command{gawk} 28374@ignore 28375Contributed by: Peter Langston <pud!psl@bellcore.bellcore.com> 28376 28377 Found in Steve English's "signature" line: 28378 28379"Write documentation as if whoever reads it is a violent psychopath 28380who knows where you live." 28381@end ignore 28382@cindex Langston, Peter 28383@cindex English, Steve 28384@quotation 28385@i{Write documentation as if whoever reads it is 28386a violent psychopath who knows where you live.} 28387@author Steve English, as quoted by Peter Langston 28388@end quotation 28389 28390This @value{CHAPTER} discusses advanced features in @command{gawk}. 28391It's a bit of a ``grab bag'' of items that are otherwise unrelated 28392to each other. 28393First, we look at a command-line option that allows @command{gawk} to recognize 28394nondecimal numbers in input data, not just in @command{awk} 28395programs. 28396Then, @command{gawk}'s special features for sorting arrays are presented. 28397Next, two-way I/O, discussed briefly in earlier parts of this 28398@value{DOCUMENT}, is described in full detail, along with the basics 28399of TCP/IP networking. Finally, we see how @command{gawk} 28400can @dfn{profile} an @command{awk} program, making it possible to tune 28401it for performance. 28402 28403@c FULLXREF ON 28404Additional advanced features are discussed in separate @value{CHAPTER}s of their 28405own: 28406 28407@itemize @value{BULLET} 28408@item 28409@ref{Internationalization}, discusses how to internationalize 28410your @command{awk} programs, so that they can speak multiple 28411national languages. 28412 28413@item 28414@ref{Debugger}, describes @command{gawk}'s built-in command-line 28415debugger for debugging @command{awk} programs. 28416 28417@item 28418@ref{Arbitrary Precision Arithmetic}, describes how you can use 28419@command{gawk} to perform arbitrary-precision arithmetic. 28420 28421@item 28422@ref{Dynamic Extensions}, 28423discusses the ability to dynamically add new built-in functions to 28424@command{gawk}. 28425@end itemize 28426@c FULLXREF OFF 28427 28428@menu 28429* Nondecimal Data:: Allowing nondecimal input data. 28430* Array Sorting:: Facilities for controlling array traversal and 28431 sorting arrays. 28432* Two-way I/O:: Two-way communications with another process. 28433* TCP/IP Networking:: Using @command{gawk} for network programming. 28434* Profiling:: Profiling your @command{awk} programs. 28435* Extension Philosophy:: What should be built-in and what should not. 28436* Advanced Features Summary:: Summary of advanced features. 28437@end menu 28438 28439@node Nondecimal Data 28440@section Allowing Nondecimal Input Data 28441@cindex @option{--non-decimal-data} option 28442@cindex advanced features @subentry nondecimal input data 28443@cindex input @subentry data, nondecimal 28444@cindex constants @subentry nondecimal 28445 28446If you run @command{gawk} with the @option{--non-decimal-data} option, 28447you can have nondecimal values in your input data: 28448 28449@example 28450$ @kbd{echo 0123 123 0x123 |} 28451> @kbd{gawk --non-decimal-data '@{ printf "%d, %d, %d\n", $1, $2, $3 @}'} 28452@print{} 83, 123, 291 28453@end example 28454 28455For this feature to work, write your program so that 28456@command{gawk} treats your data as numeric: 28457 28458@example 28459$ @kbd{echo 0123 123 0x123 | gawk '@{ print $1, $2, $3 @}'} 28460@print{} 0123 123 0x123 28461@end example 28462 28463@noindent 28464The @code{print} statement treats its expressions as strings. 28465Although the fields can act as numbers when necessary, 28466they are still strings, so @code{print} does not try to treat them 28467numerically. You need to add zero to a field to force it to 28468be treated as a number. For example: 28469 28470@example 28471$ @kbd{echo 0123 123 0x123 | gawk --non-decimal-data '} 28472> @kbd{@{ print $1, $2, $3} 28473> @kbd{print $1 + 0, $2 + 0, $3 + 0 @}'} 28474@print{} 0123 123 0x123 28475@print{} 83 123 291 28476@end example 28477 28478Because it is common to have decimal data with leading zeros, and because 28479using this facility could lead to surprising results, the default is to leave it 28480disabled. If you want it, you must explicitly request it. 28481 28482@cindex programming conventions @subentry @option{--non-decimal-data} option 28483@cindex @option{--non-decimal-data} option @subentry @code{strtonum()} function and 28484@cindex @code{strtonum()} function (@command{gawk}) @subentry @option{--non-decimal-data} option and 28485@quotation CAUTION 28486@emph{Use of this option is not recommended.} 28487It can break old programs very badly. 28488Instead, use the @code{strtonum()} function to convert your data 28489(@pxref{String Functions}). 28490This makes your programs easier to write and easier to read, and 28491leads to less surprising results. 28492 28493This option may disappear in a future version of @command{gawk}. 28494@end quotation 28495 28496@node Array Sorting 28497@section Controlling Array Traversal and Array Sorting 28498 28499@command{gawk} lets you control the order in which a 28500@samp{for (@var{indx} in @var{array})} 28501loop traverses an array. 28502 28503In addition, two built-in functions, @code{asort()} and @code{asorti()}, 28504let you sort arrays based on the array values and indices, respectively. 28505These two functions also provide control over the sorting criteria used 28506to order the elements during sorting. 28507 28508@menu 28509* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. 28510* Array Sorting Functions:: How to use @code{asort()} and @code{asorti()}. 28511@end menu 28512 28513@node Controlling Array Traversal 28514@subsection Controlling Array Traversal 28515 28516By default, the order in which a @samp{for (@var{indx} in @var{array})} loop 28517scans an array is not defined; it is generally based upon 28518the internal implementation of arrays inside @command{awk}. 28519 28520Often, though, it is desirable to be able to loop over the elements 28521in a particular order that you, the programmer, choose. @command{gawk} 28522lets you do this. 28523 28524@ref{Controlling Scanning} describes how you can assign special, 28525predefined values to @code{PROCINFO["sorted_in"]} in order to 28526control the order in which @command{gawk} traverses an array 28527during a @code{for} loop. 28528 28529In addition, the value of @code{PROCINFO["sorted_in"]} can be a 28530function name.@footnote{This is why the predefined sorting orders 28531start with an @samp{@@} character, which cannot be part of an identifier.} 28532This lets you traverse an array based on any custom criterion. 28533The array elements are ordered according to the return value of this 28534function. The comparison function should be defined with at least 28535four arguments: 28536 28537@example 28538function comp_func(i1, v1, i2, v2) 28539@{ 28540 @var{compare elements 1 and 2 in some fashion} 28541 @var{return < 0; 0; or > 0} 28542@} 28543@end example 28544 28545Here, @code{i1} and @code{i2} are the indices, and @code{v1} and @code{v2} 28546are the corresponding values of the two elements being compared. 28547Either @code{v1} or @code{v2}, or both, can be arrays if the array being 28548traversed contains subarrays as values. 28549(@xref{Arrays of Arrays} for more information about subarrays.) 28550The three possible return values are interpreted as follows: 28551 28552@table @code 28553@item comp_func(i1, v1, i2, v2) < 0 28554Index @code{i1} comes before index @code{i2} during loop traversal. 28555 28556@item comp_func(i1, v1, i2, v2) == 0 28557Indices @code{i1} and @code{i2} 28558come together, but the relative order with respect to each other is undefined. 28559 28560@item comp_func(i1, v1, i2, v2) > 0 28561Index @code{i1} comes after index @code{i2} during loop traversal. 28562@end table 28563 28564Our first comparison function can be used to scan an array in 28565numerical order of the indices: 28566 28567@example 28568@group 28569function cmp_num_idx(i1, v1, i2, v2) 28570@{ 28571 # numerical index comparison, ascending order 28572 return (i1 - i2) 28573@} 28574@end group 28575@end example 28576 28577Our second function traverses an array based on the string order of 28578the element values rather than by indices: 28579 28580@example 28581function cmp_str_val(i1, v1, i2, v2) 28582@{ 28583 # string value comparison, ascending order 28584 v1 = v1 "" 28585 v2 = v2 "" 28586 if (v1 < v2) 28587 return -1 28588 return (v1 != v2) 28589@} 28590@end example 28591 28592The third 28593comparison function makes all numbers, and numeric strings without 28594any leading or trailing spaces, come out first during loop traversal: 28595 28596@example 28597function cmp_num_str_val(i1, v1, i2, v2, n1, n2) 28598@{ 28599 # numbers before string value comparison, ascending order 28600 n1 = v1 + 0 28601 n2 = v2 + 0 28602 if (n1 == v1) 28603 return (n2 == v2) ? (n1 - n2) : -1 28604 else if (n2 == v2) 28605 return 1 28606 return (v1 < v2) ? -1 : (v1 != v2) 28607@} 28608@end example 28609 28610Here is a main program to demonstrate how @command{gawk} 28611behaves using each of the previous functions: 28612 28613@example 28614BEGIN @{ 28615 data["one"] = 10 28616 data["two"] = 20 28617 data[10] = "one" 28618 data[100] = 100 28619 data[20] = "two" 28620 28621 f[1] = "cmp_num_idx" 28622 f[2] = "cmp_str_val" 28623 f[3] = "cmp_num_str_val" 28624 for (i = 1; i <= 3; i++) @{ 28625 printf("Sort function: %s\n", f[i]) 28626 PROCINFO["sorted_in"] = f[i] 28627 for (j in data) 28628 printf("\tdata[%s] = %s\n", j, data[j]) 28629 print "" 28630 @} 28631@} 28632@end example 28633 28634Here are the results when the program is run: 28635 28636@example 28637$ @kbd{gawk -f compdemo.awk} 28638@print{} Sort function: cmp_num_idx @ii{Sort by numeric index} 28639@print{} data[two] = 20 28640@print{} data[one] = 10 @ii{Both strings are numerically zero} 28641@print{} data[10] = one 28642@print{} data[20] = two 28643@print{} data[100] = 100 28644@print{} 28645@print{} Sort function: cmp_str_val @ii{Sort by element values as strings} 28646@print{} data[one] = 10 28647@print{} data[100] = 100 @ii{String 100 is less than string 20} 28648@print{} data[two] = 20 28649@print{} data[10] = one 28650@print{} data[20] = two 28651@print{} 28652@print{} Sort function: cmp_num_str_val @ii{Sort all numeric values before all strings} 28653@print{} data[one] = 10 28654@print{} data[two] = 20 28655@print{} data[100] = 100 28656@print{} data[10] = one 28657@print{} data[20] = two 28658@end example 28659 28660Consider sorting the entries of a GNU/Linux system password file 28661according to login name. The following program sorts records 28662by a specific field position and can be used for this purpose: 28663 28664@example 28665# passwd-sort.awk --- simple program to sort by field position 28666# field position is specified by the global variable POS 28667 28668function cmp_field(i1, v1, i2, v2) 28669@{ 28670 # comparison by value, as string, and ascending order 28671 return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS]) 28672@} 28673 28674@{ 28675 for (i = 1; i <= NF; i++) 28676 a[NR][i] = $i 28677@} 28678 28679@group 28680END @{ 28681 PROCINFO["sorted_in"] = "cmp_field" 28682@end group 28683 if (POS < 1 || POS > NF) 28684 POS = 1 28685 28686 for (i in a) @{ 28687 for (j = 1; j <= NF; j++) 28688 printf("%s%c", a[i][j], j < NF ? ":" : "") 28689 print "" 28690 @} 28691@} 28692@end example 28693 28694The first field in each entry of the password file is the user's login name, 28695and the fields are separated by colons. 28696Each record defines a subarray, 28697with each field as an element in the subarray. 28698Running the program produces the 28699following output: 28700 28701@example 28702$ @kbd{gawk -v POS=1 -F: -f sort.awk /etc/passwd} 28703@print{} adm:x:3:4:adm:/var/adm:/sbin/nologin 28704@print{} apache:x:48:48:Apache:/var/www:/sbin/nologin 28705@print{} avahi:x:70:70:Avahi daemon:/:/sbin/nologin 28706@dots{} 28707@end example 28708 28709The comparison should normally always return the same value when given a 28710specific pair of array elements as its arguments. If inconsistent 28711results are returned, then the order is undefined. This behavior can be 28712exploited to introduce random order into otherwise seemingly 28713ordered data: 28714 28715@example 28716function cmp_randomize(i1, v1, i2, v2) 28717@{ 28718 # random order (caution: this may never terminate!) 28719 return (2 - 4 * rand()) 28720@} 28721@end example 28722 28723As already mentioned, the order of the indices is arbitrary if two 28724elements compare equal. This is usually not a problem, but letting 28725the tied elements come out in arbitrary order can be an issue, especially 28726when comparing item values. The partial ordering of the equal elements 28727may change the next time the array is traversed, if other elements are added to or 28728removed from the array. One way to resolve ties when comparing elements 28729with otherwise equal values is to include the indices in the comparison 28730rules. Note that doing this may make the loop traversal less efficient, 28731so consider it only if necessary. The following comparison functions 28732force a deterministic order, and are based on the fact that the 28733(string) indices of two elements are never equal: 28734 28735@example 28736function cmp_numeric(i1, v1, i2, v2) 28737@{ 28738 # numerical value (and index) comparison, descending order 28739 return (v1 != v2) ? (v2 - v1) : (i2 - i1) 28740@} 28741 28742@group 28743function cmp_string(i1, v1, i2, v2) 28744@{ 28745 # string value (and index) comparison, descending order 28746 v1 = v1 i1 28747 v2 = v2 i2 28748 return (v1 > v2) ? -1 : (v1 != v2) 28749@} 28750@end group 28751@end example 28752 28753@c Avoid using the term ``stable'' when describing the unpredictable behavior 28754@c if two items compare equal. Usually, the goal of a "stable algorithm" 28755@c is to maintain the original order of the items, which is a meaningless 28756@c concept for a list constructed from a hash. 28757 28758A custom comparison function can often simplify ordered loop 28759traversal, and the sky is really the limit when it comes to 28760designing such a function. 28761 28762When string comparisons are made during a sort, either for element 28763values where one or both aren't numbers, or for element indices 28764handled as strings, the value of @code{IGNORECASE} 28765(@pxref{Built-in Variables}) controls whether 28766the comparisons treat corresponding upper- and lowercase letters as 28767equivalent or distinct. 28768 28769Another point to keep in mind is that in the case of subarrays, 28770the element values can themselves be arrays; a production comparison 28771function should use the @code{isarray()} function 28772(@pxref{Type Functions}) 28773to check for this, and choose a defined sorting order for subarrays. 28774 28775@cindex POSIX mode 28776All sorting based on @code{PROCINFO["sorted_in"]} 28777is disabled in POSIX mode, 28778because the @code{PROCINFO} array is not special in that case. 28779 28780As a side note, sorting the array indices before traversing 28781the array has been reported to add a 15% to 20% overhead to the 28782execution time of @command{awk} programs. For this reason, 28783sorted array traversal is not the default. 28784 28785@c The @command{gawk} 28786@c maintainers believe that only the people who wish to use a 28787@c feature should have to pay for it. 28788 28789@node Array Sorting Functions 28790@subsection Sorting Array Values and Indices with @command{gawk} 28791 28792@cindex arrays @subentry sorting @subentry @code{asort()} function (@command{gawk}) 28793@cindex arrays @subentry sorting @subentry @code{asorti()} function (@command{gawk}) 28794@cindexgawkfunc{asort} 28795@cindex @code{asort()} function (@command{gawk}) @subentry arrays, sorting 28796@cindex @code{asort()} function (@command{gawk}) @subentry side effects 28797@cindexgawkfunc{asorti} 28798@cindex @code{asorti()} function (@command{gawk}) @subentry arrays, sorting 28799@cindex @code{asorti()} function (@command{gawk}) @subentry side effects 28800@cindex sort function, arrays, sorting 28801In most @command{awk} implementations, sorting an array requires writing 28802a @code{sort()} function. This can be educational for exploring 28803different sorting algorithms, but usually that's not the point of the program. 28804@command{gawk} provides the built-in @code{asort()} and @code{asorti()} 28805functions (@pxref{String Functions}) for sorting arrays. For example: 28806 28807@example 28808@var{populate the array} data 28809n = asort(data) 28810for (i = 1; i <= n; i++) 28811 @var{do something with} data[i] 28812@end example 28813 28814After the call to @code{asort()}, the array @code{data} is indexed from 1 28815to some number @var{n}, the total number of elements in @code{data}. 28816(This count is @code{asort()}'s return value.) 28817@code{data[1]} @value{LEQ} @code{data[2]} @value{LEQ} @code{data[3]}, and so on. 28818The default comparison is based on the type of the elements 28819(@pxref{Typing and Comparison}). 28820All numeric values come before all string values, 28821which in turn come before all subarrays. 28822 28823@cindex side effects @subentry @code{asort()} function 28824@cindex side effects @subentry @code{asorti()} function 28825An important side effect of calling @code{asort()} is that 28826@emph{the array's original indices are irrevocably lost}. 28827As this isn't always desirable, @code{asort()} accepts a 28828second argument: 28829 28830@example 28831@var{populate the array} source 28832n = asort(source, dest) 28833for (i = 1; i <= n; i++) 28834 @var{do something with} dest[i] 28835@end example 28836 28837In this case, @command{gawk} copies the @code{source} array into the 28838@code{dest} array and then sorts @code{dest}, destroying its indices. 28839However, the @code{source} array is not affected. 28840 28841Often, what's needed is to sort on the values of the @emph{indices} 28842instead of the values of the elements. To do that, use the 28843@code{asorti()} function. The interface and behavior are identical to 28844that of @code{asort()}, except that the index values are used for sorting 28845and become the values of the result array: 28846 28847@example 28848@{ source[$0] = some_func($0) @} 28849 28850END @{ 28851 n = asorti(source, dest) 28852 for (i = 1; i <= n; i++) @{ 28853 @ii{Work with sorted indices directly:} 28854 @var{do something with} dest[i] 28855 @dots{} 28856 @ii{Access original array via sorted indices:} 28857 @var{do something with} source[dest[i]] 28858 @} 28859@} 28860@end example 28861 28862So far, so good. Now it starts to get interesting. Both @code{asort()} 28863and @code{asorti()} accept a third string argument to control comparison 28864of array elements. When we introduced @code{asort()} and @code{asorti()} 28865in @ref{String Functions}, we ignored this third argument; however, 28866now is the time to describe how this argument affects these two functions. 28867 28868Basically, the third argument specifies how the array is to be sorted. 28869There are two possibilities. As with @code{PROCINFO["sorted_in"]}, 28870this argument may be one of the predefined names that @command{gawk} 28871provides (@pxref{Controlling Scanning}), or it may be the name of a 28872user-defined function (@pxref{Controlling Array Traversal}). 28873 28874In the latter case, @emph{the function can compare elements in any way 28875it chooses}, taking into account just the indices, just the values, 28876or both. This is extremely powerful. 28877 28878Once the array is sorted, @code{asort()} takes the @emph{values} in 28879their final order and uses them to fill in the result array, whereas 28880@code{asorti()} takes the @emph{indices} in their final order and uses 28881them to fill in the result array. 28882 28883@cindex reference counting, sorting arrays 28884@quotation NOTE 28885Copying array indices and elements isn't expensive in terms of memory. 28886Internally, @command{gawk} maintains @dfn{reference counts} to data. 28887For example, when @code{asort()} copies the first array to the second one, 28888there is only one copy of the original array elements' data, even though 28889both arrays use the values. 28890@end quotation 28891 28892You may use the same array for both the first and second arguments to 28893@code{asort()} and @code{asorti()}. Doing so only makes sense if you 28894are also supplying the third argument, since @command{awk} doesn't 28895provide a way to pass that third argument without also passing the first 28896and second ones. 28897 28898@c Document It And Call It A Feature. Sigh. 28899@cindex @command{gawk} @subentry @code{IGNORECASE} variable in 28900@cindex arrays @subentry sorting @subentry @code{IGNORECASE} variable and 28901@cindex @code{IGNORECASE} variable @subentry array sorting functions and 28902Because @code{IGNORECASE} affects string comparisons, the value 28903of @code{IGNORECASE} also affects sorting for both @code{asort()} and @code{asorti()}. 28904Note also that the locale's sorting order does @emph{not} 28905come into play; comparisons are based on character values only.@footnote{This 28906is true because locale-based comparison occurs only when in 28907POSIX-compatibility mode, and because @code{asort()} and @code{asorti()} are 28908@command{gawk} extensions, they are not available in that case.} 28909 28910The following example demonstrates the use of a comparison function with 28911@code{asort()}. The comparison function, @code{case_fold_compare()}, maps 28912both values to lowercase in order to compare them ignoring case. 28913 28914@example 28915@group 28916# case_fold_compare --- compare as strings, ignoring case 28917 28918function case_fold_compare(i1, v1, i2, v2, l, r) 28919@{ 28920 l = tolower(v1) 28921@end group 28922 r = tolower(v2) 28923 28924 if (l < r) 28925 return -1 28926 else if (l == r) 28927 return 0 28928 else 28929 return 1 28930@} 28931@end example 28932 28933And here is the test program for it: 28934 28935@example 28936# Test program 28937 28938BEGIN @{ 28939 Letters = "abcdefghijklmnopqrstuvwxyz" \ 28940 "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 28941 split(Letters, data, "") 28942 28943 asort(data, result, "case_fold_compare") 28944 28945 j = length(result) 28946 for (i = 1; i <= j; i++) @{ 28947 printf("%s", result[i]) 28948 if (i % (j/2) == 0) 28949 printf("\n") 28950 else 28951 printf(" ") 28952 @} 28953@} 28954@end example 28955 28956When run, we get the following: 28957 28958@example 28959$ @kbd{gawk -f case_fold_compare.awk} 28960@print{} A a B b c C D d e E F f g G H h i I J j k K l L M m 28961@print{} n N O o p P Q q r R S s t T u U V v w W X x y Y z Z 28962@end example 28963 28964@node Two-way I/O 28965@section Two-Way Communications with Another Process 28966 28967@c 8/2014. Neither Mike nor BWK saw this as relevant. Commenting it out. 28968@ignore 28969@cindex Brennan, Michael 28970@cindex programmers, attractiveness of 28971@smallexample 28972@c Path: cssun.mathcs.emory.edu!gatech!newsxfer3.itd.umich.edu!news-peer.sprintlink.net!news-sea-19.sprintlink.net!news-in-west.sprintlink.net!news.sprintlink.net!Sprint!204.94.52.5!news.whidbey.com!brennan 28973From: brennan@@whidbey.com (Mike Brennan) 28974Newsgroups: comp.lang.awk 28975Subject: Re: Learn the SECRET to Attract Women Easily 28976Date: 4 Aug 1997 17:34:46 GMT 28977@c Organization: WhidbeyNet 28978@c Lines: 12 28979Message-ID: <5s53rm$eca@@news.whidbey.com> 28980@c References: <5s20dn$2e1@chronicle.concentric.net> 28981@c Reply-To: brennan@whidbey.com 28982@c NNTP-Posting-Host: asn202.whidbey.com 28983@c X-Newsreader: slrn (0.9.4.1 UNIX) 28984@c Xref: cssun.mathcs.emory.edu comp.lang.awk:5403 28985 28986On 3 Aug 1997 13:17:43 GMT, Want More Dates??? 28987<tracy78@@kilgrona.com> wrote: 28988>Learn the SECRET to Attract Women Easily 28989> 28990>The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women 28991 28992The scent of awk programmers is a lot more attractive to women than 28993the scent of perl programmers. 28994-- 28995Mike Brennan 28996@c brennan@@whidbey.com 28997@end smallexample 28998@end ignore 28999 29000@cindex advanced features @subentry processes, communicating with 29001@cindex processes, two-way communications with 29002It is often useful to be able to 29003send data to a separate program for 29004processing and then read the result. This can always be 29005done with temporary files: 29006 29007@example 29008# Write the data for processing 29009tempfile = ("mydata." PROCINFO["pid"]) 29010while (@var{not done with data}) 29011 print @var{data} | ("subprogram > " tempfile) 29012close("subprogram > " tempfile) 29013 29014# Read the results, remove tempfile when done 29015while ((getline newdata < tempfile) > 0) 29016 @var{process} newdata @var{appropriately} 29017close(tempfile) 29018system("rm " tempfile) 29019@end example 29020 29021@noindent 29022This works, but not elegantly. Among other things, it requires that 29023the program be run in a directory that cannot be shared among users; 29024for example, @file{/tmp} will not do, as another user might happen 29025to be using a temporary file with the same name.@footnote{Michael 29026Brennan suggests the use of @command{rand()} to generate unique 29027@value{FN}s. This is a valid point; nevertheless, temporary files 29028remain more difficult to use than two-way pipes.} @c 8/2014 29029 29030@cindex coprocesses 29031@cindex input/output @subentry two-way 29032@cindex @code{|} (vertical bar) @subentry @code{|&} operator (I/O) 29033@cindex vertical bar (@code{|}) @subentry @code{|&} operator (I/O) 29034@cindex @command{csh} utility @subentry @code{|&} operator, comparison with 29035However, with @command{gawk}, it is possible to 29036open a @emph{two-way} pipe to another process. The second process is 29037termed a @dfn{coprocess}, as it runs in parallel with @command{gawk}. 29038The two-way connection is created using the @samp{|&} operator 29039(borrowed from the Korn shell, @command{ksh}):@footnote{This is very 29040different from the same operator in the C shell and in Bash.} 29041 29042@example 29043do @{ 29044 print @var{data} |& "subprogram" 29045 "subprogram" |& getline results 29046@} while (@var{data left to process}) 29047close("subprogram") 29048@end example 29049 29050The first time an I/O operation is executed using the @samp{|&} 29051operator, @command{gawk} creates a two-way pipeline to a child process 29052that runs the other program. Output created with @code{print} 29053or @code{printf} is written to the program's standard input, and 29054output from the program's standard output can be read by the @command{gawk} 29055program using @code{getline}. 29056As is the case with processes started by @samp{|}, the subprogram 29057can be any program, or pipeline of programs, that can be started by 29058the shell. 29059 29060There are some cautionary items to be aware of: 29061 29062@itemize @value{BULLET} 29063@item 29064As the code inside @command{gawk} currently stands, the coprocess's 29065standard error goes to the same place that the parent @command{gawk}'s 29066standard error goes. It is not possible to read the child's 29067standard error separately. 29068 29069@cindex deadlocks 29070@cindex buffering @subentry input/output 29071@cindex @code{getline} command @subentry deadlock and 29072@item 29073I/O buffering may be a problem. @command{gawk} automatically 29074flushes all output down the pipe to the coprocess. 29075However, if the coprocess does not flush its output, 29076@command{gawk} may hang when doing a @code{getline} in order to read 29077the coprocess's results. This could lead to a situation 29078known as @dfn{deadlock}, where each process is waiting for the 29079other one to do something. 29080@end itemize 29081 29082@cindex @code{close()} function @subentry two-way pipes and 29083It is possible to close just one end of the two-way pipe to 29084a coprocess, by supplying a second argument to the @code{close()} 29085function of either @code{"to"} or @code{"from"} 29086(@pxref{Close Files And Pipes}). 29087These strings tell @command{gawk} to close the end of the pipe 29088that sends data to the coprocess or the end that reads from it, 29089respectively. 29090 29091@cindex @command{sort} utility @subentry coprocesses and 29092This is particularly necessary in order to use 29093the system @command{sort} utility as part of a coprocess; 29094@command{sort} must read @emph{all} of its input 29095data before it can produce any output. 29096The @command{sort} program does not receive an end-of-file indication 29097until @command{gawk} closes the write end of the pipe. 29098 29099When you have finished writing data to the @command{sort} 29100utility, you can close the @code{"to"} end of the pipe, and 29101then start reading sorted data via @code{getline}. 29102For example: 29103 29104@example 29105BEGIN @{ 29106 command = "LC_ALL=C sort" 29107 n = split("abcdefghijklmnopqrstuvwxyz", a, "") 29108 29109 for (i = n; i > 0; i--) 29110 print a[i] |& command 29111 close(command, "to") 29112 29113 while ((command |& getline line) > 0) 29114 print "got", line 29115 close(command) 29116@} 29117@end example 29118 29119This program writes the letters of the alphabet in reverse order, one 29120per line, down the two-way pipe to @command{sort}. It then closes the 29121write end of the pipe, so that @command{sort} receives an end-of-file 29122indication. This causes @command{sort} to sort the data and write the 29123sorted data back to the @command{gawk} program. Once all of the data 29124has been read, @command{gawk} terminates the coprocess and exits. 29125 29126@cindex ASCII 29127As a side note, the assignment @samp{LC_ALL=C} in the @command{sort} 29128command ensures traditional Unix (ASCII) sorting from @command{sort}. 29129This is not strictly necessary here, but it's good to know how to do this. 29130 29131Be careful when closing the @code{"from"} end of a two-way pipe; in this 29132case @command{gawk} waits for the child process to exit, which may cause 29133your program to hang. (Thus, this particular feature is of much less 29134use in practice than being able to close the @code{"to"} end.) 29135 29136@quotation CAUTION 29137Normally, 29138it is a fatal error to write to the @code{"to"} end of a two-way 29139pipe which has been closed, and it is also a fatal error to read 29140from the @code{"from"} end of a two-way pipe that has been closed. 29141 29142You may set @code{PROCINFO["@var{command}", "NONFATAL"]} to 29143make such operations become nonfatal. If you do so, you then need 29144to check @code{ERRNO} after each @code{print}, @code{printf}, 29145or @code{getline}. 29146@xref{Nonfatal}, for more information. 29147@end quotation 29148 29149@cindex @command{gawk} @subentry @code{PROCINFO} array in 29150@cindex @code{PROCINFO} array @subentry communications via ptys and 29151You may also use pseudo-ttys (ptys) for 29152two-way communication instead of pipes, if your system supports them. 29153This is done on a per-command basis, by setting a special element 29154in the @code{PROCINFO} array 29155(@pxref{Auto-set}), 29156like so: 29157 29158@example 29159command = "sort -nr" # command, save in convenience variable 29160PROCINFO[command, "pty"] = 1 # update PROCINFO 29161print @dots{} |& command # start two-way pipe 29162@dots{} 29163@end example 29164 29165@noindent 29166If your system does not have ptys, or if all the system's ptys are in use, 29167@command{gawk} automatically falls back to using regular pipes. 29168 29169Using ptys usually avoids the buffer deadlock issues described earlier, 29170at some loss in performance. This is because the tty driver buffers 29171and sends data line-by-line. On systems with the @command{stdbuf} 29172(part of the @uref{https://www.gnu.org/software/coreutils/coreutils.html, 29173GNU Coreutils package}), you can use that program instead of ptys. 29174 29175Note also that ptys are not fully transparent. Certain binary control 29176codes, such @kbd{Ctrl-d} for end-of-file, are interpreted by the tty 29177driver and not passed through. 29178 29179@quotation CAUTION 29180Finally, coprocesses open up the possibility of @dfn{deadlock} between 29181@command{gawk} and the program running in the coprocess. This can occur 29182if you send ``too much'' data to the coprocess before reading any back; 29183each process is blocked writing data with no one available to read what 29184they've already written. There is no workaround for deadlock; careful 29185programming and knowledge of the behavior of the coprocess are required. 29186@end quotation 29187 29188@c From email send January 4, 2018. 29189The following example, due to Andrew Schorr, demonstrates how 29190using ptys can help deal with buffering deadlocks. 29191 29192Suppose @command{gawk} were unable to add numbers. 29193You could use a coprocess to do it. Here's an exceedingly 29194simple program written for that purpose: 29195 29196@example 29197$ @kbd{cat add.c} 29198#include <stdio.h> 29199 29200int 29201main(void) 29202@{ 29203 int x, y; 29204 while (scanf("%d %d", & x, & y) == 2) 29205 printf("%d\n", x + y); 29206 return 0; 29207@} 29208$ @kbd{cc -O add.c -o add} @ii{Compile the program} 29209@end example 29210 29211You could then write an exceedingly simple @command{gawk} program 29212to add numbers by passing them to the coprocess: 29213 29214@example 29215$ @kbd{echo 1 2 |} 29216> @kbd{gawk -v cmd=./add '@{ print |& cmd; cmd |& getline x; print x @}'} 29217@end example 29218 29219And it would deadlock, because @file{add.c} fails to call 29220@samp{setlinebuf(stdout)}. The @command{add} program freezes. 29221 29222Now try instead: 29223 29224@example 29225$ @kbd{echo 1 2 |} 29226> @kbd{gawk -v cmd=add 'BEGIN @{ PROCINFO[cmd, "pty"] = 1 @}} 29227> @kbd{ @{ print |& cmd; cmd |& getline x; print x @}'} 29228@print{} 3 29229@end example 29230 29231By using a pty, @command{gawk} fools the standard I/O library into 29232thinking it has an interactive session, so it defaults to line buffering. 29233And now, magically, it works! 29234 29235@node TCP/IP Networking 29236@section Using @command{gawk} for Network Programming 29237@cindex advanced features @subentry network programming 29238@cindex networks @subentry programming 29239@cindex TCP/IP 29240@cindex @code{/inet/@dots{}} special files (@command{gawk}) 29241@cindex files @subentry @code{/inet/@dots{}} (@command{gawk}) 29242@cindex @code{/inet4/@dots{}} special files (@command{gawk}) 29243@cindex files @subentry @code{/inet4/@dots{}} (@command{gawk}) 29244@cindex @code{/inet6/@dots{}} special files (@command{gawk}) 29245@cindex files @subentry @code{/inet6/@dots{}} (@command{gawk}) 29246@cindex @code{EMRED} 29247@ifnotdocbook 29248@quotation 29249@code{EMRED}:@* 29250@ @ @ @ @i{A host is a host from coast to coast,@* 29251@ @ @ @ and nobody talks to a host that's close,@* 29252@ @ @ @ unless the host that isn't close@* 29253@ @ @ @ is busy, hung, or dead.} 29254@author Mike O'Brien (aka Mr.@: Protocol) 29255@end quotation 29256@end ifnotdocbook 29257 29258@docbook 29259<blockquote> 29260<attribution>Mike O'Brien (aka Mr. Protocol)</attribution> 29261<literallayout class="normal"><literal>EMRED</literal>: 29262 <emphasis>A host is a host from coast to coast,</emphasis> 29263 <emphasis>and no-one can talk to host that's close,</emphasis> 29264 <emphasis>unless the host that isn't close</emphasis> 29265 <emphasis>is busy, hung, or dead.</emphasis></literallayout> 29266</blockquote> 29267@end docbook 29268 29269In addition to being able to open a two-way pipeline to a coprocess 29270on the same system 29271(@pxref{Two-way I/O}), 29272it is possible to make a two-way connection to 29273another process on another system across an IP network connection. 29274 29275You can think of this as just a @emph{very long} two-way pipeline to 29276a coprocess. 29277The way @command{gawk} decides that you want to use TCP/IP networking is 29278by recognizing special @value{FN}s that begin with one of @samp{/inet/}, 29279@samp{/inet4/}, or @samp{/inet6/}. 29280 29281The full syntax of the special @value{FN} is 29282@file{/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}. 29283The components are: 29284 29285@table @var 29286@item net-type 29287Specifies the kind of Internet connection to make. 29288Use @samp{/inet4/} to force IPv4, and 29289@samp{/inet6/} to force IPv6. 29290Plain @samp{/inet/} (which used to be the only option) uses 29291the system default, most likely IPv4. 29292 29293@item protocol 29294The protocol to use over IP. This must be either @samp{tcp}, or 29295@samp{udp}, for a TCP or UDP IP connection, 29296respectively. TCP should be used for most applications. 29297 29298@item local-port 29299@cindex @code{getaddrinfo()} function (C library) 29300@cindex C library functions @subentry @code{getaddrinfo()} 29301The local TCP or UDP port number to use. Use a port number of @samp{0} 29302when you want the system to pick a port. This is what you should do 29303when writing a TCP or UDP client. 29304You may also use a well-known service name, such as @samp{smtp} 29305or @samp{http}, in which case @command{gawk} attempts to determine 29306the predefined port number using the C @code{getaddrinfo()} function. 29307 29308@item remote-host 29309The IP address or fully qualified domain name of the Internet 29310host to which you want to connect. 29311 29312@item remote-port 29313The TCP or UDP port number to use on the given @var{remote-host}. 29314Again, use @samp{0} if you don't care, or else a well-known 29315service name. 29316@end table 29317 29318@cindex @command{gawk} @subentry @code{ERRNO} variable in 29319@cindex @code{ERRNO} variable 29320@quotation NOTE 29321Failure in opening a two-way socket will result in a nonfatal error 29322being returned to the calling code. The value of @code{ERRNO} indicates 29323the error (@pxref{Auto-set}). 29324@end quotation 29325 29326Consider the following very simple example: 29327 29328@example 29329BEGIN @{ 29330 Service = "/inet/tcp/0/localhost/daytime" 29331 Service |& getline 29332 print $0 29333 close(Service) 29334@} 29335@end example 29336 29337This program reads the current date and time from the local system's 29338TCP @code{daytime} server. 29339It then prints the results and closes the connection. 29340 29341Because this topic is extensive, the use of @command{gawk} for 29342TCP/IP programming is documented separately. 29343@ifinfo 29344See 29345@inforef{Top, , General Introduction, gawkinet, @value{GAWKINETTITLE}}, 29346@end ifinfo 29347@ifnotinfo 29348See 29349@uref{https://www.gnu.org/software/gawk/manual/gawkinet/, 29350@cite{@value{GAWKINETTITLE}}}, 29351which comes as part of the @command{gawk} distribution, 29352@end ifnotinfo 29353for a much more complete introduction and discussion, as well as 29354extensive examples. 29355 29356@quotation NOTE 29357@command{gawk} can only open direct sockets. There is currently 29358no way to access services available over Secure Socket Layer 29359(SSL); this includes any web service whose URL starts with @samp{https://}. 29360@end quotation 29361 29362 29363@node Profiling 29364@section Profiling Your @command{awk} Programs 29365@cindex @command{awk} programs @subentry profiling 29366@cindex profiling @command{awk} programs 29367@cindex @code{awkprof.out} file 29368@cindex files @subentry @code{awkprof.out} 29369 29370You may produce execution traces of your @command{awk} programs. 29371This is done by passing the option @option{--profile} to @command{gawk}. 29372When @command{gawk} has finished running, it creates a profile of your program in a file 29373named @file{awkprof.out}. Because it is profiling, it also executes up to 45% slower than 29374@command{gawk} normally does. 29375 29376@cindex @option{--profile} option 29377As shown in the following example, 29378the @option{--profile} option can be used to change the name of the file 29379where @command{gawk} will write the profile: 29380 29381@example 29382gawk --profile=myprog.prof -f myprog.awk data1 data2 29383@end example 29384 29385@noindent 29386In the preceding example, @command{gawk} places the profile in 29387@file{myprog.prof} instead of in @file{awkprof.out}. 29388 29389Here is a sample session showing a simple @command{awk} program, 29390its input data, and the results from running @command{gawk} with the 29391@option{--profile} option. First, the @command{awk} program: 29392 29393@example 29394BEGIN @{ print "First BEGIN rule" @} 29395 29396END @{ print "First END rule" @} 29397 29398/foo/ @{ 29399 print "matched /foo/, gosh" 29400 for (i = 1; i <= 3; i++) 29401 sing() 29402@} 29403 29404@{ 29405 if (/foo/) 29406 print "if is true" 29407 else 29408 print "else is true" 29409@} 29410 29411BEGIN @{ print "Second BEGIN rule" @} 29412 29413END @{ print "Second END rule" @} 29414 29415function sing( dummy) 29416@{ 29417 print "I gotta be me!" 29418@} 29419@end example 29420 29421Following is the input data: 29422 29423@example 29424foo 29425bar 29426baz 29427foo 29428junk 29429@end example 29430 29431Here is the @file{awkprof.out} that results from running the 29432@command{gawk} profiler on this program and data (this example also 29433illustrates that @command{awk} programmers sometimes get up very early 29434in the morning to work): 29435 29436@cindex @code{BEGIN} pattern @subentry profiling and 29437@cindex @code{END} pattern @subentry profiling and 29438@example 29439 # gawk profile, created Mon Sep 29 05:16:21 2014 29440 29441 # BEGIN rule(s) 29442 29443 BEGIN @{ 29444 1 print "First BEGIN rule" 29445 @} 29446 29447 BEGIN @{ 29448 1 print "Second BEGIN rule" 29449 @} 29450 29451 # Rule(s) 29452 29453 5 /foo/ @{ # 2 29454 2 print "matched /foo/, gosh" 29455 6 for (i = 1; i <= 3; i++) @{ 29456 6 sing() 29457 @} 29458 @} 29459 29460 5 @{ 29461 5 if (/foo/) @{ # 2 29462 2 print "if is true" 29463 3 @} else @{ 29464 3 print "else is true" 29465 @} 29466 @} 29467 29468 # END rule(s) 29469 29470 END @{ 29471 1 print "First END rule" 29472 @} 29473 29474 END @{ 29475 1 print "Second END rule" 29476 @} 29477 29478 29479 # Functions, listed alphabetically 29480 29481 6 function sing(dummy) 29482 @{ 29483 6 print "I gotta be me!" 29484 @} 29485@end example 29486 29487This example illustrates many of the basic features of profiling output. 29488They are as follows: 29489 29490@itemize @value{BULLET} 29491@item 29492The program is printed in the order @code{BEGIN} rules, 29493@code{BEGINFILE} rules, 29494pattern--action rules, 29495@code{ENDFILE} rules, @code{END} rules, and functions, listed 29496alphabetically. 29497Multiple @code{BEGIN} and @code{END} rules retain their 29498separate identities, as do 29499multiple @code{BEGINFILE} and @code{ENDFILE} rules. 29500 29501@cindex patterns @subentry counts, in a profile 29502@item 29503Pattern--action rules have two counts. 29504The first count, to the left of the rule, shows how many times 29505the rule's pattern was @emph{tested}. 29506The second count, to the right of the rule's opening left brace 29507in a comment, 29508shows how many times the rule's action was @emph{executed}. 29509The difference between the two indicates how many times the rule's 29510pattern evaluated to false. 29511 29512@item 29513Similarly, 29514the count for an @code{if}-@code{else} statement shows how many times 29515the condition was tested. 29516To the right of the opening left brace for the @code{if}'s body 29517is a count showing how many times the condition was true. 29518The count for the @code{else} 29519indicates how many times the test failed. 29520 29521@cindex loops @subentry count for header, in a profile 29522@item 29523The count for a loop header (such as @code{for} 29524or @code{while}) shows how many times the loop test was executed. 29525(Because of this, you can't just look at the count on the first 29526statement in a rule to determine how many times the rule was executed. 29527If the first statement is a loop, the count is misleading.) 29528 29529@cindex functions @subentry user-defined @subentry counts, in a profile 29530@cindex user-defined @subentry functions @subentry counts, in a profile 29531@item 29532For user-defined functions, the count next to the @code{function} 29533keyword indicates how many times the function was called. 29534The counts next to the statements in the body show how many times 29535those statements were executed. 29536 29537@cindex @code{@{@}} (braces) 29538@cindex braces (@code{@{@}}) 29539@item 29540The layout uses ``K&R'' style with TABs. 29541Braces are used everywhere, even when 29542the body of an @code{if}, @code{else}, or loop is only a single statement. 29543 29544@cindex @code{()} (parentheses) @subentry in a profile 29545@cindex parentheses @code{()} @subentry in a profile 29546@item 29547Parentheses are used only where needed, as indicated by the structure 29548of the program and the precedence rules. 29549For example, @samp{(3 + 5) * 4} means add three and five, then multiply 29550the total by four. However, @samp{3 + 5 * 4} has no parentheses, and 29551means @samp{3 + (5 * 4)}. 29552However, explicit parentheses in the source program are retained. 29553 29554@ignore 29555@item 29556All string concatenations are parenthesized too. 29557(This could be made a bit smarter.) 29558@end ignore 29559 29560@item 29561Parentheses are used around the arguments to @code{print} 29562and @code{printf} only when 29563the @code{print} or @code{printf} statement is followed by a redirection. 29564Similarly, if 29565the target of a redirection isn't a scalar, it gets parenthesized. 29566 29567@item 29568@command{gawk} supplies leading comments in 29569front of the @code{BEGIN} and @code{END} rules, 29570the @code{BEGINFILE} and @code{ENDFILE} rules, 29571the pattern--action rules, and the functions. 29572 29573@item 29574Functions are listed alphabetically. All functions in the @code{awk} 29575namespace are listed first, in alphabetical order. Then come the 29576functions in namespaces. The namespaces are listed in alphabetical order, 29577and the functions within each namespace are listed alphabetically. 29578 29579@end itemize 29580 29581The profiled version of your program may not look exactly like what you 29582typed when you wrote it. This is because @command{gawk} creates the 29583profiled version by ``pretty-printing'' its internal representation of 29584the program. The advantage to this is that @command{gawk} can produce 29585a standard representation. 29586Also, things such as: 29587 29588@example 29589/foo/ 29590@end example 29591 29592@noindent 29593come out as: 29594 29595@example 29596/foo/ @{ 29597 print 29598@} 29599@end example 29600 29601@noindent 29602which is correct, but possibly unexpected. 29603(If a program uses both @samp{print $0} and plain 29604@samp{print}, that distinction is retained.) 29605 29606@cindex profiling @command{awk} programs @subentry dynamically 29607@cindex @command{gawk} @subentry dynamic profiling 29608@cindex @command{gawk} @subentry profiling programs 29609@cindex dynamic profiling 29610Besides creating profiles when a program has completed, 29611@command{gawk} can produce a profile while it is running. 29612This is useful if your @command{awk} program goes into an 29613infinite loop and you want to see what has been executed. 29614To use this feature, run @command{gawk} with the @option{--profile} 29615option in the background: 29616 29617@example 29618$ @kbd{gawk --profile -f myprog &} 29619[1] 13992 29620@end example 29621 29622@cindex @command{kill} command, dynamic profiling 29623@cindex @code{USR1} signal, for dynamic profiling 29624@cindex @code{SIGUSR1} signal, for dynamic profiling 29625@cindex signals @subentry @code{USR1}/@code{SIGUSR1}, for profiling 29626@noindent 29627The shell prints a job number and process ID number; in this case, 13992. 29628Use the @command{kill} command to send the @code{USR1} signal 29629to @command{gawk}: 29630 29631@example 29632$ @kbd{kill -USR1 13992} 29633@end example 29634 29635@noindent 29636As usual, the profiled version of the program is written to 29637@file{awkprof.out}, or to a different file if one was specified with 29638the @option{--profile} option. 29639 29640Along with the regular profile, as shown earlier, the profile file 29641includes a trace of any active functions: 29642 29643@example 29644# Function Call Stack: 29645 29646# 3. baz 29647# 2. bar 29648# 1. foo 29649# -- main -- 29650@end example 29651 29652You may send @command{gawk} the @code{USR1} signal as many times as you like. 29653Each time, the profile and function call trace are appended to the output 29654profile file. 29655 29656@cindex @code{HUP} signal, for dynamic profiling 29657@cindex @code{SIGHUP} signal, for dynamic profiling 29658@cindex signals @subentry @code{HUP}/@code{SIGHUP}, for profiling 29659If you use the @code{HUP} signal instead of the @code{USR1} signal, 29660@command{gawk} produces the profile and the function call trace and then exits. 29661 29662@cindex @code{INT} signal (MS-Windows) 29663@cindex @code{SIGINT} signal (MS-Windows) 29664@cindex signals @subentry @code{INT}/@code{SIGINT} (MS-Windows) 29665@cindex @code{QUIT} signal (MS-Windows) 29666@cindex @code{SIGQUIT} signal (MS-Windows) 29667@cindex signals @subentry @code{QUIT}/@code{SIGQUIT} (MS-Windows) 29668When @command{gawk} runs on MS-Windows systems, it uses the 29669@code{INT} and @code{QUIT} signals for producing the profile, and in 29670the case of the @code{INT} signal, @command{gawk} exits. This is 29671because these systems don't support the @command{kill} command, so the 29672only signals you can deliver to a program are those generated by the 29673keyboard. The @code{INT} signal is generated by the 29674@kbd{Ctrl-c} or @kbd{Ctrl-BREAK} key, while the 29675@code{QUIT} signal is generated by the @kbd{Ctrl-\} key. 29676 29677@cindex pretty printing 29678Finally, @command{gawk} also accepts another option, @option{--pretty-print}. 29679When called this way, @command{gawk} ``pretty-prints'' the program into 29680@file{awkprof.out}, without any execution counts. 29681 29682@quotation NOTE 29683Once upon a time, the @option{--pretty-print} option would also run 29684your program. This is no longer the case. 29685@end quotation 29686 29687@cindex profiling, pretty printing, difference with 29688@cindex pretty printing @subentry profiling, difference with 29689There is a significant difference between the output created when 29690profiling, and that created when pretty-printing. Pretty-printed output 29691preserves the original comments that were in the program, although their 29692placement may not correspond exactly to their original locations in the 29693source code. However, no comments should be lost. 29694Also, @command{gawk} does the best it can to preserve 29695the distinction between comments at the end of a statement and comments 29696on lines by themselves. This isn't always perfect, though. 29697 29698However, as a deliberate design decision, profiling output @emph{omits} 29699the original program's comments. This allows you to focus on the 29700execution count data and helps you avoid the temptation to use the 29701profiler for pretty-printing. 29702 29703Additionally, pretty-printed output does not have the leading indentation 29704that the profiling output does. This makes it easy to pretty-print your 29705code once development is completed, and then use the result as the final 29706version of your program. 29707 29708Because the internal representation of your program is formatted to 29709recreate an @command{awk} program, profiling and pretty-printing 29710automatically disable @command{gawk}'s default optimizations. 29711 29712Profiling and pretty-printing also preserve the original format of numeric 29713constants; if you used an octal or hexadecimal value in your source 29714code, it will appear that way in the output. 29715 29716@node Extension Philosophy 29717@section Builtin Features versus Extensions 29718 29719As this and subsequent @value{CHAPTER}s show, @command{gawk} has a 29720large number of extensions over standard @command{awk} built-in to 29721the program. These have developed over time. More recently, the 29722focus has moved to using the extension mechanism (@pxref{Dynamic Extensions}) 29723for adding features. This @value{SECTION} discusses the ``guiding philosophy'' 29724behind what should be added to the interpreter as a built-in 29725feature versus what should be done in extensions. 29726 29727There are several goals: 29728 29729@enumerate 1 29730@item 29731Keep the language @command{awk}; it should not become unrecognizable, even 29732if programs in it will only run on @command{gawk}. 29733 29734@item 29735Keep the core from getting any larger unless absolutely necessary. 29736 29737@item 29738Add new functionality either in @command{awk} scripts (@option{-f}, 29739@code{@@include}) or in loadable extensions written in C or C++ 29740(@option{-l}, @code{@@load}). 29741 29742@item 29743Extend the core interpreter only if some feature is: 29744 29745@c sublist 29746@enumerate A 29747@item 29748Truly desirable. 29749@item 29750Cannot be done via library files or loadable extensions. 29751@item 29752Can be implemented without too much pain in the core. 29753@end enumerate 29754@end enumerate 29755Combining modules with @command{awk} files is a powerful technique. 29756Some of the sample extensions demonstrate this. 29757 29758Loading extensions and library files should not be done automatically, 29759because then there's overhead that most users don't want or need. 29760 29761@node Advanced Features Summary 29762@section Summary 29763 29764@itemize @value{BULLET} 29765@item 29766The @option{--non-decimal-data} option causes @command{gawk} to treat 29767octal- and hexadecimal-looking input data as octal and hexadecimal. 29768This option should be used with caution or not at all; use of @code{strtonum()} 29769is preferable. 29770Note that this option may disappear in a future version of @command{gawk}. 29771 29772@item 29773You can take over complete control of sorting in @samp{for (@var{indx} in @var{array})} 29774array traversal by setting @code{PROCINFO["sorted_in"]} to the name of a user-defined 29775function that does the comparison of array elements based on index and value. 29776 29777@item 29778Similarly, you can supply the name of a user-defined comparison function as the 29779third argument to either @code{asort()} or @command{asorti()} to control how 29780those functions sort arrays. Or you may provide one of the predefined control 29781strings that work for @code{PROCINFO["sorted_in"]}. 29782 29783@item 29784You can use the @samp{|&} operator to create a two-way pipe to a coprocess. 29785You read from the coprocess with @code{getline} and write to it with @code{print} 29786or @code{printf}. Use @code{close()} to close off the coprocess completely, or 29787optionally, close off one side of the two-way communications. 29788 29789@item 29790By using special @value{FN}s with the @samp{|&} operator, you can open a 29791TCP/IP (or UDP/IP) connection to remote hosts on the Internet. @command{gawk} 29792supports both IPv4 and IPv6. 29793 29794@item 29795You can generate statement count profiles of your program. This can help you 29796determine which parts of your program may be taking the most time and let 29797you tune them more easily. Sending the @code{USR1} signal while profiling causes 29798@command{gawk} to dump the profile and keep going, including a function call stack. 29799 29800@item 29801You can also just ``pretty-print'' the program. 29802 29803@item 29804New features should be developed using the extension mechanism if possible; 29805they should be added to the core interpreter only as a last resort. 29806@end itemize 29807 29808 29809@node Internationalization 29810@chapter Internationalization with @command{gawk} 29811 29812@cindex Robbins @subentry Malka 29813@cindex Moon, Sailor 29814@cindex Sailor Moon @seeentry{Moon, Sailor} 29815@quotation 29816@i{Moon@dots{} Gorgeous@dots{} MEDITATION!} 29817@author Pretty Guardian Sailor Moon Eternal, The Movie 29818@end quotation 29819 29820@quotation 29821@i{It probably sounded better in Japanese.} 29822@author Malka Robbins 29823@end quotation 29824 29825Once upon a time, computer makers 29826wrote software that worked only in English. 29827Eventually, hardware and software vendors noticed that if their 29828systems worked in the native languages of non-English-speaking 29829countries, they were able to sell more systems. 29830As a result, internationalization and localization 29831of programs and software systems became a common practice. 29832 29833@cindex internationalization @subentry localization 29834@cindex @command{gawk} @subentry internationalization @seeentry{internationalization} 29835@cindex internationalization @subentry localization @subentry @command{gawk} and 29836For many years, the ability to provide internationalization 29837was largely restricted to programs written in C and C++. 29838This @value{CHAPTER} describes the underlying library @command{gawk} 29839uses for internationalization, as well as how 29840@command{gawk} makes internationalization 29841features available at the @command{awk} program level. 29842Having internationalization available at the @command{awk} level 29843gives software developers additional flexibility---they are no 29844longer forced to write in C or C++ when internationalization is 29845a requirement. 29846 29847@menu 29848* I18N and L10N:: Internationalization and Localization. 29849* Explaining gettext:: How GNU @command{gettext} works. 29850* Programmer i18n:: Features for the programmer. 29851* Translator i18n:: Features for the translator. 29852* I18N Example:: A simple i18n example. 29853* Gawk I18N:: @command{gawk} is also internationalized. 29854* I18N Summary:: Summary of I18N stuff. 29855@end menu 29856 29857@node I18N and L10N 29858@section Internationalization and Localization 29859 29860@cindex internationalization 29861@cindex localization @seeentry{internationalization, localization} 29862@cindex internationalization @subentry localization 29863@dfn{Internationalization} means writing (or modifying) a program once, 29864in such a way that it can use multiple languages without requiring 29865further source code changes. 29866@dfn{Localization} means providing the data necessary for an 29867internationalized program to work in a particular language. 29868Most typically, these terms refer to features such as the language 29869used for printing error messages, the language used to read 29870responses, and information related to how numerical and 29871monetary values are printed and read. 29872 29873@node Explaining gettext 29874@section GNU @command{gettext} 29875 29876@cindex internationalizing a program 29877@cindex @command{gettext} library 29878@command{gawk} uses GNU @command{gettext} to provide its internationalization 29879features. 29880The facilities in GNU @command{gettext} focus on messages: strings printed 29881by a program, either directly or via formatting with @code{printf} or 29882@code{sprintf()}.@footnote{For some operating systems, the @command{gawk} 29883port doesn't support GNU @command{gettext}. 29884Therefore, these features are not available 29885if you are using one of those operating systems. Sorry.} 29886 29887@cindex portability @subentry @command{gettext} library and 29888When using GNU @command{gettext}, each application has its own 29889@dfn{text domain}. This is a unique name, such as @samp{kpilot} or @samp{gawk}, 29890that identifies the application. 29891A complete application may have multiple components---programs written 29892in C or C++, as well as scripts written in @command{sh} or @command{awk}. 29893All of the components use the same text domain. 29894 29895To make the discussion concrete, assume we're writing an application 29896named @command{guide}. Internationalization consists of the 29897following steps, in this order: 29898 29899@enumerate 29900@item 29901The programmer reviews the source for all of @command{guide}'s components 29902and marks each string that is a candidate for translation. 29903For example, @code{"`-F': option required"} is a good candidate for translation. 29904A table with strings of option names is not (e.g., @command{gawk}'s 29905@option{--profile} option should remain the same, no matter what the local 29906language). 29907 29908@cindex @code{textdomain()} function (C library) 29909@cindex C library functions @subentry @code{textdomain()} 29910@item 29911The programmer indicates the application's text domain 29912(@command{"guide"}) to the @command{gettext} library, 29913by calling the @code{textdomain()} function. 29914 29915@cindex @code{.pot} files 29916@cindex files @subentry @code{.pot} 29917@cindex portable object @subentry template files 29918@cindex files @subentry portable object @subentry template file (@file{.pot}) 29919@item 29920Messages from the application are extracted from the source code and 29921collected into a portable object template file (@file{guide.pot}), 29922which lists the strings and their translations. 29923The translations are initially empty. 29924The original (usually English) messages serve as the key for 29925lookup of the translations. 29926 29927@cindex @code{.po} files 29928@cindex files @subentry @code{.po} 29929@cindex portable object @subentry files 29930@cindex files @subentry portable object 29931@item 29932For each language with a translator, @file{guide.pot} 29933is copied to a portable object file (@code{.po}) 29934and translations are created and shipped with the application. 29935For example, there might be a @file{fr.po} for a French translation. 29936 29937@cindex @code{.gmo} files 29938@cindex files @subentry @code{.gmo} 29939@cindex message object files 29940@cindex files @subentry message object 29941@item 29942Each language's @file{.po} file is converted into a binary 29943message object (@file{.gmo}) file. 29944A message object file contains the original messages and their 29945translations in a binary format that allows fast lookup of translations 29946at runtime. 29947 29948@item 29949When @command{guide} is built and installed, the binary translation files 29950are installed in a standard place. 29951 29952@cindex @code{bindtextdomain()} function (C library) 29953@cindex C library functions @subentry @code{bindtextdomain()} 29954@item 29955For testing and development, it is possible to tell @command{gettext} 29956to use @file{.gmo} files in a different directory than the standard 29957one by using the @code{bindtextdomain()} function. 29958 29959@cindex @code{.gmo} files @subentry specifying directory of 29960@cindex files @subentry @code{.gmo} @subentry specifying directory of 29961@cindex message object files @subentry specifying directory of 29962@cindex files @subentry message object @subentry specifying directory of 29963@item 29964At runtime, @command{guide} looks up each string via a call 29965to @code{gettext()}. The returned string is the translated string 29966if available, or the original string if not. 29967 29968@item 29969If necessary, it is possible to access messages from a different 29970text domain than the one belonging to the application, without 29971having to switch the application's default text domain back 29972and forth. 29973@end enumerate 29974 29975@cindex @code{gettext()} function (C library) 29976@cindex C library functions @subentry @code{gettext()} 29977In C (or C++), the string marking and dynamic translation lookup 29978are accomplished by wrapping each string in a call to @code{gettext()}: 29979 29980@example 29981printf("%s", gettext("Don't Panic!\n")); 29982@end example 29983 29984The tools that extract messages from source code pull out all 29985strings enclosed in calls to @code{gettext()}. 29986 29987@cindex @code{_} (underscore) @subentry C macro 29988@cindex underscore (@code{_}) @subentry C macro 29989The GNU @command{gettext} developers, recognizing that typing 29990@samp{gettext(@dots{})} over and over again is both painful and ugly to look 29991at, use the macro @samp{_} (an underscore) to make things easier: 29992 29993@example 29994/* In the standard header file: */ 29995#define _(str) gettext(str) 29996 29997/* In the program text: */ 29998printf("%s", _("Don't Panic!\n")); 29999@end example 30000 30001@cindex internationalization @subentry localization @subentry locale categories 30002@cindex @command{gettext} library @subentry locale categories 30003@cindex locale categories 30004@noindent 30005This reduces the typing overhead to just three extra characters per string 30006and is considerably easier to read as well. 30007 30008There are locale @dfn{categories} 30009for different types of locale-related information. 30010The defined locale categories that @command{gettext} knows about are: 30011 30012@table @code 30013@cindex @code{LC_MESSAGES} locale category 30014@item LC_MESSAGES 30015Text messages. This is the default category for @command{gettext} 30016operations, but it is possible to supply a different one explicitly, 30017if necessary. (It is almost never necessary to supply a different category.) 30018 30019@cindex sorting characters in different languages 30020@cindex @code{LC_COLLATE} locale category 30021@item LC_COLLATE 30022Text-collation information (i.e., how different characters 30023and/or groups of characters sort in a given language). 30024 30025@cindex @code{LC_CTYPE} locale category 30026@item LC_CTYPE 30027Character-type information (alphabetic, digit, upper- or lowercase, and 30028so on) as well as character encoding. 30029@ignore 30030In June 2001 Bruno Haible wrote: 30031- Description of LC_CTYPE: It determines both 30032 1. character encoding, 30033 2. character type information. 30034 (For example, in both KOI8-R and ISO-8859-5 the character type information 30035 is the same - cyrillic letters could as 'alpha' - but the encoding is 30036 different.) 30037@end ignore 30038This information is accessed via the 30039POSIX character classes in regular expressions, 30040such as @code{/[[:alnum:]]/} 30041(@pxref{Bracket Expressions}). 30042 30043@cindex monetary information, localization 30044@cindex currency symbols, localization 30045@cindex internationalization @subentry localization @subentry monetary information 30046@cindex internationalization @subentry localization @subentry currency symbols 30047@cindex @code{LC_MONETARY} locale category 30048@item LC_MONETARY 30049Monetary information, such as the currency symbol, and whether the 30050symbol goes before or after a number. 30051 30052@cindex @code{LC_NUMERIC} locale category 30053@item LC_NUMERIC 30054Numeric information, such as which characters to use for the decimal 30055point and the thousands separator.@footnote{Americans 30056use a comma every three decimal places and a period for the decimal 30057point, while many Europeans do exactly the opposite: 300581,234.56 versus 1.234,56.} 30059 30060@cindex time @subentry localization and 30061@cindex dates @subentry information related to, localization 30062@cindex @code{LC_TIME} locale category 30063@item LC_TIME 30064Time- and date-related information, such as 12- or 24-hour clock, month printed 30065before or after the day in a date, local month abbreviations, and so on. 30066 30067@cindex @code{LC_ALL} locale category 30068@item LC_ALL 30069All of the above. (Not too useful in the context of @command{gettext}.) 30070@end table 30071 30072@quotation NOTE 30073@cindex @env{LANGUAGE} environment variable 30074@cindex environment variables @subentry @env{LANGUAGE} 30075As described in @ref{Locales}, environment variables with the same 30076name as the locale categories (@env{LC_CTYPE}, @env{LC_ALL}, etc.) 30077influence @command{gawk}'s behavior (and that of other utilities). 30078 30079Normally, these variables also affect how the @code{gettext} library 30080finds translations. However, the @env{LANGUAGE} environment variable 30081overrides the @env{LC_@var{xxx}} variables. Many GNU/Linux systems 30082may define this variable without your knowledge, causing @command{gawk} 30083to not find the correct translations. If this happens to you, 30084look to see if @env{LANGUAGE} is defined, and if so, use the shell's 30085@command{unset} command to remove it. 30086@end quotation 30087 30088@cindex @env{GAWK_LOCALE_DIR} environment variable 30089@cindex environment variables @subentry @env{GAWK_LOCALE_DIR} 30090For testing translations of @command{gawk} itself, you can set 30091the @env{GAWK_LOCALE_DIR} environment variable. See the documentation 30092for the C @code{bindtextdomain()} function and also see 30093@ref{Other Environment Variables}. 30094 30095@node Programmer i18n 30096@section Internationalizing @command{awk} Programs 30097@cindex @command{awk} programs @subentry internationalizing 30098 30099@command{gawk} provides the following variables for 30100internationalization: 30101 30102@table @code 30103@cindex @code{TEXTDOMAIN} variable 30104@item TEXTDOMAIN 30105This variable indicates the application's text domain. 30106For compatibility with GNU @command{gettext}, the default 30107value is @code{"messages"}. 30108 30109@cindex internationalization @subentry localization @subentry marked strings 30110@cindex strings @subentry for localization 30111@item _"your message here" 30112String constants marked with a leading underscore 30113are candidates for translation at runtime. 30114String constants without a leading underscore are not translated. 30115@end table 30116 30117@command{gawk} provides the following functions for 30118internationalization: 30119 30120@table @code 30121@cindexgawkfunc{dcgettext} 30122@item @code{dcgettext(@var{string}} [@code{,} @var{domain} [@code{,} @var{category}]]@code{)} 30123Return the translation of @var{string} in 30124text domain @var{domain} for locale category @var{category}. 30125The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. 30126The default value for @var{category} is @code{"LC_MESSAGES"}. 30127 30128If you supply a value for @var{category}, it must be a string equal to 30129one of the known locale categories described in 30130@ifnotinfo 30131the previous @value{SECTION}. 30132@end ifnotinfo 30133@ifinfo 30134@ref{Explaining gettext}. 30135@end ifinfo 30136You must also supply a text domain. Use @code{TEXTDOMAIN} if 30137you want to use the current domain. 30138 30139@quotation CAUTION 30140The order of arguments to the @command{awk} version 30141of the @code{dcgettext()} function is purposely different from the order for 30142the C version. The @command{awk} version's order was 30143chosen to be simple and to allow for reasonable @command{awk}-style 30144default arguments. 30145@end quotation 30146 30147@cindexgawkfunc{dcngettext} 30148@item @code{dcngettext(@var{string1}, @var{string2}, @var{number}} [@code{,} @var{domain} [@code{,} @var{category}]]@code{)} 30149Return the plural form used for @var{number} of the 30150translation of @var{string1} and @var{string2} in text domain 30151@var{domain} for locale category @var{category}. @var{string1} is the 30152English singular variant of a message, and @var{string2} is the English plural 30153variant of the same message. 30154The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. 30155The default value for @var{category} is @code{"LC_MESSAGES"}. 30156 30157The same remarks about argument order as for the @code{dcgettext()} function apply. 30158 30159@cindex @code{.gmo} files @subentry specifying directory of 30160@cindex files @subentry @code{.gmo} @subentry specifying directory of 30161@cindex message object files @subentry specifying directory of 30162@cindex files @subentry message object @subentry specifying directory of 30163@cindexgawkfunc{bindtextdomain} 30164@item @code{bindtextdomain(@var{directory}} [@code{,} @var{domain} ]@code{)} 30165Change the directory in which 30166@command{gettext} looks for @file{.gmo} files, in case they 30167will not or cannot be placed in the standard locations 30168(e.g., during testing). 30169Return the directory in which @var{domain} is ``bound.'' 30170 30171The default @var{domain} is the value of @code{TEXTDOMAIN}. 30172If @var{directory} is the null string (@code{""}), then 30173@code{bindtextdomain()} returns the current binding for the 30174given @var{domain}. 30175@end table 30176 30177To use these facilities in your @command{awk} program, follow these steps: 30178 30179@enumerate 30180@cindex @code{BEGIN} pattern @subentry @code{TEXTDOMAIN} variable and 30181@cindex @code{TEXTDOMAIN} variable @subentry @code{BEGIN} pattern and 30182@item 30183Set the variable @code{TEXTDOMAIN} to the text domain of 30184your program. This is best done in a @code{BEGIN} rule 30185(@pxref{BEGIN/END}), 30186or it can also be done via the @option{-v} command-line 30187option (@pxref{Options}): 30188 30189@example 30190BEGIN @{ 30191 TEXTDOMAIN = "guide" 30192 @dots{} 30193@} 30194@end example 30195 30196@cindex @code{_} (underscore) @subentry translatable strings 30197@cindex underscore (@code{_}) @subentry translatable strings 30198@item 30199Mark all translatable strings with a leading underscore (@samp{_}) 30200character. It @emph{must} be adjacent to the opening 30201quote of the string. For example: 30202 30203@example 30204print _"hello, world" 30205x = _"you goofed" 30206printf(_"Number of users is %d\n", nusers) 30207@end example 30208 30209@item 30210If you are creating strings dynamically, you can 30211still translate them, using the @code{dcgettext()} 30212built-in function:@footnote{Thanks to Bruno Haible for this 30213example.} 30214 30215@example 30216if (groggy) 30217 message = dcgettext("%d customers disturbing me\n", "adminprog") 30218else 30219 message = dcgettext("enjoying %d customers\n", "adminprog") 30220printf(message, ncustomers) 30221@end example 30222 30223Here, the call to @code{dcgettext()} supplies a different 30224text domain (@code{"adminprog"}) in which to find the 30225message, but it uses the default @code{"LC_MESSAGES"} category. 30226 30227The previous example only works if @code{ncustomers} is greater than one. 30228This example would be better done with @code{dcngettext()}: 30229 30230@example 30231if (groggy) 30232 message = dcngettext("%d customer disturbing me\n", 30233 "%d customers disturbing me\n", 30234 ncustomers, "adminprog") 30235else 30236 message = dcngettext("enjoying %d customer\n", 30237 "enjoying %d customers\n", 30238 ncustomers, "adminprog") 30239printf(message, ncustomers) 30240@end example 30241 30242 30243@cindex @code{LC_MESSAGES} locale category @subentry @code{bindtextdomain()} function (@command{gawk}) 30244@item 30245During development, you might want to put the @file{.gmo} 30246file in a private directory for testing. This is done 30247with the @code{bindtextdomain()} built-in function: 30248 30249@example 30250BEGIN @{ 30251 TEXTDOMAIN = "guide" # our text domain 30252 if (Testing) @{ 30253 # where to find our files 30254 bindtextdomain("testdir") 30255 # joe is in charge of adminprog 30256 bindtextdomain("../joe/testdir", "adminprog") 30257 @} 30258 @dots{} 30259@} 30260@end example 30261 30262@end enumerate 30263 30264@xref{I18N Example} 30265for an example program showing the steps to create 30266and use translations from @command{awk}. 30267 30268@node Translator i18n 30269@section Translating @command{awk} Programs 30270 30271@cindex @code{.po} files 30272@cindex files @subentry @code{.po} 30273@cindex portable object @subentry files 30274@cindex files @subentry portable object 30275Once a program's translatable strings have been marked, they must 30276be extracted to create the initial @file{.pot} file. 30277As part of translation, it is often helpful to rearrange the order 30278in which arguments to @code{printf} are output. 30279 30280@command{gawk}'s @option{--gen-pot} command-line option extracts 30281the messages and is discussed next. 30282After that, @code{printf}'s ability to 30283rearrange the order for @code{printf} arguments at runtime 30284is covered. 30285 30286@menu 30287* String Extraction:: Extracting marked strings. 30288* Printf Ordering:: Rearranging @code{printf} arguments. 30289* I18N Portability:: @command{awk}-level portability issues. 30290@end menu 30291 30292@node String Extraction 30293@subsection Extracting Marked Strings 30294@cindex strings @subentry extracting 30295@cindex @option{--gen-pot} option 30296@cindex command line @subentry options @subentry string extraction 30297@cindex string @subentry extraction (internationalization) 30298@cindex marked string extraction (internationalization) 30299@cindex extraction, of marked strings (internationalization) 30300 30301@cindex @option{--gen-pot} option 30302Once your @command{awk} program is working, and all the strings have 30303been marked and you've set (and perhaps bound) the text domain, 30304it is time to produce translations. 30305First, use the @option{--gen-pot} command-line option to create 30306the initial @file{.pot} file: 30307 30308@example 30309gawk --gen-pot -f guide.awk > guide.pot 30310@end example 30311 30312@cindex @command{xgettext} utility 30313When run with @option{--gen-pot}, @command{gawk} does not execute your 30314program. Instead, it parses it as usual and prints all marked strings 30315to standard output in the format of a GNU @command{gettext} Portable Object 30316file. Also included in the output are any constant strings that 30317appear as the first argument to @code{dcgettext()} or as the first and 30318second argument to @code{dcngettext()}.@footnote{The 30319@command{xgettext} utility that comes with GNU 30320@command{gettext} can handle @file{.awk} files.} 30321You should distribute the generated @file{.pot} file with 30322your @command{awk} program; translators will eventually use it 30323to provide you translations that you can also then distribute. 30324@xref{I18N Example} 30325for the full list of steps to go through to create and test 30326translations for @command{guide}. 30327 30328@node Printf Ordering 30329@subsection Rearranging @code{printf} Arguments 30330 30331@cindex @code{printf} statement @subentry positional specifiers 30332@cindex positional specifiers, @code{printf} statement 30333Format strings for @code{printf} and @code{sprintf()} 30334(@pxref{Printf}) 30335present a special problem for translation. 30336Consider the following:@footnote{This example is borrowed 30337from the GNU @command{gettext} manual.} 30338 30339@example 30340printf(_"String `%s' has %d characters\n", 30341 string, length(string))) 30342@end example 30343 30344A possible German translation for this might be: 30345 30346@example 30347"%d Zeichen lang ist die Zeichenkette `%s'\n" 30348@end example 30349 30350The problem should be obvious: the order of the format 30351specifications is different from the original! 30352Even though @code{gettext()} can return the translated string 30353at runtime, 30354it cannot change the argument order in the call to @code{printf}. 30355 30356To solve this problem, @code{printf} format specifiers may have 30357an additional optional element, which we call a @dfn{positional specifier}. 30358For example: 30359 30360@example 30361"%2$d Zeichen lang ist die Zeichenkette `%1$s'\n" 30362@end example 30363 30364Here, the positional specifier consists of an integer count, which indicates which 30365argument to use, and a @samp{$}. Counts are one-based, and the 30366format string itself is @emph{not} included. Thus, in the following 30367example, @samp{string} is the first argument and @samp{length(string)} is the second: 30368 30369@example 30370$ @kbd{gawk 'BEGIN @{} 30371> @kbd{string = "Don\47t Panic"} 30372> @kbd{printf "%2$d characters live in \"%1$s\"\n",} 30373> @kbd{string, length(string)} 30374> @kbd{@}'} 30375@print{} 11 characters live in "Don't Panic" 30376@end example 30377 30378If present, positional specifiers come first in the format specification, 30379before the flags, the field width, and/or the precision. 30380 30381Positional specifiers can be used with the dynamic field width and 30382precision capability: 30383 30384@example 30385$ @kbd{gawk 'BEGIN @{} 30386> @kbd{printf("%*.*s\n", 10, 20, "hello")} 30387> @kbd{printf("%3$*2$.*1$s\n", 20, 10, "hello")} 30388> @kbd{@}'} 30389@print{} hello 30390@print{} hello 30391@end example 30392 30393@quotation NOTE 30394When using @samp{*} with a positional specifier, the @samp{*} 30395comes first, then the integer position, and then the @samp{$}. 30396This is somewhat counterintuitive. 30397@end quotation 30398 30399@cindex @code{printf} statement @subentry positional specifiers @subentry mixing with regular formats 30400@cindex positional specifiers, @code{printf} statement @subentry mixing with regular formats 30401@cindex format specifiers @subentry mixing regular with positional specifiers 30402@command{gawk} does not allow you to mix regular format specifiers 30403and those with positional specifiers in the same string: 30404 30405@example 30406@group 30407$ @kbd{gawk 'BEGIN @{ printf "%d %3$s\n", 1, 2, "hi" @}'} 30408@error{} gawk: cmd. line:1: fatal: must use `count$' on all formats or none 30409@end group 30410@end example 30411 30412@quotation NOTE 30413There are some pathological cases that @command{gawk} may fail to 30414diagnose. In such cases, the output may not be what you expect. 30415It's still a bad idea to try mixing them, even if @command{gawk} 30416doesn't detect it. 30417@end quotation 30418 30419Although positional specifiers can be used directly in @command{awk} programs, 30420their primary purpose is to help in producing correct translations of 30421format strings into languages different from the one in which the program 30422is first written. 30423 30424@node I18N Portability 30425@subsection @command{awk} Portability Issues 30426 30427@cindex portability @subentry internationalization and 30428@cindex internationalization @subentry localization @subentry portability and 30429@command{gawk}'s internationalization features were purposely chosen to 30430have as little impact as possible on the portability of @command{awk} 30431programs that use them to other versions of @command{awk}. 30432Consider this program: 30433 30434@example 30435BEGIN @{ 30436 TEXTDOMAIN = "guide" 30437 if (Test_Guide) # set with -v 30438 bindtextdomain("/test/guide/messages") 30439 print _"don't panic!" 30440@} 30441@end example 30442 30443@noindent 30444As written, it won't work on other versions of @command{awk}. 30445However, it is actually almost portable, requiring very little 30446change: 30447 30448@itemize @value{BULLET} 30449@cindex @code{TEXTDOMAIN} variable @subentry portability and 30450@item 30451Assignments to @code{TEXTDOMAIN} won't have any effect, 30452because @code{TEXTDOMAIN} is not special in other @command{awk} implementations. 30453 30454@item 30455Non-GNU versions of @command{awk} treat marked strings 30456as the concatenation of a variable named @code{_} with the string 30457following it.@footnote{This is good fodder for an ``Obfuscated 30458@command{awk}'' contest.} Typically, the variable @code{_} has 30459the null string (@code{""}) as its value, leaving the original string constant as 30460the result. 30461 30462@item 30463By defining ``dummy'' functions to replace @code{dcgettext()}, @code{dcngettext()}, 30464and @code{bindtextdomain()}, the @command{awk} program can be made to run, but 30465all the messages are output in the original language. 30466For example: 30467 30468@cindex @code{bindtextdomain()} function (@command{gawk}) @subentry portability and 30469@cindex @code{dcgettext()} function (@command{gawk}) @subentry portability and 30470@cindex @code{dcngettext()} function (@command{gawk}) @subentry portability and 30471@example 30472@c file eg/lib/libintl.awk 30473function bindtextdomain(dir, domain) 30474@{ 30475 return dir 30476@} 30477 30478function dcgettext(string, domain, category) 30479@{ 30480 return string 30481@} 30482 30483function dcngettext(string1, string2, number, domain, category) 30484@{ 30485 return (number == 1 ? string1 : string2) 30486@} 30487@c endfile 30488@end example 30489 30490@item 30491The use of positional specifications in @code{printf} or 30492@code{sprintf()} is @emph{not} portable. 30493To support @code{gettext()} at the C level, many systems' C versions of 30494@code{sprintf()} do support positional specifiers. But it works only if 30495enough arguments are supplied in the function call. Many versions of 30496@command{awk} pass @code{printf} formats and arguments unchanged to the 30497underlying C library version of @code{sprintf()}, but only one format and 30498argument at a time. What happens if a positional specification is 30499used is anybody's guess. 30500However, because the positional specifications are primarily for use in 30501@emph{translated} format strings, and because non-GNU @command{awk}s never 30502retrieve the translated string, this should not be a problem in practice. 30503@end itemize 30504 30505@node I18N Example 30506@section A Simple Internationalization Example 30507 30508Now let's look at a step-by-step example of how to internationalize and 30509localize a simple @command{awk} program, using @file{guide.awk} as our 30510original source: 30511 30512@example 30513@c file eg/prog/guide.awk 30514BEGIN @{ 30515 TEXTDOMAIN = "guide" 30516 bindtextdomain(".") # for testing 30517 print _"Don't Panic" 30518 print _"The Answer Is", 42 30519 print "Pardon me, Zaphod who?" 30520@} 30521@c endfile 30522@end example 30523 30524@noindent 30525Run @samp{gawk --gen-pot} to create the @file{.pot} file: 30526 30527@example 30528$ @kbd{gawk --gen-pot -f guide.awk > guide.pot} 30529@end example 30530 30531@noindent 30532This produces: 30533 30534@example 30535@c file eg/data/guide.po 30536#: guide.awk:4 30537msgid "Don't Panic" 30538msgstr "" 30539 30540#: guide.awk:5 30541msgid "The Answer Is" 30542msgstr "" 30543 30544@c endfile 30545@end example 30546 30547This original portable object template file is saved and reused for each language 30548into which the application is translated. The @code{msgid} 30549is the original string and the @code{msgstr} is the translation. 30550 30551@quotation NOTE 30552Strings not marked with a leading underscore do not 30553appear in the @file{guide.pot} file. 30554@end quotation 30555 30556Next, the messages must be translated. 30557Here is a translation to a hypothetical dialect of English, 30558called ``Mellow'':@footnote{Perhaps it would be better if it were 30559called ``Hippy.'' Ah, well.} 30560 30561@example 30562@group 30563$ @kbd{cp guide.pot guide-mellow.po} 30564@var{Add translations to} guide-mellow.po @dots{} 30565@end group 30566@end example 30567 30568@noindent 30569Following are the translations: 30570 30571@example 30572@c file eg/data/guide-mellow.po 30573#: guide.awk:4 30574msgid "Don't Panic" 30575msgstr "Hey man, relax!" 30576 30577#: guide.awk:5 30578msgid "The Answer Is" 30579msgstr "Like, the scoop is" 30580 30581@c endfile 30582@end example 30583 30584@cindex GNU/Linux 30585@quotation NOTE 30586The following instructions apply to GNU/Linux with the GNU C Library. Be 30587aware that the actual steps may change over time, that the following 30588description may not be accurate for all GNU/Linux distributions, and 30589that things may work entirely differently on other operating systems. 30590@end quotation 30591 30592The next step is to make the directory to hold the binary message object 30593file and then to create the @file{guide.mo} file. 30594The directory has the form @file{@var{locale}/LC_MESSAGES}, where 30595@var{locale} is a locale name known to the C @command{gettext} routines. 30596 30597@cindex @env{LANGUAGE} environment variable 30598@cindex environment variables @subentry @env{LANGUAGE} 30599@cindex @env{LC_ALL} environment variable 30600@cindex environment variables @subentry @env{LC_ALL} 30601@cindex @env{LANG} environment variable 30602@cindex environment variables @subentry @env{LANG} 30603@cindex @env{LC_MESSAGES} environment variable 30604@cindex environment variables @subentry @env{LC_MESSAGES} 30605How do we know which locale to use? It turns out that there are 30606four different environment variables used by the C @command{gettext} routines. 30607In order, they are @env{$LANGUAGE}, @env{$LC_ALL}, @env{$LANG}, and 30608@env{$LC_MESSAGES}.@footnote{Well, sort of. It seems that if @env{$LC_ALL} 30609is set to @samp{C}, then no translations are done. Go figure.} 30610Thus, we check the value of @env{$LANGUAGE}: 30611 30612@example 30613$ @kbd{echo $LANGUAGE} 30614@print{} en_US.UTF-8 30615@end example 30616 30617@noindent 30618We next make the directories: 30619 30620@example 30621$ @kbd{mkdir en_US.UTF-8 en_US.UTF-8/LC_MESSAGES} 30622@end example 30623 30624@cindex @code{.po} files @subentry converting to @code{.mo} 30625@cindex files @subentry @code{.po} @subentry converting to @code{.mo} 30626@cindex @code{.mo} files, converting from @code{.po} 30627@cindex files @subentry @code{.mo}, converting from @code{.po} 30628@cindex portable object @subentry files @subentry converting to message object files 30629@cindex files @subentry portable object @subentry converting to message object files 30630@cindex message object files @subentry converting from portable object files 30631@cindex files @subentry message object @subentry converting from portable object files 30632@cindex @command{msgfmt} utility 30633The @command{msgfmt} utility converts the human-readable 30634@file{.po} file into a machine-readable @file{.mo} file. 30635By default, @command{msgfmt} creates a file named @file{messages}. 30636This file must be renamed and placed in the proper directory (using 30637the @option{-o} option) so that @command{gawk} can find it: 30638 30639@example 30640$ @kbd{msgfmt guide-mellow.po -o en_US.UTF-8/LC_MESSAGES/guide.mo} 30641@end example 30642 30643Finally, we run the program to test it: 30644 30645@example 30646$ @kbd{gawk -f guide.awk} 30647@print{} Hey man, relax! 30648@print{} Like, the scoop is 42 30649@print{} Pardon me, Zaphod who? 30650@end example 30651 30652If the three replacement functions for @code{dcgettext()}, @code{dcngettext()}, 30653and @code{bindtextdomain()} 30654(@pxref{I18N Portability}) 30655are in a file named @file{libintl.awk}, 30656then we can run @file{guide.awk} unchanged as follows: 30657 30658@example 30659$ @kbd{gawk --posix -f guide.awk -f libintl.awk} 30660@print{} Don't Panic 30661@print{} The Answer Is 42 30662@print{} Pardon me, Zaphod who? 30663@end example 30664 30665@node Gawk I18N 30666@section @command{gawk} Can Speak Your Language 30667 30668@command{gawk} itself has been internationalized 30669using the GNU @command{gettext} package. 30670(GNU @command{gettext} is described in 30671complete detail in 30672@ifinfo 30673@inforef{Top, , GNU @command{gettext} utilities, gettext, GNU @command{gettext} utilities}.) 30674@end ifinfo 30675@ifnotinfo 30676@uref{https://www.gnu.org/software/gettext/manual/, 30677@cite{GNU @command{gettext} utilities}}.) 30678@end ifnotinfo 30679As of this writing, the latest version of GNU @command{gettext} is 30680@uref{ftp://ftp.gnu.org/gnu/gettext/gettext-0.19.8.1.tar.gz, 30681@value{PVERSION} 0.19.8.1}. 30682 30683If a translation of @command{gawk}'s messages exists, 30684then @command{gawk} produces usage messages, warnings, 30685and fatal errors in the local language. 30686 30687@node I18N Summary 30688@section Summary 30689 30690@itemize @value{BULLET} 30691@item 30692Internationalization means writing a program such that it can use multiple 30693languages without requiring source code changes. Localization means 30694providing the data necessary for an internationalized program to work 30695in a particular language. 30696 30697@item 30698@command{gawk} uses GNU @command{gettext} to let you internationalize 30699and localize @command{awk} programs. A program's text domain identifies 30700the program for grouping all messages and other data together. 30701 30702@item 30703You mark a program's strings for translation by preceding them with 30704an underscore. Once that is done, the strings are extracted into a 30705@file{.pot} file. This file is copied for each language into a @file{.po} 30706file, and the @file{.po} files are compiled into @file{.gmo} files for 30707use at runtime. 30708 30709@item 30710You can use positional specifications with @code{sprintf()} and 30711@code{printf} to rearrange the placement of argument values in formatted 30712strings and output. This is useful for the translation of format 30713control strings. 30714 30715@item 30716The internationalization features have been designed so that they 30717can be easily worked around in a standard @command{awk}. 30718 30719@item 30720@command{gawk} itself has been internationalized and ships with 30721a number of translations for its messages. 30722 30723@end itemize 30724 30725 30726@node Debugger 30727@chapter Debugging @command{awk} Programs 30728@cindex debugging @subentry @command{awk} programs 30729 30730@c The original text for this chapter was contributed by Efraim Yawitz. 30731 30732It would be nice if computer programs worked perfectly the first time they 30733were run, but in real life, this rarely happens for programs of 30734any complexity. Thus, most programming languages have facilities available 30735for ``debugging'' programs, and @command{awk} is no exception. 30736 30737The @command{gawk} debugger is purposely modeled after 30738@uref{https://www.gnu.org/software/gdb/, the GNU Debugger (GDB)} 30739command-line debugger. If you are familiar with GDB, learning 30740how to use @command{gawk} for debugging your programs is easy. 30741 30742@menu 30743* Debugging:: Introduction to @command{gawk} debugger. 30744* Sample Debugging Session:: Sample debugging session. 30745* List of Debugger Commands:: Main debugger commands. 30746* Readline Support:: Readline support. 30747* Limitations:: Limitations and future plans. 30748* Debugging Summary:: Debugging summary. 30749@end menu 30750 30751@node Debugging 30752@section Introduction to the @command{gawk} Debugger 30753 30754This @value{SECTION} introduces debugging in general and begins 30755the discussion of debugging in @command{gawk}. 30756 30757@menu 30758* Debugging Concepts:: Debugging in General. 30759* Debugging Terms:: Additional Debugging Concepts. 30760* Awk Debugging:: Awk Debugging. 30761@end menu 30762 30763@node Debugging Concepts 30764@subsection Debugging in General 30765 30766(If you have used debuggers in other languages, you may want to skip 30767ahead to @ref{Awk Debugging}.) 30768 30769Of course, a debugging program cannot remove bugs for you, because it has 30770no way of knowing what you or your users consider a ``bug'' versus a 30771``feature.'' (Sometimes, we humans have a hard time with this ourselves.) 30772In that case, what can you expect from such a tool? The answer to that 30773depends on the language being debugged, but in general, you can expect at 30774least the following: 30775 30776@cindex debugger @subentry capabilities 30777@itemize @value{BULLET} 30778@item 30779The ability to watch a program execute its instructions one by one, 30780giving you, the programmer, the opportunity to think about what is happening 30781on a time scale of seconds, minutes, or hours, rather than the nanosecond 30782time scale at which the code usually runs. 30783 30784@item 30785The opportunity to not only passively observe the operation of your 30786program, but to control it and try different paths of execution, without 30787having to change your source files. 30788 30789@item 30790The chance to see the values of data in the program at any point in 30791execution, and also to change that data on the fly, to see how that 30792affects what happens afterward. (This often includes the ability 30793to look at internal data structures besides the variables you actually 30794defined in your code.) 30795 30796@item 30797The ability to obtain additional information about your program's state 30798or even its internal structure. 30799@end itemize 30800 30801All of these tools provide a great amount of help in using your own 30802skills and understanding of the goals of your program to find where it 30803is going wrong (or, for that matter, to better comprehend a perfectly 30804functional program that you or someone else wrote). 30805 30806@node Debugging Terms 30807@subsection Debugging Concepts 30808 30809@cindex debugger @subentry concepts 30810Before diving in to the details, we need to introduce several 30811important concepts that apply to just about all debuggers. 30812The following list defines terms used throughout the rest of 30813this @value{CHAPTER}: 30814 30815@table @dfn 30816@cindex call stack @subentry explanation of 30817@cindex stack frame (debugger) 30818@item Stack frame 30819Programs generally call functions during the course of their execution. 30820One function can call another, or a function can call itself (recursion). 30821You can view the chain of called functions (main program calls A, which 30822calls B, which calls C), as a stack of executing functions: the currently 30823running function is the topmost one on the stack, and when it finishes 30824(returns), the next one down then becomes the active function. 30825Such a stack is termed a @dfn{call stack}. 30826 30827For each function on the call stack, the system maintains a data area 30828that contains the function's parameters, local variables, and return value, 30829as well as any other ``bookkeeping'' information needed to manage the 30830call stack. This data area is termed a @dfn{stack frame}. 30831 30832@command{gawk} also follows this model, and gives you 30833access to the call stack and to each stack frame. You can see the 30834call stack, as well as from where each function on the stack was 30835invoked. Commands that print the call stack print information about 30836each stack frame (as detailed later on). 30837 30838@item Breakpoint 30839@cindex breakpoint 30840During debugging, you often wish to let the program run until it 30841reaches a certain point, and then continue execution from there one 30842statement (or instruction) at a time. The way to do this is to set 30843a @dfn{breakpoint} within the program. A breakpoint is where the 30844execution of the program should break off (stop), so that you can 30845take over control of the program's execution. You can add and remove 30846as many breakpoints as you like. 30847 30848@item Watchpoint 30849@cindex watchpoint (debugger) 30850A watchpoint is similar to a breakpoint. The difference is that 30851breakpoints are oriented around the code: stop when a certain point in the 30852code is reached. A watchpoint, however, specifies that program execution 30853should stop when a @emph{data value} is changed. This is useful, as 30854sometimes it happens that a variable receives an erroneous value, and it's 30855hard to track down where this happens just by looking at the code. 30856By using a watchpoint, you can stop whenever a variable is assigned to, 30857and usually find the errant code quite quickly. 30858@end table 30859 30860@node Awk Debugging 30861@subsection @command{awk} Debugging 30862 30863Debugging an @command{awk} program has some specific aspects that are 30864not shared with programs written in other languages. 30865 30866First of all, the fact that @command{awk} programs usually take input 30867line by line from a file or files and operate on those lines using specific 30868rules makes it especially useful to organize viewing the execution of 30869the program in terms of these rules. As we will see, each @command{awk} 30870rule is treated almost like a function call, with its own specific block 30871of instructions. 30872 30873In addition, because @command{awk} is by design a very concise language, 30874it is easy to lose sight of everything that is going on ``inside'' 30875each line of @command{awk} code. The debugger provides the opportunity 30876to look at the individual primitive instructions carried out 30877by the higher-level @command{awk} commands.@footnote{The ``primitive 30878instructions'' are defined by @command{gawk} itself; the debugger 30879does not work at the level of machine instructions.} 30880 30881@node Sample Debugging Session 30882@section Sample @command{gawk} Debugging Session 30883@cindex sample debugging session 30884@cindex example debugging session 30885@cindex debugging @subentry example session 30886 30887In order to illustrate the use of @command{gawk} as a debugger, let's look at a sample 30888debugging session. We will use the @command{awk} implementation of the 30889POSIX @command{uniq} command presented earlier (@pxref{Uniq Program}) 30890as our example. 30891 30892@menu 30893* Debugger Invocation:: How to Start the Debugger. 30894* Finding The Bug:: Finding the Bug. 30895@end menu 30896 30897@node Debugger Invocation 30898@subsection How to Start the Debugger 30899@cindex starting the debugger 30900@cindex debugger @subentry how to start 30901 30902Starting the debugger is almost exactly like running @command{gawk} normally, 30903except you have to pass an additional option, @option{--debug}, or the 30904corresponding short option, @option{-D}. The file(s) containing the 30905program and any supporting code are given on the command line as arguments 30906to one or more @option{-f} options. (@command{gawk} is not designed 30907to debug command-line programs, only programs contained in files.) 30908In our case, we invoke the debugger like this: 30909 30910@example 30911$ @kbd{gawk -D -f getopt.awk -f join.awk -f uniq.awk -1 inputfile} 30912@end example 30913 30914@noindent 30915where both @file{getopt.awk} and @file{uniq.awk} are in @env{$AWKPATH}. 30916(Experienced users of GDB or similar debuggers should note that 30917this syntax is slightly different from what you are used to. 30918With the @command{gawk} debugger, you give the arguments for running the program 30919in the command line to the debugger rather than as part of the @code{run} 30920command at the debugger prompt.) 30921The @option{-1} is an option to @file{uniq.awk}. 30922 30923@cindex debugger @subentry prompt 30924Instead of immediately running the program on @file{inputfile}, as 30925@command{gawk} would ordinarily do, the debugger merely loads all 30926the program source files, compiles them internally, and then gives 30927us a prompt: 30928 30929@example 30930gawk> 30931@end example 30932 30933@noindent 30934from which we can issue commands to the debugger. At this point, no 30935code has been executed. 30936 30937@node Finding The Bug 30938@subsection Finding the Bug 30939 30940Let's say that we are having a problem using (a faulty version of) 30941@file{uniq.awk} in ``field-skipping'' mode, and it doesn't seem to be 30942catching lines which should be identical when skipping the first field, 30943such as: 30944 30945@example 30946awk is a wonderful program! 30947gawk is a wonderful program! 30948@end example 30949 30950This could happen if we were thinking (C-like) of the fields in a record 30951as being numbered in a zero-based fashion, so instead of the lines: 30952 30953@example 30954clast = join(alast, fcount+1, n) 30955cline = join(aline, fcount+1, m) 30956@end example 30957 30958@noindent 30959we wrote: 30960 30961@example 30962clast = join(alast, fcount, n) 30963cline = join(aline, fcount, m) 30964@end example 30965 30966The first thing we usually want to do when trying to investigate a 30967problem like this is to put a breakpoint in the program so that we can 30968watch it at work and catch what it is doing wrong. A reasonable spot for 30969a breakpoint in @file{uniq.awk} is at the beginning of the function 30970@code{are_equal()}, which compares the current line with the previous one. To set 30971the breakpoint, use the @code{b} (breakpoint) command: 30972 30973@cindex debugger @subentry setting a breakpoint 30974@cindex debugger @subentry commands @subentry @code{breakpoint} 30975@cindex debugger @subentry commands @subentry @code{break} 30976@cindex debugger @subentry commands @subentry @code{b} (@code{break}) 30977@example 30978gawk> @kbd{b are_equal} 30979@print{} Breakpoint 1 set at file `awklib/eg/prog/uniq.awk', line 63 30980@end example 30981 30982The debugger tells us the file and line number where the breakpoint is. 30983Now type @samp{r} or @samp{run} and the program runs until it hits 30984the breakpoint for the first time: 30985 30986@cindex debugger @subentry running the program 30987@cindex debugger @subentry commands @subentry @code{run} 30988@example 30989gawk> @kbd{r} 30990@print{} Starting program: 30991@print{} Stopping in Rule ... 30992@print{} Breakpoint 1, are_equal(n, m, clast, cline, alast, aline) 30993 at `awklib/eg/prog/uniq.awk':63 30994@print{} 63 if (fcount == 0 && charcount == 0) 30995gawk> 30996@end example 30997 30998Now we can look at what's going on inside our program. First of all, 30999let's see how we got to where we are. At the prompt, we type @samp{bt} 31000(short for ``backtrace''), and the debugger responds with a 31001listing of the current stack frames: 31002 31003@cindex debugger @subentry stack frames, showing 31004@cindex debugger @subentry commands @subentry @code{bt} (@code{backtrace}) 31005@cindex debugger @subentry commands @subentry @code{backtrace} 31006@example 31007gawk> @kbd{bt} 31008@print{} #0 are_equal(n, m, clast, cline, alast, aline) 31009 at `awklib/eg/prog/uniq.awk':68 31010@print{} #1 in main() at `awklib/eg/prog/uniq.awk':88 31011@end example 31012 31013This tells us that @code{are_equal()} was called by the main program at 31014line 88 of @file{uniq.awk}. (This is not a big surprise, because this 31015is the only call to @code{are_equal()} in the program, but in more complex 31016programs, knowing who called a function and with what parameters can be 31017the key to finding the source of the problem.) 31018 31019Now that we're in @code{are_equal()}, we can start looking at the values 31020of some variables. Let's say we type @samp{p n} 31021(@code{p} is short for ``print''). We would expect to see the value of 31022@code{n}, a parameter to @code{are_equal()}. Actually, the debugger 31023gives us: 31024 31025@cindex debugger @subentry commands @subentry @code{print} 31026@cindex debugger @subentry commands @subentry @code{p} (@code{print}) 31027@example 31028gawk> @kbd{p n} 31029@print{} n = untyped variable 31030@end example 31031 31032@noindent 31033In this case, @code{n} is an uninitialized local variable, because the 31034function was called without arguments (@pxref{Function Calls}). 31035 31036A more useful variable to display might be the current record: 31037 31038@example 31039gawk> @kbd{p $0} 31040@print{} $0 = "gawk is a wonderful program!" 31041@end example 31042 31043@noindent 31044This might be a bit puzzling at first, as this is the second line of 31045our test input. Let's look at @code{NR}: 31046 31047@example 31048gawk> @kbd{p NR} 31049@print{} NR = 2 31050@end example 31051 31052@noindent 31053So we can see that @code{are_equal()} was only called for the second record 31054of the file. Of course, this is because our program contains a rule for 31055@samp{NR == 1}: 31056 31057@example 31058NR == 1 @{ 31059 last = $0 31060 next 31061@} 31062@end example 31063 31064OK, let's just check that that rule worked correctly: 31065 31066@example 31067gawk> @kbd{p last} 31068@print{} last = "awk is a wonderful program!" 31069@end example 31070 31071Everything we have done so far has verified that the program has worked as 31072planned, up to and including the call to @code{are_equal()}, so the problem must 31073be inside this function. To investigate further, we must begin 31074``stepping through'' the lines of @code{are_equal()}. We start by typing 31075@samp{n} (for ``next''): 31076 31077@cindex debugger @subentry commands @subentry @code{n} (@code{next}) 31078@cindex debugger @subentry commands @subentry @code{next} 31079@example 31080@group 31081gawk> @kbd{n} 31082@print{} 66 if (fcount > 0) @{ 31083@end group 31084@end example 31085 31086This tells us that @command{gawk} is now ready to execute line 66, which 31087decides whether to give the lines the special ``field-skipping'' treatment 31088indicated by the @option{-1} command-line option. (Notice that we skipped 31089from where we were before, at line 63, to here, because the condition 31090in line 63, @samp{if (fcount == 0 && charcount == 0)}, was false.) 31091 31092Continuing to step, we now get to the splitting of the current and 31093last records: 31094 31095@example 31096gawk> @kbd{n} 31097@print{} 67 n = split(last, alast) 31098gawk> @kbd{n} 31099@print{} 68 m = split($0, aline) 31100@end example 31101 31102At this point, we should be curious to see what our records were split 31103into, so we try to look: 31104 31105@example 31106gawk> @kbd{p n m alast aline} 31107@print{} n = 5 31108@print{} m = untyped variable 31109@print{} alast = array, 5 elements 31110@print{} aline = untyped variable 31111@end example 31112 31113@noindent 31114(The @code{p} command can take more than one argument, similar to 31115@command{awk}'s @code{print} statement.) 31116 31117This is kind of disappointing, though. All we found out is that there 31118are five elements in @code{alast}; @code{m} and @code{aline} don't have 31119values because we are at line 68 but haven't executed it yet. 31120This information is useful enough (we now know that 31121none of the words were accidentally left out), but what if we want to see 31122inside the array? 31123 31124@cindex debugger @subentry printing single array elements 31125The first choice would be to use subscripts: 31126 31127@example 31128gawk> @kbd{p alast[0]} 31129@print{} "0" not in array `alast' 31130@end example 31131 31132@noindent 31133Oops! 31134 31135@example 31136gawk> @kbd{p alast[1]} 31137@print{} alast["1"] = "awk" 31138@end example 31139 31140This would be kind of slow for a 100-member array, though, so 31141@command{gawk} provides a shortcut (reminiscent of another language 31142not to be mentioned): 31143 31144@cindex debugger @subentry printing all array elements 31145@example 31146gawk> @kbd{p @@alast} 31147@print{} alast["1"] = "awk" 31148@print{} alast["2"] = "is" 31149@print{} alast["3"] = "a" 31150@print{} alast["4"] = "wonderful" 31151@print{} alast["5"] = "program!" 31152@end example 31153 31154It looks like we got this far OK. Let's take another step 31155or two: 31156 31157@example 31158gawk> @kbd{n} 31159@print{} 69 clast = join(alast, fcount, n) 31160gawk> @kbd{n} 31161@print{} 70 cline = join(aline, fcount, m) 31162@end example 31163 31164Well, here we are at our error (sorry to spoil the suspense). What we 31165had in mind was to join the fields starting from the second one to make 31166the virtual record to compare, and if the first field were numbered zero, 31167this would work. Let's look at what we've got: 31168 31169@example 31170gawk> @kbd{p cline clast} 31171@print{} cline = "gawk is a wonderful program!" 31172@print{} clast = "awk is a wonderful program!" 31173@end example 31174 31175Hey, those look pretty familiar! They're just our original, unaltered 31176input records. A little thinking (the human brain is still the best 31177debugging tool), and we realize that we were off by one! 31178 31179We get out of the debugger: 31180 31181@example 31182gawk> @kbd{q} 31183@print{} The program is running. Exit anyway (y/n)? @kbd{y} 31184@end example 31185 31186@noindent 31187Then we get into an editor: 31188 31189@example 31190clast = join(alast, fcount+1, n) 31191cline = join(aline, fcount+1, m) 31192@end example 31193 31194@noindent 31195and problem solved! 31196 31197@node List of Debugger Commands 31198@section Main Debugger Commands 31199 31200The @command{gawk} debugger command set can be divided into the 31201following categories: 31202 31203@itemize @value{BULLET} 31204 31205@item 31206Breakpoint control 31207 31208@item 31209Execution control 31210 31211@item 31212Viewing and changing data 31213 31214@item 31215Working with the stack 31216 31217@item 31218Getting information 31219 31220@item 31221Miscellaneous 31222@end itemize 31223 31224@cindex debugger @subentry repeating commands 31225Each of these are discussed in the following subsections. 31226In the following descriptions, commands that may be abbreviated 31227show the abbreviation on a second description line. 31228A debugger command name may also be truncated if that partial 31229name is unambiguous. The debugger has the built-in capability to 31230automatically repeat the previous command just by hitting @kbd{Enter}. 31231This works for the commands @code{list}, @code{next}, @code{nexti}, 31232@code{step}, @code{stepi}, and @code{continue} executed without any 31233argument. 31234 31235@menu 31236* Breakpoint Control:: Control of Breakpoints. 31237* Debugger Execution Control:: Control of Execution. 31238* Viewing And Changing Data:: Viewing and Changing Data. 31239* Execution Stack:: Dealing with the Stack. 31240* Debugger Info:: Obtaining Information about the Program and 31241 the Debugger State. 31242* Miscellaneous Debugger Commands:: Miscellaneous Commands. 31243@end menu 31244 31245@node Breakpoint Control 31246@subsection Control of Breakpoints 31247 31248As we saw earlier, the first thing you probably want to do in a debugging 31249session is to get your breakpoints set up, because your program 31250will otherwise just run as if it was not under the debugger. The commands for 31251controlling breakpoints are: 31252 31253@table @asis 31254@cindex debugger @subentry commands @subentry @code{b} (@code{break}) 31255@cindex debugger @subentry commands @subentry @code{break} 31256@cindex @code{break} debugger command 31257@cindex @code{b} debugger command (alias for @code{break}) 31258@cindex set breakpoint 31259@cindex breakpoint @subentry setting 31260@item @code{break} [[@var{filename}@code{:}]@var{n} | @var{function}] [@code{"@var{expression}"}] 31261@itemx @code{b} [[@var{filename}@code{:}]@var{n} | @var{function}] [@code{"@var{expression}"}] 31262Without any argument, set a breakpoint at the next instruction 31263to be executed in the selected stack frame. 31264Arguments can be one of the following: 31265 31266@c @asis for docbook 31267@c nested table 31268@table @asis 31269@item @var{n} 31270Set a breakpoint at line number @var{n} in the current source file. 31271 31272@item @var{filename}@code{:}@var{n} 31273Set a breakpoint at line number @var{n} in source file @var{filename}. 31274 31275@item @var{function} 31276Set a breakpoint at entry to (the first instruction of) 31277function @var{function}. 31278@end table 31279 31280Each breakpoint is assigned a number that can be used to delete it from 31281the breakpoint list using the @code{delete} command. 31282 31283With a breakpoint, you may also supply a condition. This is an 31284@command{awk} expression (enclosed in double quotes) that the debugger 31285evaluates whenever the breakpoint is reached. If the condition is true, 31286then the debugger stops execution and prompts for a command. Otherwise, 31287it continues executing the program. 31288 31289@cindex debugger @subentry commands @subentry @code{clear} 31290@cindex @code{clear} debugger command 31291@cindex delete breakpoint @subentry at location 31292@cindex breakpoint @subentry at location, how to delete 31293@item @code{clear} [[@var{filename}@code{:}]@var{n} | @var{function}] 31294Without any argument, delete any breakpoint at the next instruction 31295to be executed in the selected stack frame. If the program stops at 31296a breakpoint, this deletes that breakpoint so that the program 31297does not stop at that location again. Arguments can be one of the following: 31298 31299@c nested table 31300@table @asis 31301@item @var{n} 31302Delete breakpoint(s) set at line number @var{n} in the current source file. 31303 31304@item @var{filename}@code{:}@var{n} 31305Delete breakpoint(s) set at line number @var{n} in source file @var{filename}. 31306 31307@item @var{function} 31308Delete breakpoint(s) set at entry to function @var{function}. 31309@end table 31310 31311@cindex debugger @subentry commands @subentry @code{condition} 31312@cindex @code{condition} debugger command 31313@cindex breakpoint @subentry condition 31314@item @code{condition} @var{n} @code{"@var{expression}"} 31315Add a condition to existing breakpoint or watchpoint @var{n}. The 31316condition is an @command{awk} expression @emph{enclosed in double quotes} 31317that the debugger evaluates 31318whenever the breakpoint or watchpoint is reached. If the condition is true, then 31319the debugger stops execution and prompts for a command. Otherwise, 31320the debugger continues executing the program. If the condition expression is 31321not specified, any existing condition is removed (i.e., the breakpoint or 31322watchpoint is made unconditional). 31323 31324@cindex debugger @subentry commands @subentry @code{d} (@code{delete}) 31325@cindex debugger @subentry commands @subentry @code{delete} 31326@cindex @code{delete} debugger command 31327@cindex @code{d} debugger command (alias for @code{delete}) 31328@cindex delete breakpoint @subentry by number 31329@cindex breakpoint @subentry delete by number 31330@item @code{delete} [@var{n1 n2} @dots{}] [@var{n}--@var{m}] 31331@itemx @code{d} [@var{n1 n2} @dots{}] [@var{n}--@var{m}] 31332Delete specified breakpoints or a range of breakpoints. Delete 31333all defined breakpoints if no argument is supplied. 31334 31335@cindex debugger @subentry commands @subentry @code{disable} 31336@cindex @code{disable} debugger command 31337@cindex disable breakpoint 31338@cindex breakpoint @subentry how to disable or enable 31339@item @code{disable} [@var{n1 n2} @dots{} | @var{n}--@var{m}] 31340Disable specified breakpoints or a range of breakpoints. Without 31341any argument, disable all breakpoints. 31342 31343@cindex debugger @subentry commands @subentry @code{e} (@code{enable}) 31344@cindex debugger @subentry commands @subentry @code{enable} 31345@cindex @code{enable} debugger command 31346@cindex @code{e} debugger command (alias for @code{enable}) 31347@cindex enable breakpoint 31348@item @code{enable} [@code{del} | @code{once}] [@var{n1 n2} @dots{}] [@var{n}--@var{m}] 31349@itemx @code{e} [@code{del} | @code{once}] [@var{n1 n2} @dots{}] [@var{n}--@var{m}] 31350Enable specified breakpoints or a range of breakpoints. Without 31351any argument, enable all breakpoints. 31352Optionally, you can specify how to enable the breakpoints: 31353 31354@c nested table 31355@table @code 31356@item del 31357Enable the breakpoints temporarily, then delete each one when 31358the program stops at it. 31359 31360@item once 31361Enable the breakpoints temporarily, then disable each one when 31362the program stops at it. 31363@end table 31364 31365@cindex debugger @subentry commands @subentry @code{ignore} 31366@cindex @code{ignore} debugger command 31367@cindex ignore breakpoint 31368@item @code{ignore} @var{n} @var{count} 31369Ignore breakpoint number @var{n} the next @var{count} times it is 31370hit. 31371 31372@cindex debugger @subentry commands @subentry @code{t} (@code{tbreak}) 31373@cindex debugger @subentry commands @subentry @code{tbreak} 31374@cindex @code{tbreak} debugger command 31375@cindex @code{t} debugger command (alias for @code{tbreak}) 31376@cindex temporary breakpoint 31377@item @code{tbreak} [[@var{filename}@code{:}]@var{n} | @var{function}] 31378@itemx @code{t} [[@var{filename}@code{:}]@var{n} | @var{function}] 31379Set a temporary breakpoint (enabled for only one stop). 31380The arguments are the same as for @code{break}. 31381@end table 31382 31383@node Debugger Execution Control 31384@subsection Control of Execution 31385 31386Now that your breakpoints are ready, you can start running the program 31387and observing its behavior. There are more commands for controlling 31388execution of the program than we saw in our earlier example: 31389 31390@table @asis 31391@cindex debugger @subentry commands @subentry @code{commands} 31392@cindex @code{commands} debugger command 31393@cindex debugger @subentry commands @subentry @code{silent} 31394@cindex @code{silent} debugger command 31395@cindex debugger @subentry commands @subentry @code{end} 31396@cindex @code{end} debugger command 31397@cindex breakpoint @subentry commands to execute at 31398@cindex commands to execute at breakpoint 31399@item @code{commands} [@var{n}] 31400@itemx @code{silent} 31401@itemx @dots{} 31402@itemx @code{end} 31403Set a list of commands to be executed upon stopping at 31404a breakpoint or watchpoint. @var{n} is the breakpoint or watchpoint number. 31405Without a number, the last one set is used. The actual commands follow, 31406starting on the next line, and terminated by the @code{end} command. 31407If the command @code{silent} is in the list, the usual messages about 31408stopping at a breakpoint and the source line are not printed. Any command 31409in the list that resumes execution (e.g., @code{continue}) terminates the list 31410(an implicit @code{end}), and subsequent commands are ignored. 31411For example: 31412 31413@example 31414gawk> @kbd{commands} 31415> @kbd{silent} 31416> @kbd{printf "A silent breakpoint; i = %d\n", i} 31417> @kbd{info locals} 31418> @kbd{set i = 10} 31419> @kbd{continue} 31420> @kbd{end} 31421gawk> 31422@end example 31423 31424@cindex debugger @subentry commands @subentry @code{c} (@code{continue}) 31425@cindex debugger @subentry commands @subentry @code{continue} 31426@cindex continue program, in debugger 31427@cindex @code{continue} debugger command 31428@item @code{continue} [@var{count}] 31429@itemx @code{c} [@var{count}] 31430Resume program execution. If continued from a breakpoint and @var{count} is 31431specified, ignore the breakpoint at that location the next @var{count} times 31432before stopping. 31433 31434@cindex debugger @subentry commands @subentry @code{finish} 31435@cindex @code{finish} debugger command 31436@item @code{finish} 31437Execute until the selected stack frame returns. 31438Print the returned value. 31439 31440@cindex debugger @subentry commands @subentry @code{n} (@code{next}) 31441@cindex debugger @subentry commands @subentry @code{next} 31442@cindex @code{next} debugger command 31443@cindex @code{n} debugger command (alias for @code{next}) 31444@cindex single-step execution, in the debugger 31445@item @code{next} [@var{count}] 31446@itemx @code{n} [@var{count}] 31447Continue execution to the next source line, stepping over function calls. 31448The argument @var{count} controls how many times to repeat the action, as 31449in @code{step}. 31450 31451@cindex debugger @subentry commands @subentry @code{ni} (@code{nexti}) 31452@cindex debugger @subentry commands @subentry @code{nexti} 31453@cindex @code{nexti} debugger command 31454@cindex @code{ni} debugger command (alias for @code{nexti}) 31455@item @code{nexti} [@var{count}] 31456@itemx @code{ni} [@var{count}] 31457Execute one (or @var{count}) instruction(s), stepping over function calls. 31458 31459@cindex debugger @subentry commands @subentry @code{return} 31460@cindex @code{return} debugger command 31461@item @code{return} [@var{value}] 31462Cancel execution of a function call. If @var{value} (either a string or a 31463number) is specified, it is used as the function's return value. If used in a 31464frame other than the innermost one (the currently executing function; i.e., 31465frame number 0), discard all inner frames in addition to the selected one, 31466and the caller of that frame becomes the innermost frame. 31467 31468@cindex debugger @subentry commands @subentry @code{r} (@code{run}) 31469@cindex debugger @subentry commands @subentry @code{run} 31470@cindex @code{run} debugger command 31471@cindex @code{r} debugger command (alias for @code{run}) 31472@item @code{run} 31473@itemx @code{r} 31474Start/restart execution of the program. When restarting, the debugger 31475retains the current breakpoints, watchpoints, command history, 31476automatic display variables, and debugger options. 31477 31478@cindex debugger @subentry commands @subentry @code{s} (@code{step}) 31479@cindex debugger @subentry commands @subentry @code{step} 31480@cindex @code{step} debugger command 31481@cindex @code{s} debugger command (alias for @code{step}) 31482@item @code{step} [@var{count}] 31483@itemx @code{s} [@var{count}] 31484Continue execution until control reaches a different source line in the 31485current stack frame, stepping inside any function called within 31486the line. If the argument @var{count} is supplied, steps that many times before 31487stopping, unless it encounters a breakpoint or watchpoint. 31488 31489@cindex debugger @subentry commands @subentry @code{si} (@code{stepi}) 31490@cindex debugger @subentry commands @subentry @code{stepi} 31491@cindex @code{stepi} debugger command 31492@cindex @code{si} debugger command (alias for @code{stepi}) 31493@item @code{stepi} [@var{count}] 31494@itemx @code{si} [@var{count}] 31495Execute one (or @var{count}) instruction(s), stepping inside function calls. 31496(For illustration of what is meant by an ``instruction'' in @command{gawk}, 31497see the output shown under @code{dump} in @ref{Miscellaneous Debugger Commands}.) 31498 31499@cindex debugger @subentry commands @subentry @code{u} (@code{until}) 31500@cindex debugger @subentry commands @subentry @code{until} 31501@cindex @code{until} debugger command 31502@cindex @code{u} debugger command (alias for @code{until}) 31503@item @code{until} [[@var{filename}@code{:}]@var{n} | @var{function}] 31504@itemx @code{u} [[@var{filename}@code{:}]@var{n} | @var{function}] 31505Without any argument, continue execution until a line past the current 31506line in the current stack frame is reached. With an argument, 31507continue execution until the specified location is reached, or the current 31508stack frame returns. 31509@end table 31510 31511@node Viewing And Changing Data 31512@subsection Viewing and Changing Data 31513 31514The commands for viewing and changing variables inside of @command{gawk} are: 31515 31516@table @asis 31517@cindex debugger @subentry commands @subentry @code{display} 31518@cindex @code{display} debugger command 31519@item @code{display} [@var{var} | @code{$}@var{n}] 31520Add variable @var{var} (or field @code{$@var{n}}) to the display list. 31521The value of the variable or field is displayed each time the program stops. 31522Each variable added to the list is identified by a unique number: 31523 31524@example 31525gawk> @kbd{display x} 31526@print{} 10: x = 1 31527@end example 31528 31529@noindent 31530This displays the assigned item number, the variable name, and its current value. 31531If the display variable refers to a function parameter, it is silently 31532deleted from the list as soon as the execution reaches a context where 31533no such variable of the given name exists. 31534Without argument, @code{display} displays the current values of 31535items on the list. 31536 31537@cindex debugger @subentry commands @subentry @code{eval} 31538@cindex @code{eval} debugger command 31539@cindex evaluate expressions, in debugger 31540@item @code{eval "@var{awk statements}"} 31541Evaluate @var{awk statements} in the context of the running program. 31542You can do anything that an @command{awk} program would do: assign 31543values to variables, call functions, and so on. 31544 31545@quotation NOTE 31546You cannot use @code{eval} to execute a statement containing 31547any of the following: 31548@code{exit}, 31549@code{getline}, 31550@code{next}, 31551@code{nextfile}, 31552or 31553@code{return}. 31554@end quotation 31555 31556@item @code{eval} @var{param}, @dots{} 31557@itemx @var{awk statements} 31558@itemx @code{end} 31559This form of @code{eval} is similar, but it allows you to define 31560``local variables'' that exist in the context of the 31561@var{awk statements}, instead of using variables or function 31562parameters defined by the program. 31563 31564@cindex debugger @subentry commands @subentry @code{p} (@code{print}) 31565@cindex debugger @subentry commands @subentry @code{print} 31566@cindex @code{print} debugger command 31567@cindex @code{p} debugger command (alias for @code{print}) 31568@cindex print variables, in debugger 31569@item @code{print} @var{var1}[@code{,} @var{var2} @dots{}] 31570@itemx @code{p} @var{var1}[@code{,} @var{var2} @dots{}] 31571Print the value of a @command{gawk} variable or field. 31572Fields must be referenced by constants: 31573 31574@example 31575gawk> @kbd{print $3} 31576@end example 31577 31578@noindent 31579This prints the third field in the input record (if the specified field does not 31580exist, it prints @samp{Null field}). A variable can be an array element, with 31581the subscripts being constant string values. To print the contents of an array, 31582prefix the name of the array with the @samp{@@} symbol: 31583 31584@example 31585gawk> @kbd{print @@a} 31586@end example 31587 31588@noindent 31589This prints the indices and the corresponding values for all elements in 31590the array @code{a}. 31591 31592@cindex debugger @subentry commands @subentry @code{printf} 31593@cindex @code{printf} debugger command 31594@item @code{printf} @var{format} [@code{,} @var{arg} @dots{}] 31595Print formatted text. The @var{format} may include escape sequences, 31596such as @samp{\n} 31597(@pxref{Escape Sequences}). 31598No newline is printed unless one is specified. 31599 31600@cindex debugger @subentry commands @subentry @code{set} 31601@cindex @code{set} debugger command 31602@cindex assign values to variables, in debugger 31603@item @code{set} @var{var}@code{=}@var{value} 31604Assign a constant (number or string) value to an @command{awk} variable 31605or field. 31606String values must be enclosed between double quotes (@code{"}@dots{}@code{"}). 31607 31608You can also set special @command{awk} variables, such as @code{FS}, 31609@code{NF}, @code{NR}, and so on. 31610 31611@cindex debugger @subentry commands @subentry @code{w} (@code{watch}) 31612@cindex debugger @subentry commands @subentry @code{watch} 31613@cindex @code{watch} debugger command 31614@cindex @code{w} debugger command (alias for @code{watch}) 31615@cindex set watchpoint 31616@item @code{watch} @var{var} | @code{$}@var{n} [@code{"@var{expression}"}] 31617@itemx @code{w} @var{var} | @code{$}@var{n} [@code{"@var{expression}"}] 31618Add variable @var{var} (or field @code{$@var{n}}) to the watch list. 31619The debugger then stops whenever 31620the value of the variable or field changes. Each watched item is assigned a 31621number that can be used to delete it from the watch list using the 31622@code{unwatch} command. 31623 31624With a watchpoint, you may also supply a condition. This is an 31625@command{awk} expression (enclosed in double quotes) that the debugger 31626evaluates whenever the watchpoint is reached. If the condition is true, 31627then the debugger stops execution and prompts for a command. Otherwise, 31628@command{gawk} continues executing the program. 31629 31630@cindex debugger @subentry commands @subentry @code{undisplay} 31631@cindex @code{undisplay} debugger command 31632@cindex stop automatic display, in debugger 31633@item @code{undisplay} [@var{n}] 31634Remove item number @var{n} (or all items, if no argument) from the 31635automatic display list. 31636 31637@cindex debugger @subentry commands @subentry @code{unwatch} 31638@cindex @code{unwatch} debugger command 31639@cindex delete watchpoint 31640@item @code{unwatch} [@var{n}] 31641Remove item number @var{n} (or all items, if no argument) from the 31642watch list. 31643 31644@end table 31645 31646@node Execution Stack 31647@subsection Working with the Stack 31648 31649Whenever you run a program that contains any function calls, 31650@command{gawk} maintains a stack of all of the function calls leading up 31651to where the program is right now. You can see how you got to where you are, 31652and also move around in the stack to see what the state of things was in the 31653functions that called the one you are in. The commands for doing this are: 31654 31655@table @asis 31656@cindex debugger @subentry commands @subentry @code{bt} (@code{backtrace}) 31657@cindex debugger @subentry commands @subentry @code{backtrace} 31658@cindex debugger @subentry commands @subentry @code{where} (@code{backtrace}) 31659@cindex @code{backtrace} debugger command 31660@cindex @code{bt} debugger command (alias for @code{backtrace}) 31661@cindex @code{where} debugger command (alias for @code{backtrace}) 31662@cindex call stack @subentry display in debugger 31663@cindex traceback, display in debugger 31664@item @code{backtrace} [@var{count}] 31665@itemx @code{bt} [@var{count}] 31666@itemx @code{where} [@var{count}] 31667Print a backtrace of all function calls (stack frames), or innermost @var{count} 31668frames if @var{count} > 0. Print the outermost @var{count} frames if 31669@var{count} < 0. The backtrace displays the name and arguments to each 31670function, the source @value{FN}, and the line number. 31671The alias @code{where} for @code{backtrace} is provided for longtime 31672GDB users who may be used to that command. 31673 31674@cindex debugger @subentry commands @subentry @code{down} 31675@cindex @code{down} debugger command 31676@item @code{down} [@var{count}] 31677Move @var{count} (default 1) frames down the stack toward the innermost frame. 31678Then select and print the frame. 31679 31680@cindex debugger @subentry commands @subentry @code{f} (@code{frame}) 31681@cindex debugger @subentry commands @subentry @code{frame} 31682@cindex @code{frame} debugger command 31683@cindex @code{f} debugger command (alias for @code{frame}) 31684@item @code{frame} [@var{n}] 31685@itemx @code{f} [@var{n}] 31686Select and print stack frame @var{n}. Frame 0 is the currently executing, 31687or @dfn{innermost}, frame (function call); frame 1 is the frame that 31688called the innermost one. The highest-numbered frame is the one for the 31689main program. The printed information consists of the frame number, 31690function and argument names, source file, and the source line. 31691 31692@cindex debugger @subentry commands @subentry @code{up} 31693@cindex @code{up} debugger command 31694@item @code{up} [@var{count}] 31695Move @var{count} (default 1) frames up the stack toward the outermost frame. 31696Then select and print the frame. 31697@end table 31698 31699@node Debugger Info 31700@subsection Obtaining Information About the Program and the Debugger State 31701 31702Besides looking at the values of variables, there is often a need to get 31703other sorts of information about the state of your program and of the 31704debugging environment itself. The @command{gawk} debugger has one command that 31705provides this information, appropriately called @code{info}. @code{info} 31706is used with one of a number of arguments that tell it exactly what 31707you want to know: 31708 31709@table @asis 31710@cindex debugger @subentry commands @subentry @code{i} (@code{info}) 31711@cindex debugger @subentry commands @subentry @code{info} 31712@cindex @code{info} debugger command 31713@cindex @code{i} debugger command (alias for @code{info}) 31714@item @code{info} @var{what} 31715@itemx @code{i} @var{what} 31716The value for @var{what} should be one of the following: 31717 31718@c nested table 31719@table @code 31720@item args 31721@cindex show in debugger @subentry function arguments 31722@cindex function arguments, show in debugger 31723List arguments of the selected frame. 31724 31725@item break 31726@cindex show in debugger @subentry breakpoints 31727@cindex breakpoint @subentry show all in debugger 31728List all currently set breakpoints. 31729 31730@item display 31731@cindex automatic displays, in debugger 31732List all items in the automatic display list. 31733 31734@item frame 31735@cindex describe call stack frame, in debugger 31736Give a description of the selected stack frame. 31737 31738@item functions 31739@cindex list function definitions, in debugger 31740@cindex function definitions, list in debugger 31741List all function definitions including source @value{FN}s and 31742line numbers. 31743 31744@item locals 31745@cindex show in debugger @subentry local variables 31746@cindex local variables @subentry show in debugger 31747List local variables of the selected frame. 31748 31749@item source 31750@cindex show in debugger @subentry name of current source file 31751@cindex current source file, show in debugger 31752@cindex source file, show in debugger 31753Print the name of the current source file. Each time the program stops, the 31754current source file is the file containing the current instruction. 31755When the debugger first starts, the current source file is the first file 31756included via the @option{-f} option. The 31757@samp{list @var{filename}:@var{lineno}} command can 31758be used at any time to change the current source. 31759 31760@item sources 31761@cindex show in debugger @subentry all source files 31762@cindex all source files, show in debugger 31763List all program sources. 31764 31765@item variables 31766@cindex list all global variables, in debugger 31767@cindex global variables, show in debugger 31768List all global variables. 31769 31770@item watch 31771@cindex show in debugger @subentry watchpoints 31772@cindex watchpoints, show in debugger 31773List all items in the watch list. 31774@end table 31775@end table 31776 31777Additional commands give you control over the debugger, the ability to 31778save the debugger's state, and the ability to run debugger commands 31779from a file. The commands are: 31780 31781@table @asis 31782@cindex debugger @subentry commands @subentry @code{o} (@code{option}) 31783@cindex debugger @subentry commands @subentry @code{option} 31784@cindex @code{option} debugger command 31785@cindex @code{o} debugger command (alias for @code{option}) 31786@cindex display debugger options 31787@cindex debugger @subentry options 31788@item @code{option} [@var{name}[@code{=}@var{value}]] 31789@itemx @code{o} [@var{name}[@code{=}@var{value}]] 31790Without an argument, display the available debugger options 31791and their current values. @samp{option @var{name}} shows the current 31792value of the named option. @samp{option @var{name}=@var{value}} assigns 31793a new value to the named option. 31794The available options are: 31795 31796@c nested table 31797@c asis for docbook 31798@table @asis 31799@item @code{history_size} 31800@cindex debugger @subentry history size 31801Set the maximum number of lines to keep in the history file 31802@file{./.gawk_history}. The default is 100. 31803 31804@item @code{listsize} 31805@cindex debugger @subentry default list amount 31806Specify the number of lines that @code{list} prints. The default is 15. 31807 31808@item @code{outfile} 31809@cindex redirect @command{gawk} output, in debugger 31810Send @command{gawk} output to a file; debugger output still goes 31811to standard output. An empty string (@code{""}) resets output to 31812standard output. 31813 31814@item @code{prompt} 31815@cindex debugger @subentry prompt 31816Change the debugger prompt. The default is @samp{@w{gawk> }}. 31817 31818@item @code{save_history} [@code{on} | @code{off}] 31819@cindex debugger @subentry history file 31820Save command history to file @file{./.gawk_history}. 31821The default is @code{on}. 31822 31823@item @code{save_options} [@code{on} | @code{off}] 31824@cindex save debugger options 31825Save current options to file @file{./.gawkrc} upon exit. 31826The default is @code{on}. 31827Options are read back into the next session upon startup. 31828 31829@item @code{trace} [@code{on} | @code{off}] 31830@cindex instruction tracing, in debugger 31831@cindex debugger @subentry instruction tracing 31832Turn instruction tracing on or off. The default is @code{off}. 31833@end table 31834 31835@cindex debugger @subentry save commands to a file 31836@item @code{save} @var{filename} 31837Save the commands from the current session to the given @value{FN}, 31838so that they can be replayed using the @command{source} command. 31839 31840@item @code{source} @var{filename} 31841@cindex debugger @subentry read commands from a file 31842Run command(s) from a file; an error in any command does not 31843terminate execution of subsequent commands. Comments (lines starting 31844with @samp{#}) are allowed in a command file. 31845Empty lines are ignored; they do @emph{not} 31846repeat the last command. 31847You can't restart the program by having more than one @code{run} 31848command in the file. Also, the list of commands may include additional 31849@code{source} commands; however, the @command{gawk} debugger will not source the 31850same file more than once in order to avoid infinite recursion. 31851 31852In addition to, or instead of, the @code{source} command, you can use 31853the @option{-D @var{file}} or @option{--debug=@var{file}} command-line 31854options to execute commands from a file non-interactively 31855(@pxref{Options}). 31856@end table 31857 31858@node Miscellaneous Debugger Commands 31859@subsection Miscellaneous Commands 31860 31861There are a few more commands that do not fit into the 31862previous categories, as follows: 31863 31864@table @asis 31865@cindex debugger @subentry commands @subentry @code{dump} 31866@cindex @code{dump} debugger command 31867@item @code{dump} [@var{filename}] 31868Dump byte code of the program to standard output or to the file 31869named in @var{filename}. This prints a representation of the internal 31870instructions that @command{gawk} executes to implement the @command{awk} 31871commands in a program. This can be very enlightening, as the following 31872partial dump of Davide Brini's obfuscated code 31873(@pxref{Signature Program}) demonstrates: 31874 31875@smallexample 31876@group 31877gawk> @kbd{dump} 31878@print{} # BEGIN 31879@print{} 31880@print{} [ 1:0xfcd340] Op_rule : [in_rule = BEGIN] [source_file = brini.awk] 31881@end group 31882@print{} [ 1:0xfcc240] Op_push_i : "~" [MALLOC|STRING|STRCUR] 31883@print{} [ 1:0xfcc2a0] Op_push_i : "~" [MALLOC|STRING|STRCUR] 31884@print{} [ 1:0xfcc280] Op_match : 31885@print{} [ 1:0xfcc1e0] Op_store_var : O 31886@print{} [ 1:0xfcc2e0] Op_push_i : "==" [MALLOC|STRING|STRCUR] 31887@print{} [ 1:0xfcc340] Op_push_i : "==" [MALLOC|STRING|STRCUR] 31888@print{} [ 1:0xfcc320] Op_equal : 31889@print{} [ 1:0xfcc200] Op_store_var : o 31890@print{} [ 1:0xfcc380] Op_push : o 31891@print{} [ 1:0xfcc360] Op_plus_i : 0 [MALLOC|NUMCUR|NUMBER] 31892@print{} [ 1:0xfcc220] Op_push_lhs : o [do_reference = true] 31893@print{} [ 1:0xfcc300] Op_assign_plus : 31894@print{} [ :0xfcc2c0] Op_pop : 31895@print{} [ 1:0xfcc400] Op_push : O 31896@print{} [ 1:0xfcc420] Op_push_i : "" [MALLOC|STRING|STRCUR] 31897@print{} [ :0xfcc4a0] Op_no_op : 31898@print{} [ 1:0xfcc480] Op_push : O 31899@print{} [ :0xfcc4c0] Op_concat : [expr_count = 3] [concat_flag = 0] 31900@print{} [ 1:0xfcc3c0] Op_store_var : x 31901@print{} [ 1:0xfcc440] Op_push_lhs : X [do_reference = true] 31902@print{} [ 1:0xfcc3a0] Op_postincrement : 31903@print{} [ 1:0xfcc4e0] Op_push : x 31904@print{} [ 1:0xfcc540] Op_push : o 31905@print{} [ 1:0xfcc500] Op_plus : 31906@print{} [ 1:0xfcc580] Op_push : o 31907@print{} [ 1:0xfcc560] Op_plus : 31908@print{} [ 1:0xfcc460] Op_leq : 31909@print{} [ :0xfcc5c0] Op_jmp_false : [target_jmp = 0xfcc5e0] 31910@print{} [ 1:0xfcc600] Op_push_i : "%c" [MALLOC|STRING|STRCUR] 31911@print{} [ :0xfcc660] Op_no_op : 31912@print{} [ 1:0xfcc520] Op_assign_concat : c 31913@print{} [ :0xfcc620] Op_jmp : [target_jmp = 0xfcc440] 31914@dots{} 31915@print{} [ 2:0xfcc5a0] Op_K_printf : [expr_count = 17] [redir_type = ""] 31916@print{} [ :0xfcc140] Op_no_op : 31917@print{} [ :0xfcc1c0] Op_atexit : 31918@print{} [ :0xfcc640] Op_stop : 31919@print{} [ :0xfcc180] Op_no_op : 31920@print{} [ :0xfcd150] Op_after_beginfile : 31921@group 31922@print{} [ :0xfcc160] Op_no_op : 31923@print{} [ :0xfcc1a0] Op_after_endfile : 31924gawk> 31925@end group 31926@end smallexample 31927 31928@cindex @code{exit} debugger command 31929@cindex exit the debugger 31930@item @code{exit} 31931Exit the debugger. 31932See the entry for @samp{quit}, later in this list. 31933 31934@cindex debugger @subentry commands @subentry @code{h} (@code{help}) 31935@cindex debugger @subentry commands @subentry @code{help} 31936@cindex @code{help} debugger command 31937@cindex @code{h} debugger command (alias for @code{help}) 31938@item @code{help} 31939@itemx @code{h} 31940Print a list of all of the @command{gawk} debugger commands with a short 31941summary of their usage. @samp{help @var{command}} prints the information 31942about the command @var{command}. 31943 31944@cindex debugger @subentry commands @subentry @code{l} (@code{list}) 31945@cindex debugger @subentry commands @subentry @code{list} 31946@cindex @code{list} debugger command 31947@cindex @code{l} debugger command (alias for @code{list}) 31948@item @code{list} [@code{-} | @code{+} | @var{n} | @var{filename}@code{:}@var{n} | @var{n}--@var{m} | @var{function}] 31949@itemx @code{l} [@code{-} | @code{+} | @var{n} | @var{filename}@code{:}@var{n} | @var{n}--@var{m} | @var{function}] 31950Print the specified lines (default 15) from the current source file 31951or the file named @var{filename}. The possible arguments to @code{list} 31952are as follows: 31953 31954@c nested table 31955@table @asis 31956@item @code{-} (Minus) 31957Print lines before the lines last printed. 31958 31959@item @code{+} 31960Print lines after the lines last printed. 31961@code{list} without any argument does the same thing. 31962 31963@item @var{n} 31964Print lines centered around line number @var{n}. 31965 31966@item @var{n}--@var{m} 31967Print lines from @var{n} to @var{m}. 31968 31969@item @var{filename}@code{:}@var{n} 31970Print lines centered around line number @var{n} in 31971source file @var{filename}. This command may change the current source file. 31972 31973@item @var{function} 31974Print lines centered around the beginning of the 31975function @var{function}. This command may change the current source file. 31976@end table 31977 31978@cindex debugger @subentry commands @subentry @code{q} (@code{quit}) 31979@cindex debugger @subentry commands @subentry @code{quit} 31980@cindex @code{quit} debugger command 31981@cindex @code{q} debugger command (alias for @code{quit}) 31982@cindex exit the debugger 31983@item @code{quit} 31984@itemx @code{q} 31985Exit the debugger. Debugging is great fun, but sometimes we all have 31986to tend to other obligations in life, and sometimes we find the bug 31987and are free to go on to the next one! As we saw earlier, if you are 31988running a program, the debugger warns you when you type 31989@samp{q} or @samp{quit}, to make sure you really want to quit. 31990 31991@cindex debugger @subentry commands @subentry @code{trace} 31992@cindex @code{trace} debugger command 31993@item @code{trace} [@code{on} | @code{off}] 31994Turn on or off continuous printing of the instructions that are about to 31995be executed, along with the @command{awk} lines they 31996implement. The default is @code{off}. 31997 31998It is to be hoped that most of the ``opcodes'' in these instructions are 31999fairly self-explanatory, and using @code{stepi} and @code{nexti} while 32000@code{trace} is on will make them into familiar friends. 32001 32002@end table 32003 32004@node Readline Support 32005@section Readline Support 32006@cindex command completion, in debugger 32007@cindex debugger @subentry command completion 32008@cindex history expansion, in debugger 32009@cindex debugger @subentry history expansion 32010 32011If @command{gawk} is compiled with 32012@uref{http://cnswww.cns.cwru.edu/php/chet/readline/readline.html, 32013the GNU Readline library}, you can take advantage of that library's 32014command completion and history expansion features. The following types 32015of completion are available: 32016 32017@table @asis 32018@item Command completion 32019Command names. 32020 32021@item Source @value{FN} completion 32022Source @value{FN}s. Relevant commands are 32023@code{break}, 32024@code{clear}, 32025@code{list}, 32026@code{tbreak}, 32027and 32028@code{until}. 32029 32030@item Argument completion 32031Non-numeric arguments to a command. 32032Relevant commands are @code{enable} and @code{info}. 32033 32034@item Variable name completion 32035Global variable names, and function arguments in the current context 32036if the program is running. Relevant commands are 32037@code{display}, 32038@code{print}, 32039@code{set}, 32040and 32041@code{watch}. 32042 32043@end table 32044 32045@node Limitations 32046@section Limitations 32047 32048@cindex debugger @subentry limitations 32049We hope you find the @command{gawk} debugger useful and enjoyable to work with, 32050but as with any program, especially in its early releases, it still has 32051some limitations. A few that it's worth being aware of are: 32052 32053@itemize @value{BULLET} 32054@item 32055At this point, the debugger does not give a detailed explanation of 32056what you did wrong when you type in something it doesn't like. Rather, it just 32057responds @samp{syntax error}. When you do figure out what your mistake was, 32058though, you'll feel like a real guru. 32059 32060@item 32061@c NOTE: no comma after the ref{} on purpose, due to following 32062@c parenthetical remark. 32063If you perused the dump of opcodes in @ref{Miscellaneous Debugger Commands} 32064(or if you are already familiar with @command{gawk} internals), 32065you will realize that much of the internal manipulation of data 32066in @command{gawk}, as in many interpreters, is done on a stack. 32067@code{Op_push}, @code{Op_pop}, and the like are the ``bread and butter'' of 32068most @command{gawk} code. 32069 32070Unfortunately, as of now, the @command{gawk} 32071debugger does not allow you to examine the stack's contents. 32072That is, the intermediate results of expression evaluation are on the 32073stack, but cannot be printed. Rather, only variables that are defined 32074in the program can be printed. Of course, a workaround for 32075this is to use more explicit variables at the debugging stage and then 32076change back to obscure, perhaps more optimal code later. 32077 32078@item 32079There is no way to look ``inside'' the process of compiling 32080regular expressions to see if you got it right. As an @command{awk} 32081programmer, you are expected to know the meaning of 32082@code{/[^[:alnum:][:blank:]]/}. 32083 32084@item 32085The @command{gawk} debugger is designed to be used by running a program (with all its 32086parameters) on the command line, as described in @ref{Debugger Invocation}. 32087There is no way (as of now) to attach or ``break into'' a running program. 32088This seems reasonable for a language that is used mainly for quickly 32089executing, short programs. 32090 32091@item 32092The @command{gawk} debugger only accepts source code supplied with the @option{-f} option. 32093If you have a shell script that provides an @command{awk} program as a command 32094line parameter, and you need to use the debugger, you can write the script 32095to a temporary file, and use that as the program, with the @option{-f} option. This 32096might look like this: 32097 32098@example 32099cat << \EOF > /tmp/script.$$ 32100@dots{} @ii{Your program here} 32101EOF 32102gawk -D -f /tmp/script.$$ 32103rm /tmp/script.$$ 32104@end example 32105@end itemize 32106 32107@ignore 32108@c 11/2016: This no longer applies after all the type cleanup work that's been done. 32109One other point is worth discussing. Conventional debuggers run in a 32110separate process (and thus address space) from the programs that they 32111debug (the @dfn{debuggee}, if you will). 32112 32113The @command{gawk} debugger is different; it is an integrated part 32114of @command{gawk} itself. This makes it possible, in rare cases, 32115for @command{gawk} to become an excellent demonstrator of Heisenberg 32116Uncertainty physics, where the mere act of observing something can change 32117it. Consider the following:@footnote{Thanks to Hermann Peifer for 32118this example.} 32119 32120@example 32121$ @kbd{cat test.awk} 32122@print{} @{ print typeof($1), typeof($2) @} 32123$ @kbd{cat test.data} 32124@print{} abc 123 32125$ @kbd{gawk -f test.awk test.data} 32126@print{} strnum strnum 32127@end example 32128 32129This is all as expected: field data has the STRNUM attribute 32130(@pxref{Variable Typing}). Now watch what happens when we run 32131this program under the debugger: 32132 32133@example 32134$ @kbd{gawk -D -f test.awk test.data} 32135gawk> @kbd{w $1} @ii{Set watchpoint on} $1 32136@print{} Watchpoint 1: $1 32137gawk> @kbd{w $2} @ii{Set watchpoint on} $2 32138@print{} Watchpoint 2: $2 32139gawk> @kbd{r} @ii{Start the program} 32140@print{} Starting program: 32141@print{} Stopping in Rule ... 32142@print{} Watchpoint 1: $1 @ii{Watchpoint fires} 32143@print{} Old value: "" 32144@print{} New value: "abc" 32145@print{} main() at `test.awk':1 32146@print{} 1 @{ print typeof($1), typeof($2) @} 32147gawk> @kbd{n} @ii{Keep going @dots{}} 32148@print{} Watchpoint 2: $2 @ii{Watchpoint fires} 32149@print{} Old value: "" 32150@print{} New value: "123" 32151@print{} main() at `test.awk':1 32152@print{} 1 @{ print typeof($1), typeof($2) @} 32153gawk> @kbd{n} @ii{Get result from} typeof() 32154@print{} strnum number @ii{Result for} $2 @ii{isn't right} 32155@print{} Program exited normally with exit value: 0 32156gawk> @kbd{quit} 32157@end example 32158 32159In this case, the act of comparing the new value of @code{$2} 32160with the old one caused @command{gawk} to evaluate it and determine that it 32161is indeed a number, and this is reflected in the result of 32162@code{typeof()}. 32163 32164Cases like this where the debugger is not transparent to the program's 32165execution should be rare. If you encounter one, please report it 32166(@pxref{Bugs}). 32167@end ignore 32168 32169@ignore 32170Look forward to a future release when these and other missing features may 32171be added, and of course feel free to try to add them yourself! 32172@end ignore 32173 32174@node Debugging Summary 32175@section Summary 32176 32177@itemize @value{BULLET} 32178@item 32179Programs rarely work correctly the first time. Finding bugs 32180is called debugging, and a program that helps you find bugs is a 32181debugger. @command{gawk} has a built-in debugger that works very 32182similarly to the GNU Debugger, GDB. 32183 32184@item 32185Debuggers let you step through your program one statement at a time, 32186examine and change variable and array values, and do a number of other 32187things that let you understand what your program is actually doing (as 32188opposed to what it is supposed to do). 32189 32190@item 32191Like most debuggers, the @command{gawk} debugger works in terms of stack 32192frames, and lets you set both breakpoints (stop at a point in the code) 32193and watchpoints (stop when a data value changes). 32194 32195@item 32196The debugger command set is fairly complete, providing control over 32197breakpoints, execution, viewing and changing data, working with the stack, 32198getting information, and other tasks. 32199 32200@item 32201If the GNU Readline library is available when @command{gawk} is 32202compiled, it is used by the debugger to provide command-line history 32203and editing. 32204 32205@item 32206Usually, the debugger does not not affect the 32207program being debugged, but occasionally it can. 32208 32209@end itemize 32210 32211@hyphenation{name-space name-spaces Name-space Name-spaces} 32212@node Namespaces 32213@chapter Namespaces in @command{gawk} 32214 32215This @value{CHAPTER} describes a feature that is specific to @command{gawk}. 32216 32217@quotation CAUTION 32218This feature described in this chapter is new. It is entirely 32219possible, and even likely, that there are dark corners (if not bugs) 32220still lurking within the implementation. If you find any such, 32221please report them (@xref{Bugs}). 32222@end quotation 32223 32224@menu 32225* Global Namespace:: The global namespace in standard 32226 @command{awk}. 32227* Qualified Names:: How to qualify names with a namespace. 32228* Default Namespace:: The default namespace. 32229* Changing The Namespace:: How to change the namespace. 32230* Naming Rules:: Namespace and Component Naming Rules. 32231* Internal Name Management:: How names are stored internally. 32232* Namespace Example:: An example of code using a namespace. 32233* Namespace And Features:: Namespaces and other @command{gawk} features. 32234* Namespace Summary:: Summarizing namespaces. 32235@end menu 32236 32237@node Global Namespace 32238@section Standard @command{awk}'s Single Namespace 32239 32240@cindex namespace @subentry definition of 32241@cindex namespace @subentry standard @command{awk}, global 32242In standard @command{awk}, there is a single, global, @dfn{namespace}. 32243This means that @emph{all} function names and global variable names must 32244be unique. For example, two different @command{awk} source files cannot 32245both define a function named @code{min()}, or define the same identifier, 32246used as a scalar in one and as an array in the other. 32247 32248This situation is okay when programs are small, say a few hundred 32249lines, or even a few thousand, but it prevents the development of 32250reusable libraries of @command{awk} functions, and can inadvertently 32251cause independently-developed library files to accidentally step on each 32252other's ``private'' global variables 32253(@pxref{Library Names}). 32254 32255@cindex package, definition of 32256@cindex module, definition of 32257Most other programming languages solve this issue by providing some kind 32258of namespace control: a way to say ``this function is in namespace @var{xxx}, 32259and that function is in namespace @var{yyy}.'' (Of course, there is then 32260still a single namespace for the namespaces, but the hope is that there 32261are much fewer namespaces in use by any given program, and thus much 32262less chance for collisions.) These facilities are sometimes referred 32263to as @dfn{packages} or @dfn{modules}. 32264 32265Starting with @value{PVERSION} 5.0, @command{gawk} provides a 32266simple mechanism to put functions and global variables into separate namespaces. 32267 32268@node Qualified Names 32269@section Qualified Names 32270 32271@cindex qualified name @subentry definition of 32272@cindex namespaces @subentry qualified names 32273@cindex @code{:} (colon) @subentry @code{::} namespace separator 32274@cindex colon (@code{:}) @subentry @code{::} namespace separator 32275@cindex component name 32276A @dfn{qualified name} is an identifier that includes a namespace name, 32277the namespace separator @code{::}, and a @dfn{component} name. For example, one 32278might have a function named @code{posix::getpid()}. Here, the namespace 32279is @code{posix} and the function name within the namespace (the component) 32280is @code{getpid()}. The namespace and component names are separated by 32281a double-colon. Only one such separator is allowed in a qualified name. 32282 32283@quotation NOTE 32284Unlike C++, the @code{::} is @emph{not} an operator. No spaces are 32285allowed between the namespace name, the @code{::}, and the component name. 32286@end quotation 32287 32288@cindex qualified name @subentry use of 32289You must use qualified names from one namespace to access variables 32290and functions in another. This is especially important when using 32291variable names to index the special @code{SYMTAB} array (@pxref{Auto-set}), 32292and when making indirect function calls (@pxref{Indirect Calls}). 32293 32294@node Default Namespace 32295@section The Default Namespace 32296 32297@cindex namespace @subentry default 32298@cindex namespace @subentry @code{awk} 32299@cindex @code{awk} @subentry namespace 32300The default namespace, not surprisingly, is @code{awk}. 32301All of the predefined @command{awk} and @command{gawk} variables 32302are in this namespace, and thus have qualified names like 32303@code{awk::ARGC}, @code{awk::NF}, and so on. 32304 32305@cindex uppercase names, namespace for 32306Furthermore, even when you have changed the namespace for your 32307current source file (@pxref{Changing The Namespace}), @command{gawk} 32308forces unqualified identifiers whose names are all uppercase letters 32309to be in the @code{awk} namespace. This makes it possible for you to easily 32310reference @command{gawk}'s global variables from different namespaces. 32311It also keeps your code looking natural. 32312 32313@node Changing The Namespace 32314@section Changing The Namespace 32315 32316@cindex namespaces @subentry changing 32317@cindex @code{@@} (at-sign) @subentry @code{@@namespace} directive 32318@cindex at-sign (@code{@@}) @subentry @code{@@namespace} directive 32319@cindex @code{@@namespace} directive @sortas{namespace directive} 32320In order to set the current namespace, use an @code{@@namespace} directive 32321at the top level of your program: 32322 32323@example 32324@@namespace "passwd" 32325 32326BEGIN @{ @dots{} @} 32327@dots{} 32328@end example 32329 32330After this directive, all simple non-completely-uppercase identifiers are 32331placed into the @code{passwd} namespace. 32332 32333You can change the namespace multiple times within a single 32334source file, although this is likely to become confusing if you 32335do it too much. 32336 32337@quotation NOTE 32338Association of unqualified identifiers to a namespace is handled while 32339@command{gawk} parses your program, @emph{before} it starts to run. There is 32340no concept of a ``current'' namespace once your program starts executing. 32341Be sure you understand this. 32342@end quotation 32343 32344@cindex namespace @subentry implicit 32345@cindex implicit namespace 32346Each source file for @option{-i} and @option{-f} starts out with 32347an implicit @samp{@@namespace "awk"}. Similarly, each chunk of 32348command-line code supplied with @option{-e} has such an implicit 32349initial statement (@pxref{Options}). 32350 32351@cindex current namespace, pushing and popping 32352@cindex namespace @subentry pushing and popping 32353Files included with @code{@@include} (@pxref{Include Files}) ``push'' 32354and ``pop'' the current namespace. That is, each @code{@@include} saves 32355the current namespace and starts over with an implicit @samp{@@namespace 32356"awk"} which remains in effect until an explicit @code{@@namespace} 32357directive is seen. When @command{gawk} finishes processing the included 32358file, the saved namespace is restored and processing continues where it 32359left off in the original file. 32360 32361@cindex @code{@@} (at-sign) @subentry @code{@@namespace} directive @subentry @code{BEGIN}, @code{BEGINFILE}, @code{END}, @code{ENDFILE} and 32362@cindex at-sign (@code{@@}) @subentry @code{@@namespace} directive @subentry @code{BEGIN}, @code{BEGINFILE}, @code{END}, @code{ENDFILE} and 32363@cindex @code{BEGIN} pattern @subentry @code{@@namespace} directive and 32364@cindex @code{BEGINFILE} pattern @subentry @code{@@namespace} directive and 32365@cindex @code{END} pattern @subentry @code{@@namespace} directive and 32366@cindex @code{ENDFILE} pattern @subentry @code{@@namespace} directive and 32367@cindex @code{@@namespace} directive @sortas{namespace directive} 32368The use of @code{@@namespace} has no influence upon the order of execution 32369of @code{BEGIN}, @code{BEGINFILE}, @code{END}, and @code{ENDFILE} rules. 32370 32371@node Naming Rules 32372@section Namespace and Component Naming Rules 32373 32374@cindex naming rules, namespace and component names 32375@cindex namespaces @subentry naming rules 32376@c not "component names" to merge with other index entry 32377@cindex component name @subentry naming rules 32378A number of rules apply to the namespace and component names, as follows. 32379 32380@itemize @bullet 32381@item 32382It is a syntax error to use qualified names for function parameter names. 32383 32384@item 32385It is a syntax error to use any standard @command{awk} reserved word (such 32386as @code{if} or @code{for}), or the name of any standard built-in function 32387(such as @code{sin()} or @code{gsub()}) as either part of a qualified name. 32388Thus, the following produces a syntax error: 32389 32390@example 32391@@namespace "example" 32392 32393function gsub(str, pat, result) @{ @dots{} @} 32394@end example 32395 32396@item 32397Outside the @code{awk} namespace, the names of the additional @command{gawk} 32398built-in functions (such as @code{gensub()} or @code{strftime()}) @emph{may} 32399be used as component names. The same set of names may be used as namespace 32400names, although this has the potential to be confusing. 32401 32402@item 32403The additional @command{gawk} built-in functions may still be called 32404from outside the @code{awk} namespace by qualifying them. For example, 32405@code{awk::systime()}. Here is a somewhat silly example demonstrating 32406this rule and the previous one: 32407 32408@example 32409BEGIN @{ 32410 print "in awk namespace, systime() =", systime() 32411@} 32412 32413@@namespace "testing" 32414 32415function systime() 32416@{ 32417 print "in testing namespace, systime() =", awk::systime() 32418@} 32419 32420BEGIN @{ 32421 systime() 32422@} 32423@end example 32424 32425@noindent 32426 32427When run, it produces output like this: 32428 32429@example 32430$ @kbd{gawk -f systime.awk} 32431@print{} in awk namespace, systime() = 1500488503 32432@print{} in testing namespace, systime() = 1500488503 32433@end example 32434 32435@item 32436@command{gawk} pre-defined variable names may be used: 32437@code{NF::NR} is valid, if possibly not all that useful. 32438@end itemize 32439 32440@node Internal Name Management 32441@section Internal Name Management 32442 32443@cindex name management 32444@cindex @code{awk} @subentry namespace @subentry identifier name storage 32445@cindex @code{awk} @subentry namespace @subentry use for indirect function calls 32446For backwards compatibility, all identifiers in the @code{awk} namespace 32447are stored internally as unadorned identifiers (that is, without a 32448leading @samp{awk::}). This is mainly relevant 32449when using such identifiers as indices for @code{SYMTAB}, @code{FUNCTAB}, 32450and @code{PROCINFO["identifiers"]} (@pxref{Auto-set}), and for use in 32451indirect function calls (@pxref{Indirect Calls}). 32452 32453In program code, to refer to variables and functions in the @code{awk} 32454namespace from another namespace, you must still use the @samp{awk::} 32455prefix. For example: 32456 32457@example 32458@@namespace "awk" @ii{This is the default namespace} 32459 32460BEGIN @{ 32461 Title = "My Report" @ii{Qualified name is} awk::Title 32462@} 32463 32464@@namespace "report" @ii{Now in} report @ii{namespace} 32465 32466function compute() @ii{This is really} report::compute() 32467@{ 32468 print awk::Title @ii{But would be} SYMTAB["Title"] 32469 @dots{} 32470@} 32471@end example 32472 32473@node Namespace Example 32474@section Namespace Example 32475 32476@cindex namespace @subentry example code 32477The following example is a revised version of the suite of routines 32478developed in @ref{Passwd Functions}. See there for an explanation 32479of how the code works. 32480 32481The formulation here, due mainly to Andrew Schorr, is rather elegant. 32482All of the implementation functions and variables are in the 32483@code{passwd} namespace, whereas the main interface functions are 32484defined in the @code{awk} namespace. 32485 32486@example 32487@c file eg/lib/ns_passwd.awk 32488# ns_passwd.awk --- access password file information 32489@c endfile 32490@ignore 32491@c file eg/lib/ns_passwd.awk 32492# 32493# Arnold Robbins, arnold@@skeeve.com, Public Domain 32494# May 1993 32495# Revised October 2000 32496# Revised December 2010 32497# 32498# Reworked for namespaces June 2017, with help from 32499# Andrew J.@: Schorr, aschorr@@telemetry-investments.com 32500@c endfile 32501@end ignore 32502@c file eg/lib/ns_passwd.awk 32503 32504@@namespace "passwd" 32505 32506BEGIN @{ 32507 # tailor this to suit your system 32508 Awklib = "/usr/local/libexec/awk/" 32509@} 32510 32511function Init( oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat) 32512@{ 32513 if (Inited) 32514 return 32515 32516 oldfs = FS 32517 oldrs = RS 32518 olddol0 = $0 32519 using_fw = (PROCINFO["FS"] == "FIELDWIDTHS") 32520 using_fpat = (PROCINFO["FS"] == "FPAT") 32521 FS = ":" 32522 RS = "\n" 32523 32524 pwcat = Awklib "pwcat" 32525 while ((pwcat | getline) > 0) @{ 32526 Byname[$1] = $0 32527 Byuid[$3] = $0 32528 Bycount[++Total] = $0 32529 @} 32530 close(pwcat) 32531 Count = 0 32532 Inited = 1 32533 FS = oldfs 32534 if (using_fw) 32535 FIELDWIDTHS = FIELDWIDTHS 32536 else if (using_fpat) 32537 FPAT = FPAT 32538 RS = oldrs 32539 $0 = olddol0 32540@} 32541 32542function awk::getpwnam(name) 32543@{ 32544 Init() 32545 return Byname[name] 32546@} 32547 32548function awk::getpwuid(uid) 32549@{ 32550 Init() 32551 return Byuid[uid] 32552@} 32553 32554function awk::getpwent() 32555@{ 32556 Init() 32557 if (Count < Total) 32558 return Bycount[++Count] 32559 return "" 32560@} 32561 32562function awk::endpwent() 32563@{ 32564 Count = 0 32565@} 32566@c endfile 32567@end example 32568 32569As you can see, this version also follows the convention mentioned in 32570@ref{Library Names}, whereby global variable and function names 32571start with a capital letter. 32572 32573Here is a simple test program. Since it's in a separate file, unadorned 32574identifiers are sought for in the @code{awk} namespace: 32575 32576@example 32577BEGIN @{ 32578 while ((p = getpwent()) != "") 32579 print p 32580@} 32581@end example 32582 32583@noindent 32584 32585Here's what happens when it's run: 32586 32587@example 32588$ @kbd{gawk -f ns_passwd.awk -f testpasswd.awk} 32589@print{} root:x:0:0:root:/root:/bin/bash 32590@print{} daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin 32591@print{} bin:x:2:2:bin:/bin:/usr/sbin/nologin 32592@print{} sys:x:3:3:sys:/dev:/usr/sbin/nologin 32593@dots{} 32594@end example 32595 32596@node Namespace And Features 32597@section Namespaces and Other @command{gawk} Features 32598 32599This @value{SECTION} looks briefly at how the namespace facility interacts 32600with other important @command{gawk} features. 32601 32602@cindex namespaces @subentry interaction with @subentry profiler 32603@cindex namespaces @subentry interaction with @subentry pretty printer 32604@cindex profiler, interaction with namespaces 32605@cindex pretty printer, interaction with namespaces 32606The profiler and pretty-printer (@pxref{Profiling}) have been enhanced 32607to understand namespaces and the namespace naming rules presented in 32608@ref{Naming Rules}. In particular, the output groups functions in the same 32609namespace together, and has @code{@@namespace} directives in front 32610of rules as necessary. This allows component names to be 32611simple identifiers, instead of using qualified identifiers everywhere. 32612 32613@cindex namespaces @subentry interaction with @subentry debugger 32614@cindex debugger @subentry interaction with namespaces 32615Interaction with the debugger (@pxref{Debugging}) has not had to change 32616(at least as of this writing). Some of the internal byte codes changed 32617in order to accommodate namespaces, and the debugger's @code{dump} command 32618was adjusted to match. 32619 32620@cindex namespaces @subentry interaction with @subentry extension API 32621@cindex extension API @subentry interaction with namespaces 32622The extension API (@pxref{Dynamic Extensions}) has always allowed for 32623placing functions into a different namespace, although this was not 32624previously implemented. However, the symbol lookup and symbol update 32625routines did not have provision for including a namespace. That has now 32626been corrected (@pxref{Symbol table by name}). 32627@xref{Extension Sample Inplace}, for a nice example of an extension that 32628leverages a namespace shared by cooperating @command{awk} and C code. 32629 32630@node Namespace Summary 32631@section Summary 32632 32633@itemize @value{BULLET} 32634@item 32635Standard @command{awk} provides a single namespace for all global 32636identifiers (scalars, arrays, and functions). This is limiting when 32637one wants to develop libraries of reusable functions or function suites. 32638 32639@item 32640@command{gawk} provides multiple namespaces by using qualified names: 32641names consisting of a namespace name, a double colon, @code{::}, and a 32642component name. Namespace names might still possibly conflict, but this 32643is true of any language providing namespaces, modules, or packages. 32644 32645@item 32646The default namespace is @command{awk}. The rules for namespace and 32647component names are provided in @ref{Naming Rules}. The rules are 32648designed in such a way as to make namespace-aware code continue to 32649look and work naturally while still providing the necessary power and 32650flexibility. 32651 32652@item 32653Other parts of @command{gawk} have been extended as necessary to integrate 32654namespaces smoothly with their operation. This applies most notably to 32655the profiler / pretty-printer (@pxref{Profiling}) and to the extension 32656facility (@pxref{Dynamic Extensions}). 32657 32658@cindex namespaces @subentry backwards compatibility 32659@item 32660Overall, the namespace facility was designed and implemented such that 32661backwards compatibility is paramount. Programs that don't use namespaces 32662should see absolutely no difference in behavior when run by a namespace-capable 32663version of @command{gawk}. 32664@end itemize 32665 32666@node Arbitrary Precision Arithmetic 32667@chapter Arithmetic and Arbitrary-Precision Arithmetic with @command{gawk} 32668@cindex arbitrary precision 32669@cindex multiple precision 32670@cindex infinite precision 32671@cindex floating-point @subentry numbers @subentry arbitrary-precision 32672 32673This @value{CHAPTER} introduces some basic concepts relating to 32674how computers do arithmetic and defines some important terms. 32675It then proceeds to describe floating-point arithmetic, 32676which is what @command{awk} uses for all its computations, including a 32677discussion of arbitrary-precision floating-point arithmetic, which is 32678a feature available only in @command{gawk}. It continues on to present 32679arbitrary-precision integers, and concludes with a description of some 32680points where @command{gawk} and the POSIX standard are not quite in 32681agreement. 32682 32683@quotation NOTE 32684Most users of @command{gawk} can safely skip this chapter. 32685But if you want to do scientific calculations with @command{gawk}, 32686this is the place to be. 32687@end quotation 32688 32689@menu 32690* Computer Arithmetic:: A quick intro to computer math. 32691* Math Definitions:: Defining terms used. 32692* MPFR features:: The MPFR features in @command{gawk}. 32693* FP Math Caution:: Things to know. 32694* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with 32695 @command{gawk}. 32696* Checking for MPFR:: How to check if MPFR is available. 32697* POSIX Floating Point Problems:: Standards Versus Existing Practice. 32698* Floating point summary:: Summary of floating point discussion. 32699@end menu 32700 32701@node Computer Arithmetic 32702@section A General Description of Computer Arithmetic 32703 32704Until now, we have worked with data as either numbers or 32705strings. Ultimately, however, computers represent everything in terms 32706of @dfn{binary digits}, or @dfn{bits}. A decimal digit can take on any 32707of 10 values: zero through nine. A binary digit can take on any of two 32708values, zero or one. Using binary, computers (and computer software) 32709can represent and manipulate numerical and character data. In general, 32710the more bits you can use to represent a particular thing, the greater 32711the range of possible values it can take on. 32712 32713Modern computers support at least two, and often more, ways to do 32714arithmetic. Each kind of arithmetic uses a different representation 32715(organization of the bits) for the numbers. The kinds of arithmetic 32716that interest us are: 32717 32718@table @asis 32719@item Decimal arithmetic 32720This is the kind of arithmetic you learned in elementary school, using 32721paper and pencil (and/or a calculator). In theory, numbers can have an 32722arbitrary number of digits on either side (or both sides) of the decimal 32723point, and the results of a computation are always exact. 32724 32725Some modern systems can do decimal arithmetic in hardware, but usually you 32726need a special software library to provide access to these instructions. 32727There are also libraries that do decimal arithmetic entirely in software. 32728 32729Despite the fact that some users expect @command{gawk} to be performing 32730decimal arithmetic,@footnote{We don't know why they expect this, but 32731they do.} it does not do so. 32732 32733@item Integer arithmetic 32734In school, integer values were referred to as ``whole'' numbers---that 32735is, numbers without any fractional part, such as 1, 42, or @minus{}17. 32736The advantage to integer numbers is that they represent values exactly. 32737The disadvantage is that their range is limited. 32738 32739@cindex unsigned integers 32740@cindex integers @subentry unsigned 32741In computers, integer values come in two flavors: @dfn{signed} and 32742@dfn{unsigned}. Signed values may be negative or positive, whereas 32743unsigned values are always greater than or equal 32744to zero. 32745 32746In computer systems, integer arithmetic is exact, but the possible 32747range of values is limited. Integer arithmetic is generally faster than 32748floating-point arithmetic. 32749 32750@cindex floating-point @subentry numbers 32751@item Floating-point arithmetic 32752Floating-point numbers represent what were called in school ``real'' 32753numbers (i.e., those that have a fractional part, such as 3.1415927). 32754The advantage to floating-point numbers is that they can represent a 32755much larger range of values than can integers. The disadvantage is that 32756there are numbers that they cannot represent exactly. 32757 32758Modern systems support floating-point arithmetic in hardware, with a 32759limited range of values. There are software libraries that allow 32760the use of arbitrary-precision floating-point calculations. 32761 32762@cindex floating-point @subentry numbers @subentry single-precision 32763@cindex floating-point @subentry numbers @subentry double-precision 32764@cindex floating-point @subentry numbers @subentry arbitrary-precision 32765@cindex single-precision 32766@cindex double-precision 32767@cindex arbitrary precision 32768POSIX @command{awk} uses @dfn{double-precision} floating-point numbers, which 32769can hold more digits than @dfn{single-precision} floating-point numbers. 32770@command{gawk} has facilities for performing arbitrary-precision 32771floating-point arithmetic, which we describe in more detail shortly. 32772@end table 32773 32774Computers work with integer and floating-point values of different 32775ranges. Integer values are usually either 32 or 64 bits in size. 32776Single-precision floating-point values occupy 32 bits, whereas double-precision 32777floating-point values occupy 64 bits. 32778(Quadruple-precision floating point values also exist. They occupy 128 bits, 32779but such numbers are not available in @command{awk}.) 32780Floating-point values are always 32781signed. The possible ranges of values are shown in @ref{table-numeric-ranges} 32782and @ref{table-floating-point-ranges}. 32783 32784@float Table,table-numeric-ranges 32785@caption{Value ranges for integer representations} 32786@multitable @columnfractions .34 .33 .33 32787@headitem Representation @tab Minimum value @tab Maximum value 32788@item 32-bit signed integer @tab @minus{}2,147,483,648 @tab 2,147,483,647 32789@item 32-bit unsigned integer @tab 0 @tab 4,294,967,295 32790@item 64-bit signed integer @tab @minus{}9,223,372,036,854,775,808 @tab 9,223,372,036,854,775,807 32791@item 64-bit unsigned integer @tab 0 @tab 18,446,744,073,709,551,615 32792@end multitable 32793@end float 32794 32795@float Table,table-floating-point-ranges 32796@caption{Approximate value ranges for floating-point number representations} 32797@multitable @columnfractions .38 .22 .22 .23 32798@iftex 32799@headitem Representation @tab @w{Minimum positive} @w{nonzero value} @tab Minimum @w{finite value} @tab Maximum @w{finite value} 32800@end iftex 32801@ifnottex 32802@headitem Representation @tab Minimum positive nonzero value @tab Minimum finite value @tab Maximum finite value 32803@end ifnottex 32804@iftex 32805@item @w{Single-precision floating-point} @tab @math{1.175494 @cdot 10^{-38}} @tab @math{-3.402823 @cdot 10^{38}} @tab @math{3.402823 @cdot 10^{38}} 32806@item @w{Double-precision floating-point} @tab @math{2.225074 @cdot 10^{-308}} @tab @math{-1.797693 @cdot 10^{308}} @tab @math{1.797693 @cdot 10^{308}} 32807@item @w{Quadruple-precision floating-point} @tab @math{3.362103 @cdot 10^{-4932}} @tab @math{-1.189731 @cdot 10^{4932}} @tab @math{1.189731 @cdot 10^{4932}} 32808@end iftex 32809@ifinfo 32810@item Single-precision floating-point @tab 1.175494e-38 @tab -3.402823e+38 @tab 3.402823e+38 32811@item Double-precision floating-point @tab 2.225074e-308 @tab -1.797693e+308 @tab 1.797693e+308 32812@item Quadruple-precision floating-point @tab 3.362103e-4932 @tab -1.189731e+4932 @tab 1.189731e+4932 32813@end ifinfo 32814@ifnottex 32815@ifnotinfo 32816@item Single-precision floating-point @tab 1.175494*10@sup{-38} @tab -3.402823*10@sup{38} @tab 3.402823*10@sup{38} 32817@item Double-precision floating-point @tab 2.225074*10@sup{-308} @tab -1.797693*10@sup{308} @tab 1.797693*10@sup{308} 32818@item Quadruple-precision floating-point @tab 3.362103*10@sup{-4932} @tab -1.189731*10@sup{4932} @tab 1.189731*10@sup{4932} 32819@end ifnotinfo 32820@end ifnottex 32821@end multitable 32822@end float 32823 32824@node Math Definitions 32825@section Other Stuff to Know 32826 32827The rest of this @value{CHAPTER} uses a number of terms. Here are some 32828informal definitions that should help you work your way through the material 32829here: 32830 32831@table @dfn 32832@item Accuracy 32833A floating-point calculation's accuracy is how close it comes 32834to the real (paper and pencil) value. 32835 32836@item Error 32837The difference between what the result of a computation ``should be'' 32838and what it actually is. It is best to minimize error as much 32839as possible. 32840 32841@item Exponent 32842The order of magnitude of a value; 32843some number of bits in a floating-point value store the exponent. 32844 32845@item Inf 32846A special value representing infinity. Operations involving another 32847number and infinity produce infinity. 32848 32849@item NaN 32850``Not a number.''@footnote{Thanks to Michael Brennan for this description, 32851which we have paraphrased, and for the examples.} A special value that 32852results from attempting a calculation that has no answer as a real number. 32853In such a case, programs can either receive a floating-point exception, 32854or get @code{NaN} back as the result. The IEEE 754 standard recommends 32855that systems return @code{NaN}. Some examples: 32856 32857@table @code 32858@item sqrt(-1) 32859This makes sense in the range of complex numbers, but not in the 32860range of real numbers, so the result is @code{NaN}. 32861 32862@item log(-8) 32863@minus{}8 is out of the domain of @code{log()}, so the result is @code{NaN}. 32864@end table 32865 32866@item Normalized 32867How the significand (see later in this list) is usually stored. The 32868value is adjusted so that the first bit is one, and then that leading 32869one is assumed instead of physically stored. This provides one 32870extra bit of precision. 32871 32872@item Precision 32873The number of bits used to represent a floating-point number. 32874The more bits, the more digits you can represent. 32875Binary and decimal precisions are related approximately, according to the 32876formula: 32877 32878@display 32879@iftex 32880@math{prec = 3.322 @cdot dps} 32881@end iftex 32882@ifnottex 32883@ifnotdocbook 32884@var{prec} = 3.322 * @var{dps} 32885@end ifnotdocbook 32886@end ifnottex 32887@docbook 32888<emphasis>prec</emphasis> = 3.322 ⋅ <emphasis>dps</emphasis> 32889@end docbook 32890@end display 32891 32892@noindent 32893Here, @emph{prec} denotes the binary precision 32894(measured in bits) and @emph{dps} (short for decimal places) 32895is the decimal digits. 32896 32897@item Rounding mode 32898How numbers are rounded up or down when necessary. 32899More details are provided later. 32900 32901@item Significand 32902A floating-point value consists of the significand multiplied by 10 32903to the power of the exponent. For example, in @code{1.2345e67}, 32904the significand is @code{1.2345}. 32905 32906@item Stability 32907From @uref{https://en.wikipedia.org/wiki/Numerical_stability, 32908the Wikipedia article on numerical stability}: 32909``Calculations that can be proven not to magnify approximation errors 32910are called @dfn{numerically stable}.'' 32911@end table 32912 32913See @uref{https://en.wikipedia.org/wiki/Accuracy_and_precision, 32914the Wikipedia article on accuracy and precision} for more information 32915on some of those terms. 32916 32917On modern systems, floating-point hardware uses the representation and 32918operations defined by the IEEE 754 standard. 32919Three of the standard IEEE 754 types are 32-bit single precision, 3292064-bit double precision, and 128-bit quadruple precision. 32921The standard also specifies extended precision formats 32922to allow greater precisions and larger exponent ranges. 32923(@command{awk} uses only the 64-bit double-precision format.) 32924 32925@ref{table-ieee-formats} lists the precision and exponent 32926field values for the basic IEEE 754 binary formats. 32927 32928@float Table,table-ieee-formats 32929@caption{Basic IEEE format values} 32930@multitable @columnfractions .20 .20 .20 .20 .20 32931@headitem Name @tab Total bits @tab Precision @tab Minimum exponent @tab Maximum exponent 32932@item Single @tab 32 @tab 24 @tab @minus{}126 @tab +127 32933@item Double @tab 64 @tab 53 @tab @minus{}1022 @tab +1023 32934@item Quadruple @tab 128 @tab 113 @tab @minus{}16382 @tab +16383 32935@end multitable 32936@end float 32937 32938@quotation NOTE 32939The precision numbers include the implied leading one that gives them 32940one extra bit of significand. 32941@end quotation 32942 32943@node MPFR features 32944@section Arbitrary-Precision Arithmetic Features in @command{gawk} 32945 32946By default, @command{gawk} uses the double-precision floating-point values 32947supplied by the hardware of the system it runs on. However, if it was 32948compiled to do so, and the @option{-M} command-line option is supplied, 32949@command{gawk} uses the @uref{http://www.mpfr.org, 32950GNU MPFR} and @uref{https://gmplib.org, GNU MP} (GMP) libraries for 32951arbitrary-precision arithmetic on numbers. You can see if MPFR support 32952is available like so: 32953 32954@example 32955$ @kbd{gawk --version} 32956@print{} GNU Awk 4.1.2, API: 1.1 (GNU MPFR 3.1.0-p3, GNU MP 5.0.2) 32957@print{} Copyright (C) 1989, 1991-2015 Free Software Foundation. 32958@dots{} 32959@end example 32960 32961@noindent 32962(You may see different version numbers than what's shown here. That's OK; 32963what's important is to see that GNU MPFR and GNU MP are listed in 32964the output.) 32965 32966Additionally, there are a few elements available in the @code{PROCINFO} 32967array to provide information about the MPFR and GMP libraries 32968(@pxref{Auto-set}). 32969 32970The MPFR library provides precise control over precisions and rounding 32971modes, and gives correctly rounded, reproducible, platform-independent 32972results. With the @option{-M} command-line option, 32973all floating-point arithmetic operators and numeric functions 32974can yield results to any desired precision level supported by MPFR. 32975 32976Two predefined variables, @code{PREC} and @code{ROUNDMODE}, 32977provide control over the working precision and the rounding mode. 32978The precision and the rounding mode are set globally for every operation 32979to follow. 32980@xref{Setting precision} and @ref{Setting the rounding mode} 32981for more information. 32982 32983@node FP Math Caution 32984@section Floating-Point Arithmetic: Caveat Emptor! 32985 32986@quotation 32987@i{Math class is tough!} 32988@author Teen Talk Barbie, July 1992 32989@end quotation 32990 32991This @value{SECTION} provides a high-level overview of the issues 32992involved when doing lots of floating-point arithmetic.@footnote{There 32993is a very nice @uref{http://www.validlab.com/goldberg/paper.pdf, 32994paper on floating-point arithmetic} by David Goldberg, ``What Every 32995Computer Scientist Should Know About Floating-Point Arithmetic,'' 32996@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03): 5-48. This is 32997worth reading if you are interested in the details, but it does require 32998a background in computer science.} 32999The discussion applies to both hardware and arbitrary-precision 33000floating-point arithmetic. 33001 33002@quotation CAUTION 33003The material here is purposely general. If you need to do serious 33004computer arithmetic, you should do some research first, and not 33005rely just on what we tell you. 33006@end quotation 33007 33008@menu 33009* Inexactness of computations:: Floating point math is not exact. 33010* Getting Accuracy:: Getting more accuracy takes some work. 33011* Try To Round:: Add digits and round. 33012* Setting precision:: How to set the precision. 33013* Setting the rounding mode:: How to set the rounding mode. 33014@end menu 33015 33016@node Inexactness of computations 33017@subsection Floating-Point Arithmetic Is Not Exact 33018 33019Binary floating-point representations and arithmetic are inexact. 33020Simple values like 0.1 cannot be precisely represented using 33021binary floating-point numbers, and the limited precision of 33022floating-point numbers means that slight changes in 33023the order of operations or the precision of intermediate storage 33024can change the result. To make matters worse, with arbitrary-precision 33025floating-point arithmetic, you can set the precision before starting a 33026computation, but then you cannot be sure of the number of significant 33027decimal places in the final result. 33028 33029@menu 33030* Inexact representation:: Numbers are not exactly represented. 33031* Comparing FP Values:: How to compare floating point values. 33032* Errors accumulate:: Errors get bigger as they go. 33033@end menu 33034 33035@node Inexact representation 33036@subsubsection Many Numbers Cannot Be Represented Exactly 33037 33038So, before you start to write any code, you should think 33039about what you really want and what's really happening. Consider the 33040two numbers in the following example: 33041 33042@example 33043x = 0.875 # 1/2 + 1/4 + 1/8 33044y = 0.425 33045@end example 33046 33047Unlike the number in @code{y}, the number stored in @code{x} 33048is exactly representable 33049in binary because it can be written as a finite sum of one or 33050more fractions whose denominators are all powers of two. 33051When @command{gawk} reads a floating-point number from 33052program source, it automatically rounds that number to whatever 33053precision your machine supports. If you try to print the numeric 33054content of a variable using an output format string of @code{"%.17g"}, 33055it may not produce the same number as you assigned to it: 33056 33057@example 33058$ @kbd{gawk 'BEGIN @{ x = 0.875; y = 0.425} 33059> @kbd{ printf("%0.17g, %0.17g\n", x, y) @}'} 33060@print{} 0.875, 0.42499999999999999 33061@end example 33062 33063Often the error is so small you do not even notice it, and if you do, 33064you can always specify how much precision you would like in your output. 33065Usually this is a format string like @code{"%.15g"}, which, when 33066used in the previous example, produces an output identical to the input. 33067 33068@node Comparing FP Values 33069@subsubsection Be Careful Comparing Values 33070 33071Because the underlying representation can be a little bit off from the exact value, 33072comparing floating-point values to see if they are exactly equal is generally a bad idea. 33073Here is an example where it does not work like you would expect: 33074 33075@example 33076$ @kbd{gawk 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} 33077@print{} 0 33078@end example 33079 33080The general wisdom when comparing floating-point values is to see if 33081they are within some small range of each other (called a @dfn{delta}, 33082or @dfn{tolerance}). 33083You have to decide how small a delta is important to you. Code to do 33084this looks something like the following: 33085 33086@example 33087@group 33088delta = 0.00001 # for example 33089difference = abs(a - b) # subtract the two values 33090if (difference < delta) 33091 # all ok 33092else 33093 # not ok 33094@end group 33095@end example 33096 33097@noindent 33098(We assume that you have a simple absolute value function named 33099@code{abs()} defined elsewhere in your program.) If you write a 33100function to compare values with a delta, you should be sure 33101to use @samp{difference < abs(delta)} in case someone passes 33102in a negative delta value. 33103 33104@node Errors accumulate 33105@subsubsection Errors Accumulate 33106 33107The loss of accuracy during a single computation with floating-point 33108numbers usually isn't enough to worry about. However, if you compute a 33109value that is the result of a sequence of floating-point operations, 33110the error can accumulate and greatly affect the computation itself. 33111Here is an attempt to compute the value of @value{PI} using one of its 33112many series representations: 33113 33114@example 33115BEGIN @{ 33116 x = 1.0 / sqrt(3.0) 33117 n = 6 33118 for (i = 1; i < 30; i++) @{ 33119 n = n * 2.0 33120 x = (sqrt(x * x + 1) - 1) / x 33121 printf("%.15f\n", n * x) 33122 @} 33123@} 33124@end example 33125 33126When run, the early errors propagate through later computations, 33127causing the loop to terminate prematurely after attempting to divide by zero: 33128 33129@example 33130$ @kbd{gawk -f pi.awk} 33131@print{} 3.215390309173475 33132@print{} 3.159659942097510 33133@print{} 3.146086215131467 33134@print{} 3.142714599645573 33135@dots{} 33136@print{} 3.224515243534819 33137@print{} 2.791117213058638 33138@print{} 0.000000000000000 33139@error{} gawk: pi.awk:6: fatal: division by zero attempted 33140@end example 33141 33142Here is an additional example where the inaccuracies in internal representations 33143yield an unexpected result: 33144 33145@example 33146$ @kbd{gawk 'BEGIN @{} 33147> @kbd{for (d = 1.1; d <= 1.5; d += 0.1) # loop five times (?)} 33148> @kbd{i++} 33149> @kbd{print i} 33150> @kbd{@}'} 33151@print{} 4 33152@end example 33153 33154@node Getting Accuracy 33155@subsection Getting the Accuracy You Need 33156 33157Can arbitrary-precision arithmetic give exact results? There are 33158no easy answers. The standard rules of algebra often do not apply 33159when using floating-point arithmetic. 33160Among other things, the distributive and associative laws 33161do not hold completely, and order of operation may be important 33162for your computation. Rounding error, cumulative precision loss, 33163and underflow are often troublesome. 33164 33165When @command{gawk} tests the expressions @samp{0.1 + 12.2} and 33166@samp{12.3} for equality using the machine double-precision arithmetic, 33167it decides that they are not equal! (@xref{Comparing FP Values}.) 33168You can get the result you want by increasing the precision; 56 bits in 33169this case does the job: 33170 33171@example 33172$ @kbd{gawk -M -v PREC=56 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} 33173@print{} 1 33174@end example 33175 33176If adding more bits is good, perhaps adding even more bits of 33177precision is better? 33178Here is what happens if we use an even larger value of @code{PREC}: 33179 33180@example 33181$ @kbd{gawk -M -v PREC=201 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} 33182@print{} 0 33183@end example 33184 33185This is not a bug in @command{gawk} or in the MPFR library. 33186It is easy to forget that the finite number of bits used to store the value 33187is often just an approximation after proper rounding. 33188The test for equality succeeds if and only if @emph{all} bits in the two operands 33189are exactly the same. Because this is not necessarily true after floating-point 33190computations with a particular precision and effective rounding mode, 33191a straight test for equality may not work. Instead, compare the 33192two numbers to see if they are within the desirable delta of each other. 33193 33194In applications where 15 or fewer decimal places suffice, 33195hardware double-precision arithmetic can be adequate, and is usually much faster. 33196But you need to keep in mind that every floating-point operation 33197can suffer a new rounding error with catastrophic consequences, as illustrated 33198by our earlier attempt to compute the value of @value{PI}. 33199Extra precision can greatly enhance the stability and the accuracy 33200of your computation in such cases. 33201 33202Additionally, you should understand that 33203repeated addition is not necessarily equivalent to multiplication 33204in floating-point arithmetic. In the example in 33205@ref{Errors accumulate}: 33206 33207@example 33208$ @kbd{gawk 'BEGIN @{} 33209> @kbd{for (d = 1.1; d <= 1.5; d += 0.1) # loop five times (?)} 33210> @kbd{i++} 33211> @kbd{print i} 33212> @kbd{@}'} 33213@print{} 4 33214@end example 33215 33216@noindent 33217you may or may not succeed in getting the correct result by choosing 33218an arbitrarily large value for @code{PREC}. Reformulation of 33219the problem at hand is often the correct approach in such situations. 33220 33221@node Try To Round 33222@subsection Try a Few Extra Bits of Precision and Rounding 33223 33224Instead of arbitrary-precision floating-point arithmetic, 33225often all you need is an adjustment of your logic 33226or a different order for the operations in your calculation. 33227The stability and the accuracy of the computation of @value{PI} 33228in the earlier example can be enhanced by using the following 33229simple algebraic transformation: 33230 33231@example 33232(sqrt(x * x + 1) - 1) / x @equiv{} x / (sqrt(x * x + 1) + 1) 33233@end example 33234 33235@noindent 33236After making this change, the program converges to 33237@value{PI} in under 30 iterations: 33238 33239@example 33240$ @kbd{gawk -f pi2.awk} 33241@print{} 3.215390309173473 33242@print{} 3.159659942097501 33243@print{} 3.146086215131436 33244@print{} 3.142714599645370 33245@print{} 3.141873049979825 33246@dots{} 33247@print{} 3.141592653589797 33248@print{} 3.141592653589797 33249@end example 33250 33251@node Setting precision 33252@subsection Setting the Precision 33253 33254@command{gawk} uses a global working precision; it does not keep track of 33255the precision or accuracy of individual numbers. Performing an arithmetic 33256operation or calling a built-in function rounds the result to the current 33257working precision. The default working precision is 53 bits, which you can 33258modify using the predefined variable @code{PREC}. You can also set the 33259value to one of the predefined case-insensitive strings 33260shown in @ref{table-predefined-precision-strings}, 33261to emulate an IEEE 754 binary format. 33262 33263@float Table,table-predefined-precision-strings 33264@caption{Predefined precision strings for @code{PREC}} 33265@multitable {@code{"double"}} {12345678901234567890123456789012345} 33266@headitem @code{PREC} @tab IEEE 754 binary format 33267@item @code{"half"} @tab 16-bit half-precision 33268@item @code{"single"} @tab Basic 32-bit single precision 33269@item @code{"double"} @tab Basic 64-bit double precision 33270@item @code{"quad"} @tab Basic 128-bit quadruple precision 33271@item @code{"oct"} @tab 256-bit octuple precision 33272@end multitable 33273@end float 33274 33275The following example illustrates the effects of changing precision 33276on arithmetic operations: 33277 33278@example 33279$ @kbd{gawk -M -v PREC=100 'BEGIN @{ x = 1.0e-400; print x + 0} 33280> @kbd{PREC = "double"; print x + 0 @}'} 33281@print{} 1e-400 33282@print{} 0 33283@end example 33284 33285@quotation CAUTION 33286Be wary of floating-point constants! When reading a floating-point 33287constant from program source code, @command{gawk} uses the default 33288precision (that of a C @code{double}), unless overridden by an assignment 33289to the special variable @code{PREC} on the command line, to store it 33290internally as an MPFR number. Changing the precision using @code{PREC} 33291in the program text does @emph{not} change the precision of a constant. 33292 33293If you need to represent a floating-point constant at a higher precision 33294than the default and cannot use a command-line assignment to @code{PREC}, 33295you should either specify the constant as a string, or as a rational 33296number, whenever possible. The following example illustrates the 33297differences among various ways to print a floating-point constant: 33298 33299@example 33300$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 0.1) @}'} 33301@print{} 0.1000000000000000055511151 33302$ @kbd{gawk -M -v PREC=113 'BEGIN @{ printf("%0.25f\n", 0.1) @}'} 33303@print{} 0.1000000000000000000000000 33304$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", "0.1") @}'} 33305@print{} 0.1000000000000000000000000 33306$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 1/10) @}'} 33307@print{} 0.1000000000000000000000000 33308@end example 33309@end quotation 33310 33311@node Setting the rounding mode 33312@subsection Setting the Rounding Mode 33313 33314@cindex @code{ROUNDMODE} variable 33315The @code{ROUNDMODE} variable provides 33316program-level control over the rounding mode. 33317The correspondence between @code{ROUNDMODE} and the IEEE 33318rounding modes is shown in @ref{table-gawk-rounding-modes}. 33319 33320@float Table,table-gawk-rounding-modes 33321@caption{@command{gawk} rounding modes} 33322@multitable @columnfractions .45 .30 .25 33323@headitem Rounding mode @tab IEEE name @tab @code{ROUNDMODE} 33324@item Round to nearest, ties to even @tab @code{roundTiesToEven} @tab @code{"N"} or @code{"n"} 33325@item Round toward positive infinity @tab @code{roundTowardPositive} @tab @code{"U"} or @code{"u"} 33326@item Round toward negative infinity @tab @code{roundTowardNegative} @tab @code{"D"} or @code{"d"} 33327@item Round toward zero @tab @code{roundTowardZero} @tab @code{"Z"} or @code{"z"} 33328@item Round away from zero @tab @tab @code{"A"} or @code{"a"} 33329@end multitable 33330@end float 33331 33332@code{ROUNDMODE} has the default value @code{"N"}, which 33333selects the IEEE 754 rounding mode @code{roundTiesToEven}. 33334In @ref{table-gawk-rounding-modes}, the value @code{"A"} selects 33335rounding away from zero. This is only available if your version of the 33336MPFR library supports it; otherwise, setting @code{ROUNDMODE} to @code{"A"} 33337has no effect. 33338 33339The default mode @code{roundTiesToEven} is the most preferred, 33340but the least intuitive. This method does the obvious thing for most values, 33341by rounding them up or down to the nearest digit. 33342For example, rounding 1.132 to two digits yields 1.13, 33343and rounding 1.157 yields 1.16. 33344 33345However, when it comes to rounding a value that is exactly halfway between, 33346things do not work the way you probably learned in school. 33347In this case, the number is rounded to the nearest even digit. 33348So rounding 0.125 to two digits rounds down to 0.12, 33349but rounding 0.6875 to three digits rounds up to 0.688. 33350You probably have already encountered this rounding mode when 33351using @code{printf} to format floating-point numbers. 33352For example: 33353 33354@example 33355BEGIN @{ 33356 x = -4.5 33357 for (i = 1; i < 10; i++) @{ 33358 x += 1.0 33359 printf("%4.1f => %2.0f\n", x, x) 33360 @} 33361@} 33362@end example 33363 33364@noindent 33365produces the following output when run on the author's system:@footnote{It 33366is possible for the output to be completely different if the 33367C library in your system does not use the IEEE 754 even-rounding 33368rule to round halfway cases for @code{printf}.} 33369 33370@example 33371-3.5 => -4 33372-2.5 => -2 33373-1.5 => -2 33374-0.5 => 0 33375 0.5 => 0 33376 1.5 => 2 33377 2.5 => 2 33378 3.5 => 4 33379 4.5 => 4 33380@end example 33381 33382The theory behind @code{roundTiesToEven} is that it more or less evenly 33383distributes upward and downward rounds of exact halves, which might 33384cause any accumulating round-off error to cancel itself out. This is the 33385default rounding mode for IEEE 754 computing functions and operators. 33386 33387@c January 2018. Thanks to nethox@gmail.com for the example. 33388@sidebar Rounding Modes and Conversion 33389It's important to understand that, along with @code{CONVFMT} and 33390@code{OFMT}, the rounding mode affects how numbers are converted to strings. 33391For example, consider the following program: 33392 33393@example 33394BEGIN @{ 33395 pi = 3.1416 33396 OFMT = "%.f" # Print value as integer 33397 print pi # ROUNDMODE = "N" by default. 33398 ROUNDMODE = "U" # Now change ROUNDMODE 33399 print pi 33400@} 33401@end example 33402 33403@noindent 33404Running this program produces this output: 33405 33406@example 33407$ @kbd{gawk -M -f roundmode.awk} 33408@print{} 3 33409@print{} 4 33410@end example 33411@end sidebar 33412 33413The other rounding modes are rarely used. Rounding toward positive infinity 33414(@code{roundTowardPositive}) and toward negative infinity 33415(@code{roundTowardNegative}) are often used to implement interval 33416arithmetic, where you adjust the rounding mode to calculate upper and 33417lower bounds for the range of output. The @code{roundTowardZero} mode can 33418be used for converting floating-point numbers to integers. When rounding 33419away from zero, the nearest number with magnitude greater than or equal to 33420the value is selected. 33421 33422Some numerical analysts will tell you that your choice of rounding 33423style has tremendous impact on the final outcome, and advise you to 33424wait until final output for any rounding. Instead, you can often avoid 33425round-off error problems by setting the precision initially to some 33426value sufficiently larger than the final desired precision, so that 33427the accumulation of round-off error does not influence the outcome. 33428If you suspect that results from your computation are sensitive to 33429accumulation of round-off error, look for a significant difference in 33430output when you change the rounding mode to be sure. 33431 33432@node Arbitrary Precision Integers 33433@section Arbitrary-Precision Integer Arithmetic with @command{gawk} 33434@cindex integers @subentry arbitrary precision 33435@cindex arbitrary precision @subentry integers 33436 33437When given the @option{-M} option, 33438@command{gawk} performs all integer arithmetic using GMP arbitrary-precision 33439integers. Any number that looks like an integer in a source 33440or @value{DF} is stored as an arbitrary-precision integer. The size 33441of the integer is limited only by the available memory. For example, 33442the following computes 33443@iftex 33444@math{5^{4^{3^{2}}}}, 33445@end iftex 33446@ifinfo 334475^4^3^2, 33448@end ifinfo 33449@ifnottex 33450@ifnotinfo 334515@sup{4@sup{3@sup{2}}}, 33452@end ifnotinfo 33453@end ifnottex 33454the result of which is beyond the 33455limits of ordinary hardware double-precision floating-point values: 33456 33457@example 33458$ @kbd{gawk -M 'BEGIN @{} 33459> @kbd{x = 5^4^3^2} 33460> @kbd{print "number of digits =", length(x)} 33461> @kbd{print substr(x, 1, 20), "...", substr(x, length(x) - 19, 20)} 33462> @kbd{@}'} 33463@print{} number of digits = 183231 33464@print{} 62060698786608744707 ... 92256259918212890625 33465@end example 33466 33467If instead you were to compute the same value using arbitrary-precision 33468floating-point values, the precision needed for correct output (using 33469the formula 33470@iftex 33471@math{prec = 3.322 @cdot dps}) 33472would be @math{3.322 @cdot 183231}, 33473@end iftex 33474@ifnottex 33475@ifnotdocbook 33476@samp{prec = 3.322 * dps}) 33477would be 3.322 x 183231, 33478@end ifnotdocbook 33479@end ifnottex 33480@docbook 33481<emphasis>prec</emphasis> = 3.322 ⋅ <emphasis>dps</emphasis>) 33482would be 33483<emphasis>prec</emphasis> = 3.322 ⋅ 183231, 33484@end docbook 33485or 608693. 33486 33487The result from an arithmetic operation with an integer and a floating-point value 33488is a floating-point value with a precision equal to the working precision. 33489The following program calculates the eighth term in 33490Sylvester's sequence@footnote{Weisstein, Eric W. 33491@cite{Sylvester's Sequence}. From MathWorld---A Wolfram Web Resource 33492@w{(@url{http://mathworld.wolfram.com/SylvestersSequence.html}).}} 33493using a recurrence: 33494 33495@example 33496$ @kbd{gawk -M 'BEGIN @{} 33497> @kbd{s = 2.0} 33498> @kbd{for (i = 1; i <= 7; i++)} 33499> @kbd{s = s * (s - 1) + 1} 33500> @kbd{print s} 33501> @kbd{@}'} 33502@print{} 113423713055421845118910464 33503@end example 33504 33505The output differs from the actual number, 113,423,713,055,421,844,361,000,443, 33506because the default precision of 53 bits is not enough to represent the 33507floating-point results exactly. You can either increase the precision 33508(100 bits is enough in this case), or replace the floating-point constant 33509@samp{2.0} with an integer, to perform all computations using integer 33510arithmetic to get the correct output. 33511 33512Sometimes @command{gawk} must implicitly convert an arbitrary-precision 33513integer into an arbitrary-precision floating-point value. This is 33514primarily because the MPFR library does not always provide the relevant 33515interface to process arbitrary-precision integers or mixed-mode numbers 33516as needed by an operation or function. In such a case, the precision is 33517set to the minimum value necessary for exact conversion, and the working 33518precision is not used for this purpose. If this is not what you need or 33519want, you can employ a subterfuge and convert the integer to floating 33520point first, like this: 33521 33522@example 33523gawk -M 'BEGIN @{ n = 13; print (n + 0.0) % 2.0 @}' 33524@end example 33525 33526You can avoid this issue altogether by specifying the number as a floating-point value 33527to begin with: 33528 33529@example 33530gawk -M 'BEGIN @{ n = 13.0; print n % 2.0 @}' 33531@end example 33532 33533Note that for this particular example, it is likely best 33534to just use the following: 33535 33536@example 33537gawk -M 'BEGIN @{ n = 13; print n % 2 @}' 33538@end example 33539 33540When dividing two arbitrary precision integers with either 33541@samp{/} or @samp{%}, the result is typically an arbitrary 33542precision floating point value (unless the denominator evenly 33543divides into the numerator). 33544@ifset INTDIV 33545In order to do integer division 33546or remainder with arbitrary precision integers, use the built-in 33547@code{intdiv0()} function (@pxref{Numeric Functions}). 33548 33549You can simulate the @code{intdiv0()} function in standard @command{awk} 33550using this user-defined function: 33551 33552@example 33553@c file eg/lib/intdiv0.awk 33554# intdiv0 --- do integer division 33555 33556@c endfile 33557@ignore 33558@c file eg/lib/intdiv0.awk 33559# 33560# Arnold Robbins, arnold@@skeeve.com, Public Domain 33561# July, 2014 33562# 33563# Name changed from div() to intdiv() 33564# April, 2015 33565# 33566# Changed to intdiv0() 33567# April, 2016 33568 33569@c endfile 33570 33571@end ignore 33572@c file eg/lib/intdiv0.awk 33573function intdiv0(numerator, denominator, result) 33574@{ 33575 split("", result) 33576 33577 numerator = int(numerator) 33578 denominator = int(denominator) 33579 result["quotient"] = int(numerator / denominator) 33580 result["remainder"] = int(numerator % denominator) 33581 33582 return 0.0 33583@} 33584@c endfile 33585@end example 33586 33587The following example program, contributed by Katie Wasserman, 33588uses @code{intdiv0()} to 33589compute the digits of @value{PI} to as many places as you 33590choose to set: 33591 33592@example 33593@c file eg/prog/pi.awk 33594@group 33595# pi.awk --- compute the digits of pi 33596@c endfile 33597@c endfile 33598@ignore 33599@c file eg/prog/pi.awk 33600# 33601# Katie Wasserman, katie@@wass.net 33602# August 2014 33603@c endfile 33604@end ignore 33605@c file eg/prog/pi.awk 33606 33607BEGIN @{ 33608 digits = 100000 33609 two = 2 * 10 ^ digits 33610@end group 33611 pi = two 33612 for (m = digits * 4; m > 0; --m) @{ 33613 d = m * 2 + 1 33614 x = pi * m 33615 intdiv0(x, d, result) 33616 pi = result["quotient"] 33617 pi = pi + two 33618 @} 33619 print pi 33620@} 33621@c endfile 33622@end example 33623 33624@ignore 33625Date: Wed, 20 Aug 2014 10:19:11 -0400 33626To: arnold@skeeve.com 33627From: Katherine Wasserman <katie@wass.net> 33628Subject: Re: computation of digits of pi? 33629 33630Arnold, 33631 33632>The program that you sent to compute the digits of pi using div(). Is 33633>that some standard algorithm that every math student knows? If so, 33634>what's it called? 33635 33636It's not that well known but it's not that obscure either 33637 33638It's Euler's modification to Newton's method for calculating pi. 33639 33640Take a look at lines (23) - (25) here: http://mathworld.wolfram.com/PiFormulas.htm 33641 33642The algorithm I wrote simply expands the multiply by 2 and works from the innermost expression outwards. I used this to program HP calculators because it's quite easy to modify for tiny memory devices with smallish word sizes. 33643 33644http://www.hpmuseum.org/cgi-sys/cgiwrap/hpmuseum/articles.cgi?read=899 33645 33646-Katie 33647@end ignore 33648 33649When asked about the algorithm used, Katie replied: 33650 33651@quotation 33652It's not that well known but it's not that obscure either. 33653It's Euler's modification to Newton's method for calculating pi. 33654Take a look at lines (23) - (25) here: @uref{http://mathworld.wolfram.com/PiFormulas.html}. 33655 33656The algorithm I wrote simply expands the multiply by 2 and works from 33657the innermost expression outwards. I used this to program HP calculators 33658because it's quite easy to modify for tiny memory devices with smallish 33659word sizes. See 33660@uref{http://www.hpmuseum.org/cgi-sys/cgiwrap/hpmuseum/articles.cgi?read=899}. 33661@end quotation 33662@end ifset 33663 33664@node Checking for MPFR 33665@section How To Check If MPFR Is Available 33666 33667@cindex checking for MPFR 33668@cindex MPFR, checking for 33669Occasionally, you might like to be able to check if @command{gawk} 33670was invoked with the @option{-M} option, enabling arbitrary-precision 33671arithmetic. You can do so with the following function, contributed 33672by Andrew Schorr: 33673 33674@example 33675@c file eg/lib/have_mpfr.awk 33676# adequate_math_precision --- return true if we have enough bits 33677@c endfile 33678@ignore 33679@c file eg/lib/have_mpfr.awk 33680# 33681# Andrew Schorr, aschorr@@telemetry-investments.com, Public Domain 33682# May 2017 33683@c endfile 33684@end ignore 33685@c file eg/lib/have_mpfr.awk 33686 33687function adequate_math_precision(n) 33688@{ 33689 return (1 != (1+(1/(2^(n-1))))) 33690@} 33691@c endfile 33692@end example 33693 33694Here is code that invokes the function in order to check 33695if arbitrary-precision arithmetic is available: 33696 33697@example 33698BEGIN @{ 33699 # How many bits of mantissa precision are required 33700 # for this program to function properly? 33701 fpbits = 123 33702 33703 # We hope that we were invoked with MPFR enabled. If so, the 33704 # following statement should configure calculations to our desired 33705 # precision. 33706 PREC = fpbits 33707 33708 if (! adequate_math_precision(fpbits)) @{ 33709 print("Error: insufficient computation precision available.\n" \ 33710 "Try again with the -M argument?") > "/dev/stderr" 33711 # Note: you may need to set a flag here to bail out of END rules 33712 exit 1 33713 @} 33714@} 33715@end example 33716 33717Please be aware that @code{exit} will jump to the @code{END} rules, if present (@pxref{Exit Statement}). 33718 33719@node POSIX Floating Point Problems 33720@section Standards Versus Existing Practice 33721 33722Historically, @command{awk} has converted any nonnumeric-looking string 33723to the numeric value zero, when required. Furthermore, the original 33724definition of the language and the original POSIX standards specified that 33725@command{awk} only understands decimal numbers (base 10), and not octal 33726(base 8) or hexadecimal numbers (base 16). 33727 33728Changes in the language of the 337292001 and 2004 POSIX standards can be interpreted to imply that @command{awk} 33730should support additional features. These features are: 33731 33732@itemize @value{BULLET} 33733@item 33734Interpretation of floating-point data values specified in hexadecimal 33735notation (e.g., @code{0xDEADBEEF}). (Note: data values, @emph{not} 33736source code constants.) 33737 33738@item 33739Support for the special IEEE 754 floating-point values ``not a number'' 33740(NaN), positive infinity (``inf''), and negative infinity (``@minus{}inf''). 33741In particular, the format for these values is as specified by the ISO 1999 33742C standard, which ignores case and can allow implementation-dependent additional 33743characters after the @samp{nan} and allow either @samp{inf} or @samp{infinity}. 33744@end itemize 33745 33746The first problem is that both of these are clear changes to historical 33747practice: 33748 33749@itemize @value{BULLET} 33750@item 33751The @command{gawk} maintainer feels that supporting hexadecimal 33752floating-point values, in particular, is ugly, and was never intended by the 33753original designers to be part of the language. 33754 33755@item 33756Allowing completely alphabetic strings to have valid numeric 33757values is also a very severe departure from historical practice. 33758@end itemize 33759 33760The second problem is that the @command{gawk} maintainer feels that this 33761interpretation of the standard, which required a certain amount of 33762``language lawyering'' to arrive at in the first place, was not even 33763intended by the standard developers. In other words, ``We see how you 33764got where you are, but we don't think that that's where you want to be.'' 33765 33766Recognizing these issues, but attempting to provide compatibility 33767with the earlier versions of the standard, 33768the 2008 POSIX standard added explicit wording to allow, but not require, 33769that @command{awk} support hexadecimal floating-point values and 33770special values for ``not a number'' and infinity. 33771 33772Although the @command{gawk} maintainer continues to feel that 33773providing those features is inadvisable, 33774nevertheless, on systems that support IEEE floating point, it seems 33775reasonable to provide @emph{some} way to support NaN and infinity values. 33776The solution implemented in @command{gawk} is as follows: 33777 33778@itemize @value{BULLET} 33779@item 33780With the @option{--posix} command-line option, @command{gawk} becomes 33781``hands off.'' String values are passed directly to the system library's 33782@code{strtod()} function, and if it successfully returns a numeric value, 33783that is what's used.@footnote{You asked for it, you got it.} 33784By definition, the results are not portable across 33785different systems. They are also a little surprising: 33786 33787@example 33788$ @kbd{echo nanny | gawk --posix '@{ print $1 + 0 @}'} 33789@print{} nan 33790$ @kbd{echo 0xDeadBeef | gawk --posix '@{ print $1 + 0 @}'} 33791@print{} 3735928559 33792@end example 33793 33794@item 33795Without @option{--posix}, @command{gawk} interprets the four string values 33796@samp{+inf}, 33797@samp{-inf}, 33798@samp{+nan}, 33799and 33800@samp{-nan} 33801specially, producing the corresponding special numeric values. 33802The leading sign acts a signal to @command{gawk} (and the user) 33803that the value is really numeric. Hexadecimal floating point is 33804not supported (unless you also use @option{--non-decimal-data}, 33805which is @emph{not} recommended). For example: 33806 33807@example 33808$ @kbd{echo nanny | gawk '@{ print $1 + 0 @}'} 33809@print{} 0 33810$ @kbd{echo +nan | gawk '@{ print $1 + 0 @}'} 33811@print{} +nan 33812$ @kbd{echo 0xDeadBeef | gawk '@{ print $1 + 0 @}'} 33813@print{} 0 33814@end example 33815 33816@command{gawk} ignores case in the four special values. 33817Thus, @samp{+nan} and @samp{+NaN} are the same. 33818@end itemize 33819 33820@cindex POSIX mode 33821Besides handling input, @command{gawk} also needs to print ``correct'' values on 33822output when a value is either NaN or infinity. Starting with @value{PVERSION} 338234.2.2, for such values @command{gawk} prints one of the four strings 33824just described: @samp{+inf}, @samp{-inf}, @samp{+nan}, or @samp{-nan}. 33825Similarly, in POSIX mode, @command{gawk} prints the result of 33826the system's C @code{printf()} function using the @code{%g} format string 33827for the value, whatever that may be. 33828 33829@node Floating point summary 33830@section Summary 33831 33832@itemize @value{BULLET} 33833@item 33834Most computer arithmetic is done using either integers or floating-point 33835values. Standard @command{awk} uses double-precision 33836floating-point values. 33837 33838@item 33839In the early 1990s Barbie mistakenly said, ``Math class is tough!'' 33840Although math isn't tough, floating-point arithmetic isn't the same 33841as pencil-and-paper math, and care must be taken: 33842 33843@c nested list 33844@itemize @value{MINUS} 33845@item 33846Not all numbers can be represented exactly. 33847 33848@item 33849Comparing values should use a delta, instead of being done directly 33850with @samp{==} and @samp{!=}. 33851 33852@item 33853Errors accumulate. 33854 33855@item 33856Operations are not always truly associative or distributive. 33857@end itemize 33858 33859@item 33860Increasing the accuracy can help, but it is not a panacea. 33861 33862@item 33863Often, increasing the accuracy and then rounding to the desired 33864number of digits produces reasonable results. 33865 33866@item 33867Use @option{-M} (or @option{--bignum}) to enable MPFR 33868arithmetic. Use @code{PREC} to set the precision in bits, and 33869@code{ROUNDMODE} to set the IEEE 754 rounding mode. 33870 33871@item 33872With @option{-M}, @command{gawk} performs 33873arbitrary-precision integer arithmetic using the GMP library. 33874This is faster and more space-efficient than using MPFR for 33875the same calculations. 33876 33877@item 33878There are several areas with respect to floating-point 33879numbers where @command{gawk} disagrees with the POSIX standard. 33880It pays to be aware of them. 33881 33882@item 33883Overall, there is no need to be unduly suspicious about the results from 33884floating-point arithmetic. The lesson to remember is that floating-point 33885arithmetic is always more complex than arithmetic using pencil and 33886paper. In order to take advantage of the power of floating-point arithmetic, 33887you need to know its limitations and work within them. For most casual 33888use of floating-point arithmetic, you will often get the expected result 33889if you simply round the display of your final results to the correct number 33890of significant decimal digits. 33891 33892@item 33893As general advice, avoid presenting numerical data in a manner that 33894implies better precision than is actually the case. 33895 33896@end itemize 33897 33898@node Dynamic Extensions 33899@chapter Writing Extensions for @command{gawk} 33900@cindex dynamically loaded extensions 33901 33902It is possible to add new functions written in C or C++ to @command{gawk} using 33903dynamically loaded libraries. This facility is available on systems 33904that support the C @code{dlopen()} and @code{dlsym()} 33905functions. This @value{CHAPTER} describes how to create extensions 33906using code written in C or C++. 33907 33908If you don't know anything about C programming, you can safely skip this 33909@value{CHAPTER}, although you may wish to review the documentation on the 33910extensions that come with @command{gawk} (@pxref{Extension Samples}), 33911and the information on the @code{gawkextlib} project (@pxref{gawkextlib}). 33912The sample extensions are automatically built and installed when 33913@command{gawk} is. 33914 33915@quotation NOTE 33916When @option{--sandbox} is specified, extensions are disabled 33917(@pxref{Options}). 33918@end quotation 33919 33920@menu 33921* Extension Intro:: What is an extension. 33922* Plugin License:: A note about licensing. 33923* Extension Mechanism Outline:: An outline of how it works. 33924* Extension API Description:: A full description of the API. 33925* Finding Extensions:: How @command{gawk} finds compiled extensions. 33926* Extension Example:: Example C code for an extension. 33927* Extension Samples:: The sample extensions that ship with 33928 @command{gawk}. 33929* gawkextlib:: The @code{gawkextlib} project. 33930* Extension summary:: Extension summary. 33931* Extension Exercises:: Exercises. 33932@end menu 33933 33934@node Extension Intro 33935@section Introduction 33936 33937@cindex plug-in 33938An @dfn{extension} (sometimes called a @dfn{plug-in}) is a piece of 33939external compiled code that @command{gawk} can load at runtime to 33940provide additional functionality, over and above the built-in capabilities 33941described in the rest of this @value{DOCUMENT}. 33942 33943Extensions are useful because they allow you (of course) to extend 33944@command{gawk}'s functionality. For example, they can provide access to 33945system calls (such as @code{chdir()} to change directory) and to other 33946C library routines that could be of use. As with most software, 33947``the sky is the limit''; if you can imagine something that you might 33948want to do and can write in C or C++, you can write an extension to do it! 33949 33950Extensions are written in C or C++, using the @dfn{application programming 33951interface} (API) defined for this purpose by the @command{gawk} 33952developers. The rest of this @value{CHAPTER} explains 33953the facilities that the API provides and how to use 33954them, and presents a small example extension. In addition, it documents 33955the sample extensions included in the @command{gawk} distribution 33956and describes the @code{gawkextlib} project. 33957@ifclear FOR_PRINT 33958@xref{Extension Design}, for a discussion of the extension mechanism 33959goals and design. 33960@end ifclear 33961@ifset FOR_PRINT 33962See @uref{https://www.gnu.org/software/gawk/manual/html_node/Extension-Design.html} 33963for a discussion of the extension mechanism 33964goals and design. 33965@end ifset 33966 33967@node Plugin License 33968@section Extension Licensing 33969 33970Every dynamic extension must be distributed under a license that is 33971compatible with the GNU GPL (@pxref{Copying}). 33972 33973In order for the extension to tell @command{gawk} that it is 33974properly licensed, the extension must define the global symbol 33975@code{plugin_is_GPL_compatible}. If this symbol does not exist, 33976@command{gawk} emits a fatal error and exits when it tries to load 33977your extension. 33978 33979The declared type of the symbol should be @code{int}. It does not need 33980to be in any allocated section, though. The code merely asserts that 33981the symbol exists in the global scope. Something like this is enough: 33982 33983@example 33984int plugin_is_GPL_compatible; 33985@end example 33986 33987@node Extension Mechanism Outline 33988@section How It Works at a High Level 33989 33990Communication between 33991@command{gawk} and an extension is two-way. First, when an extension 33992is loaded, @command{gawk} passes it a pointer to a @code{struct} whose fields are 33993function pointers. 33994@ifnotdocbook 33995This is shown in @ref{figure-load-extension}. 33996@end ifnotdocbook 33997@ifdocbook 33998This is shown in @inlineraw{docbook, <xref linkend="figure-load-extension"/>}. 33999@end ifdocbook 34000 34001@ifnotdocbook 34002@float Figure,figure-load-extension 34003@caption{Loading the extension} 34004@center @image{api-figure1, , , Loading the extension} 34005@end float 34006@end ifnotdocbook 34007 34008@docbook 34009<figure id="figure-load-extension" float="0"> 34010<title>Loading the extension</title> 34011<mediaobject> 34012<imageobject role="web"><imagedata fileref="api-figure1.png" format="PNG"/></imageobject> 34013</mediaobject> 34014</figure> 34015@end docbook 34016 34017The extension can call functions inside @command{gawk} through these 34018function pointers, at runtime, without needing (link-time) access 34019to @command{gawk}'s symbols. One of these function pointers is to a 34020function for ``registering'' new functions. 34021@ifnotdocbook 34022This is shown in @ref{figure-register-new-function}. 34023@end ifnotdocbook 34024@ifdocbook 34025This is shown in @inlineraw{docbook, <xref linkend="figure-register-new-function"/>}. 34026@end ifdocbook 34027 34028@ifnotdocbook 34029@float Figure,figure-register-new-function 34030@caption{Registering a new function} 34031@center @image{api-figure2, , , Registering a new Function} 34032@end float 34033@end ifnotdocbook 34034 34035@docbook 34036<figure id="figure-register-new-function" float="0"> 34037<title>Registering a new function</title> 34038<mediaobject> 34039<imageobject role="web"><imagedata fileref="api-figure2.png" format="PNG"/></imageobject> 34040</mediaobject> 34041</figure> 34042@end docbook 34043 34044In the other direction, the extension registers its new functions 34045with @command{gawk} by passing function pointers to the functions that 34046provide the new feature (@code{do_chdir()}, for example). @command{gawk} 34047associates the function pointer with a name and can then call it, using a 34048defined calling convention. 34049@ifnotdocbook 34050This is shown in @ref{figure-call-new-function}. 34051@end ifnotdocbook 34052@ifdocbook 34053This is shown in @inlineraw{docbook, <xref linkend="figure-call-new-function"/>}. 34054@end ifdocbook 34055 34056@ifnotdocbook 34057@float Figure,figure-call-new-function 34058@caption{Calling the new function} 34059@center @image{api-figure3, , , Calling the new function} 34060@end float 34061@end ifnotdocbook 34062 34063@docbook 34064<figure id="figure-call-new-function" float="0"> 34065<title>Calling the new function</title> 34066<mediaobject> 34067<imageobject role="web"><imagedata fileref="api-figure3.png" format="PNG"/></imageobject> 34068</mediaobject> 34069</figure> 34070@end docbook 34071 34072The @code{do_@var{xxx}()} function, in turn, then uses the function 34073pointers in the API @code{struct} to do its work, such as updating 34074variables or arrays, printing messages, setting @code{ERRNO}, and so on. 34075 34076Convenience macros make calling through the function pointers look 34077like regular function calls so that extension code is quite readable 34078and understandable. 34079 34080Although all of this sounds somewhat complicated, the result is that 34081extension code is quite straightforward to write and to read. You can 34082see this in the sample extension @file{filefuncs.c} (@pxref{Extension 34083Example}) and also in the @file{testext.c} code for testing the APIs. 34084 34085Some other bits and pieces: 34086 34087@itemize @value{BULLET} 34088@item 34089The API provides access to @command{gawk}'s @code{do_@var{xxx}} values, 34090reflecting command-line options, like @code{do_lint}, @code{do_profiling}, 34091and so on (@pxref{Extension API Variables}). 34092These are informational: an extension cannot affect their values 34093inside @command{gawk}. In addition, attempting to assign to them 34094produces a compile-time error. 34095 34096@item 34097The API also provides major and minor version numbers, so that an 34098extension can check if the @command{gawk} it is loaded with supports the 34099facilities it was compiled with. (Version mismatches ``shouldn't'' 34100happen, but we all know how @emph{that} goes.) 34101@xref{Extension Versioning} for details. 34102@end itemize 34103 34104@node Extension API Description 34105@section API Description 34106@cindex extension API 34107 34108C or C++ code for an extension must include the header file 34109@file{gawkapi.h}, which declares the functions and defines the data 34110types used to communicate with @command{gawk}. 34111This (rather large) @value{SECTION} describes the API in detail. 34112 34113@menu 34114* Extension API Functions Introduction:: Introduction to the API functions. 34115* General Data Types:: The data types. 34116* Memory Allocation Functions:: Functions for allocating memory. 34117* Constructor Functions:: Functions for creating values. 34118* API Ownership of MPFR and GMP Values:: Managing MPFR and GMP Values. 34119* Registration Functions:: Functions to register things with 34120 @command{gawk}. 34121* Printing Messages:: Functions for printing messages. 34122* Updating @code{ERRNO}:: Functions for updating @code{ERRNO}. 34123* Requesting Values:: How to get a value. 34124* Accessing Parameters:: Functions for accessing parameters. 34125* Symbol Table Access:: Functions for accessing global 34126 variables. 34127* Array Manipulation:: Functions for working with arrays. 34128* Redirection API:: How to access and manipulate 34129 redirections. 34130* Extension API Variables:: Variables provided by the API. 34131* Extension API Boilerplate:: Boilerplate code for using the API. 34132* Changes from API V1:: Changes from V1 of the API. 34133@end menu 34134 34135@node Extension API Functions Introduction 34136@subsection Introduction 34137 34138Access to facilities within @command{gawk} is achieved 34139by calling through function pointers passed into your extension. 34140 34141API function pointers are provided for the following kinds of operations: 34142 34143@itemize @value{BULLET} 34144@item 34145Allocating, reallocating, and releasing memory. 34146 34147@item 34148Registration functions. You may register: 34149 34150@c nested list 34151@itemize @value{MINUS} 34152@item 34153Extension functions 34154@item 34155Exit callbacks 34156@item 34157A version string 34158@item 34159Input parsers 34160@item 34161Output wrappers 34162@item 34163Two-way processors 34164@end itemize 34165 34166All of these are discussed in detail later in this @value{CHAPTER}. 34167 34168@item 34169Printing fatal, warning, and ``lint'' warning messages. 34170 34171@item 34172Updating @code{ERRNO}, or unsetting it. 34173 34174@item 34175Accessing parameters, including converting an undefined parameter into 34176an array. 34177 34178@item 34179Symbol table access: retrieving a global variable, creating one, 34180or changing one. 34181 34182@item 34183Creating and releasing cached values; this provides an 34184efficient way to use values for multiple variables and 34185can be a big performance win. 34186 34187@item 34188Manipulating arrays: 34189 34190@itemize @value{MINUS} 34191@item 34192Retrieving, adding, deleting, and modifying elements 34193 34194@item 34195Getting the count of elements in an array 34196 34197@item 34198Creating a new array 34199 34200@item 34201Clearing an array 34202 34203@item 34204Flattening an array for easy C-style looping over all its indices and elements 34205@end itemize 34206 34207@item 34208Accessing and manipulating redirections. 34209 34210@end itemize 34211 34212Some points about using the API: 34213 34214@itemize @value{BULLET} 34215@item 34216The following types, macros, and/or functions are referenced 34217in @file{gawkapi.h}. For correct use, you must therefore include the 34218corresponding standard header file @emph{before} including @file{gawkapi.h}. 34219The list of macros and related header files is shown in @ref{table-api-std-headers}. 34220 34221@float Table,table-api-std-headers 34222@caption{Standard header files needed by API} 34223@multitable {@code{memset()}, @code{memcpy()}} {@code{<sys/types.h>}} 34224@headitem C entity @tab Header file 34225@item @code{EOF} @tab @code{<stdio.h>} 34226@item Values for @code{errno} @tab @code{<errno.h>} 34227@item @code{FILE} @tab @code{<stdio.h>} 34228@item @code{NULL} @tab @code{<stddef.h>} 34229@item @code{memcpy()} @tab @code{<string.h>} 34230@item @code{memset()} @tab @code{<string.h>} 34231@item @code{size_t} @tab @code{<sys/types.h>} 34232@item @code{struct stat} @tab @code{<sys/stat.h>} 34233@end multitable 34234@end float 34235 34236Due to portability concerns, especially to systems that are not 34237fully standards-compliant, it is your responsibility 34238to include the correct files in the correct way. This requirement 34239is necessary in order to keep @file{gawkapi.h} clean, instead of becoming 34240a portability hodge-podge as can be seen in some parts of 34241the @command{gawk} source code. 34242 34243@item 34244If your extension uses MPFR facilities, and you wish to receive such 34245values from @command{gawk} and/or pass such values to it, you must include the 34246@code{<mpfr.h>} header before including @code{<gawkapi.h>}. 34247 34248@item 34249The @file{gawkapi.h} file may be included more than once without ill effect. 34250Doing so, however, is poor coding practice. 34251 34252@item 34253Although the API only uses ISO C 90 features, there is an exception; the 34254``constructor'' functions use the @code{inline} keyword. If your compiler 34255does not support this keyword, you should either place 34256@samp{-Dinline=''} on your command line or use the GNU Autotools and include a 34257@file{config.h} file in your extensions. 34258 34259@item 34260All pointers filled in by @command{gawk} point to memory 34261managed by @command{gawk} and should be treated by the extension as 34262read-only. 34263 34264Memory for @emph{all} strings passed into @command{gawk} 34265from the extension @emph{must} come from calling one of 34266@code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()}, 34267and is managed by @command{gawk} from then on. 34268 34269Memory for MPFR/GMP values that come from @command{gawk} 34270should also be treated as read-only. However, unlike strings, 34271memory for MPFR/GMP values allocated by an extension and passed 34272into @command{gawk} is @emph{copied} by @command{gawk}; the extension 34273should then free the values itself to avoid memory leaks. This is 34274discussed further in @strong{API Ownership of MPFR and GMP Values}. 34275 34276@item 34277The API defines several simple @code{struct}s that map values as seen 34278from @command{awk}. A value can be a @code{double}, a string, or an 34279array (as in multidimensional arrays, or when creating a new array). 34280 34281String values maintain both pointer and length, because embedded @sc{nul} 34282characters are allowed. 34283 34284@quotation NOTE 34285By intent, @command{gawk} maintains strings using the current multibyte 34286encoding (as defined by @env{LC_@var{xxx}} environment variables) 34287and not using wide characters. This matches how @command{gawk} stores 34288strings internally and also how characters are likely to be input into 34289and output from files. 34290@end quotation 34291 34292@quotation NOTE 34293String values passed to an extension by @command{gawk} are always 34294@sc{nul}-terminated. Thus it is safe to pass such string values to 34295standard library and system routines. However, because @command{gawk} 34296allows embedded @sc{nul} characters in string data, before using the data 34297as a regular C string, you should check that the length for that string 34298passed to the extension matches the return value of @code{strlen()} 34299for it. 34300@end quotation 34301 34302@item 34303When retrieving a value (such as a parameter or that of a global variable 34304or array element), the extension requests a specific type (number, string, 34305scalar, value cookie, array, or ``undefined''). When the request is 34306``undefined,'' the returned value will have the real underlying type. 34307 34308However, if the request and actual type don't match, the access function 34309returns ``false'' and fills in the type of the actual value that is there, 34310so that the extension can, e.g., print an error message 34311(such as ``scalar passed where array expected''). 34312 34313@c This is documented in the header file and needs some expanding upon. 34314@c The table there should be presented here 34315@end itemize 34316 34317You may call the API functions by using the function pointers 34318directly, but the interface is not so pretty. To make extension code look 34319more like regular code, the @file{gawkapi.h} header file defines several 34320macros that you should use in your code. This @value{SECTION} presents 34321the macros as if they were functions. 34322 34323@node General Data Types 34324@subsection General-Purpose Data Types 34325 34326@cindex Robbins @subentry Arnold 34327@cindex Ramey, Chet 34328@quotation 34329@i{I have a true love/hate relationship with unions.} 34330@author Arnold Robbins 34331@end quotation 34332 34333@quotation 34334@i{That's the thing about unions: the compiler will arrange things so they 34335can accommodate both love and hate.} 34336@author Chet Ramey 34337@end quotation 34338 34339The extension API defines a number of simple types and structures for 34340general-purpose use. Additional, more specialized, data structures are 34341introduced in subsequent @value{SECTION}s, together with the functions 34342that use them. 34343 34344The general-purpose types and structures are as follows: 34345 34346@table @code 34347@item typedef void *awk_ext_id_t; 34348A value of this type is received from @command{gawk} when an extension is loaded. 34349That value must then be passed back to @command{gawk} as the first parameter of 34350each API function. 34351 34352@item #define awk_const @dots{} 34353This macro expands to @samp{const} when compiling an extension, 34354and to nothing when compiling @command{gawk} itself. This makes 34355certain fields in the API data structures unwritable from extension code, 34356while allowing @command{gawk} to use them as it needs to. 34357 34358@item typedef enum awk_bool @{ 34359@itemx @ @ @ @ awk_false = 0, 34360@itemx @ @ @ @ awk_true 34361@itemx @} awk_bool_t; 34362A simple Boolean type. 34363 34364@item typedef struct awk_string @{ 34365@itemx @ @ @ @ char *str;@ @ @ @ @ @ /* data */ 34366@itemx @ @ @ @ size_t len;@ @ @ @ @ /* length thereof, in chars */ 34367@itemx @} awk_string_t; 34368This represents a mutable string. @command{gawk} 34369owns the memory pointed to if it supplied 34370the value. Otherwise, it takes ownership of the memory pointed to. 34371@emph{Such memory must come from calling one of the 34372@code{gawk_malloc()}, @code{gawk_calloc()}, or 34373@code{gawk_realloc()} functions!} 34374 34375As mentioned earlier, strings are maintained using the current 34376multibyte encoding. 34377 34378@item typedef enum @{ 34379@itemx @ @ @ @ AWK_UNDEFINED, 34380@itemx @ @ @ @ AWK_NUMBER, 34381@itemx @ @ @ @ AWK_STRING, 34382@itemx @ @ @ @ AWK_REGEX, 34383@itemx @ @ @ @ AWK_STRNUM, 34384@itemx @ @ @ @ AWK_ARRAY, 34385@itemx @ @ @ @ AWK_SCALAR,@ @ @ @ @ @ @ @ @ /* opaque access to a variable */ 34386@itemx @ @ @ @ AWK_VALUE_COOKIE@ @ @ @ /* for updating a previously created value */ 34387@itemx @} awk_valtype_t; 34388This @code{enum} indicates the type of a value. 34389It is used in the following @code{struct}. 34390 34391@item typedef struct awk_value @{ 34392@itemx @ @ @ @ awk_valtype_t val_type; 34393@itemx @ @ @ @ union @{ 34394@itemx @ @ @ @ @ @ @ @ awk_string_t@ @ @ @ @ @ @ s; 34395@itemx @ @ @ @ @ @ @ @ awknum_t@ @ @ @ @ @ @ @ @ @ @ n; 34396@itemx @ @ @ @ @ @ @ @ awk_array_t@ @ @ @ @ @ @ @ a; 34397@itemx @ @ @ @ @ @ @ @ awk_scalar_t@ @ @ @ @ @ @ scl; 34398@itemx @ @ @ @ @ @ @ @ awk_value_cookie_t@ vc; 34399@itemx @ @ @ @ @} u; 34400@itemx @} awk_value_t; 34401An ``@command{awk} value.'' 34402The @code{val_type} member indicates what kind of value the 34403@code{union} holds, and each member is of the appropriate type. 34404 34405@item #define str_value@ @ @ @ @ @ u.s 34406@itemx #define strnum_value@ @ @ str_value 34407@itemx #define regex_value@ @ @ @ str_value 34408@itemx #define num_value@ @ @ @ @ @ u.n.d 34409@itemx #define num_type@ @ @ @ @ @ @ u.n.type 34410@itemx #define num_ptr@ @ @ @ @ @ @ @ u.n.ptr 34411@itemx #define array_cookie@ @ @ u.a 34412@itemx #define scalar_cookie@ @ u.scl 34413@itemx #define value_cookie@ @ @ u.vc 34414Using these macros makes accessing the fields of the @code{awk_value_t} more 34415readable. 34416 34417@item enum AWK_NUMBER_TYPE @{ 34418@itemx @ @ @ @ AWK_NUMBER_TYPE_DOUBLE, 34419@itemx @ @ @ @ AWK_NUMBER_TYPE_MPFR, 34420@itemx @ @ @ @ AWK_NUMBER_TYPE_MPZ 34421@itemx @}; 34422This @code{enum} is used in the following structure for defining the 34423type of numeric value that is being worked with. It is declared at the 34424top level of the file so that it works correctly for C++ as well as for C. 34425 34426@item typedef struct awk_number @{ 34427@itemx @ @ @ @ double d; 34428@itemx @ @ @ @ enum AWK_NUMBER_TYPE type; 34429@itemx @ @ @ @ void *ptr; 34430@itemx @} awk_number_t; 34431This represents a numeric value. Internally, @command{gawk} stores 34432every number as either a C @code{double}, a GMP integer, or an MPFR 34433arbitrary-precision floating-point value. In order to allow extensions 34434to also support GMP and MPFR values, numeric values are passed in this 34435structure. 34436 34437The double-precision @code{d} element is always populated 34438in data received from @command{gawk}. In addition, by examining the 34439@code{type} member, an extension can determine if the @code{ptr} 34440member is either a GMP integer (type @code{mpz_ptr}), or an MPFR 34441floating-point value (type @code{mpfr_ptr_t}), and cast it appropriately. 34442 34443@quotation CAUTION 34444Any MPFR or MPZ values that you create and pass to @command{gawk} 34445to save are @emph{copied}. This means you are responsible to release 34446the storage once you're done with it. See the sample @code{intdiv} 34447extension for some example code. 34448@end quotation 34449 34450@item typedef void *awk_scalar_t; 34451Scalars can be represented as an opaque type. These values are obtained 34452from @command{gawk} and then passed back into it. This is discussed 34453in a general fashion in the text following this list, and in more detail in 34454@ref{Symbol table by cookie}. 34455 34456@item typedef void *awk_value_cookie_t; 34457A ``value cookie'' is an opaque type representing a cached value. 34458This is also discussed in a general fashion in the text following this list, 34459and in more detail in @ref{Cached values}. 34460 34461@end table 34462 34463Scalar values in @command{awk} are numbers, strings, strnums, or typed regexps. The 34464@code{awk_value_t} struct represents values. The @code{val_type} member 34465indicates what is in the @code{union}. 34466 34467Representing numbers is easy---the API uses a C @code{double}. Strings 34468require more work. Because @command{gawk} allows embedded @sc{nul} bytes 34469in string values, a string must be represented as a pair containing a 34470data pointer and length. This is the @code{awk_string_t} type. 34471 34472A strnum (numeric string) value is represented as a string and consists 34473of user input data that appears to be numeric. 34474When an extension creates a strnum value, the result is a string flagged 34475as user input. Subsequent parsing by @command{gawk} then determines whether it 34476looks like a number and should be treated as a strnum, or as a regular string. 34477 34478This is useful in cases where an extension function would like to do something 34479comparable to the @code{split()} function which sets the strnum attribute 34480on the array elements it creates. For example, an extension that implements 34481CSV splitting would want to use this feature. This is also useful for a 34482function that retrieves a data item from a database. The PostgreSQL 34483@code{PQgetvalue()} function, for example, returns a string that may be numeric 34484or textual depending on the contents. 34485 34486Typed regexp values (@pxref{Strong Regexp Constants}) are not of 34487much use to extension functions. Extension functions can tell that 34488they've received them, and create them for scalar values. Otherwise, 34489they can examine the text of the regexp through @code{regex_value.str} 34490and @code{regex_value.len}. 34491 34492Identifiers (i.e., the names of global variables) can be associated 34493with either scalar values or with arrays. In addition, @command{gawk} 34494provides true arrays of arrays, where any given array element can 34495itself be an array. Discussion of arrays is delayed until 34496@ref{Array Manipulation}. 34497 34498The various macros listed earlier make it easier to use the elements 34499of the @code{union} as if they were fields in a @code{struct}; this 34500is a common coding practice in C. Such code is easier to write and to 34501read, but it remains @emph{your} responsibility to make sure that 34502the @code{val_type} member correctly reflects the type of the value in 34503the @code{awk_value_t} struct. 34504 34505Conceptually, the first three members of the @code{union} (number, string, 34506and array) are all that is needed for working with @command{awk} values. 34507However, because the API provides routines for accessing and changing 34508the value of a global scalar variable only by using the variable's name, 34509there is a performance penalty: @command{gawk} must find the variable 34510each time it is accessed and changed. This turns out to be a real issue, 34511not just a theoretical one. 34512 34513Thus, if you know that your extension will spend considerable time 34514reading and/or changing the value of one or more scalar variables, you 34515can obtain a @dfn{scalar cookie}@footnote{See 34516@uref{http://catb.org/jargon/html/C/cookie.html, the ``cookie'' entry in the Jargon file} for a 34517definition of @dfn{cookie}, and @uref{http://catb.org/jargon/html/M/magic-cookie.html, 34518the ``magic cookie'' entry in the Jargon file} for a nice example. 34519@ifclear FOR_PRINT 34520See also the entry for ``Cookie'' in the @ref{Glossary}. 34521@end ifclear 34522} 34523object for that variable, and then use 34524the cookie for getting the variable's value or for changing the variable's 34525value. 34526The @code{awk_scalar_t} type holds a scalar cookie, and the 34527@code{scalar_cookie} macro provides access to the value of that type 34528in the @code{awk_value_t} struct. 34529Given a scalar cookie, @command{gawk} can directly retrieve or 34530modify the value, as required, without having to find it first. 34531 34532The @code{awk_value_cookie_t} type and @code{value_cookie} macro are similar. 34533If you know that you wish to 34534use the same numeric or string @emph{value} for one or more variables, 34535you can create the value once, retaining a @dfn{value cookie} for it, 34536and then pass in that value cookie whenever you wish to set the value of a 34537variable. This saves storage space within the running @command{gawk} 34538process and reduces the time needed to create the value. 34539 34540@node Memory Allocation Functions 34541@subsection Memory Allocation Functions and Convenience Macros 34542@cindex allocating memory for extensions 34543@cindex extensions @subentry loadable @subentry allocating memory 34544@cindex memory, allocating for extensions 34545 34546The API provides a number of @dfn{memory allocation} functions for 34547allocating memory that can be passed to @command{gawk}, as well as a number of 34548convenience macros. 34549This @value{SUBSECTION} presents them all as function prototypes, in 34550the way that extension code would use them: 34551 34552@table @code 34553@item void *gawk_malloc(size_t size); 34554Call the correct version of @code{malloc()} to allocate storage that may 34555be passed to @command{gawk}. 34556 34557@item void *gawk_calloc(size_t nmemb, size_t size); 34558Call the correct version of @code{calloc()} to allocate storage that may 34559be passed to @command{gawk}. 34560 34561@item void *gawk_realloc(void *ptr, size_t size); 34562Call the correct version of @code{realloc()} to allocate storage that may 34563be passed to @command{gawk}. 34564 34565@item void gawk_free(void *ptr); 34566Call the correct version of @code{free()} to release storage that was 34567allocated with @code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()}. 34568@end table 34569 34570The API has to provide these functions because it is possible 34571for an extension to be compiled and linked against a different 34572version of the C library than was used for the @command{gawk} 34573executable.@footnote{This is more common on MS-Windows systems, but it 34574can happen on Unix-like systems as well.} If @command{gawk} were 34575to use its version of @code{free()} when the memory came from an 34576unrelated version of @code{malloc()}, unexpected behavior would 34577likely result. 34578 34579Three convenience macros may be used for allocating storage 34580from @code{gawk_malloc()}, @code{gawk_calloc}, and 34581@code{gawk_realloc()}. If the allocation fails, they cause @command{gawk} 34582to exit with a fatal error message. They should be used as if they were 34583procedure calls that do not return a value: 34584 34585@table @code 34586@item #define emalloc(pointer, type, size, message) @dots{} 34587The arguments to this macro are as follows: 34588 34589@c nested table 34590@table @code 34591@item pointer 34592The pointer variable to point at the allocated storage. 34593 34594@item type 34595The type of the pointer variable. This is used to create a cast for 34596the call to @code{gawk_malloc()}. 34597 34598@item size 34599The total number of bytes to be allocated. 34600 34601@item message 34602A message to be prefixed to the fatal error message. Typically this is the name 34603of the function using the macro. 34604@end table 34605 34606@noindent 34607For example, you might allocate a string value like so: 34608 34609@example 34610@group 34611awk_value_t result; 34612char *message; 34613const char greet[] = "Don't Panic!"; 34614 34615emalloc(message, char *, sizeof(greet), "myfunc"); 34616strcpy(message, greet); 34617make_malloced_string(message, strlen(message), & result); 34618@end group 34619@end example 34620 34621@sp 2 34622@item #define ezalloc(pointer, type, size, message) @dots{} 34623This is like @code{emalloc()}, but it calls @code{gawk_calloc()} 34624instead of @code{gawk_malloc()}. 34625The arguments are the same as for the @code{emalloc()} macro, but this 34626macro guarantees that the memory returned is initialized to zero. 34627 34628@item #define erealloc(pointer, type, size, message) @dots{} 34629This is like @code{emalloc()}, but it calls @code{gawk_realloc()} 34630instead of @code{gawk_malloc()}. 34631The arguments are the same as for the @code{emalloc()} macro. 34632@end table 34633 34634Two additional functions allocate MPFR and GMP objects for use 34635by extension functions that need to create and then return such 34636values. 34637 34638@quotation NOTE 34639These functions are obsolete. Extension functions that need local MPFR 34640and GMP values should simply allocate them on the stack and clear them, 34641as any other code would. 34642@end quotation 34643 34644@noindent 34645The functions are: 34646 34647@table @code 34648@item void *get_mpfr_ptr(); 34649Allocate and initialize an MPFR object and return a pointer to it. 34650If the allocation fails, @command{gawk} exits with a fatal 34651``out of memory'' error. If @command{gawk} was compiled without 34652MPFR support, calling this function causes a fatal error. 34653 34654@item void *get_mpz_ptr(); 34655Allocate and initialize a GMP object and return a pointer to it. 34656If the allocation fails, @command{gawk} exits with a fatal 34657``out of memory'' error. If @command{gawk} was compiled without 34658MPFR support, calling this function causes a fatal error. 34659@end table 34660 34661Both of these functions return @samp{void *}, since the @file{gawkapi.h} 34662header file should not have dependency upon @code{<mpfr.h>} (and @code{<gmp.h>}, 34663which is included from @code{<mpfr.h>}). The actual return values are of 34664types @code{mpfr_ptr} and @code{mpz_ptr} respectively, and you should cast 34665the return values appropriately before assigning the results to variables 34666of the correct types. 34667 34668The memory allocated by these functions should be freed with 34669@code{gawk_free()}. 34670 34671@node Constructor Functions 34672@subsection Constructor Functions 34673 34674The API provides a number of @dfn{constructor} functions for creating 34675string and numeric values, as well as a number of convenience macros. 34676This @value{SUBSECTION} presents them all as function prototypes, in 34677the way that extension code would use them: 34678 34679@table @code 34680@item static inline awk_value_t * 34681@itemx make_const_string(const char *string, size_t length, awk_value_t *result); 34682This function creates a string value in the @code{awk_value_t} variable 34683pointed to by @code{result}. It expects @code{string} to be a C string constant 34684(or other string data), and automatically creates a @emph{copy} of the data 34685for storage in @code{result}. It returns @code{result}. 34686 34687@item static inline awk_value_t * 34688@itemx make_malloced_string(const char *string, size_t length, awk_value_t *result); 34689This function creates a string value in the @code{awk_value_t} variable 34690pointed to by @code{result}. It expects @code{string} to be a @samp{char *} 34691value pointing to data previously obtained from @code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()}. The idea here 34692is that the data is passed directly to @command{gawk}, which assumes 34693responsibility for it. It returns @code{result}. 34694 34695@item static inline awk_value_t * 34696@itemx make_null_string(awk_value_t *result); 34697This specialized function creates a null string (the ``undefined'' value) 34698in the @code{awk_value_t} variable pointed to by @code{result}. 34699It returns @code{result}. 34700 34701@item static inline awk_value_t * 34702@itemx make_number(double num, awk_value_t *result); 34703This function simply creates a numeric value in the @code{awk_value_t} variable 34704pointed to by @code{result}. 34705 34706@item static inline awk_value_t * 34707@itemx make_number_mpz(void *mpz, awk_value_t *result); 34708This function creates a GMP number value in @code{result}. 34709The @code{mpz} must be from a call to @code{get_mpz_ptr()} 34710(and thus be of real underlying type @code{mpz_ptr}). 34711 34712@item static inline awk_value_t * 34713@itemx make_number_mpfr(void *mpfr, awk_value_t *result); 34714This function creates an MPFR number value in @code{result}. 34715The @code{mpfr} must be from a call to @code{get_mpfr_ptr()}. 34716 34717@item static inline awk_value_t * 34718@itemx make_const_user_input(const char *string, size_t length, awk_value_t *result); 34719This function is identical to @code{make_const_string()}, but the string is 34720flagged as user input that should be treated as a strnum value if the contents 34721of the string are numeric. 34722 34723@item static inline awk_value_t * 34724@itemx make_malloced_user_input(const char *string, size_t length, awk_value_t *result); 34725This function is identical to @code{make_malloced_string()}, but the string is 34726flagged as user input that should be treated as a strnum value if the contents 34727of the string are numeric. 34728 34729@item static inline awk_value_t * 34730@itemx make_const_regex(const char *string, size_t length, awk_value_t *result); 34731This function creates a strongly typed regexp value by allocating a copy of the string. 34732@code{string} is the regular expression of length @code{len}. 34733 34734@item static inline awk_value_t * 34735@itemx make_malloced_regex(const char *string, size_t length, awk_value_t *result); 34736This function creates a strongly typed regexp value. @code{string} is 34737the regular expression of length @code{len}. It expects @code{string} 34738to be a @samp{char *} value pointing to data previously obtained from 34739@code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()}. 34740 34741@end table 34742 34743@node API Ownership of MPFR and GMP Values 34744@subsection Managing MPFR and GMP Values 34745@cindex MPFR values, API ownership of 34746@cindex GMP values, API ownership of 34747@cindex API, ownership of MPFR and GMP values 34748 34749MPFR and GMP values are different from string values, where you can 34750``take ownership'' of the value simply by assigning pointers. For example: 34751 34752@example 34753char *p = gawk_malloc(42); p @ii{``owns'' the memory} 34754char *q = p; 34755p = NULL; @ii{now} q @ii{``owns'' it} 34756@end example 34757 34758MPFR and GMP objects are indeed allocated on the stack or dynamically, 34759but the MPFR and GMP libraries treat these objects as values, the same way that 34760you would pass an @code{int} or a @code{double} by value. There is no 34761way to ``transfer ownership'' of MPFR and GMP objects. Thus, code in 34762an extension should look like this: 34763 34764@example 34765mpz_t part1, part2, answer; @ii{declare local values} 34766 34767mpz_set_si(part1, 21); @ii{do some computations} 34768mpz_set_si(part2, 21); 34769mpz_add(answer, part1, part2); 34770@dots{} 34771/* assume that result is a parameter of type (awk_value_t *). */ 34772make_number_mpz(answer, & result); @ii{set it with final GMP value} 34773 34774mpz_clear(part1); @ii{release intermediate values} 34775mpz_clear(part2); 34776mpz_clear(answer); 34777 34778return result; 34779@end example 34780 34781@node Registration Functions 34782@subsection Registration Functions 34783@cindex register loadable extension 34784@cindex extensions @subentry loadable @subentry registration 34785 34786This @value{SECTION} describes the API functions for 34787registering parts of your extension with @command{gawk}. 34788 34789@menu 34790* Extension Functions:: Registering extension functions. 34791* Exit Callback Functions:: Registering an exit callback. 34792* Extension Version String:: Registering a version string. 34793* Input Parsers:: Registering an input parser. 34794* Output Wrappers:: Registering an output wrapper. 34795* Two-way processors:: Registering a two-way processor. 34796@end menu 34797 34798@node Extension Functions 34799@subsubsection Registering An Extension Function 34800 34801Extension functions are described by the following record: 34802 34803@example 34804@group 34805typedef struct awk_ext_func @{ 34806@ @ @ @ const char *name; 34807@ @ @ @ awk_value_t *(*const function)(int num_actual_args, 34808@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result, 34809@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ struct awk_ext_func *finfo); 34810@ @ @ @ const size_t max_expected_args; 34811@ @ @ @ const size_t min_required_args; 34812@ @ @ @ awk_bool_t suppress_lint; 34813@ @ @ @ void *data; /* opaque pointer to any extra state */ 34814@} awk_ext_func_t; 34815@end group 34816@end example 34817 34818The fields are: 34819 34820@table @code 34821@item const char *name; 34822The name of the new function. 34823@command{awk}-level code calls the function by this name. 34824This is a regular C string. 34825 34826Function names must obey the rules for @command{awk} 34827identifiers. That is, they must begin with either an English letter 34828or an underscore, which may be followed by any number of 34829letters, digits, and underscores. 34830Letter case in function names is significant. 34831 34832@item awk_value_t *(*const function)(int num_actual_args, 34833@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result, 34834@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ struct awk_ext_func *finfo); 34835This is a pointer to the C function that provides the extension's 34836functionality. 34837The function must fill in @code{*result} with either a number, 34838a string, or a regexp. 34839@command{gawk} takes ownership of any string memory. 34840As mentioned earlier, string memory @emph{must} come from one of 34841@code{gawk_malloc()}, @code{gawk_calloc()}, or @code{gawk_realloc()}. 34842 34843The @code{num_actual_args} argument tells the C function how many 34844actual parameters were passed from the calling @command{awk} code. 34845 34846The @code{finfo} parameter is a pointer to the @code{awk_ext_func_t} for 34847this function. The called function may access data within it as desired, or not. 34848 34849The function must return the value of @code{result}. 34850This is for the convenience of the calling code inside @command{gawk}. 34851 34852@item const size_t max_expected_args; 34853This is the maximum number of arguments the function expects to receive. 34854If called with more arguments than this, and if lint checking has 34855been enabled, then @command{gawk} prints a warning message. For more 34856information, see the entry for @code{suppress_lint}, later in this list. 34857 34858@item const size_t min_required_args; 34859This is the minimum number of arguments the function expects to receive. 34860If called with fewer arguments, @command{gawk} prints a fatal error 34861message and exits. 34862 34863@item awk_bool_t suppress_lint; 34864This flag tells @command{gawk} not to print a lint message if lint 34865checking has been enabled and if more arguments were supplied in the call 34866than expected. An extension function can tell if @command{gawk} already 34867printed at least one such message by checking if @samp{num_actual_args > 34868finfo->max_expected_args}. If so, and the function does not want more 34869lint messages to be printed, it should set @code{finfo->suppress_lint} 34870to @code{awk_true}. 34871 34872@item void *data; 34873This is an opaque pointer to any data that an extension function may 34874wish to have available when called. Passing the @code{awk_ext_func_t} 34875structure to the extension function, and having this pointer available 34876in it enable writing a single C or C++ function that implements multiple 34877@command{awk}-level extension functions. 34878@end table 34879 34880Once you have a record representing your extension function, you register 34881it with @command{gawk} using this API function: 34882 34883@table @code 34884@item awk_bool_t add_ext_func(const char *name_space, awk_ext_func_t *func); 34885This function returns true upon success, false otherwise. 34886The @code{name_space} parameter is the namespace in which to place 34887the function (@pxref{Namespaces}). 34888Use an empty string (@code{""}) or @code{"awk"} to place 34889the function in the default @code{awk} namespace. 34890The @code{func} pointer is the address of a 34891@code{struct} representing your function, as just described. 34892 34893@command{gawk} does not modify what @code{func} points to, but the 34894extension function itself receives this pointer and can modify what it 34895points to, thus it is purposely not declared to be @code{const}. 34896@end table 34897 34898The combination of @code{min_required_args}, @code{max_expected_args}, 34899and @code{suppress_lint} may be confusing. Here is how you should 34900set things up. 34901 34902@table @asis 34903@item Any number of arguments is valid 34904Set @code{min_required_args} and @code{max_expected_args} to zero and 34905set @code{suppress_lint} to @code{awk_true}. 34906 34907@item A minimum number of arguments is required, no limit on maximum number of arguments 34908Set @code{min_required_args} to the minimum required. Set 34909@code{max_expected_args} to zero and 34910set @code{suppress_lint} to @code{awk_true}. 34911 34912@item A minimum number of arguments is required, a maximum number is expected 34913Set @code{min_required_args} to the minimum required. Set 34914@code{max_expected_args} to the maximum expected. 34915Set @code{suppress_lint} to @code{awk_false}. 34916 34917@item A minimum number of arguments is required, and no more than a maximum is allowed 34918Set @code{min_required_args} to the minimum required. Set 34919@code{max_expected_args} to the maximum expected. 34920Set @code{suppress_lint} to @code{awk_false}. 34921In your extension function, check that @code{num_actual_args} does not 34922exceed @code{f->max_expected_args}. If it does, issue a fatal error message. 34923@end table 34924 34925@node Exit Callback Functions 34926@subsubsection Registering An Exit Callback Function 34927 34928An @dfn{exit callback} function is a function that 34929@command{gawk} calls before it exits. 34930Such functions are useful if you have general ``cleanup'' tasks 34931that should be performed in your extension (such as closing database 34932connections or other resource deallocations). 34933You can register such 34934a function with @command{gawk} using the following function: 34935 34936@table @code 34937@item void awk_atexit(void (*funcp)(void *data, int exit_status), 34938@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ void *arg0); 34939The parameters are: 34940 34941@c nested table 34942@table @code 34943@item funcp 34944A pointer to the function to be called before @command{gawk} exits. The @code{data} 34945parameter will be the original value of @code{arg0}. 34946The @code{exit_status} parameter is the exit status value that 34947@command{gawk} intends to pass to the @code{exit()} system call. 34948 34949@item arg0 34950A pointer to private data that @command{gawk} saves in order to pass to 34951the function pointed to by @code{funcp}. 34952@end table 34953@end table 34954 34955Exit callback functions are called in last-in, first-out (LIFO) 34956order---that is, in the reverse order in which they are registered with 34957@command{gawk}. 34958 34959@node Extension Version String 34960@subsubsection Registering An Extension Version String 34961 34962You can register a version string that indicates the name and 34963version of your extension with @command{gawk}, as follows: 34964 34965@table @code 34966@item void register_ext_version(const char *version); 34967Register the string pointed to by @code{version} with @command{gawk}. 34968Note that @command{gawk} does @emph{not} copy the @code{version} string, so 34969it should not be changed. 34970@end table 34971 34972@command{gawk} prints all registered extension version strings when it 34973is invoked with the @option{--version} option. 34974 34975@node Input Parsers 34976@subsubsection Customized Input Parsers 34977@cindex customized input parser 34978 34979By default, @command{gawk} reads text files as its input. It uses the value 34980of @code{RS} to find the end of the record, and then uses @code{FS} 34981(or @code{FIELDWIDTHS} or @code{FPAT}) to split it into fields (@pxref{Reading Files}). 34982Additionally, it sets the value of @code{RT} (@pxref{Built-in Variables}). 34983 34984If you want, you can provide your own custom input parser. An input 34985parser's job is to return a record to the @command{gawk} record-processing 34986code, along with indicators for the value and length of the data to be 34987used for @code{RT}, if any. 34988 34989To provide an input parser, you must first provide two functions 34990(where @var{XXX} is a prefix name for your extension): 34991 34992@table @code 34993@item awk_bool_t @var{XXX}_can_take_file(const awk_input_buf_t *iobuf); 34994This function examines the information available in @code{iobuf} 34995(which we discuss shortly). Based on the information there, it 34996decides if the input parser should be used for this file. 34997If so, it should return true. Otherwise, it should return false. 34998It should not change any state (variable values, etc.) within @command{gawk}. 34999 35000@item awk_bool_t @var{XXX}_take_control_of(awk_input_buf_t *iobuf); 35001When @command{gawk} decides to hand control of the file over to the 35002input parser, it calls this function. This function in turn must fill 35003in certain fields in the @code{awk_input_buf_t} structure and ensure 35004that certain conditions are true. It should then return true. If an 35005error of some kind occurs, it should not fill in any fields and should 35006return false; then @command{gawk} will not use the input parser. 35007The details are presented shortly. 35008@end table 35009 35010Your extension should package these functions inside an 35011@code{awk_input_parser_t}, which looks like this: 35012 35013@example 35014@group 35015typedef struct awk_input_parser @{ 35016 const char *name; /* name of parser */ 35017 awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf); 35018 awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf); 35019 awk_const struct awk_input_parser *awk_const next; /* for gawk */ 35020@} awk_input_parser_t; 35021@end group 35022@end example 35023 35024The fields are: 35025 35026@table @code 35027@item const char *name; 35028The name of the input parser. This is a regular C string. 35029 35030@item awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf); 35031A pointer to your @code{@var{XXX}_can_take_file()} function. 35032 35033@item awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf); 35034A pointer to your @code{@var{XXX}_take_control_of()} function. 35035 35036@item awk_const struct input_parser *awk_const next; 35037This is for use by @command{gawk}; 35038therefore it is marked @code{awk_const} so that the extension cannot 35039modify it. 35040@end table 35041 35042The steps are as follows: 35043 35044@enumerate 35045@item 35046Create a @code{static awk_input_parser_t} variable and initialize it 35047appropriately. 35048 35049@item 35050When your extension is loaded, register your input parser with 35051@command{gawk} using the @code{register_input_parser()} API function 35052(described next). 35053@end enumerate 35054 35055An @code{awk_input_buf_t} looks like this: 35056 35057@example 35058typedef struct awk_input @{ 35059 const char *name; /* filename */ 35060 int fd; /* file descriptor */ 35061#define INVALID_HANDLE (-1) 35062 void *opaque; /* private data for input parsers */ 35063 int (*get_record)(char **out, struct awk_input *iobuf, 35064 int *errcode, char **rt_start, size_t *rt_len, 35065 const awk_fieldwidth_info_t **field_width); 35066 ssize_t (*read_func)(); 35067 void (*close_func)(struct awk_input *iobuf); 35068 struct stat sbuf; /* stat buf */ 35069@} awk_input_buf_t; 35070@end example 35071 35072The fields can be divided into two categories: those for use (initially, 35073at least) by @code{@var{XXX}_can_take_file()}, and those for use by 35074@code{@var{XXX}_take_control_of()}. The first group of fields and their uses 35075are as follows: 35076 35077@table @code 35078@item const char *name; 35079The name of the file. 35080 35081@item int fd; 35082A file descriptor for the file. If @command{gawk} was able to 35083open the file, then @code{fd} will @emph{not} be equal to 35084@code{INVALID_HANDLE}. Otherwise, it will. 35085 35086@item struct stat sbuf; 35087If the file descriptor is valid, then @command{gawk} will have filled 35088in this structure via a call to the @code{fstat()} system call. 35089@end table 35090 35091The @code{@var{XXX}_can_take_file()} function should examine these 35092fields and decide if the input parser should be used for the file. 35093The decision can be made based upon @command{gawk} state (the value 35094of a variable defined previously by the extension and set by 35095@command{awk} code), the name of the 35096file, whether or not the file descriptor is valid, the information 35097in the @code{struct stat}, or any combination of these factors. 35098 35099Once @code{@var{XXX}_can_take_file()} has returned true, and 35100@command{gawk} has decided to use your input parser, it calls 35101@code{@var{XXX}_take_control_of()}. That function then fills 35102either the @code{get_record} field or the @code{read_func} field in 35103the @code{awk_input_buf_t}. It must also ensure that @code{fd} is @emph{not} 35104set to @code{INVALID_HANDLE}. The following list describes the fields that 35105may be filled by @code{@var{XXX}_take_control_of()}: 35106 35107@table @code 35108@item void *opaque; 35109This is used to hold any state information needed by the input parser 35110for this file. It is ``opaque'' to @command{gawk}. The input parser 35111is not required to use this pointer. 35112 35113@item int@ (*get_record)(char@ **out, 35114@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ struct@ awk_input *iobuf, 35115@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ int *errcode, 35116@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ char **rt_start, 35117@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ size_t *rt_len, 35118@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_fieldwidth_info_t **field_width); 35119This function pointer should point to a function that creates the input 35120records. Said function is the core of the input parser. Its behavior 35121is described in the text following this list. 35122 35123@item ssize_t (*read_func)(); 35124This function pointer should point to a function that has the 35125same behavior as the standard POSIX @code{read()} system call. 35126It is an alternative to the @code{get_record} pointer. Its behavior 35127is also described in the text following this list. 35128 35129@item void (*close_func)(struct awk_input *iobuf); 35130This function pointer should point to a function that does 35131the ``teardown.'' It should release any resources allocated by 35132@code{@var{XXX}_take_control_of()}. It may also close the file. If it 35133does so, it should set the @code{fd} field to @code{INVALID_HANDLE}. 35134 35135If @code{fd} is still not @code{INVALID_HANDLE} after the call to this 35136function, @command{gawk} calls the regular @code{close()} system call. 35137 35138Having a ``teardown'' function is optional. If your input parser does 35139not need it, do not set this field. Then, @command{gawk} calls the 35140regular @code{close()} system call on the file descriptor, so it should 35141be valid. 35142@end table 35143 35144The @code{@var{XXX}_get_record()} function does the work of creating 35145input records. The parameters are as follows: 35146 35147@table @code 35148@item char **out 35149This is a pointer to a @code{char *} variable that is set to point 35150to the record. @command{gawk} makes its own copy of the data, so 35151the extension must manage this storage. 35152 35153@item struct awk_input *iobuf 35154This is the @code{awk_input_buf_t} for the file. The fields should be 35155used for reading data (@code{fd}) and for managing private state 35156(@code{opaque}), if any. 35157 35158@item int *errcode 35159If an error occurs, @code{*errcode} should be set to an appropriate 35160code from @code{<errno.h>}. 35161 35162@item char **rt_start 35163@itemx size_t *rt_len 35164If the concept of a ``record terminator'' makes sense, then 35165@code{*rt_start} should be set to point to the data to be used for 35166@code{RT}, and @code{*rt_len} should be set to the length of the 35167data. Otherwise, @code{*rt_len} should be set to zero. 35168@command{gawk} makes its own copy of this data, so the 35169extension must manage this storage. 35170 35171@item const awk_fieldwidth_info_t **field_width 35172If @code{field_width} is not @code{NULL}, then @code{*field_width} will be initialized 35173to @code{NULL}, and the function may set it to point to a structure 35174supplying field width information to override the default 35175field parsing mechanism. Note that this structure will not 35176be copied by @command{gawk}; it must persist at least until the next call 35177to @code{get_record} or @code{close_func}. Note also that @code{field_width} is 35178@code{NULL} when @code{getline} is assigning the results to a variable, thus 35179field parsing is not needed. If the parser does set @code{*field_width}, 35180then @command{gawk} uses this layout to parse the input record, 35181and the @code{PROCINFO["FS"]} value will be @code{"API"} while this record 35182is active in @code{$0}. 35183The @code{awk_fieldwidth_info_t} data structure 35184is described below. 35185@end table 35186 35187The return value is the length of the buffer pointed to by 35188@code{*out}, or @code{EOF} if end-of-file was reached or an 35189error occurred. 35190 35191It is guaranteed that @code{errcode} is a valid pointer, so there is no 35192need to test for a @code{NULL} value. @command{gawk} sets @code{*errcode} 35193to zero, so there is no need to set it unless an error occurs. 35194 35195If an error does occur, the function should return @code{EOF} and set 35196@code{*errcode} to a value greater than zero. In that case, if @code{*errcode} 35197does not equal zero, @command{gawk} automatically updates 35198the @code{ERRNO} variable based on the value of @code{*errcode}. 35199(In general, setting @samp{*errcode = errno} should do the right thing.) 35200 35201As an alternative to supplying a function that returns an input record, 35202you may instead supply a function that simply reads bytes, and let 35203@command{gawk} parse the data into records. If you do so, the data 35204should be returned in the multibyte encoding of the current locale. 35205Such a function should follow the same behavior as the @code{read()} 35206system call, and you fill in the @code{read_func} pointer with its 35207address in the @code{awk_input_buf_t} structure. 35208 35209By default, @command{gawk} sets the @code{read_func} pointer to 35210point to the @code{read()} system call. So your extension need not 35211set this field explicitly. 35212 35213@quotation NOTE 35214You must choose one method or the other: either a function that 35215returns a record, or one that returns raw data. In particular, 35216if you supply a function to get a record, @command{gawk} will 35217call it, and will never call the raw read function. 35218@end quotation 35219 35220@command{gawk} ships with a sample extension that reads directories, 35221returning records for each entry in a directory (@pxref{Extension 35222Sample Readdir}). You may wish to use that code as a guide for writing 35223your own input parser. 35224 35225When writing an input parser, you should think about (and document) 35226how it is expected to interact with @command{awk} code. You may want 35227it to always be called, and to take effect as appropriate (as the 35228@code{readdir} extension does). Or you may want it to take effect 35229based upon the value of an @command{awk} variable, as the XML extension 35230from the @code{gawkextlib} project does (@pxref{gawkextlib}). 35231In the latter case, code in a @code{BEGINFILE} rule 35232can look at @code{FILENAME} and @code{ERRNO} to decide whether or 35233not to activate an input parser (@pxref{BEGINFILE/ENDFILE}). 35234 35235You register your input parser with the following function: 35236 35237@table @code 35238@item void register_input_parser(awk_input_parser_t *input_parser); 35239Register the input parser pointed to by @code{input_parser} with 35240@command{gawk}. 35241@end table 35242 35243If you would like to override the default field parsing mechanism for a given 35244record, then you must populate an @code{awk_fieldwidth_info_t} structure, 35245which looks like this: 35246 35247@example 35248typedef struct @{ 35249 awk_bool_t use_chars; /* false ==> use bytes */ 35250 size_t nf; /* number of fields in record (NF) */ 35251 struct awk_field_info @{ 35252 size_t skip; /* amount to skip before field starts */ 35253 size_t len; /* length of field */ 35254 @} fields[1]; /* actual dimension should be nf */ 35255@} awk_fieldwidth_info_t; 35256@end example 35257 35258The fields are: 35259 35260@table @code 35261@item awk_bool_t use_chars; 35262Set this to @code{awk_true} if the field lengths are specified in terms 35263of potentially multi-byte characters, and set it to @code{awk_false} if 35264the lengths are in terms of bytes. 35265Performance will be better if the values are supplied in 35266terms of bytes. 35267 35268@item size_t nf; 35269Set this to the number of fields in the input record, i.e. @code{NF}. 35270 35271@item struct awk_field_info fields[nf]; 35272This is a variable-length array whose actual dimension should be @code{nf}. 35273For each field, the @code{skip} element should be set to the number 35274of characters or bytes, as controlled by the @code{use_chars} flag, 35275to skip before the start of this field. The @code{len} element provides 35276the length of the field. The values in @code{fields[0]} provide the information 35277for @code{$1}, and so on through the @code{fields[nf-1]} element containing the information for @code{$NF}. 35278@end table 35279 35280A convenience macro @code{awk_fieldwidth_info_size(numfields)} is provided to 35281calculate the appropriate size of a variable-length 35282@code{awk_fieldwidth_info_t} structure containing @code{numfields} fields. This can 35283be used as an argument to @code{malloc()} or in a union to allocate space 35284statically. Please refer to the @code{readdir_test} sample extension for an 35285example. 35286 35287@node Output Wrappers 35288@subsubsection Customized Output Wrappers 35289@cindex customized output wrapper 35290 35291@cindex output wrapper 35292An @dfn{output wrapper} is the mirror image of an input parser. 35293It allows an extension to take over the output to a file opened 35294with the @samp{>} or @samp{>>} I/O redirection operators (@pxref{Redirection}). 35295 35296The output wrapper is very similar to the input parser structure: 35297 35298@example 35299typedef struct awk_output_wrapper @{ 35300 const char *name; /* name of the wrapper */ 35301 awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf); 35302 awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf); 35303 awk_const struct awk_output_wrapper *awk_const next; /* for gawk */ 35304@} awk_output_wrapper_t; 35305@end example 35306 35307The members are as follows: 35308 35309@table @code 35310@item const char *name; 35311This is the name of the output wrapper. 35312 35313@item awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf); 35314This points to a function that examines the information in 35315the @code{awk_output_buf_t} structure pointed to by @code{outbuf}. 35316It should return true if the output wrapper wants to take over the 35317file, and false otherwise. It should not change any state (variable 35318values, etc.) within @command{gawk}. 35319 35320@item awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf); 35321The function pointed to by this field is called when @command{gawk} 35322decides to let the output wrapper take control of the file. It should 35323fill in appropriate members of the @code{awk_output_buf_t} structure, 35324as described next, and return true if successful, false otherwise. 35325 35326@item awk_const struct output_wrapper *awk_const next; 35327This is for use by @command{gawk}; 35328therefore it is marked @code{awk_const} so that the extension cannot 35329modify it. 35330@end table 35331 35332The @code{awk_output_buf_t} structure looks like this: 35333 35334@example 35335typedef struct awk_output_buf @{ 35336 const char *name; /* name of output file */ 35337 const char *mode; /* mode argument to fopen */ 35338 FILE *fp; /* stdio file pointer */ 35339 awk_bool_t redirected; /* true if a wrapper is active */ 35340 void *opaque; /* for use by output wrapper */ 35341 size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count, 35342 FILE *fp, void *opaque); 35343 int (*gawk_fflush)(FILE *fp, void *opaque); 35344 int (*gawk_ferror)(FILE *fp, void *opaque); 35345 int (*gawk_fclose)(FILE *fp, void *opaque); 35346@} awk_output_buf_t; 35347@end example 35348 35349Here too, your extension will define @code{@var{XXX}_can_take_file()} 35350and @code{@var{XXX}_take_control_of()} functions that examine and update 35351data members in the @code{awk_output_buf_t}. 35352The data members are as follows: 35353 35354@table @code 35355@item const char *name; 35356The name of the output file. 35357 35358@item const char *mode; 35359The mode string (as would be used in the second argument to @code{fopen()}) 35360with which the file was opened. 35361 35362@item FILE *fp; 35363The @code{FILE} pointer from @code{<stdio.h>}. @command{gawk} opens the file 35364before attempting to find an output wrapper. 35365 35366@item awk_bool_t redirected; 35367This field must be set to true by the @code{@var{XXX}_take_control_of()} function. 35368 35369@item void *opaque; 35370This pointer is opaque to @command{gawk}. The extension should use it to store 35371a pointer to any private data associated with the file. 35372 35373@item size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count, 35374@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ FILE *fp, void *opaque); 35375@itemx int (*gawk_fflush)(FILE *fp, void *opaque); 35376@itemx int (*gawk_ferror)(FILE *fp, void *opaque); 35377@itemx int (*gawk_fclose)(FILE *fp, void *opaque); 35378These pointers should be set to point to functions that perform 35379the equivalent function as the @code{<stdio.h>} functions do, if appropriate. 35380@command{gawk} uses these function pointers for all output. 35381@command{gawk} initializes the pointers to point to internal ``pass-through'' 35382functions that just call the regular @code{<stdio.h>} functions, so an 35383extension only needs to redefine those functions that are appropriate for 35384what it does. 35385@end table 35386 35387The @code{@var{XXX}_can_take_file()} function should make a decision based 35388upon the @code{name} and @code{mode} fields, and any additional state 35389(such as @command{awk} variable values) that is appropriate. 35390 35391When @command{gawk} calls @code{@var{XXX}_take_control_of()}, that function should fill 35392in the other fields as appropriate, except for @code{fp}, which it should just 35393use normally. 35394 35395You register your output wrapper with the following function: 35396 35397@table @code 35398@item void register_output_wrapper(awk_output_wrapper_t *output_wrapper); 35399Register the output wrapper pointed to by @code{output_wrapper} with 35400@command{gawk}. 35401@end table 35402 35403@node Two-way processors 35404@subsubsection Customized Two-way Processors 35405@cindex customized two-way processor 35406 35407A @dfn{two-way processor} combines an input parser and an output wrapper for 35408two-way I/O with the @samp{|&} operator (@pxref{Redirection}). It makes identical 35409use of the @code{awk_input_parser_t} and @code{awk_output_buf_t} structures 35410as described earlier. 35411 35412A two-way processor is represented by the following structure: 35413 35414@example 35415typedef struct awk_two_way_processor @{ 35416 const char *name; /* name of the two-way processor */ 35417 awk_bool_t (*can_take_two_way)(const char *name); 35418 awk_bool_t (*take_control_of)(const char *name, 35419 awk_input_buf_t *inbuf, 35420 awk_output_buf_t *outbuf); 35421 awk_const struct awk_two_way_processor *awk_const next; /* for gawk */ 35422@} awk_two_way_processor_t; 35423@end example 35424 35425The fields are as follows: 35426 35427@table @code 35428@item const char *name; 35429The name of the two-way processor. 35430 35431@item awk_bool_t (*can_take_two_way)(const char *name); 35432The function pointed to by this field should return true if it wants to take over two-way I/O for this @value{FN}. 35433It should not change any state (variable 35434values, etc.) within @command{gawk}. 35435 35436@item awk_bool_t (*take_control_of)(const char *name, 35437@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_input_buf_t *inbuf, 35438@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_output_buf_t *outbuf); 35439The function pointed to by this field should fill in the @code{awk_input_buf_t} and 35440@code{awk_output_buf_t} structures pointed to by @code{inbuf} and 35441@code{outbuf}, respectively. These structures were described earlier. 35442 35443@item awk_const struct two_way_processor *awk_const next; 35444This is for use by @command{gawk}; 35445therefore it is marked @code{awk_const} so that the extension cannot 35446modify it. 35447@end table 35448 35449As with the input parser and output processor, you provide 35450``yes I can take this'' and ``take over for this'' functions, 35451@code{@var{XXX}_can_take_two_way()} and @code{@var{XXX}_take_control_of()}. 35452 35453You register your two-way processor with the following function: 35454 35455@table @code 35456@item void register_two_way_processor(awk_two_way_processor_t *two_way_processor); 35457Register the two-way processor pointed to by @code{two_way_processor} with 35458@command{gawk}. 35459@end table 35460 35461@node Printing Messages 35462@subsection Printing Messages 35463@cindex printing @subentry messages from extensions 35464@cindex messages from extensions 35465 35466You can print different kinds of warning messages from your 35467extension, as described here. Note that for these functions, 35468you must pass in the extension ID received from @command{gawk} 35469when the extension was loaded:@footnote{Because the API uses only ISO C 90 35470features, it cannot make use of the ISO C 99 variadic macro feature to hide 35471that parameter. More's the pity.} 35472 35473@table @code 35474@item void fatal(awk_ext_id_t id, const char *format, ...); 35475Print a message and then cause @command{gawk} to exit immediately. 35476 35477@item void nonfatal(awk_ext_id_t id, const char *format, ...); 35478Print a nonfatal error message. 35479 35480@item void warning(awk_ext_id_t id, const char *format, ...); 35481Print a warning message. 35482 35483@item void lintwarn(awk_ext_id_t id, const char *format, ...); 35484Print a ``lint warning.'' Normally this is the same as printing a 35485warning message, but if @command{gawk} was invoked with @samp{--lint=fatal}, 35486then lint warnings become fatal error messages. 35487@end table 35488 35489All of these functions are otherwise like the C @code{printf()} 35490family of functions, where the @code{format} parameter is a string 35491with literal characters and formatting codes intermixed. 35492 35493@node Updating @code{ERRNO} 35494@subsection Updating @code{ERRNO} 35495 35496The following functions allow you to update the @code{ERRNO} 35497variable: 35498 35499@table @code 35500@item void update_ERRNO_int(int errno_val); 35501Set @code{ERRNO} to the string equivalent of the error code 35502in @code{errno_val}. The value should be one of the defined 35503error codes in @code{<errno.h>}, and @command{gawk} turns it 35504into a (possibly translated) string using the C @code{strerror()} function. 35505 35506@item void update_ERRNO_string(const char *string); 35507Set @code{ERRNO} directly to the string value of @code{ERRNO}. 35508@command{gawk} makes a copy of the value of @code{string}. 35509 35510@item void unset_ERRNO(void); 35511Unset @code{ERRNO}. 35512@end table 35513 35514@node Requesting Values 35515@subsection Requesting Values 35516 35517All of the functions that return values from @command{gawk} 35518work in the same way. You pass in an @code{awk_valtype_t} value 35519to indicate what kind of value you expect. If the actual value 35520matches what you requested, the function returns true and fills 35521in the @code{awk_value_t} result. 35522Otherwise, the function returns false, and the @code{val_type} 35523member indicates the type of the actual value. You may then 35524print an error message or reissue the request for the actual 35525value type, as appropriate. This behavior is summarized in 35526@ref{table-value-types-returned}. 35527 35528@float Table,table-value-types-returned 35529@caption{API value types returned} 35530@docbook 35531<informaltable> 35532<tgroup cols="8"> 35533 <colspec colname="c1"/> 35534 <colspec colname="c2"/> 35535 <colspec colname="c3"/> 35536 <colspec colname="c4"/> 35537 <colspec colname="c5"/> 35538 <colspec colname="c6"/> 35539 <colspec colname="c7"/> 35540 <colspec colname="c8"/> 35541 <spanspec spanname="hspan" namest="c3" nameend="c8" align="center"/> 35542 <thead> 35543 <row><entry></entry><entry spanname="hspan"><para>Type of Actual Value</para></entry></row> 35544 <row> 35545 <entry></entry> 35546 <entry></entry> 35547 <entry><para>String</para></entry> 35548 <entry><para>Strnum</para></entry> 35549 <entry><para>Number</para></entry> 35550 <entry><para>Regex</para></entry> 35551 <entry><para>Array</para></entry> 35552 <entry><para>Undefined</para></entry> 35553 </row> 35554 </thead> 35555 <tbody> 35556 <row> 35557 <entry></entry> 35558 <entry><para><emphasis role="bold">String</emphasis></para></entry> 35559 <entry><para>String</para></entry> 35560 <entry><para>String</para></entry> 35561 <entry><para>String</para></entry> 35562 <entry><para>String</para></entry> 35563 <entry><para>false</para></entry> 35564 <entry><para>false</para></entry> 35565 </row> 35566 <row> 35567 <entry></entry> 35568 <entry><para><emphasis role="bold">Strnum</emphasis></para></entry> 35569 <entry><para>false</para></entry> 35570 <entry><para>Strnum</para></entry> 35571 <entry><para>Strnum</para></entry> 35572 <entry><para>false</para></entry> 35573 <entry><para>false</para></entry> 35574 <entry><para>false</para></entry> 35575 </row> 35576 <row> 35577 <entry></entry> 35578 <entry><para><emphasis role="bold">Number</emphasis></para></entry> 35579 <entry><para>Number</para></entry> 35580 <entry><para>Number</para></entry> 35581 <entry><para>Number</para></entry> 35582 <entry><para>false</para></entry> 35583 <entry><para>false</para></entry> 35584 <entry><para>false</para></entry> 35585 </row> 35586 <row> 35587 <entry><para><emphasis role="bold">Type</emphasis></para></entry> 35588 <entry><para><emphasis role="bold">Regex</emphasis></para></entry> 35589 <entry><para>false</para></entry> 35590 <entry><para>false</para></entry> 35591 <entry><para>Regex</para></entry> 35592 <entry><para>false</para></entry> 35593 <entry><para>false</para></entry> 35594 <entry><para>false</para></entry> 35595 </row> 35596 <row> 35597 <entry><para><emphasis role="bold">Requested</emphasis></para></entry> 35598 <entry><para><emphasis role="bold">Array</emphasis></para></entry> 35599 <entry><para>false</para></entry> 35600 <entry><para>false</para></entry> 35601 <entry><para>false</para></entry> 35602 <entry><para>false</para></entry> 35603 <entry><para>Array</para></entry> 35604 <entry><para>false</para></entry> 35605 </row> 35606 <row> 35607 <entry></entry> 35608 <entry><para><emphasis role="bold">Scalar</emphasis></para></entry> 35609 <entry><para>Scalar</para></entry> 35610 <entry><para>Scalar</para></entry> 35611 <entry><para>Scalar</para></entry> 35612 <entry><para>Scalar</para></entry> 35613 <entry><para>false</para></entry> 35614 <entry><para>false</para></entry> 35615 </row> 35616 <row> 35617 <entry></entry> 35618 <entry><para><emphasis role="bold">Undefined</emphasis></para></entry> 35619 <entry><para>String</para></entry> 35620 <entry><para>Strnum</para></entry> 35621 <entry><para>Number</para></entry> 35622 <entry><para>Regex</para></entry> 35623 <entry><para>Array</para></entry> 35624 <entry><para>Undefined</para></entry> 35625 </row> 35626 <row> 35627 <entry></entry> 35628 <entry><para><emphasis role="bold">Value cookie</emphasis></para></entry> 35629 <entry><para>false</para></entry> 35630 <entry><para>false</para></entry> 35631 <entry><para>false</para></entry> 35632 <entry><para>false</para></entry> 35633 <entry><para>false</para></entry> 35634 <entry><para>false</para></entry> 35635 </row> 35636 </tbody> 35637</tgroup> 35638</informaltable> 35639@end docbook 35640 35641@ifnotplaintext 35642@ifnotdocbook 35643@multitable @columnfractions .50 .50 35644@headitem @tab Type of Actual Value 35645@end multitable 35646@c 10/2014: Thanks to Karl Berry for this bit to reduce the space: 35647@tex 35648\vglue-1.1\baselineskip 35649@end tex 35650@c @multitable @columnfractions .166 .166 .198 .15 .15 .166 35651@multitable {Requested} {Undefined} {Number} {Number} {Scalar} {Regex} {Array} {Undefined} 35652@headitem @tab @tab String @tab Strnum @tab Number @tab Regex @tab Array @tab Undefined 35653@item @tab @b{String} @tab String @tab String @tab String @tab String @tab false @tab false 35654@item @tab @b{Strnum} @tab false @tab Strnum @tab Strnum @tab false @tab false @tab false 35655@item @tab @b{Number} @tab Number @tab Number @tab Number @tab false @tab false @tab false 35656@item @b{Type} @tab @b{Regex} @tab false @tab false @tab false @tab Regex @tab false @tab false 35657@item @b{Requested} @tab @b{Array} @tab false @tab false @tab false @tab false @tab Array @tab false 35658@item @tab @b{Scalar} @tab Scalar @tab Scalar @tab Scalar @tab Scalar @tab false @tab false 35659@item @tab @b{Undefined} @tab String @tab Strnum @tab Number @tab Regex @tab Array @tab Undefined 35660@item @tab @b{Value cookie} @tab false @tab false @tab false @tab false @tab false @tab false 35661@end multitable 35662@end ifnotdocbook 35663@end ifnotplaintext 35664@ifplaintext 35665@verbatim 35666 +-------------------------------------------------------+ 35667 | Type of Actual Value: | 35668 +--------+--------+--------+--------+-------+-----------+ 35669 | String | Strnum | Number | Regex | Array | Undefined | 35670+-----------+-----------+--------+--------+--------+--------+-------+-----------+ 35671| | String | String | String | String | String | false | false | 35672| +-----------+--------+--------+--------+--------+-------+-----------+ 35673| | Strnum | false | Strnum | Strnum | false | false | false | 35674| +-----------+--------+--------+--------+--------+-------+-----------+ 35675| | Number | Number | Number | Number | false | false | false | 35676| +-----------+--------+--------+--------+--------+-------+-----------+ 35677| | Regex | false | false | false | Regex | false | false | 35678| Type +-----------+--------+--------+--------+--------+-------+-----------+ 35679| Requested | Array | false | false | false | false | Array | false | 35680| +-----------+--------+--------+--------+--------+-------+-----------+ 35681| | Scalar | Scalar | Scalar | Scalar | Scalar | false | false | 35682| +-----------+--------+--------+--------+--------+-------+-----------+ 35683| | Undefined | String | Strnum | Number | Regex | Array | Undefined | 35684| +-----------+--------+--------+--------+--------+-------+-----------+ 35685| | Value | false | false | false | false | false | false | 35686| | Cookie | | | | | | | 35687+-----------+-----------+--------+--------+--------+--------+-------+-----------+ 35688@end verbatim 35689@end ifplaintext 35690@end float 35691 35692@node Accessing Parameters 35693@subsection Accessing and Updating Parameters 35694 35695Two functions give you access to the arguments (parameters) 35696passed to your extension function. They are: 35697 35698@table @code 35699@item awk_bool_t get_argument(size_t count, 35700@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted, 35701@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result); 35702Fill in the @code{awk_value_t} structure pointed to by @code{result} 35703with the @code{count}th argument. Return true if the actual 35704type matches @code{wanted}, and false otherwise. In the latter 35705case, @code{result@w{->}val_type} indicates the actual type 35706(@pxref{table-value-types-returned}). Counts are zero-based---the first 35707argument is numbered zero, the second one, and so on. @code{wanted} 35708indicates the type of value expected. 35709 35710@item awk_bool_t set_argument(size_t count, awk_array_t array); 35711Convert a parameter that was undefined into an array; this provides 35712call by reference for arrays. Return false if @code{count} is too big, 35713or if the argument's type is not undefined. @xref{Array Manipulation} 35714for more information on creating arrays. 35715@end table 35716 35717@node Symbol Table Access 35718@subsection Symbol Table Access 35719@cindex accessing global variables from extensions 35720 35721Two sets of routines provide access to global variables, and one set 35722allows you to create and release cached values. 35723 35724@menu 35725* Symbol table by name:: Accessing variables by name. 35726* Symbol table by cookie:: Accessing variables by ``cookie''. 35727* Cached values:: Creating and using cached values. 35728@end menu 35729 35730@node Symbol table by name 35731@subsubsection Variable Access and Update by Name 35732 35733The following routines provide the ability to access and update 35734global @command{awk}-level variables by name. In compiler terminology, 35735identifiers of different kinds are termed @dfn{symbols}, thus the ``sym'' 35736in the routines' names. The data structure that stores information 35737about symbols is termed a @dfn{symbol table}. 35738The functions are as follows: 35739 35740@table @code 35741@item awk_bool_t sym_lookup(const char *name, 35742@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted, 35743@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result); 35744Fill in the @code{awk_value_t} structure pointed to by @code{result} 35745with the value of the variable named by the string @code{name}, which is 35746a regular C string. @code{wanted} indicates the type of value expected. 35747Return true if the actual type matches @code{wanted}, and false otherwise. 35748In the latter case, @code{result->val_type} indicates the actual type 35749(@pxref{table-value-types-returned}). 35750 35751@item awk_bool_t sym_lookup_ns(const char *name, 35752@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const char *name_space, 35753@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted, 35754@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result); 35755This is like @code{sym_lookup()}, but the @code{name_space} parameter allows you 35756to specify which namespace @code{name} is part of. @code{name_space} cannot be 35757@code{NULL}. If it is @code{""} or @code{"awk"}, then @code{name} is searched 35758for in the default @code{awk} namespace. 35759 35760Note that @code{namespace} is a C++ keyword. For interoperability with C++, 35761you should avoid using that identifier in C code. 35762 35763@item awk_bool_t sym_update(const char *name, awk_value_t *value); 35764Update the variable named by the string @code{name}, which is a regular 35765C string. The variable is added to @command{gawk}'s symbol table 35766if it is not there. Return true if everything worked, and false otherwise. 35767 35768Changing types (scalar to array or vice versa) of an existing variable 35769is @emph{not} allowed, nor may this routine be used to update an array. 35770This routine cannot be used to update any of the predefined 35771variables (such as @code{ARGC} or @code{NF}). 35772 35773@item awk_bool_t sym_update_ns(const char *name_space, const char *name, awk_value_t *value); 35774This is like @code{sym_update()}, but the @code{name_space} parameter allows you 35775to specify which namespace @code{name} is part of. @code{name_space} cannot be 35776@code{NULL}. If it is @code{""} or @code{"awk"}, then @code{name} is searched 35777for in the default @code{awk} namespace. 35778@end table 35779 35780An extension can look up the value of @command{gawk}'s special variables. 35781However, with the exception of the @code{PROCINFO} array, an extension 35782cannot change any of those variables. 35783 35784When searching for or updating variables outside the @code{awk} namespace 35785(@pxref{Namespaces}), function and variable names must be simple 35786identifiers.@footnote{Allowing both namespace plus identifier and 35787@code{foo::bar} would have been too confusing to document, and to code 35788and test.} In addition, namespace names and variable and function names 35789must follow the rules given in @ref{Naming Rules}. 35790 35791@node Symbol table by cookie 35792@subsubsection Variable Access and Update by Cookie 35793 35794A @dfn{scalar cookie} is an opaque handle that provides access 35795to a global variable or array. It is an optimization that 35796avoids looking up variables in @command{gawk}'s symbol table every time 35797access is needed. This was discussed earlier, in @ref{General Data Types}. 35798 35799@need 1500 35800The following functions let you work with scalar cookies: 35801 35802@table @code 35803@item awk_bool_t sym_lookup_scalar(awk_scalar_t cookie, 35804@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted, 35805@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result); 35806Retrieve the current value of a scalar cookie. 35807Once you have obtained a scalar cookie using @code{sym_lookup()}, you can 35808use this function to get its value more efficiently. 35809Return false if the value cannot be retrieved. 35810 35811@item awk_bool_t sym_update_scalar(awk_scalar_t cookie, awk_value_t *value); 35812Update the value associated with a scalar cookie. Return false if 35813the new value is not of type @code{AWK_STRING}, @code{AWK_STRNUM}, @code{AWK_REGEX}, or @code{AWK_NUMBER}. 35814Here too, the predefined variables may not be updated. 35815@end table 35816 35817It is not obvious at first glance how to work with scalar cookies or 35818what their @i{raison d'@^etre} really is. In theory, the @code{sym_lookup()} 35819and @code{sym_update()} routines are all you really need to work with 35820variables. For example, you might have code that looks up the value of 35821a variable, evaluates a condition, and then possibly changes the value 35822of the variable based on the result of that evaluation, like so: 35823 35824@example 35825/* do_magic --- do something really great */ 35826 35827static awk_value_t * 35828do_magic(int nargs, awk_value_t *result) 35829@{ 35830 awk_value_t value; 35831 35832 if ( sym_lookup("MAGIC_VAR", AWK_NUMBER, & value) 35833 && some_condition(value.num_value)) @{ 35834 value.num_value += 42; 35835 sym_update("MAGIC_VAR", & value); 35836 @} 35837 35838 return make_number(0.0, result); 35839@} 35840@end example 35841 35842@noindent 35843This code looks (and is) simple and straightforward. So what's the problem? 35844 35845Well, consider what happens if @command{awk}-level code associated 35846with your extension calls the @code{magic()} function (implemented in 35847C by @code{do_magic()}), once per record, while processing hundreds 35848of thousands or millions of records. The @code{MAGIC_VAR} variable is 35849looked up in the symbol table once or twice per function call! 35850 35851The symbol table lookup is really pure overhead; it is considerably 35852more efficient to get a cookie that represents the variable, and use 35853that to get the variable's value and update it as needed.@footnote{The 35854difference is measurable and quite real. Trust us.} 35855 35856Thus, the way to use cookies is as follows. First, install 35857your extension's variable in @command{gawk}'s symbol table using 35858@code{sym_update()}, as usual. Then get a scalar cookie for the variable 35859using @code{sym_lookup()}: 35860 35861@example 35862@group 35863static awk_scalar_t magic_var_cookie; /* cookie for MAGIC_VAR */ 35864 35865static void 35866my_extension_init() 35867@{ 35868 awk_value_t value; 35869@end group 35870 35871 /* install initial value */ 35872 sym_update("MAGIC_VAR", make_number(42.0, & value)); 35873 35874 /* get the cookie */ 35875 sym_lookup("MAGIC_VAR", AWK_SCALAR, & value); 35876 35877 /* save the cookie */ 35878 magic_var_cookie = value.scalar_cookie; 35879 @dots{} 35880@} 35881@end example 35882 35883Next, use the routines in this @value{SECTION} for retrieving and updating 35884the value through the cookie. Thus, @code{do_magic()} now becomes 35885something like this: 35886 35887@example 35888/* do_magic --- do something really great */ 35889 35890static awk_value_t * 35891do_magic(int nargs, awk_value_t *result) 35892@{ 35893 awk_value_t value; 35894 35895 if ( sym_lookup_scalar(magic_var_cookie, AWK_NUMBER, & value) 35896 && some_condition(value.num_value)) @{ 35897 value.num_value += 42; 35898 sym_update_scalar(magic_var_cookie, & value); 35899 @} 35900 @dots{} 35901 35902 return make_number(0.0, result); 35903@} 35904@end example 35905 35906@quotation NOTE 35907The previous code omitted error checking for 35908presentation purposes. Your extension code should be more robust 35909and carefully check the return values from the API functions. 35910@end quotation 35911 35912@node Cached values 35913@subsubsection Creating and Using Cached Values 35914 35915The routines in this @value{SECTION} allow you to create and release 35916cached values. Like scalar cookies, in theory, cached values 35917are not necessary. You can create numbers and strings using 35918the functions in @ref{Constructor Functions}. You can then 35919assign those values to variables using @code{sym_update()} 35920or @code{sym_update_scalar()}, as you like. 35921 35922However, you can understand the point of cached values if you remember that 35923@emph{every} string value's storage @emph{must} come from @code{gawk_malloc()}, 35924@code{gawk_calloc()}, or @code{gawk_realloc()}. 35925If you have 20 variables, all of which have the same string value, you 35926must create 20 identical copies of the string.@footnote{Numeric values 35927are clearly less problematic, requiring only a C @code{double} to store. 35928But of course, GMP and MPFR values @emph{do} take up more memory.} 35929 35930It is clearly more efficient, if possible, to create a value once, and 35931then tell @command{gawk} to reuse the value for multiple variables. That 35932is what the routines in this @value{SECTION} let you do. The functions are as follows: 35933 35934@table @code 35935@item awk_bool_t create_value(awk_value_t *value, awk_value_cookie_t *result); 35936Create a cached string or numeric value from @code{value} for 35937efficient later assignment. Only values of type @code{AWK_NUMBER}, @code{AWK_REGEX}, @code{AWK_STRNUM}, 35938and @code{AWK_STRING} are allowed. Any other type is rejected. 35939@code{AWK_UNDEFINED} could be allowed, but doing so would result in 35940inferior performance. 35941 35942@item awk_bool_t release_value(awk_value_cookie_t vc); 35943Release the memory associated with a value cookie obtained 35944from @code{create_value()}. 35945@end table 35946 35947You use value cookies in a fashion similar to the way you use scalar cookies. 35948In the extension initialization routine, you create the value cookie: 35949 35950@example 35951static awk_value_cookie_t answer_cookie; /* static value cookie */ 35952 35953static void 35954my_extension_init() 35955@{ 35956 awk_value_t value; 35957 char *long_string; 35958 size_t long_string_len; 35959 35960 /* code from earlier */ 35961 @dots{} 35962 /* @dots{} fill in long_string and long_string_len @dots{} */ 35963 make_malloced_string(long_string, long_string_len, & value); 35964 create_value(& value, & answer_cookie); /* create cookie */ 35965 @dots{} 35966@} 35967@end example 35968 35969Once the value is created, you can use it as the value of any number 35970of variables: 35971 35972@example 35973static awk_value_t * 35974do_magic(int nargs, awk_value_t *result) 35975@{ 35976 awk_value_t new_value; 35977 35978 @dots{} /* as earlier */ 35979 35980 value.val_type = AWK_VALUE_COOKIE; 35981 value.value_cookie = answer_cookie; 35982 sym_update("VAR1", & value); 35983 sym_update("VAR2", & value); 35984 @dots{} 35985 sym_update("VAR100", & value); 35986 @dots{} 35987@} 35988@end example 35989 35990@noindent 35991Using value cookies in this way saves considerable storage, as all of 35992@code{VAR1} through @code{VAR100} share the same value. 35993 35994You might be wondering, ``Is this sharing problematic? 35995What happens if @command{awk} code assigns a new value to @code{VAR1}; 35996are all the others changed too?'' 35997 35998That's a great question. The answer is that no, it's not a problem. 35999Internally, @command{gawk} uses @dfn{reference-counted strings}. This means 36000that many variables can share the same string value, and @command{gawk} 36001keeps track of the usage. When a variable's value changes, @command{gawk} 36002simply decrements the reference count on the old value and updates 36003the variable to use the new value. 36004 36005Finally, as part of your cleanup action (@pxref{Exit Callback Functions}) 36006you should release any cached values that you created, using 36007@code{release_value()}. 36008 36009@node Array Manipulation 36010@subsection Array Manipulation 36011@cindex array manipulation in extensions 36012@cindex extensions @subentry loadable @subentry array manipulation in 36013 36014The primary data structure@footnote{OK, the only data structure.} in @command{awk} 36015is the associative array (@pxref{Arrays}). 36016Extensions need to be able to manipulate @command{awk} arrays. 36017The API provides a number of data structures for working with arrays, 36018functions for working with individual elements, and functions for 36019working with arrays as a whole. This includes the ability to 36020``flatten'' an array so that it is easy for C code to traverse 36021every element in an array. The array data structures integrate 36022nicely with the data structures for values to make it easy to 36023both work with and create true arrays of arrays (@pxref{General Data Types}). 36024 36025@menu 36026* Array Data Types:: Data types for working with arrays. 36027* Array Functions:: Functions for working with arrays. 36028* Flattening Arrays:: How to flatten arrays. 36029* Creating Arrays:: How to create and populate arrays. 36030@end menu 36031 36032@node Array Data Types 36033@subsubsection Array Data Types 36034 36035The data types associated with arrays are as follows: 36036 36037@table @code 36038@item typedef void *awk_array_t; 36039If you request the value of an array variable, you get back an 36040@code{awk_array_t} value. This value is opaque@footnote{It is also 36041a ``cookie,'' but the @command{gawk} developers did not wish to overuse this 36042term.} to the extension; it uniquely identifies the array but can 36043only be used by passing it into API functions or receiving it from API 36044functions. This is very similar to way @samp{FILE *} values are used 36045with the @code{<stdio.h>} library routines. 36046 36047@item typedef struct awk_element @{ 36048@itemx @ @ @ @ /* convenience linked list pointer, not used by gawk */ 36049@itemx @ @ @ @ struct awk_element *next; 36050@itemx @ @ @ @ enum @{ 36051@itemx @ @ @ @ @ @ @ @ AWK_ELEMENT_DEFAULT = 0,@ @ /* set by gawk */ 36052@itemx @ @ @ @ @ @ @ @ AWK_ELEMENT_DELETE = 1@ @ @ @ /* set by extension */ 36053@itemx @ @ @ @ @} flags; 36054@itemx @ @ @ @ awk_value_t index; 36055@itemx @ @ @ @ awk_value_t value; 36056@itemx @} awk_element_t; 36057The @code{awk_element_t} is a ``flattened'' 36058array element. @command{awk} produces an array of these 36059inside the @code{awk_flat_array_t} (see the next item). 36060Individual elements may be marked for deletion. New elements must be added 36061individually, one at a time, using the separate API for that purpose. 36062The fields are as follows: 36063 36064@c nested table 36065@table @code 36066@item struct awk_element *next; 36067This pointer is for the convenience of extension writers. It allows 36068an extension to create a linked list of new elements that can then be 36069added to an array in a loop that traverses the list. 36070 36071@item enum @{ @dots{} @} flags; 36072A set of flag values that convey information between the extension 36073and @command{gawk}. Currently there is only one: @code{AWK_ELEMENT_DELETE}. 36074Setting it causes @command{gawk} to delete the 36075element from the original array upon release of the flattened array. 36076 36077@item index 36078@itemx value 36079The index and value of the element, respectively. 36080@emph{All} memory pointed to by @code{index} and @code{value} belongs to @command{gawk}. 36081@end table 36082 36083@item typedef struct awk_flat_array @{ 36084@itemx @ @ @ @ awk_const void *awk_const opaque1;@ @ @ @ /* for use by gawk */ 36085@itemx @ @ @ @ awk_const void *awk_const opaque2;@ @ @ @ /* for use by gawk */ 36086@itemx @ @ @ @ awk_const size_t count;@ @ @ @ @ /* how many elements */ 36087@itemx @ @ @ @ awk_element_t elements[1];@ @ /* will be extended */ 36088@itemx @} awk_flat_array_t; 36089This is a flattened array. When an extension gets one of these 36090from @command{gawk}, the @code{elements} array is of actual 36091size @code{count}. 36092The @code{opaque1} and @code{opaque2} pointers are for use by @command{gawk}; 36093therefore they are marked @code{awk_const} so that the extension cannot 36094modify them. 36095@end table 36096 36097@node Array Functions 36098@subsubsection Array Functions 36099 36100The following functions relate to individual array elements: 36101 36102@table @code 36103@item awk_bool_t get_element_count(awk_array_t a_cookie, size_t *count); 36104For the array represented by @code{a_cookie}, place in @code{*count} 36105the number of elements it contains. A subarray counts as a single element. 36106Return false if there is an error. 36107 36108@item awk_bool_t get_array_element(awk_array_t a_cookie, 36109@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_value_t *const index, 36110@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted, 36111@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result); 36112For the array represented by @code{a_cookie}, return in @code{*result} 36113the value of the element whose index is @code{index}. 36114@code{wanted} specifies the type of value you wish to retrieve. 36115Return false if @code{wanted} does not match the actual type or if 36116@code{index} is not in the array (@pxref{table-value-types-returned}). 36117 36118The value for @code{index} can be numeric, in which case @command{gawk} 36119converts it to a string. Using nonintegral values is possible, but 36120requires that you understand how such values are converted to strings 36121(@pxref{Conversion}); thus, using integral values is safest. 36122 36123As with @emph{all} strings passed into @command{gawk} from an extension, 36124the string value of @code{index} must come from @code{gawk_malloc()}, 36125@code{gawk_calloc()}, or @code{gawk_realloc()}, and 36126@command{gawk} releases the storage. 36127 36128@item awk_bool_t set_array_element(awk_array_t a_cookie, 36129@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const@ awk_value_t *const index, 36130@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const@ awk_value_t *const value); 36131In the array represented by @code{a_cookie}, create or modify 36132the element whose index is given by @code{index}. 36133The @code{ARGV} and @code{ENVIRON} arrays may not be changed, 36134although the @code{PROCINFO} array can be. 36135 36136@item awk_bool_t set_array_element_by_elem(awk_array_t a_cookie, 36137@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_element_t element); 36138Like @code{set_array_element()}, but take the @code{index} and @code{value} 36139from @code{element}. This is a convenience macro. 36140 36141@item awk_bool_t del_array_element(awk_array_t a_cookie, 36142@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_value_t* const index); 36143Remove the element with the given index from the array 36144represented by @code{a_cookie}. 36145Return true if the element was removed, or false if the element did 36146not exist in the array. 36147@end table 36148 36149The following functions relate to arrays as a whole: 36150 36151@table @code 36152@item awk_array_t create_array(void); 36153Create a new array to which elements may be added. 36154@xref{Creating Arrays} for a discussion of how to 36155create a new array and add elements to it. 36156 36157@item awk_bool_t clear_array(awk_array_t a_cookie); 36158Clear the array represented by @code{a_cookie}. 36159Return false if there was some kind of problem, true otherwise. 36160The array remains an array, but after calling this function, it 36161has no elements. This is equivalent to using the @code{delete} 36162statement (@pxref{Delete}). 36163 36164@item awk_bool_t flatten_array_typed(awk_array_t a_cookie, 36165@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_flat_array_t **data, 36166@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t index_type, 36167@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t value_type); 36168For the array represented by @code{a_cookie}, create an @code{awk_flat_array_t} 36169structure and fill it in with indices and values of the requested types. 36170Set the pointer whose address is passed as @code{data} 36171to point to this structure. 36172Return true upon success, or false otherwise. 36173@ifset FOR_PRINT 36174See the next @value{SECTION} 36175@end ifset 36176@ifclear FOR_PRINT 36177@xref{Flattening Arrays}, 36178@end ifclear 36179for a discussion of how to 36180flatten an array and work with it. 36181 36182@item awk_bool_t flatten_array(awk_array_t a_cookie, awk_flat_array_t **data); 36183For the array represented by @code{a_cookie}, create an @code{awk_flat_array_t} 36184structure and fill it in with @code{AWK_STRING} indices and 36185@code{AWK_UNDEFINED} values. 36186This is superseded by @code{flatten_array_typed()}. 36187It is provided as a macro, and remains for convenience and for source code 36188compatibility with the previous version of the API. 36189 36190@item awk_bool_t release_flattened_array(awk_array_t a_cookie, 36191@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_flat_array_t *data); 36192When done with a flattened array, release the storage using this function. 36193You must pass in both the original array cookie and the address of 36194the created @code{awk_flat_array_t} structure. 36195The function returns true upon success, false otherwise. 36196@end table 36197 36198@node Flattening Arrays 36199@subsubsection Working With All The Elements of an Array 36200 36201To @dfn{flatten} an array is to create a structure that 36202represents the full array in a fashion that makes it easy 36203for C code to traverse the entire array. Some of the code 36204in @file{extension/testext.c} does this, and also serves 36205as a nice example showing how to use the APIs. 36206 36207We walk through that part of the code one step at a time. 36208First, the @command{gawk} script that drives the test extension: 36209 36210@example 36211@@load "testext" 36212BEGIN @{ 36213 n = split("blacky rusty sophie raincloud lucky", pets) 36214 printf("pets has %d elements\n", length(pets)) 36215 ret = dump_array_and_delete("pets", "3") 36216 printf("dump_array_and_delete(pets) returned %d\n", ret) 36217 if ("3" in pets) 36218 printf("dump_array_and_delete() did NOT remove index \"3\"!\n") 36219 else 36220 printf("dump_array_and_delete() did remove index \"3\"!\n") 36221 print "" 36222@} 36223@end example 36224 36225@noindent 36226This code creates an array with @code{split()} (@pxref{String Functions}) 36227and then calls @code{dump_array_and_delete()}. That function looks up 36228the array whose name is passed as the first argument, and 36229deletes the element at the index passed in the second argument. 36230The @command{awk} code then prints the return value and checks if the element 36231was indeed deleted. Here is the C code that implements 36232@code{dump_array_and_delete()}. It has been edited slightly for 36233presentation. 36234 36235The first part declares variables, sets up the default 36236return value in @code{result}, and checks that the function 36237was called with the correct number of arguments: 36238 36239@example 36240static awk_value_t * 36241dump_array_and_delete(int nargs, awk_value_t *result) 36242@{ 36243 awk_value_t value, value2, value3; 36244 awk_flat_array_t *flat_array; 36245 size_t count; 36246 char *name; 36247 int i; 36248 36249 assert(result != NULL); 36250 make_number(0.0, result); 36251 36252 if (nargs != 2) @{ 36253 printf("dump_array_and_delete: nargs not right " 36254 "(%d should be 2)\n", nargs); 36255 goto out; 36256 @} 36257@end example 36258 36259The function then proceeds in steps, as follows. First, retrieve 36260the name of the array, passed as the first argument, followed by 36261the array itself. If either operation fails, print an 36262error message and return: 36263 36264@example 36265 /* get argument named array as flat array and print it */ 36266 if (get_argument(0, AWK_STRING, & value)) @{ 36267 name = value.str_value.str; 36268 if (sym_lookup(name, AWK_ARRAY, & value2)) 36269 printf("dump_array_and_delete: sym_lookup of %s passed\n", 36270 name); 36271 else @{ 36272 printf("dump_array_and_delete: sym_lookup of %s failed\n", 36273 name); 36274 goto out; 36275 @} 36276 @} else @{ 36277 printf("dump_array_and_delete: get_argument(0) failed\n"); 36278 goto out; 36279 @} 36280@end example 36281 36282For testing purposes and to make sure that the C code sees 36283the same number of elements as the @command{awk} code, 36284the second step is to get the count of elements in the array 36285and print it: 36286 36287@example 36288 if (! get_element_count(value2.array_cookie, & count)) @{ 36289 printf("dump_array_and_delete: get_element_count failed\n"); 36290 goto out; 36291 @} 36292 36293 printf("dump_array_and_delete: incoming size is %lu\n", 36294 (unsigned long) count); 36295@end example 36296 36297The third step is to actually flatten the array, and then 36298to double-check that the count in the @code{awk_flat_array_t} 36299is the same as the count just retrieved: 36300 36301@example 36302 if (! flatten_array_typed(value2.array_cookie, & flat_array, 36303 AWK_STRING, AWK_UNDEFINED)) @{ 36304 printf("dump_array_and_delete: could not flatten array\n"); 36305 goto out; 36306 @} 36307 36308 if (flat_array->count != count) @{ 36309 printf("dump_array_and_delete: flat_array->count (%lu)" 36310 " != count (%lu)\n", 36311 (unsigned long) flat_array->count, 36312 (unsigned long) count); 36313 goto out; 36314 @} 36315@end example 36316 36317The fourth step is to retrieve the index of the element 36318to be deleted, which was passed as the second argument. 36319Remember that argument counts passed to @code{get_argument()} 36320are zero-based, and thus the second argument is numbered one: 36321 36322@example 36323 if (! get_argument(1, AWK_STRING, & value3)) @{ 36324 printf("dump_array_and_delete: get_argument(1) failed\n"); 36325 goto out; 36326 @} 36327@end example 36328 36329The fifth step is where the ``real work'' is done. The function 36330loops over every element in the array, printing the index and 36331element values. In addition, upon finding the element with the 36332index that is supposed to be deleted, the function sets the 36333@code{AWK_ELEMENT_DELETE} bit in the @code{flags} field 36334of the element. When the array is released, @command{gawk} 36335traverses the flattened array, and deletes any elements that 36336have this flag bit set: 36337 36338@example 36339 for (i = 0; i < flat_array->count; i++) @{ 36340 printf("\t%s[\"%.*s\"] = %s\n", 36341 name, 36342 (int) flat_array->elements[i].index.str_value.len, 36343 flat_array->elements[i].index.str_value.str, 36344 valrep2str(& flat_array->elements[i].value)); 36345 36346 if (strcmp(value3.str_value.str, 36347 flat_array->elements[i].index.str_value.str) == 0) @{ 36348 flat_array->elements[i].flags |= AWK_ELEMENT_DELETE; 36349 printf("dump_array_and_delete: marking element \"%s\" " 36350 "for deletion\n", 36351 flat_array->elements[i].index.str_value.str); 36352 @} 36353 @} 36354@end example 36355 36356The sixth step is to release the flattened array. This tells 36357@command{gawk} that the extension is no longer using the array, 36358and that it should delete any elements marked for deletion. 36359@command{gawk} also frees any storage that was allocated, 36360so you should not use the pointer (@code{flat_array} in this 36361code) once you have called @code{release_flattened_array()}: 36362 36363@example 36364 if (! release_flattened_array(value2.array_cookie, flat_array)) @{ 36365 printf("dump_array_and_delete: could not release flattened array\n"); 36366 goto out; 36367 @} 36368@end example 36369 36370Finally, because everything was successful, the function sets the 36371return value to success, and returns: 36372 36373@example 36374@group 36375 make_number(1.0, result); 36376out: 36377 return result; 36378@} 36379@end group 36380@end example 36381 36382Here is the output from running this part of the test: 36383 36384@example 36385pets has 5 elements 36386dump_array_and_delete: sym_lookup of pets passed 36387dump_array_and_delete: incoming size is 5 36388 pets["1"] = "blacky" 36389 pets["2"] = "rusty" 36390 pets["3"] = "sophie" 36391dump_array_and_delete: marking element "3" for deletion 36392 pets["4"] = "raincloud" 36393 pets["5"] = "lucky" 36394dump_array_and_delete(pets) returned 1 36395dump_array_and_delete() did remove index "3"! 36396@end example 36397 36398@node Creating Arrays 36399@subsubsection How To Create and Populate Arrays 36400 36401Besides working with arrays created by @command{awk} code, you can 36402create arrays and populate them as you see fit, and then @command{awk} 36403code can access them and manipulate them. 36404 36405There are two important points about creating arrays from extension code: 36406 36407@itemize @value{BULLET} 36408@item 36409You must install a new array into @command{gawk}'s symbol 36410table immediately upon creating it. Once you have done so, 36411you can then populate the array. 36412 36413@ignore 36414Strictly speaking, this is required only 36415for arrays that will have subarrays as elements; however it is 36416a good idea to always do this. This restriction may be relaxed 36417in a subsequent revision of the API. 36418@end ignore 36419 36420Similarly, if installing a new array as a subarray of an existing array, 36421you must add the new array to its parent before adding any elements to it. 36422 36423Thus, the correct way to build an array is to work ``top down.'' Create 36424the array, and immediately install it in @command{gawk}'s symbol table 36425using @code{sym_update()}, or install it as an element in a previously 36426existing array using @code{set_array_element()}. We show example code shortly. 36427 36428@item 36429Due to @command{gawk} internals, after using @code{sym_update()} to install an array 36430into @command{gawk}, you have to retrieve the array cookie from the value 36431passed in to @command{sym_update()} before doing anything else with it, like so: 36432 36433@example 36434awk_value_t val; 36435awk_array_t new_array; 36436 36437new_array = create_array(); 36438val.val_type = AWK_ARRAY; 36439val.array_cookie = new_array; 36440 36441/* install array in the symbol table */ 36442sym_update("array", & val); 36443 36444new_array = val.array_cookie; /* YOU MUST DO THIS */ 36445@end example 36446 36447If installing an array as a subarray, you must also retrieve the value 36448of the array cookie after the call to @code{set_element()}. 36449@end itemize 36450 36451The following C code is a simple test extension to create an array 36452with two regular elements and with a subarray. The leading @code{#include} 36453directives and boilerplate variable declarations 36454(@pxref{Extension API Boilerplate}) 36455are omitted for brevity. 36456The first step is to create a new array and then install it 36457in the symbol table: 36458 36459@example 36460@ignore 36461#ifdef HAVE_CONFIG_H 36462#include <config.h> 36463#endif 36464 36465#include <stdio.h> 36466#include <assert.h> 36467#include <errno.h> 36468#include <stdlib.h> 36469#include <string.h> 36470#include <unistd.h> 36471 36472#include <sys/types.h> 36473#include <sys/stat.h> 36474 36475#include "gawkapi.h" 36476 36477static const gawk_api_t *api; /* for convenience macros to work */ 36478static awk_ext_id_t ext_id; 36479static const char *ext_version = "testarray extension: version 1.0"; 36480 36481int plugin_is_GPL_compatible; 36482 36483@end ignore 36484/* create_new_array --- create a named array */ 36485 36486static void 36487create_new_array() 36488@{ 36489 awk_array_t a_cookie; 36490 awk_array_t subarray; 36491 awk_value_t index, value; 36492 36493 a_cookie = create_array(); 36494 value.val_type = AWK_ARRAY; 36495 value.array_cookie = a_cookie; 36496 36497 if (! sym_update("new_array", & value)) 36498 printf("create_new_array: sym_update(\"new_array\") failed!\n"); 36499 a_cookie = value.array_cookie; 36500@end example 36501 36502@noindent 36503Note how @code{a_cookie} is reset from the @code{array_cookie} field in 36504the @code{value} structure. 36505 36506The second step is to install two regular values into @code{new_array}: 36507 36508@example 36509 (void) make_const_string("hello", 5, & index); 36510 (void) make_const_string("world", 5, & value); 36511 if (! set_array_element(a_cookie, & index, & value)) @{ 36512 printf("fill_in_array: set_array_element failed\n"); 36513 return; 36514 @} 36515 36516 (void) make_const_string("answer", 6, & index); 36517 (void) make_number(42.0, & value); 36518 if (! set_array_element(a_cookie, & index, & value)) @{ 36519 printf("fill_in_array: set_array_element failed\n"); 36520 return; 36521 @} 36522@end example 36523 36524The third step is to create the subarray and install it: 36525 36526@example 36527 (void) make_const_string("subarray", 8, & index); 36528 subarray = create_array(); 36529 value.val_type = AWK_ARRAY; 36530 value.array_cookie = subarray; 36531 if (! set_array_element(a_cookie, & index, & value)) @{ 36532 printf("fill_in_array: set_array_element failed\n"); 36533 return; 36534 @} 36535 subarray = value.array_cookie; 36536@end example 36537 36538The final step is to populate the subarray with its own element: 36539 36540@example 36541 (void) make_const_string("foo", 3, & index); 36542 (void) make_const_string("bar", 3, & value); 36543 if (! set_array_element(subarray, & index, & value)) @{ 36544 printf("fill_in_array: set_array_element failed\n"); 36545 return; 36546 @} 36547@} 36548@ignore 36549static awk_ext_func_t func_table[] = @{ 36550 @{ NULL, NULL, 0 @} 36551@}; 36552 36553/* init_testarray --- additional initialization function */ 36554 36555static awk_bool_t init_testarray(void) 36556@{ 36557 create_new_array(); 36558 36559 return awk_true; 36560@} 36561 36562static awk_bool_t (*init_func)(void) = init_testarray; 36563 36564dl_load_func(func_table, testarray, "") 36565@end ignore 36566@end example 36567 36568Here is a sample script that loads the extension 36569and then dumps the array: 36570 36571@example 36572@@load "subarray" 36573 36574function dumparray(name, array, i) 36575@{ 36576 for (i in array) 36577 if (isarray(array[i])) 36578 dumparray(name "[\"" i "\"]", array[i]) 36579 else 36580 printf("%s[\"%s\"] = %s\n", name, i, array[i]) 36581@} 36582 36583BEGIN @{ 36584 dumparray("new_array", new_array); 36585@} 36586@end example 36587 36588Here is the result of running the script: 36589 36590@example 36591$ @kbd{AWKLIBPATH=$PWD gawk -f subarray.awk} 36592@print{} new_array["subarray"]["foo"] = bar 36593@print{} new_array["hello"] = world 36594@print{} new_array["answer"] = 42 36595@end example 36596 36597@noindent 36598(@xref{Finding Extensions} for more information on the 36599@env{AWKLIBPATH} environment variable.) 36600 36601@node Redirection API 36602@subsection Accessing and Manipulating Redirections 36603 36604The following function allows extensions to access and manipulate redirections. 36605 36606@table @code 36607@item awk_bool_t get_file(const char *name, 36608@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ size_t name_len, 36609@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const char *filetype, 36610@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ int fd, 36611@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_input_buf_t **ibufp, 36612@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_output_buf_t **obufp); 36613Look up file @code{name} in @command{gawk}'s internal redirection table. 36614If @code{name} is @code{NULL} or @code{name_len} is zero, return 36615data for the currently open input file corresponding to @code{FILENAME}. 36616(This does not access the @code{filetype} argument, so that may be undefined). 36617If the file is not already open, attempt to open it. 36618The @code{filetype} argument must be zero-terminated and should be one of: 36619 36620@table @code 36621@item ">" 36622A file opened for output. 36623 36624@item ">>" 36625A file opened for append. 36626 36627@item "<" 36628A file opened for input. 36629 36630@item "|>" 36631A pipe opened for output. 36632 36633@item "|<" 36634A pipe opened for input. 36635 36636@item "|&" 36637A two-way coprocess. 36638@end table 36639 36640On error, return @code{awk_false}. Otherwise, return 36641@code{awk_true}, and return additional information about the redirection 36642in the @code{ibufp} and @code{obufp} pointers. 36643 36644For input redirections, the @code{*ibufp} value should be non-@code{NULL}, 36645and @code{*obufp} should be @code{NULL}. For output redirections, 36646the @code{*obufp} value should be non-@code{NULL}, and @code{*ibufp} 36647should be @code{NULL}. For two-way coprocesses, both values should 36648be non-@code{NULL}. 36649 36650In the usual case, the extension is interested in @code{(*ibufp)->fd} 36651and/or @code{fileno((*obufp)->fp)}. If the file is not already 36652open, and the @code{fd} argument is nonnegative, @command{gawk} 36653will use that file descriptor instead of opening the file in the 36654usual way. If @code{fd} is nonnegative, but the file exists already, 36655@command{gawk} ignores @code{fd} and returns the existing file. It is 36656the caller's responsibility to notice that neither the @code{fd} in 36657the returned @code{awk_input_buf_t} nor the @code{fd} in the returned 36658@code{awk_output_buf_t} matches the requested value. 36659 36660Note that supplying a file descriptor is currently @emph{not} supported 36661for pipes. However, supplying a file descriptor should work for input, 36662output, append, and two-way (coprocess) sockets. If @code{filetype} 36663is two-way, @command{gawk} assumes that it is a socket! Note that in 36664the two-way case, the input and output file descriptors may differ. 36665To check for success, you must check whether either matches. 36666@end table 36667 36668It is anticipated that this API function will be used to implement I/O 36669multiplexing and a socket library. 36670 36671@node Extension API Variables 36672@subsection API Variables 36673 36674The API provides two sets of variables. The first provides information 36675about the version of the API (both with which the extension was compiled, 36676and with which @command{gawk} was compiled). The second provides 36677information about how @command{gawk} was invoked. 36678 36679@menu 36680* Extension Versioning:: API Version information. 36681* Extension GMP/MPFR Versioning:: Version information about GMP and MPFR. 36682* Extension API Informational Variables:: Variables providing information about 36683 @command{gawk}'s invocation. 36684@end menu 36685 36686@node Extension Versioning 36687@subsubsection API Version Constants and Variables 36688@cindex API @subentry version 36689@cindex extension API @subentry version number 36690 36691The API provides both a ``major'' and a ``minor'' version number. 36692The API versions are available at compile time as C preprocessor defines 36693to support conditional compilation, and as enum constants to facilitate 36694debugging: 36695 36696@float Table,gawk-api-version 36697@caption{gawk API version constants} 36698@multitable {@b{API Version}} {@code{gawk_api_major_version}} {@code{GAWK_API_MAJOR_VERSION}} 36699@headitem API Version @tab C Preprocessor Define @tab enum constant 36700@item Major @tab @code{gawk_api_major_version} @tab @code{GAWK_API_MAJOR_VERSION} 36701@item Minor @tab @code{gawk_api_minor_version} @tab @code{GAWK_API_MINOR_VERSION} 36702@end multitable 36703@end float 36704 36705The minor version increases when new functions are added to the API. Such 36706new functions are always added to the end of the API @code{struct}. 36707 36708The major version increases (and the minor version is reset to zero) if any 36709of the data types change size or member order, or if any of the existing 36710functions change signature. 36711 36712It could happen that an extension may be compiled against one version 36713of the API but loaded by a version of @command{gawk} using a different 36714version. For this reason, the major and minor API versions of the 36715running @command{gawk} are included in the API @code{struct} as read-only 36716constant integers: 36717 36718@table @code 36719@item api->major_version 36720The major version of the running @command{gawk}. 36721 36722@item api->minor_version 36723The minor version of the running @command{gawk}. 36724@end table 36725 36726It is up to the extension to decide if there are API incompatibilities. 36727Typically, a check like this is enough: 36728 36729@example 36730if ( api->major_version != GAWK_API_MAJOR_VERSION 36731 || api->minor_version < GAWK_API_MINOR_VERSION) @{ 36732 fprintf(stderr, "foo_extension: version mismatch with gawk!\n"); 36733 fprintf(stderr, "\tmy version (%d, %d), gawk version (%d, %d)\n", 36734 GAWK_API_MAJOR_VERSION, GAWK_API_MINOR_VERSION, 36735 api->major_version, api->minor_version); 36736 exit(1); 36737@} 36738@end example 36739 36740Such code is included in the boilerplate @code{dl_load_func()} macro 36741provided in @file{gawkapi.h} (discussed in 36742@ref{Extension API Boilerplate}). 36743 36744@node Extension GMP/MPFR Versioning 36745@subsubsection GMP and MPFR Version Information 36746 36747The API also includes information about the versions of GMP and MPFR 36748with which the running @command{gawk} was compiled (if any). 36749They are included in the API @code{struct} as read-only 36750constant integers: 36751 36752@table @code 36753@item api->gmp_major_version 36754The major version of the GMP library used to compile @command{gawk}. 36755 36756@item api->gmp_minor_version 36757The minor version of the GMP library used to compile @command{gawk}. 36758 36759@item api->mpfr_major_version 36760The major version of the MPFR library used to compile @command{gawk}. 36761 36762@item api->mpfr_minor_version 36763The minor version of the MPFR library used to compile @command{gawk}. 36764@end table 36765 36766These fields are set to zero if @command{gawk} was compiled without 36767MPFR support. 36768 36769You can check if the versions of MPFR and GMP that you are using match those 36770of @command{gawk} with the following macro: 36771 36772@table @code 36773@item check_mpfr_version(extension) 36774The @code{extension} is the extension id passed to all the other macros 36775and functions defined in @file{gawkapi.h}. If you have not included 36776the @code{<mpfr.h>} header file, then this macro will be defined to do nothing. 36777 36778If you have included that file, then this macro compares the MPFR 36779and GMP major and minor versions against those of the library you are 36780compiling against. If your libraries are newer than @command{gawk}'s, it 36781produces a fatal error message. 36782 36783The @code{dl_load_func()} macro (@pxref{Extension API Boilerplate}) 36784calls @code{check_mpfr_version()}. 36785@end table 36786 36787@node Extension API Informational Variables 36788@subsubsection Informational Variables 36789@cindex API @subentry informational variables 36790@cindex extension API @subentry informational variables 36791 36792The API provides access to several variables that describe 36793whether the corresponding command-line options were enabled when 36794@command{gawk} was invoked. The variables are: 36795 36796@table @code 36797@item do_debug 36798This variable is true if @command{gawk} was invoked with @option{--debug} option. 36799 36800@item do_lint 36801This variable is true if @command{gawk} was invoked with @option{--lint} option. 36802 36803@item do_mpfr 36804This variable is true if @command{gawk} was invoked with @option{--bignum} option. 36805 36806@item do_profile 36807This variable is true if @command{gawk} was invoked with @option{--profile} option. 36808 36809@item do_sandbox 36810This variable is true if @command{gawk} was invoked with @option{--sandbox} option. 36811 36812@item do_traditional 36813This variable is true if @command{gawk} was invoked with @option{--traditional} option. 36814@end table 36815 36816The value of @code{do_lint} can change if @command{awk} code 36817modifies the @code{LINT} predefined variable (@pxref{Built-in Variables}). 36818The others should not change during execution. 36819 36820@node Extension API Boilerplate 36821@subsection Boilerplate Code 36822 36823As mentioned earlier (@pxref{Extension Mechanism Outline}), the function 36824definitions as presented are really macros. To use these macros, your 36825extension must provide a small amount of boilerplate code (variables and 36826functions) toward the top of your source file, using predefined names 36827as described here. The boilerplate needed is also provided in comments 36828in the @file{gawkapi.h} header file: 36829 36830@example 36831@group 36832/* Boilerplate code: */ 36833int plugin_is_GPL_compatible; 36834 36835static gawk_api_t *const api; 36836@end group 36837static awk_ext_id_t ext_id; 36838static const char *ext_version = NULL; /* or @dots{} = "some string" */ 36839 36840static awk_ext_func_t func_table[] = @{ 36841 @{ "name", do_name, 1, 0, awk_false, NULL @}, 36842 /* @dots{} */ 36843@}; 36844 36845/* EITHER: */ 36846 36847static awk_bool_t (*init_func)(void) = NULL; 36848 36849/* OR: */ 36850 36851static awk_bool_t 36852init_my_extension(void) 36853@{ 36854 @dots{} 36855@} 36856 36857static awk_bool_t (*init_func)(void) = init_my_extension; 36858 36859dl_load_func(func_table, some_name, "name_space_in_quotes") 36860@end example 36861 36862These variables and functions are as follows: 36863 36864@table @code 36865@item int plugin_is_GPL_compatible; 36866This asserts that the extension is compatible with 36867@ifclear FOR_PRINT 36868the GNU GPL (@pxref{Copying}). 36869@end ifclear 36870@ifset FOR_PRINT 36871the GNU GPL. 36872@end ifset 36873If your extension does not have this, @command{gawk} 36874will not load it (@pxref{Plugin License}). 36875 36876@item static gawk_api_t *const api; 36877This global @code{static} variable should be set to point to 36878the @code{gawk_api_t} pointer that @command{gawk} passes to your 36879@code{dl_load()} function. This variable is used by all of the macros. 36880 36881@item static awk_ext_id_t ext_id; 36882This global static variable should be set to the @code{awk_ext_id_t} 36883value that @command{gawk} passes to your @code{dl_load()} function. 36884This variable is used by all of the macros. 36885 36886@item static const char *ext_version = NULL; /* or @dots{} = "some string" */ 36887This global @code{static} variable should be set either 36888to @code{NULL}, or to point to a string giving the name and version of 36889your extension. 36890 36891@item static awk_ext_func_t func_table[] = @{ @dots{} @}; 36892This is an array of one or more @code{awk_ext_func_t} structures, 36893as described earlier (@pxref{Extension Functions}). 36894It can then be looped over for multiple calls to 36895@code{add_ext_func()}. 36896 36897@c Use @var{OR} for docbook 36898@item static awk_bool_t (*init_func)(void) = NULL; 36899@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @var{OR} 36900@itemx static awk_bool_t init_my_extension(void) @{ @dots{} @} 36901@itemx static awk_bool_t (*init_func)(void) = init_my_extension; 36902If you need to do some initialization work, you should define a 36903function that does it (creates variables, opens files, etc.) 36904and then define the @code{init_func} pointer to point to your 36905function. 36906The function should return @code{awk_false} upon failure, or @code{awk_true} 36907if everything goes well. 36908 36909If you don't need to do any initialization, define the pointer and 36910initialize it to @code{NULL}. 36911 36912@item dl_load_func(func_table, some_name, "name_space_in_quotes") 36913This macro expands to a @code{dl_load()} function that performs 36914all the necessary initializations. 36915@end table 36916 36917The point of all the variables and arrays is to let the 36918@code{dl_load()} function (from the @code{dl_load_func()} 36919macro) do all the standard work. It does the following: 36920 36921@enumerate 1 36922@item 36923Check the API versions. If the extension major version does not match 36924@command{gawk}'s, or if the extension minor version is greater than 36925@command{gawk}'s, it prints a fatal error message and exits. 36926 36927@item 36928Check the MPFR and GMP versions. If there is a mismatch, it prints 36929a fatal error message and exits. 36930 36931@item 36932Load the functions defined in @code{func_table}. 36933If any of them fails to load, it prints a warning message but 36934continues on. 36935 36936@item 36937If the @code{init_func} pointer is not @code{NULL}, call the 36938function it points to. If it returns @code{awk_false}, print a 36939warning message. 36940 36941@item 36942If @code{ext_version} is not @code{NULL}, register 36943the version string with @command{gawk}. 36944@end enumerate 36945 36946 36947@node Changes from API V1 36948@subsection Changes From Version 1 of the API 36949 36950The current API is @emph{not} binary compatible with version 1 of the API. 36951You will have to recompile your extensions in order to use them with 36952the current version of @command{gawk}. 36953 36954Fortunately, at the possible expense of some compile-time warnings, the API remains 36955source-code--compatible with the previous API. The major differences are 36956the additional members in the @code{awk_ext_func_t} structure, and the 36957addition of the third argument to the C implementation function 36958(@pxref{Extension Functions}). 36959 36960Here is a list of individual features that changed from version 1 to 36961version 2 of the API: 36962 36963@itemize @bullet 36964 36965@item 36966Numeric values can now have MPFR/MPZ variants 36967(@pxref{General Data Types}). 36968 36969@item 36970There are new string types: @code{AWK_REGEX} and @code{AWK_STRNUM} 36971(@pxref{General Data Types}). 36972 36973@item 36974The @code{ezalloc()} macro is new 36975(@pxref{Memory Allocation Functions}). 36976 36977@item 36978The @code{awk_ext_func_t} structure changed. Instead of 36979@code{num_expected_args}, it now has @code{max_expected} and 36980@code{min_required} 36981(@pxref{Extension Functions}). 36982 36983@item 36984For @code{get_record()}, an input parser can now specify field widths 36985(@pxref{Input Parsers}). 36986 36987@item 36988Extensions can now produce nonfatal error messages 36989(@pxref{Printing Messages}). 36990 36991@item 36992When flattening an array, you can now specify the index and value types 36993(@pxref{Array Functions}). 36994 36995@item 36996The @code{get_file()} API is new 36997(@pxref{Redirection API}). 36998@end itemize 36999 37000@node Finding Extensions 37001@section How @command{gawk} Finds Extensions 37002@cindex extensions @subentry loadable @subentry search path 37003@cindex finding extensions 37004 37005Compiled extensions have to be installed in a directory where 37006@command{gawk} can find them. If @command{gawk} is configured and 37007built in the default fashion, the directory in which to find 37008extensions is @file{/usr/local/lib/gawk}. You can also specify a search 37009path with a list of directories to search for compiled extensions. 37010@xref{AWKLIBPATH Variable} for more information. 37011 37012@node Extension Example 37013@section Example: Some File Functions 37014@cindex extensions @subentry loadable @subentry example 37015 37016@quotation 37017@i{No matter where you go, there you are.} 37018@author Buckaroo Banzai 37019@end quotation 37020 37021@c It's enough to show chdir and stat, no need for fts 37022 37023Two useful functions that are not in @command{awk} are @code{chdir()} (so 37024that an @command{awk} program can change its directory) and @code{stat()} 37025(so that an @command{awk} program can gather information about a file). 37026In order to illustrate the API in action, this @value{SECTION} implements 37027these functions for @command{gawk} in an extension. 37028 37029@menu 37030* Internal File Description:: What the new functions will do. 37031* Internal File Ops:: The code for internal file operations. 37032* Using Internal File Ops:: How to use an external extension. 37033@end menu 37034 37035@node Internal File Description 37036@subsection Using @code{chdir()} and @code{stat()} 37037 37038This @value{SECTION} shows how to use the new functions at 37039the @command{awk} level once they've been integrated into the 37040running @command{gawk} interpreter. Using @code{chdir()} is very 37041straightforward. It takes one argument, the new directory to change to: 37042 37043@example 37044@@load "filefuncs" 37045@dots{} 37046newdir = "/home/arnold/funstuff" 37047ret = chdir(newdir) 37048if (ret < 0) @{ 37049 printf("could not change to %s: %s\n", newdir, ERRNO) > "/dev/stderr" 37050 exit 1 37051@} 37052@dots{} 37053@end example 37054 37055The return value is negative if the @code{chdir()} failed, and 37056@code{ERRNO} (@pxref{Built-in Variables}) is set to a string indicating 37057the error. 37058 37059Using @code{stat()} is a bit more complicated. The C @code{stat()} 37060function fills in a structure that has a fair amount of information. 37061The right way to model this in @command{awk} is to fill in an associative 37062array with the appropriate information: 37063 37064@c broke printf for page breaking 37065@example 37066file = "/home/arnold/.profile" 37067ret = stat(file, fdata) 37068if (ret < 0) @{ 37069 printf("could not stat %s: %s\n", 37070 file, ERRNO) > "/dev/stderr" 37071 exit 1 37072@} 37073printf("size of %s is %d bytes\n", file, fdata["size"]) 37074@end example 37075 37076The @code{stat()} function always clears the data array, even if 37077the @code{stat()} fails. It fills in the following elements: 37078 37079@table @code 37080@item "name" 37081The name of the file that was @code{stat()}ed. 37082 37083@item "dev" 37084@itemx "ino" 37085The file's device and inode numbers, respectively. 37086 37087@item "mode" 37088The file's mode, as a numeric value. This includes both the file's 37089type and its permissions. 37090 37091@item "nlink" 37092The number of hard links (directory entries) the file has. 37093 37094@item "uid" 37095@itemx "gid" 37096The numeric user and group ID numbers of the file's owner. 37097 37098@item "size" 37099The size in bytes of the file. 37100 37101@item "blocks" 37102The number of disk blocks the file actually occupies. This may not 37103be a function of the file's size if the file has holes. 37104 37105@item "atime" 37106@itemx "mtime" 37107@itemx "ctime" 37108The file's last access, modification, and inode update times, 37109respectively. These are numeric timestamps, suitable for formatting 37110with @code{strftime()} 37111(@pxref{Time Functions}). 37112 37113@item "pmode" 37114The file's ``printable mode.'' This is a string representation of 37115the file's type and permissions, such as is produced by 37116@samp{ls -l}---for example, @code{"drwxr-xr-x"}. 37117 37118@item "type" 37119A printable string representation of the file's type. The value 37120is one of the following: 37121 37122@table @code 37123@item "blockdev" 37124@itemx "chardev" 37125The file is a block or character device (``special file''). 37126 37127@ignore 37128@item "door" 37129The file is a Solaris ``door'' (special file used for 37130interprocess communications). 37131@end ignore 37132 37133@item "directory" 37134The file is a directory. 37135 37136@item "fifo" 37137The file is a named pipe (also known as a FIFO). 37138 37139@item "file" 37140The file is just a regular file. 37141 37142@item "socket" 37143The file is an @code{AF_UNIX} (``Unix domain'') socket in the 37144filesystem. 37145 37146@item "symlink" 37147The file is a symbolic link. 37148@end table 37149 37150@c 5/2013: Thanks to Corinna Vinschen for this information. 37151@item "devbsize" 37152The size of a block for the element indexed by @code{"blocks"}. 37153This information is derived from either the @code{DEV_BSIZE} 37154constant defined in @code{<sys/param.h>} on most systems, 37155or the @code{S_BLKSIZE} constant in @code{<sys/stat.h>} on BSD systems. 37156For some other systems, @dfn{a priori} knowledge is used to provide 37157a value. Where no value can be determined, it defaults to 512. 37158@end table 37159 37160Several additional elements may be present, depending upon the operating 37161system and the type of the file. You can test for them in your @command{awk} 37162program by using the @code{in} operator 37163(@pxref{Reference to Elements}): 37164 37165@table @code 37166@item "blksize" 37167The preferred block size for I/O to the file. This field is not 37168present on all POSIX-like systems in the C @code{stat} structure. 37169 37170@item "linkval" 37171If the file is a symbolic link, this element is the name of the 37172file the link points to (i.e., the value of the link). 37173 37174@item "rdev" 37175@itemx "major" 37176@itemx "minor" 37177If the file is a block or character device file, then these values 37178represent the numeric device number and the major and minor components 37179of that number, respectively. 37180@end table 37181 37182@node Internal File Ops 37183@subsection C Code for @code{chdir()} and @code{stat()} 37184 37185Here is the C code for these extensions.@footnote{This version is 37186edited slightly for presentation. See @file{extension/filefuncs.c} 37187in the @command{gawk} distribution for the complete version.} 37188 37189The file includes a number of standard header files, and then includes 37190the @file{gawkapi.h} header file, which provides the API definitions. 37191Those are followed by the necessary variable declarations 37192to make use of the API macros and boilerplate code 37193(@pxref{Extension API Boilerplate}): 37194 37195@example 37196#ifdef HAVE_CONFIG_H 37197#include <config.h> 37198#endif 37199 37200#include <stdio.h> 37201#include <assert.h> 37202#include <errno.h> 37203#include <stdlib.h> 37204#include <string.h> 37205#include <unistd.h> 37206 37207#include <sys/types.h> 37208#include <sys/stat.h> 37209 37210#include "gawkapi.h" 37211 37212#include "gettext.h" 37213#define _(msgid) gettext(msgid) 37214#define N_(msgid) msgid 37215 37216#include "gawkfts.h" 37217#include "stack.h" 37218 37219static const gawk_api_t *api; /* for convenience macros to work */ 37220static awk_ext_id_t ext_id; 37221static awk_bool_t init_filefuncs(void); 37222static awk_bool_t (*init_func)(void) = init_filefuncs; 37223static const char *ext_version = "filefuncs extension: version 1.0"; 37224 37225int plugin_is_GPL_compatible; 37226@end example 37227 37228@cindex programming conventions @subentry @command{gawk} extensions 37229By convention, for an @command{awk} function @code{foo()}, the C function 37230that implements it is called @code{do_foo()}. The function should have 37231two arguments. The first is an @code{int}, usually called @code{nargs}, 37232that represents the number of actual arguments for the function. 37233The second is a pointer to an @code{awk_value_t} structure, usually named 37234@code{result}: 37235 37236@example 37237@group 37238/* do_chdir --- provide dynamically loaded chdir() function for gawk */ 37239 37240static awk_value_t * 37241do_chdir(int nargs, awk_value_t *result, struct awk_ext_func *unused) 37242@end group 37243@{ 37244 awk_value_t newdir; 37245 int ret = -1; 37246 37247 assert(result != NULL); 37248@end example 37249 37250The @code{newdir} 37251variable represents the new directory to change to, which is retrieved 37252with @code{get_argument()}. Note that the first argument is 37253numbered zero. 37254 37255If the argument is retrieved successfully, the function calls the 37256@code{chdir()} system call. Otherwise, if the @code{chdir()} fails, 37257it updates @code{ERRNO}: 37258 37259@example 37260 if (get_argument(0, AWK_STRING, & newdir)) @{ 37261 ret = chdir(newdir.str_value.str); 37262 if (ret < 0) 37263 update_ERRNO_int(errno); 37264 @} 37265@end example 37266 37267Finally, the function returns the return value to the @command{awk} level: 37268 37269@example 37270 return make_number(ret, result); 37271@} 37272@end example 37273 37274The @code{stat()} extension is more involved. First comes a function 37275that turns a numeric mode into a printable representation 37276(e.g., octal @code{0644} becomes @samp{-rw-r--r--}). This is omitted here for brevity: 37277 37278@example 37279/* format_mode --- turn a stat mode field into something readable */ 37280 37281static char * 37282format_mode(unsigned long fmode) 37283@{ 37284 @dots{} 37285@} 37286@end example 37287 37288Next comes a function for reading symbolic links, which is also 37289omitted here for brevity: 37290 37291@example 37292/* read_symlink --- read a symbolic link into an allocated buffer. 37293 @dots{} */ 37294 37295static char * 37296read_symlink(const char *fname, size_t bufsize, ssize_t *linksize) 37297@{ 37298 @dots{} 37299@} 37300@end example 37301 37302Two helper functions simplify entering values in the 37303array that will contain the result of the @code{stat()}: 37304 37305@example 37306/* array_set --- set an array element */ 37307 37308static void 37309array_set(awk_array_t array, const char *sub, awk_value_t *value) 37310@{ 37311 awk_value_t index; 37312 37313 set_array_element(array, 37314 make_const_string(sub, strlen(sub), & index), 37315 value); 37316 37317@} 37318 37319/* array_set_numeric --- set an array element with a number */ 37320 37321static void 37322array_set_numeric(awk_array_t array, const char *sub, double num) 37323@{ 37324 awk_value_t tmp; 37325 37326 array_set(array, sub, make_number(num, & tmp)); 37327@} 37328@end example 37329 37330The following function does most of the work to fill in 37331the @code{awk_array_t} result array with values obtained 37332from a valid @code{struct stat}. This work is done in a separate function 37333to support the @code{stat()} function for @command{gawk} and also 37334to support the @code{fts()} extension, which is included in 37335the same file but whose code is not shown here 37336(@pxref{Extension Sample File Functions}). 37337 37338The first part of the function is variable declarations, 37339including a table to map file types to strings: 37340 37341@example 37342/* fill_stat_array --- do the work to fill an array with stat info */ 37343 37344static int 37345fill_stat_array(const char *name, awk_array_t array, struct stat *sbuf) 37346@{ 37347 char *pmode; /* printable mode */ 37348 const char *type = "unknown"; 37349 awk_value_t tmp; 37350 static struct ftype_map @{ 37351 unsigned int mask; 37352 const char *type; 37353 @} ftype_map[] = @{ 37354 @{ S_IFREG, "file" @}, 37355 @{ S_IFBLK, "blockdev" @}, 37356 @{ S_IFCHR, "chardev" @}, 37357 @{ S_IFDIR, "directory" @}, 37358#ifdef S_IFSOCK 37359 @{ S_IFSOCK, "socket" @}, 37360#endif 37361#ifdef S_IFIFO 37362 @{ S_IFIFO, "fifo" @}, 37363#endif 37364#ifdef S_IFLNK 37365 @{ S_IFLNK, "symlink" @}, 37366#endif 37367#ifdef S_IFDOOR /* Solaris weirdness */ 37368 @{ S_IFDOOR, "door" @}, 37369#endif 37370 @}; 37371 int j, k; 37372@end example 37373 37374The destination array is cleared, and then code fills in 37375various elements based on values in the @code{struct stat}: 37376 37377@example 37378 /* empty out the array */ 37379 clear_array(array); 37380 37381 /* fill in the array */ 37382 array_set(array, "name", make_const_string(name, strlen(name), 37383 & tmp)); 37384 array_set_numeric(array, "dev", sbuf->st_dev); 37385 array_set_numeric(array, "ino", sbuf->st_ino); 37386 array_set_numeric(array, "mode", sbuf->st_mode); 37387 array_set_numeric(array, "nlink", sbuf->st_nlink); 37388 array_set_numeric(array, "uid", sbuf->st_uid); 37389 array_set_numeric(array, "gid", sbuf->st_gid); 37390 array_set_numeric(array, "size", sbuf->st_size); 37391 array_set_numeric(array, "blocks", sbuf->st_blocks); 37392 array_set_numeric(array, "atime", sbuf->st_atime); 37393 array_set_numeric(array, "mtime", sbuf->st_mtime); 37394 array_set_numeric(array, "ctime", sbuf->st_ctime); 37395 37396 /* for block and character devices, add rdev, 37397 major and minor numbers */ 37398 if (S_ISBLK(sbuf->st_mode) || S_ISCHR(sbuf->st_mode)) @{ 37399 array_set_numeric(array, "rdev", sbuf->st_rdev); 37400 array_set_numeric(array, "major", major(sbuf->st_rdev)); 37401 array_set_numeric(array, "minor", minor(sbuf->st_rdev)); 37402 @} 37403@end example 37404 37405@noindent 37406The latter part of the function makes selective additions 37407to the destination array, depending upon the availability of 37408certain members and/or the type of the file. It then returns zero, 37409for success: 37410 37411@example 37412@group 37413#ifdef HAVE_STRUCT_STAT_ST_BLKSIZE 37414 array_set_numeric(array, "blksize", sbuf->st_blksize); 37415#endif 37416@end group 37417 37418 pmode = format_mode(sbuf->st_mode); 37419 array_set(array, "pmode", make_const_string(pmode, strlen(pmode), 37420 & tmp)); 37421 37422 /* for symbolic links, add a linkval field */ 37423 if (S_ISLNK(sbuf->st_mode)) @{ 37424 char *buf; 37425 ssize_t linksize; 37426 37427 if ((buf = read_symlink(name, sbuf->st_size, 37428 & linksize)) != NULL) 37429 array_set(array, "linkval", 37430 make_malloced_string(buf, linksize, & tmp)); 37431 else 37432 warning(ext_id, _("stat: unable to read symbolic link `%s'"), 37433 name); 37434 @} 37435 37436 /* add a type field */ 37437 type = "unknown"; /* shouldn't happen */ 37438 for (j = 0, k = sizeof(ftype_map)/sizeof(ftype_map[0]); j < k; j++) @{ 37439 if ((sbuf->st_mode & S_IFMT) == ftype_map[j].mask) @{ 37440 type = ftype_map[j].type; 37441 break; 37442 @} 37443 @} 37444 37445 array_set(array, "type", make_const_string(type, strlen(type), & tmp)); 37446 37447 return 0; 37448@} 37449@end example 37450 37451The third argument to @code{stat()} was not discussed previously. This 37452argument is optional. If present, it causes @code{do_stat()} to use 37453the @code{stat()} system call instead of the @code{lstat()} system 37454call. This is done by using a function pointer: @code{statfunc}. 37455@code{statfunc} is initialized to point to @code{lstat()} (instead 37456of @code{stat()}) to get the file information, in case the file is a 37457symbolic link. However, if the third argument is included, @code{statfunc} 37458is set to point to @code{stat()}, instead. 37459 37460Here is the @code{do_stat()} function, which starts with 37461variable declarations and argument checking: 37462 37463@example 37464/* do_stat --- provide a stat() function for gawk */ 37465 37466static awk_value_t * 37467do_stat(int nargs, awk_value_t *result, struct awk_ext_func *unused) 37468@{ 37469 awk_value_t file_param, array_param; 37470 char *name; 37471 awk_array_t array; 37472 int ret; 37473 struct stat sbuf; 37474 /* default is lstat() */ 37475 int (*statfunc)(const char *path, struct stat *sbuf) = lstat; 37476 37477 assert(result != NULL); 37478@end example 37479 37480Then comes the actual work. First, the function gets the arguments. 37481Next, it gets the information for the file. If the called function 37482(@code{lstat()} or @code{stat()}) returns an error, the code sets 37483@code{ERRNO} and returns: 37484 37485@example 37486 /* file is first arg, array to hold results is second */ 37487 if ( ! get_argument(0, AWK_STRING, & file_param) 37488 || ! get_argument(1, AWK_ARRAY, & array_param)) @{ 37489 warning(ext_id, _("stat: bad parameters")); 37490 return make_number(-1, result); 37491 @} 37492 37493 if (nargs == 3) @{ 37494 statfunc = stat; 37495 @} 37496 37497 name = file_param.str_value.str; 37498 array = array_param.array_cookie; 37499 37500 /* always empty out the array */ 37501 clear_array(array); 37502 37503 /* stat the file; if error, set ERRNO and return */ 37504 ret = statfunc(name, & sbuf); 37505@group 37506 if (ret < 0) @{ 37507 update_ERRNO_int(errno); 37508 return make_number(ret, result); 37509 @} 37510@end group 37511@end example 37512 37513The tedious work is done by @code{fill_stat_array()}, shown 37514earlier. When done, the function returns the result from @code{fill_stat_array()}: 37515 37516@example 37517@group 37518 ret = fill_stat_array(name, array, & sbuf); 37519 37520 return make_number(ret, result); 37521@} 37522@end group 37523@end example 37524 37525Finally, it's necessary to provide the ``glue'' that loads the 37526new function(s) into @command{gawk}. 37527 37528The @code{filefuncs} extension also provides an @code{fts()} 37529function, which we omit here 37530(@pxref{Extension Sample File Functions}). 37531For its sake, there is an initialization 37532function: 37533 37534@example 37535/* init_filefuncs --- initialization routine */ 37536 37537static awk_bool_t 37538init_filefuncs(void) 37539@{ 37540 @dots{} 37541@} 37542@end example 37543 37544We are almost done. We need an array of @code{awk_ext_func_t} 37545structures for loading each function into @command{gawk}: 37546 37547@example 37548static awk_ext_func_t func_table[] = @{ 37549 @{ "chdir", do_chdir, 1, 1, awk_false, NULL @}, 37550 @{ "stat", do_stat, 3, 2, awk_false, NULL @}, 37551 @dots{} 37552@}; 37553@end example 37554 37555Each extension must have a routine named @code{dl_load()} to load 37556everything that needs to be loaded. It is simplest to use the 37557@code{dl_load_func()} macro in @code{gawkapi.h}: 37558 37559@example 37560/* define the dl_load() function using the boilerplate macro */ 37561 37562dl_load_func(func_table, filefuncs, "") 37563@end example 37564 37565And that's it! 37566 37567@node Using Internal File Ops 37568@subsection Integrating the Extensions 37569 37570@cindex @command{gawk} @subentry interpreter, adding code to 37571Now that the code is written, it must be possible to add it at 37572runtime to the running @command{gawk} interpreter. First, the 37573code must be compiled. Assuming that the functions are in 37574a file named @file{filefuncs.c}, and @var{idir} is the location 37575of the @file{gawkapi.h} header file, 37576the following steps@footnote{In practice, you would probably want to 37577use the GNU Autotools (Automake, Autoconf, Libtool, and @command{gettext}) to 37578configure and build your libraries. Instructions for doing so are beyond 37579the scope of this @value{DOCUMENT}. @xref{gawkextlib} for Internet links to 37580the tools.} create a GNU/Linux shared library: 37581 37582@example 37583$ @kbd{gcc -fPIC -shared -DHAVE_CONFIG_H -c -O -g -I@var{idir} filefuncs.c} 37584$ @kbd{gcc -o filefuncs.so -shared filefuncs.o} 37585@end example 37586 37587Once the library exists, it is loaded by using the @code{@@load} keyword: 37588 37589@example 37590# file testff.awk 37591@@load "filefuncs" 37592 37593BEGIN @{ 37594 "pwd" | getline curdir # save current directory 37595 close("pwd") 37596 37597 chdir("/tmp") 37598 system("pwd") # test it 37599 chdir(curdir) # go back 37600 37601 print "Info for testff.awk" 37602 ret = stat("testff.awk", data) 37603 print "ret =", ret 37604 for (i in data) 37605 printf "data[\"%s\"] = %s\n", i, data[i] 37606 print "testff.awk modified:", 37607 strftime("%m %d %Y %H:%M:%S", data["mtime"]) 37608 37609 print "\nInfo for JUNK" 37610 ret = stat("JUNK", data) 37611 print "ret =", ret 37612 for (i in data) 37613 printf "data[\"%s\"] = %s\n", i, data[i] 37614 print "JUNK modified:", strftime("%m %d %Y %H:%M:%S", data["mtime"]) 37615@} 37616@end example 37617 37618The @env{AWKLIBPATH} environment variable tells 37619@command{gawk} where to find extensions (@pxref{Finding Extensions}). 37620We set it to the current directory and run the program: 37621 37622@example 37623$ @kbd{AWKLIBPATH=$PWD gawk -f testff.awk} 37624@print{} /tmp 37625@print{} Info for testff.awk 37626@print{} ret = 0 37627@print{} data["blksize"] = 4096 37628@print{} data["devbsize"] = 512 37629@print{} data["mtime"] = 1412004710 37630@print{} data["mode"] = 33204 37631@print{} data["type"] = file 37632@print{} data["dev"] = 2053 37633@print{} data["gid"] = 1000 37634@print{} data["ino"] = 10358899 37635@print{} data["ctime"] = 1412004710 37636@print{} data["blocks"] = 8 37637@print{} data["nlink"] = 1 37638@print{} data["name"] = testff.awk 37639@print{} data["atime"] = 1412004716 37640@print{} data["pmode"] = -rw-rw-r-- 37641@print{} data["size"] = 666 37642@print{} data["uid"] = 1000 37643@print{} testff.awk modified: 09 29 2014 18:31:50 37644@print{} 37645@print{} Info for JUNK 37646@print{} ret = -1 37647@print{} JUNK modified: 01 01 1970 02:00:00 37648@end example 37649 37650@node Extension Samples 37651@section The Sample Extensions in the @command{gawk} Distribution 37652@cindex extensions @subentry loadable @subentry distributed with @command{gawk} 37653 37654This @value{SECTION} provides a brief overview of the sample extensions 37655that come in the @command{gawk} distribution. Some of them are intended 37656for production use (e.g., the @code{filefuncs}, @code{readdir}, and 37657@code{inplace} extensions). Others mainly provide example code that 37658shows how to use the extension API. 37659 37660@menu 37661* Extension Sample File Functions:: The file functions sample. 37662* Extension Sample Fnmatch:: An interface to @code{fnmatch()}. 37663* Extension Sample Fork:: An interface to @code{fork()} and other 37664 process functions. 37665* Extension Sample Inplace:: Enabling in-place file editing. 37666* Extension Sample Ord:: Character to value to character 37667 conversions. 37668* Extension Sample Readdir:: An interface to @code{readdir()}. 37669* Extension Sample Revout:: Reversing output sample output wrapper. 37670* Extension Sample Rev2way:: Reversing data sample two-way processor. 37671* Extension Sample Read write array:: Serializing an array to a file. 37672* Extension Sample Readfile:: Reading an entire file into a string. 37673* Extension Sample Time:: An interface to @code{gettimeofday()} 37674 and @code{sleep()}. 37675* Extension Sample API Tests:: Tests for the API. 37676@end menu 37677 37678@node Extension Sample File Functions 37679@subsection File-Related Functions 37680 37681The @code{filefuncs} extension provides three different functions, as follows. 37682The usage is: 37683 37684@table @asis 37685@item @code{@@load "filefuncs"} 37686This is how you load the extension. 37687 37688@cindex @code{chdir()} extension function 37689@item @code{result = chdir("/some/directory")} 37690The @code{chdir()} function is a direct hook to the @code{chdir()} 37691system call to change the current directory. It returns zero 37692upon success or a value less than zero upon error. 37693In the latter case, it updates @code{ERRNO}. 37694 37695@cindex @code{stat()} extension function 37696@item @code{result = stat("/some/path", statdata} [@code{, follow}]@code{)} 37697The @code{stat()} function provides a hook into the 37698@code{stat()} system call. 37699It returns zero upon success or a value less than zero upon error. 37700In the latter case, it updates @code{ERRNO}. 37701 37702By default, it uses the @code{lstat()} system call. However, if passed 37703a third argument, it uses @code{stat()} instead. 37704 37705In all cases, it clears the @code{statdata} array. 37706When the call is successful, @code{stat()} fills the @code{statdata} 37707array with information retrieved from the filesystem, as follows: 37708 37709@multitable @columnfractions .15 .50 .20 37710@headitem Subscript @tab Field in @code{struct stat} @tab File type 37711@item @code{"name"} @tab The @value{FN} @tab All 37712@item @code{"dev"} @tab @code{st_dev} @tab All 37713@item @code{"ino"} @tab @code{st_ino} @tab All 37714@item @code{"mode"} @tab @code{st_mode} @tab All 37715@item @code{"nlink"} @tab @code{st_nlink} @tab All 37716@item @code{"uid"} @tab @code{st_uid} @tab All 37717@item @code{"gid"} @tab @code{st_gid} @tab All 37718@item @code{"size"} @tab @code{st_size} @tab All 37719@item @code{"atime"} @tab @code{st_atime} @tab All 37720@item @code{"mtime"} @tab @code{st_mtime} @tab All 37721@item @code{"ctime"} @tab @code{st_ctime} @tab All 37722@item @code{"rdev"} @tab @code{st_rdev} @tab Device files 37723@item @code{"major"} @tab @code{st_major} @tab Device files 37724@item @code{"minor"} @tab @code{st_minor} @tab Device files 37725@item @code{"blksize"} @tab @code{st_blksize} @tab All 37726@item @code{"pmode"} @tab A human-readable version of the mode value, like that printed by 37727@command{ls} (for example, @code{"-rwxr-xr-x"}) @tab All 37728@item @code{"linkval"} @tab The value of the symbolic link @tab Symbolic links 37729@item @code{"type"} @tab The type of the file as a string---one of 37730@code{"file"}, 37731@code{"blockdev"}, 37732@code{"chardev"}, 37733@code{"directory"}, 37734@code{"socket"}, 37735@code{"fifo"}, 37736@code{"symlink"}, 37737@code{"door"}, 37738or 37739@code{"unknown"} 37740(not all systems support all file types) @tab All 37741@end multitable 37742 37743@cindex @code{fts()} extension function 37744@item @code{flags = or(FTS_PHYSICAL, ...)} 37745@itemx @code{result = fts(pathlist, flags, filedata)} 37746Walk the file trees provided in @code{pathlist} and fill in the 37747@code{filedata} array, as described next. @code{flags} is the bitwise 37748OR of several predefined values, also described in a moment. 37749Return zero if there were no errors, otherwise return @minus{}1. 37750@end table 37751 37752The @code{fts()} function provides a hook to the C library @code{fts()} 37753routines for traversing file hierarchies. Instead of returning data 37754about one file at a time in a stream, it fills in a multidimensional 37755array with data about each file and directory encountered in the requested 37756hierarchies. 37757 37758The arguments are as follows: 37759 37760@table @code 37761@item pathlist 37762An array of @value{FN}s. The element values are used; the index values are ignored. 37763 37764@item flags 37765This should be the bitwise OR of one or more of the following 37766predefined constant flag values. At least one of 37767@code{FTS_LOGICAL} or @code{FTS_PHYSICAL} must be provided; otherwise 37768@code{fts()} returns an error value and sets @code{ERRNO}. 37769The flags are: 37770 37771@c nested table 37772@table @code 37773@item FTS_LOGICAL 37774Do a ``logical'' file traversal, where the information returned for 37775a symbolic link refers to the linked-to file, and not to the symbolic 37776link itself. This flag is mutually exclusive with @code{FTS_PHYSICAL}. 37777 37778@item FTS_PHYSICAL 37779Do a ``physical'' file traversal, where the information returned for a 37780symbolic link refers to the symbolic link itself. This flag is mutually 37781exclusive with @code{FTS_LOGICAL}. 37782 37783@item FTS_NOCHDIR 37784As a performance optimization, the C library @code{fts()} routines 37785change directory as they traverse a file hierarchy. This flag disables 37786that optimization. 37787 37788@item FTS_COMFOLLOW 37789Immediately follow a symbolic link named in @code{pathlist}, 37790whether or not @code{FTS_LOGICAL} is set. 37791 37792@item FTS_SEEDOT 37793By default, the C library @code{fts()} routines do not return entries for 37794@file{.} (dot) and @file{..} (dot-dot). This option causes entries for 37795dot-dot to also be included. (The extension always includes an entry 37796for dot; more on this in a moment.) 37797 37798@item FTS_XDEV 37799During a traversal, do not cross onto a different mounted filesystem. 37800@end table 37801 37802@item filedata 37803The @code{filedata} array holds the results. 37804@code{fts()} first clears it. Then it creates 37805an element in @code{filedata} for every element in @code{pathlist}. 37806The index is the name of the directory or file given in @code{pathlist}. 37807The element for this index is itself an array. There are two cases: 37808 37809@c nested table 37810@table @emph 37811@item The path is a file 37812In this case, the array contains two or three elements: 37813 37814@c doubly nested table 37815@table @code 37816@item "path" 37817The full path to this file, starting from the ``root'' that was given 37818in the @code{pathlist} array. 37819 37820@item "stat" 37821This element is itself an array, containing the same information as provided 37822by the @code{stat()} function described earlier for its 37823@code{statdata} argument. The element may not be present if 37824the @code{stat()} system call for the file failed. 37825 37826@item "error" 37827If some kind of error was encountered, the array will also 37828contain an element named @code{"error"}, which is a string describing the error. 37829@end table 37830 37831@item The path is a directory 37832In this case, the array contains one element for each entry in the 37833directory. If an entry is a file, that element is the same as for files, just 37834described. If the entry is a directory, that element is (recursively) 37835an array describing the subdirectory. If @code{FTS_SEEDOT} was provided 37836in the flags, then there will also be an element named @code{".."}. This 37837element will be an array containing the data as provided by @code{stat()}. 37838 37839In addition, there will be an element whose index is @code{"."}. 37840This element is an array containing the same two or three elements as 37841for a file: @code{"path"}, @code{"stat"}, and @code{"error"}. 37842@end table 37843@end table 37844 37845The @code{fts()} function returns zero if there were no errors. 37846Otherwise, it returns @minus{}1. 37847 37848@quotation NOTE 37849The @code{fts()} extension does not exactly mimic the 37850interface of the C library @code{fts()} routines, choosing instead to 37851provide an interface that is based on associative arrays, which is 37852more comfortable to use from an @command{awk} program. This includes the 37853lack of a comparison function, because @command{gawk} already provides 37854powerful array sorting facilities. Although an @code{fts_read()}-like 37855interface could have been provided, this felt less natural than simply 37856creating a multidimensional array to represent the file hierarchy and 37857its information. 37858@end quotation 37859 37860See @file{test/fts.awk} in the @command{gawk} distribution for an example 37861use of the @code{fts()} extension function. 37862 37863@node Extension Sample Fnmatch 37864@subsection Interface to @code{fnmatch()} 37865 37866This extension provides an interface to the C library 37867@code{fnmatch()} function. The usage is: 37868 37869@table @code 37870@item @@load "fnmatch" 37871This is how you load the extension. 37872 37873@cindex @code{fnmatch()} extension function 37874@item result = fnmatch(pattern, string, flags) 37875The return value is zero on success, @code{FNM_NOMATCH} 37876if the string did not match the pattern, or 37877a different nonzero value if an error occurred. 37878@end table 37879 37880In addition to the @code{fnmatch()} function, the @code{fnmatch} extension 37881adds one constant (@code{FNM_NOMATCH}), and an array of flag values 37882named @code{FNM}. 37883 37884The arguments to @code{fnmatch()} are: 37885 37886@table @code 37887@item pattern 37888The @value{FN} wildcard to match 37889 37890@item string 37891The @value{FN} string 37892 37893@item flag 37894Either zero, or the bitwise OR of one or more of the 37895flags in the @code{FNM} array 37896@end table 37897 37898The flags are as follows: 37899 37900@multitable @columnfractions .25 .75 37901@headitem Array element @tab Corresponding flag defined by @code{fnmatch()} 37902@item @code{FNM["CASEFOLD"]} @tab @code{FNM_CASEFOLD} 37903@item @code{FNM["FILE_NAME"]} @tab @code{FNM_FILE_NAME} 37904@item @code{FNM["LEADING_DIR"]} @tab @code{FNM_LEADING_DIR} 37905@item @code{FNM["NOESCAPE"]} @tab @code{FNM_NOESCAPE} 37906@item @code{FNM["PATHNAME"]} @tab @code{FNM_PATHNAME} 37907@item @code{FNM["PERIOD"]} @tab @code{FNM_PERIOD} 37908@end multitable 37909 37910Here is an example: 37911 37912@example 37913@@load "fnmatch" 37914@dots{} 37915flags = or(FNM["PERIOD"], FNM["NOESCAPE"]) 37916if (fnmatch("*.a", "foo.c", flags) == FNM_NOMATCH) 37917 print "no match" 37918@end example 37919 37920@node Extension Sample Fork 37921@subsection Interface to @code{fork()}, @code{wait()}, and @code{waitpid()} 37922 37923The @code{fork} extension adds three functions, as follows: 37924 37925@table @code 37926@item @@load "fork" 37927This is how you load the extension. 37928 37929@cindex @code{fork()} extension function 37930@item pid = fork() 37931This function creates a new process. The return value is zero in the 37932child and the process ID number of the child in the parent, or @minus{}1 37933upon error. In the latter case, @code{ERRNO} indicates the problem. 37934In the child, @code{PROCINFO["pid"]} and @code{PROCINFO["ppid"]} are 37935updated to reflect the correct values. 37936 37937@cindex @code{waitpid()} extension function 37938@item ret = waitpid(pid) 37939This function takes a numeric argument, which is the process ID to 37940wait for. The return value is that of the 37941@code{waitpid()} system call. 37942 37943@cindex @code{wait()} extension function 37944@item ret = wait() 37945This function waits for the first child to die. 37946The return value is that of the 37947@code{wait()} system call. 37948@end table 37949 37950There is no corresponding @code{exec()} function. 37951 37952Here is an example: 37953 37954@example 37955@@load "fork" 37956@dots{} 37957if ((pid = fork()) == 0) 37958 print "hello from the child" 37959else 37960 print "hello from the parent" 37961@end example 37962 37963@node Extension Sample Inplace 37964@subsection Enabling In-Place File Editing 37965 37966@cindex @code{inplace} extension 37967The @code{inplace} extension emulates GNU @command{sed}'s @option{-i} option, 37968which performs ``in-place'' editing of each input file. 37969It uses the bundled @file{inplace.awk} include file to invoke the extension 37970properly. This extension makes use of the namespace facility to place 37971all the variables and functions in the @code{inplace} namespace 37972(@pxref{Namespaces}): 37973 37974@example 37975@c file eg/lib/inplace.awk 37976@group 37977# inplace --- load and invoke the inplace extension. 37978@c endfile 37979@ignore 37980@c file eg/lib/inplace.awk 37981# 37982# Copyright (C) 2013, 2017, 2019 the Free Software Foundation, Inc. 37983# 37984# This file is part of GAWK, the GNU implementation of the 37985# AWK Programming Language. 37986# 37987# GAWK is free software; you can redistribute it and/or modify 37988# it under the terms of the GNU General Public License as published by 37989# the Free Software Foundation; either version 3 of the License, or 37990# (at your option) any later version. 37991# 37992# GAWK is distributed in the hope that it will be useful, 37993# but WITHOUT ANY WARRANTY; without even the implied warranty of 37994# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 37995# GNU General Public License for more details. 37996# 37997# You should have received a copy of the GNU General Public License 37998# along with this program; if not, write to the Free Software 37999# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA 38000# 38001# Andrew J. Schorr, aschorr@@telemetry-investments.com 38002# January 2013 38003# 38004# Revised for namespaces 38005# Arnold Robbins, arnold@@skeeve.com 38006# July 2017 38007# June 2019, add backwards compatibility 38008@c endfile 38009@end ignore 38010@c file eg/lib/inplace.awk 38011 38012@@load "inplace" 38013 38014# Please set inplace::suffix to make a backup copy. For example, you may 38015# want to set inplace::suffix to .bak on the command line or in a BEGIN rule. 38016 38017# Before there were namespaces in gawk, this extension used 38018# INPLACE_SUFFIX as the variable for making backup copies. We allow this 38019# too, so that any code that used the previous version continues to work. 38020 38021# By default, each filename on the command line will be edited inplace. 38022# But you can selectively disable this by adding an inplace::enable=0 argument 38023# prior to files that you do not want to process this way. You can then 38024# reenable it later on the commandline by putting inplace::enable=1 before files 38025# that you wish to be subject to inplace editing. 38026 38027# N.B. We call inplace::end() in the BEGINFILE and END rules so that any 38028# actions in an ENDFILE rule will be redirected as expected. 38029 38030@@namespace "inplace" 38031@end group 38032 38033@group 38034BEGIN @{ 38035 enable = 1 # enabled by default 38036@} 38037@end group 38038 38039@group 38040BEGINFILE @{ 38041 sfx = (suffix ? suffix : awk::INPLACE_SUFFIX) 38042 if (filename != "") 38043 end(filename, sfx) 38044 if (enable) 38045 begin(filename = FILENAME, sfx) 38046 else 38047 filename = "" 38048@} 38049@end group 38050 38051@group 38052END @{ 38053 if (filename != "") 38054 end(filename, (suffix ? suffix : awk::INPLACE_SUFFIX)) 38055@} 38056@end group 38057@c endfile 38058@end example 38059 38060For each regular file that is processed, the extension redirects 38061standard output to a temporary file configured to have the same owner 38062and permissions as the original. After the file has been processed, 38063the extension restores standard output to its original destination. 38064If @code{inplace::suffix} is not an empty string, the original file is 38065linked to a backup @value{FN} created by appending that suffix. Finally, 38066the temporary file is renamed to the original @value{FN}. 38067 38068Note that the use of this feature can be controlled by placing 38069@samp{inplace::enable=0} on the command-line prior to listing files that 38070should not be processed this way. You can reenable inplace editing by adding 38071an @samp{inplace::enable=1} argument prior to files that should be subject 38072to inplace editing. 38073 38074The @code{inplace::filename} variable serves to keep track of the 38075current @value{FN} so as to not invoke @code{inplace::end()} before 38076processing the first file. 38077 38078If any error occurs, the extension issues a fatal error to terminate 38079processing immediately without damaging the original file. 38080 38081Here are some simple examples: 38082 38083@example 38084$ @kbd{gawk -i inplace '@{ gsub(/foo/, "bar") @}; @{ print @}' file1 file2 file3} 38085@end example 38086 38087To keep a backup copy of the original files, try this: 38088 38089@example 38090$ @kbd{gawk -i inplace -v inplace::suffix=.bak '@{ gsub(/foo/, "bar") @}} 38091> @kbd{@{ print @}' file1 file2 file3} 38092@end example 38093 38094Please note that, while the extension does attempt to preserve ownership and permissions, it makes no attempt to copy the ACLs from the original file. 38095 38096If the program dies prematurely, as might happen if an unhandled signal is received, a temporary file may be left behind. 38097 38098@node Extension Sample Ord 38099@subsection Character and Numeric values: @code{ord()} and @code{chr()} 38100 38101The @code{ordchr} extension adds two functions, named 38102@code{ord()} and @code{chr()}, as follows: 38103 38104@table @code 38105@item @@load "ordchr" 38106This is how you load the extension. 38107 38108@cindex @code{ord()} extension function 38109@item number = ord(string) 38110Return the numeric value of the first character in @code{string}. 38111 38112@cindex @code{chr()} extension function 38113@item char = chr(number) 38114Return a string whose first character is that represented by @code{number}. 38115@end table 38116 38117These functions are inspired by the Pascal language functions 38118of the same name. Here is an example: 38119 38120@example 38121@@load "ordchr" 38122@dots{} 38123printf("The numeric value of 'A' is %d\n", ord("A")) 38124printf("The string value of 65 is %s\n", chr(65)) 38125@end example 38126 38127@node Extension Sample Readdir 38128@subsection Reading Directories 38129 38130The @code{readdir} extension adds an input parser for directories. 38131The usage is as follows: 38132 38133@cindex @code{readdir} extension 38134@example 38135@@load "readdir" 38136@end example 38137 38138When this extension is in use, instead of skipping directories named 38139on the command line (or with @code{getline}), 38140they are read, with each entry returned as a record. 38141 38142The record consists of three fields. The first two are the inode number and the 38143@value{FN}, separated by a forward slash character. 38144On systems where the directory entry contains the file type, the record 38145has a third field (also separated by a slash), which is a single letter 38146indicating the type of the file. The letters and their corresponding file 38147types are shown in @ref{table-readdir-file-types}. 38148 38149@float Table,table-readdir-file-types 38150@caption{File types returned by the @code{readdir} extension} 38151@multitable @columnfractions .1 .9 38152@headitem Letter @tab File type 38153@item @code{b} @tab Block device 38154@item @code{c} @tab Character device 38155@item @code{d} @tab Directory 38156@item @code{f} @tab Regular file 38157@item @code{l} @tab Symbolic link 38158@item @code{p} @tab Named pipe (FIFO) 38159@item @code{s} @tab Socket 38160@item @code{u} @tab Anything else (unknown) 38161@end multitable 38162@end float 38163 38164On systems without the file type information, the third field is always 38165@samp{u}. 38166 38167@quotation NOTE 38168On GNU/Linux systems, there are filesystems that don't support the 38169@code{d_type} entry (see the @i{readdir}(3) manual page), and so the file 38170type is always @samp{u}. You can use the @code{filefuncs} extension to call 38171@code{stat()} in order to get correct type information. 38172@end quotation 38173 38174By default, if a directory cannot be opened (due to permission problems, 38175for example), @command{gawk} will exit. As with regular files, this 38176situation can be handled using a @code{BEGINFILE} rule that checks 38177@code{ERRNO} and prints an error or otherwise handles the problem. 38178 38179Here is an example: 38180 38181@example 38182@@load "readdir" 38183@dots{} 38184BEGIN @{ FS = "/" @} 38185@{ print "@value{FN} is", $2 @} 38186@end example 38187 38188@node Extension Sample Revout 38189@subsection Reversing Output 38190 38191The @code{revoutput} extension adds a simple output wrapper that reverses 38192the characters in each output line. Its main purpose is to show how to 38193write an output wrapper, although it may be mildly amusing for the unwary. 38194Here is an example: 38195 38196@cindex @code{revoutput} extension 38197@example 38198@@load "revoutput" 38199 38200BEGIN @{ 38201 REVOUT = 1 38202 print "don't panic" > "/dev/stdout" 38203@} 38204@end example 38205 38206The output from this program is @samp{cinap t'nod}. 38207 38208@node Extension Sample Rev2way 38209@subsection Two-Way I/O Example 38210 38211The @code{revtwoway} extension adds a simple two-way processor that 38212reverses the characters in each line sent to it for reading back by 38213the @command{awk} program. Its main purpose is to show how to write 38214a two-way processor, although it may also be mildly amusing. 38215The following example shows how to use it: 38216 38217@cindex @code{revtwoway} extension 38218@example 38219@@load "revtwoway" 38220 38221BEGIN @{ 38222 cmd = "/magic/mirror" 38223 print "don't panic" |& cmd 38224 cmd |& getline result 38225 print result 38226 close(cmd) 38227@} 38228@end example 38229 38230The output from this program 38231@ifnotinfo 38232also is: 38233@end ifnotinfo 38234@ifinfo 38235is: 38236@end ifinfo 38237@samp{cinap t'nod}. 38238 38239@node Extension Sample Read write array 38240@subsection Dumping and Restoring an Array 38241 38242The @code{rwarray} extension adds two functions, 38243named @code{writea()} and @code{reada()}, as follows: 38244 38245@table @code 38246@item @@load "rwarray" 38247This is how you load the extension. 38248 38249@cindex @code{writea()} extension function 38250@item ret = writea(file, array) 38251This function takes a string argument, which is the name of the file 38252to which to dump the array, and the array itself as the second argument. 38253@code{writea()} understands arrays of arrays. It returns one on 38254success, or zero upon failure. 38255 38256@cindex @code{reada()} extension function 38257@item ret = reada(file, array) 38258@code{reada()} is the inverse of @code{writea()}; 38259it reads the file named as its first argument, filling in 38260the array named as the second argument. It clears the array first. 38261Here too, the return value is one on success, or zero upon failure. 38262@end table 38263 38264The array created by @code{reada()} is identical to that written by 38265@code{writea()} in the sense that the contents are the same. However, 38266due to implementation issues, the array traversal order of the re-created 38267array is likely to be different from that of the original array. As array 38268traversal order in @command{awk} is by default undefined, this is (technically) 38269not a problem. If you need to guarantee a particular traversal 38270order, use the array sorting features in @command{gawk} to do so 38271(@pxref{Array Sorting}). 38272 38273The file contains binary data. All integral values are written in network 38274byte order. However, double-precision floating-point values are written 38275as native binary data. Thus, arrays containing only string data can 38276theoretically be dumped on systems with one byte order and restored on 38277systems with a different one, but this has not been tried. 38278 38279Here is an example: 38280 38281@example 38282@@load "rwarray" 38283@dots{} 38284ret = writea("arraydump.bin", array) 38285@dots{} 38286ret = reada("arraydump.bin", array) 38287@end example 38288 38289@node Extension Sample Readfile 38290@subsection Reading an Entire File 38291 38292The @code{readfile} extension adds a single function 38293named @code{readfile()}, and an input parser: 38294 38295@table @code 38296@item @@load "readfile" 38297This is how you load the extension. 38298 38299@cindex @code{readfile()} extension function 38300@item result = readfile("/some/path") 38301The argument is the name of the file to read. The return value is a 38302string containing the entire contents of the requested file. Upon error, 38303the function returns the empty string and sets @code{ERRNO}. 38304 38305@item BEGIN @{ PROCINFO["readfile"] = 1 @} 38306In addition, the extension adds an input parser that is activated if 38307@code{PROCINFO["readfile"]} exists. 38308When activated, each input file is returned in its entirety as @code{$0}. 38309@code{RT} is set to the null string. 38310@end table 38311 38312Here is an example: 38313 38314@example 38315@@load "readfile" 38316@dots{} 38317contents = readfile("/path/to/file"); 38318if (contents == "" && ERRNO != "") @{ 38319 print("problem reading file", ERRNO) > "/dev/stderr" 38320 ... 38321@} 38322@end example 38323 38324@node Extension Sample Time 38325@subsection Extension Time Functions 38326 38327@quotation CAUTION 38328As @command{gawk} @value{PVERSION} 5.1.0, this extension is considered to be obsolete. 38329It is replaced by the @code{timex} extension in @code{gawkextlib} 38330(@pxref{gawkextlib}). 38331 38332For @value{PVERSION} 5.1, no warning will be issued if this extension is used. 38333For the next major release, a warning will be issued. In the release after that 38334this extension will be removed from the distribution. 38335@end quotation 38336 38337The @code{time} extension adds two functions, named @code{gettimeofday()} 38338and @code{sleep()}, as follows: 38339 38340@table @code 38341@item @@load "time" 38342This is how you load the extension. 38343 38344@cindex @code{gettimeofday()} extension function 38345@item the_time = gettimeofday() 38346Return the time in seconds that has elapsed since 1970-01-01 UTC as a 38347floating-point value. If the time is unavailable on this platform, return 38348@minus{}1 and set @code{ERRNO}. The returned time should have sub-second 38349precision, but the actual precision may vary based on the platform. 38350If the standard C @code{gettimeofday()} system call is available on this 38351platform, then it simply returns the value. Otherwise, if on MS-Windows, 38352it tries to use @code{GetSystemTimeAsFileTime()}. 38353 38354@cindex @code{sleep()} extension function 38355@item result = sleep(@var{seconds}) 38356Attempt to sleep for @var{seconds} seconds. If @var{seconds} is negative, 38357or the attempt to sleep fails, return @minus{}1 and set @code{ERRNO}. 38358Otherwise, return zero after sleeping for the indicated amount of time. 38359Note that @var{seconds} may be a floating-point (nonintegral) value. 38360Implementation details: depending on platform availability, this function 38361tries to use @code{nanosleep()} or @code{select()} to implement the delay. 38362@end table 38363 38364@node Extension Sample API Tests 38365@subsection API Tests 38366@cindex @code{testext} extension 38367 38368The @code{testext} extension exercises parts of the extension API that 38369are not tested by the other samples. The @file{extension/testext.c} 38370file contains both the C code for the extension and @command{awk} 38371test code inside C comments that run the tests. The testing framework 38372extracts the @command{awk} code and runs the tests. See the source file 38373for more information. 38374 38375@node gawkextlib 38376@section The @code{gawkextlib} Project 38377@cindex extensions @subentry loadable @subentry @code{gawkextlib} project 38378 38379@cindex @code{gawkextlib} project 38380The @uref{https://sourceforge.net/projects/gawkextlib/, @code{gawkextlib}} 38381project provides a number of @command{gawk} extensions, including one for 38382processing XML files. This is the evolution of the original @command{xgawk} 38383(XML @command{gawk}) project. 38384 38385There are a number of extensions. Some of the more interesting ones are: 38386 38387@itemize @value{BULLET} 38388@item 38389@code{abort} extension. It allows you to exit immediately from your 38390@command{awk} program without running the @code{END} rules. 38391 38392@item 38393@code{json} extension. 38394This serializes a multidimensional array into a JSON string, and 38395can deserialize a JSON string into a @command{gawk} array. 38396This extension is interesting since it is written in C++ instead of C. 38397 38398@item 38399MPFR library extension. 38400This provides access to a number of MPFR functions that @command{gawk}'s 38401native MPFR support does not. 38402 38403@item 38404Select extension. It provides functionality based on the 38405@code{select()} system call. 38406 38407@item 38408XML parser extension, using the @uref{https://expat.sourceforge.net, Expat} 38409XML parsing library 38410@end itemize 38411 38412@cindex @command{git} utility 38413You can check out the code for the @code{gawkextlib} project 38414using the @uref{https://git-scm.com, Git} distributed source 38415code control system. The command is as follows: 38416 38417@example 38418git clone git://git.code.sf.net/p/gawkextlib/code gawkextlib-code 38419@end example 38420 38421@cindex RapidJson JSON parser library 38422You will need to have the @uref{http://www.rapidjson.org, RapidJson} 38423JSON parser library installed in order to build and use the @code{json} extension. 38424 38425@cindex Expat XML parser library 38426You will need to have the @uref{https://expat.sourceforge.net, Expat} 38427XML parser library installed in order to build and use the XML extension. 38428 38429In addition, you must have the GNU Autotools installed 38430(@uref{https://www.gnu.org/software/autoconf, Autoconf}, 38431@uref{https://www.gnu.org/software/automake, Automake}, 38432@uref{https://www.gnu.org/software/libtool, Libtool}, 38433and 38434@uref{https://www.gnu.org/software/gettext, GNU @command{gettext}}). 38435 38436The simple recipe for building and testing @code{gawkextlib} is as follows. 38437First, build and install @command{gawk}: 38438 38439@example 38440cd .../path/to/gawk/code 38441./configure --prefix=/tmp/newgawk @ii{Install in /tmp/newgawk for now} 38442make && make check @ii{Build and check that all is OK} 38443make install @ii{Install gawk} 38444@end example 38445 38446Next, go to @url{https://sourceforge.net/projects/gawkextlib/files} to 38447download @code{gawkextlib} and any extensions that you would like to build. 38448The @file{README} file at that site explains how to build the code. If you 38449installed @command{gawk} in a non-standard location, you will need to 38450specify @samp{./configure --with-gawk=@var{/path/to/gawk}} to find it. 38451You may need to use the @command{sudo} utility 38452to install both @command{gawk} and @code{gawkextlib}, depending upon 38453how your system works. 38454 38455If you write an extension that you wish to share with other 38456@command{gawk} users, consider doing so through the 38457@code{gawkextlib} project. 38458See the project's website for more information. 38459 38460@node Extension summary 38461@section Summary 38462 38463@itemize @value{BULLET} 38464@item 38465You can write extensions (sometimes called plug-ins) for @command{gawk} 38466in C or C++ using the application programming interface (API) defined 38467by the @command{gawk} developers. 38468 38469@item 38470Extensions must have a license compatible with the GNU General Public 38471License (GPL), and they must assert that fact by declaring a variable 38472named @code{plugin_is_GPL_compatible}. 38473 38474@item 38475Communication between @command{gawk} and an extension is two-way. 38476@command{gawk} passes a @code{struct} to the extension that contains 38477various data fields and function pointers. The extension can then call 38478into @command{gawk} via the supplied function pointers to accomplish 38479certain tasks. 38480 38481@item 38482One of these tasks is to ``register'' the name and implementation of 38483new @command{awk}-level functions with @command{gawk}. The implementation 38484takes the form of a C function pointer with a defined signature. 38485By convention, implementation functions are named @code{do_@var{XXXX}()} 38486for some @command{awk}-level function @code{@var{XXXX}()}. 38487 38488@item 38489The API is defined in a header file named @file{gawkapi.h}. You must include 38490a number of standard header files @emph{before} including it in your source file. 38491 38492@item 38493API function pointers are provided for the following kinds of operations: 38494 38495@itemize @value{BULLET} 38496@item 38497Allocating, reallocating, and releasing memory 38498 38499@item 38500Registration functions (you may register 38501extension functions, 38502exit callbacks, 38503a version string, 38504input parsers, 38505output wrappers, 38506and two-way processors) 38507 38508@item 38509Printing fatal, nonfatal, warning, and ``lint'' warning messages 38510 38511@item 38512Updating @code{ERRNO}, or unsetting it 38513 38514@item 38515Accessing parameters, including converting an undefined parameter into 38516an array 38517 38518@item 38519Symbol table access (retrieving a global variable, creating one, 38520or changing one) 38521 38522@item 38523Creating and releasing cached values; this provides an 38524efficient way to use values for multiple variables and 38525can be a big performance win 38526 38527@item 38528Manipulating arrays 38529(retrieving, adding, deleting, and modifying elements; 38530getting the count of elements in an array; 38531creating a new array; 38532clearing an array; 38533and 38534flattening an array for easy C-style looping over all its indices and elements) 38535@end itemize 38536 38537@item 38538The API defines a number of standard data types for representing 38539@command{awk} values, array elements, and arrays. 38540 38541@item 38542The API provides convenience functions for constructing values. 38543It also provides memory management functions to ensure compatibility 38544between memory allocated by @command{gawk} and memory allocated by an 38545extension. 38546 38547@item 38548@emph{All} memory passed from @command{gawk} to an extension must be 38549treated as read-only by the extension. 38550 38551@item 38552@emph{All} memory passed from an extension to @command{gawk} must come from 38553the API's memory allocation functions. @command{gawk} takes responsibility for 38554the memory and releases it when appropriate. 38555 38556@item 38557The API provides information about the running version of @command{gawk} so 38558that an extension can make sure it is compatible with the @command{gawk} 38559that loaded it. 38560 38561@item 38562It is easiest to start a new extension by copying the boilerplate code 38563described in this @value{CHAPTER}. Macros in the @file{gawkapi.h} header 38564file make this easier to do. 38565 38566@item 38567The @command{gawk} distribution includes a number of small but useful 38568sample extensions. The @code{gawkextlib} project includes several more 38569(larger) extensions. If you wish to write an extension and contribute it 38570to the community of @command{gawk} users, the @code{gawkextlib} project 38571is the place to do so. 38572 38573@end itemize 38574 38575@c EXCLUDE START 38576@node Extension Exercises 38577@section Exercises 38578 38579@enumerate 38580@item 38581Add functions to implement system calls such as @code{chown()}, 38582@code{chmod()}, and @code{umask()} to the file operations extension 38583presented in @ref{Internal File Ops}. 38584 38585@c Idea from comp.lang.awk, February 2015 38586@item 38587Write an input parser that prints a prompt if the input is 38588a from a ``terminal'' device. You can use the @code{isatty()} 38589function to tell if the input file is a terminal. (Hint: this function 38590is usually expensive to call; try to call it just once.) 38591The content of the prompt should come from a variable settable 38592by @command{awk}-level code. 38593You can write the prompt to standard error. However, 38594for best results, open a new file descriptor (or file pointer) 38595on @file{/dev/tty} and print the prompt there, in case standard 38596error has been redirected. 38597 38598Why is standard error a better 38599choice than standard output for writing the prompt? 38600Which reading mechanism should you replace, the one to get 38601a record, or the one to read raw bytes? 38602 38603@item 38604Write a wrapper script that provides an interface similar to 38605@samp{sed -i} for the ``inplace'' extension presented in 38606@ref{Extension Sample Inplace}. 38607 38608@end enumerate 38609@c EXCLUDE END 38610 38611@ifnotinfo 38612@part @value{PART4}Appendices 38613@end ifnotinfo 38614 38615@ifdocbook 38616 38617@ifclear FOR_PRINT 38618Part IV contains the appendices (including the two licenses that cover 38619the @command{gawk} source code and this @value{DOCUMENT}, respectively) 38620and the Glossary: 38621@end ifclear 38622 38623@ifset FOR_PRINT 38624Part IV contains three appendices, the last of which is the license that 38625covers the @command{gawk} source code: 38626@end ifset 38627 38628@itemize @value{BULLET} 38629@item 38630@ref{Language History} 38631 38632@item 38633@ref{Installation} 38634 38635@ifclear FOR_PRINT 38636@item 38637@ref{Notes} 38638 38639@item 38640@ref{Basic Concepts} 38641 38642@item 38643@ref{Glossary} 38644@end ifclear 38645 38646@item 38647@ref{Copying} 38648 38649@ifclear FOR_PRINT 38650@item 38651@ref{GNU Free Documentation License} 38652@end ifclear 38653@end itemize 38654@end ifdocbook 38655 38656@node Language History 38657@appendix The Evolution of the @command{awk} Language 38658 38659This @value{DOCUMENT} describes the GNU implementation of @command{awk}, 38660which follows the POSIX specification. Many longtime @command{awk} 38661users learned @command{awk} programming with the original @command{awk} 38662implementation in Version 7 Unix. (This implementation was the basis for 38663@command{awk} in Berkeley Unix, through 4.3-Reno. Subsequent versions 38664of Berkeley Unix, and, for a while, some systems derived from 4.4BSD-Lite, used various 38665versions of @command{gawk} for their @command{awk}.) This @value{CHAPTER} 38666briefly describes the evolution of the @command{awk} language, with 38667cross-references to other parts of the @value{DOCUMENT} where you can 38668find more information. 38669 38670@ifset FOR_PRINT 38671To save space, we have omitted 38672information on the history of features in @command{gawk} from this 38673edition. You can find it in the 38674@uref{https://www.gnu.org/software/gawk/manual/html_node/Feature-History.html, 38675online documentation}. 38676@end ifset 38677 38678@menu 38679* V7/SVR3.1:: The major changes between V7 and System V 38680 Release 3.1. 38681* SVR4:: Minor changes between System V Releases 3.1 38682 and 4. 38683* POSIX:: New features from the POSIX standard. 38684* BTL:: New features from Brian Kernighan's version of 38685 @command{awk}. 38686* POSIX/GNU:: The extensions in @command{gawk} not in POSIX 38687 @command{awk}. 38688* Feature History:: The history of the features in @command{gawk}. 38689* Common Extensions:: Common Extensions Summary. 38690* Ranges and Locales:: How locales used to affect regexp ranges. 38691* Contributors:: The major contributors to @command{gawk}. 38692* History summary:: History summary. 38693@end menu 38694 38695@node V7/SVR3.1 38696@appendixsec Major Changes Between V7 and SVR3.1 38697@cindex @command{awk} @subentry versions of 38698@cindex @command{awk} @subentry versions of @subentry changes between V7 and SVR3.1 38699 38700The @command{awk} language evolved considerably between the release of 38701Version 7 Unix (1978) and the new version that was first made generally available in 38702System V Release 3.1 (1987). This @value{SECTION} summarizes the changes, with 38703cross-references to further details: 38704 38705@itemize @value{BULLET} 38706@item 38707The requirement for @samp{;} to separate rules on a line 38708(@pxref{Statements/Lines}) 38709 38710@item 38711User-defined functions and the @code{return} statement 38712(@pxref{User-defined}) 38713 38714@item 38715The @code{delete} statement (@pxref{Delete}) 38716 38717@item 38718The @code{do}-@code{while} statement 38719(@pxref{Do Statement}) 38720 38721@item 38722The built-in functions @code{atan2()}, @code{cos()}, @code{sin()}, @code{rand()}, and 38723@code{srand()} (@pxref{Numeric Functions}) 38724 38725@item 38726The built-in functions @code{gsub()}, @code{sub()}, and @code{match()} 38727(@pxref{String Functions}) 38728 38729@item 38730The built-in functions @code{close()} and @code{system()} 38731(@pxref{I/O Functions}) 38732 38733@item 38734The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART}, 38735and @code{SUBSEP} predefined variables (@pxref{Built-in Variables}) 38736 38737@item 38738Assignable @code{$0} (@pxref{Changing Fields}) 38739 38740@item 38741The conditional expression using the ternary operator @samp{?:} 38742(@pxref{Conditional Exp}) 38743 38744@item 38745The expression @samp{@var{indx} in @var{array}} outside of @code{for} 38746statements (@pxref{Reference to Elements}) 38747 38748@item 38749The exponentiation operator @samp{^} 38750(@pxref{Arithmetic Ops}) and its assignment operator 38751form @samp{^=} (@pxref{Assignment Ops}) 38752 38753@item 38754C-compatible operator precedence, which breaks some old @command{awk} 38755programs (@pxref{Precedence}) 38756 38757@item 38758Regexps as the value of @code{FS} 38759(@pxref{Field Separators}) and as the 38760third argument to the @code{split()} function 38761(@pxref{String Functions}), rather than using only the first character 38762of @code{FS} 38763 38764@item 38765Dynamic regexps as operands of the @samp{~} and @samp{!~} operators 38766(@pxref{Computed Regexps}) 38767 38768@item 38769The escape sequences @samp{\b}, @samp{\f}, and @samp{\r} 38770(@pxref{Escape Sequences}) 38771 38772@item 38773Redirection of input for the @code{getline} function 38774(@pxref{Getline}) 38775 38776@item 38777Multiple @code{BEGIN} and @code{END} rules 38778(@pxref{BEGIN/END}) 38779 38780@item 38781Multidimensional arrays 38782(@pxref{Multidimensional}) 38783@end itemize 38784 38785@node SVR4 38786@appendixsec Changes Between SVR3.1 and SVR4 38787 38788@cindex @command{awk} @subentry versions of @subentry changes between SVR3.1 and SVR4 38789The System V Release 4 (1989) version of Unix @command{awk} added these features 38790(some of which originated in @command{gawk}): 38791 38792@itemize @value{BULLET} 38793@item 38794The @code{ENVIRON} array (@pxref{Built-in Variables}) 38795@c gawk and MKS awk 38796 38797@item 38798Multiple @option{-f} options on the command line 38799(@pxref{Options}) 38800@c MKS awk 38801 38802@item 38803The @option{-v} option for assigning variables before program execution begins 38804(@pxref{Options}) 38805@c GNU, Bell Laboratories & MKS together 38806 38807@item 38808The @option{--} signal for terminating command-line options 38809 38810@item 38811The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences 38812(@pxref{Escape Sequences}) 38813@c GNU, for ANSI C compat 38814 38815@item 38816A defined return value for the @code{srand()} built-in function 38817(@pxref{Numeric Functions}) 38818 38819@item 38820The @code{toupper()} and @code{tolower()} built-in string functions 38821for case translation 38822(@pxref{String Functions}) 38823 38824@item 38825A cleaner specification for the @samp{%c} format-control letter in the 38826@code{printf} function 38827(@pxref{Control Letters}) 38828 38829@item 38830The ability to dynamically pass the field width and precision (@code{"%*.*d"}) 38831in the argument list of @code{printf} and @code{sprintf()} 38832(@pxref{Control Letters}) 38833 38834@item 38835The use of regexp constants, such as @code{/foo/}, as expressions, where 38836they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/} 38837(@pxref{Using Constant Regexps}) 38838 38839@item 38840Processing of escape sequences inside command-line variable assignments 38841(@pxref{Assignment Options}) 38842@end itemize 38843 38844@node POSIX 38845@appendixsec Changes Between SVR4 and POSIX @command{awk} 38846@cindex @command{awk} @subentry versions of @subentry changes between SVR4 and POSIX @command{awk} 38847@cindex POSIX @command{awk} @subentry changes in @command{awk} versions 38848 38849The POSIX Command Language and Utilities standard for @command{awk} (1992) 38850introduced the following changes into the language: 38851 38852@itemize @value{BULLET} 38853@item 38854The use of @option{-W} for implementation-specific options 38855(@pxref{Options}) 38856 38857@item 38858The use of @code{CONVFMT} for controlling the conversion of numbers 38859to strings (@pxref{Conversion}) 38860 38861@item 38862The concept of a numeric string and tighter comparison rules to go 38863with it (@pxref{Typing and Comparison}) 38864 38865@item 38866The use of predefined variables as function parameter names is forbidden 38867(@pxref{Definition Syntax}) 38868 38869@item 38870More complete documentation of many of the previously undocumented 38871features of the language 38872@end itemize 38873 38874In 2012, a number of extensions that had been commonly available for 38875many years were finally added to POSIX. They are: 38876 38877@itemize @value{BULLET} 38878@item 38879The @code{fflush()} built-in function for flushing buffered output 38880(@pxref{I/O Functions}) 38881 38882@item 38883The @code{nextfile} statement 38884(@pxref{Nextfile Statement}) 38885 38886@item 38887The ability to delete all of an array at once with @samp{delete @var{array}} 38888(@pxref{Delete}) 38889 38890@end itemize 38891 38892@xref{Common Extensions} for a list of common extensions 38893not permitted by the POSIX standard. 38894 38895The 2018 POSIX standard can be found online at 38896@url{https://pubs.opengroup.org/onlinepubs/9699919799/}. 38897 38898 38899@node BTL 38900@appendixsec Extensions in Brian Kernighan's @command{awk} 38901 38902@cindex @command{awk} @subentry versions of @seealso{Brian Kernighan's @command{awk}} 38903@cindex extensions @subentry Brian Kernighan's @command{awk} 38904@cindex Brian Kernighan's @command{awk} @subentry extensions 38905@cindex Kernighan, Brian 38906Brian Kernighan 38907has made his version available via his home page 38908(@pxref{Other Versions}). 38909 38910This @value{SECTION} describes common extensions that 38911originally appeared in his version of @command{awk}: 38912 38913@itemize @value{BULLET} 38914@item 38915The @samp{**} and @samp{**=} operators 38916(@pxref{Arithmetic Ops} 38917and 38918@ref{Assignment Ops}) 38919 38920@item 38921The use of @code{func} as an abbreviation for @code{function} 38922(@pxref{Definition Syntax}) 38923 38924@item 38925The @code{fflush()} built-in function for flushing buffered output 38926(@pxref{I/O Functions}) 38927 38928@ignore 38929@item 38930The @code{SYMTAB} array, that allows access to @command{awk}'s internal symbol 38931table. This feature was never documented for his @command{awk}, largely because 38932it is somewhat shakily implemented. For instance, you cannot access arrays 38933or array elements through it 38934@end ignore 38935@end itemize 38936 38937@xref{Common Extensions} for a full list of the extensions 38938available in his @command{awk}. 38939 38940@node POSIX/GNU 38941@appendixsec Extensions in @command{gawk} Not in POSIX @command{awk} 38942 38943@cindex compatibility mode (@command{gawk}) @subentry extensions 38944@cindex extensions @subentry in @command{gawk}, not in POSIX @command{awk} 38945@cindex POSIX @subentry @command{gawk} extensions not included in 38946The GNU implementation, @command{gawk}, adds a large number of features. 38947They can all be disabled with either the @option{--traditional} or 38948@option{--posix} options 38949(@pxref{Options}). 38950 38951A number of features have come and gone over the years. This @value{SECTION} 38952summarizes the additional features over POSIX @command{awk} that are 38953in the current version of @command{gawk}. 38954 38955@itemize @value{BULLET} 38956 38957@item 38958Additional predefined variables: 38959 38960@itemize @value{MINUS} 38961@item 38962The 38963@code{ARGIND}, 38964@code{BINMODE}, 38965@code{ERRNO}, 38966@code{FIELDWIDTHS}, 38967@code{FPAT}, 38968@code{IGNORECASE}, 38969@code{LINT}, 38970@code{PROCINFO}, 38971@code{RT}, 38972and 38973@code{TEXTDOMAIN} 38974variables 38975(@pxref{Built-in Variables}) 38976@end itemize 38977 38978@item 38979Special files in I/O redirections: 38980 38981@itemize @value{MINUS} 38982@item 38983The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and 38984@file{/dev/fd/@var{N}} special @value{FN}s 38985(@pxref{Special Files}) 38986 38987@item 38988The @file{/inet}, @file{/inet4}, and @file{/inet6} special files for 38989TCP/IP networking using @samp{|&} to specify which version of the 38990IP protocol to use 38991(@pxref{TCP/IP Networking}) 38992@end itemize 38993 38994@item 38995Changes and/or additions to the language: 38996 38997@itemize @value{MINUS} 38998@item 38999The @samp{\x} escape sequence 39000(@pxref{Escape Sequences}) 39001 39002@item 39003Full support for both POSIX and GNU regexps 39004(@pxref{Regexp}) 39005 39006@item 39007The ability for @code{FS} and for the third 39008argument to @code{split()} to be null strings 39009(@pxref{Single Character Fields}) 39010 39011@item 39012The ability for @code{RS} to be a regexp 39013(@pxref{Records}) 39014 39015@item 39016The ability to use octal and hexadecimal constants in @command{awk} 39017program source code 39018(@pxref{Nondecimal-numbers}) 39019 39020@item 39021The @samp{|&} operator for two-way I/O to a coprocess 39022(@pxref{Two-way I/O}) 39023 39024@item 39025Indirect function calls 39026(@pxref{Indirect Calls}) 39027 39028@item 39029Directories on the command line produce a warning and are skipped 39030(@pxref{Command-line directories}) 39031 39032@item 39033Output with @code{print} and @code{printf} need not be fatal 39034(@pxref{Nonfatal}) 39035@end itemize 39036 39037@item 39038New keywords: 39039 39040@itemize @value{MINUS} 39041@item 39042The @code{BEGINFILE} and @code{ENDFILE} special patterns 39043(@pxref{BEGINFILE/ENDFILE}) 39044 39045@item 39046The @code{switch} statement 39047(@pxref{Switch Statement}) 39048@end itemize 39049 39050@item 39051Changes to standard @command{awk} functions: 39052 39053@itemize @value{MINUS} 39054@item 39055The optional second argument to @code{close()} that allows closing one end 39056of a two-way pipe to a coprocess 39057(@pxref{Two-way I/O}) 39058 39059@item 39060POSIX compliance for @code{gsub()} and @code{sub()} with @option{--posix} 39061 39062@item 39063The @code{length()} function accepts an array argument 39064and returns the number of elements in the array 39065(@pxref{String Functions}) 39066 39067@item 39068The optional third argument to the @code{match()} function 39069for capturing text-matching subexpressions within a regexp 39070(@pxref{String Functions}) 39071 39072@item 39073Positional specifiers in @code{printf} formats for 39074making translations easier 39075(@pxref{Printf Ordering}) 39076 39077@item 39078The @code{split()} function's additional optional fourth 39079argument, which is an array to hold the text of the field separators 39080(@pxref{String Functions}) 39081@end itemize 39082 39083@item 39084Additional functions only in @command{gawk}: 39085 39086@itemize @value{MINUS} 39087@item 39088The @code{gensub()}, @code{patsplit()}, and @code{strtonum()} functions 39089for more powerful text manipulation 39090(@pxref{String Functions}) 39091 39092@item 39093The @code{asort()} and @code{asorti()} functions for sorting arrays 39094(@pxref{Array Sorting}) 39095 39096@item 39097The @code{mktime()}, @code{systime()}, and @code{strftime()} 39098functions for working with timestamps 39099(@pxref{Time Functions}) 39100 39101@item 39102The 39103@code{and()}, 39104@code{compl()}, 39105@code{lshift()}, 39106@code{or()}, 39107@code{rshift()}, 39108and 39109@code{xor()} 39110functions for bit manipulation 39111(@pxref{Bitwise Functions}) 39112@c In 4.1, and(), or() and xor() grew the ability to take > 2 arguments 39113 39114@item 39115The @code{isarray()} function to check if a variable is an array or not 39116(@pxref{Type Functions}) 39117 39118@item 39119The @code{bindtextdomain()}, @code{dcgettext()}, and @code{dcngettext()} 39120functions for internationalization 39121(@pxref{Programmer i18n}) 39122 39123@ifset INTDIV 39124@item 39125The @code{intdiv0()} function for doing integer 39126division and remainder 39127(@pxref{Numeric Functions}) 39128@end ifset 39129@end itemize 39130 39131@item 39132Changes and/or additions in the command-line options: 39133 39134@itemize @value{MINUS} 39135@item 39136The @env{AWKPATH} environment variable for specifying a path search for 39137the @option{-f} command-line option 39138(@pxref{Options}) 39139 39140@item 39141The @env{AWKLIBPATH} environment variable for specifying a path search for 39142the @option{-l} command-line option 39143(@pxref{Options}) 39144 39145@item 39146The 39147@option{-b}, 39148@option{-c}, 39149@option{-C}, 39150@option{-d}, 39151@option{-D}, 39152@option{-e}, 39153@option{-E}, 39154@option{-g}, 39155@option{-h}, 39156@option{-i}, 39157@option{-l}, 39158@option{-L}, 39159@option{-M}, 39160@option{-n}, 39161@option{-N}, 39162@option{-o}, 39163@option{-O}, 39164@option{-p}, 39165@option{-P}, 39166@option{-r}, 39167@option{-s}, 39168@option{-S}, 39169@option{-t}, 39170and 39171@option{-V} 39172short options. Also, the 39173ability to use GNU-style long-named options that start with @option{--}, 39174and the 39175@option{--assign}, 39176@option{--bignum}, 39177@option{--characters-as-bytes}, 39178@option{--copyright}, 39179@option{--debug}, 39180@option{--dump-variables}, 39181@option{--exec}, 39182@option{--field-separator}, 39183@option{--file}, 39184@option{--gen-pot}, 39185@option{--help}, 39186@option{--include}, 39187@option{--lint}, 39188@option{--lint-old}, 39189@option{--load}, 39190@option{--non-decimal-data}, 39191@option{--optimize}, 39192@option{--no-optimize}, 39193@option{--posix}, 39194@option{--pretty-print}, 39195@option{--profile}, 39196@option{--re-interval}, 39197@option{--sandbox}, 39198@option{--source}, 39199@option{--traditional}, 39200@option{--use-lc-numeric}, 39201and 39202@option{--version} 39203long options 39204(@pxref{Options}). 39205@end itemize 39206 39207@c new ports 39208 39209@item 39210Support for the following obsolete systems was removed from the code 39211and the documentation for @command{gawk} @value{PVERSION} 4.0: 39212 39213@c nested table 39214@itemize @value{MINUS} 39215@item 39216Amiga 39217 39218@item 39219Atari 39220 39221@item 39222BeOS 39223 39224@item 39225Cray 39226 39227@item 39228MIPS RiscOS 39229 39230@item 39231MS-DOS with the Microsoft Compiler 39232 39233@item 39234MS-Windows with the Microsoft Compiler 39235 39236@item 39237NeXT 39238 39239@item 39240SunOS 3.x, Sun 386 (Road Runner) 39241 39242@item 39243Tandem (non-POSIX) 39244 39245@item 39246Prestandard VAX C compiler for VAX/VMS 39247 39248@item 39249GCC for VAX and Alpha has not been tested for a while. 39250 39251@end itemize 39252 39253@item 39254Support for the following obsolete system was removed from the code 39255for @command{gawk} @value{PVERSION} 4.1: 39256 39257@c nested table 39258@itemize @value{MINUS} 39259@item 39260Ultrix 39261@end itemize 39262 39263@item 39264Support for the following systems was removed from the code 39265for @command{gawk} @value{PVERSION} 4.2: 39266 39267@c nested table 39268@itemize @value{MINUS} 39269@item 39270MirBSD 39271 39272@item 39273GNU/Linux on Alpha 39274@end itemize 39275 39276@end itemize 39277 39278@c XXX ADD MORE STUFF HERE 39279 39280 39281@c This does not need to be in the formal book. 39282@ifclear FOR_PRINT 39283@node Feature History 39284@appendixsec History of @command{gawk} Features 39285 39286@ignore 39287See the thread: 39288https://groups.google.com/forum/#!topic/comp.lang.awk/SAUiRuff30c 39289This motivated me to add this section. 39290@end ignore 39291 39292@ignore 39293I've tried to follow this general order, esp.@: for the 3.0 and 3.1 sections: 39294 variables 39295 special files 39296 language changes (e.g., hex constants) 39297 differences in standard awk functions 39298 new gawk functions 39299 new keywords 39300 new command-line options 39301 behavioral changes 39302 extension API changes 39303 new / deprecated / removed ports 39304 installation time stuff 39305Within each category, be alphabetical. 39306@end ignore 39307 39308This @value{SECTION} describes the features in @command{gawk} 39309over and above those in POSIX @command{awk}, 39310in the order they were added to @command{gawk}. 39311 39312Version 2.10 of @command{gawk} introduced the following features: 39313 39314@itemize @value{BULLET} 39315@item 39316The @env{AWKPATH} environment variable for specifying a path search for 39317the @option{-f} command-line option 39318(@pxref{Options}). 39319 39320@item 39321The @code{IGNORECASE} variable and its effects 39322(@pxref{Case-sensitivity}). 39323 39324@item 39325The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr} and 39326@file{/dev/fd/@var{N}} special @value{FN}s 39327(@pxref{Special Files}). 39328@end itemize 39329 39330Version 2.13 of @command{gawk} introduced the following features: 39331 39332@itemize @value{BULLET} 39333@item 39334The @code{FIELDWIDTHS} variable and its effects 39335(@pxref{Constant Size}). 39336 39337@item 39338The @code{systime()} and @code{strftime()} built-in functions for obtaining 39339and printing timestamps 39340(@pxref{Time Functions}). 39341 39342@item 39343Additional command-line options 39344(@pxref{Options}): 39345 39346@itemize @value{MINUS} 39347@item 39348The @option{-W lint} option to provide error and portability checking 39349for both the source code and at runtime. 39350 39351@item 39352The @option{-W compat} option to turn off the GNU extensions. 39353 39354@item 39355The @option{-W posix} option for full POSIX compliance. 39356@end itemize 39357@end itemize 39358 39359Version 2.14 of @command{gawk} introduced the following feature: 39360 39361@itemize @value{BULLET} 39362@item 39363The @code{next file} statement for skipping to the next @value{DF} 39364(@pxref{Nextfile Statement}). 39365@end itemize 39366 39367Version 2.15 of @command{gawk} introduced the following features: 39368 39369@itemize @value{BULLET} 39370@item 39371New variables (@pxref{Built-in Variables}): 39372 39373@itemize @value{MINUS} 39374@item 39375@code{ARGIND}, which tracks the movement of @code{FILENAME} 39376through @code{ARGV}. 39377 39378@item 39379@code{ERRNO}, which contains the system error message when 39380@code{getline} returns @minus{}1 or @code{close()} fails. 39381@end itemize 39382 39383@item 39384The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and 39385@file{/dev/user} special @value{FN}s. These have since been removed. 39386 39387@item 39388The ability to delete all of an array at once with @samp{delete @var{array}} 39389(@pxref{Delete}). 39390 39391@item 39392Command-line option changes 39393(@pxref{Options}): 39394 39395@itemize @value{MINUS} 39396@item 39397The ability to use GNU-style long-named options that start with @option{--}. 39398 39399@item 39400The @option{--source} option for mixing command-line and library-file 39401source code. 39402@end itemize 39403@end itemize 39404 39405Version 3.0 of @command{gawk} introduced the following features: 39406 39407@itemize @value{BULLET} 39408@item 39409New or changed variables: 39410 39411@itemize @value{MINUS} 39412@item 39413@code{IGNORECASE} changed, now applying to string comparison as well 39414as regexp operations 39415(@pxref{Case-sensitivity}). 39416 39417@item 39418@code{RT}, which contains the input text that matched @code{RS} 39419(@pxref{Records}). 39420@end itemize 39421 39422@item 39423Full support for both POSIX and GNU regexps 39424(@pxref{Regexp}). 39425 39426@item 39427The @code{gensub()} function for more powerful text manipulation 39428(@pxref{String Functions}). 39429 39430@item 39431The @code{strftime()} function acquired a default time format, 39432allowing it to be called with no arguments 39433(@pxref{Time Functions}). 39434 39435@item 39436The ability for @code{FS} and for the third 39437argument to @code{split()} to be null strings 39438(@pxref{Single Character Fields}). 39439 39440@item 39441The ability for @code{RS} to be a regexp 39442(@pxref{Records}). 39443 39444@item 39445The @code{next file} statement became @code{nextfile} 39446(@pxref{Nextfile Statement}). 39447 39448@item 39449The @code{fflush()} function from 39450BWK @command{awk} 39451(then at Bell Laboratories; 39452@pxref{I/O Functions}). 39453 39454@item 39455New command-line options: 39456 39457@itemize @value{MINUS} 39458@item 39459The @option{--lint-old} option to 39460warn about constructs that are not available in 39461the original Version 7 Unix version of @command{awk} 39462(@pxref{V7/SVR3.1}). 39463 39464@item 39465The @option{-m} option from BWK @command{awk}. (Brian was 39466still at Bell Laboratories at the time.) This was later removed from 39467both his @command{awk} and from @command{gawk}. 39468 39469@item 39470The @option{--re-interval} option to provide interval expressions in regexps 39471(@pxref{Regexp Operators}). 39472 39473@item 39474The @option{--traditional} option was added as a better name for 39475@option{--compat} (@pxref{Options}). 39476@end itemize 39477 39478@item 39479The use of GNU Autoconf to control the configuration process 39480(@pxref{Quick Installation}). 39481 39482@item 39483Amiga support. 39484This has since been removed. 39485 39486@end itemize 39487 39488Version 3.1 of @command{gawk} introduced the following features: 39489 39490@itemize @value{BULLET} 39491@item 39492New variables 39493(@pxref{Built-in Variables}): 39494 39495@itemize @value{MINUS} 39496@item 39497@code{BINMODE}, for non-POSIX systems, 39498which allows binary I/O for input and/or output files 39499(@pxref{PC Using}). 39500 39501@item 39502@code{LINT}, which dynamically controls lint warnings. 39503 39504@item 39505@code{PROCINFO}, an array for providing process-related information. 39506 39507@item 39508@code{TEXTDOMAIN}, for setting an application's internationalization text domain 39509(@pxref{Internationalization}). 39510@end itemize 39511 39512@item 39513The ability to use octal and hexadecimal constants in @command{awk} 39514program source code 39515(@pxref{Nondecimal-numbers}). 39516 39517@item 39518The @samp{|&} operator for two-way I/O to a coprocess 39519(@pxref{Two-way I/O}). 39520 39521@item 39522The @file{/inet} special files for TCP/IP networking using @samp{|&} 39523(@pxref{TCP/IP Networking}). 39524 39525@item 39526The optional second argument to @code{close()} that allows closing one end 39527of a two-way pipe to a coprocess 39528(@pxref{Two-way I/O}). 39529 39530@item 39531The optional third argument to the @code{match()} function 39532for capturing text-matching subexpressions within a regexp 39533(@pxref{String Functions}). 39534 39535@item 39536Positional specifiers in @code{printf} formats for 39537making translations easier 39538(@pxref{Printf Ordering}). 39539 39540@item 39541A number of new built-in functions: 39542 39543@itemize @value{MINUS} 39544@item 39545The @code{asort()} and @code{asorti()} functions for sorting arrays 39546(@pxref{Array Sorting}). 39547 39548@item 39549The @code{bindtextdomain()}, @code{dcgettext()} and @code{dcngettext()} functions 39550for internationalization 39551(@pxref{Programmer i18n}). 39552 39553@item 39554The @code{extension()} function and the ability to add 39555new built-in functions dynamically. This has seen removed. 39556It was replaced by the new extension mechanism. 39557@xref{Dynamic Extensions}. 39558 39559@item 39560The @code{mktime()} function for creating timestamps 39561(@pxref{Time Functions}). 39562 39563@item 39564The @code{and()}, @code{or()}, @code{xor()}, @code{compl()}, 39565@code{lshift()}, @code{rshift()}, and @code{strtonum()} functions 39566(@pxref{Bitwise Functions}). 39567@end itemize 39568 39569@item 39570@cindex @code{next file} statement 39571The support for @samp{next file} as two words was removed completely 39572(@pxref{Nextfile Statement}). 39573 39574@item 39575Additional command-line options 39576(@pxref{Options}): 39577 39578@itemize @value{MINUS} 39579@item 39580The @option{--dump-variables} option to print a list of all global variables. 39581 39582@item 39583The @option{--exec} option, for use in CGI scripts. 39584 39585@item 39586The @option{--gen-po} command-line option and the use of a leading 39587underscore to mark strings that should be translated 39588(@pxref{String Extraction}). 39589 39590@item 39591The @option{--non-decimal-data} option to allow non-decimal 39592input data 39593(@pxref{Nondecimal Data}). 39594 39595@item 39596The @option{--profile} option and @command{pgawk}, the 39597profiling version of @command{gawk}, for producing execution 39598profiles of @command{awk} programs 39599(@pxref{Profiling}). 39600 39601@item 39602The @option{--use-lc-numeric} option to force @command{gawk} 39603to use the locale's decimal point for parsing input data 39604(@pxref{Conversion}). 39605@end itemize 39606 39607@item 39608The use of GNU Automake to help in standardizing the configuration process 39609(@pxref{Quick Installation}). 39610 39611@item 39612The use of GNU @command{gettext} for @command{gawk}'s own message output 39613(@pxref{Gawk I18N}). 39614 39615@item 39616BeOS support. This was later removed. 39617 39618@item 39619Tandem support. This was later removed. 39620 39621@item 39622The Atari port became officially unsupported and was 39623later removed entirely. 39624 39625@item 39626The source code changed to use ISO C standard-style function definitions. 39627 39628@item 39629POSIX compliance for @code{sub()} and @code{gsub()} 39630(@pxref{Gory Details}). 39631 39632@item 39633The @code{length()} function was extended to accept an array argument 39634and return the number of elements in the array 39635(@pxref{String Functions}). 39636 39637@item 39638The @code{strftime()} function acquired a third argument to 39639enable printing times as UTC 39640(@pxref{Time Functions}). 39641@end itemize 39642 39643Version 4.0 of @command{gawk} introduced the following features: 39644 39645@itemize @value{BULLET} 39646 39647@item 39648Variable additions: 39649 39650@itemize @value{MINUS} 39651@item 39652@code{FPAT}, which allows you to specify a regexp that matches 39653the fields, instead of matching the field separator 39654(@pxref{Splitting By Content}). 39655 39656@item 39657If @code{PROCINFO["sorted_in"]} exists, @samp{for (iggy in foo)} loops sort the 39658indices before looping over them. The value of this element 39659provides control over how the indices are sorted before the loop 39660traversal starts 39661(@pxref{Controlling Scanning}). 39662 39663@item 39664@code{PROCINFO["strftime"]}, which holds 39665the default format for @code{strftime()} 39666(@pxref{Time Functions}). 39667@end itemize 39668 39669@item 39670The special files @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid} 39671and @file{/dev/user} were removed. 39672 39673@item 39674Support for IPv6 was added via the @file{/inet6} special file. 39675@file{/inet4} forces IPv4 and @file{/inet} chooses the system 39676default, which is probably IPv4 39677(@pxref{TCP/IP Networking}). 39678 39679@item 39680The use of @samp{\s} and @samp{\S} escape sequences in regular expressions 39681(@pxref{GNU Regexp Operators}). 39682 39683@item 39684Interval expressions became part of default regular expressions 39685(@pxref{Regexp Operators}). 39686 39687@item 39688POSIX character classes work even with @option{--traditional} 39689(@pxref{Regexp Operators}). 39690 39691@item 39692@code{break} and @code{continue} became invalid outside a loop, 39693even with @option{--traditional} 39694(@pxref{Break Statement}, and also see 39695@ref{Continue Statement}). 39696 39697@item 39698@code{fflush()}, @code{nextfile}, and @samp{delete @var{array}} 39699are allowed if @option{--posix} or @option{--traditional}, since they 39700are all now part of POSIX. 39701 39702@item 39703An optional third argument to 39704@code{asort()} and @code{asorti()}, specifying how to sort 39705(@pxref{String Functions}). 39706 39707@item 39708The behavior of @code{fflush()} changed to match BWK @command{awk} 39709and for POSIX; now both @samp{fflush()} and @samp{fflush("")} 39710flush all open output redirections 39711(@pxref{I/O Functions}). 39712 39713@item 39714The @code{isarray()} 39715function which distinguishes if an item is an array 39716or not, to make it possible to traverse arrays of arrays 39717(@pxref{Type Functions}). 39718 39719@item 39720The @code{patsplit()} 39721function which gives the same capability as @code{FPAT}, for splitting 39722(@pxref{String Functions}). 39723 39724@item 39725An optional fourth argument to the @code{split()} function, 39726which is an array to hold the values of the separators 39727(@pxref{String Functions}). 39728 39729@item 39730Arrays of arrays 39731(@pxref{Arrays of Arrays}). 39732 39733@item 39734The @code{BEGINFILE} and @code{ENDFILE} special patterns 39735(@pxref{BEGINFILE/ENDFILE}). 39736 39737@item 39738Indirect function calls 39739(@pxref{Indirect Calls}). 39740 39741@item 39742@code{switch} / @code{case} are enabled by default 39743(@pxref{Switch Statement}). 39744 39745@item 39746Command-line option changes 39747(@pxref{Options}): 39748 39749@itemize @value{MINUS} 39750@item 39751The @option{-b} and @option{--characters-as-bytes} options 39752which prevent @command{gawk} from treating input as a multibyte string. 39753 39754@item 39755The redundant @option{--compat}, @option{--copyleft}, and @option{--usage} 39756long options were removed. 39757 39758@item 39759The @option{--gen-po} option was finally renamed to the correct @option{--gen-pot}. 39760 39761@item 39762The @option{--sandbox} option which disables certain features. 39763 39764@item 39765All long options acquired corresponding short options, for use in @samp{#!} scripts. 39766@end itemize 39767 39768@item 39769Directories named on the command line now produce a warning, not a fatal 39770error, unless @option{--posix} or @option{--traditional} are used 39771(@pxref{Command-line directories}). 39772 39773@item 39774The @command{gawk} internals were rewritten, bringing the @command{dgawk} 39775debugger and possibly improved performance 39776(@pxref{Debugger}). 39777 39778@item 39779Per the GNU Coding Standards, dynamic extensions must now define 39780a global symbol indicating that they are GPL-compatible 39781(@pxref{Plugin License}). 39782 39783@item 39784@cindex POSIX mode 39785In POSIX mode, string comparisons use @code{strcoll()} / @code{wcscoll()} 39786(@pxref{POSIX String Comparison}). 39787 39788@item 39789The option for raw sockets was removed, since it was never implemented 39790(@pxref{TCP/IP Networking}). 39791 39792@item 39793Ranges of the form @samp{[d-h]} are treated as if they were in the 39794C locale, no matter what kind of regexp is being used, and even if 39795@option{--posix} 39796(@pxref{Ranges and Locales}). 39797 39798@item 39799Support was removed for the following systems: 39800 39801@itemize @value{MINUS} 39802@item 39803Atari 39804 39805@item 39806Amiga 39807 39808@item 39809BeOS 39810 39811@item 39812Cray 39813 39814@item 39815MIPS RiscOS 39816 39817@item 39818MS-DOS with the Microsoft Compiler 39819 39820@item 39821MS-Windows with the Microsoft Compiler 39822 39823@item 39824NeXT 39825 39826@item 39827SunOS 3.x, Sun 386 (Road Runner) 39828 39829@item 39830Tandem (non-POSIX) 39831 39832@item 39833Prestandard VAX C compiler for VAX/VMS 39834@end itemize 39835@end itemize 39836 39837Version 4.1 of @command{gawk} introduced the following features: 39838 39839@itemize @value{BULLET} 39840 39841@item 39842Three new arrays: 39843@code{SYMTAB}, @code{FUNCTAB}, and @code{PROCINFO["identifiers"]} 39844(@pxref{Auto-set}). 39845 39846@item 39847The three executables @command{gawk}, @command{pgawk}, and @command{dgawk}, were merged into 39848one, named just @command{gawk}. As a result the command-line options changed. 39849 39850@item 39851Command-line option changes 39852(@pxref{Options}): 39853 39854@itemize @value{MINUS} 39855@item 39856The @option{-D} option invokes the debugger. 39857 39858@item 39859The @option{-i} and @option{--include} options 39860load @command{awk} library files. 39861 39862@item 39863The @option{-l} and @option{--load} options load compiled dynamic extensions. 39864 39865@item 39866The @option{-M} and @option{--bignum} options enable MPFR. 39867 39868@item 39869The @option{-o} option only does pretty-printing. 39870 39871@item 39872The @option{-p} option is used for profiling. 39873 39874@item 39875The @option{-R} option was removed. 39876@end itemize 39877 39878@item 39879Support for high precision arithmetic with MPFR 39880(@pxref{Arbitrary Precision Arithmetic}). 39881 39882@item 39883The @code{and()}, @code{or()} and @code{xor()} functions 39884changed to allow any number of arguments, 39885with a minimum of two 39886(@pxref{Bitwise Functions}). 39887 39888@item 39889The dynamic extension interface was completely redone 39890(@pxref{Dynamic Extensions}). 39891 39892@item 39893Redirected @code{getline} became allowed inside 39894@code{BEGINFILE} and @code{ENDFILE} 39895(@pxref{BEGINFILE/ENDFILE}). 39896 39897@item 39898The @code{where} command was added to the debugger 39899(@pxref{Execution Stack}). 39900 39901@item 39902Support for Ultrix was removed. 39903 39904@end itemize 39905 39906Version 4.2 of @command{gawk} introduced the following changes: 39907 39908@itemize @bullet 39909@item 39910Changes to @code{ENVIRON} are reflected into @command{gawk}'s 39911environment and that of programs that it runs. 39912@xref{Auto-set}. 39913 39914@item 39915@code{FIELDWIDTHS} was enhanced to allow skipping characters 39916before assigning a value to a field 39917(@pxref{Splitting By Content}). 39918 39919@item 39920The @code{PROCINFO["argv"]} array. 39921@xref{Auto-set}. 39922 39923@item 39924The maximum number of hexadecimal digits in @samp{\x} escapes 39925is now two. 39926@xref{Escape Sequences}. 39927 39928@item 39929Strongly typed regexp constants of the form @samp{@@/@dots{}/} 39930(@pxref{Strong Regexp Constants}). 39931 39932@item 39933The bitwise functions changed, making negative arguments into 39934a fatal error (@pxref{Bitwise Functions}). 39935 39936@ifset INTDIV 39937@item 39938The @code{intdiv0()} function. 39939@xref{Numeric Functions}. 39940@end ifset 39941 39942@item 39943The @code{mktime()} function now accepts an optional 39944second argument 39945(@pxref{Time Functions}). 39946 39947@item 39948The @code{typeof()} function (@pxref{Type Functions}). 39949 39950@item 39951Optimizations are enabled by default. Use @option{-s} / 39952@option{--no-optimize} to disable optimizations. 39953 39954@item 39955For many years, POSIX specified that default field splitting 39956only allowed spaces and tabs to separate fields, and this was 39957how @command{gawk} behaved with @option{--posix}. As of 2013, 39958the standard restored historical behavior, and now default 39959field splitting with @option{--posix} also allows newlines to 39960separate fields. 39961 39962@item 39963Nonfatal output with @code{print} and @code{printf}. 39964@xref{Nonfatal}. 39965 39966@item 39967Retryable I/O via @code{PROCINFO[@var{input-file}, "RETRY"]}; 39968(@pxref{Retrying Input}). 39969 39970@item 39971Changes to the pretty-printer (@pxref{Profiling}): 39972 39973@c nested table 39974@itemize @value{MINUS} 39975@item 39976The @option{--pretty-print} option no longer runs the @command{awk} 39977program too. 39978 39979@item 39980Comments in the source program are preserved and placed into the 39981output file. 39982 39983@item 39984Explicit parentheses for expressions 39985in the input are preserved in the generated output. 39986@end itemize 39987 39988@item 39989Improvements to the extension API 39990(@pxref{Dynamic Extensions}): 39991 39992@c nested 39993@itemize @value{MINUS} 39994@item 39995The @code{get_file()} function to access open redirections. 39996 39997@item 39998The @code{nonfatal()} function for generating nonfatal error messages. 39999 40000@item 40001Support for GMP and MPFR values. 40002 40003@item 40004Input parsers can now override the default field parsing mechanism 40005by specifying explicit locations. 40006@end itemize 40007 40008@item 40009Shell startup files are supplied with the distribution and 40010installed by @samp{make install} (@pxref{Shell Startup Files}). 40011 40012@item 40013The @command{igawk} program and its manual page are no longer 40014installed when @command{gawk} is built. 40015@xref{Igawk Program}. 40016 40017@item 40018Support for MirBSD was removed. 40019 40020@item 40021Support for GNU/Linux on Alpha was removed. 40022 40023@end itemize 40024 40025Version 5.0 added the following features: 40026 40027@itemize 40028@item 40029The @code{PROCINFO["platform"]} array element, which allows you 40030to write code that takes the operating system / platform into account. 40031@end itemize 40032 40033Version 5.1 was created to release @command{gawk} with a correct 40034major version number for the API. This was overlooked for version 5.0, 40035unfortunately. It added the following features: 40036 40037@itemize 40038@item 40039The index for this manual was completely reworked. 40040 40041@item 40042Support was added for MSYS2. 40043 40044@item 40045@code{asort()} and @code{asorti()} were changed to 40046allow @code{FUNCTAB} and @code{SYMTAB} as the first argument if a 40047second destination array is supplied (@pxref{String Functions}). 40048 40049@item 40050The @option{-I}/@option{--trace} options were added to 40051print a trace of the byte codes as they execute (@pxref{Options}). 40052 40053@item 40054@code{$0} and the fields are now cleared before starting a 40055@code{BEGINFILE} rule (@pxref{BEGINFILE/ENDFILE}). 40056 40057@item 40058Several example programs in the manual were updated to their modern 40059POSIX equivalents. 40060 40061@item 40062The ``no effect'' lint warnings from @option{--lint} were fixed up 40063and now behave more sanely (@pxref{Options}). 40064 40065@item 40066Handling of Infinity and NaN values were improved. 40067@xref{Math Definitions}, and also see 40068@ref{POSIX Floating Point Problems}. 40069@end itemize 40070 40071@c XXX ADD MORE STUFF HERE 40072@end ifclear 40073 40074@node Common Extensions 40075@appendixsec Common Extensions Summary 40076 40077@cindex extensions @subentry Brian Kernighan's @command{awk} 40078@cindex extensions @subentry @command{mawk} 40079The following table summarizes the common extensions supported 40080by @command{gawk}, Brian Kernighan's @command{awk}, and @command{mawk}, 40081the three most widely used freely available versions of @command{awk} 40082(@pxref{Other Versions}). 40083 40084@multitable {@file{/dev/stderr} special file} {BWK @command{awk}} {@command{mawk}} {@command{gawk}} {Now standard} 40085@headitem Feature @tab BWK @command{awk} @tab @command{mawk} @tab @command{gawk} @tab Now standard 40086@item @samp{\x} escape sequence @tab X @tab X @tab X @tab 40087@item @code{FS} as null string @tab X @tab X @tab X @tab 40088@item @file{/dev/stdin} special file @tab X @tab X @tab X @tab 40089@item @file{/dev/stdout} special file @tab X @tab X @tab X @tab 40090@item @file{/dev/stderr} special file @tab X @tab X @tab X @tab 40091@item @code{delete} without subscript @tab X @tab X @tab X @tab X 40092@item @code{fflush()} function @tab X @tab X @tab X @tab X 40093@item @code{length()} of an array @tab X @tab X @tab X @tab 40094@item @code{nextfile} statement @tab X @tab X @tab X @tab X 40095@item @code{**} and @code{**=} operators @tab X @tab @tab X @tab 40096@item @code{func} keyword @tab X @tab @tab X @tab 40097@item @code{BINMODE} variable @tab @tab X @tab X @tab 40098@item @code{RS} as regexp @tab X @tab X @tab X @tab 40099@item Time-related functions @tab @tab X @tab X @tab 40100@end multitable 40101 40102@node Ranges and Locales 40103@appendixsec Regexp Ranges and Locales: A Long Sad Story 40104 40105This @value{SECTION} describes the confusing history of ranges within 40106regular expressions and their interactions with locales, and how this 40107affected different versions of @command{gawk}. 40108 40109@cindex ASCII 40110@cindex EBCDIC 40111The original Unix tools that worked with regular expressions defined 40112character ranges (such as @samp{[a-z]}) to match any character between 40113the first character in the range and the last character in the range, 40114inclusive. Ordering was based on the numeric value of each character 40115in the machine's native character set. Thus, on ASCII-based systems, 40116@samp{[a-z]} matched all the lowercase letters, and only the lowercase 40117letters, as the numeric values for the letters from @samp{a} through 40118@samp{z} were contiguous. (On an EBCDIC system, the range @samp{[a-z]} 40119includes additional nonalphabetic characters as well.) 40120 40121Almost all introductory Unix literature explained range expressions 40122as working in this fashion, and in particular, would teach that the 40123``correct'' way to match lowercase letters was with @samp{[a-z]}, and 40124that @samp{[A-Z]} was the ``correct'' way to match uppercase letters. 40125And indeed, this was true.@footnote{And Life was good.} 40126 40127The 1992 POSIX standard introduced the idea of locales (@pxref{Locales}). 40128Because many locales include other letters besides the plain 26 40129letters of the English alphabet, the POSIX standard added 40130character classes (@pxref{Bracket Expressions}) as a way to match 40131different kinds of characters besides the traditional ones in the ASCII 40132character set. 40133 40134However, the standard @emph{changed} the interpretation of range expressions. 40135In the @code{"C"} and @code{"POSIX"} locales, a range expression like 40136@samp{[a-dx-z]} is still equivalent to @samp{[abcdxyz]}, as in ASCII. 40137But outside those locales, the ordering was defined to be based on 40138@dfn{collation order}. 40139 40140What does that mean? 40141In many locales, @samp{A} and @samp{a} are both less than @samp{B}. 40142In other words, these locales sort characters in dictionary order, 40143and @samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]}; 40144instead, it might be equivalent to @samp{[ABCXYabcdxyz]}, for example. 40145 40146This point needs to be emphasized: much literature teaches that you should 40147use @samp{[a-z]} to match a lowercase character. But on systems with 40148non-ASCII locales, this also matches all of the uppercase characters 40149except @samp{A} or @samp{Z}! This was a continuous cause of confusion, even well 40150into the twenty-first century. 40151 40152To demonstrate these issues, the following example uses the @code{sub()} 40153function, which does text replacement (@pxref{String Functions}). Here, 40154the intent is to remove trailing uppercase characters: 40155 40156@example 40157$ @kbd{echo something1234abc | gawk-3.1.8 '@{ sub("[A-Z]*$", ""); print @}'} 40158@print{} something1234a 40159@end example 40160 40161@noindent 40162This output is unexpected, as the @samp{bc} at the end of 40163@samp{something1234abc} should not normally match @samp{[A-Z]*}. 40164This result is due to the locale setting (and thus you may not see 40165it on your system). 40166 40167@cindex Unicode 40168@cindex ASCII 40169Similar considerations apply to other ranges. For example, @samp{["-/]} 40170is perfectly valid in ASCII, but is not valid in many Unicode locales, 40171such as @code{en_US.UTF-8}. 40172 40173Early versions of @command{gawk} used regexp matching code that was not 40174locale-aware, so ranges had their traditional interpretation. 40175 40176When @command{gawk} switched to using locale-aware regexp matchers, 40177the problems began; especially as both GNU/Linux and commercial Unix 40178vendors started implementing non-ASCII locales, @emph{and making them 40179the default}. Perhaps the most frequently asked question became something 40180like, ``Why does @samp{[A-Z]} match lowercase letters?!?'' 40181 40182@cindex Berry, Karl 40183This situation existed for close to 10 years, if not more, and 40184the @command{gawk} maintainer grew weary of trying to explain that 40185@command{gawk} was being nicely standards-compliant, and that the issue 40186was in the user's locale. During the development of @value{PVERSION} 4.0, 40187he modified @command{gawk} to always treat ranges in the original, 40188pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}).@footnote{And 40189thus was born the Campaign for Rational Range Interpretation (or 40190RRI). A number of GNU tools have already implemented this change, 40191or will soon. Thanks to Karl Berry for coining the phrase ``Rational 40192Range Interpretation.''} 40193 40194Fortunately, shortly before the final release of @command{gawk} 4.0, 40195the maintainer learned that the 2008 standard had changed the 40196definition of ranges, such that outside the @code{"C"} and @code{"POSIX"} 40197locales, the meaning of range expressions was @emph{undefined}.@footnote{See 40198@uref{https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05, the standard} 40199and 40200@uref{https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05, its rationale}.} 40201 40202By using this lovely technical term, the standard gives license 40203to implementers to implement ranges in whatever way they choose. 40204The @command{gawk} maintainer chose to apply the pre-POSIX meaning 40205both with the default regexp matching and when @option{--traditional} or 40206@option{--posix} are used. 40207In all cases @command{gawk} remains POSIX-compliant. 40208 40209@node Contributors 40210@appendixsec Major Contributors to @command{gawk} 40211@cindex @command{gawk} @subentry list of contributors to 40212@quotation 40213@i{Always give credit where credit is due.} 40214@author Anonymous 40215@end quotation 40216 40217This @value{SECTION} names the major contributors to @command{gawk} 40218and/or this @value{DOCUMENT}, in approximate chronological order: 40219 40220@itemize @value{BULLET} 40221@item 40222@cindex Aho, Alfred 40223@cindex Weinberger, Peter 40224@cindex Kernighan, Brian 40225Dr.@: Alfred V.@: Aho, 40226Dr.@: Peter J.@: Weinberger, and 40227Dr.@: Brian W.@: Kernighan, all of Bell Laboratories, 40228designed and implemented Unix @command{awk}, 40229from which @command{gawk} gets the majority of its feature set. 40230 40231@item 40232@cindex Rubin, Paul 40233Paul Rubin 40234did the initial design and implementation in 1986, and wrote 40235the first draft (around 40 pages) of this @value{DOCUMENT}. 40236 40237@item 40238@cindex Fenlason, Jay 40239Jay Fenlason 40240finished the initial implementation. 40241 40242@item 40243@cindex Close, Diane 40244Diane Close 40245revised the first draft of this @value{DOCUMENT}, bringing it 40246to around 90 pages. 40247 40248@item 40249@cindex Stallman, Richard 40250Richard Stallman 40251helped finish the implementation and the initial draft of this 40252@value{DOCUMENT}. 40253He is also the founder of the FSF and the GNU Project. 40254 40255@item 40256@cindex Woods, John 40257John Woods 40258contributed parts of the code (mostly fixes) in 40259the initial version of @command{gawk}. 40260 40261@item 40262@cindex Trueman, David 40263In 1988, 40264David Trueman 40265took over primary maintenance of @command{gawk}, 40266making it compatible with ``new'' @command{awk}, and 40267greatly improving its performance. 40268 40269@item 40270@cindex Kwok, Conrad 40271@cindex Garfinkle, Scott 40272@cindex Williams, Kent 40273Conrad Kwok, 40274Scott Garfinkle, 40275and 40276Kent Williams 40277did the initial ports to MS-DOS with various versions of MSC. 40278 40279@item 40280@cindex Rankin, Pat 40281Pat Rankin 40282provided the VMS port and its documentation. 40283 40284@item 40285@cindex Peterson, Hal 40286Hal Peterson 40287provided help in porting @command{gawk} to Cray systems. 40288(This is no longer supported.) 40289 40290@item 40291@cindex Rommel, Kai Uwe 40292Kai Uwe Rommel 40293provided the initial port to OS/2 and its documentation. 40294 40295@item 40296@cindex Jaegermann, Michal 40297Michal Jaegermann 40298provided the port to Atari systems and its documentation. 40299(This port is no longer supported.) 40300He continues to provide portability checking, 40301and has done a lot of work to make sure @command{gawk} 40302works on non-32-bit systems. 40303 40304@item 40305@cindex Fish, Fred 40306Fred Fish 40307provided the port to Amiga systems and its documentation. 40308(With Fred's sad passing, this is no longer supported.) 40309 40310@item 40311@cindex Deifik, Scott 40312Scott Deifik 40313formerly maintained the MS-DOS port using DJGPP. 40314 40315@item 40316@cindex Zaretskii, Eli 40317Eli Zaretskii 40318currently maintains the MS-Windows port using MinGW. 40319 40320@item 40321@cindex Grigera, Juan 40322Juan Grigera 40323provided a port to Windows32 systems. 40324(This is no longer supported.) 40325 40326 40327@item 40328@cindex Hankerson, Darrel 40329For many years, 40330Dr.@: Darrel Hankerson 40331acted as coordinator for the various ports to different PC platforms 40332and created binary distributions for various PC operating systems. 40333He was also instrumental in keeping the documentation up to date for 40334the various PC platforms. 40335 40336@item 40337@cindex Zoulas, Christos 40338Christos Zoulas 40339provided the @code{extension()} 40340built-in function for dynamically adding new functions. 40341(This was obsoleted at @command{gawk} 4.1.) 40342 40343@item 40344@cindex Kahrs, J@"urgen 40345J@"urgen Kahrs 40346contributed the initial version of the TCP/IP networking 40347code and documentation, and motivated the inclusion of the @samp{|&} operator. 40348 40349@item 40350@cindex Davies, Stephen 40351Stephen Davies 40352provided the initial port to Tandem systems and its documentation. 40353(However, this is no longer supported.) 40354He was also instrumental in the initial work to integrate the 40355byte-code internals into the @command{gawk} code base. 40356Additionally, he did most of the work enabling the pretty-printer 40357to preserve and output comments. 40358 40359@item 40360@cindex Woehlke, Matthew 40361Matthew Woehlke 40362provided improvements for Tandem's POSIX-compliant systems. 40363 40364@item 40365@cindex Brown, Martin 40366Martin Brown 40367provided the port to BeOS and its documentation. 40368(This is no longer supported.) 40369 40370@item 40371@cindex Peters, Arno 40372Arno Peters 40373did the initial work to convert @command{gawk} to use 40374GNU Automake and GNU @command{gettext}. 40375 40376@item 40377@cindex Broder, Alan J.@: 40378Alan J.@: Broder 40379provided the initial version of the @code{asort()} function 40380as well as the code for the optional third argument to the 40381@code{match()} function. 40382 40383@item 40384@cindex Buening, Andreas 40385Andreas Buening 40386updated the @command{gawk} port for OS/2. 40387 40388@item 40389@cindex Hasegawa, Isamu 40390Isamu Hasegawa, 40391of IBM in Japan, contributed support for multibyte characters. 40392 40393@item 40394@cindex Benzinger, Michael 40395Michael Benzinger contributed the initial code for @code{switch} statements. 40396 40397@item 40398@cindex McPhee, Patrick T.J.@: 40399Patrick T.J.@: McPhee contributed the code for dynamic loading in Windows32 40400environments. 40401(This is no longer supported.) 40402 40403@item 40404@cindex Wallin, Anders 40405Anders Wallin helped keep the VMS port going for several years. 40406 40407@item 40408@cindex Gordon, Assaf 40409Assaf Gordon contributed the initial code to implement the 40410@option{--sandbox} option. 40411 40412@item 40413@cindex Haque, John 40414John Haque made the following contributions: 40415 40416@itemize @value{MINUS} 40417@item 40418The modifications to convert @command{gawk} 40419into a byte-code interpreter, including the debugger 40420 40421@item 40422The addition of true arrays of arrays 40423 40424@item 40425The additional modifications for support of arbitrary-precision arithmetic 40426 40427@item 40428The initial text of 40429@ref{Arbitrary Precision Arithmetic} 40430 40431@item 40432The work to merge the three versions of @command{gawk} 40433into one, for the 4.1 release 40434 40435@item 40436Improved array internals for arrays indexed by integers 40437 40438@item 40439The improved array sorting features were also driven by John, together 40440with Pat Rankin 40441@end itemize 40442 40443@cindex Papadopoulos, Panos 40444@item 40445Panos Papadopoulos contributed the original text for @ref{Include Files}. 40446 40447@item 40448@cindex Yawitz, Efraim 40449Efraim Yawitz contributed the original text for @ref{Debugger}. 40450 40451@item 40452@cindex Schorr, Andrew 40453The development of the extension API first released with 40454@command{gawk} 4.1 was driven primarily by 40455Arnold Robbins and Andrew Schorr, with notable contributions from 40456the rest of the development team. 40457 40458@cindex Malmberg, John 40459@item 40460John Malmberg contributed significant improvements to the 40461OpenVMS port and the related documentation. 40462 40463@item 40464@cindex Colombo, Antonio 40465Antonio Giovanni Colombo rewrote a number of examples in the early 40466chapters that were severely dated, for which I am incredibly grateful. 40467He also provided and maintains the Italian translation. 40468 40469@item 40470@cindex Curreli, Marco 40471Marco Curreli, together with Antonio Colombo, translated this 40472@value{DOCUMENT} into Italian. It is included in the @command{gawk} 40473distribution. 40474 40475@item 40476@cindex Guerrero, Juan Manuel 40477Juan Manuel Guerrero took over maintenance of the DJGPP port. 40478 40479@item 40480@cindex Jannick 40481``Jannick'' provided support for MSYS2. 40482 40483@item 40484@cindex Robbins @subentry Arnold 40485Arnold Robbins 40486has been working on @command{gawk} since 1988, at first 40487helping David Trueman, and as the primary maintainer since around 1994. 40488@end itemize 40489 40490@node History summary 40491@appendixsec Summary 40492 40493@itemize @value{BULLET} 40494@item 40495The @command{awk} language has evolved over time. The first release 40496was with V7 Unix, circa 1978. In 1987, for System V Release 3.1, 40497major additions, including user-defined functions, were made to the language. 40498Additional changes were made for System V Release 4, in 1989. 40499Since then, further minor changes have happened under the auspices of the 40500POSIX standard. 40501 40502@item 40503Brian Kernighan's @command{awk} provides a small number of extensions 40504that are implemented in common with other versions of @command{awk}. 40505 40506@item 40507@command{gawk} provides a large number of extensions over POSIX @command{awk}. 40508They can be disabled with either the @option{--traditional} or @option{--posix} 40509options. 40510 40511@item 40512@cindex ASCII 40513@cindex EBCDIC 40514The interaction of POSIX locales and regexp matching in @command{gawk} has been confusing over 40515the years. Today, @command{gawk} implements Rational Range Interpretation, where 40516ranges of the form @samp{[a-z]} match @emph{only} the characters numerically between 40517@samp{a} through @samp{z} in the machine's native character set. Usually this is ASCII, 40518but it can be EBCDIC on IBM S/390 systems. 40519 40520@item 40521Many people have contributed to @command{gawk} development over the years. 40522We hope that the list provided in this @value{CHAPTER} is complete and gives 40523the appropriate credit where credit is due. 40524 40525@end itemize 40526 40527@node Installation 40528@appendix Installing @command{gawk} 40529 40530@c last two commas are part of see also 40531@cindex operating systems 40532@cindex operating systems @seealso{GNU/Linux} 40533@cindex operating systems @seealso{PC operating systems} 40534@cindex operating systems @seealso{Unix} 40535@cindex @command{gawk} @subentry installing 40536@cindex installing @command{gawk} 40537This appendix provides instructions for installing @command{gawk} on the 40538various platforms that are supported by the developers. The primary 40539developer supports GNU/Linux (and Unix), whereas the other ports are 40540contributed. 40541@xref{Bugs} 40542for the email addresses of the people who maintain 40543the respective ports. 40544 40545@menu 40546* Gawk Distribution:: What is in the @command{gawk} distribution. 40547* Unix Installation:: Installing @command{gawk} under various 40548 versions of Unix. 40549* Non-Unix Installation:: Installation on Other Operating Systems. 40550* Bugs:: Reporting Problems and Bugs. 40551* Other Versions:: Other freely available @command{awk} 40552 implementations. 40553* Installation summary:: Summary of installation. 40554@end menu 40555 40556@node Gawk Distribution 40557@appendixsec The @command{gawk} Distribution 40558@cindex source code @subentry @command{gawk} 40559 40560This @value{SECTION} describes how to get the @command{gawk} 40561distribution, how to extract it, and then what is in the various files and 40562subdirectories. 40563 40564@menu 40565* Getting:: How to get the distribution. 40566* Extracting:: How to extract the distribution. 40567* Distribution contents:: What is in the distribution. 40568@end menu 40569 40570@node Getting 40571@appendixsubsec Getting the @command{gawk} Distribution 40572@cindex @command{gawk} @subentry source code, obtaining 40573There are two ways to get GNU software: 40574 40575@itemize @value{BULLET} 40576@item 40577Copy it from someone else who already has it. 40578 40579@cindex FSF (Free Software Foundation) 40580@cindex Free Software Foundation (FSF) 40581@item 40582Retrieve @command{gawk} 40583from the Internet host 40584@code{ftp.gnu.org}, in the directory @file{/gnu/gawk}. 40585Both anonymous @command{ftp} and @code{http} access are supported. 40586If you have the @command{wget} program, you can use a command like 40587the following: 40588 40589@example 40590wget https://ftp.gnu.org/gnu/gawk/gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz 40591@end example 40592@end itemize 40593 40594The GNU software archive is mirrored around the world. 40595The up-to-date list of mirror sites is available from 40596@uref{https://www.gnu.org/order/ftp.html, the main FSF website}. 40597Try to use one of the mirrors; they 40598will be less busy, and you can usually find one closer to your site. 40599 40600You may also retrieve the @command{gawk} source code from the official 40601Git repository; for more information see @ref{Accessing The Source}. 40602 40603@node Extracting 40604@appendixsubsec Extracting the Distribution 40605@command{gawk} is distributed as several @command{tar} files compressed with 40606different compression programs: @command{gzip}, @command{bzip2}, 40607and @command{xz}. For simplicity, the rest of these instructions assume 40608you are using the one compressed with the GNU Gzip program (@command{gzip}). 40609 40610Once you have the distribution (e.g., 40611@file{gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz}), 40612use @command{gzip} to expand the 40613file and then use @command{tar} to extract it. You can use the following 40614pipeline to produce the @command{gawk} distribution: 40615 40616@example 40617gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf - 40618@end example 40619 40620On a system with GNU @command{tar}, you can let @command{tar} 40621do the decompression for you: 40622 40623@example 40624tar -xvpzf gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz 40625@end example 40626 40627@noindent 40628Extracting the archive 40629creates a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}} 40630in the current directory. 40631 40632The distribution @value{FN} is of the form 40633@file{gawk-@var{V}.@var{R}.@var{P}.tar.gz}. 40634The @var{V} represents the major version of @command{gawk}, 40635the @var{R} represents the current release of version @var{V}, and 40636the @var{P} represents a @dfn{patch level}, meaning that minor bugs have 40637been fixed in the release. The current patch level is @value{PATCHLEVEL}, 40638but when retrieving distributions, you should get the version with the highest 40639version, release, and patch level. (Note, however, that patch levels greater than 40640or equal to 60 denote ``beta'' or nonproduction software; you might not want 40641to retrieve such a version unless you don't mind experimenting.) 40642If you are not on a Unix or GNU/Linux system, you need to make other arrangements 40643for getting and extracting the @command{gawk} distribution. You should consult 40644a local expert. 40645 40646@node Distribution contents 40647@appendixsubsec Contents of the @command{gawk} Distribution 40648@cindex @command{gawk} @subentry distribution 40649 40650The @command{gawk} distribution has a number of C source files, 40651documentation files, 40652subdirectories, and files related to the configuration process 40653(@pxref{Unix Installation}), 40654as well as several subdirectories related to different non-Unix 40655operating systems: 40656 40657@table @asis 40658@item Various @samp{.c}, @samp{.y}, and @samp{.h} files 40659These files contain the actual @command{gawk} source code. 40660@end table 40661 40662@table @file 40663@item support/* 40664C header and source files for routines that @command{gawk} 40665uses, but that are not part of its core functionality. 40666For example, argument parsing, regular expression matching, 40667and random number generating routines are all kept here. 40668 40669@item ABOUT-NLS 40670A file containing information about GNU @command{gettext} and translations. 40671 40672@item AUTHORS 40673A file with some information about the authorship of @command{gawk}. 40674It exists only to satisfy the pedants at the Free Software Foundation. 40675 40676@item README 40677@itemx README_d/README.* 40678Descriptive files: @file{README} for @command{gawk} under Unix and the 40679rest for the various hardware and software combinations. 40680 40681@item INSTALL 40682A file providing an overview of the configuration and installation process. 40683 40684@item ChangeLog 40685A detailed list of source code changes as bugs are fixed or improvements made. 40686There are similar files in all of the subdirectories. 40687 40688@item ChangeLog.0 40689@itemx ChangeLog.1 40690Older lists of source code changes. 40691There are similar files in all of the subdirectories. 40692 40693@item NEWS 40694A list of changes to @command{gawk} since the last release or patch. 40695There may be similar files in other subdirectories. 40696 40697@item NEWS.0 40698@itemx NEWS.1 40699Older lists of changes to @command{gawk}. 40700There may be similar files in other subdirectories. 40701 40702@item COPYING 40703The GNU General Public License. 40704 40705@item POSIX.STD 40706A description of behaviors in the POSIX standard for @command{awk} that 40707are left undefined, or where @command{gawk} may not comply fully, as well 40708as a list of things that the POSIX standard should describe but does not. 40709 40710@cindex artificial intelligence, @command{gawk} and 40711@item doc/awkforai.txt 40712Pointers to the original draft of 40713a short article describing why @command{gawk} is a good language for 40714artificial intelligence (AI) programming. 40715 40716@item doc/bc_notes 40717A brief description of @command{gawk}'s ``byte code'' internals. 40718 40719@item doc/README.card 40720@itemx doc/ad.block 40721@itemx doc/awkcard.in 40722@itemx doc/cardfonts 40723@itemx doc/colors 40724@itemx doc/macros 40725@itemx doc/no.colors 40726@itemx doc/setter.outline 40727The @command{troff} source for a five-color @command{awk} reference card. 40728A modern version of @command{troff} such as GNU @command{troff} (@command{groff}) is 40729needed to produce the color version. See the file @file{README.card} 40730for instructions if you have an older @command{troff}. 40731 40732@item doc/gawk.1 40733The @command{troff} source for a manual page describing @command{gawk}. 40734This is distributed for the convenience of Unix users. 40735 40736@cindex Texinfo 40737@item doc/gawktexi.in 40738@itemx doc/sidebar.awk 40739The Texinfo source file for this @value{DOCUMENT}. 40740It should be processed by @file{doc/sidebar.awk} 40741before processing with @command{texi2dvi} or @command{texi2pdf} 40742to produce a printed document, and 40743with @command{makeinfo} to produce an Info or HTML file. 40744The @file{Makefile} takes care of this processing and produces 40745printable output via @command{texi2dvi} or @command{texi2pdf}. 40746 40747@item doc/gawk.texi 40748The file produced after processing @file{gawktexi.in} 40749with @file{sidebar.awk}. 40750 40751@item doc/gawk.info 40752The generated Info file for this @value{DOCUMENT}. 40753 40754@item doc/gawkinet.texi 40755The Texinfo source file for 40756@ifinfo 40757@inforef{Top, , General Introduction, gawkinet, @value{GAWKINETTITLE}}. 40758@end ifinfo 40759@ifnotinfo 40760@cite{@value{GAWKINETTITLE}}. 40761@end ifnotinfo 40762It should be processed with @TeX{} 40763(via @command{texi2dvi} or @command{texi2pdf}) 40764to produce a printed document and 40765with @command{makeinfo} to produce an Info or HTML file. 40766 40767@item doc/gawkinet.info 40768The generated Info file for 40769@cite{@value{GAWKINETTITLE}}. 40770 40771@item doc/gawkworkflow.texi 40772The Texinfo source file for 40773@ifinfo 40774@inforef{Top, , General Introduction, gawkworkflow, @value{GAWKWORKFLOWTITLE}}. 40775@end ifinfo 40776@ifnotinfo 40777@cite{@value{GAWKWORKFLOWTITLE}}. 40778@end ifnotinfo 40779It should be processed with @TeX{} 40780(via @command{texi2dvi} or @command{texi2pdf}) 40781to produce a printed document and 40782with @command{makeinfo} to produce an Info or HTML file. 40783 40784@item doc/gawkworkflow.info 40785The generated Info file for 40786@cite{@value{GAWKWORKFLOWTITLE}}. 40787 40788@item doc/igawk.1 40789The @command{troff} source for a manual page describing the @command{igawk} 40790program presented in 40791@ref{Igawk Program}. 40792(Since @command{gawk} can do its own @code{@@include} processing, 40793neither @command{igawk} nor @file{igawk.1} are installed.) 40794 40795@item doc/it/* 40796Files for the Italian translation of this @value{DOCUMENT}, produced and 40797contributed by Antonio Colombo and Marco Curreli. 40798 40799@item doc/Makefile.in 40800The input file used during the configuration process to generate the 40801actual @file{Makefile} for creating the documentation. 40802 40803@item Makefile.am 40804@itemx */Makefile.am 40805Files used by the GNU Automake software for generating 40806the @file{Makefile.in} files used by Autoconf and 40807@command{configure}. 40808 40809@item Makefile.in 40810@itemx aclocal.m4 40811@itemx bisonfix.awk 40812@itemx config.guess 40813@itemx configh.in 40814@itemx configure.ac 40815@itemx configure 40816@itemx custom.h 40817@itemx depcomp 40818@itemx install-sh 40819@itemx missing_d/* 40820@itemx mkinstalldirs 40821@itemx m4/* 40822These files and subdirectories are used when configuring and compiling 40823@command{gawk} for various Unix systems. Most of them are explained 40824in @ref{Unix Installation}. The rest are there to support the main 40825infrastructure. 40826 40827@item po/* 40828The @file{po} library contains message translations. 40829 40830@item awklib/extract.awk 40831@itemx awklib/Makefile.am 40832@itemx awklib/Makefile.in 40833@itemx awklib/eg/* 40834The @file{awklib} directory contains a copy of @file{extract.awk} 40835(@pxref{Extract Program}), 40836which can be used to extract the sample programs from the Texinfo 40837source file for this @value{DOCUMENT}. It also contains a @file{Makefile.in} file, which 40838@command{configure} uses to generate a @file{Makefile}. 40839@file{Makefile.am} is used by GNU Automake to create @file{Makefile.in}. 40840The library functions from 40841@ref{Library Functions}, 40842are included as ready-to-use files in the @command{gawk} distribution. 40843They are installed as part of the installation process. 40844The rest of the programs in this @value{DOCUMENT} are available in appropriate 40845subdirectories of @file{awklib/eg}. 40846 40847@item extension/* 40848The source code, manual pages, and infrastructure files for 40849the sample extensions included with @command{gawk}. 40850@xref{Dynamic Extensions}, for more information. 40851 40852@item extras/* 40853Additional non-essential files. Currently, this directory contains some shell 40854startup files to be installed in @file{/etc/profile.d} to aid in manipulating 40855the @env{AWKPATH} and @env{AWKLIBPATH} environment variables. 40856@xref{Shell Startup Files}, for more information. 40857 40858@item posix/* 40859Files needed for building @command{gawk} on POSIX-compliant systems. 40860 40861@item pc/* 40862Files needed for building @command{gawk} under MS-Windows 40863(@pxref{PC Installation} for details). 40864 40865@item vms/* 40866Files needed for building @command{gawk} under Vax/VMS and OpenVMS 40867(@pxref{VMS Installation} for details). 40868 40869@item test/* 40870A test suite for 40871@command{gawk}. You can use @samp{make check} from the top-level @command{gawk} 40872directory to run your version of @command{gawk} against the test suite. 40873If @command{gawk} successfully passes @samp{make check}, then you can 40874be confident of a successful port. 40875@end table 40876 40877@node Unix Installation 40878@appendixsec Compiling and Installing @command{gawk} on Unix-Like Systems 40879 40880Usually, you can compile and install @command{gawk} by typing only two 40881commands. However, if you use an unusual system, you may need 40882to configure @command{gawk} for your system yourself. 40883 40884@menu 40885* Quick Installation:: Compiling @command{gawk} under Unix. 40886* Shell Startup Files:: Shell convenience functions. 40887* Additional Configuration Options:: Other compile-time options. 40888* Configuration Philosophy:: How it's all supposed to work. 40889* Compiling from Git:: Compiling from Git. 40890* Building the Documentation:: Building the Documentation. 40891@end menu 40892 40893@node Quick Installation 40894@appendixsubsec Compiling @command{gawk} for Unix-Like Systems 40895 40896@menu 40897* Compiling with MPFR:: Building with MPFR. 40898@end menu 40899 40900The normal installation steps should work on all modern commercial 40901Unix-derived systems, GNU/Linux, BSD-based systems, and the Cygwin 40902environment for MS-Windows. 40903 40904After you have extracted the @command{gawk} distribution, @command{cd} 40905to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}. As with most GNU 40906software, you configure @command{gawk} for your system by running the 40907@command{configure} program. This program is a Bourne shell script that 40908is generated automatically using GNU Autoconf. 40909@ifnotinfo 40910(The Autoconf software is 40911described fully in 40912@cite{Autoconf---Generating Automatic Configuration Scripts}, 40913which can be found online at 40914@uref{https://www.gnu.org/software/autoconf/manual/index.html, 40915the Free Software Foundation's website}.) 40916@end ifnotinfo 40917@ifinfo 40918(The Autoconf software is described fully starting with 40919@inforef{Top, , Autoconf, autoconf,Autoconf---Generating Automatic Configuration Scripts}.) 40920@end ifinfo 40921 40922To configure @command{gawk}, simply run @command{configure}: 40923 40924@example 40925sh ./configure 40926@end example 40927 40928This produces a @file{Makefile} and @file{config.h} tailored to your system. 40929The @file{config.h} file describes various facts about your system. 40930You might want to edit the @file{Makefile} to 40931change the @code{CFLAGS} variable, which controls 40932the command-line options that are passed to the C compiler (such as 40933optimization levels or compiling for debugging). 40934 40935Alternatively, you can add your own values for most @command{make} 40936variables on the command line, such as @code{CC} and @code{CFLAGS}, when 40937running @command{configure}: 40938 40939@example 40940CC=cc CFLAGS=-g sh ./configure 40941@end example 40942 40943@noindent 40944See the file @file{INSTALL} in the @command{gawk} distribution for 40945all the details. 40946 40947After you have run @command{configure} and possibly edited the @file{Makefile}, 40948type: 40949 40950@example 40951make 40952@end example 40953 40954@noindent 40955Shortly thereafter, you should have an executable version of @command{gawk}. 40956That's all there is to it! 40957To verify that @command{gawk} is working properly, 40958run @samp{make check}. All of the tests should succeed. 40959If these steps do not work, or if any of the tests fail, 40960check the files in the @file{README_d} directory to see if you've 40961found a known problem. If the failure is not described there, 40962send in a bug report (@pxref{Bugs}). 40963 40964Of course, once you've built @command{gawk}, it is likely that you will 40965wish to install it. To do so, you need to run the command @samp{make 40966install}, as a user with the appropriate permissions. How to do this 40967varies by system, but on many systems you can use the @command{sudo} 40968command to do so. The command then becomes @samp{sudo make install}. It 40969is likely that you will be asked for your password, and you will have 40970to have been set up previously as a user who is allowed to run the 40971@command{sudo} command. 40972 40973 40974@node Compiling with MPFR 40975@appendixsubsubsec Building With MPFR 40976 40977@cindex MPFR library, building with 40978Use of the MPFR library with @command{gawk} 40979is an optional feature: if you have the MPFR and GMP libraries already installed 40980when you configure and build @command{gawk}, 40981@command{gawk} automatically will be able to use them. 40982 40983You can install these libraries from source code by fetching them 40984from the GNU distribution site at @code{ftp.gnu.org}. 40985 40986Most modern systems provide package managers which save you the trouble 40987of building from source. They fetch and install the library header files 40988and binaries for you. You will need to research how to do this for 40989your particular system. 40990 40991@node Shell Startup Files 40992@appendixsubsec Shell Startup Files 40993 40994The distribution contains shell startup files @file{gawk.sh} and 40995@file{gawk.csh}, containing functions to aid in manipulating 40996the @env{AWKPATH} and @env{AWKLIBPATH} environment variables. 40997On a Fedora GNU/Linux system, these files should be installed in @file{/etc/profile.d}; 40998on other platforms, the appropriate location may be different. 40999 41000@table @command 41001 41002@cindex @command{gawkpath_default} shell function 41003@cindex shell function @subentry @command{gawkpath_default} 41004@item gawkpath_default 41005Reset the @env{AWKPATH} environment variable to its default value. 41006 41007@cindex @command{gawkpath_prepend} shell function 41008@cindex shell function @subentry @command{gawkpath_prepend} 41009@item gawkpath_prepend 41010Add the argument to the front of the @env{AWKPATH} environment variable. 41011 41012@cindex @command{gawkpath_append} shell function 41013@cindex shell function @subentry @command{gawkpath_append} 41014@item gawkpath_append 41015Add the argument to the end of the @env{AWKPATH} environment variable. 41016 41017@cindex @command{gawklibpath_default} shell function 41018@cindex shell function @subentry @command{gawklibpath_default} 41019@item gawklibpath_default 41020Reset the @env{AWKLIBPATH} environment variable to its default value. 41021 41022@cindex @command{gawklibpath_prepend} shell function 41023@cindex shell function @subentry @command{gawklibpath_prepend} 41024@item gawklibpath_prepend 41025Add the argument to the front of the @env{AWKLIBPATH} environment variable. 41026 41027@cindex @command{gawklibpath_append} shell function 41028@cindex shell function @subentry @command{gawklibpath_append} 41029@item gawklibpath_append 41030Add the argument to the end of the @env{AWKLIBPATH} environment variable. 41031 41032@end table 41033 41034 41035@node Additional Configuration Options 41036@appendixsubsec Additional Configuration Options 41037@cindex @command{gawk} @subentry configuring @subentry options 41038@cindex configuration options, @command{gawk} 41039 41040There are several additional options you may use on the @command{configure} 41041command line when compiling @command{gawk} from scratch, including: 41042 41043@table @code 41044 41045@cindex @option{--disable-extensions} configuration option 41046@cindex configuration option @subentry @option{--disable-extensions} 41047@item --disable-extensions 41048Disable the extension mechanism within @command{gawk}. With this 41049option, it is not possible to use dynamic extensions. This also 41050disables configuring and building the sample extensions in the 41051@file{extension} directory. 41052 41053This option may be useful for cross-compiling. 41054The default action is to dynamically check if the extensions 41055can be configured and compiled. 41056 41057@cindex @option{--disable-lint} configuration option 41058@cindex configuration option @subentry @option{--disable-lint} 41059@item --disable-lint 41060Disable all lint checking within @command{gawk}. The 41061@option{--lint} and @option{--lint-old} options 41062(@pxref{Options}) 41063are accepted, but silently do nothing. 41064Similarly, setting the @code{LINT} variable 41065(@pxref{User-modified}) 41066has no effect on the running @command{awk} program. 41067 41068When used with the GNU Compiler Collection's (GCC's) 41069automatic dead-code-elimination, this option 41070cuts almost 23K bytes off the size of the @command{gawk} 41071executable on GNU/Linux x86_64 systems. Results on other systems and 41072with other compilers are likely to vary. 41073Using this option may bring you some slight performance improvement. 41074 41075@quotation CAUTION 41076Using this option will cause some of the tests in the test suite 41077to fail. This option may be removed at a later date. 41078@end quotation 41079 41080@cindex @option{--disable-mpfr} configuration option 41081@cindex configuration option @subentry @option{--disable-mpfr} 41082@item --disable-mpfr 41083Skip checking for the MPFR and GMP libraries. This is useful 41084mainly for the developers, to make sure nothing breaks if 41085MPFR support is not available. 41086 41087@cindex @option{--disable-nls} configuration option 41088@cindex configuration option @subentry @option{--disable-nls} 41089@item --disable-nls 41090Disable all message-translation facilities. 41091This is usually not desirable, but it may bring you some slight performance 41092improvement. 41093 41094@cindex @option{--enable-versioned-extension-dir} configuration option 41095@cindex configuration option @subentry @option{--enable-versioned-extension-dir} 41096@item --enable-versioned-extension-dir 41097Use a versioned directory for extensions. The directory name will 41098include the major and minor API versions in it. This makes it possible 41099to keep extensions for different API versions on the same system 41100without their conflicting with one another. 41101 41102@end table 41103 41104Use the command @samp{./configure --help} to see the full list of 41105options supplied by @command{configure}. 41106 41107@node Configuration Philosophy 41108@appendixsubsec The Configuration Process 41109 41110@cindex @command{gawk} @subentry configuring 41111This @value{SECTION} is of interest only if you know something about using the 41112C language and Unix-like operating systems. 41113 41114The source code for @command{gawk} generally attempts to adhere to formal 41115standards wherever possible. This means that @command{gawk} uses library 41116routines that are specified by the ISO C standard and by the POSIX 41117operating system interface standard. 41118The @command{gawk} source code requires using an ISO C compiler (the 1999 41119standard). 41120 41121Many Unix systems do not support all of either the ISO or the 41122POSIX standards. The @file{missing_d} subdirectory in the @command{gawk} 41123distribution contains replacement versions of those functions that are 41124most likely to be missing. 41125 41126The @file{config.h} file that @command{configure} creates contains 41127definitions that describe features of the particular operating system 41128where you are attempting to compile @command{gawk}. The three things 41129described by this file are: what header files are available, so that 41130they can be correctly included, what (supposedly) standard functions 41131are actually available in your C libraries, and various miscellaneous 41132facts about your operating system. For example, there may not be an 41133@code{st_blksize} element in the @code{stat} structure. In this case, 41134@samp{HAVE_STRUCT_STAT_ST_BLKSIZE} is undefined. 41135 41136@cindex @code{custom.h} file 41137It is possible for your C compiler to lie to @command{configure}. It may 41138do so by not exiting with an error when a library function is not 41139available. To get around this, edit the @file{custom.h} file. 41140Use an @samp{#ifdef} that is appropriate for your system, and either 41141@code{#define} any constants that @command{configure} should have defined but 41142didn't, or @code{#undef} any constants that @command{configure} defined and 41143should not have. The @file{custom.h} file is automatically included by 41144the @file{config.h} file. 41145 41146It is also possible that the @command{configure} program generated by 41147Autoconf will not work on your system in some other fashion. 41148If you do have a problem, the @file{configure.ac} file is the input for 41149Autoconf. You may be able to change this file and generate a 41150new version of @command{configure} that works on your system 41151(@pxref{Bugs} 41152for information on how to report problems in configuring @command{gawk}). 41153The same mechanism may be used to send in updates to @file{configure.ac} 41154and/or @file{custom.h}. 41155 41156@node Compiling from Git 41157@appendixsubsec Compiling from Git 41158 41159Building @command{gawk} directly from the development source control 41160repository is possible, but not recommended for everyday users, as the 41161code may not be as stable as released versions are. If you really do 41162want to do that, here are the steps: 41163 41164@example 41165git clone https://git.savannah.gnu.org/r/gawk.git 41166cd gawk 41167./bootstrap.sh && ./configure && make && make check 41168@end example 41169 41170@node Building the Documentation 41171@appendixsubsec Building the Documentation 41172 41173@cindex documentation @subentry building @subentry Info files 41174The generated Info documentation is included in the distribution 41175@command{tar} files and in the Git source code repository; you should 41176not need to rebuild it. However, if it needs to be done, simply running 41177@command{make} will do it, assuming that you have a recent enough version 41178of @command{makeinfo} installed. 41179 41180@cindex documentation @subentry building @subentry PDF 41181If you wish to build the PDF version of the manuals, you will need 41182to have @TeX{} installed, and possibly additional packages that 41183provide the necessary fonts and tools, such as @command{dvi2pdf} 41184and @command{ps2pdf}. You will also need GNU Troff (@command{groff}) 41185installed in order to format the reference card and the manual page 41186(@pxref{Distribution contents}). Managing this process is beyond the 41187scope of this @value{DOCUMENT}. 41188 41189Assuming you have all you need, then the following commands produce the 41190PDF versions of the documentation: 41191 41192@example 41193cd doc 41194make pdf 41195@end example 41196 41197@noindent 41198This creates PDF versions of all three Texinfo documents included 41199in the distribution, as well as of the manual page and the reference card. 41200 41201@cindex documentation @subentry building @subentry HTML 41202Similarly, if you have a recent enough version of @command{makeinfo}, 41203you can make the HTML version of the manuals with: 41204 41205@example 41206cd doc 41207make html 41208@end example 41209 41210@noindent 41211This creates HTML versions of all three Texinfo documents included 41212in the distribution. 41213 41214@node Non-Unix Installation 41215@appendixsec Installation on Other Operating Systems 41216 41217This @value{SECTION} describes how to install @command{gawk} on 41218various non-Unix systems. 41219 41220@menu 41221* PC Installation:: Installing and Compiling @command{gawk} on 41222 Microsoft Windows. 41223* VMS Installation:: Installing @command{gawk} on VMS. 41224@end menu 41225 41226@node PC Installation 41227@appendixsubsec Installation on MS-Windows 41228 41229@cindex PC operating systems, @command{gawk} on @subentry installing 41230@cindex operating systems @subentry PC, @command{gawk} on @subentry installing 41231This @value{SECTION} covers installation and usage of @command{gawk} 41232on Intel architecture machines running any version of MS-Windows. 41233In this @value{SECTION}, the term ``Windows32'' 41234refers to any of Microsoft Windows 95/98/ME/NT/2000/XP/Vista/7/8/10. 41235 41236See also the @file{README_d/README.pc} file in the distribution. 41237 41238@menu 41239* PC Binary Installation:: Installing a prepared distribution. 41240* PC Compiling:: Compiling @command{gawk} for Windows32. 41241* PC Using:: Running @command{gawk} on Windows32. 41242* Cygwin:: Building and running @command{gawk} for 41243 Cygwin. 41244* MSYS:: Using @command{gawk} In The MSYS Environment. 41245@end menu 41246 41247@node PC Binary Installation 41248@appendixsubsubsec Installing a Prepared Distribution for MS-Windows Systems 41249@cindex installing @command{gawk} @subentry MS-Windows 41250 41251The only supported binary distribution for MS-Windows systems 41252is that provided by Eli Zaretskii's @uref{https://sourceforge.net/projects/ezwinports/, 41253``ezwinports''} project. Install the compiled @command{gawk} from there. 41254 41255@node PC Compiling 41256@appendixsubsubsec Compiling @command{gawk} for PC Operating Systems 41257 41258@command{gawk} can be compiled for Windows32 using MinGW (Windows32). 41259The file @file{README_d/README.pc} in the @command{gawk} distribution 41260contains additional notes, and @file{pc/Makefile} contains important 41261information on compilation options. 41262 41263@cindex compiling @command{gawk} @subentry for MS-Windows 41264To build @command{gawk} for Windows32, copy the files in 41265the @file{pc} directory (@emph{except} for @file{ChangeLog}) to the 41266directory with the rest of the @command{gawk} sources, then invoke 41267@command{make} with the appropriate target name as an argument to 41268build @command{gawk}. The @file{Makefile} copied from the @file{pc} 41269directory contains a configuration section with comments and may need 41270to be edited in order to work with your @command{make} utility. 41271 41272The @file{Makefile} supports a number of targets for building various 41273MS-DOS and Windows32 versions. A list of targets is printed if the 41274@command{make} command is given without a target. As an example, 41275to build a native MS-Windows binary of @command{gawk} using the MinGW tools, 41276type @samp{make mingw32}. 41277 41278@node PC Using 41279@appendixsubsubsec Using @command{gawk} on PC Operating Systems 41280@cindex operating systems @subentry PC, @command{gawk} on 41281@cindex PC operating systems, @command{gawk} on 41282 41283Information in this section applies to the MinGW and 41284DJGPP ports of @command{gawk}. @xref{Cygwin} for information 41285about the Cygwin port. 41286 41287Under MS-Windows, the MinGW environment supports 41288both the @samp{|&} operator and TCP/IP networking 41289(@pxref{TCP/IP Networking}). 41290The DJGPP environment does not support @samp{|&}. 41291 41292@cindex search paths 41293@cindex search paths @subentry for source files 41294@cindex @command{gawk} @subentry MS-Windows version of 41295@cindex @code{;} (semicolon) @subentry @env{AWKPATH} variable and 41296@cindex semicolon (@code{;}) @subentry @env{AWKPATH} variable and 41297@cindex @env{AWKPATH} environment variable 41298@cindex environment variables @subentry @env{AWKPATH} 41299The MS-Windows version of @command{gawk} searches for 41300program files as described in @ref{AWKPATH Variable}. However, 41301semicolons (rather than colons) separate elements in the @env{AWKPATH} 41302variable. If @env{AWKPATH} is not set or is empty, then the default 41303search path is @samp{@w{.;c:/lib/awk;c:/gnu/lib/awk}}. 41304 41305@cindex common extensions @subentry @code{BINMODE} variable 41306@cindex extensions @subentry common @subentry @code{BINMODE} variable 41307@cindex differences in @command{awk} and @command{gawk} @subentry @code{BINMODE} variable 41308@cindex @code{BINMODE} variable 41309Under MS-Windows, 41310@command{gawk} (and many other text programs) silently 41311translates end-of-line @samp{\r\n} to @samp{\n} on input and @samp{\n} 41312to @samp{\r\n} on output. A special @code{BINMODE} variable @value{COMMONEXT} 41313allows control over these translations and is interpreted as follows: 41314 41315@itemize @value{BULLET} 41316@item 41317If @code{BINMODE} is @code{"r"} or one, 41318then 41319binary mode is set on read (i.e., no translations on reads). 41320 41321@item 41322If @code{BINMODE} is @code{"w"} or two, 41323then 41324binary mode is set on write (i.e., no translations on writes). 41325 41326@item 41327If @code{BINMODE} is @code{"rw"} or @code{"wr"} or three, 41328binary mode is set for both read and write. 41329 41330@item 41331@code{BINMODE=@var{non-null-string}} is 41332the same as @samp{BINMODE=3} (i.e., no translations on 41333reads or writes). However, @command{gawk} issues a warning 41334message if the string is not one of @code{"rw"} or @code{"wr"}. 41335@end itemize 41336 41337@noindent 41338The modes for standard input and standard output are set one time 41339only (after the 41340command line is read, but before processing any of the @command{awk} program). 41341Setting @code{BINMODE} for standard input or 41342standard output is accomplished by using an 41343appropriate @samp{-v BINMODE=@var{N}} option on the command line. 41344@code{BINMODE} is set at the time a file or pipe is opened and cannot be 41345changed midstream. 41346 41347On POSIX-compatible systems, this variable's value has no effect. 41348Thus, if you think your program will run on multiple different systems 41349and that you may need to use @code{BINMODE}, you should simply set it 41350(in the program or on the command line) unconditionally, and not worry 41351about the operating system on which your program is running. 41352 41353The name @code{BINMODE} was chosen to match @command{mawk} 41354(@pxref{Other Versions}). 41355@command{mawk} and @command{gawk} handle @code{BINMODE} similarly; however, 41356@command{mawk} adds a @samp{-W BINMODE=@var{N}} option and an environment 41357variable that can set @code{BINMODE}, @code{RS}, and @code{ORS}. The 41358files @file{binmode[1-3].awk} (under @file{gnu/lib/awk} in some of the 41359prepared binary distributions) have been chosen to match @command{mawk}'s @samp{-W 41360BINMODE=@var{N}} option. These can be changed or discarded; in particular, 41361the setting of @code{RS} giving the fewest ``surprises'' is open to debate. 41362@command{mawk} uses @samp{RS = "\r\n"} if binary mode is set on read, which is 41363appropriate for files with the MS-DOS-style end-of-line. 41364 41365To illustrate, the following examples set binary mode on writes for standard 41366output and other files, and set @code{ORS} as the ``usual'' MS-DOS-style 41367end-of-line: 41368 41369@example 41370gawk -v BINMODE=2 -v ORS="\r\n" @dots{} 41371@end example 41372 41373@noindent 41374or: 41375 41376@example 41377gawk -v BINMODE=w -f binmode2.awk @dots{} 41378@end example 41379 41380@noindent 41381These give the same result as the @samp{-W BINMODE=2} option in 41382@command{mawk}. 41383The following changes the record separator to @code{"\r\n"} and sets binary 41384mode on reads, but does not affect the mode on standard input: 41385 41386@example 41387gawk -v RS="\r\n" -e "BEGIN @{ BINMODE = 1 @}" @dots{} 41388@end example 41389 41390@noindent 41391or: 41392 41393@example 41394gawk -f binmode1.awk @dots{} 41395@end example 41396 41397@noindent 41398With proper quoting, in the first example the setting of @code{RS} can be 41399moved into the @code{BEGIN} rule. 41400 41401@node Cygwin 41402@appendixsubsubsec Using @command{gawk} In The Cygwin Environment 41403@cindex compiling @command{gawk} @subentry for Cygwin 41404 41405@command{gawk} can be built and used ``out of the box'' under MS-Windows 41406if you are using the @uref{http://www.cygwin.com, Cygwin environment}. 41407This environment provides an excellent simulation of GNU/Linux, using 41408Bash, GCC, GNU Make, 41409and other GNU programs. Compilation and installation for Cygwin is the 41410same as for a Unix system: 41411 41412@example 41413tar -xvpzf gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz 41414cd gawk-@value{VERSION}.@value{PATCHLEVEL} 41415./configure 41416make && make check 41417@end example 41418 41419When compared to GNU/Linux on the same system, the @samp{configure} 41420step on Cygwin takes considerably longer. However, it does finish, 41421and then the @samp{make} proceeds as usual. 41422 41423@cindex installing @command{gawk} @subentry Cygwin 41424You may also install @command{gawk} using the regular Cygwin installer. 41425In general Cygwin supplies the latest released version. 41426 41427Recent versions of Cygwin open all files in binary mode. This means 41428that you should use @samp{RS = "\r?\n"} in order to be able to 41429handle standard MS-Windows text files with carriage-return plus 41430line-feed line endings. 41431 41432The Cygwin environment supports 41433both the @samp{|&} operator and TCP/IP networking 41434(@pxref{TCP/IP Networking}). 41435 41436@node MSYS 41437@appendixsubsubsec Using @command{gawk} In The MSYS Environment 41438 41439In the MSYS environment under MS-Windows, @command{gawk} automatically 41440uses binary mode for reading and writing files. Thus, there is no 41441need to use the @code{BINMODE} variable. 41442 41443This can cause problems with other Unix-like components that have 41444been ported to MS-Windows that expect @command{gawk} to do automatic 41445translation of @code{"\r\n"}, because it won't. 41446 41447Under MSYS2, compilation using the standard @samp{./configure && make} 41448recipe works ``out of the box.'' 41449 41450@node VMS Installation 41451@appendixsubsec Compiling and Installing @command{gawk} on Vax/VMS and OpenVMS 41452 41453@c based on material from Pat Rankin <rankin@eql.caltech.edu> 41454@c now rankin@pactechdata.com 41455@c now r.pat.rankin@gmail.com 41456 41457@cindex @command{gawk} @subentry VMS version of 41458@cindex installing @command{gawk} @subentry VMS 41459This @value{SUBSECTION} describes how to compile and install @command{gawk} under OpenVMS. 41460The older designation ``VMS'' is used throughout to refer to OpenVMS. 41461 41462@menu 41463* VMS Compilation:: How to compile @command{gawk} under VMS. 41464* VMS Dynamic Extensions:: Compiling @command{gawk} dynamic extensions on 41465 VMS. 41466* VMS Installation Details:: How to install @command{gawk} under VMS. 41467* VMS Running:: How to run @command{gawk} under VMS. 41468* VMS GNV:: The VMS GNV Project. 41469@end menu 41470 41471@node VMS Compilation 41472@appendixsubsubsec Compiling @command{gawk} on VMS 41473@cindex compiling @command{gawk} @subentry for VMS 41474 41475To compile @command{gawk} under VMS, there is a @code{DCL} command procedure 41476that issues all the necessary @code{CC} and @code{LINK} commands. There is 41477also a @file{Makefile} for use with the @code{MMS} and @code{MMK} utilities. 41478From the source directory, use either: 41479 41480@example 41481$ @kbd{@@[.vms]vmsbuild.com} 41482@end example 41483 41484@noindent 41485or: 41486 41487@example 41488$ @kbd{MMS/DESCRIPTION=[.vms]descrip.mms gawk} 41489@end example 41490 41491@noindent 41492or: 41493 41494@example 41495$ @kbd{MMK/DESCRIPTION=[.vms]descrip.mms gawk} 41496@end example 41497 41498@command{MMK} is an open source, free, near-clone of @command{MMS} and 41499can better handle ODS-5 volumes with upper- and lowercase @value{FN}s. 41500@command{MMK} is available from @uref{https://github.com/endlesssoftware/mmk}. 41501 41502With ODS-5 volumes and extended parsing enabled, the case of the target 41503parameter may need to be exact. 41504 41505@command{gawk} has been tested under VAX/VMS 7.3 and Alpha/VMS 7.3-1 41506using Compaq C V6.4, and under Alpha/VMS 7.3, Alpha/VMS 7.3-2, and IA64/VMS 8.3. 41507The most recent builds used HP C V7.3 on Alpha VMS 8.3 and both 41508Alpha and IA64 VMS 8.4 used HP C 7.3.@footnote{The IA64 architecture 41509is also known as ``Itanium.''} 41510 41511@xref{VMS GNV} for information on building 41512@command{gawk} as a PCSI kit that is compatible with the GNV product. 41513 41514@node VMS Dynamic Extensions 41515@appendixsubsubsec Compiling @command{gawk} Dynamic Extensions on VMS 41516 41517The extensions that have been ported to VMS can be built using one of 41518the following commands: 41519 41520@example 41521$ @kbd{MMS/DESCRIPTION=[.vms]descrip.mms extensions} 41522@end example 41523 41524@noindent 41525or: 41526 41527@example 41528$ @kbd{MMK/DESCRIPTION=[.vms]descrip.mms extensions} 41529@end example 41530 41531@command{gawk} uses @code{AWKLIBPATH} as either an environment variable 41532or a logical name to find the dynamic extensions. 41533 41534Dynamic extensions need to be compiled with the same compiler options for 41535floating-point, pointer size, and symbol name handling as were used 41536to compile @command{gawk} itself. 41537Alpha and Itanium should use IEEE floating point. The pointer size is 32 bits, 41538and the symbol name handling should be exact case with CRC shortening for 41539symbols longer than 32 bits. 41540 41541For Alpha and Itanium: 41542 41543@example 41544/name=(as_is,short) 41545/float=ieee/ieee_mode=denorm_results 41546@end example 41547 41548For VAX: 41549 41550@example 41551/name=(as_is,short) 41552@end example 41553 41554Compile-time macros need to be defined before the first VMS-supplied 41555header file is included, as follows: 41556 41557@example 41558#if (__CRTL_VER >= 70200000) && !defined (__VAX) 41559#define _LARGEFILE 1 41560#endif 41561 41562#ifndef __VAX 41563#ifdef __CRTL_VER 41564#if __CRTL_VER >= 80200000 41565#define _USE_STD_STAT 1 41566#endif 41567#endif 41568#endif 41569@end example 41570 41571If you are writing your own extensions to run on VMS, you must supply these 41572definitions yourself. The @file{config.h} file created when building @command{gawk} 41573on VMS does this for you; if instead you use that file or a similar one, then you 41574must remember to include it before any VMS-supplied header files. 41575 41576@node VMS Installation Details 41577@appendixsubsubsec Installing @command{gawk} on VMS 41578 41579To use @command{gawk}, all you need is a ``foreign'' command, which is a 41580@code{DCL} symbol whose value begins with a dollar sign. For example: 41581 41582@example 41583$ @kbd{GAWK :== $disk1:[gnubin]gawk} 41584@end example 41585 41586@noindent 41587Substitute the actual location of @command{gawk.exe} for 41588@samp{$disk1:[gnubin]}. The symbol should be placed in the 41589@file{login.com} of any user who wants to run @command{gawk}, 41590so that it is defined every time the user logs on. 41591Alternatively, the symbol may be placed in the system-wide 41592@file{sylogin.com} procedure, which allows all users 41593to run @command{gawk}. 41594 41595If your @command{gawk} was installed by a PCSI kit into the 41596@file{GNV$GNU:} directory tree, the program will be known as 41597@file{GNV$GNU:[bin]gnv$gawk.exe} and the help file will be 41598@file{GNV$GNU:[vms_help]gawk.hlp}. 41599 41600The PCSI kit also installs a @file{GNV$GNU:[vms_bin]gawk_verb.cld} file 41601that can be used to add @command{gawk} and @command{awk} as DCL commands. 41602 41603For just the current process you can use: 41604 41605@example 41606$ @kbd{set command gnv$gnu:[vms_bin]gawk_verb.cld} 41607@end example 41608 41609Or the system manager can use @file{GNV$GNU:[vms_bin]gawk_verb.cld} to 41610add the @command{gawk} and @command{awk} commands to the system-wide @samp{DCLTABLES}. 41611 41612The DCL syntax is documented in the @file{gawk.hlp} file. 41613 41614Optionally, the @file{gawk.hlp} entry can be loaded into a VMS help library: 41615 41616@example 41617$ @kbd{LIBRARY/HELP sys$help:helplib [.vms]gawk.hlp} 41618@end example 41619 41620@noindent 41621(You may want to substitute a site-specific help library rather than 41622the standard VMS library @samp{HELPLIB}.) After loading the help text, 41623the command: 41624 41625@example 41626$ @kbd{HELP GAWK} 41627@end example 41628 41629@noindent 41630provides information about both the @command{gawk} implementation and the 41631@command{awk} programming language. 41632 41633The logical name @samp{AWK_LIBRARY} can designate a default location 41634for @command{awk} program files. For the @option{-f} option, if the specified 41635@value{FN} has no device or directory path information in it, @command{gawk} 41636looks in the current directory first, then in the directory specified 41637by the translation of @samp{AWK_LIBRARY} if the file is not found. 41638If, after searching in both directories, the file still is not found, 41639@command{gawk} appends the suffix @samp{.awk} to the @value{FN} and retries 41640the file search. If @samp{AWK_LIBRARY} has no definition, a default value 41641of @samp{SYS$LIBRARY:} is used for it. 41642 41643@node VMS Running 41644@appendixsubsubsec Running @command{gawk} on VMS 41645 41646Command-line parsing and quoting conventions are significantly different 41647on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor 41648changes. They @emph{are} minor though, and all @command{awk} programs 41649should run correctly. 41650 41651Here are a couple of trivial tests: 41652 41653@example 41654$ @kbd{gawk -- "BEGIN @{print ""Hello, World!""@}"} 41655$ @kbd{gawk -"W" version} 41656! could also be -"W version" or "-W version" 41657@end example 41658 41659@noindent 41660Note that uppercase and mixed-case text must be quoted. 41661 41662The VMS port of @command{gawk} includes a @code{DCL}-style interface in addition 41663to the original shell-style interface (see the help entry for details). 41664One side effect of dual command-line parsing is that if there is only a 41665single parameter (as in the quoted string program), the command 41666becomes ambiguous. To work around this, the normally optional @option{--} 41667flag is required to force Unix-style parsing rather than @code{DCL} parsing. 41668If any other dash-type options (or multiple parameters such as @value{DF}s to 41669process) are present, there is no ambiguity and @option{--} can be omitted. 41670 41671@cindex exit status, of @command{gawk} @subentry on VMS 41672The @code{exit} value is a Unix-style value and is encoded into a VMS exit 41673status value when the program exits. 41674 41675The VMS severity bits will be set based on the @code{exit} value. 41676A failure is indicated by 1, and VMS sets the @code{ERROR} status. 41677A fatal error is indicated by 2, and VMS sets the @code{FATAL} status. 41678All other values will have the @code{SUCCESS} status. The exit value is 41679encoded to comply with VMS coding standards and will have the 41680@code{C_FACILITY_NO} of @code{0x350000} with the constant @code{0xA000} 41681added to the number shifted over by 3 bits to make room for the severity codes. 41682 41683To extract the actual @command{gawk} exit code from the VMS status, use: 41684 41685@example 41686unix_status = (vms_status .and. %x7f8) / 8 41687@end example 41688 41689@noindent 41690A C program that uses @code{exec()} to call @command{gawk} will get the original 41691Unix-style exit value. 41692 41693Older versions of @command{gawk} for VMS treated a Unix exit code 0 as 1, 41694a failure as 2, a fatal error as 4, and passed all the other numbers through. 41695This violated the VMS exit status coding requirements. 41696 41697@cindex floating-point @subentry numbers @subentry VAX/VMS 41698VAX/VMS floating point uses unbiased rounding. @xref{Round Function}. 41699 41700VMS reports time values in GMT unless one of the @code{SYS$TIMEZONE_RULE} 41701or @code{TZ} logical names is set. Older versions of VMS, such as VAX/VMS 417027.3, do not set these logical names. 41703 41704@cindex search paths 41705@cindex search paths @subentry for source files 41706The default search path, when looking for @command{awk} program files specified 41707by the @option{-f} option, is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical 41708name @env{AWKPATH} can be used to override this default. The format 41709of @env{AWKPATH} is a comma-separated list of directory specifications. 41710When defining it, the value should be quoted so that it retains a single 41711translation and not a multitranslation @code{RMS} searchlist. 41712 41713@cindex redirection @subentry on VMS 41714 41715This restriction also applies to running @command{gawk} under GNV, 41716as redirection is always to a DCL command. 41717 41718If you are redirecting data to a VMS command or utility, the current 41719implementation requires that setting up a VMS foreign command that runs 41720a command file before invoking @command{gawk}. 41721(This restriction may be removed in a future release of @command{gawk} on VMS.) 41722 41723Without this command file, the input data will also appear prepended 41724to the output data. 41725 41726This also allows simulating POSIX commands that are not found on VMS or the 41727use of GNV utilities. 41728 41729The example below is for @command{gawk} redirecting data to the VMS 41730@command{sort} command. 41731 41732@example 41733$ sort = "@@device:[dir]vms_gawk_sort.com" 41734@end example 41735 41736The command file needs to be of the format in the example below. 41737 41738The first line inhibits the passed input data from also showing up in the 41739output. It must be in the format in the example. 41740 41741The next line creates a foreign command that overrides the outer foreign 41742command which prevents an infinite recursion of command files. 41743 41744The next to the last command redirects @code{sys$input} to be 41745@code{sys$command}, in order to pick up the data that is being redirected 41746to the command. 41747 41748The last line runs the actual command. It must be the last command as the data 41749redirected from @command{gawk} will be read when the command file ends. 41750 41751@example 41752$!'f$verify(0,0)' 41753$ sort := sort 41754$ define/user sys$input sys$command: 41755$ sort sys$input: sys$output: 41756@end example 41757 41758@node VMS GNV 41759@appendixsubsubsec The VMS GNV Project 41760 41761The VMS GNV package provides a build environment similar to POSIX with ports 41762of a collection of open source tools. The @command{gawk} found in the GNV 41763base kit is an older port. Currently, the GNV project is being reorganized 41764to supply individual PCSI packages for each component. 41765See @w{@uref{https://sourceforge.net/p/gnv/wiki/InstallingGNVPackages/}.} 41766 41767The normal build procedure for @command{gawk} produces a program that 41768is suitable for use with GNV. 41769 41770The file @file{vms/gawk_build_steps.txt} in the distribution documents 41771the procedure for building a VMS PCSI kit that is compatible with GNV. 41772 41773@node Bugs 41774@appendixsec Reporting Problems and Bugs 41775@cindex archaeologists 41776@quotation 41777@i{There is nothing more dangerous than a bored archaeologist.} 41778@author Douglas Adams, @cite{The Hitchhiker's Guide to the Galaxy} 41779@end quotation 41780@c the radio show, not the book. :-) 41781 41782@cindex debugging @command{gawk}, bug reports 41783@cindex troubleshooting @subentry @command{gawk} @subentry bug reports 41784If you have problems with @command{gawk} or think that you have found a bug, 41785report it to the developers; we cannot promise to do anything, 41786but we might well want to fix it. 41787 41788@menu 41789* Bug definition:: Defining what is and is not a bug. 41790* Bug address:: Where to send reports to. 41791* Usenet:: Where not to send reports to. 41792* Performance bugs:: What to do if you think there is a performance 41793 issue. 41794* Asking for help:: Dealing with non-bug questions. 41795* Maintainers:: Maintainers of non-*nix ports. 41796@end menu 41797 41798@node Bug definition 41799@appendixsubsec Defining What Is and What Is Not A Bug 41800 41801Before talking about reporting bugs, let's define what is a bug, 41802and what is not. 41803 41804A bug is: 41805 41806@itemize @bullet 41807@item 41808When @command{gawk} behaves differently from what's described 41809in the POSIX standard, and that difference is not mentioned 41810in this @value{DOCUMENT} as being done on purpose. 41811 41812@item 41813When @command{gawk} behaves differently from what's described 41814in this @value{DOCUMENT}. 41815 41816@item 41817When @command{gawk} behaves differently from other @command{awk} 41818implementations in particular circumstances, and that behavior cannot 41819be attributed to an additional feature in @command{gawk}. 41820 41821@item 41822Something that is obviously wrong, such as a core dump. 41823 41824@item 41825When this @value{DOCUMENT} is unclear or ambiguous about a particular 41826feature's behavior. 41827@end itemize 41828 41829The following things are @emph{not} bugs, and should not be reported 41830to the bug mailing list. You can ask about them on the ``help'' mailing 41831list (@pxref{Asking for help}), but don't be surprised if you get an 41832answer of the form ``that's how @command{gawk} behaves and it isn't 41833going to change.'' Here's the list: 41834 41835@itemize @bullet 41836@item 41837Missing features, for any definition of @dfn{feature}. For example, 41838additional built-in arithmetic functions, or additional ways to split 41839fields or records, or anything else. 41840 41841The number of features that @command{gawk} does @emph{not} have is 41842by definition infinite. It cannot be all things to all people. 41843In short, just because @command{gawk} doesn't do what @emph{you} 41844think it should, it's not necessarily a bug. 41845 41846@item 41847Behaviors that are defined by the POSIX standard and/or for historical 41848compatibility with Unix @command{awk}. Even if you happen to dislike 41849those behaviors, they're not going to change: changing them would 41850break millions of existing @command{awk} programs. 41851 41852@item 41853Behaviors that differ from how it's done in other languages. @command{awk} 41854and @command{gawk} stand on their own and do not have to follow the crowd. 41855This is particularly true when the requested behavior change would break 41856backwards compatibility. 41857 41858This applies also to differences in behavior between @command{gawk} 41859and other language compilers and interpreters, such as wishes for more 41860detailed descriptions of what the problem is when a syntax error is 41861encountered. 41862 41863@item 41864Documentation issues of the form ``the manual doesn't tell me how to 41865do XYZ.'' The manual is not a cookbook to solve every little problem 41866you may have. Its purpose is to teach you how to solve your problems 41867on your own. 41868 41869@item 41870General questions and discussion about @command{awk} programming or 41871why @command{gawk} behaves the way it does. For that use the ``help'' 41872mailing list: see @ref{Asking for help}. 41873@end itemize 41874 41875For more information, see @uref{http://www.skeeve.com/fork-my-code.html, 41876@cite{Fork My Code, Please!---An Open Letter To Those of You Who Are Unhappy}}, 41877by Arnold Robbins and Chet Ramey. 41878 41879@node Bug address 41880@appendixsubsec Submitting Bug Reports 41881 41882Before reporting a bug, make sure you have really found a genuine bug. 41883 41884Here are the steps for submitting a bug report. Following them will 41885make both your life and the lives of the maintainers much easier. 41886 41887@enumerate 1 41888@item 41889Make sure that what you want to report is appropriate. 41890@xref{Bug definition}. If it's not, you are wasting your 41891time and ours. 41892 41893@item 41894Verify that you have the latest version of @command{gawk}. 41895Many bugs (usually subtle ones) are fixed at each release, and if yours 41896is out-of-date, the problem may already have been solved. 41897 41898@item 41899Please see if setting the environment variable @env{LC_ALL} 41900to @code{LC_ALL=C} causes things to behave as you expect. If so, it's 41901a locale issue, and may or may not really be a bug. 41902 41903@item 41904Carefully reread the documentation and see if it says you can do 41905what you're trying to do. If it's not clear whether you should be able 41906to do something or not, report that too; it's a bug in the documentation! 41907 41908@item 41909Before reporting a bug or trying to fix it yourself, try to isolate it 41910to the smallest possible @command{awk} program and input @value{DF} that 41911reproduce the problem. Then send us: 41912 41913@itemize @bullet 41914@item 41915The program and @value{DF}. 41916 41917@item 41918Some idea of what kind of Unix system you're using. 41919 41920@item 41921The compiler you used to compile @command{gawk}. 41922 41923@item 41924The exact results 41925@command{gawk} gave you. Also say what you expected to occur; this helps 41926us decide whether the problem is really in the documentation. 41927 41928@item 41929The version number of @command{gawk} you are using. 41930You can get this information with the command @samp{gawk --version}. 41931@end itemize 41932 41933@item 41934Do @emph{not} send screenshots. Instead, use copy/paste to send text, or 41935send files. 41936 41937@item 41938Do send files as attachments, instead of inline. This avoids corruption 41939by mailer programs out in the wilds of the Internet. 41940 41941@item 41942Please be sure to send all mail in @emph{plain text}, 41943not (or not exclusively) in HTML. 41944 41945@item 41946@emph{All email must be in English. This is the only language 41947understood in common by all the maintainers.} 41948@end enumerate 41949 41950@cindex @email{bug-gawk@@gnu.org} bug reporting address 41951@cindex email address for bug reports, @email{bug-gawk@@gnu.org} 41952@cindex bug reports, email address, @email{bug-gawk@@gnu.org} 41953Once you have a precise problem description, send email to 41954@EMAIL{bug-gawk@@gnu.org,bug dash gawk at gnu dot org}. 41955 41956The @command{gawk} maintainers subscribe to this address, and 41957thus they will receive your bug report. 41958Although you can send mail to the maintainers directly, 41959the bug reporting address is preferred because the 41960email list is archived at the GNU Project. 41961 41962@quotation NOTE 41963Many distributions of GNU/Linux and the various BSD-based operating systems 41964have their own bug reporting systems. If you report a bug using your distribution's 41965bug reporting system, you should also send a copy to 41966@EMAIL{bug-gawk@@gnu.org,bug dash gawk at gnu dot org}. 41967 41968This is for two reasons. First, although some distributions forward 41969bug reports ``upstream'' to the GNU mailing list, many don't, so there is a good 41970chance that the @command{gawk} maintainers won't even see the bug report! Second, 41971mail to the GNU list is archived, and having everything at the GNU Project 41972keeps things self-contained and not dependent on other organizations. 41973@end quotation 41974 41975Please note: We ask that you follow the 41976@uref{https://gnu.org/philosophy/kind-communication.html, 41977GNU Kind Communication Guidelines} in your correspondence on the 41978list (as well as off of it). 41979 41980@node Usenet 41981@appendixsubsec Please Don't Post Bug Reports to USENET 41982 41983@quotation 41984@c Date: Sun, 17 May 2015 19:50:14 -0400 41985@c From: Chet Ramey <chet.ramey@case.edu> 41986@c Reply-To: chet.ramey@case.edu 41987@c Organization: ITS, Case Western Reserve University 41988@c To: Aharon Robbins <arnold@skeeve.com> 41989@c CC: chet.ramey@case.edu 41990I gave up on Usenet a couple of years ago and haven't really looked back. 41991It's like sports talk radio---you feel smarter for not having read it. 41992@author Chet Ramey 41993@end quotation 41994 41995@cindex @code{comp.lang.awk} newsgroup 41996Please do @emph{not} try to report bugs in @command{gawk} by posting to the 41997Usenet/Internet newsgroup @code{comp.lang.awk}. Although some of the 41998@command{gawk} developers occasionally read this news group, the primary 41999@command{gawk} maintainer no longer does. Thus it's virtually guaranteed 42000that he will @emph{not} see your posting. 42001 42002If you really don't care about the previous paragraph and continue to 42003post bug reports in @code{comp.lang.awk}, then understand that you're 42004not reporting bugs, you're just whining. 42005 42006Similarly, posting bug reports or questions in web forums (such 42007as @uref{https://stackoverflow.com/, Stack Overflow}) may get you 42008an answer, but it won't be from the @command{gawk} maintainers, 42009who do not spend their time in web forums. The steps described here are 42010the only officially recognized way for reporting bugs. Really. 42011 42012@ignore 42013And another one: 42014 42015Date: Thu, 11 Jun 2015 09:00:56 -0400 42016From: Chet Ramey <chet.ramey@case.edu> 42017 42018My memory was imperfect. Back in June 2009, I wrote: 42019 42020"That's the nice thing about open source, right? You can take your ball 42021and run to another section of the playground. Then, if you like mixing 42022metaphors, you can throw rocks from there." 42023@end ignore 42024 42025@node Performance bugs 42026@appendixsubsec What To Do If You Think There Is A Performance Issue 42027 42028@cindex performance, checking issues 42029@cindex profiling, compiling @command{gawk} for 42030If you think that @command{gawk} is too slow at doing a particular task, 42031you should investigate before sending in a bug report. Here are the steps 42032to follow: 42033 42034@enumerate 1 42035@item 42036Run @command{gawk} with the @option{--profile} option (@pxref{Options}) 42037to see what your 42038program is doing. It may be that you have written it in an inefficient manner. 42039For example, you may be doing something for every record that could be done 42040just once, for every file. 42041(Use a @code{BEGINFILE} rule; @pxref{BEGINFILE/ENDFILE}.) 42042Or you may be doing something for every file that only needs to be done 42043once per run of the program. 42044(Use a @code{BEGIN} rule; @pxref{BEGIN/END}.) 42045 42046@item 42047If profiling at the @command{awk} level doesn't help, then you will 42048need to compile @command{gawk} itself for profiling at the C language level. 42049 42050To do that, start with the latest released version of 42051@command{gawk}. Unpack the source code in a new directory, and configure 42052it: 42053 42054@example 42055$ @kbd{tar -xpzvf gawk-X.Y.Z.tar.gz} 42056@print{} @dots{} @ii{Output omitted} 42057$ @kbd{cd gawk-X.Y.Z} 42058$ @kbd{./configure} 42059@print{} @dots{} @ii{Output omitted} 42060@end example 42061 42062@item 42063Edit the files @file{Makefile} and @file{support/Makefile}. 42064Change every instance of @option{-O2} or @option{-O} to @option{-pg}. 42065This causes @command{gawk} to be compiled for profiling. 42066 42067@item 42068Compile the program by running the @command{make} command: 42069 42070@example 42071@group 42072$ @kbd{make} 42073@print{} @dots{} @ii{Output omitted} 42074@end group 42075@end example 42076 42077@item 42078Run the freshly compiled @command{gawk} on a @emph{real} program, 42079using @emph{real} data. Using an artificial program to try to time one 42080particular feature of @command{gawk} is useless; real @command{awk} programs 42081generally spend most of their time doing I/O, not computing. If you want to prove 42082that something is slow, it @emph{must} be done using a real program and real data. 42083 42084Use a data file that is large enough for the statistical profiling to measure 42085where @command{gawk} spends its time. It should be at least 100 megabytes in size. 42086 42087@example 42088$ @kbd{./gawk -f realprogram.awk realdata > /dev/null} 42089@end example 42090 42091@item 42092When done, you should have a file in the current directory named @file{gmon.out}. 42093Run the command @samp{gprof gawk gmon.out > gprof.out}. 42094 42095@item 42096Submit a bug report explaining what you think is slow. Include the @file{gprof.out} 42097file with it. 42098 42099Preferably, you should also submit the program and the data, or else indicate where to 42100get the data if the file is large. 42101 42102@item 42103If you have not submitted your program and data, be prepared to apply patches and 42104rerun the profiling in order to see if the patches were effective. 42105 42106@end enumerate 42107 42108If you are incapable or unwilling to do the steps listed above, then you will 42109just have to live with @command{gawk} as it is. 42110 42111@node Asking for help 42112@appendixsubsec Where To Send Non-bug Questions 42113 42114If you have questions related to @command{awk} programming, or why @command{gawk} 42115behaves a certain way, or any other @command{awk}- or @command{gawk}-related issue, 42116please @emph{do not} send it to the bug reporting address. 42117 42118As of July, 2021, there is a separate mailing list for this purpose: 42119@EMAIL{help-gawk@@gnu.org, help dash gawk at gnu dot org}. 42120Anything that is not a bug report should be sent to that list. 42121 42122@quotation NOTE 42123If you disregard these directions and send non-bug mails to the bug list, 42124you will be told to use the help list. 42125After two such requests you will be silently @emph{blacklisted} from the bug list. 42126@end quotation 42127 42128Please note: As with the bug list, we ask that you follow the 42129@uref{https://gnu.org/philosophy/kind-communication.html, 42130GNU Kind Communication Guidelines} in your correspondence on the help 42131list (as well as off of it). 42132 42133@cindex Proulx, Bob 42134If you wish to the subscribe to the list, in order to help out 42135others, or to learn from others, here are instructions, courtesy 42136of Bob Proulx: 42137 42138@table @emph 42139@item Subscribe by email 42140 42141Send an email message to 42142@EMAIL{help-gawk-request@@gnu.org, help dash gawk dash request at gnu dot org} 42143with ``subscribe'' in 42144the body of the message. The subject does not matter and is not used. 42145 42146@item Subscribe by web form 42147 42148To use the web interface visit 42149@uref{https://lists.gnu.org/mailman/listinfo/help-gawk, 42150the list information page}. 42151Use the 42152subscribe form to fill out your email address and submit using the 42153@code{Subscribe} button. 42154 42155@item Reply to the confirmation message 42156 42157In both cases then reply to the confirmation message that is sent to 42158your address in reply. 42159@end table 42160 42161Bob mentions that you may also use email for subscribing and 42162unsubscribing. For example: 42163 42164@example 42165$ @kbd{echo help | mailx -s request help-gawk-request@@gnu.org} 42166$ @kbd{echo subscribe | mailx -s request help-gawk-request@@gnu.org} 42167$ @kbd{echo unsubscribe | mailx -s request help-gawk-request@@gnu.org} 42168@end example 42169 42170@node Maintainers 42171@appendixsubsec Reporting Problems with Non-Unix Ports 42172 42173If you find bugs in one of the non-Unix ports of @command{gawk}, 42174send an email to the bug list, with a copy to the 42175person who maintains that port. The maintainers are named in the following list, 42176as well as in the @file{README} file in the @command{gawk} distribution. 42177Information in the @file{README} file should be considered authoritative 42178if it conflicts with this @value{DOCUMENT}. 42179 42180The people maintaining the various @command{gawk} ports are: 42181 42182@c put the index entries outside the table, for docbook 42183@cindex Buening, Andreas 42184@cindex Malmberg, John 42185@cindex G., Daniel Richard 42186@cindex Robbins @subentry Arnold 42187@cindex Zaretskii, Eli 42188@cindex Guerrero, Juan Manuel 42189@multitable {MS-Windows with MinGW} {123456789012345678901234567890123456789001234567890} 42190@item Unix and POSIX systems @tab Arnold Robbins, @EMAIL{arnold@@skeeve.com,arnold at skeeve dot com} 42191 42192@item MS-DOS with DJGPP @tab Juan Manuel Guerrero, @EMAIL{juan.guerrero@@gmx.de, juan dot guerrero at gmx dot de} 42193 42194@item MS-Windows with MinGW @tab Eli Zaretskii, @EMAIL{eliz@@gnu.org,eliz at gnu dot org} 42195 42196@c Leave this in the document on purpose. 42197@c OS/2 is not mentioned anywhere else though. 42198@item OS/2 @tab Andreas Buening, @EMAIL{andreas.buening@@nexgo.de,andreas dot buening at nexgo dot de} 42199 42200@item VMS @tab John Malmberg, @EMAIL{wb8tyw@@qsl.net,wb8tyw at qsl dot net} 42201 42202@item z/OS (OS/390) @tab Daniel Richard G.@: @EMAIL{skunk@@iSKUNK.ORG,skunk at iSKUNK dot ORG} 42203@end multitable 42204 42205If your bug is also reproducible under Unix, send a copy of your 42206report to the @EMAIL{bug-gawk@@gnu.org,bug dash gawk at gnu dot org} email list as well. 42207 42208@node Other Versions 42209@appendixsec Other Freely Available @command{awk} Implementations 42210@cindex @command{awk} @subentry implementations 42211@ignore 42212From: emory!amc.com!brennan (Michael Brennan) 42213Subject: C++ comments in awk programs 42214To: arnold@gnu.ai.mit.edu (Arnold Robbins) 42215Date: Wed, 4 Sep 1996 08:11:48 -0700 (PDT) 42216 42217@end ignore 42218@cindex Brennan, Michael 42219@ifnotdocbook 42220@quotation 42221@i{It's kind of fun to put comments like this in your awk code:}@* 42222@ @ @ @ @ @ @code{// Do C++ comments work? answer: yes! of course} 42223@author Michael Brennan 42224@end quotation 42225@end ifnotdocbook 42226 42227@docbook 42228<blockquote><attribution>Michael Brennan</attribution> 42229<literallayout><emphasis>It's kind of fun to put comments like this in your awk code.</emphasis> 42230 <literal>// Do C++ comments work? answer: yes! of course</literal></literallayout> 42231</blockquote> 42232@end docbook 42233 42234There are a number of other freely available @command{awk} implementations. 42235This @value{SECTION} briefly describes where to get them: 42236 42237@table @asis 42238@cindex Kernighan, Brian 42239@cindex source code @subentry Brian Kernighan's @command{awk} 42240@cindex @command{awk} @subentry versions of @seealso{Brian Kernighan's @command{awk}} 42241@cindex Brian Kernighan's @command{awk} @subentry source code 42242@item Unix @command{awk} 42243Brian Kernighan, one of the original designers of Unix @command{awk}, 42244has made his implementation of 42245@command{awk} freely available. 42246You can retrieve it from GitHub: 42247 42248@cindex @command{git} utility 42249@example 42250git clone git://github.com/onetrueawk/awk bwkawk 42251@end example 42252 42253@noindent 42254This command creates a copy of the @uref{https://git-scm.com, Git} 42255repository in a directory named @file{bwkawk}. If you omit the last argument 42256from the @command{git} command line, the repository copy is created in a 42257directory named @file{awk}. 42258 42259This version requires an ISO C (1990 standard) compiler; the C compiler 42260from GCC (the GNU Compiler Collection) works quite nicely. 42261 42262To build it, review the settings in the @file{makefile}, and then just run 42263@command{make}. Note that the result of compilation is named 42264@command{a.out}; you will have to rename it to something reasonable. 42265 42266@xref{Common Extensions} 42267for a list of extensions in this @command{awk} that are not in POSIX @command{awk}. 42268 42269As a side note, Dan Bornstein has created a Git repository tracking 42270all the versions of BWK @command{awk} that he could find. It's 42271available at @uref{git://github.com/danfuzz/one-true-awk}. 42272 42273@cindex Brennan, Michael 42274@cindex @command{mawk} utility 42275@cindex source code @subentry @command{mawk} 42276@item @command{mawk} 42277Michael Brennan wrote an independent implementation of @command{awk}, 42278called @command{mawk}. It is available under the 42279@ifclear FOR_PRINT 42280GPL (@pxref{Copying}), 42281@end ifclear 42282@ifset FOR_PRINT 42283GPL, 42284@end ifset 42285just as @command{gawk} is. 42286 42287The original distribution site for the @command{mawk} source code 42288no longer has it. A copy is available at 42289@uref{http://www.skeeve.com/gawk/mawk1.3.3.tar.gz}. 42290 42291In 2009, Thomas Dickey took on @command{mawk} maintenance. 42292Basic information is available on 42293@uref{http://www.invisible-island.net/mawk, the project's web page}. 42294The download URL is 42295@url{http://invisible-island.net/datafiles/release/mawk.tar.gz}. 42296 42297Once you have it, 42298@command{gunzip} may be used to decompress this file. Installation 42299is similar to @command{gawk}'s 42300(@pxref{Unix Installation}). 42301 42302@xref{Common Extensions} 42303for a list of extensions in @command{mawk} that are not in POSIX @command{awk}. 42304 42305@item @command{mawk} 2.0 42306In 2016, Michael Brennan resumed @command{mawk} development. 42307His development snapshots are available via Git from the project's 42308@uref{https://github.com/mikebrennan000/mawk-2, GitHub page}. 42309 42310@cindex Sumner, Andrew 42311@cindex @command{awka} compiler for @command{awk} 42312@cindex source code @subentry @command{awka} 42313@item @command{awka} 42314Written by Andrew Sumner, 42315@command{awka} translates @command{awk} programs into C, compiles them, 42316and links them with a library of functions that provide the core 42317@command{awk} functionality. 42318It also has a number of extensions. 42319 42320Both the @command{awk} translator and the library are released under the GPL. 42321 42322To get @command{awka}, go to @url{https://sourceforge.net/projects/awka}. 42323@c You can reach Andrew Sumner at @email{andrew@@zbcom.net}. 42324@c andrewsumner@@yahoo.net 42325 42326The project seems to be frozen; no new code changes have been made 42327since approximately 2001. 42328 42329@item Revive Awka 42330This project, available at @uref{https://github.com/noyesno/awka}, 42331intends to fix bugs in @command{awka} and add more features. 42332 42333@cindex Beebe, Nelson H.F.@: 42334@cindex @command{pawk} (profiling version of Brian Kernighan's @command{awk}) 42335@cindex source code @subentry @command{pawk} (profiling version of Brian Kernighan's @command{awk}) 42336@item @command{pawk} 42337Nelson H.F.@: Beebe at the University of Utah has modified 42338BWK @command{awk} to provide timing and profiling information. 42339It is different from @command{gawk} with the @option{--profile} option 42340(@pxref{Profiling}) 42341in that it uses CPU-based profiling, not line-count 42342profiling. You may find it at either 42343@uref{ftp://ftp.math.utah.edu/pub/pawk/pawk-20030606.tar.gz} 42344or 42345@uref{http://www.math.utah.edu/pub/pawk/pawk-20030606.tar.gz}. 42346 42347@item BusyBox @command{awk} 42348@cindex BusyBox Awk 42349@cindex source code @subentry BusyBox Awk 42350BusyBox is a GPL-licensed program providing small versions of many 42351applications within a single executable. It is aimed at embedded systems. 42352It includes a full implementation of POSIX @command{awk}. When building 42353it, be careful not to do @samp{make install} as it will overwrite 42354copies of other applications in your @file{/usr/local/bin}. For more 42355information, see the @uref{https://busybox.net, project's home page}. 42356 42357@cindex OpenSolaris 42358@cindex Solaris, POSIX-compliant @command{awk} 42359@cindex source code @subentry Solaris @command{awk} 42360@item The OpenSolaris POSIX @command{awk} 42361The versions of @command{awk} in @file{/usr/xpg4/bin} and 42362@file{/usr/xpg6/bin} on Solaris are more or less POSIX-compliant. 42363They are based on the @command{awk} from Mortice Kern Systems for PCs. 42364We were able to make this code compile and work under GNU/Linux 42365with 1--2 hours of work. Making it more generally portable (using 42366GNU Autoconf and/or Automake) would take more work, and this 42367has not been done, at least to our knowledge. 42368 42369@cindex Illumos, POSIX-compliant @command{awk} 42370@cindex source code @subentry Illumos @command{awk} 42371The source code used to be available from the OpenSolaris website. 42372However, that project was ended and the website shut down. Fortunately, the 42373@uref{https://wiki.illumos.org/display/illumos/illumos+Home, Illumos project} 42374makes this implementation available. You can view the files one at a time from 42375@uref{https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/awk_xpg4}. 42376 42377@cindex @command{frawk} 42378@cindex source code @subentry @command{frawk} 42379@item @command{frawk} 42380This is a language for writing short programs. ``To a first 42381approximation, it is an implementation of the AWK language; 42382many common @command{awk} programs produce equivalent output 42383when passed to @command{frawk}.'' However, it has a number of 42384important additional features. The code is available at 42385@uref{https://github.com/ezrosent/frawk}. 42386 42387@cindex @command{goawk} 42388@cindex Go implementation of @command{awk} 42389@cindex source code @subentry @command{goawk} 42390@cindex programming languages @subentry Go 42391@item @command{goawk} 42392This is an @command{awk} interpreter written in the 42393@uref{https://golang.org/, Go programming language}. 42394It implements POSIX @command{awk}, with a few minor extensions. 42395Source code is available from @uref{https://github.com/benhoyt/goawk}. 42396The author wrote a nice 42397@uref{https://benhoyt.com/writings/goawk/, article} 42398describing the implementation. 42399 42400@cindex @command{jawk} 42401@cindex Java implementation of @command{awk} 42402@cindex source code @subentry @command{jawk} 42403@item @command{jawk} 42404This is an interpreter for @command{awk} written in Java. It claims 42405to be a full interpreter, although because it uses Java facilities 42406for I/O and for regexp matching, the language it supports is different 42407from POSIX @command{awk}. More information is available on the 42408@uref{http://jawk.sourceforge.net, project's home page}. 42409 42410@item Hoijui's @command{jawk} 42411This project, available at @uref{https://github.com/hoijui/Jawk}, 42412is another @command{awk} interpreter written in Java. It uses 42413modern Java build tools. 42414 42415@item Libmawk 42416@cindex libmawk 42417@cindex source code @subentry libmawk 42418This is an embeddable @command{awk} interpreter derived from 42419@command{mawk}. For more information, see 42420@uref{http://repo.hu/projects/libmawk/}. 42421 42422@cindex source code @subentry embeddable @command{awk} interpreter 42423@cindex Neacsu, Mircea 42424@item Mircea Neacsu's Embeddable @command{awk} 42425Mircea Neacsu has created an embeddable @command{awk} 42426interpreter, based on BWK awk. It's available 42427at @uref{https://github.com/neacsum/awk}. 42428 42429@item @code{pawk} 42430@cindex source code @subentry @command{pawk} (Python version) 42431@cindex @code{pawk}, @command{awk}-like facilities for Python 42432This is a Python module that claims to bring @command{awk}-like 42433features to Python. See @uref{https://github.com/alecthomas/pawk} 42434for more information. (This is not related to Nelson Beebe's 42435modified version of BWK @command{awk}, described earlier.) 42436 42437@item @w{QSE @command{awk}} 42438@cindex QSE @command{awk} 42439@cindex source code @subentry QSE @command{awk} 42440This is an embeddable @command{awk} interpreter. For more information, 42441see @uref{https://code.google.com/p/qse/}. @c and @uref{http://awk.info/?tools/qse}. 42442 42443@item @command{QTawk} 42444@cindex QuikTrim Awk 42445@cindex source code @subentry QuikTrim Awk 42446This is an independent implementation of @command{awk} distributed 42447under the GPL. It has a large number of extensions over standard 42448@command{awk} and may not be 100% syntactically compatible with it. 42449See @uref{http://www.quiktrim.org/QTawk.html} for more information, 42450including the manual. The download link there is out of date; see 42451@uref{http://www.quiktrim.org/#AdditionalResources} for the latest 42452download link. 42453 42454The project may also be frozen; no new code changes have been made 42455since approximately 2014. 42456 42457@item Other versions 42458See also the ``Versions and implementations'' section of the 42459@uref{https://en.wikipedia.org/wiki/Awk_language#Versions_and_implementations, 42460Wikipedia article} on @command{awk} for information on additional versions. 42461 42462@end table 42463 42464An interesting collection of library functions is available 42465at @uref{https://github.com/e36freak/awk-libs}. 42466 42467@node Installation summary 42468@appendixsec Summary 42469 42470@itemize @value{BULLET} 42471@item 42472The @command{gawk} distribution is available from the GNU Project's main 42473distribution site, @code{ftp.gnu.org}. The canonical build recipe is: 42474 42475@example 42476wget https://ftp.gnu.org/gnu/gawk/gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz 42477tar -xvpzf gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz 42478cd gawk-@value{VERSION}.@value{PATCHLEVEL} 42479./configure && make && make check 42480@end example 42481 42482@quotation NOTE 42483Because of the @samp{https://} URL, you may have to supply the 42484@option{--no-check-certificate} option to @command{wget} to download 42485the file. 42486@end quotation 42487 42488@item 42489@command{gawk} may be built on non-POSIX systems as well. The currently 42490supported systems are MS-Windows using 42491MSYS, MSYS2, DJGPP, MinGW, and Cygwin, 42492@c OS/2, 42493and both Vax/VMS and OpenVMS. 42494Instructions for each system are included in this @value{APPENDIX}. 42495 42496@item 42497Bug reports should be sent via email to @EMAIL{bug-gawk@@gnu.org, bug dash gawk at gnu dot org}. 42498Bug reports should be in English and should include the version of @command{gawk}, 42499how it was compiled, and a short program and @value{DF} that demonstrate 42500the problem. 42501 42502@item 42503Non-bug emails should be sent to @EMAIL{help-gawk@@gnu.org, help dash gawk at gnu dot org}. 42504Repeatedly sending non-bug emails to the bug list will get you blacklisted from it. 42505 42506@item 42507There are a number of other freely available @command{awk} 42508implementations. Many are POSIX-compliant; others are less so. 42509 42510@end itemize 42511 42512 42513@ifclear FOR_PRINT 42514@node Notes 42515@appendix Implementation Notes 42516@cindex @command{gawk} @subentry implementation issues 42517@cindex implementation issues, @command{gawk} 42518 42519This appendix contains information mainly of interest to implementers and 42520maintainers of @command{gawk}. Everything in it applies specifically to 42521@command{gawk} and not to other implementations. 42522 42523@menu 42524* Compatibility Mode:: How to disable certain @command{gawk} 42525 extensions. 42526* Additions:: Making Additions To @command{gawk}. 42527* Future Extensions:: New features that may be implemented one day. 42528* Implementation Limitations:: Some limitations of the implementation. 42529* Extension Design:: Design notes about the extension API. 42530* Notes summary:: Summary of implementation notes. 42531@end menu 42532 42533@node Compatibility Mode 42534@appendixsec Downward Compatibility and Debugging 42535@cindex @command{gawk} @subentry implementation issues @subentry downward compatibility 42536@cindex @command{gawk} @subentry implementation issues @subentry debugging 42537@cindex troubleshooting @subentry @command{gawk} 42538@cindex implementation issues, @command{gawk} @subentry debugging 42539 42540@xref{POSIX/GNU}, 42541for a summary of the GNU extensions to the @command{awk} language and program. 42542All of these features can be turned off by invoking @command{gawk} with the 42543@option{--traditional} option or with the @option{--posix} option. 42544 42545If @command{gawk} is compiled for debugging with @samp{-DDEBUG}, then there 42546is one more option available on the command line: 42547 42548@table @code 42549@item -Y 42550@itemx --parsedebug 42551Print out the parse stack information as the program is being parsed. 42552@end table 42553 42554This option is intended only for serious @command{gawk} developers 42555and not for the casual user. It probably has not even been compiled into 42556your version of @command{gawk}, since it slows down execution. 42557 42558@node Additions 42559@appendixsec Making Additions to @command{gawk} 42560 42561If you find that you want to enhance @command{gawk} in a significant 42562fashion, you are perfectly free to do so. That is the point of having 42563free software; the source code is available and you are free to change 42564it as you want (@pxref{Copying}). 42565 42566This @value{SECTION} discusses the ways you might want to change @command{gawk} 42567as well as any considerations you should bear in mind. 42568 42569@menu 42570* Accessing The Source:: Accessing the Git repository. 42571* Adding Code:: Adding code to the main body of 42572 @command{gawk}. 42573* New Ports:: Porting @command{gawk} to a new operating 42574 system. 42575* Derived Files:: Why derived files are kept in the Git 42576 repository. 42577@end menu 42578 42579@node Accessing The Source 42580@appendixsubsec Accessing The @command{gawk} Git Repository 42581 42582As @command{gawk} is Free Software, the source code is always available. 42583@ref{Gawk Distribution} describes how to get and build the formal, 42584released versions of @command{gawk}. 42585 42586@cindex @command{git} utility 42587However, if you want to modify @command{gawk} and contribute back your 42588changes, you will probably wish to work with the development version. 42589To do so, you will need to access the @command{gawk} source code 42590repository. The code is maintained using the 42591@uref{https://git-scm.com, Git distributed version control system}. 42592You will need to install it if your system doesn't have it. 42593Once you have done so, use the command: 42594 42595@example 42596git clone git://git.savannah.gnu.org/gawk.git 42597@end example 42598 42599@noindent 42600This clones the @command{gawk} repository. If you are behind a 42601firewall that does not allow you to use the Git native protocol, you 42602can still access the repository using: 42603 42604@example 42605git clone https://git.savannah.gnu.org/r/gawk.git 42606@end example 42607 42608Once you have made changes, you can use @samp{git diff} to produce a 42609patch, and send that to the @command{gawk} maintainer; see @ref{Bugs}, 42610for how to do that. 42611 42612Once upon a time there was Git--CVS gateway for use by people who could 42613not install Git. However, this gateway no longer works, so you may have 42614better luck using a more modern version control system like Bazaar, 42615that has a Git plug-in for working with Git repositories. 42616 42617@node Adding Code 42618@appendixsubsec Adding New Features 42619 42620@cindex adding @subentry features to @command{gawk} 42621@cindex features @subentry adding to @command{gawk} 42622@cindex @command{gawk} @subentry features @subentry adding 42623You are free to add any new features you like to @command{gawk}. 42624However, if you want your changes to be incorporated into the @command{gawk} 42625distribution, there are several steps that you need to take in order to 42626make it possible to include them: 42627 42628@enumerate 1 42629@item 42630Before building the new feature into @command{gawk} itself, 42631consider writing it as an extension 42632(@pxref{Dynamic Extensions}). 42633If that's not possible, continue with the rest of the steps in this list. 42634 42635@item 42636Be prepared to sign the appropriate paperwork. 42637In order for the FSF to distribute your changes, you must either place 42638those changes in the public domain and submit a signed statement to that 42639effect, or assign the copyright in your changes to the FSF. 42640Both of these actions are easy to do and @emph{many} people have done so 42641already. If you have questions, please contact me 42642(@pxref{Bugs}), 42643or @EMAIL{assign@@gnu.org,assign at gnu dot org}. 42644 42645@item 42646Get the latest version. 42647It is much easier for me to integrate changes if they are relative to 42648the most recent distributed version of @command{gawk}, or better yet, 42649relative to the latest code in the Git repository. If your version of 42650@command{gawk} is very old, I may not be able to integrate your changes at all. 42651(@xref{Getting}, 42652for information on getting the latest version of @command{gawk}.) 42653 42654@item 42655@ifnotinfo 42656Follow the @cite{GNU Coding Standards}. 42657@end ifnotinfo 42658@ifinfo 42659See @inforef{Top, , Version, standards, GNU Coding Standards}. 42660@end ifinfo 42661This document describes how GNU software should be written. If you haven't 42662read it, please do so, preferably @emph{before} starting to modify @command{gawk}. 42663(The @cite{GNU Coding Standards} are available from 42664the GNU Project's 42665@uref{https://www.gnu.org/prep/standards/, website}. 42666Texinfo, Info, and DVI versions are also available.) 42667 42668@cindex @command{gawk} @subentry coding style in 42669@item 42670Use the @command{gawk} coding style. 42671The C code for @command{gawk} follows the instructions in the 42672@cite{GNU Coding Standards}, with minor exceptions. The code is formatted 42673using the traditional ``K&R'' style, particularly as regards to the placement 42674of braces and the use of TABs. In brief, the coding rules for @command{gawk} 42675are as follows: 42676 42677@itemize @value{BULLET} 42678@item 42679Use ANSI/ISO style (prototype) function headers when defining functions. 42680 42681@item 42682Put the name of the function at the beginning of its own line. 42683 42684@item 42685Use @samp{#elif} instead of nesting @samp{#if} inside @samp{#else}. 42686 42687@item 42688Put the return type of the function, even if it is @code{int}, on the 42689line above the line with the name and arguments of the function. 42690 42691@item 42692Put spaces around parentheses used in control structures 42693(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch}, 42694and @code{return}). 42695 42696@item 42697Do not put spaces in front of parentheses used in function calls. 42698 42699@item 42700Put spaces around all C operators and after commas in function calls. 42701 42702@item 42703Do not use the comma operator to produce multiple side effects, except 42704in @code{for} loop initialization and increment parts, and in macro bodies. 42705 42706@item 42707Use real TABs for indenting, not spaces. 42708 42709@item 42710Use the ``K&R'' brace layout style. 42711 42712@item 42713Use comparisons against @code{NULL} and @code{'\0'} in the conditions of 42714@code{if}, @code{while}, and @code{for} statements, as well as in the @code{case}s 42715of @code{switch} statements, instead of just the 42716plain pointer or character value. 42717 42718@item 42719Use @code{true} and @code{false} for @code{bool} values, 42720the @code{NULL} symbolic constant for pointer values, 42721and the character constant @code{'\0'} where appropriate, instead of @code{1} 42722and @code{0}. 42723 42724@item 42725Provide one-line descriptive comments for each function. 42726 42727@item 42728Do not use the @code{alloca()} function for allocating memory off the 42729stack. Its use causes more portability trouble than is worth the minor 42730benefit of not having to free the storage. Instead, use @code{malloc()} 42731and @code{free()}. 42732 42733@item 42734Do not use comparisons of the form @samp{! strcmp(a, b)} or similar. 42735As Henry Spencer once said, ``@code{strcmp()} is not a boolean!'' 42736Instead, use @samp{strcmp(a, b) == 0}. 42737 42738@item 42739If adding new bit flag values, use explicit hexadecimal constants 42740(@code{0x001}, @code{0x002}, @code{0x004}, and so on) instead of 42741shifting one left by successive amounts (@samp{(1<<0)}, @samp{(1<<1)}, 42742and so on). 42743@end itemize 42744 42745@quotation NOTE 42746If I have to reformat your code to follow the coding style used in 42747@command{gawk}, I may not bother to integrate your changes at all. 42748@end quotation 42749 42750@cindex Texinfo 42751@item 42752Update the documentation. 42753Along with your new code, please supply new sections and/or chapters 42754for this @value{DOCUMENT}. If at all possible, please use real 42755Texinfo, instead of just supplying unformatted ASCII text (although 42756even that is better than no documentation at all). 42757Conventions to be followed in @cite{@value{TITLE}} are provided 42758after the @samp{@@bye} at the end of the Texinfo source file. 42759If possible, please update the @command{man} page as well. 42760 42761You will also have to sign paperwork for your documentation changes. 42762 42763@cindex @command{git} utility 42764@item 42765Submit changes as unified diffs. 42766Use @samp{diff -u -r -N} to compare 42767the original @command{gawk} source tree with your version. 42768I recommend using the GNU version of @command{diff}, or best of all, 42769@samp{git diff} or @samp{git format-patch}. 42770Send the output produced by @command{diff} to me when you 42771submit your changes. 42772(@xref{Bugs}, for the electronic mail 42773information.) 42774 42775Using this format makes it easy for me to apply your changes to the 42776master version of the @command{gawk} source code (using @command{patch}). 42777If I have to apply the changes manually, using a text editor, I may 42778not do so, particularly if there are lots of changes. 42779 42780@item 42781Include an entry for the @file{ChangeLog} file with your submission. 42782This helps further minimize the amount of work I have to do, 42783making it easier for me to accept patches. 42784It is simplest if you just make this part of your diff. 42785@end enumerate 42786 42787Although this sounds like a lot of work, please remember that while you 42788may write the new code, I have to maintain it and support it. If it 42789isn't possible for me to do that with a minimum of extra work, then I 42790probably will not. 42791 42792@node New Ports 42793@appendixsubsec Porting @command{gawk} to a New Operating System 42794@cindex portability @subentry @command{gawk} 42795@cindex operating systems @subentry porting @command{gawk} to 42796 42797@cindex porting @command{gawk} 42798If you want to port @command{gawk} to a new operating system, there are 42799several steps: 42800 42801@enumerate 1 42802@item 42803Follow the guidelines in 42804@ifinfo 42805@ref{Adding Code}, 42806@end ifinfo 42807@ifnotinfo 42808the previous @value{SECTION} 42809@end ifnotinfo 42810concerning coding style, submission of diffs, and so on. 42811 42812@item 42813Be prepared to sign the appropriate paperwork. 42814In order for the FSF to distribute your code, you must either place 42815your code in the public domain and submit a signed statement to that 42816effect, or assign the copyright in your code to the FSF. 42817Both of these actions are easy to do and @emph{many} people have done so 42818already. If you have questions, please contact me, or 42819@EMAIL{gnu@@gnu.org, gnu at gnu dot org}. 42820 42821@item 42822When doing a port, bear in mind that your code must coexist peacefully 42823with the rest of @command{gawk} and the other ports. Avoid gratuitous 42824changes to the system-independent parts of the code. If at all possible, 42825avoid sprinkling @samp{#ifdef}s just for your port throughout the 42826code. 42827 42828If the changes needed for a particular system affect too much of the 42829code, I probably will not accept them. In such a case, you can, of course, 42830distribute your changes on your own, as long as you comply 42831with the GPL 42832(@pxref{Copying}). 42833 42834@item 42835A number of the files that come with @command{gawk} are maintained by other 42836people. Thus, you should not change them 42837unless it is for a very good reason; i.e., changes are not out of the 42838question, but changes to these files are scrutinized extra carefully. 42839These are all the files in the @file{support} directory 42840within the @command{gawk} distribution. See there. 42841 42842@item 42843A number of other files are provided by the GNU 42844Autotools (Autoconf, Automake, and GNU @command{gettext}). 42845You should not change them either, unless it is for a very 42846good reason. The files are 42847@file{ABOUT-NLS}, 42848@file{config.guess}, 42849@file{config.rpath}, 42850@file{config.sub}, 42851@file{depcomp}, 42852@file{INSTALL}, 42853@file{install-sh}, 42854@file{missing}, 42855@file{mkinstalldirs}, 42856and 42857@file{ylwrap}. 42858 42859@item 42860Be willing to continue to maintain the port. 42861Non-Unix operating systems are supported by volunteers who maintain 42862the code needed to compile and run @command{gawk} on their systems. If no-one 42863volunteers to maintain a port, it becomes unsupported and it may 42864be necessary to remove it from the distribution. 42865 42866@item 42867Supply an appropriate @file{gawkmisc.???} file. 42868Each port has its own @file{gawkmisc.???} that implements certain 42869operating system specific functions. This is cleaner than a plethora of 42870@samp{#ifdef}s scattered throughout the code. The @file{gawkmisc.c} in 42871the main source directory includes the appropriate 42872@file{gawkmisc.???} file from each subdirectory. 42873Be sure to update it as well. 42874 42875Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine 42876or operating system for the port---for example, @file{pc/gawkmisc.pc} and 42877@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain 42878@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory 42879into the main subdirectory, without accidentally destroying the real 42880@file{gawkmisc.c} file. (Currently, this is only an issue for the 42881PC operating system ports.) 42882 42883@item 42884Supply a @file{Makefile} as well as any other C source and header files that are 42885necessary for your operating system. All your code should be in a 42886separate subdirectory, with a name that is the same as, or reminiscent 42887of, either your operating system or the computer system. If possible, 42888try to structure things so that it is not necessary to move files out 42889of the subdirectory into the main source directory. If that is not 42890possible, then be sure to avoid using names for your files that 42891duplicate the names of files in the main source directory. 42892 42893@item 42894Update the documentation. 42895Please write a section (or sections) for this @value{DOCUMENT} describing the 42896installation and compilation steps needed to compile and/or install 42897@command{gawk} for your system. 42898@end enumerate 42899 42900Following these steps makes it much easier to integrate your changes 42901into @command{gawk} and have them coexist happily with other 42902operating systems' code that is already there. 42903 42904In the code that you supply and maintain, feel free to use a 42905coding style and brace layout that suits your taste. 42906 42907@node Derived Files 42908@appendixsubsec Why Generated Files Are Kept In Git 42909 42910@cindex Git, use of for @command{gawk} source code 42911@c From emails written March 22, 2012, to the gawk developers list. 42912 42913If you look at the @command{gawk} source in the Git 42914repository, you will notice that it includes files that are automatically 42915generated by GNU infrastructure tools, such as @file{Makefile.in} from 42916Automake and even @file{configure} from Autoconf. 42917 42918This is different from many Free Software projects that do not store 42919the derived files, because that keeps the repository less cluttered, 42920and it is easier to see the substantive changes when comparing versions 42921and trying to understand what changed between commits. 42922 42923However, there are several reasons why the @command{gawk} maintainer 42924likes to have everything in the repository. 42925 42926First, because it is then easy to reproduce any given version completely, 42927without relying upon the availability of (older, likely obsolete, and 42928maybe even impossible to find) other tools. 42929 42930As an extreme example, if you ever even think about trying to compile, 42931oh, say, the V7 @command{awk}, you will discover that not only do you 42932have to bootstrap the V7 @command{yacc} to do so, but you also need the 42933V7 @command{lex}. And the latter is pretty much impossible to bring up 42934on a modern GNU/Linux system.@footnote{We tried. It was painful.} 42935 42936(Or, let's say @command{gawk} 1.2 required @command{bison} whatever-it-was 42937in 1989 and that there was no @file{awkgram.c} file in the repository. Is 42938there a guarantee that we could find that @command{bison} version? Or that 42939@emph{it} would build?) 42940 42941If the repository has all the generated files, then it's easy to just check 42942them out and build. (Or @emph{easier}, depending upon how far back we go.) 42943 42944And that brings us to the second (and stronger) reason why all the files 42945really need to be in Git. It boils down to who do you cater 42946to---the @command{gawk} developer(s), or the user who just wants to check 42947out a version and try it out? 42948 42949The @command{gawk} maintainer 42950wants it to be possible for any interested @command{awk} user in the 42951world to just clone the repository, check out the branch of interest and 42952build it. Without their having to have the correct version(s) of the 42953autotools.@footnote{There is one GNU program that is (in our opinion) 42954severely difficult to bootstrap from the Git repository. For 42955example, on the author's old (but still working) PowerPC Macintosh with 42956Mac OS X 10.5, it was necessary to bootstrap a ton of software, starting 42957with Git itself, in order to try to work with the latest code. 42958It's not pleasant, and especially on older systems, it's a big waste 42959of time. 42960 42961Starting with the latest tarball was no picnic either. The maintainers 42962had dropped @file{.gz} and @file{.bz2} files and only distribute 42963@file{.tar.xz} files. It was necessary to bootstrap @command{xz} first!} 42964That is the point of the @file{bootstrap.sh} file. It touches the 42965various other files in the right order such that 42966 42967@example 42968# The canonical incantation for building GNU software: 42969./bootstrap.sh && ./configure && make 42970@end example 42971 42972@noindent 42973will @emph{just work}. 42974 42975This is extremely important for the @code{master} and 42976@code{gawk-@var{X}.@var{Y}-stable} branches. 42977 42978Further, the @command{gawk} maintainer would argue that it's also 42979important for the @command{gawk} developers. When he tried to check out 42980the @code{xgawk} branch@footnote{A branch (since removed) created by one of the other 42981developers that did not include the generated files.} to build it, he 42982couldn't. (No @file{ltmain.sh} file, and he had no idea how to create it, 42983and that was not the only problem.) 42984 42985He felt @emph{extremely} frustrated. With respect to that branch, 42986the maintainer is no different than Jane User who wants to try to build 42987@code{gawk-4.1-stable} or @code{master} from the repository. 42988 42989Thus, the maintainer thinks that it's not just important, but critical, 42990that for any given branch, the above incantation @emph{just works}. 42991 42992@c Added 9/2014: 42993A third reason to have all the files is that without them, using @samp{git 42994bisect} to try to find the commit that introduced a bug is exceedingly 42995difficult. The maintainer tried to do that on another project that 42996requires running bootstrapping scripts just to create @command{configure} 42997and so on; it was really painful. When the repository is self-contained, 42998using @command{git bisect} in it is very easy. 42999 43000@c So - that's my reasoning and philosophy. 43001 43002What are some of the consequences and/or actions to take? 43003 43004@enumerate 1 43005@item 43006We don't mind that there are differing files in the different branches 43007as a result of different versions of the autotools. 43008 43009@enumerate A 43010@item 43011It's the maintainer's job to merge them and he will deal with it. 43012 43013@item 43014He is really good at @samp{git diff x y > /tmp/diff1 ; gvim /tmp/diff1} to 43015remove the diffs that aren't of interest in order to review code. 43016@end enumerate 43017 43018@item 43019It would certainly help if everyone used the same versions of the GNU tools 43020as he does, which in general are the latest released versions of 43021Automake, 43022Autoconf, 43023@command{bison}, 43024GNU @command{gettext}, 43025and 43026Libtool. 43027 43028@ignore 43029If it would help if I sent out an ``I just upgraded to version x.y 43030of tool Z'' kind of message to this list, I can do that. Up until 43031now it hasn't been a real issue since I'm the only one who's been 43032dorking with the configuration machinery. 43033@end ignore 43034 43035@c @enumerate A 43036@c @item 43037Installing from source is quite easy. It's how the maintainer worked for years 43038(and still works). 43039He had @file{/usr/local/bin} at the front of his @env{PATH} and just did: 43040 43041@example 43042wget https://ftp.gnu.org/gnu/@var{package}/@var{package}-@var{x}.@var{y}.@var{z}.tar.gz 43043tar -xpzvf @var{package}-@var{x}.@var{y}.@var{z}.tar.gz 43044cd @var{package}-@var{x}.@var{y}.@var{z} 43045./configure && make && make check 43046make install # as root 43047@end example 43048 43049@quotation NOTE 43050Because of the @samp{https://} URL, you may have to supply the 43051@option{--no-check-certificate} option to @command{wget} to download 43052the file. 43053@end quotation 43054 43055@c @item 43056@ignore 43057These days the maintainer uses Ubuntu 12.04 which is medium current, but 43058he is already doing the above for Automake, Autoconf, and @command{bison}. 43059@end ignore 43060 43061@ignore 43062(C. Rant: Recent Linux versions with GNOME 3 really suck. What 43063 are all those people thinking? Fedora 15 was such a bust it drove 43064 me to Ubuntu, but Ubuntu 11.04 and 11.10 are totally unusable from 43065 a UI perspective. Bleah.) 43066@end ignore 43067@c @end enumerate 43068 43069@ignore 43070@item 43071If someone still feels really strongly about all this, then perhaps they 43072can have two branches, one for their development with just the clean 43073changes, and one that is buildable (xgawk and xgawk-buildable, maybe). 43074Or, as I suggested in another mail, make commits in pairs, the first with 43075the "real" changes and the second with "everything else needed for 43076 building". 43077@end ignore 43078@end enumerate 43079 43080Most of the above was originally written by the maintainer to other 43081@command{gawk} developers. It raised the objection from one of 43082the developers ``@dots{} that anybody pulling down the source from 43083Git is not an end user.'' 43084 43085However, this is not true. There are ``power @command{awk} users'' 43086who can build @command{gawk} (using the magic incantation shown previously) 43087but who can't program in C. Thus, the major branches should be 43088kept buildable all the time. 43089 43090It was then suggested that there be a @command{cron} job to create 43091nightly tarballs of ``the source.'' Here, the problem is that there 43092are source trees, corresponding to the various branches! So, 43093nightly tarballs aren't the answer, especially as the repository can go 43094for weeks without significant change being introduced. 43095 43096Fortunately, the Git server can meet this need. For any given 43097branch named @var{branchname}, use: 43098 43099@example 43100wget https://git.savannah.gnu.org/cgit/gawk.git/snapshot/gawk-@var{branchname}.tar.gz 43101@end example 43102 43103@noindent 43104to retrieve a snapshot of the given branch. 43105 43106@node Future Extensions 43107@appendixsec Probable Future Extensions 43108@ignore 43109From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995 43110Return-Path: <emory!scalpel.netlabs.com!lwall> 43111Message-Id: <9510311732.AA28472@scalpel.netlabs.com> 43112To: arnold@skeeve.atl.ga.us (Arnold D. Robbins) 43113Subject: Re: May I quote you? 43114In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST." 43115 <m0tAHPQ-00014MC@skeeve.atl.ga.us> 43116Date: Tue, 31 Oct 95 09:32:46 -0800 43117From: Larry Wall <emory!scalpel.netlabs.com!lwall> 43118 43119: Greetings. I am working on the release of gawk 3.0. Part of it will be a 43120: thoroughly updated manual. One of the sections deals with planned future 43121: extensions and enhancements. I have the following at the beginning 43122: of it: 43123: 43124: @cindex PERL 43125: @cindex Wall, Larry 43126: @display 43127: @i{AWK is a language similar to PERL, only considerably more elegant.} @* 43128: Arnold Robbins 43129: @sp 1 43130: @i{Hey!} @* 43131: Larry Wall 43132: @end display 43133: 43134: Before I actually release this for publication, I wanted to get your 43135: permission to quote you. (Hopefully, in the spirit of much of GNU, the 43136: implied humor is visible... :-) 43137 43138I think that would be fine. 43139 43140Larry 43141@end ignore 43142@cindex Perl 43143@cindex Wall, Larry 43144@cindex Robbins @subentry Arnold 43145@quotation 43146@i{AWK is a language similar to PERL, only considerably more elegant.} 43147@author Arnold Robbins 43148@end quotation 43149 43150@quotation 43151@i{Hey!} 43152@author Larry Wall 43153@end quotation 43154 43155The @file{TODO} file in the @code{master} branch of the @command{gawk} 43156Git repository lists possible future enhancements. Some of these relate 43157to the source code, and others to possible new features. Please see 43158that file for the list. 43159@xref{Additions}, 43160if you are interested in tackling any of the projects listed there. 43161 43162@node Implementation Limitations 43163@appendixsec Some Limitations of the Implementation 43164 43165This following table describes limits of @command{gawk} on a Unix-like 43166system (although it is variable even then). Other systems may have 43167different limits. 43168 43169@multitable @columnfractions .40 .60 43170@headitem Item @tab Limit 43171@item Characters in a character class @tab 2^(number of bits per byte) 43172@item Length of input record in bytes @tab @code{ULONG_MAX} 43173@item Length of output record @tab Unlimited 43174@item Length of source line @tab Unlimited 43175@item Number of fields in a record @tab @code{ULONG_MAX} 43176@item Number of file redirections @tab Unlimited 43177@item Number of input records in one file @tab @code{MAX_LONG} 43178@item Number of input records total @tab @code{MAX_LONG} 43179@item Number of pipe redirections @tab min(number of processes per user, number of open files) 43180@item Numeric values @tab Double-precision floating point (if not using MPFR) 43181@item Size of a field in bytes @tab @code{ULONG_MAX} 43182@item Size of a literal string in bytes @tab @code{ULONG_MAX} 43183@item Size of a printf string in bytes @tab @code{ULONG_MAX} 43184@end multitable 43185 43186@node Extension Design 43187@appendixsec Extension API Design 43188 43189This @value{SECTION} documents the design of the extension API, 43190including a discussion of some of the history and problems that needed 43191to be solved. 43192 43193The first version of extensions for @command{gawk} was developed in 43194the mid-1990s and released with @command{gawk} 3.1 in the late 1990s. 43195The basic mechanisms and design remained unchanged for close to 15 years, 43196until 2012. 43197 43198The old extension mechanism used data types and functions from 43199@command{gawk} itself, with a ``clever hack'' to install extension 43200functions. 43201 43202@command{gawk} included some sample extensions, of which a few were 43203really useful. However, it was clear from the outset that the extension 43204mechanism was bolted onto the side and was not really well thought out. 43205 43206@menu 43207* Old Extension Problems:: Problems with the old mechanism. 43208* Extension New Mechanism Goals:: Goals for the new mechanism. 43209* Extension Other Design Decisions:: Some other design decisions. 43210* Extension Future Growth:: Some room for future growth. 43211@end menu 43212 43213@node Old Extension Problems 43214@appendixsubsec Problems With The Old Mechanism 43215 43216The old extension mechanism had several problems: 43217 43218@itemize @value{BULLET} 43219@item 43220It depended heavily upon @command{gawk} internals. Any time the 43221@code{NODE} structure@footnote{A critical central data structure 43222inside @command{gawk}.} changed, an extension would have to be 43223recompiled. Furthermore, to really write extensions required understanding 43224something about @command{gawk}'s internal functions. There was some 43225documentation in this @value{DOCUMENT}, but it was quite minimal. 43226 43227@item 43228Being able to call into @command{gawk} from an extension required linker 43229facilities that are common on Unix-derived systems but that did 43230not work on MS-Windows systems; users wanting extensions on MS-Windows 43231had to statically link them into @command{gawk}, even though MS-Windows supports 43232dynamic loading of shared objects. 43233 43234@item 43235The API would change occasionally as @command{gawk} changed; no compatibility 43236between versions was ever offered or planned for. 43237@end itemize 43238 43239Despite the drawbacks, the @command{xgawk} project developers forked 43240@command{gawk} and developed several significant extensions. They also 43241enhanced @command{gawk}'s facilities relating to file inclusion and 43242shared object access. 43243 43244A new API was desired for a long time, but only in 2012 did the 43245@command{gawk} maintainer and the @command{xgawk} developers finally 43246start working on it together. More information about the @command{xgawk} 43247project is provided in @ref{gawkextlib}. 43248 43249@node Extension New Mechanism Goals 43250@appendixsubsec Goals For A New Mechanism 43251 43252Some goals for the new API were: 43253 43254@itemize @value{BULLET} 43255@item 43256The API should be independent of @command{gawk} internals. Changes in 43257@command{gawk} internals should not be visible to the writer of an 43258extension function. 43259 43260@item 43261The API should provide @emph{binary} compatibility across @command{gawk} 43262releases as long as the API itself does not change. 43263 43264@item 43265The API should enable extensions written in C or C++ to have roughly the 43266same ``appearance'' to @command{awk}-level code as @command{awk} 43267functions do. This means that extensions should have: 43268 43269@itemize @value{MINUS} 43270@item 43271The ability to access function parameters. 43272 43273@item 43274The ability to turn an undefined parameter into an array (call by reference). 43275 43276@item 43277The ability to create, access and update global variables. 43278 43279@item 43280Easy access to all the elements of an array at once (``array flattening'') 43281in order to loop over all the element in an easy fashion for C code. 43282 43283@item 43284The ability to create arrays (including @command{gawk}'s true 43285arrays of arrays). 43286@end itemize 43287@end itemize 43288 43289Some additional important goals were: 43290 43291@itemize @value{BULLET} 43292@item 43293The API should use only features in ISO C 90, so that extensions 43294can be written using the widest range of C and C++ compilers. The header 43295should include the appropriate @samp{#ifdef __cplusplus} and @samp{extern "C"} 43296magic so that a C++ compiler could be used. (If using C++, the runtime 43297system has to be smart enough to call any constructors and destructors, 43298as @command{gawk} is a C program. As of this writing, this has not been 43299tested.) 43300 43301@item 43302The API mechanism should not require access to @command{gawk}'s 43303symbols@footnote{The @dfn{symbols} are the variables and functions 43304defined inside @command{gawk}. Access to these symbols by code 43305external to @command{gawk} loaded dynamically at runtime is 43306problematic on MS-Windows.} by the compile-time or dynamic linker, 43307in order to enable creation of extensions that also work on MS-Windows. 43308@end itemize 43309 43310During development, it became clear that there were other features 43311that should be available to extensions, which were also subsequently 43312provided: 43313 43314@itemize @value{BULLET} 43315@item 43316Extensions should have the ability to hook into @command{gawk}'s 43317I/O redirection mechanism. In particular, the @command{xgawk} 43318developers provided a so-called ``open hook'' to take over reading 43319records. During development, this was generalized to allow 43320extensions to hook into input processing, output processing, and 43321two-way I/O. 43322 43323@item 43324An extension should be able to provide a ``call back'' function 43325to perform cleanup actions when @command{gawk} exits. 43326 43327@item 43328An extension should be able to provide a version string so that 43329@command{gawk}'s @option{--version} option can provide information 43330about extensions as well. 43331@end itemize 43332 43333The requirement to avoid access to @command{gawk}'s symbols is, at first 43334glance, a difficult one to meet. 43335 43336One design, apparently used by Perl and Ruby and maybe others, would 43337be to make the mainline @command{gawk} code into a library, with the 43338@command{gawk} utility a small C @code{main()} function linked against 43339the library. 43340 43341This seemed like the tail wagging the dog, complicating build and 43342installation and making a simple copy of the @command{gawk} executable 43343from one system to another (or one place to another on the same 43344system!) into a chancy operation. 43345 43346Pat Rankin suggested the solution that was adopted. 43347@xref{Extension Mechanism Outline}, for the details. 43348 43349@node Extension Other Design Decisions 43350@appendixsubsec Other Design Decisions 43351 43352As an arbitrary design decision, extensions can read the values of 43353predefined variables and arrays (such as @code{ARGV} and @code{FS}), but cannot 43354change them, with the exception of @code{PROCINFO}. 43355 43356The reason for this is to prevent an extension function from affecting 43357the flow of an @command{awk} program outside its control. While a real 43358@command{awk} function can do what it likes, that is at the discretion 43359of the programmer. An extension function should provide a service or 43360make a C API available for use within @command{awk}, and not mess with 43361@code{FS} or @code{ARGC} and @code{ARGV}. 43362 43363In addition, it becomes easy to start down a slippery slope. How 43364much access to @command{gawk} facilities do extensions need? 43365Do they need @code{getline}? What about calling @code{gsub()} or 43366compiling regular expressions? What about calling into @command{awk} 43367functions? (@emph{That} would be messy.) 43368 43369In order to avoid these issues, the @command{gawk} developers chose 43370to start with the simplest, most basic features that are still truly useful. 43371 43372Another decision is that although @command{gawk} provides nice things like 43373MPFR, and arrays indexed internally by integers, these features are not 43374being brought out to the API in order to keep things simple and close to 43375traditional @command{awk} semantics. (In fact, arrays indexed internally 43376by integers are so transparent that they aren't even documented!) 43377 43378Additionally, all functions in the API check that their pointer 43379input parameters are not @code{NULL}. If they are, they return an error. 43380(It is a good idea for extension code to verify that 43381pointers received from @command{gawk} are not @code{NULL}. 43382Such a thing should not happen, but the @command{gawk} developers 43383are only human, and they have been known to occasionally make 43384mistakes.) 43385 43386With time, the API will undoubtedly evolve; the @command{gawk} developers 43387expect this to be driven by user needs. For now, the current API seems 43388to provide a minimal yet powerful set of features for creating extensions. 43389 43390@node Extension Future Growth 43391@appendixsubsec Room For Future Growth 43392 43393The API can later be expanded, in at least the following way: 43394 43395@itemize @value{BULLET} 43396@item 43397@command{gawk} passes an ``extension id'' into the extension when it 43398first loads the extension. The extension then passes this id back 43399to @command{gawk} with each function call. This mechanism allows 43400@command{gawk} to identify the extension calling into it, should it need 43401to know. 43402 43403@end itemize 43404 43405Of course, as of this writing, no decisions have been made with respect 43406to the above. 43407 43408@node Notes summary 43409@appendixsec Summary 43410 43411@itemize @value{BULLET} 43412@item 43413@command{gawk}'s extensions can be disabled with either the 43414@option{--traditional} option or with the @option{--posix} option. 43415The @option{--parsedebug} option is available if @command{gawk} is 43416compiled with @samp{-DDEBUG}. 43417 43418@item 43419The source code for @command{gawk} is maintained in a publicly 43420accessible Git repository. Anyone may check it out and view the source. 43421 43422@item 43423Contributions to @command{gawk} are welcome. Following the steps 43424outlined in this @value{CHAPTER} will make it easier to integrate 43425your contributions into the code base. 43426This applies both to new feature contributions and to ports to 43427additional operating systems. 43428 43429@item 43430@command{gawk} has some limits---generally those that are imposed by 43431the machine architecture. 43432 43433@item 43434The extension API design was intended to solve a number of problems 43435with the previous extension mechanism, enable features needed by 43436the @code{xgawk} project, and provide binary compatibility going forward. 43437 43438@item 43439The previous extension mechanism is no longer supported and was 43440removed from the code base with the 4.2 release. 43441 43442@end itemize 43443 43444 43445@node Basic Concepts 43446@appendix Basic Programming Concepts 43447@cindex programming @subentry concepts 43448@cindex programming @subentry concepts 43449 43450This @value{APPENDIX} attempts to define some of the basic concepts 43451and terms that are used throughout the rest of this @value{DOCUMENT}. 43452As this @value{DOCUMENT} is specifically about @command{awk}, 43453and not about computer programming in general, the coverage here 43454is by necessity fairly cursory and simplistic. 43455(If you need more background, there are many 43456other introductory texts that you should refer to instead.) 43457 43458@menu 43459* Basic High Level:: The high level view. 43460* Basic Data Typing:: A very quick intro to data types. 43461@end menu 43462 43463@node Basic High Level 43464@appendixsec What a Program Does 43465 43466@cindex processing data 43467At the most basic level, the job of a program is to process 43468some input data and produce results. 43469@ifnotdocbook 43470See @ref{figure-general-flow}. 43471@end ifnotdocbook 43472@ifdocbook 43473See @inlineraw{docbook, <xref linkend="figure-general-flow"/>}. 43474@end ifdocbook 43475 43476@ifnotdocbook 43477@float Figure,figure-general-flow 43478@caption{General Program Flow} 43479@center @image{general-program, , , General program flow} 43480@end float 43481@end ifnotdocbook 43482 43483@docbook 43484<figure id="figure-general-flow" float="0"> 43485<title>General Program Flow</title> 43486<mediaobject> 43487<imageobject role="web"><imagedata fileref="general-program.png" format="PNG"/></imageobject> 43488</mediaobject> 43489</figure> 43490@end docbook 43491 43492@cindex compiled programs 43493@cindex interpreted programs 43494The ``program'' in the figure can be either a compiled 43495program@footnote{Compiled programs are typically written 43496in lower-level languages such as C, C++, or Ada, 43497and then translated, or @dfn{compiled}, into a form that 43498the computer can execute directly.} 43499(such as @command{ls}), 43500or it may be @dfn{interpreted}. In the latter case, a machine-executable 43501program such as @command{awk} reads your program, and then uses the 43502instructions in your program to process the data. 43503 43504@cindex programming @subentry basic steps 43505When you write a program, it usually consists 43506of the following, very basic set of steps, 43507@ifnotdocbook 43508as shown in @ref{figure-process-flow}: 43509@end ifnotdocbook 43510@ifdocbook 43511as shown in @inlineraw{docbook, <xref linkend="figure-process-flow"/>}: 43512@end ifdocbook 43513 43514@ifnotdocbook 43515@float Figure,figure-process-flow 43516@caption{Basic Program Steps} 43517@center @image{process-flow, , , Basic Program Stages} 43518@end float 43519@end ifnotdocbook 43520 43521@docbook 43522<figure id="figure-process-flow" float="0"> 43523<title>Basic Program Stages</title> 43524<mediaobject> 43525<imageobject role="web"><imagedata fileref="process-flow.png" format="PNG"/></imageobject> 43526</mediaobject> 43527</figure> 43528@end docbook 43529 43530@table @asis 43531@item Initialization 43532These are the things you do before actually starting to process 43533data, such as checking arguments, initializing any data you need 43534to work with, and so on. 43535This step corresponds to @command{awk}'s @code{BEGIN} rule 43536(@pxref{BEGIN/END}). 43537 43538If you were baking a cake, this might consist of laying out all the 43539mixing bowls and the baking pan, and making sure you have all the 43540ingredients that you need. 43541 43542@item Processing 43543This is where the actual work is done. Your program reads data, 43544one logical chunk at a time, and processes it as appropriate. 43545 43546In most programming languages, you have to manually manage the reading 43547of data, checking to see if there is more each time you read a chunk. 43548@command{awk}'s pattern-action paradigm 43549(@pxref{Getting Started}) 43550handles the mechanics of this for you. 43551 43552In baking a cake, the processing corresponds to the actual labor: 43553breaking eggs, mixing the flour, water, and other ingredients, and then putting the cake 43554into the oven. 43555 43556@item Clean Up 43557Once you've processed all the data, you may have things you need to 43558do before exiting. 43559This step corresponds to @command{awk}'s @code{END} rule 43560(@pxref{BEGIN/END}). 43561 43562After the cake comes out of the oven, you still have to wrap it in 43563plastic wrap to keep anyone from tasting it, as well as wash 43564the mixing bowls and utensils. 43565@end table 43566 43567@cindex algorithms 43568An @dfn{algorithm} is a detailed set of instructions necessary to accomplish 43569a task, or process data. It is much the same as a recipe for baking 43570a cake. Programs implement algorithms. Often, it is up to you to design 43571the algorithm and implement it, simultaneously. 43572 43573@cindex records 43574@cindex fields 43575The ``logical chunks'' we talked about previously are called @dfn{records}, 43576similar to the records a company keeps on employees, a school keeps for 43577students, or a doctor keeps for patients. 43578Each record has many component parts, such as first and last names, 43579date of birth, address, and so on. The component parts are referred 43580to as the @dfn{fields} of the record. 43581 43582The act of reading data is termed @dfn{input}, and that of 43583generating results, not too surprisingly, is termed @dfn{output}. 43584They are often referred to together as ``input/output,'' 43585and even more often, as ``I/O'' for short. 43586(You will also see ``input'' and ``output'' used as verbs.) 43587 43588@cindex data-driven languages 43589@cindex languages, data-driven 43590@command{awk} manages the reading of data for you, as well as the 43591breaking it up into records and fields. Your program's job is to 43592tell @command{awk} what to do with the data. You do this by describing 43593@dfn{patterns} in the data to look for, and @dfn{actions} to execute 43594when those patterns are seen. This @dfn{data-driven} nature of 43595@command{awk} programs usually makes them both easier to write 43596and easier to read. 43597 43598@node Basic Data Typing 43599@appendixsec Data Values in a Computer 43600 43601@cindex variables 43602In a program, 43603you keep track of information and values in things called @dfn{variables}. 43604A variable is just a name for a given value, such as @code{first_name}, 43605@code{last_name}, @code{address}, and so on. 43606@command{awk} has several predefined variables, and it has 43607special names to refer to the current input record 43608and the fields of the record. 43609You may also group multiple 43610associated values under one name, as an array. 43611 43612@cindex values @subentry numeric 43613@cindex values @subentry string 43614@cindex scalar values 43615Data, particularly in @command{awk}, consists of either numeric 43616values, such as 42 or 3.1415927, or string values. 43617String values are essentially anything that's not a number, such as a name. 43618Strings are sometimes referred to as @dfn{character data}, since they 43619store the individual characters that comprise them. 43620Individual variables, as well as numeric and string variables, are 43621referred to as @dfn{scalar} values. 43622Groups of values, such as arrays, are not scalars. 43623 43624@ref{Computer Arithmetic}, provided a basic introduction to numeric 43625types (integer and floating-point) and how they are used in a computer. 43626Please review that information, including a number of caveats that 43627were presented. 43628 43629@cindex null strings 43630While you are probably used to the idea of a number without a value (i.e., zero), 43631it takes a bit more getting used to the idea of zero-length character data. 43632Nevertheless, such a thing exists. 43633It is called the @dfn{null string}. 43634The null string is character data that has no value. 43635In other words, it is empty. It is written in @command{awk} programs 43636like this: @code{""}. 43637 43638Humans are used to working in decimal; i.e., base 10. In base 10, 43639numbers go from 0 to 9, and then ``roll over'' into the next 43640@iftex 43641column. (Remember grade school? @math{42 = 4\times 10 + 2}.) 43642@end iftex 43643@ifnottex 43644column. (Remember grade school? 42 = 4 x 10 + 2.) 43645@end ifnottex 43646 43647There are other number bases though. Computers commonly use base 2 43648or @dfn{binary}, base 8 or @dfn{octal}, and base 16 or @dfn{hexadecimal}. 43649In binary, each column represents two times the value in the column to 43650its right. Each column may contain either a 0 or a 1. 43651@iftex 43652Thus, binary 1010 represents @math{(1\times 8) + (0\times 4) + (1\times 2) + (0\times 1)}, or decimal 10. 43653@end iftex 43654@ifnottex 43655Thus, binary 1010 represents (1 x 8) + (0 x 4) + (1 x 2) 43656+ (0 x 1), or decimal 10. 43657@end ifnottex 43658Octal and hexadecimal are discussed more in 43659@ref{Nondecimal-numbers}. 43660 43661At the very lowest level, computers store values as groups of binary digits, 43662or @dfn{bits}. Modern computers group bits into groups of eight, called @dfn{bytes}. 43663Advanced applications sometimes have to manipulate bits directly, 43664and @command{gawk} provides functions for doing so. 43665 43666Programs are written in programming languages. 43667Hundreds, if not thousands, of programming languages exist. 43668One of the most popular is the C programming language. 43669The C language had a very strong influence on the design of 43670the @command{awk} language. 43671 43672@cindex Kernighan, Brian 43673@cindex Ritchie, Dennis 43674There have been several versions of C. The first is often referred to 43675as ``K&R'' C, after the initials of Brian Kernighan and Dennis Ritchie, 43676the authors of the first book on C. (Dennis Ritchie created the language, 43677and Brian Kernighan was one of the creators of @command{awk}.) 43678 43679In the mid-1980s, an effort began to produce an international standard 43680for C. This work culminated in 1989, with the production of the ANSI 43681standard for C. This standard became an ISO standard in 1990. 43682In 1999, a revised ISO C standard was approved and released. 43683Where it makes sense, POSIX @command{awk} is compatible with 1999 ISO C. 43684 43685 43686@node Glossary 43687@unnumbered Glossary 43688 43689@table @asis 43690@item Action 43691A series of @command{awk} statements attached to a rule. If the rule's 43692pattern matches an input record, @command{awk} executes the 43693rule's action. Actions are always enclosed in braces. 43694(@xref{Action Overview}.) 43695 43696@cindex Ada programming language 43697@cindex programming languages @subentry Ada 43698@item Ada 43699A programming language originally defined by the U.S.@: Department of 43700Defense for embedded programming. It was designed to enforce good 43701Software Engineering practices. 43702 43703@cindex Spencer, Henry 43704@cindex @command{sed} utility 43705@cindex amazing @command{awk} assembler (@command{aaa}) 43706@cindex @command{aaa} (amazing @command{awk} assembler) program 43707@item Amazing @command{awk} Assembler 43708Henry Spencer at the University of Toronto wrote a retargetable assembler 43709completely as @command{sed} and @command{awk} scripts. It is thousands 43710of lines long, including machine descriptions for several eight-bit 43711microcomputers. It is a good example of a program that would have been 43712better written in another language. 43713@c You can get it from @uref{http://awk.info/?awk100/aaa}. 43714 43715@cindex amazingly workable formatter (@command{awf}) 43716@cindex @command{awf} (amazingly workable formatter) program 43717@item Amazingly Workable Formatter (@command{awf}) 43718Henry Spencer at the University of Toronto wrote a formatter that accepts 43719a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting 43720commands, using @command{awk} and @command{sh}. 43721@c It is available 43722@c from @uref{http://awk.info/?tools/awf}. 43723 43724@item Anchor 43725The regexp metacharacters @samp{^} and @samp{$}, which force the match 43726to the beginning or end of the string, respectively. 43727 43728@cindex ANSI 43729@item ANSI 43730The American National Standards Institute. This organization produces 43731many standards, among them the standards for the C and C++ programming 43732languages. 43733These standards often become international standards as well. See also 43734``ISO.'' 43735 43736@item Argument 43737An argument can be two different things. It can be an option or a 43738@value{FN} passed to a command while invoking it from the command line, or 43739it can be something passed to a @dfn{function} inside a program, e.g. 43740inside @command{awk}. 43741 43742In the latter case, an argument can be passed to a function in two ways. 43743Either it is given to the called function by value, i.e., a copy of the 43744value of the variable is made available to the called function, but the 43745original variable cannot be modified by the function itself; or it is 43746given by reference, i.e., a pointer to the interested variable is passed to 43747the function, which can then directly modify it. In @command{awk} 43748scalars are passed by value, and arrays are passed by reference. 43749See ``Pass By Value/Reference.'' 43750 43751@item Array 43752A grouping of multiple values under the same name. 43753Most languages just provide sequential arrays. 43754@command{awk} provides associative arrays. 43755 43756@item Assertion 43757A statement in a program that a condition is true at this point in the program. 43758Useful for reasoning about how a program is supposed to behave. 43759 43760@item Assignment 43761An @command{awk} expression that changes the value of some @command{awk} 43762variable or data object. An object that you can assign to is called an 43763@dfn{lvalue}. The assigned values are called @dfn{rvalues}. 43764@xref{Assignment Ops}. 43765 43766@item Associative Array 43767Arrays in which the indices may be numbers or strings, not just 43768sequential integers in a fixed range. 43769 43770@item @command{awk} Language 43771The language in which @command{awk} programs are written. 43772 43773@item @command{awk} Program 43774An @command{awk} program consists of a series of @dfn{patterns} and 43775@dfn{actions}, collectively known as @dfn{rules}. For each input record 43776given to the program, the program's rules are all processed in turn. 43777@command{awk} programs may also contain function definitions. 43778 43779@item @command{awk} Script 43780Another name for an @command{awk} program. 43781 43782@item Bash 43783The GNU version of the standard shell 43784@ifnotinfo 43785(the @b{B}ourne-@b{A}gain @b{SH}ell). 43786@end ifnotinfo 43787@ifinfo 43788(the Bourne-Again SHell). 43789@end ifinfo 43790See also ``Bourne Shell.'' 43791 43792@item Binary 43793Base-two notation, where the digits are @code{0}--@code{1}. Since 43794electronic circuitry works ``naturally'' in base 2 (just think of Off/On), 43795everything inside a computer is calculated using base 2. Each digit 43796represents the presence (or absence) of a power of 2 and is called a 43797@dfn{bit}. So, for example, the base-two number @code{10101} is 43798@iftex 43799the same as decimal 21, (@math{(1\times 16) + (1\times 4) + (1\times 1)}). 43800@end iftex 43801@ifnottex 43802the same as decimal 21, ((1 x 16) + (1 x 4) + (1 x 1)). 43803@end ifnottex 43804 43805Since base-two numbers quickly become 43806very long to read and write, they are usually grouped by 3 (i.e., they are 43807read as octal numbers), or by 4 (i.e., they are read as hexadecimal 43808numbers). There is no direct way to insert base 2 numbers in a C program. 43809If need arises, such numbers are usually inserted as octal or hexadecimal 43810numbers. The number of base-two digits that fit into registers used for 43811representing integer numbers in computers is a rough indication of the 43812computing power of the computer itself. Most computers nowadays use 64 43813bits for representing integer numbers in their registers, but 32-bit, 4381416-bit and 8-bit registers have been widely used in the past. 43815@xref{Nondecimal-numbers}. 43816@item Bit 43817Short for ``Binary Digit.'' 43818All values in computer memory ultimately reduce to binary digits: values 43819that are either zero or one. 43820Groups of bits may be interpreted differently---as integers, 43821floating-point numbers, character data, addresses of other 43822memory objects, or other data. 43823@command{awk} lets you work with floating-point numbers and strings. 43824@command{gawk} lets you manipulate bit values with the built-in 43825functions described in 43826@ref{Bitwise Functions}. 43827 43828Computers are often defined by how many bits they use to represent integer 43829values. Typical systems are 32-bit systems, but 64-bit systems are 43830becoming increasingly popular, and 16-bit systems have essentially 43831disappeared. 43832 43833@item Boolean Expression 43834Named after the English mathematician Boole. See also ``Logical Expression.'' 43835 43836@item Bourne Shell 43837The standard shell (@file{/bin/sh}) on Unix and Unix-like systems, 43838originally written by Steven R.@: Bourne at Bell Laboratories. 43839Many shells (Bash, @command{ksh}, @command{pdksh}, @command{zsh}) are 43840generally upwardly compatible with the Bourne shell. 43841 43842@item Braces 43843The characters @samp{@{} and @samp{@}}. Braces are used in 43844@command{awk} for delimiting actions, compound statements, and function 43845bodies. 43846 43847@item Bracket Expression 43848Inside a @dfn{regular expression}, an expression included in square 43849brackets, meant to designate a single character as belonging to a 43850specified character class. A bracket expression can contain a list of one 43851or more characters, like @samp{[abc]}, a range of characters, like 43852@samp{[A-Z]}, or a name, delimited by @samp{:}, that designates a known set 43853of characters, like @samp{[:digit:]}. The form of bracket expression 43854enclosed between @samp{:} is independent of the underlying representation 43855of the character themselves, which could utilize the ASCII, EBCDIC, or 43856Unicode codesets, depending on the architecture of the computer system, and on 43857localization. 43858See also ``Regular Expression.'' 43859 43860@item Built-in Function 43861The @command{awk} language provides built-in functions that perform various 43862numerical, I/O-related, and string computations. Examples are 43863@code{sqrt()} (for the square root of a number) and @code{substr()} (for a 43864substring of a string). 43865@command{gawk} provides functions for timestamp management, bit manipulation, 43866array sorting, type checking, 43867and runtime string translation. 43868(@xref{Built-in}.) 43869 43870@item Built-in Variable 43871@code{ARGC}, 43872@code{ARGV}, 43873@code{CONVFMT}, 43874@code{ENVIRON}, 43875@code{FILENAME}, 43876@code{FNR}, 43877@code{FS}, 43878@code{NF}, 43879@code{NR}, 43880@code{OFMT}, 43881@code{OFS}, 43882@code{ORS}, 43883@code{RLENGTH}, 43884@code{RSTART}, 43885@code{RS}, 43886and 43887@code{SUBSEP} 43888are the variables that have special meaning to @command{awk}. 43889In addition, 43890@code{ARGIND}, 43891@code{BINMODE}, 43892@code{ERRNO}, 43893@code{FIELDWIDTHS}, 43894@code{FPAT}, 43895@code{IGNORECASE}, 43896@code{LINT}, 43897@code{PROCINFO}, 43898@code{RT}, 43899and 43900@code{TEXTDOMAIN} 43901are the variables that have special meaning to @command{gawk}. 43902Changing some of them affects @command{awk}'s running environment. 43903(@xref{Built-in Variables}.) 43904 43905@item C 43906The system programming language that most GNU software is written in. The 43907@command{awk} programming language has C-like syntax, and this @value{DOCUMENT} 43908points out similarities between @command{awk} and C when appropriate. 43909 43910In general, @command{gawk} attempts to be as similar to the 1990 version 43911of ISO C as makes sense. 43912 43913@item C Shell 43914The C Shell (@command{csh} or its improved version, @command{tcsh}) is a Unix shell that was 43915created by Bill Joy in the late 1970s. The C shell was differentiated from 43916other shells by its interactive features and overall style, which 43917looks more like C. The C Shell is not backward compatible with the Bourne 43918Shell, so special attention is required when converting scripts 43919written for other Unix shells to the C shell, especially with regard to the management of 43920shell variables. 43921See also ``Bourne Shell.'' 43922 43923@item C++ 43924A popular object-oriented programming language derived from C. 43925 43926@item Character Class 43927See ``Bracket Expression.'' 43928 43929@item Character List 43930See ``Bracket Expression.'' 43931 43932@cindex ASCII 43933@cindex ISO @subentry ISO 8859-1 character set 43934@cindex ISO @subentry ISO Latin-1 character set 43935@cindex character sets (machine character encodings) 43936@cindex Unicode 43937@item Character Set 43938The set of numeric codes used by a computer system to represent the 43939characters (letters, numbers, punctuation, etc.) of a particular country 43940or place. The most common character set in use today is ASCII (American 43941Standard Code for Information Interchange). Many European 43942countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1). 43943The @uref{http://www.unicode.org, Unicode character set} is 43944increasingly popular and standard, and is particularly 43945widely used on GNU/Linux systems. 43946 43947@cindex Kernighan, Brian 43948@cindex Bentley, Jon 43949@cindex @command{chem} utility 43950@item CHEM 43951A preprocessor for @command{pic} that reads descriptions of molecules 43952and produces @command{pic} input for drawing them. 43953It was written in @command{awk} 43954by Brian Kernighan and Jon Bentley, and is available from 43955@uref{http://netlib.org/typesetting/chem}. 43956 43957@item Comparison Expression 43958A relation that is either true or false, such as @samp{a < b}. 43959Comparison expressions are used in @code{if}, @code{while}, @code{do}, 43960and @code{for} 43961statements, and in patterns to select which input records to process. 43962(@xref{Typing and Comparison}.) 43963 43964@cindex compiled programs 43965@item Compiler 43966A program that translates human-readable source code into 43967machine-executable object code. The object code is then executed 43968directly by the computer. 43969See also ``Interpreter.'' 43970 43971@item Complemented Bracket Expression 43972The negation of a @dfn{bracket expression}. All that is @emph{not} 43973described by a given bracket expression. The symbol @samp{^} precedes 43974the negated bracket expression. E.g.: @samp{[^[:digit:]]} 43975designates whatever character is not a digit. @samp{[^bad]} 43976designates whatever character is not one of the letters @samp{b}, @samp{a}, 43977or @samp{d}. 43978See ``Bracket Expression.'' 43979 43980@item Compound Statement 43981A series of @command{awk} statements, enclosed in curly braces. Compound 43982statements may be nested. 43983(@xref{Statements}.) 43984 43985@item Computed Regexps 43986See ``Dynamic Regular Expressions.'' 43987 43988@item Concatenation 43989Concatenating two strings means sticking them together, one after another, 43990producing a new string. For example, the string @samp{foo} concatenated with 43991the string @samp{bar} gives the string @samp{foobar}. 43992(@xref{Concatenation}.) 43993 43994@item Conditional Expression 43995An expression using the @samp{?:} ternary operator, such as 43996@samp{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression 43997@var{expr1} is evaluated; if the result is true, the value of the whole 43998expression is the value of @var{expr2}; otherwise the value is 43999@var{expr3}. In either case, only one of @var{expr2} and @var{expr3} 44000is evaluated. (@xref{Conditional Exp}.) 44001 44002@item Control Statement 44003A control statement is an instruction to perform a given operation or a set 44004of operations inside an @command{awk} program, if a given condition is 44005true. Control statements are: @code{if}, @code{for}, @code{while}, and 44006@code{do} 44007(@pxref{Statements}). 44008 44009@cindex McIlroy, Doug 44010@cindex cookie 44011@item Cookie 44012A peculiar goodie, token, saying or remembrance 44013produced by or presented to a program. (With thanks to Professor Doug McIlroy.) 44014@ignore 44015From: Doug McIlroy <doug@cs.dartmouth.edu> 44016Date: Sat, 13 Oct 2012 19:55:25 -0400 44017To: arnold@skeeve.com 44018Subject: Re: origin of the term "cookie"? 44019 44020I believe the term "cookie", for a more or less inscrutable 44021saying or crumb of information, was injected into Unix 44022jargon by Bob Morris, who used the word quite frequently. 44023It had no fixed meaning as it now does in browsers. 44024 44025The word had been around long before it was recognized in 44026the 8th edition glossary (earlier editions had no glossary): 44027 44028cookie a peculiar goodie, token, saying or remembrance 44029returned by or presented to a program. [I would say that 44030"returned by" would better read "produced by", and assume 44031responsibility for the inexactitude.] 44032 44033Doug McIlroy 44034 44035From: Doug McIlroy <doug@cs.dartmouth.edu> 44036Date: Sun, 14 Oct 2012 10:08:43 -0400 44037To: arnold@skeeve.com 44038Subject: Re: origin of the term "cookie"? 44039 44040> Can I forward your email to Eric Raymond, for possible addition to the 44041> Jargon File? 44042 44043Sure. I might add that I don't know how "cookie" entered Morris's 44044vocabulary. Certainly "values of beta give rise to dom!" (see google) 44045was an early, if not the earliest Unix cookie. The fact that it was 44046found lying around on a model 37 teletype (which had Greek beta in 44047its type box) suggests that maybe it was seen to be like milk and 44048cookies laid out for Santa Claus. Morris was wont to make such 44049connections. 44050 44051Doug 44052@end ignore 44053 44054@item Coprocess 44055A subordinate program with which two-way communications is possible. 44056 44057@item Curly Braces 44058See ``Braces.'' 44059 44060@cindex dark corner 44061@item Dark Corner 44062An area in the language where specifications often were (or still 44063are) not clear, leading to unexpected or undesirable behavior. 44064Such areas are marked in this @value{DOCUMENT} with 44065@iftex 44066the picture of a flashlight in the margin 44067@end iftex 44068@ifnottex 44069``(d.c.)'' in the text 44070@end ifnottex 44071and are indexed under the heading ``dark corner.'' 44072 44073@item Data Driven 44074A description of @command{awk} programs, where you specify the data you 44075are interested in processing, and what to do when that data is seen. 44076 44077@item Data Objects 44078These are numbers and strings of characters. Numbers are converted into 44079strings and vice versa, as needed. 44080(@xref{Conversion}.) 44081 44082@item Deadlock 44083The situation in which two communicating processes are each waiting 44084for the other to perform an action. 44085 44086@item Debugger 44087A program used to help developers remove ``bugs'' from (de-bug) 44088their programs. 44089 44090@item Double Precision 44091An internal representation of numbers that can have fractional parts. 44092Double precision numbers keep track of more digits than do single precision 44093numbers, but operations on them are sometimes more expensive. This is the way 44094@command{awk} stores numeric values. It is the C type @code{double}. 44095 44096@item Dynamic Regular Expression 44097A dynamic regular expression is a regular expression written as an 44098ordinary expression. It could be a string constant, such as 44099@code{"foo"}, but it may also be an expression whose value can vary. 44100(@xref{Computed Regexps}.) 44101 44102@item Empty String 44103See ``Null String.'' 44104 44105@item Environment 44106A collection of strings, of the form @samp{@var{name}=@var{val}}, that each 44107program has available to it. Users generally place values into the 44108environment in order to provide information to various programs. Typical 44109examples are the environment variables @env{HOME} and @env{PATH}. 44110 44111@cindex epoch, definition of 44112@item Epoch 44113The date used as the ``beginning of time'' for timestamps. 44114Time values in most systems are represented as seconds since the epoch, 44115with library functions available for converting these values into 44116standard date and time formats. 44117 44118The epoch on Unix and POSIX systems is 1970-01-01 00:00:00 UTC. 44119See also ``GMT'' and ``UTC.'' 44120 44121@item Escape Sequences 44122@cindex ASCII 44123A special sequence of characters used for describing nonprinting 44124characters, such as @samp{\n} for newline or @samp{\033} for the ASCII 44125ESC (Escape) character. (@xref{Escape Sequences}.) 44126 44127@item Extension 44128An additional feature or change to a programming language or 44129utility not defined by that language's or utility's standard. 44130@command{gawk} has (too) many extensions over POSIX @command{awk}. 44131 44132@item FDL 44133See ``Free Documentation License.'' 44134 44135@item Field 44136When @command{awk} reads an input record, it splits the record into pieces 44137separated by whitespace (or by a separator regexp that you can 44138change by setting the predefined variable @code{FS}). Such pieces are 44139called fields. If the pieces are of fixed length, you can use the built-in 44140variable @code{FIELDWIDTHS} to describe their lengths. 44141If you wish to specify the contents of fields instead of the field 44142separator, you can use the predefined variable @code{FPAT} to do so. 44143(@xref{Field Separators}, 44144@ref{Constant Size}, 44145and 44146@ref{Splitting By Content}.) 44147 44148@item Flag 44149A variable whose truth value indicates the existence or nonexistence 44150of some condition. 44151 44152@item Floating-Point Number 44153Often referred to in mathematical terms as a ``rational'' or real number, 44154this is just a number that can have a fractional part. 44155See also ``Double Precision'' and ``Single Precision.'' 44156 44157@item Format 44158Format strings control the appearance of output in the 44159@code{strftime()} and @code{sprintf()} functions, and in the 44160@code{printf} statement as well. Also, data conversions from numbers to strings 44161are controlled by the format strings contained in the predefined variables 44162@code{CONVFMT} and @code{OFMT}. (@xref{Control Letters}.) 44163 44164@item Fortran 44165Shorthand for FORmula TRANslator, one of the first programming languages 44166available for scientific calculations. It was created by John Backus, 44167and has been available since 1957. It is still in use today. 44168 44169@item Free Documentation License 44170This document describes the terms under which this @value{DOCUMENT} 44171is published and may be copied. (@xref{GNU Free Documentation License}.) 44172 44173@cindex FSF (Free Software Foundation) 44174@cindex Free Software Foundation (FSF) 44175@cindex Stallman, Richard 44176@item Free Software Foundation 44177A nonprofit organization dedicated 44178to the production and distribution of freely distributable software. 44179It was founded by Richard M.@: Stallman, the author of the original 44180Emacs editor. GNU Emacs is the most widely used version of Emacs today. 44181 44182@item FSF 44183See ``Free Software Foundation.'' 44184 44185@item Function 44186A part of an @command{awk} program that can be invoked from every point of 44187the program, to perform a task. @command{awk} has several built-in 44188functions. 44189Users can define their own functions in every part of the program. 44190Function can be recursive, i.e., they may invoke themselves. 44191@xref{Functions}. 44192In @command{gawk} it is also possible to have functions shared 44193among different programs, and included where required using the 44194@code{@@include} directive 44195(@pxref{Include Files}). 44196In @command{gawk} the name of the function that should be invoked 44197can be generated at run time, i.e., dynamically. 44198The @command{gawk} extension API provides constructor functions 44199(@pxref{Constructor Functions}). 44200 44201 44202@item @command{gawk} 44203The GNU implementation of @command{awk}. 44204 44205@cindex GPL (General Public License) 44206@item General Public License 44207This document describes the terms under which @command{gawk} and its source 44208code may be distributed. (@xref{Copying}.) 44209 44210@item GMT 44211``Greenwich Mean Time.'' 44212This is the old term for UTC. 44213It is the time of day used internally for Unix and POSIX systems. 44214See also ``Epoch'' and ``UTC.'' 44215 44216@cindex FSF (Free Software Foundation) 44217@cindex Free Software Foundation (FSF) 44218@cindex GNU Project 44219@item GNU 44220``GNU's not Unix''. An on-going project of the Free Software Foundation 44221to create a complete, freely distributable, POSIX-compliant computing 44222environment. 44223 44224@item GNU/Linux 44225A variant of the GNU system using the Linux kernel, instead of the 44226Free Software Foundation's Hurd kernel. 44227The Linux kernel is a stable, efficient, full-featured clone of Unix that has 44228been ported to a variety of architectures. 44229It is most popular on PC-class systems, but runs well on a variety of 44230other systems too. 44231The Linux kernel source code is available under the terms of the GNU General 44232Public License, which is perhaps its most important aspect. 44233 44234@item GPL 44235See ``General Public License.'' 44236 44237@item Hexadecimal 44238Base 16 notation, where the digits are @code{0}--@code{9} and 44239@code{A}--@code{F}, with @samp{A} 44240representing 10, @samp{B} representing 11, and so on, up to @samp{F} for 15. 44241Hexadecimal numbers are written in C using a leading @samp{0x}, 44242@iftex 44243to indicate their base. Thus, @code{0x12} is 18 (@math{(1\times 16) + 2}). 44244@end iftex 44245@ifnottex 44246to indicate their base. Thus, @code{0x12} is 18 ((1 x 16) + 2). 44247@end ifnottex 44248@xref{Nondecimal-numbers}. 44249 44250@item I/O 44251Abbreviation for ``Input/Output,'' the act of moving data into and/or 44252out of a running program. 44253 44254@item Input Record 44255A single chunk of data that is read in by @command{awk}. Usually, an @command{awk} input 44256record consists of one line of text. 44257(@xref{Records}.) 44258 44259@item Integer 44260A whole number, i.e., a number that does not have a fractional part. 44261 44262@item Internationalization 44263The process of writing or modifying a program so 44264that it can use multiple languages without requiring 44265further source code changes. 44266 44267@cindex interpreted programs 44268@item Interpreter 44269A program that reads human-readable source code directly, and uses 44270the instructions in it to process data and produce results. 44271@command{awk} is typically (but not always) implemented as an interpreter. 44272See also ``Compiler.'' 44273 44274@item Interval Expression 44275A component of a regular expression that lets you specify repeated matches of 44276some part of the regexp. Interval expressions were not originally available 44277in @command{awk} programs. 44278 44279@cindex ISO 44280@item ISO 44281The International Organization for Standardization. 44282This organization produces international standards for many things, including 44283programming languages, such as C and C++. 44284In the computer arena, important standards like those for C, C++, and POSIX 44285become both American national and ISO international standards simultaneously. 44286This @value{DOCUMENT} refers to Standard C as ``ISO C'' throughout. 44287See @uref{https://www.iso.org/iso/home/about.htm, the ISO website} for more 44288information about the name of the organization and its language-independent 44289three-letter acronym. 44290 44291@cindex Java programming language 44292@cindex programming languages @subentry Java 44293@item Java 44294A modern programming language originally developed by Sun Microsystems 44295(now Oracle) supporting Object-Oriented programming. Although usually 44296implemented by compiling to the instructions for a standard virtual 44297machine (the JVM), the language can be compiled to native code. 44298 44299@item Keyword 44300In the @command{awk} language, a keyword is a word that has special 44301meaning. Keywords are reserved and may not be used as variable names. 44302 44303@command{gawk}'s keywords are: 44304@code{BEGIN}, 44305@code{BEGINFILE}, 44306@code{END}, 44307@code{ENDFILE}, 44308@code{break}, 44309@code{case}, 44310@code{continue}, 44311@code{default}, 44312@code{delete}, 44313@code{do@dots{}while}, 44314@code{else}, 44315@code{exit}, 44316@code{for@dots{}in}, 44317@code{for}, 44318@code{function}, 44319@code{func}, 44320@code{if}, 44321@code{next}, 44322@code{nextfile}, 44323@code{switch}, 44324and 44325@code{while}. 44326 44327@item Korn Shell 44328The Korn Shell (@command{ksh}) is a Unix shell which was developed by David Korn at Bell 44329Laboratories in the early 1980s. The Korn Shell is backward-compatible with the Bourne 44330shell and includes many features of the C shell. 44331See also ``Bourne Shell.'' 44332 44333@cindex LGPL (Lesser General Public License) 44334@cindex Lesser General Public License (LGPL) 44335@cindex GNU Lesser General Public License 44336@item Lesser General Public License 44337This document describes the terms under which binary library archives 44338or shared objects, 44339and their source code may be distributed. 44340 44341@item LGPL 44342See ``Lesser General Public License.'' 44343 44344@item Linux 44345See ``GNU/Linux.'' 44346 44347@item Localization 44348The process of providing the data necessary for an 44349internationalized program to work in a particular language. 44350 44351@item Logical Expression 44352An expression using the operators for logic, AND, OR, and NOT, written 44353@samp{&&}, @samp{||}, and @samp{!} in @command{awk}. Often called Boolean 44354expressions, after the mathematician who pioneered this kind of 44355mathematical logic. 44356 44357@item Lvalue 44358An expression that can appear on the left side of an assignment 44359operator. In most languages, lvalues can be variables or array 44360elements. In @command{awk}, a field designator can also be used as an 44361lvalue. 44362 44363@item Matching 44364The act of testing a string against a regular expression. If the 44365regexp describes the contents of the string, it is said to @dfn{match} it. 44366 44367@item Metacharacters 44368Characters used within a regexp that do not stand for themselves. 44369Instead, they denote regular expression operations, such as repetition, 44370grouping, or alternation. 44371 44372@item Nesting 44373Nesting is where information is organized in layers, or where objects 44374contain other similar objects. 44375In @command{gawk} the @code{@@include} 44376directive can be nested. The ``natural'' nesting of arithmetic and 44377logical operations can be changed using parentheses 44378(@pxref{Precedence}). 44379 44380@item No-op 44381An operation that does nothing. 44382 44383@item Null String 44384A string with no characters in it. It is represented explicitly in 44385@command{awk} programs by placing two double quote characters next to 44386each other (@code{""}). It can appear in input data by having two successive 44387occurrences of the field separator appear next to each other. 44388 44389@item Number 44390A numeric-valued data object. Modern @command{awk} implementations use 44391double precision floating-point to represent numbers. 44392Ancient @command{awk} implementations used single precision floating-point. 44393 44394@item Octal 44395Base-eight notation, where the digits are @code{0}--@code{7}. 44396Octal numbers are written in C using a leading @samp{0}, 44397@iftex 44398to indicate their base. Thus, @code{013} is 11 (@math{(1\times 8) + 3}). 44399@end iftex 44400@ifnottex 44401to indicate their base. Thus, @code{013} is 11 ((1 x 8) + 3). 44402@end ifnottex 44403@xref{Nondecimal-numbers}. 44404 44405@item Output Record 44406A single chunk of data that is written out by @command{awk}. Usually, an 44407@command{awk} output record consists of one or more lines of text. 44408@xref{Records}. 44409 44410@item Pattern 44411Patterns tell @command{awk} which input records are interesting to which 44412rules. 44413 44414A pattern is an arbitrary conditional expression against which input is 44415tested. If the condition is satisfied, the pattern is said to @dfn{match} 44416the input record. A typical pattern might compare the input record against 44417a regular expression. (@xref{Pattern Overview}.) 44418 44419@item PEBKAC 44420An acronym describing what is possibly the most frequent 44421source of computer usage problems. (Problem Exists Between 44422Keyboard And Chair.) 44423 44424@item Plug-in 44425See ``Extensions.'' 44426 44427@item POSIX 44428The name for a series of standards 44429that specify a Portable Operating System interface. The ``IX'' denotes 44430the Unix heritage of these standards. The main standard of interest for 44431@command{awk} users is 44432@cite{IEEE Standard for Information Technology, Standard 1003.1@sup{TM}-2017 44433(Revision of IEEE Std 1003.1-2008)}. 44434The 2018 POSIX standard can be found online at 44435@url{https://pubs.opengroup.org/onlinepubs/9699919799/}. 44436 44437@item Precedence 44438The order in which operations are performed when operators are used 44439without explicit parentheses. 44440 44441@item Private 44442Variables and/or functions that are meant for use exclusively by library 44443functions and not for the main @command{awk} program. Special care must be 44444taken when naming such variables and functions. 44445(@xref{Library Names}.) 44446 44447@item Range (of input lines) 44448A sequence of consecutive lines from the input file(s). A pattern 44449can specify ranges of input lines for @command{awk} to process or it can 44450specify single lines. (@xref{Pattern Overview}.) 44451 44452@item Record 44453See ``Input record'' and ``Output record.'' 44454 44455@item Recursion 44456When a function calls itself, either directly or indirectly. 44457If this is clear, stop, and proceed to the next entry. 44458Otherwise, refer to the entry for ``recursion.'' 44459 44460@item Redirection 44461Redirection means performing input from something other than the standard input 44462stream, or performing output to something other than the standard output stream. 44463 44464You can redirect input to the @code{getline} statement using 44465the @samp{<}, @samp{|}, and @samp{|&} operators. 44466You can redirect the output of the @code{print} and @code{printf} statements 44467to a file or a system command, using the @samp{>}, @samp{>>}, @samp{|}, and @samp{|&} 44468operators. 44469(@xref{Getline}, 44470and @ref{Redirection}.) 44471 44472@item Reference Counts 44473An internal mechanism in @command{gawk} to minimize the amount of memory 44474needed to store the value of string variables. If the value assumed by 44475a variable is used in more than one place, only one copy of the value 44476itself is kept, and the associated reference count is increased when the 44477same value is used by an additional variable, and decreased when the related 44478variable is no longer in use. When the reference count goes to zero, 44479the memory space used to store the value of the variable is freed. 44480 44481@item Regexp 44482See ``Regular Expression.'' 44483 44484@item Regular Expression 44485A regular expression (``regexp'' for short) is a pattern that denotes a 44486set of strings, possibly an infinite set. For example, the regular expression 44487@samp{R.*xp} matches any string starting with the letter @samp{R} 44488and ending with the letters @samp{xp}. In @command{awk}, regular expressions are 44489used in patterns and in conditional expressions. Regular expressions may contain 44490escape sequences. (@xref{Regexp}.) 44491 44492@item Regular Expression Constant 44493A regular expression constant is a regular expression written within 44494slashes, such as @code{/foo/}. This regular expression is chosen 44495when you write the @command{awk} program and cannot be changed during 44496its execution. (@xref{Regexp Usage}.) 44497 44498@item Regular Expression Operators 44499See ``Metacharacters.'' 44500 44501@item Rounding 44502Rounding the result of an arithmetic operation can be tricky. 44503More than one way of rounding exists, and in @command{gawk} 44504it is possible to choose which method should be used in a program. 44505@xref{Setting the rounding mode}. 44506 44507@item Rule 44508A segment of an @command{awk} program that specifies how to process single 44509input records. A rule consists of a @dfn{pattern} and an @dfn{action}. 44510@command{awk} reads an input record; then, for each rule, if the input record 44511satisfies the rule's pattern, @command{awk} executes the rule's action. 44512Otherwise, the rule does nothing for that input record. 44513 44514@item Rvalue 44515A value that can appear on the right side of an assignment operator. 44516In @command{awk}, essentially every expression has a value. These values 44517are rvalues. 44518 44519@item Scalar 44520A single value, be it a number or a string. 44521Regular variables are scalars; arrays and functions are not. 44522 44523@item Search Path 44524In @command{gawk}, a list of directories to search for @command{awk} program source files. 44525In the shell, a list of directories to search for executable programs. 44526 44527@item @command{sed} 44528See ``Stream Editor.'' 44529 44530@item Seed 44531The initial value, or starting point, for a sequence of random numbers. 44532 44533@item Shell 44534The command interpreter for Unix and POSIX-compliant systems. 44535The shell works both interactively, and as a programming language 44536for batch files, or shell scripts. 44537 44538@item Short-Circuit 44539The nature of the @command{awk} logical operators @samp{&&} and @samp{||}. 44540If the value of the entire expression is determinable from evaluating just 44541the lefthand side of these operators, the righthand side is not 44542evaluated. 44543(@xref{Boolean Ops}.) 44544 44545@item Side Effect 44546A side effect occurs when an expression has an effect aside from merely 44547producing a value. Assignment expressions, increment and decrement 44548expressions, and function calls have side effects. 44549(@xref{Assignment Ops}.) 44550 44551@item Single Precision 44552An internal representation of numbers that can have fractional parts. 44553Single precision numbers keep track of fewer digits than do double precision 44554numbers, but operations on them are sometimes less expensive in terms of CPU time. 44555This is the type used by some ancient versions of @command{awk} to store 44556numeric values. It is the C type @code{float}. 44557 44558@item Space 44559The character generated by hitting the space bar on the keyboard. 44560 44561@item Special File 44562A @value{FN} interpreted internally by @command{gawk}, instead of being handed 44563directly to the underlying operating system---for example, @file{/dev/stderr}. 44564(@xref{Special Files}.) 44565 44566@item Statement 44567An expression inside an @command{awk} program in the action part 44568of a pattern--action rule, or inside an 44569@command{awk} function. A statement can be a variable assignment, 44570an array operation, a loop, etc. 44571 44572@item Stream Editor 44573A program that reads records from an input stream and processes them one 44574or more at a time. This is in contrast with batch programs, which may 44575expect to read their input files in entirety before starting to do 44576anything, as well as with interactive programs which require input from the 44577user. 44578 44579@item String 44580A datum consisting of a sequence of characters, such as @samp{I am a 44581string}. Constant strings are written with double quotes in the 44582@command{awk} language and may contain escape sequences. 44583(@xref{Escape Sequences}.) 44584 44585@item Tab 44586The character generated by hitting the @kbd{TAB} key on the keyboard. 44587It usually expands to up to eight spaces upon output. 44588 44589@item Text Domain 44590A unique name that identifies an application. 44591Used for grouping messages that are translated at runtime 44592into the local language. 44593 44594@item Timestamp 44595A value in the ``seconds since the epoch'' format used by Unix 44596and POSIX systems. Used for the @command{gawk} functions 44597@code{mktime()}, @code{strftime()}, and @code{systime()}. 44598See also ``Epoch,'' ``GMT,'' and ``UTC.'' 44599 44600@cindex GNU/Linux 44601@cindex Unix 44602@cindex BSD-based operating systems 44603@cindex NetBSD 44604@cindex FreeBSD 44605@cindex OpenBSD 44606@item Unix 44607A computer operating system originally developed in the early 1970's at 44608AT&T Bell Laboratories. It initially became popular in universities around 44609the world and later moved into commercial environments as a software 44610development system and network server system. There are many commercial 44611versions of Unix, as well as several work-alike systems whose source code 44612is freely available (such as GNU/Linux, @uref{http://www.netbsd.org, NetBSD}, 44613@uref{https://www.freebsd.org, FreeBSD}, and @uref{http://www.openbsd.org, OpenBSD}). 44614 44615@item UTC 44616The accepted abbreviation for ``Universal Coordinated Time.'' 44617This is standard time in Greenwich, England, which is used as a 44618reference time for day and date calculations. 44619See also ``Epoch'' and ``GMT.'' 44620 44621@item Variable 44622A name for a value. In @command{awk}, variables may be either scalars 44623or arrays. 44624 44625@item Whitespace 44626A sequence of space, TAB, or newline characters occurring inside an input 44627record or a string. 44628 44629@end table 44630 44631@end ifclear 44632 44633@c The GNU General Public License. 44634@node Copying 44635@unnumbered GNU General Public License 44636@ifnotdocbook 44637@center Version 3, 29 June 2007 44638@end ifnotdocbook 44639@docbook 44640<subtitle>Version 3, 29 June 2007</subtitle> 44641@end docbook 44642 44643@c This file is intended to be included within another document, 44644@c hence no sectioning command or @node. 44645 44646@display 44647Copyright @copyright{} 2007 Free Software Foundation, Inc. @url{https://fsf.org/} 44648 44649Everyone is permitted to copy and distribute verbatim copies of this 44650license document, but changing it is not allowed. 44651@end display 44652 44653@c fakenode --- for prepinfo 44654@heading Preamble 44655 44656The GNU General Public License is a free, copyleft license for 44657software and other kinds of works. 44658 44659The licenses for most software and other practical works are designed 44660to take away your freedom to share and change the works. By contrast, 44661the GNU General Public License is intended to guarantee your freedom 44662to share and change all versions of a program---to make sure it remains 44663free software for all its users. We, the Free Software Foundation, 44664use the GNU General Public License for most of our software; it 44665applies also to any other work released this way by its authors. You 44666can apply it to your programs, too. 44667 44668When we speak of free software, we are referring to freedom, not 44669price. Our General Public Licenses are designed to make sure that you 44670have the freedom to distribute copies of free software (and charge for 44671them if you wish), that you receive source code or can get it if you 44672want it, that you can change the software or use pieces of it in new 44673free programs, and that you know you can do these things. 44674 44675To protect your rights, we need to prevent others from denying you 44676these rights or asking you to surrender the rights. Therefore, you 44677have certain responsibilities if you distribute copies of the 44678software, or if you modify it: responsibilities to respect the freedom 44679of others. 44680 44681For example, if you distribute copies of such a program, whether 44682gratis or for a fee, you must pass on to the recipients the same 44683freedoms that you received. You must make sure that they, too, 44684receive or can get the source code. And you must show them these 44685terms so they know their rights. 44686 44687Developers that use the GNU GPL protect your rights with two steps: 44688(1) assert copyright on the software, and (2) offer you this License 44689giving you legal permission to copy, distribute and/or modify it. 44690 44691For the developers' and authors' protection, the GPL clearly explains 44692that there is no warranty for this free software. For both users' and 44693authors' sake, the GPL requires that modified versions be marked as 44694changed, so that their problems will not be attributed erroneously to 44695authors of previous versions. 44696 44697Some devices are designed to deny users access to install or run 44698modified versions of the software inside them, although the 44699manufacturer can do so. This is fundamentally incompatible with the 44700aim of protecting users' freedom to change the software. The 44701systematic pattern of such abuse occurs in the area of products for 44702individuals to use, which is precisely where it is most unacceptable. 44703Therefore, we have designed this version of the GPL to prohibit the 44704practice for those products. If such problems arise substantially in 44705other domains, we stand ready to extend this provision to those 44706domains in future versions of the GPL, as needed to protect the 44707freedom of users. 44708 44709Finally, every program is threatened constantly by software patents. 44710States should not allow patents to restrict development and use of 44711software on general-purpose computers, but in those that do, we wish 44712to avoid the special danger that patents applied to a free program 44713could make it effectively proprietary. To prevent this, the GPL 44714assures that patents cannot be used to render the program non-free. 44715 44716The precise terms and conditions for copying, distribution and 44717modification follow. 44718 44719@c fakenode --- for prepinfo 44720@heading TERMS AND CONDITIONS 44721 44722@enumerate 0 44723@item Definitions. 44724 44725``This License'' refers to version 3 of the GNU General Public License. 44726 44727``Copyright'' also means copyright-like laws that apply to other kinds 44728of works, such as semiconductor masks. 44729 44730``The Program'' refers to any copyrightable work licensed under this 44731License. Each licensee is addressed as ``you''. ``Licensees'' and 44732``recipients'' may be individuals or organizations. 44733 44734To ``modify'' a work means to copy from or adapt all or part of the work 44735in a fashion requiring copyright permission, other than the making of 44736an exact copy. The resulting work is called a ``modified version'' of 44737the earlier work or a work ``based on'' the earlier work. 44738 44739A ``covered work'' means either the unmodified Program or a work based 44740on the Program. 44741 44742To ``propagate'' a work means to do anything with it that, without 44743permission, would make you directly or secondarily liable for 44744infringement under applicable copyright law, except executing it on a 44745computer or modifying a private copy. Propagation includes copying, 44746distribution (with or without modification), making available to the 44747public, and in some countries other activities as well. 44748 44749To ``convey'' a work means any kind of propagation that enables other 44750parties to make or receive copies. Mere interaction with a user 44751through a computer network, with no transfer of a copy, is not 44752conveying. 44753 44754An interactive user interface displays ``Appropriate Legal Notices'' to 44755the extent that it includes a convenient and prominently visible 44756feature that (1) displays an appropriate copyright notice, and (2) 44757tells the user that there is no warranty for the work (except to the 44758extent that warranties are provided), that licensees may convey the 44759work under this License, and how to view a copy of this License. If 44760the interface presents a list of user commands or options, such as a 44761menu, a prominent item in the list meets this criterion. 44762 44763@item Source Code. 44764 44765The ``source code'' for a work means the preferred form of the work for 44766making modifications to it. ``Object code'' means any non-source form 44767of a work. 44768 44769A ``Standard Interface'' means an interface that either is an official 44770standard defined by a recognized standards body, or, in the case of 44771interfaces specified for a particular programming language, one that 44772is widely used among developers working in that language. 44773 44774The ``System Libraries'' of an executable work include anything, other 44775than the work as a whole, that (a) is included in the normal form of 44776packaging a Major Component, but which is not part of that Major 44777Component, and (b) serves only to enable use of the work with that 44778Major Component, or to implement a Standard Interface for which an 44779implementation is available to the public in source code form. A 44780``Major Component'', in this context, means a major essential component 44781(kernel, window system, and so on) of the specific operating system 44782(if any) on which the executable work runs, or a compiler used to 44783produce the work, or an object code interpreter used to run it. 44784 44785The ``Corresponding Source'' for a work in object code form means all 44786the source code needed to generate, install, and (for an executable 44787work) run the object code and to modify the work, including scripts to 44788control those activities. However, it does not include the work's 44789System Libraries, or general-purpose tools or generally available free 44790programs which are used unmodified in performing those activities but 44791which are not part of the work. For example, Corresponding Source 44792includes interface definition files associated with source files for 44793the work, and the source code for shared libraries and dynamically 44794linked subprograms that the work is specifically designed to require, 44795such as by intimate data communication or control flow between those 44796subprograms and other parts of the work. 44797 44798The Corresponding Source need not include anything that users can 44799regenerate automatically from other parts of the Corresponding Source. 44800 44801The Corresponding Source for a work in source code form is that same 44802work. 44803 44804@item Basic Permissions. 44805 44806All rights granted under this License are granted for the term of 44807copyright on the Program, and are irrevocable provided the stated 44808conditions are met. This License explicitly affirms your unlimited 44809permission to run the unmodified Program. The output from running a 44810covered work is covered by this License only if the output, given its 44811content, constitutes a covered work. This License acknowledges your 44812rights of fair use or other equivalent, as provided by copyright law. 44813 44814You may make, run and propagate covered works that you do not convey, 44815without conditions so long as your license otherwise remains in force. 44816You may convey covered works to others for the sole purpose of having 44817them make modifications exclusively for you, or provide you with 44818facilities for running those works, provided that you comply with the 44819terms of this License in conveying all material for which you do not 44820control copyright. Those thus making or running the covered works for 44821you must do so exclusively on your behalf, under your direction and 44822control, on terms that prohibit them from making any copies of your 44823copyrighted material outside their relationship with you. 44824 44825Conveying under any other circumstances is permitted solely under the 44826conditions stated below. Sublicensing is not allowed; section 10 44827makes it unnecessary. 44828 44829@item Protecting Users' Legal Rights From Anti-Circumvention Law. 44830 44831No covered work shall be deemed part of an effective technological 44832measure under any applicable law fulfilling obligations under article 4483311 of the WIPO copyright treaty adopted on 20 December 1996, or 44834similar laws prohibiting or restricting circumvention of such 44835measures. 44836 44837When you convey a covered work, you waive any legal power to forbid 44838circumvention of technological measures to the extent such 44839circumvention is effected by exercising rights under this License with 44840respect to the covered work, and you disclaim any intention to limit 44841operation or modification of the work as a means of enforcing, against 44842the work's users, your or third parties' legal rights to forbid 44843circumvention of technological measures. 44844 44845@item Conveying Verbatim Copies. 44846 44847You may convey verbatim copies of the Program's source code as you 44848receive it, in any medium, provided that you conspicuously and 44849appropriately publish on each copy an appropriate copyright notice; 44850keep intact all notices stating that this License and any 44851non-permissive terms added in accord with section 7 apply to the code; 44852keep intact all notices of the absence of any warranty; and give all 44853recipients a copy of this License along with the Program. 44854 44855You may charge any price or no price for each copy that you convey, 44856and you may offer support or warranty protection for a fee. 44857 44858@item Conveying Modified Source Versions. 44859 44860You may convey a work based on the Program, or the modifications to 44861produce it from the Program, in the form of source code under the 44862terms of section 4, provided that you also meet all of these 44863conditions: 44864 44865@enumerate a 44866@item 44867The work must carry prominent notices stating that you modified it, 44868and giving a relevant date. 44869 44870@item 44871The work must carry prominent notices stating that it is released 44872under this License and any conditions added under section 7. This 44873requirement modifies the requirement in section 4 to ``keep intact all 44874notices''. 44875 44876@item 44877You must license the entire work, as a whole, under this License to 44878anyone who comes into possession of a copy. This License will 44879therefore apply, along with any applicable section 7 additional terms, 44880to the whole of the work, and all its parts, regardless of how they 44881are packaged. This License gives no permission to license the work in 44882any other way, but it does not invalidate such permission if you have 44883separately received it. 44884 44885@item 44886If the work has interactive user interfaces, each must display 44887Appropriate Legal Notices; however, if the Program has interactive 44888interfaces that do not display Appropriate Legal Notices, your work 44889need not make them do so. 44890@end enumerate 44891 44892A compilation of a covered work with other separate and independent 44893works, which are not by their nature extensions of the covered work, 44894and which are not combined with it such as to form a larger program, 44895in or on a volume of a storage or distribution medium, is called an 44896``aggregate'' if the compilation and its resulting copyright are not 44897used to limit the access or legal rights of the compilation's users 44898beyond what the individual works permit. Inclusion of a covered work 44899in an aggregate does not cause this License to apply to the other 44900parts of the aggregate. 44901 44902@item Conveying Non-Source Forms. 44903 44904You may convey a covered work in object code form under the terms of 44905sections 4 and 5, provided that you also convey the machine-readable 44906Corresponding Source under the terms of this License, in one of these 44907ways: 44908 44909@enumerate a 44910@item 44911Convey the object code in, or embodied in, a physical product 44912(including a physical distribution medium), accompanied by the 44913Corresponding Source fixed on a durable physical medium customarily 44914used for software interchange. 44915 44916@item 44917Convey the object code in, or embodied in, a physical product 44918(including a physical distribution medium), accompanied by a written 44919offer, valid for at least three years and valid for as long as you 44920offer spare parts or customer support for that product model, to give 44921anyone who possesses the object code either (1) a copy of the 44922Corresponding Source for all the software in the product that is 44923covered by this License, on a durable physical medium customarily used 44924for software interchange, for a price no more than your reasonable 44925cost of physically performing this conveying of source, or (2) access 44926to copy the Corresponding Source from a network server at no charge. 44927 44928@item 44929Convey individual copies of the object code with a copy of the written 44930offer to provide the Corresponding Source. This alternative is 44931allowed only occasionally and noncommercially, and only if you 44932received the object code with such an offer, in accord with subsection 449336b. 44934 44935@item 44936Convey the object code by offering access from a designated place 44937(gratis or for a charge), and offer equivalent access to the 44938Corresponding Source in the same way through the same place at no 44939further charge. You need not require recipients to copy the 44940Corresponding Source along with the object code. If the place to copy 44941the object code is a network server, the Corresponding Source may be 44942on a different server (operated by you or a third party) that supports 44943equivalent copying facilities, provided you maintain clear directions 44944next to the object code saying where to find the Corresponding Source. 44945Regardless of what server hosts the Corresponding Source, you remain 44946obligated to ensure that it is available for as long as needed to 44947satisfy these requirements. 44948 44949@item 44950Convey the object code using peer-to-peer transmission, provided you 44951inform other peers where the object code and Corresponding Source of 44952the work are being offered to the general public at no charge under 44953subsection 6d. 44954 44955@end enumerate 44956 44957A separable portion of the object code, whose source code is excluded 44958from the Corresponding Source as a System Library, need not be 44959included in conveying the object code work. 44960 44961A ``User Product'' is either (1) a ``consumer product'', which means any 44962tangible personal property which is normally used for personal, 44963family, or household purposes, or (2) anything designed or sold for 44964incorporation into a dwelling. In determining whether a product is a 44965consumer product, doubtful cases shall be resolved in favor of 44966coverage. For a particular product received by a particular user, 44967``normally used'' refers to a typical or common use of that class of 44968product, regardless of the status of the particular user or of the way 44969in which the particular user actually uses, or expects or is expected 44970to use, the product. A product is a consumer product regardless of 44971whether the product has substantial commercial, industrial or 44972non-consumer uses, unless such uses represent the only significant 44973mode of use of the product. 44974 44975``Installation Information'' for a User Product means any methods, 44976procedures, authorization keys, or other information required to 44977install and execute modified versions of a covered work in that User 44978Product from a modified version of its Corresponding Source. The 44979information must suffice to ensure that the continued functioning of 44980the modified object code is in no case prevented or interfered with 44981solely because modification has been made. 44982 44983If you convey an object code work under this section in, or with, or 44984specifically for use in, a User Product, and the conveying occurs as 44985part of a transaction in which the right of possession and use of the 44986User Product is transferred to the recipient in perpetuity or for a 44987fixed term (regardless of how the transaction is characterized), the 44988Corresponding Source conveyed under this section must be accompanied 44989by the Installation Information. But this requirement does not apply 44990if neither you nor any third party retains the ability to install 44991modified object code on the User Product (for example, the work has 44992been installed in ROM). 44993 44994The requirement to provide Installation Information does not include a 44995requirement to continue to provide support service, warranty, or 44996updates for a work that has been modified or installed by the 44997recipient, or for the User Product in which it has been modified or 44998installed. Access to a network may be denied when the modification 44999itself materially and adversely affects the operation of the network 45000or violates the rules and protocols for communication across the 45001network. 45002 45003Corresponding Source conveyed, and Installation Information provided, 45004in accord with this section must be in a format that is publicly 45005documented (and with an implementation available to the public in 45006source code form), and must require no special password or key for 45007unpacking, reading or copying. 45008 45009@item Additional Terms. 45010 45011``Additional permissions'' are terms that supplement the terms of this 45012License by making exceptions from one or more of its conditions. 45013Additional permissions that are applicable to the entire Program shall 45014be treated as though they were included in this License, to the extent 45015that they are valid under applicable law. If additional permissions 45016apply only to part of the Program, that part may be used separately 45017under those permissions, but the entire Program remains governed by 45018this License without regard to the additional permissions. 45019 45020When you convey a copy of a covered work, you may at your option 45021remove any additional permissions from that copy, or from any part of 45022it. (Additional permissions may be written to require their own 45023removal in certain cases when you modify the work.) You may place 45024additional permissions on material, added by you to a covered work, 45025for which you have or can give appropriate copyright permission. 45026 45027Notwithstanding any other provision of this License, for material you 45028add to a covered work, you may (if authorized by the copyright holders 45029of that material) supplement the terms of this License with terms: 45030 45031@enumerate a 45032@item 45033Disclaiming warranty or limiting liability differently from the terms 45034of sections 15 and 16 of this License; or 45035 45036@item 45037Requiring preservation of specified reasonable legal notices or author 45038attributions in that material or in the Appropriate Legal Notices 45039displayed by works containing it; or 45040 45041@item 45042Prohibiting misrepresentation of the origin of that material, or 45043requiring that modified versions of such material be marked in 45044reasonable ways as different from the original version; or 45045 45046@item 45047Limiting the use for publicity purposes of names of licensors or 45048authors of the material; or 45049 45050@item 45051Declining to grant rights under trademark law for use of some trade 45052names, trademarks, or service marks; or 45053 45054@item 45055Requiring indemnification of licensors and authors of that material by 45056anyone who conveys the material (or modified versions of it) with 45057contractual assumptions of liability to the recipient, for any 45058liability that these contractual assumptions directly impose on those 45059licensors and authors. 45060@end enumerate 45061 45062All other non-permissive additional terms are considered ``further 45063restrictions'' within the meaning of section 10. If the Program as you 45064received it, or any part of it, contains a notice stating that it is 45065governed by this License along with a term that is a further 45066restriction, you may remove that term. If a license document contains 45067a further restriction but permits relicensing or conveying under this 45068License, you may add to a covered work material governed by the terms 45069of that license document, provided that the further restriction does 45070not survive such relicensing or conveying. 45071 45072If you add terms to a covered work in accord with this section, you 45073must place, in the relevant source files, a statement of the 45074additional terms that apply to those files, or a notice indicating 45075where to find the applicable terms. 45076 45077Additional terms, permissive or non-permissive, may be stated in the 45078form of a separately written license, or stated as exceptions; the 45079above requirements apply either way. 45080 45081@item Termination. 45082 45083You may not propagate or modify a covered work except as expressly 45084provided under this License. Any attempt otherwise to propagate or 45085modify it is void, and will automatically terminate your rights under 45086this License (including any patent licenses granted under the third 45087paragraph of section 11). 45088 45089However, if you cease all violation of this License, then your license 45090from a particular copyright holder is reinstated (a) provisionally, 45091unless and until the copyright holder explicitly and finally 45092terminates your license, and (b) permanently, if the copyright holder 45093fails to notify you of the violation by some reasonable means prior to 4509460 days after the cessation. 45095 45096Moreover, your license from a particular copyright holder is 45097reinstated permanently if the copyright holder notifies you of the 45098violation by some reasonable means, this is the first time you have 45099received notice of violation of this License (for any work) from that 45100copyright holder, and you cure the violation prior to 30 days after 45101your receipt of the notice. 45102 45103Termination of your rights under this section does not terminate the 45104licenses of parties who have received copies or rights from you under 45105this License. If your rights have been terminated and not permanently 45106reinstated, you do not qualify to receive new licenses for the same 45107material under section 10. 45108 45109@item Acceptance Not Required for Having Copies. 45110 45111You are not required to accept this License in order to receive or run 45112a copy of the Program. Ancillary propagation of a covered work 45113occurring solely as a consequence of using peer-to-peer transmission 45114to receive a copy likewise does not require acceptance. However, 45115nothing other than this License grants you permission to propagate or 45116modify any covered work. These actions infringe copyright if you do 45117not accept this License. Therefore, by modifying or propagating a 45118covered work, you indicate your acceptance of this License to do so. 45119 45120@item Automatic Licensing of Downstream Recipients. 45121 45122Each time you convey a covered work, the recipient automatically 45123receives a license from the original licensors, to run, modify and 45124propagate that work, subject to this License. You are not responsible 45125for enforcing compliance by third parties with this License. 45126 45127An ``entity transaction'' is a transaction transferring control of an 45128organization, or substantially all assets of one, or subdividing an 45129organization, or merging organizations. If propagation of a covered 45130work results from an entity transaction, each party to that 45131transaction who receives a copy of the work also receives whatever 45132licenses to the work the party's predecessor in interest had or could 45133give under the previous paragraph, plus a right to possession of the 45134Corresponding Source of the work from the predecessor in interest, if 45135the predecessor has it or can get it with reasonable efforts. 45136 45137You may not impose any further restrictions on the exercise of the 45138rights granted or affirmed under this License. For example, you may 45139not impose a license fee, royalty, or other charge for exercise of 45140rights granted under this License, and you may not initiate litigation 45141(including a cross-claim or counterclaim in a lawsuit) alleging that 45142any patent claim is infringed by making, using, selling, offering for 45143sale, or importing the Program or any portion of it. 45144 45145@item Patents. 45146 45147A ``contributor'' is a copyright holder who authorizes use under this 45148License of the Program or a work on which the Program is based. The 45149work thus licensed is called the contributor's ``contributor version''. 45150 45151A contributor's ``essential patent claims'' are all patent claims owned 45152or controlled by the contributor, whether already acquired or 45153hereafter acquired, that would be infringed by some manner, permitted 45154by this License, of making, using, or selling its contributor version, 45155but do not include claims that would be infringed only as a 45156consequence of further modification of the contributor version. For 45157purposes of this definition, ``control'' includes the right to grant 45158patent sublicenses in a manner consistent with the requirements of 45159this License. 45160 45161Each contributor grants you a non-exclusive, worldwide, royalty-free 45162patent license under the contributor's essential patent claims, to 45163make, use, sell, offer for sale, import and otherwise run, modify and 45164propagate the contents of its contributor version. 45165 45166In the following three paragraphs, a ``patent license'' is any express 45167agreement or commitment, however denominated, not to enforce a patent 45168(such as an express permission to practice a patent or covenant not to 45169sue for patent infringement). To ``grant'' such a patent license to a 45170party means to make such an agreement or commitment not to enforce a 45171patent against the party. 45172 45173If you convey a covered work, knowingly relying on a patent license, 45174and the Corresponding Source of the work is not available for anyone 45175to copy, free of charge and under the terms of this License, through a 45176publicly available network server or other readily accessible means, 45177then you must either (1) cause the Corresponding Source to be so 45178available, or (2) arrange to deprive yourself of the benefit of the 45179patent license for this particular work, or (3) arrange, in a manner 45180consistent with the requirements of this License, to extend the patent 45181license to downstream recipients. ``Knowingly relying'' means you have 45182actual knowledge that, but for the patent license, your conveying the 45183covered work in a country, or your recipient's use of the covered work 45184in a country, would infringe one or more identifiable patents in that 45185country that you have reason to believe are valid. 45186 45187If, pursuant to or in connection with a single transaction or 45188arrangement, you convey, or propagate by procuring conveyance of, a 45189covered work, and grant a patent license to some of the parties 45190receiving the covered work authorizing them to use, propagate, modify 45191or convey a specific copy of the covered work, then the patent license 45192you grant is automatically extended to all recipients of the covered 45193work and works based on it. 45194 45195A patent license is ``discriminatory'' if it does not include within the 45196scope of its coverage, prohibits the exercise of, or is conditioned on 45197the non-exercise of one or more of the rights that are specifically 45198granted under this License. You may not convey a covered work if you 45199are a party to an arrangement with a third party that is in the 45200business of distributing software, under which you make payment to the 45201third party based on the extent of your activity of conveying the 45202work, and under which the third party grants, to any of the parties 45203who would receive the covered work from you, a discriminatory patent 45204license (a) in connection with copies of the covered work conveyed by 45205you (or copies made from those copies), or (b) primarily for and in 45206connection with specific products or compilations that contain the 45207covered work, unless you entered into that arrangement, or that patent 45208license was granted, prior to 28 March 2007. 45209 45210Nothing in this License shall be construed as excluding or limiting 45211any implied license or other defenses to infringement that may 45212otherwise be available to you under applicable patent law. 45213 45214@item No Surrender of Others' Freedom. 45215 45216If conditions are imposed on you (whether by court order, agreement or 45217otherwise) that contradict the conditions of this License, they do not 45218excuse you from the conditions of this License. If you cannot convey 45219a covered work so as to satisfy simultaneously your obligations under 45220this License and any other pertinent obligations, then as a 45221consequence you may not convey it at all. For example, if you agree 45222to terms that obligate you to collect a royalty for further conveying 45223from those to whom you convey the Program, the only way you could 45224satisfy both those terms and this License would be to refrain entirely 45225from conveying the Program. 45226 45227@item Use with the GNU Affero General Public License. 45228 45229Notwithstanding any other provision of this License, you have 45230permission to link or combine any covered work with a work licensed 45231under version 3 of the GNU Affero General Public License into a single 45232combined work, and to convey the resulting work. The terms of this 45233License will continue to apply to the part which is the covered work, 45234but the special requirements of the GNU Affero General Public License, 45235section 13, concerning interaction through a network will apply to the 45236combination as such. 45237 45238@item Revised Versions of this License. 45239 45240The Free Software Foundation may publish revised and/or new versions 45241of the GNU General Public License from time to time. Such new 45242versions will be similar in spirit to the present version, but may 45243differ in detail to address new problems or concerns. 45244 45245Each version is given a distinguishing version number. If the Program 45246specifies that a certain numbered version of the GNU General Public 45247License ``or any later version'' applies to it, you have the option of 45248following the terms and conditions either of that numbered version or 45249of any later version published by the Free Software Foundation. If 45250the Program does not specify a version number of the GNU General 45251Public License, you may choose any version ever published by the Free 45252Software Foundation. 45253 45254If the Program specifies that a proxy can decide which future versions 45255of the GNU General Public License can be used, that proxy's public 45256statement of acceptance of a version permanently authorizes you to 45257choose that version for the Program. 45258 45259Later license versions may give you additional or different 45260permissions. However, no additional obligations are imposed on any 45261author or copyright holder as a result of your choosing to follow a 45262later version. 45263 45264@item Disclaimer of Warranty. 45265 45266THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 45267APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 45268HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM ``AS IS'' WITHOUT 45269WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT 45270LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 45271A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND 45272PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE 45273DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR 45274CORRECTION. 45275 45276@item Limitation of Liability. 45277 45278IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 45279WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR 45280CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 45281INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES 45282ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT 45283NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR 45284LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM 45285TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER 45286PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 45287 45288@item Interpretation of Sections 15 and 16. 45289 45290If the disclaimer of warranty and limitation of liability provided 45291above cannot be given local legal effect according to their terms, 45292reviewing courts shall apply local law that most closely approximates 45293an absolute waiver of all civil liability in connection with the 45294Program, unless a warranty or assumption of liability accompanies a 45295copy of the Program in return for a fee. 45296 45297@end enumerate 45298 45299@c fakenode --- for prepinfo 45300@heading END OF TERMS AND CONDITIONS 45301 45302@c fakenode --- for prepinfo 45303@heading How to Apply These Terms to Your New Programs 45304 45305If you develop a new program, and you want it to be of the greatest 45306possible use to the public, the best way to achieve this is to make it 45307free software which everyone can redistribute and change under these 45308terms. 45309 45310To do so, attach the following notices to the program. It is safest 45311to attach them to the start of each source file to most effectively 45312state the exclusion of warranty; and each file should have at least 45313the ``copyright'' line and a pointer to where the full notice is found. 45314 45315@smallexample 45316@var{one line to give the program's name and a brief idea of what it does.} 45317Copyright (C) @var{year} @var{name of author} 45318 45319This program is free software: you can redistribute it and/or modify 45320it under the terms of the GNU General Public License as published by 45321the Free Software Foundation, either version 3 of the License, or (at 45322your option) any later version. 45323 45324This program is distributed in the hope that it will be useful, but 45325WITHOUT ANY WARRANTY; without even the implied warranty of 45326MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 45327General Public License for more details. 45328 45329You should have received a copy of the GNU General Public License 45330along with this program. If not, see @url{https://www.gnu.org/licenses/}. 45331@end smallexample 45332 45333Also add information on how to contact you by electronic and paper mail. 45334 45335If the program does terminal interaction, make it output a short 45336notice like this when it starts in an interactive mode: 45337 45338@smallexample 45339@var{program} Copyright (C) @var{year} @var{name of author} 45340This program comes with ABSOLUTELY NO WARRANTY; for details type @samp{show w}. 45341This is free software, and you are welcome to redistribute it 45342under certain conditions; type @samp{show c} for details. 45343@end smallexample 45344 45345The hypothetical commands @samp{show w} and @samp{show c} should show 45346the appropriate parts of the General Public License. Of course, your 45347program's commands might be different; for a GUI interface, you would 45348use an ``about box''. 45349 45350You should also get your employer (if you work as a programmer) or school, 45351if any, to sign a ``copyright disclaimer'' for the program, if necessary. 45352For more information on this, and how to apply and follow the GNU GPL, see 45353@url{https://www.gnu.org/licenses/}. 45354 45355The GNU General Public License does not permit incorporating your 45356program into proprietary programs. If your program is a subroutine 45357library, you may consider it more useful to permit linking proprietary 45358applications with the library. If this is what you want to do, use 45359the GNU Lesser General Public License instead of this License. But 45360first, please read @url{https://www.gnu.org/philosophy/why-not-lgpl.html}. 45361 45362@ifclear FOR_PRINT 45363@c The GNU Free Documentation License. 45364@node GNU Free Documentation License 45365@unnumbered GNU Free Documentation License 45366@ifnotdocbook 45367@center Version 1.3, 3 November 2008 45368@end ifnotdocbook 45369 45370@docbook 45371<subtitle>Version 1.3, 3 November 2008</subtitle> 45372@end docbook 45373 45374@cindex FDL (Free Documentation License) 45375@cindex Free Documentation License (FDL) 45376@cindex GNU Free Documentation License 45377 45378@c This file is intended to be included within another document, 45379@c hence no sectioning command or @node. 45380 45381@display 45382Copyright @copyright{} 2000, 2001, 2002, 2007, 2008 Free Software Foundation, Inc. 45383@uref{https://fsf.org/} 45384 45385Everyone is permitted to copy and distribute verbatim copies 45386of this license document, but changing it is not allowed. 45387@end display 45388 45389@enumerate 0 45390@item 45391PREAMBLE 45392 45393The purpose of this License is to make a manual, textbook, or other 45394functional and useful document @dfn{free} in the sense of freedom: to 45395assure everyone the effective freedom to copy and redistribute it, 45396with or without modifying it, either commercially or noncommercially. 45397Secondarily, this License preserves for the author and publisher a way 45398to get credit for their work, while not being considered responsible 45399for modifications made by others. 45400 45401This License is a kind of ``copyleft'', which means that derivative 45402works of the document must themselves be free in the same sense. It 45403complements the GNU General Public License, which is a copyleft 45404license designed for free software. 45405 45406We have designed this License in order to use it for manuals for free 45407software, because free software needs free documentation: a free 45408program should come with manuals providing the same freedoms that the 45409software does. But this License is not limited to software manuals; 45410it can be used for any textual work, regardless of subject matter or 45411whether it is published as a printed book. We recommend this License 45412principally for works whose purpose is instruction or reference. 45413 45414@item 45415APPLICABILITY AND DEFINITIONS 45416 45417This License applies to any manual or other work, in any medium, that 45418contains a notice placed by the copyright holder saying it can be 45419distributed under the terms of this License. Such a notice grants a 45420world-wide, royalty-free license, unlimited in duration, to use that 45421work under the conditions stated herein. The ``Document'', below, 45422refers to any such manual or work. Any member of the public is a 45423licensee, and is addressed as ``you''. You accept the license if you 45424copy, modify or distribute the work in a way requiring permission 45425under copyright law. 45426 45427A ``Modified Version'' of the Document means any work containing the 45428Document or a portion of it, either copied verbatim, or with 45429modifications and/or translated into another language. 45430 45431A ``Secondary Section'' is a named appendix or a front-matter section 45432of the Document that deals exclusively with the relationship of the 45433publishers or authors of the Document to the Document's overall 45434subject (or to related matters) and contains nothing that could fall 45435directly within that overall subject. (Thus, if the Document is in 45436part a textbook of mathematics, a Secondary Section may not explain 45437any mathematics.) The relationship could be a matter of historical 45438connection with the subject or with related matters, or of legal, 45439commercial, philosophical, ethical or political position regarding 45440them. 45441 45442The ``Invariant Sections'' are certain Secondary Sections whose titles 45443are designated, as being those of Invariant Sections, in the notice 45444that says that the Document is released under this License. If a 45445section does not fit the above definition of Secondary then it is not 45446allowed to be designated as Invariant. The Document may contain zero 45447Invariant Sections. If the Document does not identify any Invariant 45448Sections then there are none. 45449 45450The ``Cover Texts'' are certain short passages of text that are listed, 45451as Front-Cover Texts or Back-Cover Texts, in the notice that says that 45452the Document is released under this License. A Front-Cover Text may 45453be at most 5 words, and a Back-Cover Text may be at most 25 words. 45454 45455A ``Transparent'' copy of the Document means a machine-readable copy, 45456represented in a format whose specification is available to the 45457general public, that is suitable for revising the document 45458straightforwardly with generic text editors or (for images composed of 45459pixels) generic paint programs or (for drawings) some widely available 45460drawing editor, and that is suitable for input to text formatters or 45461for automatic translation to a variety of formats suitable for input 45462to text formatters. A copy made in an otherwise Transparent file 45463format whose markup, or absence of markup, has been arranged to thwart 45464or discourage subsequent modification by readers is not Transparent. 45465An image format is not Transparent if used for any substantial amount 45466of text. A copy that is not ``Transparent'' is called ``Opaque''. 45467 45468Examples of suitable formats for Transparent copies include plain 45469@sc{ascii} without markup, Texinfo input format, La@TeX{} input 45470format, @acronym{SGML} or @acronym{XML} using a publicly available 45471@acronym{DTD}, and standard-conforming simple @acronym{HTML}, 45472PostScript or @acronym{PDF} designed for human modification. Examples 45473of transparent image formats include @acronym{PNG}, @acronym{XCF} and 45474@acronym{JPG}. Opaque formats include proprietary formats that can be 45475read and edited only by proprietary word processors, @acronym{SGML} or 45476@acronym{XML} for which the @acronym{DTD} and/or processing tools are 45477not generally available, and the machine-generated @acronym{HTML}, 45478PostScript or @acronym{PDF} produced by some word processors for 45479output purposes only. 45480 45481The ``Title Page'' means, for a printed book, the title page itself, 45482plus such following pages as are needed to hold, legibly, the material 45483this License requires to appear in the title page. For works in 45484formats which do not have any title page as such, ``Title Page'' means 45485the text near the most prominent appearance of the work's title, 45486preceding the beginning of the body of the text. 45487 45488The ``publisher'' means any person or entity that distributes copies 45489of the Document to the public. 45490 45491A section ``Entitled XYZ'' means a named subunit of the Document whose 45492title either is precisely XYZ or contains XYZ in parentheses following 45493text that translates XYZ in another language. (Here XYZ stands for a 45494specific section name mentioned below, such as ``Acknowledgements'', 45495``Dedications'', ``Endorsements'', or ``History''.) To ``Preserve the Title'' 45496of such a section when you modify the Document means that it remains a 45497section ``Entitled XYZ'' according to this definition. 45498 45499The Document may include Warranty Disclaimers next to the notice which 45500states that this License applies to the Document. These Warranty 45501Disclaimers are considered to be included by reference in this 45502License, but only as regards disclaiming warranties: any other 45503implication that these Warranty Disclaimers may have is void and has 45504no effect on the meaning of this License. 45505 45506@item 45507VERBATIM COPYING 45508 45509You may copy and distribute the Document in any medium, either 45510commercially or noncommercially, provided that this License, the 45511copyright notices, and the license notice saying this License applies 45512to the Document are reproduced in all copies, and that you add no other 45513conditions whatsoever to those of this License. You may not use 45514technical measures to obstruct or control the reading or further 45515copying of the copies you make or distribute. However, you may accept 45516compensation in exchange for copies. If you distribute a large enough 45517number of copies you must also follow the conditions in section 3. 45518 45519You may also lend copies, under the same conditions stated above, and 45520you may publicly display copies. 45521 45522@item 45523COPYING IN QUANTITY 45524 45525If you publish printed copies (or copies in media that commonly have 45526printed covers) of the Document, numbering more than 100, and the 45527Document's license notice requires Cover Texts, you must enclose the 45528copies in covers that carry, clearly and legibly, all these Cover 45529Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on 45530the back cover. Both covers must also clearly and legibly identify 45531you as the publisher of these copies. The front cover must present 45532the full title with all words of the title equally prominent and 45533visible. You may add other material on the covers in addition. 45534Copying with changes limited to the covers, as long as they preserve 45535the title of the Document and satisfy these conditions, can be treated 45536as verbatim copying in other respects. 45537 45538If the required texts for either cover are too voluminous to fit 45539legibly, you should put the first ones listed (as many as fit 45540reasonably) on the actual cover, and continue the rest onto adjacent 45541pages. 45542 45543If you publish or distribute Opaque copies of the Document numbering 45544more than 100, you must either include a machine-readable Transparent 45545copy along with each Opaque copy, or state in or with each Opaque copy 45546a computer-network location from which the general network-using 45547public has access to download using public-standard network protocols 45548a complete Transparent copy of the Document, free of added material. 45549If you use the latter option, you must take reasonably prudent steps, 45550when you begin distribution of Opaque copies in quantity, to ensure 45551that this Transparent copy will remain thus accessible at the stated 45552location until at least one year after the last time you distribute an 45553Opaque copy (directly or through your agents or retailers) of that 45554edition to the public. 45555 45556It is requested, but not required, that you contact the authors of the 45557Document well before redistributing any large number of copies, to give 45558them a chance to provide you with an updated version of the Document. 45559 45560@item 45561MODIFICATIONS 45562 45563You may copy and distribute a Modified Version of the Document under 45564the conditions of sections 2 and 3 above, provided that you release 45565the Modified Version under precisely this License, with the Modified 45566Version filling the role of the Document, thus licensing distribution 45567and modification of the Modified Version to whoever possesses a copy 45568of it. In addition, you must do these things in the Modified Version: 45569 45570@enumerate A 45571@item 45572Use in the Title Page (and on the covers, if any) a title distinct 45573from that of the Document, and from those of previous versions 45574(which should, if there were any, be listed in the History section 45575of the Document). You may use the same title as a previous version 45576if the original publisher of that version gives permission. 45577 45578@item 45579List on the Title Page, as authors, one or more persons or entities 45580responsible for authorship of the modifications in the Modified 45581Version, together with at least five of the principal authors of the 45582Document (all of its principal authors, if it has fewer than five), 45583unless they release you from this requirement. 45584 45585@item 45586State on the Title page the name of the publisher of the 45587Modified Version, as the publisher. 45588 45589@item 45590Preserve all the copyright notices of the Document. 45591 45592@item 45593Add an appropriate copyright notice for your modifications 45594adjacent to the other copyright notices. 45595 45596@item 45597Include, immediately after the copyright notices, a license notice 45598giving the public permission to use the Modified Version under the 45599terms of this License, in the form shown in the Addendum below. 45600 45601@item 45602Preserve in that license notice the full lists of Invariant Sections 45603and required Cover Texts given in the Document's license notice. 45604 45605@item 45606Include an unaltered copy of this License. 45607 45608@item 45609Preserve the section Entitled ``History'', Preserve its Title, and add 45610to it an item stating at least the title, year, new authors, and 45611publisher of the Modified Version as given on the Title Page. If 45612there is no section Entitled ``History'' in the Document, create one 45613stating the title, year, authors, and publisher of the Document as 45614given on its Title Page, then add an item describing the Modified 45615Version as stated in the previous sentence. 45616 45617@item 45618Preserve the network location, if any, given in the Document for 45619public access to a Transparent copy of the Document, and likewise 45620the network locations given in the Document for previous versions 45621it was based on. These may be placed in the ``History'' section. 45622You may omit a network location for a work that was published at 45623least four years before the Document itself, or if the original 45624publisher of the version it refers to gives permission. 45625 45626@item 45627For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve 45628the Title of the section, and preserve in the section all the 45629substance and tone of each of the contributor acknowledgements and/or 45630dedications given therein. 45631 45632@item 45633Preserve all the Invariant Sections of the Document, 45634unaltered in their text and in their titles. Section numbers 45635or the equivalent are not considered part of the section titles. 45636 45637@item 45638Delete any section Entitled ``Endorsements''. Such a section 45639may not be included in the Modified Version. 45640 45641@item 45642Do not retitle any existing section to be Entitled ``Endorsements'' or 45643to conflict in title with any Invariant Section. 45644 45645@item 45646Preserve any Warranty Disclaimers. 45647@end enumerate 45648 45649If the Modified Version includes new front-matter sections or 45650appendices that qualify as Secondary Sections and contain no material 45651copied from the Document, you may at your option designate some or all 45652of these sections as invariant. To do this, add their titles to the 45653list of Invariant Sections in the Modified Version's license notice. 45654These titles must be distinct from any other section titles. 45655 45656You may add a section Entitled ``Endorsements'', provided it contains 45657nothing but endorsements of your Modified Version by various 45658parties---for example, statements of peer review or that the text has 45659been approved by an organization as the authoritative definition of a 45660standard. 45661 45662You may add a passage of up to five words as a Front-Cover Text, and a 45663passage of up to 25 words as a Back-Cover Text, to the end of the list 45664of Cover Texts in the Modified Version. Only one passage of 45665Front-Cover Text and one of Back-Cover Text may be added by (or 45666through arrangements made by) any one entity. If the Document already 45667includes a cover text for the same cover, previously added by you or 45668by arrangement made by the same entity you are acting on behalf of, 45669you may not add another; but you may replace the old one, on explicit 45670permission from the previous publisher that added the old one. 45671 45672The author(s) and publisher(s) of the Document do not by this License 45673give permission to use their names for publicity for or to assert or 45674imply endorsement of any Modified Version. 45675 45676@item 45677COMBINING DOCUMENTS 45678 45679You may combine the Document with other documents released under this 45680License, under the terms defined in section 4 above for modified 45681versions, provided that you include in the combination all of the 45682Invariant Sections of all of the original documents, unmodified, and 45683list them all as Invariant Sections of your combined work in its 45684license notice, and that you preserve all their Warranty Disclaimers. 45685 45686The combined work need only contain one copy of this License, and 45687multiple identical Invariant Sections may be replaced with a single 45688copy. If there are multiple Invariant Sections with the same name but 45689different contents, make the title of each such section unique by 45690adding at the end of it, in parentheses, the name of the original 45691author or publisher of that section if known, or else a unique number. 45692Make the same adjustment to the section titles in the list of 45693Invariant Sections in the license notice of the combined work. 45694 45695In the combination, you must combine any sections Entitled ``History'' 45696in the various original documents, forming one section Entitled 45697``History''; likewise combine any sections Entitled ``Acknowledgements'', 45698and any sections Entitled ``Dedications''. You must delete all 45699sections Entitled ``Endorsements.'' 45700 45701@item 45702COLLECTIONS OF DOCUMENTS 45703 45704You may make a collection consisting of the Document and other documents 45705released under this License, and replace the individual copies of this 45706License in the various documents with a single copy that is included in 45707the collection, provided that you follow the rules of this License for 45708verbatim copying of each of the documents in all other respects. 45709 45710You may extract a single document from such a collection, and distribute 45711it individually under this License, provided you insert a copy of this 45712License into the extracted document, and follow this License in all 45713other respects regarding verbatim copying of that document. 45714 45715@item 45716AGGREGATION WITH INDEPENDENT WORKS 45717 45718A compilation of the Document or its derivatives with other separate 45719and independent documents or works, in or on a volume of a storage or 45720distribution medium, is called an ``aggregate'' if the copyright 45721resulting from the compilation is not used to limit the legal rights 45722of the compilation's users beyond what the individual works permit. 45723When the Document is included in an aggregate, this License does not 45724apply to the other works in the aggregate which are not themselves 45725derivative works of the Document. 45726 45727If the Cover Text requirement of section 3 is applicable to these 45728copies of the Document, then if the Document is less than one half of 45729the entire aggregate, the Document's Cover Texts may be placed on 45730covers that bracket the Document within the aggregate, or the 45731electronic equivalent of covers if the Document is in electronic form. 45732Otherwise they must appear on printed covers that bracket the whole 45733aggregate. 45734 45735@item 45736TRANSLATION 45737 45738Translation is considered a kind of modification, so you may 45739distribute translations of the Document under the terms of section 4. 45740Replacing Invariant Sections with translations requires special 45741permission from their copyright holders, but you may include 45742translations of some or all Invariant Sections in addition to the 45743original versions of these Invariant Sections. You may include a 45744translation of this License, and all the license notices in the 45745Document, and any Warranty Disclaimers, provided that you also include 45746the original English version of this License and the original versions 45747of those notices and disclaimers. In case of a disagreement between 45748the translation and the original version of this License or a notice 45749or disclaimer, the original version will prevail. 45750 45751If a section in the Document is Entitled ``Acknowledgements'', 45752``Dedications'', or ``History'', the requirement (section 4) to Preserve 45753its Title (section 1) will typically require changing the actual 45754title. 45755 45756@item 45757TERMINATION 45758 45759You may not copy, modify, sublicense, or distribute the Document 45760except as expressly provided under this License. Any attempt 45761otherwise to copy, modify, sublicense, or distribute it is void, and 45762will automatically terminate your rights under this License. 45763 45764However, if you cease all violation of this License, then your license 45765from a particular copyright holder is reinstated (a) provisionally, 45766unless and until the copyright holder explicitly and finally 45767terminates your license, and (b) permanently, if the copyright holder 45768fails to notify you of the violation by some reasonable means prior to 4576960 days after the cessation. 45770 45771Moreover, your license from a particular copyright holder is 45772reinstated permanently if the copyright holder notifies you of the 45773violation by some reasonable means, this is the first time you have 45774received notice of violation of this License (for any work) from that 45775copyright holder, and you cure the violation prior to 30 days after 45776your receipt of the notice. 45777 45778Termination of your rights under this section does not terminate the 45779licenses of parties who have received copies or rights from you under 45780this License. If your rights have been terminated and not permanently 45781reinstated, receipt of a copy of some or all of the same material does 45782not give you any rights to use it. 45783 45784@item 45785FUTURE REVISIONS OF THIS LICENSE 45786 45787The Free Software Foundation may publish new, revised versions 45788of the GNU Free Documentation License from time to time. Such new 45789versions will be similar in spirit to the present version, but may 45790differ in detail to address new problems or concerns. See 45791@uref{https://www.gnu.org/copyleft/}. 45792 45793Each version of the License is given a distinguishing version number. 45794If the Document specifies that a particular numbered version of this 45795License ``or any later version'' applies to it, you have the option of 45796following the terms and conditions either of that specified version or 45797of any later version that has been published (not as a draft) by the 45798Free Software Foundation. If the Document does not specify a version 45799number of this License, you may choose any version ever published (not 45800as a draft) by the Free Software Foundation. If the Document 45801specifies that a proxy can decide which future versions of this 45802License can be used, that proxy's public statement of acceptance of a 45803version permanently authorizes you to choose that version for the 45804Document. 45805 45806@item 45807RELICENSING 45808 45809``Massive Multiauthor Collaboration Site'' (or ``MMC Site'') means any 45810World Wide Web server that publishes copyrightable works and also 45811provides prominent facilities for anybody to edit those works. A 45812public wiki that anybody can edit is an example of such a server. A 45813``Massive Multiauthor Collaboration'' (or ``MMC'') contained in the 45814site means any set of copyrightable works thus published on the MMC 45815site. 45816 45817``CC-BY-SA'' means the Creative Commons Attribution-Share Alike 3.0 45818license published by Creative Commons Corporation, a not-for-profit 45819corporation with a principal place of business in San Francisco, 45820California, as well as future copyleft versions of that license 45821published by that same organization. 45822 45823``Incorporate'' means to publish or republish a Document, in whole or 45824in part, as part of another Document. 45825 45826An MMC is ``eligible for relicensing'' if it is licensed under this 45827License, and if all works that were first published under this License 45828somewhere other than this MMC, and subsequently incorporated in whole 45829or in part into the MMC, (1) had no cover texts or invariant sections, 45830and (2) were thus incorporated prior to November 1, 2008. 45831 45832The operator of an MMC Site may republish an MMC contained in the site 45833under CC-BY-SA on the same site at any time before August 1, 2009, 45834provided the MMC is eligible for relicensing. 45835 45836@end enumerate 45837 45838@c fakenode --- for prepinfo 45839@unnumberedsec ADDENDUM: How to use this License for your documents 45840 45841To use this License in a document you have written, include a copy of 45842the License in the document and put the following copyright and 45843license notices just after the title page: 45844 45845@smallexample 45846@group 45847 Copyright (C) @var{year} @var{your name}. 45848 Permission is granted to copy, distribute and/or modify this document 45849 under the terms of the GNU Free Documentation License, Version 1.3 45850 or any later version published by the Free Software Foundation; 45851 with no Invariant Sections, no Front-Cover Texts, and no Back-Cover 45852 Texts. A copy of the license is included in the section entitled ``GNU 45853 Free Documentation License''. 45854@end group 45855@end smallexample 45856 45857If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, 45858replace the ``with@dots{}Texts.'' line with this: 45859 45860@smallexample 45861@group 45862 with the Invariant Sections being @var{list their titles}, with 45863 the Front-Cover Texts being @var{list}, and with the Back-Cover Texts 45864 being @var{list}. 45865@end group 45866@end smallexample 45867 45868If you have Invariant Sections without Cover Texts, or some other 45869combination of the three, merge those two alternatives to suit the 45870situation. 45871 45872If your document contains nontrivial examples of program code, we 45873recommend releasing these examples in parallel under your choice of 45874free software license, such as the GNU General Public License, 45875to permit their use in free software. 45876 45877@end ifclear 45878 45879@ifnotdocbook 45880@node Index 45881@unnumbered Index 45882@end ifnotdocbook 45883@printindex cp 45884 45885@bye 45886 45887Unresolved Issues: 45888------------------ 458891. From ADR. 45890 45891 Robert J. Chassell points out that awk programs should have some indication 45892 of how to use them. It would be useful to perhaps have a "programming 45893 style" section of the manual that would include this and other tips. 45894 45895Consistency issues: 45896 /.../ regexps are in @code, not @samp 45897 ".." strings are in @code, not @samp 45898 no @print before @dots 45899 values of expressions in the text (@code{x} has the value 15), 45900 should be in roman, not @code 45901 Use TAB and not tab 45902 Use ESC and not ESCAPE 45903 Use space and not blank to describe the space bar's character 45904 The term "blank" is thus basically reserved for "blank lines" etc. 45905 To make dark corners work, the @value{DARKCORNER} has to be outside 45906 closing `.' of a sentence and after (pxref{...}). 45907 Make sure that each @value{DARKCORNER} has an index entry, and 45908 also that each `@cindex dark corner' has an @value{DARKCORNER}. 45909 " " should have an @w{} around it 45910 Use "non-" only with language names or acronyms, or the words bug and option and null 45911 Use @command{ftp} when talking about anonymous ftp 45912 Use uppercase and lowercase, not "upper-case" and "lower-case" 45913 or "upper case" and "lower case" 45914 Use "single precision" and "double precision", not "single-precision" or "double-precision" 45915 Use alphanumeric, not alpha-numeric 45916 Use POSIX-compliant, not POSIX compliant 45917 Use --foo, not -Wfoo when describing long options 45918 Use "Bell Laboratories", but not "Bell Labs". 45919 Use "behavior" instead of "behaviour". 45920 Use "coprocess" instead of "co-process". 45921 Use "zeros" instead of "zeroes". 45922 Use "nonzero" not "non-zero". 45923 Use "runtime" not "run time" or "run-time". 45924 Use "command-line" as an adjective and "command line" as a noun. 45925 Use "online" not "on-line". 45926 Use "whitespace" not "white space". 45927 Use "Input/Output", not "input/output". Also "I/O", not "i/o". 45928 Use "lefthand"/"righthand", not "left-hand"/"right-hand". 45929 Use "workaround", not "work-around". 45930 Use "startup"/"cleanup", not "start-up"/"clean-up" 45931 Use "filesystem", not "file system" 45932 Use @code{do}, and not @code{do}-@code{while}, except where 45933 actually discussing the do-while. 45934 Use "versus" in text and "vs." in index entries 45935 Use @code{"C"} for the C locale, not ``C'' or @samp{C}. 45936 The words "a", "and", "as", "between", "for", "from", "in", "of", 45937 "on", "that", "the", "to", "with", and "without", 45938 should not be capitalized in @chapter, @section etc. 45939 "Into" and "How" should. 45940 Search for @dfn; make sure important items are also indexed. 45941 "e.g." should always be followed by a comma. 45942 "i.e." should always be followed by a comma. 45943 The numbers zero through ten should be spelled out, except when 45944 talking about file descriptor numbers. > 10 and < 0, it's 45945 ok to use numbers. 45946 For most cases, do NOT put a comma before "and", "or" or "but". 45947 But exercise taste with this rule. 45948 Don't show the awk command with a program in quotes when it's 45949 just the program. I.e. 45950 45951 { 45952 .... 45953 } 45954 45955 not 45956 awk '{ 45957 ... 45958 }' 45959 45960 Do show it when showing command-line arguments, data files, etc, even 45961 if there is no output shown. 45962 45963 Use numbered lists only to show a sequential series of steps. 45964 45965 Use @code{xxx} for the xxx operator in indexing statements, not @samp. 45966 Use MS-Windows not MS Windows 45967 Use MS-DOS not MS DOS 45968 Use an empty set of parentheses after built-in and awk function names. 45969 Use "multiFOO" without a hyphen. 45970 Use "time zone" as two words, not "timezone". 45971 45972Date: Wed, 13 Apr 94 15:20:52 -0400 45973From: rms@gnu.org (Richard Stallman) 45974To: gnu-prog@gnu.org 45975Subject: A reminder: no pathnames in GNU 45976 45977It's a GNU convention to use the term "file name" for the name of a 45978file, never "pathname". We use the term "path" for search paths, 45979which are lists of file names. Using it for a single file name as 45980well is potentially confusing to users. 45981 45982So please check any documentation you maintain, if you think you might 45983have used "pathname". 45984 45985Note that "file name" should be two words when it appears as ordinary 45986text. It's ok as one word when it's a metasyntactic variable, though. 45987 45988------------------------ 45989ORA uses filename, thus the macro. 45990 45991Suggestions: 45992------------ 45993 45994Better sidebars can almost sort of be done with: 45995 45996 @ifdocbook 45997 @macro @sidebar{title, content} 45998 @inlinefmt{docbook, <sidebar><title>} 45999 \title\ 46000 @inlinefmt{docbook, </title>} 46001 \content\ 46002 @inlinefmt{docbook, </sidebar>} 46003 @end macro 46004 @end ifdocbook 46005 46006 46007 @ifnotdocbook 46008 @macro @sidebar{title, content} 46009 @cartouche 46010 @center @b{\title\} 46011 46012 \content\ 46013 @end cartouche 46014 @end macro 46015 @end ifnotdocbook 46016 46017But to use it you have to say 46018 46019 @sidebar{Title Here, 46020 @include file-with-content 46021 } 46022 46023which sorta sucks. 46024 46025TODO: 46026